David Giller, KI3V
Writing any non-trivial program on a Z80 to operate without using RAM is an extremely tight constraint. A few of the restrictions implied by lack of RAM are:
CALL
or RET
instructions in any conventional sensePUSH
, or move data between registers with PUSH
and POP
A
, will be consumed by just about any operation or computation at allI by no means an expert at Z80 assembly language; this project is my first Z80 program. I did not invent the techniques used in this program, although I have not seen them combined in exactly this form elsewhere. This was a learning experience for me, and this document aims to help others who are starting from a similar level of experience to mine learn some of the lessons I did.
The initial inspiration for the techniques used in this project was taken from this article by Jens Madsen on the z80.info web site. Jens describes how a 'stack' of sorts can be assembled by hand into ROM, and the addresses and parameters that form operations made out of machine-language primitives (subroutines that don't call any other subroutines) can be listed in a very space-efficient form and thus composed into larger, more complex programs without using the CALL
instruction, RAM, or the stack for return addresses.
(For brevity I'll call this Jens' method, though I don't believe he claims to have invented it. If anyone knows who did, I would be very interested to hear the story.)
These days, the word threading almost always means concurrent multiprocessing. But traditionally, the term "threaded code" had a different meaning, one more familiar to Forth language programmers.
Threaded code in this sense is a technique of stringing subroutines together using sequences of codes representing operations; these operations have been likened to user-definable opcodes in an abstract virtual machine. The extremely simple interpreter that ran this virtual machine is called an address interpreter.
In Direct Threaded Code, the opcodes are just the address of the machine-language subroutines that implement each operation. This is like a string of CALL mysubroutine
operations, without the CALL
opcode on the front. Interspersed with the opcodes, as necessary, are any parameters required by the machine-language opcode subroutines.
The technique described above by Jens is basically a form of direct threading code. However, this method still only allowed for one level of subroutines, and the order needs to be determined at compile time. More importantly, subroutines can start the CPU on a new 'stack' or stream of instructions, but can't perform a CALL
to other code that returns where the current code left off.
Threaded code is not generally a technique to avoid using RAM or a stack; quite the contrary, the Forth language makes extensive use of at least two stacks for fundamental operation. I needed to find a way to extend this mechanism to allow subroutine calls, even if only in a very limited form, without using RAM for variables of for a CPU stack.
I implemented a very simple address interpreter based on Jens' method.
To start defining a thread of code, I created the spthread_begin
macro. All this does is define a ROM-based 'stack' with its "top" beginning immediately after the macro is invoked. Z80 stacks grow from high addresses to low ones, so the "top" of the stack is the lowest address.
At the end of a string of threaded code, the spthread_end
macro wraps it up.
Defining threaded code, then, just looks like this:
spthread_begin
dw proc1
dw proc1arg1
dw proc2
dw proc2arg1
dw proc2arg2
dw proc3
spthread_end
This can be written more succinctly, which conveniently also looks more like a high level language:
spthread_begin
dw proc1, proc1arg1
dw proc2, proc2arg1, proc2arg2
dw proc3
spthread_end
These macros expand to generate code that looks like this:
ld sp,.threadstart
ret
.threadstart:
dw proc1, proc1arg1
dw proc2, proc2arg1, proc2arg2
dw proc3
dw .threadend
.threadend:
Conbine the above with subroutines such as the following:
proc1: pop hl
; do some processing
; ...
ret
proc2: pop hl
pop bc
; do some useful work
; ...
ret
proc3: ; do something without arguments
; ...
; ...
ret
For lack of a more creative term, I call this "Stack Pointer Threading", or SPT. This is not an invention of mine, just an implementation of the method Jens described. The trick is to remember that the SP
register serves the purpose of the instruction pointer, for the threaded code addresses.
(For attempted clarity, I will use the term "instruction pointer" to refer to threaded code, to distinguish between that and the Z80's PC
register.)
The "interpreter" consists of just two instructions: RET
calls the next threaded operation, and POP
fetches a parameter from the instruction stream. It's just two Z80 instructions. There is no need for an interpreter subroutine.
Now we have code that 'calls' subroutines — limited to one level deep, and with only fixed arguments passed on the stack. Still, they feel almost like regular subroutines, save for the fact that POP
instructions are not balanced with PUSH
instructions anywhere else.
The final piece, however, is how to make more than one level of subroutine call. I came to the solution for this after pondering on the advice by Jim Westfall to remember the EX
and EXX
instructions. These instructions are not really designed for general use; they are really intended to reduce context switching time in interrupt service routines by shortening the time required to save a single set of registers (instead of saving them to the stack). This ROM does not use interrupts, but the magic idea there is "instead of saving them to the stack".
The alternate resister set that is swapped in and out by the EX AF,AF'
and EXX
instructions don't include the stack pointer SP
. They do, however, include HL
, which happens to be the only register to and from which the SP
register can be transferred without using RAM.
The SPT address interpreter can make one level of subroutine calls, but those calls cannot make deeper calls because we need a place to save the SP
register (just like the Z80 must store a copy of the PC
register on the stack before jumping to a subroutine using the CALL
instruction).
Using this knowlege, I created a set of macros to form the function prologue and epilogue for subroutines that want to make deeper threaded subroutine calls. The sequence for a threaded subroutine call expands to look like this:
.prologue
exx ; prologue: push SP, (threaded IP),
; onto the emulated stack
ex de,hl ; copy hl' to de'
ld hl,0 ; copy sp to hl'
add hl,sp
exx
ld sp,.threadstart
ret ; begin address interpreter
.threadstart
db func1, func1arg1
db func2,
; ...
db .threadend
.threadend
exx ; epilogue: pop the previous SP off the emulated stack
ld sp,hl ; resume from the thread location saved in hl'
ex de,hl ; copy de' to hl'
exx
ret
This gives a two-level threaded-code stack, which gives up to three levels of subroutine nesting:
spthread_begin
and spthread_end
threaded code section (threaded IP saved in DE'
)
HL'
)
SP
)
If necessary, it would be possible to extend this method by one more register by also using BC'
. However, this seems excessive. Operating without RAM is for special situations such as this RAM testing firmware, and it seems likely that such programs can be structured to live with three levels of nesting or less.
It's worth saying this here: this is extremely slow compared to native Z80 CALL
and RET
. This technique is for when you don't have RAM and can't use CALL
!
There are a handful of refinements such as the ability to jump to a copy of the epliogue (not the prologue), and even returning to the previous threaded 'stack' frame from within a primitive operation, but the useful portions of the method are described here.
I doubt that I am the first to use this method, and I doubt that I have implemented it in the optimal way. I would be very interested to hear suggestions for improving this technique. Please submit a "Issue" or start a "Discussion" on the main Github page for this project with any suggestions or thoughts you may have.
If it seems like a stretch to call the threaded code a "virtual machine", consider that the technique described above already permits code like the following, which I think you'll agree starts to look like assembly language for a fictitious virtual processor:
main_program:
spthread_begin
.loop: dw sendbyte, $FFFF
dw readbyte $FF
dw jump_nonzero, .loop
spthread_end
; ... other code here
halt
sendbyte:
pop bc
out (c),b
ret
readbyte:
pop bc
in b,(c)
ret
jump_nonzero:
pop hl
sub a
cp b
ret z
ld sp,hl ; SP now points to .loop and...
ret ; this will continue execution at sendbyte
By changing SP
from a machine-level threaded-code "instruction", your code can move around inside the threaded instruction stream.
Some might be wondering why not simply use the IX
and IY
registers to store return addresses. After all, it is very easy to make a macro to do this:
ld iy,$+3 ; save the return address
jp musub ;
; ...
mysub: ; do something useful
; ...
jp (iy) ; return from whence we came
We have three registers available with similar powers, IX
, IY
, and even HL
. And this is much faster and simpler than the whole SPT mess.
The reason is simple. In this diagnostics ROM, where the goal was to operate with no RAM whatsoever, giving up IX
and IY
was too high a price to pay when they are needed for global variables.