| 1
|
The 8-bit 1802 is
the heart of what are widely regarded as just about the slowest home
computers ever sold. (8 clocks per machine cycle, 2 or 3 machine
cycles per instruction. Clock rate fixed at 1.7 MHz in order to
drive the 1861-based
NTSC video of the
typical system, and the video output was serviced by
per-scanline
interrupts and
per-8-pixel
DMA
cycles, which stole much time from your application program. So,
roughly
0.1 MIPS
even without the distractions, and its 'I's were nothing to write home
about, either. Comparable in performance, perhaps, to the much older
4-bit Intel
4004.)
Superficially the 8-bit 1802, with its 16 16-bit
registers,
appeared unexcelled until the much later 16-bit
Z8000 and
68000
devices, except that the slow instruction execution combined with some
truly unfortunate missing instructions2
served to hobble it significantly. In practice, except for its use in
small trainer computers like the
Elf and some very
early programmable home-video games, it really only was deployed in
places where its extreme tolerance of environmental stresses like
unstable power supplies and clocks, heat, cold, and radiation, or its
ultra-low full-CMOS
power requirements were crucial: Space vehicles,
boreholes, traffic
signs, engine computers, battery-operated handheld devices, etc.
(Though considered relatively unpopular it nonetheless sold millions
of units in these applications, and is still in limited
production as of this writing, 48 years after its introduction.)
Generally it seemed to be most satisfactorily programmed in various
virtual
machine environments like
Chip-8,
SWEET16, or
threaded
Forth3.
These, of course, extracted a significant performance penalty on top
of an already slow processor, but paid for themselves with
substantially reduced memory requirements. (Said memory also
needing to be as tolerant of the environment as the
CPU
needed to be, and thus possibly quite bulky and/or expensive when
compared to more mainstream memory devices. Memory, in the 8-bit era,
was relatively expensive even in its cheapest forms.)
|
| 2
|
The 16 16-bit registers can be used as program counters, pointer
registers, and counters, which is excellent, and can be treated as up
to 30 8-bit data holding registers, also good. However:
- All output and
ALU
instructions work only on memory, indexed by one of the
16 registers, which translates to a significant amount of data
shuffling through the 8-bit
accumulator,
the only gateway to both the registers and the memory. (This also
means that even the tiniest systems must†
include
RAM,
which is not necessarily the case with other register-rich
processors.)
- Unusually, stack operators are post-increment
and post-decrement, not pre-decrement, requiring
extra increment and decrement instructions when switching between
pushing and popping.
(Exemplified in
the MARK/RET
subroutine technique, below.)
- There is no
condition
code register, only the accumulator's separate carry bit.
Calculations can only test the current value of the accumulator
for zero, or the state of this carry bit. Thus, most value
testing is bulky, and signed arithmetic comparison and
overflow detection is particularly difficult.
- There are no compare instructions, thus all numeric comparisons
(except with zero) must be done with 'destructive' subtractions
and potentially bulky value tests.
- There are no traditional subroutine call or return instructions,
though they can be simulated‡.
- DMA
use costs one register (R0). Interrupt use costs two more (R1
& R2), though the necessary stack pointer register (R2) can
usually be shared with the foreground code. (As a result, by
convention R3 is the primary program counter, though R0 is used at
reset and in systems that do not use DMA.)
These six characteristics combine to make most 1802 native code rather
bloated in size, compared4 to its 8-bit
peers, and particularly slow.
| †
|
An application note [Using the 1802 Scratchpad To Store RAM
Variables by RCA's John Stahler] suggests that if you can
afford to burn a 256-byte page of ROM for a 1:1 translation table
this allows for RAM-less operation, whereby the low-order address
of R(X) is copied to the data bus, via the table, for ALU
instructions. Thus RX.hi selects the page containing the 1:1
table, and RX.lo contains the ALU (or output) instruction operand.
This trick costs you one of the 16-bit registers, as well as the
ROM space.
Alternately, address-decode circuitry can be used to control a 373
latch that fakes this 'ROM' translation table, but it seems to me
that said circuitry would not really be superior in any way to an
actual small 8-bit RAM device, except for possibly environmental
[voltage, power, temperature, radiation] reasons. And it still
costs you a register pair.
Or, use an 1804/1805, which contains a small amount of RAM.
| | ‡
|
Subroutines are an interesting proposition for the quirky 1802.
There are basically three techniques in common use:
| Name | Description | Pro | Con
|
|---|
SEP
| Dedicate registers as program counters. Use SEP
instruction to call, a complementary SEP to
return. In subroutines the program counter usually must be
left at a suitable (re-)entry point after returning. (For the
next call, unless you want to reload the register for every
call.) Minimum of 4 cycles per subroutine call and return.
|
- Smallest/fastest.
- Perfect for co-routines.
- No RAM required.
- Inline data can be passed to the subroutine.
|
- Can't afford to lose too many registers. Unrelated
subroutines can share
SEP registers, but that
makes the code slower and bulkier, due to the reloading.
- Inflexible. Callee must know who the caller is in order
to use the correct
SEP to return.
- No recursion possible.
| MARK/
RET
| As above, but more flexible. Use MARK,
SEP, and DEC 2 instructions to call.
Use SEX 2, INC 2,
and RET (or DIS) instructions to
return. Minimum of 12 cycles per subroutine call and return.
|
- Call tree can be flexible, subroutines return to whomever
called them.
- Inline data can be passed to the subroutine.
|
- As above, can't afford to lose too many registers, etc.
- Must have a stack. (Costing a register [R2], and a byte
of RAM per call depth.)
- Recursion still not possible.
- Must know whether interrupts are enabled or not, to use
correct
RET or DIS instruction to
return; no mixing of foreground subroutines and interrupt
service subroutines.
- Larger and slower (3×) than pure
SEP
calling.
| | SCRT
| Standard Call and Return Technique. SEP used to
call a 16-byte 30-cycle dedicated calling subroutine, which
picks up the 2-byte target address inline (after
the SEP) and pushes the return address on the
stack. SEP used to call a 13-byte 24-cycle
dedicated returning subroutine which reverses this. Optimal
when large numbers of subroutines need calling (because there
is no incremental register cost per subroutine) or when
recursion is necessary. Minimum of 54 cycles per subroutine
call and return.
|
- Normal subroutine appearance, behavior, and flexibility.
- Recursion.
- Smallest per-call incremental cost for larger numbers of
subroutines.
- Inline data can be passed to the subroutine.
|
- Costs 29 bytes of code space for the call/return subroutines.
- Costs three registers. (Usually R4, R5, and R6; R3 is
assumed to be the 'normal' program counter.)
- Must have a stack. (Costing a register [usually R2], and
two bytes of RAM per call depth.)
- Slowest (13.5×) subroutine method.
|
None can be considered ideal, and variations of the above techniques
are possible. (Using a SCRT variant with a 1-byte selector instead of
a 2-byte address
like MACROSUB,
for example.) These calling techniques may be mixed as needed, they
are not mutually exclusive. (Though the subroutines themselves cannot
be called by techniques they were not designed for, as the necessary
return instructions are incompatible.)
If using the 1804–6, an additional mechanism is possible: the
new SCAL/SRET N instructions. (If using
1804–6 you'd probably never use SCRT, you'd use this instead,
though for compatibility SCRT does continue to work, as do the
others.)
| Name | Description | Pro | Con
|
|---|
SCAL/
SRET
| Subroutine Call and Return. New 4-byte 10-cycle SCAL N
instruction (68 8N SS SS) used to call your subroutine, using
a specified linkage register N to hold the return address.
(Prior contents of RN are saved on the X stack.) RN may be used
to pick up inline arguments, if any. New 2-byte
8-cycle SRET N instruction (68 9N) used to return
to RN address, X stack contents popped back into RN. Optimal
when large numbers of subroutines need calling (because there
is no incremental register cost per subroutine) or when
recursion is necessary. Minimum of 18 cycles per subroutine
call and return.
|
- Normal subroutine appearance, behavior, and flexibility.
- Recursion.
- Small per-call incremental cost for larger numbers of
subroutines.
- Inline data can be passed to the subroutine.
- No code space penalty, beyond the extra byte in each
call/return instruction.
|
|
|
Native programs for the 1802 do not grow gracefully, because you
quickly run out of registers to dedicate, and once you do the shuffle
starts to kill you. (Modularity is often impeded due to the bias
towards dedicated registers.) The 1802 is uniquely suited to
implement virtual machines, like FORTH, CHIP-8, SWEET16, P-machine,
etc., where a small number of oft-used subroutines, co-routines,
pointers, and counters are the bulk of what is necessary to
efficiently implement the virtual environment, and which can live in
dedicated registers and avoid the shuffle.
|
| 3
|
Threaded
Forth
is particularly noted for being memory-thrifty, but most find Forth to
be a difficult language to work in. It occurs to me that one could
combine a
P-machine
stack-oriented
execution architecture (for a more traditional
C
or
Pascal
compiler, say) with
Forth-style threading in a 2-stack
virtual
machine to get the best of both worlds. Classic P-machine runtime
interpreters
were stack-oriented, which means that the
basic blocks
that a compiler (C, or Pascal) generated could be characterized, just
like Forth words, by their stack effects. The compiler could generate
these blocks as threadable Forth-like words (with separate return and
data stacks, allowing for
zero-prologue
subroutines and simple machine refactoring) instead of the usual
inline code. If the compiler, as each basic block was identified,
fingerprinted the block by its data stack effects and its internal
operations, it could consult the cache of already emitted code (the
application itself, as built so far to that point) and before caching
a new block 'word' it could see if a match was already there, and if
it were thread to that instead of emitting fresh code. (Refactoring
the existing code at that point, as required, would be expected.) The
2-stack machine (versus native code) constraints would encourage a
greater block uniformity that could well lend itself to increased
sharing. Post-processing could even discover larger clusters of
common sequences (of threaded words) and re-factor them for increased
sharing, and even better code density. (Manipulations of the return
stack would stop the refactoring, of course, because the refactoring
itself affects the return stack.)
In this manner the P-machine compiler would be doing, partially
anyway, what the experienced Forth programmer does automatically:
extensive factoring of an application with an eye towards sharing code
that already exists. (This being the secret by which Forth was so
memory-efficient. In fact, Forth is difficult enough to work with
that it is practically impossible not to factor extensively.)
If the compiler did a good enough job of factoring and sharing, it
might make a C or Pascal environment for the 1802 practical for
non-toybox applications. Programs would be slow(-ish) due to the
triple penalty5, but maybe they'd
fit† in memory.
| † |
One experience I had in school with
HP Pascal/64000 on
an 8085
embedded
system was an abject failure in this respect, and a useful object
lesson. The hardware had been designed with 4KB (2× 8755
devices, IIRC) of
EPROM for the program,
which was typical for embedded systems of the era and felt by all to
be sufficient for the intended door-lock application. Unfortunately
the project specification required Pascal as the
implementation language, not the usual assembly, and the necessary
program space, with all requested features present, turned out to be
about 3–4× what was available in the hardware. The Pascal
compiler churned out perfectly reasonable native code, but it was just
so bulky (especially on an 8-bit processor) when compared to
what an assembly (or Forth) programmer could do that the project was
doomed from the start. Adding source code generally resulted in a
linear increase in object code, there was no factoring economy
exhibited, beyond what was expressed directly as subroutines in the
source code. The memory 'gas gauge' dropped precipitously with each
new line of code added, burying itself far below empty before the
application was barely even begun. The application was eventually
completed anyway, and functioned correctly, but only when using
substantial development system memory
emulation
resources. In fact it was never realized on the un-aided target
hardware, and the project was eventually abandoned, after all
the door lock hardware had been installed throughout the building.
This was a very useful (and painful) object lesson on the importance
of project specification and estimation, and on having a viable
Plan B, and
on how ill-suited 8-bit
microprocessors
were to handling languages conceived for the much more capable
mini- and
mainframe
computers of the day.
The same project supplied other painful lessons. The provided target
hardware was, in fact, not even fully functional: the floppy disk
subsystem was unable to actually save and retrieve data to the storage
medium, and only the '0' keys worked on the keypads at the doors.
(The hardware was clearly largely untested, and no test software
accompanied the provided platform. I should never have accepted the
software project as offered to me.) Development system disk
emulation resources were also necessary for the code demonstration,
and all demonstrated door combinations had to be comprised only of
zeroes.
I learned a lot in that project!
|
|
| 4
|
Consider a not-uncommon small but significant task: copying a small
(<256) number of bytes from known (constant) places (src
and dst) in memory—your basic structure copy. In
C, perhaps something like:
copy(char *src, char *dst, unsigned char count) {
do {
*dst++ = *src++;
} while (--count);
}
I have hand-assembled some code fragments here for comparison
purposes. (There may be minor errors due to the hand-assembly.) We
begin with the canonical 8-bit processor: Intel's 8080, and compare it
to the 1802.
| 8080/8085 | 1802
|
|---|
Cycl Addr Code Label Mnemonic Comment
(10) 0000 21 ss ss copy: LXI H,src ; Setup
(10) 0003 11 dd dd LXI D,dst
( 7) 0006 0E nn MVI C,count
( 7) 0008 7E loop: MOV A,M ; Fetch source byte.
( 7) 0009 02 STAX D ; Store dest byte.
( 5) 000A 23 INX H ; Advance both pointers.
( 5) 000B 13 INX D
( 5) 000C 0D DCR C ; Are we done?
(10) 000D C2 08 00 JNZ loop
|
Cycl Addr Code Label Mnemonic Comment
(16) 0000 F8 ss copy: LDI src.hi ; Setup
(16) 0002 BD PHI 13
(16) 0003 F8 ss LDI src.lo
(16) 0005 AD PLO 13
(16) 0006 F8 dd LDI dst.hi
(16) 0008 BE PHI 14
(16) 0009 F8 dd LDI dst.lo
(16) 000B AE PLO 14
(16) 000C F8 nn LDI count
(16) 000E AF PLO 15
(16) 000F 4D loop: LDA 13 ; Fetch source byte, advance.
(16) 0010 5E STR 14 ; Store dest byte,
(16) 0011 1E INC 14 ; advance.
(16) 0012 2F DEC 15
(16) 0013 8F GLO 15 ; Are we done?
(16) 0014 3A 0F BNZ loop
|
In the setup parts of the code, pre-loop, we are initializing two
16-bit pointer registers and an 8-bit counter. The 3-instruction
8-byte 8080 fragment is about as optimal as it could get. (Though an
8-bit processor, the 8080 sports a few crucial 16-bit instructions
designed for manipulating addresses.) It takes the 1802 almost twice
as much code, 10 instructions in 15 bytes, to do the exact same part
of the task. This is due to the 8-bit accumulator bottleneck, through
which all data must flow. (The only 16-bit operations the
1802 can do are to increment or decrement a register pair.)
In the loop proper, the 1802 actually has a 1-instruction 1-byte
advantage (3 vs 4) copying the data because it has a combined
load-and-increment-pointer (LDA) instruction.
For loop termination the 8080 takes 2 instructions, 4 bytes, to
decrement the counter and loop around if it's not zero. The 1802
takes 3 instructions, 4 bytes, to do the same thing, leaving the 1802
with only a 1-instruction (0-byte) penalty for this phase. This
penalty is due to the lack of a
condition-code
register: we have to pull the counter back into the accumulator to
see if it's expired. (The high half of the counter register is
undamaged, because we never decrement the low half below zero.)
So, for the loop itself the 8080 needed 6 instructions in 8 bytes to
copy each data byte; the 1802 also needed 6 instructions but in only 7
bytes, giving it a slight size edge—which does not even begin to
pay for the huge penalty it incurred setting up the loop.
Note that every 1802 instruction we used takes 16 clock cycles to
execute, whereas the 8080 instructions we used take 5, 7, or 10 cycles
to execute. No matter how you rate it, the 1802 comes in last on this
task. (And, indeed, most tasks.)
As a routine matter we preserved any processor resources (registers)
that we didn't need for the copy. For the 1802, registers 1–12
were untouched, as was the high half of our counter R15. In the case
of the 8080, our benchmark, only the 8-bit B register wasn't touched.
Using the slightly newer, more-popular
Z-80 successor to
the 8080, and its additional instructions, we can beat the 8080,
several different ways:
| Z-80 (1) | Z-80 (2) | Z-80 (3)
|
|---|
(10) 0000 21 ss ss copy: LXI H,src
(10) 0003 11 dd dd LXI D,dst
( 7) 0006 0E nn MVI C,count
( 7) 0008 7E loop: MOV A,M
( 7) 0009 02 STAX D
( 5) 000A 23 INX H
( 5) 000B 13 INX D
( 5) 000C 0D DCR C
(12) 000D 20 F9 JRNZ loop
This is one byte shorter, using a relative jump, but it's also two
cycles (per iteration) slower.
|
(10) 0000 21 ss ss copy: LXI H,src
(10) 0003 11 dd dd LXI D,dst
( 7) 0006 06 nn MVI B,count
( 7) 0008 7E loop: MOV A,M
( 7) 0009 02 STAX D
( 5) 000A 23 INX H
( 5) 000B 13 INX D
(13) 000C 10 FA DJNZ loop
We save 2 bytes here by using the new DJNZ instruction. It's also 2
cycles (per iteration) faster. We must use register B for the
counter, instead of C. (C is now our untouched resource.)
|
(10) 0000 21 ss ss copy: LXI H,src
(10) 0003 11 dd dd LXI D,dst
(10) 0006 01 nn nn LXI B,count
(21n) 0009 ED B0 LDIR
This is smallest, by a fair margin, and the fastest. (You can only
beat its speed with unrolled LDI loops, which we are not
interested in here.) It also leaves one 8-bit register untouched: the
accumulator this time.
This is the canonical data-copy for the Z-80, and works for blocks
of any size.
|
These only make the 1802 look even worse, of course. Especially that
last one.
On the other hand, consider another wildly popular peer, and two that
were not so popular:
| 6502 | 6800 | 6809
|
|---|
(2) 0000 A9 ss copy: LDA #src-1.lo
(3) 0002 85 00 STA from
(2) 0004 A9 ss LDA #src-1.hi
(3) 0006 85 01 STA from+1
(2) 0008 A9 dd LDA #dst-1.lo
(3) 000A 85 02 STA to
(2) 000C A9 dd LDA #dst-1.hi
(3) 000E 85 03 STA to+1
(2) 0010 A0 nn LDY #count
(5) 0012 B1 00 loop: LDA (from),Y
(6) 0014 91 02 STA (to),Y
(2) 0016 88 DEY
(3) 0017 D0 F9 BNE loop
This is the biggest fragment so far and in fact looks a lot
like the 1802 in that it has substantial setup code when compared to
the copy and loop sections of the code, and that is because the 6502,
like the 1802, can work only with 8-bit values.
By fiddling with the setup constants we can use the Y register as both
index and counter. (The X index register remains untouched.) The
copy also proceeds from the top down, unlike all the others, because
of Y's dual role. It is fairly typical of the 6502 that if you're
willing to warp your code a little, greater efficiency is possible.
We need two 16-bit pointers stored in page 0: from
and to, because the 6502 has no pointer registers. (Modularity
is often impeded due to the need to allocate storage in page 0.)
|
(2) 0000 C6 nn copy: LDB #count
(3) 0002 CE dd dd LDX #dest
(5) 0005 DF 02 STX to
(3) 0007 CE ss ss LDX #src
(5) 000A A5 00 loop: LDA 0,X ; Fetch data byte.
(4) 000C 08 INX
(5) 000D DF 00 STX from ; Switch pointers.
(4) 000F DE 02 LDX to
(6) 0011 A7 00 STA 0,X ; Store data byte.
(4) 0013 08 INX
(5) 0014 DF 02 STX to ; Switch pointers.
(4) 0016 DE 00 LDX from
(2) 0018 5A DECB
(4) 0019 26 F0 BNE loop
This is even bigger than the 6502 fragment, but is partitioned
differently and is much slower. The efficient setup code is similar
to the 8080, but the fact that there is only one pointer
(index) register means we have to continually shuffle our two pointers
in and out of memory. (We use zero-page for the holding cells, for
better efficiency.) So, the setup and branch control portions are
reasonable, but the loop copy itself is, most decidedly, not.
If it weren't for the 1802 this would be the worst of these processors
at this task, without question.
If you can afford to disable interrupts the stack pointer can be
repurposed as a second index register, resulting in a very efficient
copy portion of the loop, but this has obvious system implications and
additional setup/cleanup code, and cannot be relied upon in general
practice.
The 6800's fatal weakness was having only the single pointer register,
which crippled it for just about any interesting task to which it
could be put. (Copying, comparing, and transforming data all require
two pointers.) It's better than no processor at all, of course, but
just about any other choice was superior from the software
perspective. (The
68HC11, one
of the successors to the 6800, attempted to rectify this by adding a
second index register, Y. There were others that did the same. There
was room in the 6800's instruction encoding for the Y register, not
including it from the very beginning is inexplicable. This oversight
helped ensure the competing 6502's success.)
|
(3) 0000 8E ss ss copy: LDX #src
(4) 0003 10 8E dd dd LDY #dest
(2) 0007 C6 nn LDB #count
(6) 0009 A6 80 loop: LDA ,X+
(6) 000B A7 A0 STA ,Y+
(2) 000D 5A DECB
(3) 000E 26 F9 BNE loop
Attempting to address all the 6800's weaknesses, this successor is
arguably the best 8-bit processor made, and not really a peer of the
1802 because it came along so much later. However, it is interesting
to note that even so it's neither the smallest nor fastest at this
particular task. Clock-corrected the Z-80 still beats it handily, and
is smaller, and the 6502 is one clock faster per byte.
(Due to instructions and addressing modes we aren't using here, this
processor is the undisputed 8-bit champion when running
threaded Forth, native-compiled C and Pascal, and ROM-based
relocatable code, all of which were explicit design targets of the
processor.)
|
Just as with the Z-80, the 6809 is rich enough that there are some
options:
| 6809 (2) | 6809 (3)
|
|---|
The 6809 is capable of moving 16 bits of data at a time, which
considerably improves its performance, but with more complex code. We
need another 16-bit register, which is one more than it comfortably
has. However, the user stack pointer (U) can be (ab)used for this,
keeping everything in-register, and thus fast:
(7) 0000 34 40 copy: PSHS U ; Save U stack.
(3) 0002 8E ss ss LDX #src
(4) 0005 10 8E dd dd LDY #dest
(2) 0009 C6 nn LDB #count
(2) 000B 54 LSRB ; Byte pair count.
(3) 000C 24 04 BCC loop
(6) 000E A6 80 LDA ,X+ ; Move the
(6) 0010 A7 A0 STA ,Y+ ; odd byte.
(8) 0012 EE C1 loop: LDU ,X++ ; Move the
(8) 0014 EF A1 STU ,Y++ ; byte pairs.
(2) 0016 5A DECB
(3) 0017 26 F9 BNE loop
(7) 0019 35 40 PULS U ; Restore U stack.
This is a 27-byte subroutine, as large as the 6800's, and instead of
the first routine's 9+17×N clocks it takes
41+10.5×N clocks, making it nearly as fast as the Z-80,
at least for larger N.
|
Optimally (for even-sized copies) it's only 20 bytes in size, and
23+21×(N/2) clocks:
(7) 0000 34 40 copy: PSHS U ; Save U stack.
(3) 0002 8E ss ss LDX #src
(4) 0005 10 8E dd dd LDY #dest
(2) 0009 C6 nn LDB #count/2
(8) 000B EE C1 loop: LDU ,X++ ; Move the
(8) 000D EF A1 STU ,Y++ ; byte pairs.
(2) 000F 5A DECB
(3) 0010 26 F9 BNE loop
(7) 0012 35 40 PULS U ; Restore U stack.
The instruction count is quite reasonable, but unfortunately over half
the instructions we're using are 'extended' (compared to the parent
6800) instructions, and require an extra opcode byte to invoke, thus
increasing the size and cycle count.
And the Z-80 still beats it.
|
To handle blocks of any size, most of these fragments would need an
additional outer loop, or other equivalent modifications to handle
larger counters and/or offsets. (The Z-80's LDIR can already handle
any size.) I have ignored this as I don't think it's relevant for the
point(s) being made here.
The little-used
Hitachi 6309, a
successor to the 6809, added block move instructions similar to the
Z-80:
| 6309
|
|---|
(3) 0000 8E ss ss copy: LDX #src
(4) 0003 10 8E dd dd LDY #dest
(4) 0007 10 86 nn nn LDW #count
(6+3n) 000B 11 38 12 TFM X+,Y+
|
Essentially the same four instructions as the Z-80, but three bytes
larger due to the extension-byte orientation of the 6809 family. (Yet
still two bytes smaller than the basic 6809 routine. To be fair, the
Z-80 also uses extension bytes, but only needs one for this task
rather than four.) This takes 17+3×N clocks, making it
the absolute winner, bar none, in the speed category, and it can
handle blocks of any size. (Too bad this processor was not more
popular, but by the time it was available the 16-bit processors, with
their vastly increased memory addressing abilities, were coming on
strong.)
So, here is our collected summary of the basic structure-copy
subroutines, showing size and time requirements for each 8-bit
processor:
Structure Copy Summary
| | Code Size (bytes) | | Clocks/
|
|---|
| CPU | | Setup | Copy | Loop | Total | | Setup | Byte
|
|---|
| 8080 | | 8 | 4 | 4 | 16 | | 27 | 39
| | 1802 | | 15 | 3 | 4 | 22 | | 160 | 96
| | Z-80 | | 9 | 2 | 0 | 11 | | 30 | 21
| | 6502 | | 18 | 4 | 3 | 25 | | 22 | 16
| | 6800 | | 10 | 14 | 3 | 27 | | 13 | 43
| | 6809 | | 9 | 4 | 3 | 16 | | 9 | 17
| | 6309 | | 11 | 3 | 0 | 14 | | 17 | 3
|
| | | |
By speed
| | Code Size (bytes) | | Clocks/
|
|---|
| CPU | | Setup | Copy | Loop | Total | | Setup | Byte
|
|---|
| 6309 | | 11 | 3 | 0 | 14 | | 17 | 3
| | Z-80 | | 9 | 2 | 0 | 11 | | 30 | 21
| | 6502 | | 18 | 4 | 3 | 25 | | 22 | 16
| | 6809 | | 9 | 4 | 3 | 16 | | 9 | 17
| | 8080 | | 8 | 4 | 4 | 16 | | 27 | 39
| | 6800 | | 10 | 14 | 3 | 27 | | 13 | 43
| | 1802 | | 15 | 3 | 4 | 22 | | 160 | 96
|
| | | |
By size
| | Code Size (bytes) | | Clocks/
|
|---|
| CPU | | Setup | Copy | Loop | Total | | Setup | Byte
|
|---|
| Z-80 | | 9 | 2 | 0 | 11 | | 30 | 21
| | 6309 | | 11 | 3 | 0 | 14 | | 17 | 3
| | 8080 | | 8 | 4 | 4 | 16 | | 27 | 39
| | 6809 | | 9 | 4 | 3 | 16 | | 9 | 17
| | 1802 | | 15 | 3 | 4 | 22 | | 160 | 96
| | 6502 | | 18 | 4 | 3 | 25 | | 22 | 16
| | 6800 | | 10 | 14 | 3 | 27 | | 13 | 43
|
|
Discounting the never-seen 6309 latecomer the Z-80 is the clear winner
on all benchmark fronts here, once clock rates are made comparable.
(See below.) Speed-wise the 6502 is in second
place, and the 1802 is far in the rear, though it requires less code
than either the 6800 or 6502.
However, benchmark fragments aren't everything.
The 6502, for example, has two secret weapons: 1) due to the way its
index registers work it's very good at handling data
structures and arrays that are less than 256 bytes in size, and 2) it
can keep up to 128 pointers (using all of zero page) on hand at once.
So, that grotesque setup code probably need only be done once. Most
significant programs are quite repetitive in nature, avoiding the
setup on all but the first pass can really add up.
For example, the second time that structure copy needs to be
done, alone of the CPU's the 6502 can probably re-use the source and
destination pointers where they sit in
RAM.
The setup for the (second and any subsequent) copy then drops to two
bytes, and two clock cycles.
The 6502 has been called by some the first common
RISC
microprocessor, because it has so few internal resources. (An
arguable definition, but what is true is that the programmer
generally only works with three 8-bit registers, which is Not Much no
matter how you measure it. [TI's novel 16-bit
9900
takes this register-less approach even further, though in a different
way.]) Instead, the 6502 is designed to work very efficiently with
the first 256 bytes of RAM, using them as arrays and pointers, and
arrays of pointers. Substantial applications, once
written and optimized for the 6502, were
fast—often very fast.
(None6 of the 8-bit processors, with the
exception of the later
6809/6309,
natively handles high-level languages like C and Pascal well at all.
Significant 8-bit applications that need maximum performance must be
written in assembly language, and are thus well-poised to take
advantage of architectural peculiarities.)
The various '80' processors don't have enough internal registers to
keep anything interesting long-term within the
CPU, everything has to live out in RAM and shuffle in and out
of the CPU in use. The extremely introverted 1802, focused on its
generous 16 internal registers, still doesn't have enough registers to
keep very much control data within the CPU, especially considering its
instruction set peculiarities—most data still has to
live out in RAM and endure the shuffle. The extremely extroverted
6502, with practically no registers to speak of and focused instead on
its first 256 bytes of RAM, can keep the bulk of interesting
control information out there, where (using 8-bit zero-page
addressing) it is fairly fast to access and can be used where it sits,
no shuffling required. (Like the 1802, you can't build even the most
basic 6502 system without RAM. Also, like the 1802, modularity is
hampered by the necessity to allocate limited resources: zero-page RAM
[vs the 1802's registers].) While the 6502 instructions that access
the data directly in RAM are inherently slower than purely
register-based operations, avoiding the shuffle more than makes up for
it. That efficiency adds up over an entire application. (The
6809/6309 can do many of the same in-memory tricks, and even some
additional ones, but its instructions to do so are larger and slower,
largely negating that 'advantage' so far as relative performance is
concerned.)
In the heyday of the 8-bit era the 'fast' 4 MHz Z-80 systems were
generally (though erroneously) thought to be faster than the 'fast' 2
MHz 6502 systems. (These being approximately comparable as they
interfaced to the same-speed commodity memory, and in those pre-cache
days that was all that really mattered. 8-family processors were
approximately half as clock-efficient as 6-family processors; scaling
clock numbers by 2 was a reasonable equivalency approximation.)
The erroneous conclusion that the Z-80 was superior was certainly true
of small benchmarks: on a fragment-by-fragment basis the Z-80
does very well. But running real world applications the
2 MHz 6502 systems were usually comfortably in the
lead7, followed by the 4 MHz Z-80
systems, followed by everything else, with the 1802 systems solidly
bringing up the rear. Larger applications that take advantage of the
6502 instruction architecture simply win. (The 1 MHz 6502
systems, exemplified by the highly regarded and extremely popular
Apple II, naturally
found it a lot harder to compete on performance than the 2 MHz
systems did, but they usually compensated by having inexpensive
integrated graphic displays that shared the main memory with no
performance hit or visual artifacts when accessing it.)
In fact, if the 1802 were to be implemented with circuitry as
cycle-efficient as the 6502, a not-unreasonable ask, it would actually
fare well in these contests, as it's not the worst in code size, and
having multiple pointer registers and counters available can be a real
advantage in avoiding the pesky shuffle. (Or use the 1804–6
successor, for some slight relief [4 bytes and 48 cycles] in the setup
portion. Using a SEP support subroutine,
à la SCRT, gives you the same size relief on an 1802, but
is considerably slower, costs you a register, and you have to provide
the support subroutine too.) Revisited:
| e1802 | 1804/1805/1806 | 1802 'Macros', à la SCRT
|
|---|
(2) 0000 F8 ss copy: LDI src.hi
(2) 0002 BD PHI 13
(2) 0003 F8 ss LDI src.lo
(2) 0005 AD PLO 13
(2) 0006 F8 dd LDI dst.hi
(2) 0008 BE PHI 14
(2) 0009 F8 dd LDI dst.lo
(2) 000B AE PLO 14
(2) 000C F8 nn LDI count
(2) 000E AF PLO 15
(2) 000F 4D loop: LDA 13
(2) 0010 5E STR 14
(2) 0011 1E INC 14
(2) 0012 2F DEC 15
(2) 0013 8F GLO 15
(2) 0014 3A 0F BNZ loop
|
(40) 0000 68 CD ss ss copy: RLDI 13,src
(40) 0004 68 CE dd dd RLDI 14,dst
(16) 0008 F8 nn LDI count
(16) 000A AF PLO 15
(16) 000B 4D loop: LDA 13
(16) 000C 5E STR 14
(16) 000D 1E INC 14
(16) 000E 2F DEC 15
(16) 000F 8F GLO 15
(16) 0010 3A 0F BNZ loop
|
(??) 0000 D1 0D ss ss copy: LDI 13,src
(??) 0004 D1 0E dd dd LDI 14,dst
(16) 0008 F8 nn LDI count
(16) 000A AF PLO 15
(16) 000B 4D loop: LDA 13
(16) 000C 5E STR 14
(16) 000D 1E INC 14
(16) 000E 2F DEC 15
(16) 000F 8F GLO 15
(16) 0010 3A 0F BNZ loop
|
Instead of 160+96×N clocks, our fictional enhancement
would take 20+12×N clocks, making it the fastest
of these processors at this task! (Equivalently, clock the real 1802
design at 16–32 MHz—in a non-caching design it's
the memory speed that matters, not the CPU clock rate.)
Of course, if we're going invent fictional derivatives of existing
processors a small change to the Z-80 would make it unbeatable in this
particular challenge. As it happens the Z-80 implements LDIR
by decrementing the program counter rather than incrementing
it for the second byte of the opcode, if the decremented BC counter
pair is not zero. This results in re-fetching the LDIR instruction
itself for each data byte transferred, except for the last one, which
is why it takes a leisurely 21 cycles per byte transferred. (This is
elegant, if inefficient, and also ties into preserving its ability to
refresh
DRAM
reliably while still remaining responsive to
interrupts during
this lengthy instruction, all while minimizing implementation
circuitry. [The WDC
65816 does the same thing for its block-move instructions.]) If
the Z-80 had been able to fetch the opcode pair only once and thereafter
just move data, under control of the internal registers, it could have
taken 13 cycles per byte transferred, or even a theoretical minimum of
6, far superior to any other 8-bit CPU. (The only non-caching
processors I'm aware of that could actually keep the memory bus
saturated with data while executing bulk copies were the 16-bit
68010, and
the 8-bit Hitachi
6309.)
Were I designing the ultimate 8-bit hobbyist computer I'd be inclined
to give it multiple CPU's for maximum flexibility, one each from the
three most popular families: 8080, 6800, and 6502. (Plenty
of dual CPU systems were sold on the market, exemplified by
the Commodore
128, which had both 6502 and Z-80 processors.) Sadly, the 1802
offers nothing unique enough in this space to compensate for its lack
of speed in this kind of system, so in spite of my liking for it I
would leave it out. I'd use the latest-and-greatest in each line: the
Zilog Z-80 (for its wealth of CP/M applications), the Hitachi 6309
(for its support of high-level languages), and the WDC 65816 (for its
speed, and larger memory addressing); each of the three brings a
significant strength to the partnership. Because they're all sharing
the same memory they can even act as co-processors for each other.
To continue on down the rabbit hole, if you used the 65816 as the
primary CPU the system could be built with more than 64KB of RAM. By
adding MMUs the other two CPUs could also access all of the memory.
(Morrow
used a MMU for the Z-80 back in the day that even gave it the ability
to run a protected-space Un*x OS. Throw that in too!) If one were
inclined to stretch the boundaries completely to the point of
ridiculousness, one could go ahead and throw in three 16-bit CPUs that
had 8-bit bus interfaces as well: the
8088,
68008 (though
a caching
68030 or
68040
constrained to a narrow 8-bit bus would be both faster and much more
capable), and
9995.
At that point maybe just go ahead and throw in an 1802 too, for the
heck of it and making a grand total of 7 CPU's sharing one memory;
this slippery-slope mental exercise already got stupid some time ago,
what's one more processor at this point? We can even add some sort of
MMU8 to let the 1802 also get at the
additional memory.
Regardless of how far you took it, this would be purely a
retrocomputing
exercise, because you could learn nothing that you couldn't more
simply by simulating any of these CPUs on modern high-speed desktop
computers, with vastly less effort and probably better (faster)
results.
This little digression has gone on far too long, I really
need to stop now!
|
| 5
|
Penalties: 1: P-machine interpreter. 2: Threading overhead. 3: 1802!
|
| 6
|
In fact, Intel's 8085 had
additional
instructions that were designed for the convenience of high level
languages like C and Pascal (particularly the
new LDHI/LDSI for pointing DE to structure
and stack frame members/variables (respectively), followed
by LDAX/STAX [8-bit A] or the
new LHLX/SHLX [16-bit HL] transfers), so it
wasn't just the 6809/6309 that could do this, though not so well as
they. Intel, however, chose to leave these instructions undocumented,
presumably to not further fractionate the burgeoning
CP/M market at the
time, or perhaps to not steal market share from their upcoming
8086 family. (No
doubt this deliberate omission was heartbreaking to the designers of
the 8085.) To my knowledge, no high-level language ever exploited the
8085's additional abilities.
|
| 7
|
The
TI-99/4A should
have been the fastest machine of the era due to its 3 MHz
9900
16-bit CPU, derived from their minicomputer architecture, but was
instead one of the slowest due to all of its application
memory being indirectly accessed, 8 bits at a time, through
the 9918 video
controller. (The CPU was given only 256 bytes of 16-bit RAM for use
by native code.) A truly heinous architectural decision that crippled
the product, dragging it down into
1802 territory so
far as apparent speed was concerned, and shooting
TI out of
the saddle in the lucrative and growing home computer market.
The 99/4 was a very clever design, and inexpensive to build, but a
spectacular failure. They designed what was essentially a very nice
closed-system calculator, in the shape of a computer with an
alphanumeric keyboard and color video, and entered it in a cage fight
with real computers. (Or: Don't bring a calculator to a
computer fight? Also, the closed nature of the system infuriated
potential software developers, which did not help.) The results were
not pretty, and TI eventually lost hundreds of millions of dollars on
the effort. This debacle informed IBM's later decision to make its
new 5150
PC an open system instead. See here for
more opinion on the 99/4 machines.
To my knowledge, every general-purpose machine that tried to
save money by using deliberately slower, narrower, bulk memory than
'native' failed in the market, except for the 8088-based IBM PC (for
which there were essentially no 8086-based competitors). The 99/4A
was merely the most egregious of these, being far slower than the
typical half-speed entrant. (And even it might have succeeded if it
were merely half the native speed, had the narrow bulk memory somehow
been directly connected to the CPU instead of through the video
controller.)
Here is our block copy benchmark for the 9900, assuming (like for the
best-case 6809 code) that the block is an even number of bytes and
aligned on even byte boundaries (not unreasonable for a 16-bit
system), and that we're not shoehorned into a crippled-by-design
machine like a 99/4A:
| 9900
|
|---|
Cycl Addr Code Label Mnemonic Comment
(12) 0000 0400 ssss copy: li r0,src
(12) 0004 0401 dddd li r1,dst
(12) 0008 0402 nnnn li r2,count
(30) 000C CC70 loop: mov *r0+,*r1+ ; Copy word and advance pointers
(10) 000E 0642 dect r2 ; Are we done?
(10) 0010 16FA jne loop
|
From the programmer's perspective this could not be simpler, or
cleaner. Only six instructions (18 bytes), half of them (6 bytes)
repeated in a loop. Its 18 bytes puts it in the middle of the size
pack, because all instructions are 16 bits, or multiples thereof.
It's a bit pedestrian at 36+50×(N/2) clocks, and this is
entirely due to the fact that 9900 'registers' are all actually out in
RAM. (For unbeatable context switching, but offset by modest
calculation speed that is reminiscent of using pointers on the 6502.)
The 9900 is not particularly fast, but its clean 16-bit architecture
generally makes up for it; it handles high-level languages well and
has enough registers that algorithms are not 'cramped'.
Note that the 9900 is a true 16-bit system. While it could do the
copy a byte at a time, using movb and dec
instructions in the loop, there's a substantial speed penalty to do
so:
- Twice as many passes through the loop, of course, and
- Each pass is slower, because the bus hardware has to do a
read-modify-write for each byte written in order to preserve the
byte (within the 16-bit word) that is not being modified;
the 9900 supports byte operations but its memory interface does
not have byte lane strobes that protect non-targeted bytes in a
word.
No one would willingly do this, not in a 16-bit system.
|
Here's how the calculated speed of the 9900 stacks up against the
8-bit CPU's, in their typical incarnations in home computers, copying
128 bytes of data.
Many of these CPU's were available in faster systems, so scale down
their times accordingly. Note that the 6800 is less than half as fast
as its direct competitor: the 8080. And yet, the 1802 is even slower!
|
| CPU | | MHz | MSec
|
|---|
| 6309 | | 1.00 | 0.40
| | Z-80 | | 4.00 | 0.68
| | 9900 | | 3.00 | 1.08
| | 6502 | | 1.00 | 2.07
| | 6809 | | 1.00 | 2.19
| | 8080 | | 2.00 | 2.50
| | 6800 | | 1.00 | 5.52
| | 1802 | | 1.78 | 6.99
|
| |
The Z-80 beats the 9900 at this task, but only by 37%, and that only
because of the Z-80's LDIR instruction. (The same goes
for the 6309.) Without LDIR the Z-80's just a 2×
8080, which puts it at 1.25 msec, behind the 9900. (And,
notably, also behind a 2 MHz 6502—those being the 8-bit
speed demons.)
And this is, essentially, a simplistic 8-bit task. Given the 9900's
superiority in registers and instruction set, the 9900 would
absolutely trounce the Z-80 at any real-world task involving
calculations or high-level languages.
The 6309/6809, having a few more registers and good support for
high-level languages, might not be too embarrassed, but it would still
lose to the 9900.
There
is no replacement for displacement, as they say.
|
TI, with their early (1976) 9900, could have owned the home
computer market. But the 99/4, what they chose to bet the farm
on, was an absolute dog.
|
For its time the 9900 looked pretty good compared to its 8-bit peers,
and held up well with time. Contrast this 9900 code with that for the
later mainstream 16-bit microprocessors, most of which have
specialized block/string instructions, as well as increased addressing
capabilities:
| 8086
| Z8000
| 68000
| 32016 (née 16032)
|
|---|
copy: lea si,src
lea di,dst
mov cx,count/2
cld
rep movsw
The 8086 exhibits considerable
8-bit heritage, such as special-
purpose registers and instructions,
and variable 8-bit instruction en-
coding. (1–6 bytes.)
|
copy: lda r1,src
lda r2,dst
ld r0,#count
ldir r2,r1,r0
Like the 9900, instructions are encoded
in a variable number of 16-bit words.
Only the Z8000 lets you choose which registers to use for the copy
instruction.
|
copy: lea src,a0
lea dst,a1
move #count/2-1,d0
loop: move (a0)+,(a1)+
dbra d0,loop
Like the 9900, instructions are encoded
in a variable number of 16-bit words,
and there are no block/string instructions.
(Misaligned word access is illegal.)
In the 68010, with its
'loop mode',
the
copy loop is as performant as other
processors' string instructions and far
more versatile, thus: elegant.
|
copy: addr src,r1
addr dst,r2
movw #count,r0
movsw
Like the 8086, instructions are encoded
using a variable number of bytes. (1–23)
|
We have used specialized instructions where available, as not using
them can result in a surprisingly poor showing:
| 8086
| Z8000
| 68000
| 32016
|
|---|
copy: lea si,src
lea di,dst
mov cx,count/2
more: mov ax,[si]
mov [di],ax
add si,2
add di,2
loop more
This 16-bit CPU doesn't have auto-
increment, so if not using string
instructions, and their dedicated
registers, the code is much larger
and slower. (No human would ever
choose to do this. A compiler might.)
(Optimally, using two inc instructions
to replace add is the same speed, but
smaller; using mov instead of lea is
both smaller and faster here.)
|
copy: lda r1,src
lda r2,dst
ld r0,#count/2
loop: ld r3,@r1
ld @r2,r3
inc r1,#2
inc r2,#2
djnz r0,loop
This 16-bit CPU doesn't have auto-
increment, so if not using string
instructions the code is much
larger and slower.
A stack-autoincrement can be used
for half the job, saving 1 instruction:
loop: pop r3,@r1
ld @r2,r3
inc r2,#2
djnz r0,loop
Not really good enough, though.
|
copy: lea src,a0
lea dst,a1
move #count/2-1,d0
loop: move (a0)+,(a1)+
dbra d0,loop
Still only two repeated instructions.
The 9900 and 68000 both have very
tidy expressions of this function, even
without specialized instructions.
|
copy: addr src,r1
addr dst,r2
movw #count,r0
loop: movw 0(r1),r3
movw r3,0(r2)
addw 2,r1
addw 2,r2
subw 2,r0
bnz loop
This 16-bit CPU doesn't have auto-
increment, so if not using the string
instructions, and their dedicated registers,
the code is much larger and slower. (No
human would ever choose to do this. A
compiler might.)
|
Compilers have the unenviable job of recognizing when the specialized
instructions can be used to good effect, and arranging the register
assignments to be able to use them.
Three of these four processor families, faced with the need to copy
data as quickly as possible, added specialized instructions. They do
this, and well, but they do only this. One went another way,
with the 68010, and found a way to not add instructions but
rather to make the existing instructions much more efficient. And,
incidentally, speeding up many tightly repetitive tasks, not
just copying. Beautiful. Author, aviator, and adventurer
Antoine
de Saint-Exupéry said it best: "Perfection is finally
attained not when there is no longer anything to add, but when there
is no longer anything to take away." Kudos, Motorola.
It should be noted that everything discussed up to now would be
classified as a
CISC
machine, no matter how primitive. The thrust towards the opposite,
RISC,
in recent decades embraces instruction streams that look more like our
non-specialized, and unattractive, fragments above. This direction is
predicated on several technological trends and breakthroughs:
- RISC instruction streams are notably less code-dense than CISC
streams for the same functionality. RISC programs are thus much
larger, and can require larger address spaces, and less expensive
bulk RAM to populate these larger spaces cost-effectively in order
to be competitive.
- RISC instructions tend to be fixed-size, which makes the fetch and
decode logic inside the CPU simpler, and thus faster. (Again:
code size penalty.)
- The theory with RISC processors is that as you simplify the
internals you can speed up operation more than the
penalty you incur with less-dense instructions. (RISC doesn't
make sense without this.)
- RISC processors tend to have significant high-speed internal
instruction caches, to help mitigate the code size penalties.
This does add additional system complexity.
Note that none of this applied to the older
machines—they needed to be CISC in order to compete in
their markets. Memory was relatively slow and expensive, the CPU
needed to make up for that.
Note that there are still potential places where RISC might be a
poorer choice, for reasons other than base cost and functionality.
Environments where overall system complexity or high speeds were
penalized might still favor CISC. Space hardware is the first thing
that comes to mind. Space hardware needs to be significantly more
environmentally robust, and has near-zero tolerance for failure.
Larger silicon area (bigger target) is more susceptible to various
faults, and smaller feature geometries (more sensitive target) are
likewise more susceptible to environmental upsets. Power is often
limited.
A basic axiom is that as complexity increases reliablity decreases,
and it is the total system reliability that usually matters.
You can afford to pump some additional "C" into your "ISC", if the
memory system complexity goes down enough to make up for it. There
are a lot of factors that would go into finding an ideal balance!
|
| 8
|
There are two (really: three) MMU possibilities for the 1802:
- A traditional I/O-accessed peripheral page-mapping MMU, probably
using 16 4KB pages. This gives the CPU access to only 64KB at a
time, but the 16 accessible pages can each come from anywhere in
up to 16MB of memory. (This is the sort that Morrow used for
their Z-80 running Micronix.) Best for running multiple 64KB (or
smaller) independent tasks at once. (Or, for bare-metal
programming.)
- The 1802 architecture and implementation allows for another, more
interesting possibility: effectively extending each of its 16-bit
registers (for addressing purposes) to 24 bits, similar to the
65816.
Remember that all memory access on the 1802 is addressed
by one of the 16 registers, and its instruction decode and
execution is simple enough that we can always track which register
is doing the addressing. (Let's call this 'M'.) The memory
address decoder can utilize an external 16×8 RAM bank
indexed by 'M' to supply the upper 8 bits of the 24-bit address.
Each register can thus have access to its own 64KB
address segment if we wish, allowing a single task to be much
larger than 64KB. (Or, for bare-metal programming. Multiple
tasks are also possible, but they can't be smaller than 64KB and
remain independent. Not without additional constraint hardware.)
There are two possibilities for programming this kind of MMU:
- I/O-accessed peripheral, as above, but remember that I/O on the
1802 is a bit cumbersome, and the data is
accessed through a (mapped) register, so it's entirely
possible to paint yourself into a corner if you're not careful.
- The application note Data Bus Contention During CDP1802
Register-to-Register Operations hints at another, very
interesting possibility. Paraphrased:
In 1802-based systems bus contention problems have been found to
occur during internal data transfer operations
(GHI, PHI, GLO, PLO)
if memory read decoding does not also include
the MRD signal,
losing [clobbering] data in one or more registers.
Basically the D register value is actually visible (and, more to the point,
vulnerable) on the data bus during
GHI/PHI/GLO/PLO
instructions, which means we can track what's going on. (Also,
interestingly enough, the full 16-bit register values are visible
on the address bus for INC
and DEC instructions during the execution [second]
cycle of the instruction.) By using a recognizable (by external
hardware) but normally never-used sequence of instructions we can
arrange that select PHI instructions affect both the
register bank and the MMU map page by copying the visible
D to the mapper as well. Perhaps:
; Point R1 at beginning of second 64KB bank of RAM. ($010000)
F8 01 LDI 1 ; D ← 1
B1 PHI 1 ; D → 1.HI
B1 PHI 1 ; D → 1.HI, D → map[1]
F8 00 LDI 0 ; D ← 0
B1 PHI 1 ; D → 1.HI
A1 PLO 1 ; D → 1.LO
Our trigger here is two
identical PHI instructions in a row, something that
you would never normally do. The hardware, upon seeing the
execution of the second PHI, would also write the
data bus contents (D) into map[N] (N=1 in this example).
For reading we cannot reliably do the same thing
for GHI, by doubling them, because it is neither safe
nor reliable to deliberately clash bus drivers (hence the
Application Note that led us here), and so our mapper contents
can't be read into D this way. Some other mechanism would have to
be provided, if reading were necessary. As it would be, if
interrupts are ever part of the picture. Interrupts would also
have to be temporarily blocked for the first instruction of any
potential special sequence so that these sequences couldn't be
interrupted, as there's no reliable way to save this halfway-there
state across an interrupt. DMA should need no special
consideration because those cycles are recognizable and
independent, and needn't affect our sequencing logic, and would
naturally have full access to our 24-bit space.
If you were willing to forgo 1804–6 compatibility
you could safely do something like:
; Fetch R1's 64KB bank select value into D
91 GHI 1
68 INP 0 ; D ← map[1]
The INP 0 instruction (68) normally can't be used
because there's nothing within that execution cycle to
recognize in order to gate something onto the bus, and so was
designated as an unused opcode. (With the probable side-effect of
destroying D, if executed anyway.) But with tracking logic we can
recognize it and use it to trigger the MMU read. Of course, the
68 opcode is the only formally unused one, and is the gate to all
of the 1804–6 enhancements, so doing it this way automatically
limits your CPU choice. Probably not the best idea.
Slightly less efficiently, which probably doesn't matter because
you don't often need to read these values anyway, you can maintain
full system flexibility by tracking this sequence instead:
; Fetch R1's 64KB bank select value into D
91 GHI 1
F8 XX LDI XX ; D ← map[1]
The tracking memory address decoder would turn off the normal RAM
read for the immediate operand, discarding it, and substitute the
MMU read data instead. A more efficient alternative might be:
; Fetch R1's 64KB bank select value into D
91 GHI 1
01 LDN 1 ; D ← map[1]
but would be complicated by the fact that there is no LDN
0 instruction, whose opcode is instead interpreted
as IDL, but the tracker needn't consider the register
indicated in the LDN instruction at all. (Any 'G'
instruction followed by any 'LD' instruction renders the 'G'
instruction essentially useless, and could be fair game for a
trapped sequence, but something has to indicate the
ultimate data source, and it probably should be the N field of the
first of our two trap-sequence instructions: the 'G' instruction.)
There are probably other 'never-used' sequences that could be
recognized and used instead, but one of these seems sufficient.
If using an MMU like this any of the 1802 subroutine calling
mechanisms could be used to call anywhere within the
extended address space. (SCRT would need modification, of course.
You might want both near- and far-calling versions, for
efficiency.) The 1804–6 calling mechanism would be
constrained to remain within the current 64KB segment, much like
short branches are constrained to stay in-page, and the linkage
register might not point to the correct bank unless you made prior
arrangements. The bus tracker necessary to support the
1804–6 would also be substantially more complex, in order to
do the correct bank mapping within the execution phases
of SCAL and SRET, and the other enhanced
instructions. (Some enhanced instructions could not participate
properly in our enhanced 24-bit address space because only 2 of
the 3 addressing 8-bit registers would be considered. For
example, the
new RLDI, RLXA, RSXD,
and DBNZ instructions would be significantly
hampered, and probably not truly usable in a 24-bit program.)
There are a couple of additional options for our 24-bit system.
First, as indicated above, there could be 64kB banks,
where an addressing register was constrained to stay within this
(OS-provided?) bank unless explicitly changed. Another, likely
better, option would be to also track incrementing and
decrementing instructions, and apply carry/borrow to the
appropriate MMU bank-select register. This would provide a 'flat'
24-bit addressing model, and is only possible because the 1802
actually exposes the 16-bit address during the execution
of INC and DEC instructions, even though
it's not used to access memory. (And here the lamentable lack of
a condition
code register in the 1802 is actually an advantage, if you can
call it that, because there are no side-effects to consider.)
Putting all this together, an any-size block copy in such a 24-bit
'flat' system would look like this:
| Native 24-bit | 24-bit 'Macros', à la SCRT
|
|---|
Cycl Addr Code Label Mnemonic Comment
(16) 000000 F8 ss copy: LDI src.ex ; Setup
(16) 000002 BD PHI 13
(16) 000003 BD PHI 13
(16) 000004 F8 ss LDI src.hi
(16) 000006 BD PHI 13
(16) 000007 F8 ss LDI src.lo
(16) 000009 AD PLO 13
(16) 00000A F8 dd LDI dst.ex
(16) 00000C BE PHI 14
(16) 00000D BE PHI 14
(16) 00000E F8 dd LDI dst.hi
(16) 000010 BE PHI 14
(16) 000011 F8 dd LDI dst.lo
(16) 000013 AE PLO 14
(16) 000014 F8 nn LDI cnt.ex
(16) 000016 BF PHI 15
(16) 000017 BF PHI 15
(16) 000018 F8 nn LDI cnt.hi
(16) 00001A BF PHI 15
(16) 00001B F8 nn LDI cnt.lo
(16) 00001D AF PLO 15
(16) 00001E 4D loop: LDA 13 ; Fetch source byte, advance.
(16) 00001F 5E STR 14 ; Store dest byte,
(16) 000020 1E INC 14 ; advance.
(16) 000021 2F DEC 15
(16) 000022 8F GLO 15 ; Are we done?
(16) 000023 3A 1E BNZ loop
(16) 000025 8F GHI 15
(16) 000026 3A 1E BNZ loop
(16) 000028 8F GHI 15
(16) 000029 0F LDN 15
(16) 00002A 3A 1E BNZ loop
|
Addr Code Label Mnemonic Comment
000000 D1 0D ss ss ss copy: LDI 13,src ; Setup
000005 D1 0E dd dd dd LDI 14,dst
00000A D1 0F nn nn nn LDI 15,cnt
00000F 4D loop: LDA 13 ; Fetch source byte, advance.
000010 5E STR 14 ; Store dest byte,
000011 1E INC 14 ; advance.
000012 2F DEC 15
000013 D1 1F 0F BNZ 15,loop ; Are we done?
Here we've used SEP 1 (not a good choice!) as a macro
lead-in, like SCRT, to select from common operations involving
extended registers.
This code is considerably smaller, but is also slower and would
require a (shared) support subroutine (not shown) that actually
implemented the operations.
| Once again the 1802 proves to be | both: | 1) remarkably versatile,
| | yet | 2) cumbersome and slow.
|
|
The setup is grotesque, as you would expect because we're loading
three full 24-bit registers a piddly 8 bits at a time,
but the copy part is, notably, unchanged. Most of the
condition-testing cycles are also unchanged—every 256 bytes
we incur two more instructions, and every 65536 bytes we incur
three more instructions. So it essentially is no slower (per
byte) than the original copy routine, and not unreasonably larger,
both of which are fair accomplishments.
Instructions in which the MMU took an active interest are marked
in red, although it played a role
in every instruction: the program counter is also 24
bits. Also, the MMU has to half-rouse
on every PHI/GHI instruction
just in case it's part of a trapped sequence, and it also has to
block interrupts until after the second instruction in any
potential sequence so that a trapped sequence is not interrupted
in the middle.
In such a system you could have 'macro' instructions (like SCRT,
for example) for common activities like loading a constant into a
24-bit register or branches based on whether an extended register
was zero or not. (Also shown.) On the other hand, memory-saving
techniques like this might not actually be necessary in a 24-bit
system that had a lot of memory. Expanding these operations
inline, as shown first, would be bigger, but would execute
somewhat faster.
For compatibility with extant 16-bit programs the MMU should
probably not apply non-zero high-order address bits to memory
until the extension value(s) had been set. (On a
per-register basis, probably.) This would allow 16-bit counters
to wrap around as they would on a non-extended system. (And we'd
probably want a non-system-reset way to go back to 16-bit mode for
the use of an OS that could load both 16-bit and 24-bit programs.)
On the other hand, it's a rare 1802 system that ever had enough
memory that expecting a 16-bit wraparound would actually be
useful, so maybe this all would be unnecessary.
- In fact, one could even perhaps combine these addressing
enhancements since it would only take a 256×8 mapping RAM to
do so. This could give us multiple small independent tasks, with
the ability to give selected tasks greater than 64KB sizes. This
could be very versatile, but adds significant programming
complexity and really opens up the question of
inter-task protection9,
which has been lurking in the wings for awhile now. In which case
you might wish to require that memory mapping be arranged through
a (protected? privileged?) OS layer, possibly precluding anything
so perilous as
PHI-tracking. (With even more tracking and
protection hardware even that might still be possible.) By using
a 2048×8 mapping RAM, practical since the early 80's, we
could even have 8 hardware-mapped (instantaneous map switching)
tasks, each independent and of variable size.
This combined system should probably preclude the 'flat' 24-bit
addressing mode discussed above, as that allows tasks to 'leak'
out of their OS-provided 'pens'. Unless, of course, associated
constraints were also provided as part of the 'M' context.
(Significantly more complexity.)
It should be noted that access to a larger 'flat' memory space is
the main reason why Motorola's 68000 gained significant
market share against Intel's earlier, established, 8086 family.
So, 'flat' is where it's at?
- ...And, what if 24-bit addressing was not enough? (I think that
the notion of an 1802 slinging more than 16MB of bits around is
doubly ludicrous, but...) Well, then you'd probably want to track
doubling
of both
PHI and PLO
instructions, to tack on an extra 16 bits of addressing, for a
full 32-bit addressing model. So, even if doing only 24 bit
addressing (at first), maybe the extension should be
on PLO just in case you ever want to go bigger in the
future, and thus maintaining a bit more forward compatibility in
the software? It's a thought.
At this point it's getting more than a bit
silly10, 11.
If one were interested in this sort of thing, the best
approach would be to craft a virtual system using simulation,
and see how it all worked together: CPU, memory, tools, and
applications. Only if it tested out as practical, and you actually
had need of this in physical hardware, should you consider building
one.
|
| 9
|
If you're going to support multiple tasks, with inter-task protection
(for running things like Unix), the MMU has even more
responsibilities. A bare minimum is a write-protect bit, per
page/segment. This prevents inadvertent corruption, but does nothing
for information security. To provide
that you have to arrange so that any executing task
has no access to any memory but its own. You also need a
protected context for the OS itself, where only it is allowed to mess
with the MMU configuration. And you need a reliable way to trap into
the OS for service.
The 1802 has no exception mechanism whatsoever, so the only way to do
an OS trap is to use an opcode tracking mechanism to detect and
intercept a 'useless' sequence; choosing the sequence could be tricky,
so that it never trips except in a deliberate trap situation.
The IDL instruction (00) is a natural, because no user
application should ever do such a thing. Only a driver, in
OS context, or the OS itself should ever use that instruction. The
timing of this could be tricky, though, because you'd have to prevent
the CPU from seeing that instruction when in user context. Some
other, more innocuous sequence might be necessary, so that you could
let it actually execute before beginning the trap.
Then you need a guaranteed way to vector execution to a known place in
the OS context, without losing any state of the CPU, so that
it can be resumed cleanly. (Well, you can lose a little state if
that's part of the contract with the OS itself.) I think this can
only be accomplished by 'jamming' an instruction sequence into the
CPU, regardless of what the address bus might be doing, which takes
over the CPU while recording what it must for a clean resumption
later. (Basically what was done for the Z-80 CPU in
Applied
Microsystems' EM-180 in-circuit emulator.) Interrupts must be
blocked while the memory system is in this special state. Possible
instruction sequence, once the trap intercept has been triggered:
; Hardware has detected our OS trap sequence, and has started
; jamming instructions into the CPU regardless of its state.
; Memory access blocked during the ('****'-indicated) jam.
**** 79 MARK ; HW captures instruction read cycle addr (addr to return to),
; write cycle addr (R2) and write cycle data (X, P).
**** 73 STXD ; HW captures write cycle data: D. (Trap opcode?)
**** 83 GLO 3 ; HW captures R3 from data bus. (In case P was not 3.)
**** 93 GHI 3
**** D3 SEP 3 ; Jamming lets us do this before R3 is set.
**** F8 00 LDI 0 ; Jump to R3=0000 in OS context. HW is done capturing.
**** A3 PLO 3 ; Prevent R3.1 increment before next instruction.
**** B3 PHI 3 ; Two of these, if using 24-bit MMU described above8.
**** A3 PLO 3 ; Again, starts us at zero.
0000 E2 SEX 2 ; (Code could perhaps be in OS space rather than jam hardware.)
0001 B3 PHI 2 ; Two of these, if using 24-bit MMU described above8.
0002 A3 PLO 2
0003 22 DEC 2 ; Stack at R2=$FFFF
0004 CC LSIE ; Save interrupt-enabled state. D=0 means IE=1, use RET
0005 F8 01 LDI 1 ; to return; D=1 means IE=0, use DIS to return.
0007 73 STXD ; IE state (in D) now saved on (new) R2 stack.
0008 73 STXD ; HW now vomits saved R(P), D, X, P, R2, & R3 state, 8 bytes,
0009 73 STXD ; into the R2 stack. (D register value is ignored.)
000A 73 STXD
000B 73 STXD
000C 73 STXD
000D 73 STXD
000E 73 STXD
000F 73 STXD
; We are now running in OS context P=3 X=2, R2=$FFFA, R3=$0010
; Only R(P), X, P, D, R2, R3, and IE have been saved on R2 stack,
; any other CPU state must be preserved before we change it.
; Continue on to do the OS work from here...
What a mess! (47 cycles [51 in 24-bit system] to get into the OS, not
counting the trap sequence itself.) And that only got us in
to the OS, we still have to craft a return to user task mechanism that
reverses all this, with minimal damage to the processor state. And
deal with interrupts and DMA, in some kind of safe system-designed
manner. We will need interrupts for at least the task-switching
timer, if not for other device drivers.
And, of course, once within the OS we must do the task's requested
operation. Whether this operation was selected by register contents,
or inline (with the trap) opcodes, or some combination thereof is yet
to be determined. Time-slicing and other task-switching would also be
done before resuming the original (or some other) task.
A possible task-resumption sequence, hardware-assisted:
; OS has restored most registers by now, we've told hardware
; our intent to resume, and it has started jamming instructions
; into the CPU regardless of its state. We only need to restore
; R2, R3, X, P, D, and IE. Plus any HW task-selector. The exact
; jam sequence had to be prepared earlier in the OS, based on the
; saved information and the results of the trap we called.
**** F8 XX LDI XX ; Restore R2
**** B2 PHI 2
**** F8 XX LDI XX
**** A2 PLO 2
**** 22 DEC 2 ; Compensate for side-effect of upcoming RET/DIS.
**** F8 XX LDI XX ; Restore R3 (saved value decreased by 3 to compensate for
**** B3 PHI 3 ; the increments due to the post-PLO instruction execution).
**** F8 XX LDI XX
**** A3 PLO 3
**** F8 XX LDI XX ; Restore D. Result code? DF not restored, also result?
**** 7[01] RET/DIS ; Restore X, P, and IE. HW also un-blocks interrupt line.
; We are now running back in user context, with whatever OS
; task complete and any register changes appropriate to this.
More mess. (22 cycles to get back out of the OS, not counting work
done within the OS to prepare the hardware for this return sequence.)
If our system is using DMA and/or interrupts it'll be difficult to
support in a protected environment. One way might be to, for user
contexts, trap any use of registers 0 and 1 by user programs.
Basically dedicating them to OS and driver use only.
The DIS instruction must also be prohibited. More work
for the instruction tracker, and we have to come up with a (similar to
above) trap mechanism for handling 'illegal' instructions. Our 'Unix'
tasks will probably wake up with R2 as the stack and R3 as the program
counter, though they can be changed freely after that. (But not, of
course, to either R0 or R1.)
With this implementation another natural choice for an OS service trap
instruction for user tasks would be SEP 0, an instruction
that can likely never be used if the DMA hardware is armed.
However, like any other prohibited instruction, you can't actually let
it execute before beginning the trap; timing will be tricky.
The instruction tracker and MMU/intercept logic will be involved
enough that we'd probably need to use a FPGA of some sort. Most of
these are now large enough that the rather simple 1802 itself could
also be emulated by the FPGA. In which case we don't need such
elaborate work for a trap, we can just force the CPU emulation to
change state wholesale, whenever and however necessary. Also, it
would be difficult to build this kind of hardware so that it actually
ran faster than a pure software simulation on a fast desktop computer,
so why bother to build anything? Which circles around to the
basic question: what is this all for? This all sounds far
too complicated, and limited, to actually build. A virtual machine
for research purposes would be just as useful, and far easier.
|
| 10
|
Appropriate memory size is really a question of proportion and
purpose. (Remember that we are talking about vintage technology
choices, either back in the day or retrocomputing.
With modern hardware on the table we have many more options.)
Early computers, through the minicomputer era, tended to have datum
sizes the same as the address space, though the address space of any
given machine might not have been fully populated. The 'small'
PDP-8, for example,
had a 12-bit word, and a 12-bit (4096-word) address space, usually
fully populated. Most early general-purpose microprocessors had a
smaller 8-bit word, but as 256 bytes of address space is not nearly
enough for any significant program they all tended to use two
words as addresses, resulting in the near-ubiquitous 64KB address
space of the time.
(This 8/16 dual nature meant that programs that dealt with the full
address space were usually somewhat awkward, as addresses often had to
be assembled piecemeal. A similar awkwardness was exhibited by the
subsequent 16-bit
8086, which had a
16/20 duality. A year later the 16/32
68000 was not
awkward at all, because it supported 32-bit instructions right from
the beginning. Hardware designers were finally learning, after enough
complaints from the software people!)
The common 16-bit 64KB space was more than generous in the early days,
as that amount of memory, regardless of form, was prohibitively
expensive. None of the early microcomputers ever had
anywhere near that amount of available memory in them, 1–16KB
was the norm.
Tools and techniques evolved to make do with less than the maximum
possible resources.
Forth
and the P-system
were strong, along with multi-phase language compilers, etc.
BASIC interpreters,
another way to increase the semantic density of limited memory, were
common.
For a general-purpose machine often the CPU choice is made
for you, by factors like corporate standards or the availability of
necessary software, or for compatibility with software you already
have. For a specific-purpose machine, like an embedded
system, where you have more flexibility, well...
The only defensible reason to choose an
1802 CPU for a new
project, out of all of the available 8-bitters, is for its
environmental strengths, which are considerable. But, as
mentioned before, the memory in such a system also
has to share the same strengths, which makes it particularly
expensive. (Space hardware, which would otherwise be
cost-insensitive, can't be overlarge because of
MTBF
concerns. Battery-operated handheld equipment just can't be
overlarge, period.) So, an 1802 system is, almost by definition, a
limited-memory environment. Unless the application is nearly trivial,
code density will be of paramount importance. Which for an 1802
means:
- A 24-bit direct address space extension is stupid. (Fun, to be
sure, but stupid.) The 1802 is slow to begin with, and
manhandling this amount of (ultra-expensive) memory will be
excruciating. If you actually need larger amounts of memory,
you'll probably be driven to find bulk storage that is cheaper
than RAM: some kind of disk or tape for example, and a paging
(block) access method. Which you can certainly do in a 16-bit
space, and which gives you non-volatility as a bonus. The program
will be more complicated and slower, but you won't really have a
choice.
The only time a 24-bit extension could possibly be
justified is if there existed some kind of slow, cheap
RAM technology that worked with the 1802 in its environment. It
didn't exist then, and I doubt it does even now.
- Thus, if limited to a 16-bit space you can compact (and speed up)
native 1802 code by using an 1804–6 CPU and its extended
instructions.
- To compact application programs even further you'll have to use a
more semantically dense language, such as Forth, P-system,
CHIP-8, etc.
(Even BASIC, though other choices are probably better.) This
costs you even more speed; 1802 systems are slow.
| Sidebar:
|
The 1802 was chosen for small (256 bytes of RAM) trainer computers
like the Elf,
ostensibly because its built-in DMA system (in LOAD mode)
allowed for ROM-less operation and the manual loading of small
programs. In fact you could pretty much do the same thing
with any 8-bit CPU by using a few inexpensive jellybean logic
chips to implement 'load' mode by jamming a (CPU-specific,
starting-at-zero) NOP sequence into the CPU while holding
it in a wait state, then forcing a write into memory and briefly
releasing the wait to deposit a byte of program and advance to the
next location. The CPU would slowly NOP its way through
memory, generating the incrementing address sequence, while the
hardware took care of dribbling the program into memory. (This is
almost exactly what the 1802 does all by itself in LOAD
mode.) Such a trainer wouldn't be quite as small as an Elf due to the
extra jellybeans, but it'd be nearly as cheap. The real reason to use
an 1802 in a trainer like this was the inexpensive companion
1861 video chip,
which together with the 1802 and the paltry 256 bytes of RAM gave you
an actual bit-mapped video display to play with, at a very low
incremental cost to an already cheap machine, unlike every other
trainer computer ever made. The
iconic
1802 'Star Trek' demo occupies only 256 bytes of memory; the
displayed 64×32 bitmap shows the entire
memory—both the figure and the small 64-byte
DMA/Interrupt program that displays all the memory on the screen.
Expanding to 1KB was enough to hold two complete 64×32
alternating display fields, occupying half that memory, and
Ben
Hutchinson's (slow, naturally) implementation of Conway's
Game of
Life program that showed you one field while building the next.
Quite the deal for the 1978 hobbyist with only a little over $100 to
spend on a computer!
|
If speed is essential use anything but an 1802 (well,
probably not a
6800 either,
as illustrated above4) and hope that you
can find something that is capable of handling the environment,
whatever it is. For general computing you're in good
shape, all choices are available.
If code density is crucial in a system that is not environmentally
challenged, then use a
6502 and Forth,
P-system, etc. This 8/16 CPU, though a bit of a pain to program, is
likely to be the fastest 8-bitter at implementing the intermediate
language. It was also historically the cheapest general-purpose
8-bitter; cheap and fast is a fairly potent combination. A
6309 might be
another good choice, if its instructions particularly lend themselves
to the required solution. (Certain Forth implementations get a speedy
single-instruction NEXT primitive; C and Pascal compile
cleanly to native code as instructions can directly access stack-frame
and structure member variables; block memory copies are faster than
with any other 8-bit CPU. Also, if writing a significant amount of
assembly language it's probably the easiest of the 8-bitters to code
for, speeding development.)
If you actually do need a bigger-than-16-bit address space use a CPU
that can handle it cleanly and quickly, such as a 68000 or
80386; don't waste time
with anything less. (The 16/20 8086 will punish you if exceeding 16
addressing bits, so even if 20's enough use a better CPU than that,
you'll be glad you did. The 8/24
65816 is far more
focused on compatibility [with the 6502 in this case]
than capability, though it is very capable.)
Unless, like the proverbial dancing bear (where the notable thing is
not how well it dances, but that it dances at all), you want do do
something odd just for the perverse pleasure of making it do something
the 'hard way'. In which case, something like a 24-bit 1802 system
would be ideal!
|
| 11
|
'These go to 11'
(Not yet they don't!)
|