The 1802, Footnotes

The 8-bit 1802 is the heart of what are widely regarded as just about the slowest home computers ever sold. (8 clocks per machine cycle, 2 or 3 machine cycles per instruction. Clock rate fixed at 1.7 MHz in order to drive the NTSC video of the typical system, and the video output was serviced by per-scanline interrupts and per-8-pixel DMA cycles, which stole much time from your application program. So, roughly 0.1 MIPS even without the distractions, and its 'I's were nothing to write home about, either. Comparable in performance, perhaps, to the much older 4-bit Intel 4004.)

Superficially the 8-bit 1802, with its 16 16-bit registers, appeared unexcelled until the much later 16-bit Z8000 and 68000 devices, except that the slow instruction execution combined with some truly unfortunate missing instructions² served to hobble it significantly. In practice, except for its use in small trainer computers like the Elf and some very early programmable home-video games, it really only was deployed in places where its extreme tolerance of environmental stresses like unstable power supplies and clocks, heat, cold, and radiation, or its ultra-low full-CMOS power requirements were crucial: Space vehicles, boreholes, traffic signs, engine computers, battery-operated handheld devices, etc. (Though considered relatively unpopular it nonetheless sold millions of units in these applications, and is still in limited production as of this writing, 48 years after its introduction.) Generally it seemed to be most satisfactorily programmed in various virtual machine environments like Chip-8, or threaded Forth³. These, of course, extracted a significant performance penalty on top of an already slow processor, but paid for themselves with substantially reduced memory requirements. (Said memory also needing to be as tolerant of the environment as the CPU needed to be, and thus possibly quite bulky and/or expensive when compared to more mainstream memory devices.)

The 16 16-bit registers can be used as program counters, pointer registers, and counters, which is excellent, and can be treated as up to 30 8-bit data holding registers, also good. However:

All I/O and most ALU instructions work only on memory, indexed by one of the 16 registers, which translates to a significant amount of data shuffling through the 8-bit accumulator, the only gateway to both the registers and the memory. (This also means that even the tiniest systems must include RAM, which is not necessarily the case with other register-rich processors.)
Unusually, stack operators are post-increment and post-decrement, not pre-decrement, requiring extra increment and decrement instructions when switching between pushing and popping.
There is no condition code register, only the accumulator's separate carry bit. Calculations can only test the current value of the accumulator for zero, or the state of this carry bit.
There are no compare instructions, thus all numeric comparisons must be done with 'destructive' subtractions.
There are no traditional subroutine call or return instructions, though they can be simulated.

These five characteristics combine to make most 1802 machine code rather bloated in size, compared⁴ to its peers, and particularly slow.

Threaded Forth is particularly noted for being memory-thrifty, but most find Forth to be a difficult language to work in. It occurs to me that one could combine a P-machine stack-oriented execution architecture (for a more traditional C or Pascal compiler, say) with Forth-style threading in a 2-stack virtual machine to get the best of both worlds. Classic P-machine runtime interpreters were stack-oriented, which means that the basic blocks that a compiler (C, or Pascal) generated could be characterized, just like Forth words, by their stack effects. The compiler could generate these blocks as threadable Forth-like words (with separate return and data stacks, allowing for zero-prologue subroutines and simple machine refactoring) instead of the usual inline code. If the compiler, as each basic block was identified, fingerprinted the block by its data stack effects and its internal operations, it could consult the cache of already emitted code (the application itself, as built so far to that point) and before caching a new block 'word' it could see if a match was already there, and if it were thread to that instead of emitting fresh code. The 2-stack machine (versus native code) constraints would encourage a greater block uniformity that could well lend itself to increased sharing. Post-processing could even discover larger clusters of common sequences (of threaded words) and re-factor them for increased sharing, and even better code density. (Manipulations of the return stack would stop the refactoring, of course, because the refactoring itself affects the return stack.)

In this manner the P-machine compiler would be doing, partially anyway, what the experienced Forth programmer does automatically: extensive factoring of an application with an eye towards sharing code that already exists. (This being the secret by which Forth was so memory-efficient.) If the compiler did a good enough job of factoring and sharing, it might make a C or Pascal environment for the 1802 practical for non-toybox applications. Programs would be slow(-ish) due to the triple penalty⁵, but maybe they'd fit in memory.

One experience I had in school with HP Pascal/64000 on an 8085 embedded system was an abject failure in this respect, and a useful object lesson. The hardware had been designed with 4KB (2× 8755 devices, IIRC) of EPROM for the program, which was typical for embedded systems of the era and felt by all to be sufficient for the intended door-lock application. Unfortunately the project specification required Pascal as the implementation language, not assembly, and the necessary program space turned out to be about 3–4× what was available in the hardware. The Pascal compiler churned out perfectly reasonable native code, but it was just so bulky (especially on an 8-bit processor) when compared to what an assembly (or Forth) programmer could do that the project was doomed from the start. Adding source code generally resulted in a linear increase in object code, there was no factoring economy exhibited, beyond what was expressed directly as subroutines in the source code. The memory 'gas gauge' dropped precipitously with each new line of code added, burying itself far below empty before the application was barely even begun. The application was eventually completed anyway, and functioned correctly, but only when using substantial development system memory emulation resources. In fact it was never realized on the un-aided target hardware, and the project was eventually abandoned, after all the door lock hardware had been installed throughout the building. This was a very useful (and painful) object lesson on the importance of project specification and estimation, and on having a viable Plan B, and on how ill-suited 8-bit microprocessors were to handling languages conceived for the much more capable mini- and mainframe computers of the day.

⁴

Consider a not-uncommon small but significant task: copying a small (<256) number of bytes from known (constant) places (src and dst) in memory—your basic structure copy. In C, perhaps something like:

copy(char *src, char *dst, unsigned char count) {
   do {
      *dst++ = *src++;
   } while (--count);
}

I have hand-assembled some code fragments here for comparison purposes, there may be minor errors as a result.

8080/8085

1802

(10) 0000 21 ss ss       LXI H,src   ; Setup
(10) 0003 11 dd dd       LXI D,dst
( 7) 0006 0E nn          MVI C,count

( 7) 0008 7E       loop: MOV A,M     ; Copy -- Fetch source byte
( 7) 0009 02             STAX D      ; Store dest byte
( 5) 000A 23             INX H
( 5) 000B 13             INX D

( 5) 000C 0D             DCR C       ; Are we done?
(10) 000D C2 08 00       JNZ loop

(16) 0000 F8 ss          LDI src.hi
(16) 0002 BD             PHI 13
(16) 0003 F8 ss          LDI src.lo
(16) 0005 AD             PLO 13
(16) 0006 F8 dd          LDI dst.hi
(16) 0008 BE             PHI 14
(16) 0009 F8 dd          LDI dst.lo
(16) 000B AE             PLO 14
(16) 000C F8 nn          LDI count
(16) 000E AF             PLO 15

(16) 000F 4D       loop: LDA 13   ; Fetch source byte, and advance.
(16) 0010 5E             STR 14   ; Store dest byte
(16) 0011 1E             INC 14

(16) 0012 2F             DEC 15
(16) 0013 8F             GLO 15   ; Are we done?
(16) 0014 3A 0F          BNZ loop

In the setup parts of the code, pre-loop, we are initializing two 16-bit pointer registers and an 8-bit counter. The 3-instruction 8-byte 8080 fragment is about as optimal as it could get. It takes the 1802 almost twice as much code, 10 instructions in 15 bytes, to do the exact same part of the task. This is due to the 8-bit accumulator bottleneck, through which all data must flow.

In the loop proper, the 1802 actually has a 1-instruction 1-byte advantage (3 vs 4) copying the data because it has a combined load-and-increment-pointer (LDA) instruction.

For loop termination the 8080 takes 2 instructions, 4 bytes, to decrement the counter and loop around if it's not zero. The 1802 takes 3 instructions, 4 bytes, to do the same thing, leaving the 1802 with only a 1-instruction (0-byte) penalty for this phase. This penalty is due to the lack of a condition-code register: we have to pull the counter back into the accumulator to see if it's expired. (The high half of the counter register is undamaged, because we never decrement the low half below zero.)

So, for the loop itself the 8080 needed 6 instructions in 8 bytes to copy each data byte, the 1802 also needed 6 instructions but in only 7 bytes, giving it a slight size edge. Which does not even begin to pay for the huge penalty it incurred setting up the loop.

Note that every 1802 instruction we used takes 16 clock cycles to execute, whereas the 8080 instructions we used take 5, 7, or 10 cycles to execute. No matter how you rate it, the 1802 comes in last on this task. (And, indeed, most tasks.)

As a routine matter we preserved any processor resources (registers) that we didn't need for the copy. For the 1802, registers 1–C were untouched, as was the high half of our counter R15. In the case of the 8080, our benchmark, only the 8-bit B register wasn't touched.

Using the slightly newer, more-popular Z80 and its additional instructions we can beat the 8080 a little, several different ways:

Z80 (1) Z80 (2) Z80 (3)
(10) 0000 21 ss ss       LXI H,src
(10) 0003 11 dd dd       LXI D,dst
( 7) 0006 0E nn          MVI C,count

( 7) 0008 7E       loop: MOV A,M
( 7) 0009 02             STAX D
( 5) 000A 23             INX H
( 5) 000B 13             INX D

( 5) 000C 0D             DCR C
(12) 000D 20 F9          JRNZ loop
This is one byte shorter, using a relative jump, but it's also two cycles (per iteration) slower.
(10) 0000 21 ss ss       LXI H,src
(10) 0003 11 dd dd       LXI D,dst
( 7) 0006 06 nn          MVI B,count

( 7) 0008 7E       loop: MOV A,M
( 7) 0009 02             STAX D
( 5) 000A 23             INX H
( 5) 000B 13             INX D

(13) 000C 10 FA          DJNZ loop
We save 2 bytes here by using the new DJNZ instruction. It's also 2 cycles (per iteration) faster. We must use register B for the counter, instead of C. (C is now our untouched resource.)
(10) 0000 21 ss ss       LXI H,src
(10) 0003 11 dd dd       LXI D,dst
(10) 0006 01 nn nn       LXI B,count

(21) 0009 ED B0          LDIR
This is smallest, by a fair margin, and the fastest. (You can only beat its speed with unrolled LDI loops, which we are not interested in here.) It also leaves one 8-bit register untouched: the accumulator this time.
This is the canonical data-copy for the Z-80, and works for blocks of any size.

These only make the 1802 look even worse, of course. Especially that last one.

On the other hand, consider another wildly popular peer, and two that were not so popular:

6502 6800 6809
(2) 0000 A9 ss       LDA #src-1.lo
(3) 0002 85 00       STA from
(2) 0004 A9 ss       LDA #src-1.hi
(3) 0006 85 01       STA from+1
(2) 0008 A9 dd       LDA #dst-1.lo
(3) 000A 85 02       STA to
(2) 000C A9 dd       LDA #dst-1.hi
(3) 000E 85 03       STA to+1
(2) 0010 A0 nn       LDY #count

(5) 0012 B1 00 loop: LDA (from),Y
(6) 0014 91 02       STA (to),Y

(2) 0016 88          DEY
(3) 0017 D0 F9       BNE loop
This is the biggest fragment so far and in fact looks a lot like the 1802 in that it has substantial setup code when compared to the copy and loop sections of the code, and that is because the 6502, like the 1802, can work only with 8-bit values.
By fiddling with the setup constants we can use the Y register both as index and counter. (The X index register remains untouched.) It is fairly typical of the 6502 that if you're willing to warp your code a little, greater efficiency is possible.
We need two 16-bit pointers stored in page 0: from and to, because the 6502 has no pointer registers.
(2) 0000 C6 nn          LDB #count
(3) 0002 CE dd dd       LDX #dest
(5) 0005 DF 02          STX to
(3) 0007 CE ss ss       LDX #src

(5) 000A A5 00    loop: LDA ix   ; Fetch data byte.
(4) 000C 08             INX
(5) 000D DF 00          STX from ; Switch pointers.
(4) 000F DE 02          LDX to
(6) 0011 A7 00          STA ix   ; Store data byte.
(4) 0013 08             INX
(5) 0014 DF 02          STX to   ; Switch pointers.
(4) 0016 DE 00          LDX from

(2) 0018 5A             DECB
(4) 0019 26 F0          BNE loop
This is even bigger than the 6502 fragment, but is partitioned differently and is much slower. The efficient setup code is similar to the 8080, but the fact that there is only one pointer (index) register means we have to continually shuffle our two pointers in and out of memory. (We use zero-page for the holding cells, for better efficiency.) So, the setup and branch control portions are reasonable, but the loop copy itself is, most decidedly, not. If it weren't for the 1802 this would be the worst of these processors at this task, without question.
The 6800's fatal weakness was having only the single pointer register, which crippled it for just about any interesting task to which it could be put. It's better than no processor at all, of course, but just about any other choice was superior from the software perspective. (The 68HC11, one of the successors to the 6800, attempted to rectify this by adding a second index register, Y.)
(3) 0000 8E ss ss          LDX #src
(4) 0003 10 8E dd dd       LDY #dest
(2) 0007 C6 nn             LDB #count

(6) 0009 A6 80       loop: LDA ,X+
(6) 000B A7 A0             STA ,Y+

(2) 000D 5A                DECB
(3) 000E 26 F9             BNE loop
Attempting to address all the 6800's weaknesses, this successor is arguably the best 8-bit processor made, and not really a peer of the 1802 because it came along so much later. However, it is interesting to note that even so it's neither the smallest nor fastest at this particular task. Clock-corrected the Z-80 still beats it handily, and the 6502 is one clock faster per byte.
(Due to instructions and addressing modes we aren't using here, this processor is the undisputed 8-bit champion when running threaded Forth, native-compiled C and Pascal, and ROM-based code, all of which were explicit design targets of the processor.)

Just as with the Z-80, the 6809 is rich enough that there are some options:

6809 (2) 6809 (3)
The 6809 is capable of moving 16 bits of data at a time, which considerably improves its performance, but with more complex code. We need another 16-bit register, which is one more than it comfortably has. However, the user stack pointer (U) can be (ab)used for this, keeping everything in-register, and thus fast:
(7) 0000 34 40             PSHS U   ; Save U stack.
(3) 0002 8E ss ss          LDX #src
(4) 0005 10 8E dd dd       LDY #dest
(2) 0009 C6 nn             LDB #count
(2) 000B 54                LSRB     ; Byte pair count.
(3) 000C 24 04             BCC loop
(6) 000E A6 80             LDA ,X+  ; Move the
(6) 0010 A7 A0             STA ,Y+  ; odd byte.

(8) 0012 EE C1       loop: LDU ,X++ ; Move the
(8) 0014 EF A1             STU ,Y++ ; byte pairs.

(2) 0016 5A                DECB
(3) 0017 26 F9             BNE loop
(7) 0019 35 40             PULS U   ; Restore U stack.
This is a 27-byte subroutine, as large as the 6800's, and instead of the first routine's 9+17×N clocks it takes 41+10.5×N clocks, making it nearly as fast as the Z-80, at least for larger N.
Optimally (for even-sized copies) it's only 20 bytes in size, and 23+21×(N/2) clocks:
(7) 0000 34 40             PSHS U   ; Save U stack.
(3) 0002 8E ss ss          LDX #src
(4) 0005 10 8E dd dd       LDY #dest
(2) 0009 C6 nn             LDB #count/2

(8) 000B EE C1       loop: LDU ,X++ ; Move the
(8) 000D EF A1             STU ,Y++ ; byte pairs.

(2) 000F 5A                DECB
(3) 0010 26 F9             BNE loop
(7) 0012 35 40             PULS U   ; Restore U stack.
The instruction count is quite reasonable, but unfortunately over half the instructions we're using are 'extended' (compared to the parent 6800) instructions, and require an extra opcode byte to invoke, thus increasing the size and cycle count.
And the Z-80 still beats it.

To handle blocks of any size, most of these fragments would need an additional outer loop, or other equivalent modifications to handle larger counters and/or offsets. (The Z-80's LDIR can already handle any size.) I have ignored this as I don't think it's relevant for the point(s) being made here.

The little-used Hitachi 6309, a successor to the 6809, added block move instructions similar to the Z-80:

6309
(3) 0000 8E ss ss LDX #src (4) 0003 10 8E dd dd LDY #dest (4) 0007 10 86 nn nn LDW #count (6+3n) 000B 11 38 12 TFM X+,Y+

Essentially the same four instructions as the Z-80, but three bytes larger due to the extension-byte orientation of the 6809 family. (Yet still two bytes smaller than the basic 6809 routine. To be fair, the Z-80 also uses extension bytes, but only needs one for this task rather than four.) This takes 17+3×N clocks, making it the absolute winner, bar none, in the speed category, and it can handle blocks of any size. (Too bad this processor was not more popular, but by the time it was available the 16-bit processors, with their vastly increased memory addressing abilities, were coming on strong.)

So, here is our collected summary of the basic structure-copy subroutines, showing size and time requirements for each processor:

Structure Copy Summary
CPU Setup Copy Loop Total
(bytes) Clocks/
Setup Clocks/
Byte
8080 8 4 4 16 27 39
1802 15 3 4 22 160 96
Z-80 9 2 0 11 30 21
6502 18 4 3 25 22 16
6800 10 14 3 27 13 43
6809 9 4 3 16 9 17
6309 11 3 0 14 17 3

Discounting the never-seen 6309 the Z-80 is the clear winner on all benchmark fronts here, once clock rates are made comparable. (See below.)

However, benchmark fragments aren't everything.

The 6502, for example, has two secret weapons: 1) due to the way its index registers work it's very good at handling data structures and arrays that are less than 256 bytes in size, and 2) it can keep up to 128 pointers (using all of zero page) on hand at once. So, that grotesque setup code probably need only be done once. Most significant programs are quite repetitive in nature, avoiding the setup on all but the first pass can really add up.

For example, the second time that structure copy needs to be done, alone of the CPU's the 6502 can probably re-use the source and destination pointers where they sit in RAM. The setup for the (second and any subsequent) copy then drops to two bytes, and two clock cycles.

The 6502 has been called by some the first common RISC microprocessor, because it has so few internal resources. (An arguable definition, but what is true is that the programmer generally only works with three 8-bit registers, which is Not Much no matter how you measure it. [TI's novel 16-bit 9900 takes this register-less approach even further, though in a different way.]) Instead, the 6502 is designed to work very efficiently with the first 256 bytes of RAM, using them as pointers and arrays, and arrays of pointers. Substantial applications, once written and optimized for the 6502, were fast—often very fast. (None⁶ of the 8-bit processors, with the exception of the later 6809/6309, natively handles high-level languages like C and Pascal well at all. Significant 8-bit applications that need maximum performance must be written in assembly language, and are thus well-poised to take advantage of architectural peculiarities.)

The various '80' processors don't have enough internal registers to keep anything interesting long-term within the CPU, everything has to live out in RAM and shuffle in and out of the CPU in use. The extremely introverted 1802, focused on its generous 16 internal registers, still doesn't have enough registers to keep very much control data within the CPU, especially considering its instruction set peculiarities—most data still has to live out in RAM and endure the shuffle. The extremely extroverted 6502, with practically no registers to speak of and focused on its first 256 bytes of RAM, can keep the bulk of interesting control information out there, where (using 8-bit zero-page addressing) it is fairly fast to access and can be used where it sits, no shuffling required. (Like the 1802, you can't build even the most basic 6502 system without RAM.) While the 6502 instructions that access the data directly in RAM are inherently slower than purely register-based operations, avoiding the shuffle more than makes up for it. That efficiency adds up over an entire application. (The 6809/6309 can do many of the same in-memory tricks, and even some additional ones, but its instructions to do so are larger and slower, largely negating that 'advantage' so far as performance is concerned.)

In the heyday of the 8-bit era the 'fast' 4 MHz Z-80 systems were generally (though erroneously) thought to be faster than the 'fast' 2 MHz 6502 systems. (These being approximately comparable as they interfaced to the same-speed commodity memory, and in those pre-cache days that was all that really mattered. 8-family processors were approximately half as clock-efficient as 6-family processors; scaling clock numbers by 2 was a reasonable equivalency approximation.)

That conclusion was certainly true of small benchmarks: on a fragment-by-fragment basis the Z-80 does very well. But running real world applications the 2 MHz 6502 systems were usually comfortably in the lead, followed by the 4 MHz Z-80 systems, followed by everything else, with the 1802 systems solidly bringing up the rear. (The 1 MHz 6502 systems, exemplified by the Apple II, naturally found it a lot harder to compete on performance than the 2 MHz systems did, but they usually compensated by having inexpensive integrated graphic displays that shared the main memory with no performance hit or visual artifacts when accessing it.) Larger applications that take advantage of the 6502 instruction architecture simply win.

In fact, if the 1802 were implemented with circuitry as cycle-efficient as the 6502, a not-unreasonable ask, it would actually fare well in these contests, as it's not the worst in code size, and having multiple pointer registers and counters available can be a real advantage in avoiding the pesky shuffle. Revisited:

1802+

(2) 0000 F8 ss          LDI src.hi
(2) 0002 BD             PHI 13
(2) 0003 F8 ss          LDI src.lo
(2) 0005 AD             PLO 13
(2) 0006 F8 dd          LDI dst.hi
(2) 0008 BE             PHI 14
(2) 0009 F8 dd          LDI dst.lo
(2) 000B AE             PLO 14
(2) 000C F8 nn          LDI count
(2) 000E AF             PLO 15

(2) 000F 4D       loop: LDA 13   ; Fetch source byte, and advance.
(2) 0010 5E             STR 14   ; Store dest byte
(2) 0011 1E             INC 14

(2) 0012 2F             DEC 15
(2) 0013 8F             GLO 15   ; Are we done?
(2) 0014 3A 0F          BNZ loop

Instead of 160+96×N clocks, our fictional enhancement would take 20+12×N clocks, making it the fastest of these processors at this task! (Equivalently, clock the real 1802 design at 16–32 MHz—in a non-caching design it's the memory speed that matters, not the clock rate.)

Of course, if we're going invent fictional derivatives of existing processors a small change to the Z80 would make it unbeatable in this particular challenge. As it happens the Z80 implements LDIR by decrementing the program counter rather than incrementing it for the second byte of the opcode, if the decremented BC counter pair is not zero. This results in re-fetching the LDIR instruction itself for each data byte transferred, except for the last one, which is why it takes a leisurely 21 cycles per byte transferred. (This is elegant, if inefficient, and also ties into preserving its ability to refresh DRAM reliably while still remaining responsive to interrupts during this lengthy instruction, all while minimizing implementation circuitry.) If it had been able to fetch the opcode pair only once and thereafter just move data, under control of the internal registers, it could have taken 13 cycles per byte transferred, or even a theoretical minimum of 6, far superior to any other 8-bit CPU. (The only non-caching processors I'm aware of that could actually keep the memory bus saturated with data while copying bulk data were the 16-bit 68010, and the 8-bit Hitachi 6309.)

⁵

Penalties: 1: P-machine interpreter. 2: Threading overhead. 3: 1802!

⁶ In fact, Intel's 8085 had additional instructions that were designed for the convenience of high level languages like C and Pascal, so it wasn't just the 6809/6309 that could do this. Intel, however, chose to leave them undocumented, presumably to not further fractionate the burgeoning CP/M market at the time. (No doubt this deliberate omission was heartbreaking to the designers of the 8085.) To my knowledge, no high-level language ever exploited the 8085's additional abilities.

Return to Site Home