The 1802, Footnotes

The 8-bit 1802 is the heart of what are widely regarded as just about the slowest home computers ever sold. (8 clocks per machine cycle, 2 or 3 machine cycles per instruction. Clock rate fixed at 1.7 MHz in order to drive the 1861-based NTSC video of the typical system, and the video output was serviced by per-scanline interrupts and per-8-pixel DMA cycles, which stole much time from your application program. So, roughly 0.1 MIPS even without the distractions, and its 'I's were nothing to write home about, either. Comparable in performance, perhaps, to the much older 4-bit Intel 4004.)

Superficially the 8-bit 1802, with its 16 16-bit registers, appeared unexcelled until the much later 16-bit Z8000 and 68000 devices, except that the slow instruction execution combined with some truly unfortunate missing instructions² served to hobble it significantly. In practice, except for its use in small trainer computers like the Elf and some very early programmable home-video games, it really only was deployed in places where its extreme tolerance of environmental stresses like unstable power supplies and clocks, heat, cold, and radiation, or its ultra-low full-CMOS power requirements were crucial: Space vehicles, boreholes, traffic signs, engine computers, battery-operated handheld devices, etc. (Though considered relatively unpopular it nonetheless sold millions of units in these applications, and is still in limited production as of this writing, 48 years after its introduction.) Generally it seemed to be most satisfactorily programmed in various virtual machine environments like Chip-8, or threaded Forth³. These, of course, extracted a significant performance penalty on top of an already slow processor, but paid for themselves with substantially reduced memory requirements. (Said memory also needing to be as tolerant of the environment as the CPU needed to be, and thus possibly quite bulky and/or expensive when compared to more mainstream memory devices.)

The 16 16-bit registers can be used as program counters, pointer registers, and counters, which is excellent, and can be treated as up to 30 8-bit data holding registers, also good. However:

All output and ALU instructions work only on memory, indexed by one of the 16 registers, which translates to a significant amount of data shuffling through the 8-bit accumulator, the only gateway to both the registers and the memory. (This also means that even the tiniest systems must^† include RAM, which is not necessarily the case with other register-rich processors.)
Unusually, stack operators are post-increment and post-decrement, not pre-decrement, requiring extra increment and decrement instructions when switching between pushing and popping.
(Exemplified in the MARK/RET subroutine technique, below.)
There is no condition code register, only the accumulator's separate carry bit. Calculations can only test the current value of the accumulator for zero, or the state of this carry bit. Thus, most value testing is bulky, and signed arithmetic comparison and overflow detection is particularly difficult.
There are no compare instructions, thus all numeric comparisons (except with zero) must be done with 'destructive' subtractions and potentially bulky value tests.
There are no traditional subroutine call or return instructions, though they can be simulated^‡.
DMA use costs one register (R0). Interrupt use costs two more (R1 & R2), though the necessary stack pointer register (R2) can usually be shared with the foreground code. (As a result, by convention R3 is the primary program counter, though R0 is used at reset.)

These six characteristics combine to make most 1802 native code rather bloated in size, compared⁴ to its 8-bit peers, and particularly slow.

† An application note [Using the 1802 Scratchpad To Store RAM Variables from RCA's John Stahler] suggests that if you can afford to burn a 256-byte page of ROM for a 1:1 translation table this allows for RAM-less operation, whereby the low-order address of R(X) is copied to the data bus, via the table, for ALU instructions. Thus RX.hi selects the page containing the 1:1 table, and RX.lo contains the ALU (or output) instruction operand. This trick costs you one of the 16-bit registers, as well as the ROM space.
Alternately, address-decode circuitry can be used to control a 373 latch that fakes this 'ROM' translation table, but it seems to me that said circuitry would not really be superior in any way to an actual small 8-bit RAM device, except for possibly environmental [voltage, power, temperature, radiation] reasons. And it still costs you a register pair.
Or, use an 1804/1805, which contains a small amount of RAM.
‡ Subroutines are an interesting proposition for the quirky 1802. There are basically three techniques in common use:

Name Description Pro Con
SEP Dedicate registers as program counters. Use SEP instruction to call, a complementary SEP to return. In subroutines the program counter usually must be left at a suitable (re-)entry point after returning. (For the next call, unless you want to reload the register for every call.) Minimum of 4 cycles per subroutine call and return.

Smallest/fastest.
Perfect for co-routines.
No RAM required.
Inline data can be passed to the subroutine.

Can't afford to lose too many registers. Unrelated subroutines can share SEP registers, but that makes the code slower and bulkier, due to the reloading.
Inflexible. Callee must know who the caller is in order to use the correct SEP to return.
No recursion possible.

MARK/
RET As above, but more flexible. Use MARK, SEP, and DEC 2 instructions to call. Use SEX 2, INC 2, and RET instructions to return. Minimum of 12 cycles per subroutine call and return.

Call tree can be flexible, subroutines return to whomever called them.
Inline data can be passed to the subroutine.

As above, can't afford to lose too many registers, etc.
Must have a stack. (Costing a register [R2], and a byte of RAM per call depth.)
Recursion still not possible.
Larger and slower (3×) than pure SEP calling.

SCRT Standard Call and Return Technique. SEP used to call a 16-byte 30-cycle dedicated calling subroutine, which picks up the target address inline and pushes the return address on the stack. SEP used to call a 13-byte 24-cycle dedicated returning subroutine which reverses this. Optimal when large numbers of subroutines need calling (because there is no incremental register cost per subroutine) or when recursion is necessary. Minimum of 54 cycles per subroutine call and return.

'Normal' subroutine appearance, behavior, and flexibility.
Recursion.
Smallest per-call incremental cost for larger numbers of subroutines.
Inline data can be passed to the subroutine.

Costs 29 bytes of code space for the call/return subroutines.
Costs three registers. (Usually R4, R5, and R6; R3 is assumed to be the 'normal' program counter.)
Must have a stack. (Costing a register [R2], and two bytes of RAM per call depth.)
Slowest (13.5×) subroutine method.

None can be considered ideal, and variations of the above techniques are possible. These calling techniques may be mixed as needed, they are not mutually exclusive. (Though the subroutines themselves cannot be called by techniques they were not designed for, as the necessary return instructions are incompatible.)
If using the 1804/1805, an additional mechanism is possible: the new SCAL/SRET instructions. (If using 1804/1805 you'd never use SCRT, you'd use this instead, though for compatibility SCRT does still work, as do the others.)

Name Description Pro Con
SCAL/
SRET Subroutine Call and Return. 4-byte 10-cycle SCAL instruction used to call your subroutine, using specified register N to hold the return address. (Prior contents of RN saved on stack.) RN may be used to pick up inline arguments, if any. 2-byte 8-cycle SRET instruction used to return to RN address, stack contents popped back into RN. Optimal when large numbers of subroutines need calling (because there is no incremental register cost per subroutine) or when recursion is necessary. Minimum of 18 cycles per subroutine call and return.

Normal subroutine appearance, behavior, and flexibility.
Recursion.
Smallest per-call incremental cost for larger numbers of subroutines.
Inline data can be passed to the subroutine.
No code space penalty, beyond the extra byte in each call/return instruction.

Must have a stack. (Costing a register [R2], and two bytes of RAM per call depth.)
Slowest (4.5×) subroutine method, except for the now-unneeded SCRT.

Native programs for the 1802 do not grow gracefully, because you run out of registers quickly and once you do the shuffle starts to kill you. (Modularity is often impeded due to the bias towards dedicated registers.) The 1802 is uniquely suited to implement virtual machines, like FORTH, CHIP-8, P-machine, etc., where a small number of oft-used subroutines, co-routines, pointers, and counters are the bulk of what is necessary to efficiently implement the virtual environment, and which can live in dedicated registers.

Threaded Forth is particularly noted for being memory-thrifty, but most find Forth to be a difficult language to work in. It occurs to me that one could combine a P-machine stack-oriented execution architecture (for a more traditional C or Pascal compiler, say) with Forth-style threading in a 2-stack virtual machine to get the best of both worlds. Classic P-machine runtime interpreters were stack-oriented, which means that the basic blocks that a compiler (C, or Pascal) generated could be characterized, just like Forth words, by their stack effects. The compiler could generate these blocks as threadable Forth-like words (with separate return and data stacks, allowing for zero-prologue subroutines and simple machine refactoring) instead of the usual inline code. If the compiler, as each basic block was identified, fingerprinted the block by its data stack effects and its internal operations, it could consult the cache of already emitted code (the application itself, as built so far to that point) and before caching a new block 'word' it could see if a match was already there, and if it were thread to that instead of emitting fresh code. The 2-stack machine (versus native code) constraints would encourage a greater block uniformity that could well lend itself to increased sharing. Post-processing could even discover larger clusters of common sequences (of threaded words) and re-factor them for increased sharing, and even better code density. (Manipulations of the return stack would stop the refactoring, of course, because the refactoring itself affects the return stack.)

In this manner the P-machine compiler would be doing, partially anyway, what the experienced Forth programmer does automatically: extensive factoring of an application with an eye towards sharing code that already exists. (This being the secret by which Forth was so memory-efficient.) If the compiler did a good enough job of factoring and sharing, it might make a C or Pascal environment for the 1802 practical for non-toybox applications. Programs would be slow(-ish) due to the triple penalty⁵, but maybe they'd fit in memory.

One experience I had in school with HP Pascal/64000 on an 8085 embedded system was an abject failure in this respect, and a useful object lesson. The hardware had been designed with 4KB (2× 8755 devices, IIRC) of EPROM for the program, which was typical for embedded systems of the era and felt by all to be sufficient for the intended door-lock application. Unfortunately the project specification required Pascal as the implementation language, not assembly, and the necessary program space turned out to be about 3–4× what was available in the hardware. The Pascal compiler churned out perfectly reasonable native code, but it was just so bulky (especially on an 8-bit processor) when compared to what an assembly (or Forth) programmer could do that the project was doomed from the start. Adding source code generally resulted in a linear increase in object code, there was no factoring economy exhibited, beyond what was expressed directly as subroutines in the source code. The memory 'gas gauge' dropped precipitously with each new line of code added, burying itself far below empty before the application was barely even begun. The application was eventually completed anyway, and functioned correctly, but only when using substantial development system memory emulation resources. In fact it was never realized on the un-aided target hardware, and the project was eventually abandoned, after all the door lock hardware had been installed throughout the building. This was a very useful (and painful) object lesson on the importance of project specification and estimation, and on having a viable Plan B, and on how ill-suited 8-bit microprocessors were to handling languages conceived for the much more capable mini- and mainframe computers of the day.

⁴

Consider a not-uncommon small but significant task: copying a small (<256) number of bytes from known (constant) places (src and dst) in memory—your basic structure copy. In C, perhaps something like:

copy(char *src, char *dst, unsigned char count) {
   do {
      *dst++ = *src++;
   } while (--count);
}

I have hand-assembled some code fragments here for comparison purposes. (There may be minor errors due to the hand-assembly.)

8080/8085

1802

Cycl Addr Code     Label Mnemonic    Comment
(10) 0000 21 ss ss       LXI H,src   ; Setup
(10) 0003 11 dd dd       LXI D,dst
( 7) 0006 0E nn          MVI C,count

( 7) 0008 7E       loop: MOV A,M     ; Copy -- Fetch source byte
( 7) 0009 02             STAX D      ; Store dest byte
( 5) 000A 23             INX H       ; Advance
( 5) 000B 13             INX D

( 5) 000C 0D             DCR C       ; Are we done?
(10) 000D C2 08 00       JNZ loop

Cycl Addr Code     Label Mnemonic    Comment
(16) 0000 F8 ss          LDI src.hi
(16) 0002 BD             PHI 13
(16) 0003 F8 ss          LDI src.lo
(16) 0005 AD             PLO 13
(16) 0006 F8 dd          LDI dst.hi
(16) 0008 BE             PHI 14
(16) 0009 F8 dd          LDI dst.lo
(16) 000B AE             PLO 14
(16) 000C F8 nn          LDI count
(16) 000E AF             PLO 15

(16) 000F 4D       loop: LDA 13      ; Fetch source byte, advance.
(16) 0010 5E             STR 14      ; Store dest byte
(16) 0011 1E             INC 14

(16) 0012 2F             DEC 15
(16) 0013 8F             GLO 15      ; Are we done?
(16) 0014 3A 0F          BNZ loop

In the setup parts of the code, pre-loop, we are initializing two 16-bit pointer registers and an 8-bit counter. The 3-instruction 8-byte 8080 fragment is about as optimal as it could get. It takes the 1802 almost twice as much code, 10 instructions in 15 bytes, to do the exact same part of the task. This is due to the 8-bit accumulator bottleneck, through which all data must flow.

In the loop proper, the 1802 actually has a 1-instruction 1-byte advantage (3 vs 4) copying the data because it has a combined load-and-increment-pointer (LDA) instruction.

For loop termination the 8080 takes 2 instructions, 4 bytes, to decrement the counter and loop around if it's not zero. The 1802 takes 3 instructions, 4 bytes, to do the same thing, leaving the 1802 with only a 1-instruction (0-byte) penalty for this phase. This penalty is due to the lack of a condition-code register: we have to pull the counter back into the accumulator to see if it's expired. (The high half of the counter register is undamaged, because we never decrement the low half below zero.)

So, for the loop itself the 8080 needed 6 instructions in 8 bytes to copy each data byte, the 1802 also needed 6 instructions but in only 7 bytes, giving it a slight size edge. Which does not even begin to pay for the huge penalty it incurred setting up the loop.

Note that every 1802 instruction we used takes 16 clock cycles to execute, whereas the 8080 instructions we used take 5, 7, or 10 cycles to execute. No matter how you rate it, the 1802 comes in last on this task. (And, indeed, most tasks.)

As a routine matter we preserved any processor resources (registers) that we didn't need for the copy. For the 1802, registers 1–C were untouched, as was the high half of our counter R15. In the case of the 8080, our benchmark, only the 8-bit B register wasn't touched.

Using the slightly newer, more-popular Z80 successor to the 8080, and its additional instructions, we can beat the 8080 a little, several different ways:

Z80 (1) Z80 (2) Z80 (3)
(10) 0000 21 ss ss       LXI H,src
(10) 0003 11 dd dd       LXI D,dst
( 7) 0006 0E nn          MVI C,count

( 7) 0008 7E       loop: MOV A,M
( 7) 0009 02             STAX D
( 5) 000A 23             INX H
( 5) 000B 13             INX D

( 5) 000C 0D             DCR C
(12) 000D 20 F9          JRNZ loop
This is one byte shorter, using a relative jump, but it's also two cycles (per iteration) slower.
(10) 0000 21 ss ss       LXI H,src
(10) 0003 11 dd dd       LXI D,dst
( 7) 0006 06 nn          MVI B,count

( 7) 0008 7E       loop: MOV A,M
( 7) 0009 02             STAX D
( 5) 000A 23             INX H
( 5) 000B 13             INX D

(13) 000C 10 FA          DJNZ loop
We save 2 bytes here by using the new DJNZ instruction. It's also 2 cycles (per iteration) faster. We must use register B for the counter, instead of C. (C is now our untouched resource.)
(10)  0000 21 ss ss       LXI H,src
(10)  0003 11 dd dd       LXI D,dst
(10)  0006 01 nn nn       LXI B,count

(21n) 0009 ED B0          LDIR
This is smallest, by a fair margin, and the fastest. (You can only beat its speed with unrolled LDI loops, which we are not interested in here.) It also leaves one 8-bit register untouched: the accumulator this time.
This is the canonical data-copy for the Z-80, and works for blocks of any size.

These only make the 1802 look even worse, of course. Especially that last one.

On the other hand, consider another wildly popular peer, and two that were not so popular:

6502 6800 6809
(2) 0000 A9 ss       LDA #src-1.lo
(3) 0002 85 00       STA from
(2) 0004 A9 ss       LDA #src-1.hi
(3) 0006 85 01       STA from+1
(2) 0008 A9 dd       LDA #dst-1.lo
(3) 000A 85 02       STA to
(2) 000C A9 dd       LDA #dst-1.hi
(3) 000E 85 03       STA to+1
(2) 0010 A0 nn       LDY #count

(5) 0012 B1 00 loop: LDA (from),Y
(6) 0014 91 02       STA (to),Y

(2) 0016 88          DEY
(3) 0017 D0 F9       BNE loop
This is the biggest fragment so far and in fact looks a lot like the 1802 in that it has substantial setup code when compared to the copy and loop sections of the code, and that is because the 6502, like the 1802, can work only with 8-bit values.
By fiddling with the setup constants we can use the Y register as both index and counter. (The X index register remains untouched.) It is fairly typical of the 6502 that if you're willing to warp your code a little, greater efficiency is possible.
We need two 16-bit pointers stored in page 0: from and to, because the 6502 has no pointer registers. (Modularity is often impeded due to the need to allocate storage in page 0.)
(2) 0000 C6 nn          LDB #count
(3) 0002 CE dd dd       LDX #dest
(5) 0005 DF 02          STX to
(3) 0007 CE ss ss       LDX #src

(5) 000A A5 00    loop: LDA 0,X  ; Fetch data byte.
(4) 000C 08             INX
(5) 000D DF 00          STX from ; Switch pointers.
(4) 000F DE 02          LDX to
(6) 0011 A7 00          STA 0,X  ; Store data byte.
(4) 0013 08             INX
(5) 0014 DF 02          STX to   ; Switch pointers.
(4) 0016 DE 00          LDX from

(2) 0018 5A             DECB
(4) 0019 26 F0          BNE loop
This is even bigger than the 6502 fragment, but is partitioned differently and is much slower. The efficient setup code is similar to the 8080, but the fact that there is only one pointer (index) register means we have to continually shuffle our two pointers in and out of memory. (We use zero-page for the holding cells, for better efficiency.) So, the setup and branch control portions are reasonable, but the loop copy itself is, most decidedly, not. If it weren't for the 1802 this would be the worst of these processors at this task, without question.
The 6800's fatal weakness was having only the single pointer register, which crippled it for just about any interesting task to which it could be put. It's better than no processor at all, of course, but just about any other choice was superior from the software perspective. (The 68HC11, one of the successors to the 6800, attempted to rectify this by adding a second index register, Y.)
(3) 0000 8E ss ss          LDX #src
(4) 0003 10 8E dd dd       LDY #dest
(2) 0007 C6 nn             LDB #count

(6) 0009 A6 80       loop: LDA ,X+
(6) 000B A7 A0             STA ,Y+

(2) 000D 5A                DECB
(3) 000E 26 F9             BNE loop
Attempting to address all the 6800's weaknesses, this successor is arguably the best 8-bit processor made, and not really a peer of the 1802 because it came along so much later. However, it is interesting to note that even so it's neither the smallest nor fastest at this particular task. Clock-corrected the Z-80 still beats it handily, and is smaller, and the 6502 is one clock faster per byte.
(Due to instructions and addressing modes we aren't using here, this processor is the undisputed 8-bit champion when running threaded Forth, native-compiled C and Pascal, and ROM-based relocatable code, all of which were explicit design targets of the processor.)

Just as with the Z-80, the 6809 is rich enough that there are some options:

6809 (2) 6809 (3)
The 6809 is capable of moving 16 bits of data at a time, which considerably improves its performance, but with more complex code. We need another 16-bit register, which is one more than it comfortably has. However, the user stack pointer (U) can be (ab)used for this, keeping everything in-register, and thus fast:
(7) 0000 34 40             PSHS U   ; Save U stack.
(3) 0002 8E ss ss          LDX #src
(4) 0005 10 8E dd dd       LDY #dest
(2) 0009 C6 nn             LDB #count
(2) 000B 54                LSRB     ; Byte pair count.
(3) 000C 24 04             BCC loop
(6) 000E A6 80             LDA ,X+  ; Move the
(6) 0010 A7 A0             STA ,Y+  ; odd byte.

(8) 0012 EE C1       loop: LDU ,X++ ; Move the
(8) 0014 EF A1             STU ,Y++ ; byte pairs.

(2) 0016 5A                DECB
(3) 0017 26 F9             BNE loop
(7) 0019 35 40             PULS U   ; Restore U stack.
This is a 27-byte subroutine, as large as the 6800's, and instead of the first routine's 9+17×N clocks it takes 41+10.5×N clocks, making it nearly as fast as the Z-80, at least for larger N.
Optimally (for even-sized copies) it's only 20 bytes in size, and 23+21×(N/2) clocks:
(7) 0000 34 40             PSHS U   ; Save U stack.
(3) 0002 8E ss ss          LDX #src
(4) 0005 10 8E dd dd       LDY #dest
(2) 0009 C6 nn             LDB #count/2

(8) 000B EE C1       loop: LDU ,X++ ; Move the
(8) 000D EF A1             STU ,Y++ ; byte pairs.

(2) 000F 5A                DECB
(3) 0010 26 F9             BNE loop
(7) 0012 35 40             PULS U   ; Restore U stack.
The instruction count is quite reasonable, but unfortunately over half the instructions we're using are 'extended' (compared to the parent 6800) instructions, and require an extra opcode byte to invoke, thus increasing the size and cycle count.
And the Z-80 still beats it.

To handle blocks of any size, most of these fragments would need an additional outer loop, or other equivalent modifications to handle larger counters and/or offsets. (The Z-80's LDIR can already handle any size.) I have ignored this as I don't think it's relevant for the point(s) being made here.

The little-used Hitachi 6309, a successor to the 6809, added block move instructions similar to the Z-80:

6309
(3) 0000 8E ss ss LDX #src (4) 0003 10 8E dd dd LDY #dest (4) 0007 10 86 nn nn LDW #count (6+3n) 000B 11 38 12 TFM X+,Y+

Essentially the same four instructions as the Z-80, but three bytes larger due to the extension-byte orientation of the 6809 family. (Yet still two bytes smaller than the basic 6809 routine. To be fair, the Z-80 also uses extension bytes, but only needs one for this task rather than four.) This takes 17+3×N clocks, making it the absolute winner, bar none, in the speed category, and it can handle blocks of any size. (Too bad this processor was not more popular, but by the time it was available the 16-bit processors, with their vastly increased memory addressing abilities, were coming on strong.)

So, here is our collected summary of the basic structure-copy subroutines, showing size and time requirements for each processor:

Structure Copy Summary
CPU Setup Copy Loop Total
(bytes) Clocks/
Setup Clocks/
Byte
8080 8 4 4 16 27 39
1802 15 3 4 22 160 96
Z-80 9 2 0 11 30 21
6502 18 4 3 25 22 16
6800 10 14 3 27 13 43
6809 9 4 3 16 9 17
6309 11 3 0 14 17 3

Discounting the never-seen 6309 the Z-80 is the clear winner on all benchmark fronts here, once clock rates are made comparable. (See below.)

However, benchmark fragments aren't everything.

The 6502, for example, has two secret weapons: 1) due to the way its index registers work it's very good at handling data structures and arrays that are less than 256 bytes in size, and 2) it can keep up to 128 pointers (using all of zero page) on hand at once. So, that grotesque setup code probably need only be done once. Most significant programs are quite repetitive in nature, avoiding the setup on all but the first pass can really add up.

For example, the second time that structure copy needs to be done, alone of the CPU's the 6502 can probably re-use the source and destination pointers where they sit in RAM. The setup for the (second and any subsequent) copy then drops to two bytes, and two clock cycles.

The 6502 has been called by some the first common RISC microprocessor, because it has so few internal resources. (An arguable definition, but what is true is that the programmer generally only works with three 8-bit registers, which is Not Much no matter how you measure it. [TI's novel 16-bit 9900 takes this register-less approach even further, though in a different way.]) Instead, the 6502 is designed to work very efficiently with the first 256 bytes of RAM, using them as pointers and arrays, and arrays of pointers. Substantial applications, once written and optimized for the 6502, were fast—often very fast. (None⁶ of the 8-bit processors, with the exception of the later 6809/6309, natively handles high-level languages like C and Pascal well at all. Significant 8-bit applications that need maximum performance must be written in assembly language, and are thus well-poised to take advantage of architectural peculiarities.)

The various '80' processors don't have enough internal registers to keep anything interesting long-term within the CPU, everything has to live out in RAM and shuffle in and out of the CPU in use. The extremely introverted 1802, focused on its generous 16 internal registers, still doesn't have enough registers to keep very much control data within the CPU, especially considering its instruction set peculiarities—most data still has to live out in RAM and endure the shuffle. The extremely extroverted 6502, with practically no registers to speak of and focused instead on its first 256 bytes of RAM, can keep the bulk of interesting control information out there, where (using 8-bit zero-page addressing) it is fairly fast to access and can be used where it sits, no shuffling required. (Like the 1802, you can't build even the most basic 6502 system without RAM.) While the 6502 instructions that access the data directly in RAM are inherently slower than purely register-based operations, avoiding the shuffle more than makes up for it. That efficiency adds up over an entire application. (The 6809/6309 can do many of the same in-memory tricks, and even some additional ones, but its instructions to do so are larger and slower, largely negating that 'advantage' so far as performance is concerned.)

In the heyday of the 8-bit era the 'fast' 4 MHz Z-80 systems were generally (though erroneously) thought to be faster than the 'fast' 2 MHz 6502 systems. (These being approximately comparable as they interfaced to the same-speed commodity memory, and in those pre-cache days that was all that really mattered. 8-family processors were approximately half as clock-efficient as 6-family processors; scaling clock numbers by 2 was a reasonable equivalency approximation.)

That erroneous conclusion was certainly true of small benchmarks: on a fragment-by-fragment basis the Z-80 does very well. But running real world applications the 2 MHz 6502 systems were usually comfortably in the lead⁷, followed by the 4 MHz Z-80 systems, followed by everything else, with the 1802 systems solidly bringing up the rear. (The 1 MHz 6502 systems, exemplified by the Apple II, naturally found it a lot harder to compete on performance than the 2 MHz systems did, but they usually compensated by having inexpensive integrated graphic displays that shared the main memory with no performance hit or visual artifacts when accessing it.) Larger applications that take advantage of the 6502 instruction architecture simply win.

In fact, if the 1802 were implemented with circuitry as cycle-efficient as the 6502, a not-unreasonable ask, it would actually fare well in these contests, as it's not the worst in code size, and having multiple pointer registers and counters available can be a real advantage in avoiding the pesky shuffle. (Or use the 1804/1805 successor, for some slight relief [4 bytes and 48 cycles] in the setup portion.) Revisited:

1802+

1804/1805

(2) 0000 F8 ss          LDI src.hi
(2) 0002 BD             PHI 13
(2) 0003 F8 ss          LDI src.lo
(2) 0005 AD             PLO 13
(2) 0006 F8 dd          LDI dst.hi
(2) 0008 BE             PHI 14
(2) 0009 F8 dd          LDI dst.lo
(2) 000B AE             PLO 14
(2) 000C F8 nn          LDI count
(2) 000E AF             PLO 15

(2) 000F 4D       loop: LDA 13
(2) 0010 5E             STR 14
(2) 0011 1E             INC 14

(2) 0012 2F             DEC 15
(2) 0013 8F             GLO 15
(2) 0014 3A 0F          BNZ loop

(40) 0000 68 CD ss ss        RLDI 13,src
(40) 0004 68 CE dd dd        RLDI 14,dst
(16) 0008 F8 nn              LDI count
(16) 000A AF                 PLO 15







(16) 000B 4D           loop: LDA 13
(16) 000C 5E                 STR 14
(16) 000D 1E                 INC 14

(16) 000E 2F                 DEC 15
(16) 000F 8F                 GLO 15
(16) 0010 3A 0F              BNZ loop

Instead of 160+96×N clocks, our fictional enhancement would take 20+12×N clocks, making it the fastest of these processors at this task! (Equivalently, clock the real 1802 design at 16–32 MHz—in a non-caching design it's the memory speed that matters, not the CPU clock rate.)

Of course, if we're going invent fictional derivatives of existing processors a small change to the Z80 would make it unbeatable in this particular challenge. As it happens the Z80 implements LDIR by decrementing the program counter rather than incrementing it for the second byte of the opcode, if the decremented BC counter pair is not zero. This results in re-fetching the LDIR instruction itself for each data byte transferred, except for the last one, which is why it takes a leisurely 21 cycles per byte transferred. (This is elegant, if inefficient, and also ties into preserving its ability to refresh DRAM reliably while still remaining responsive to interrupts during this lengthy instruction, all while minimizing implementation circuitry.) If it had been able to fetch the opcode pair only once and thereafter just move data, under control of the internal registers, it could have taken 13 cycles per byte transferred, or even a theoretical minimum of 6, far superior to any other 8-bit CPU. (The only non-caching processors I'm aware of that could actually keep the memory bus saturated with data while copying bulk data were the 16-bit 68010, and the 8-bit Hitachi 6309.)

Were I designing the ultimate 8-bit hobbyist computer I'd be inclined to give it multiple CPU's for maximum flexibility, one each from the three most popular families: 8080, 6800, and 6502. (Plenty of dual CPU systems were sold on the market, exemplified by the Commodore 128, which had both 6502 and Z80 processors.) Sadly, the 1802 offers nothing unique enough in this space to compensate for its lack of speed, and in spite of my liking for it I would leave it out. I'd use the latest-and-greatest in each line: the Zilog Z-80 (for its wealth of CP/M applications), the Hitachi 6309 (for its support of high-level languages), and the WDC 65816 (for its speed, and larger memory addressing); each of the three brings a significant strength to the partnership. Because they're all sharing the same memory they can act as co-processors for each other.

To continue on down the rabbit hole, if you used the 65816 as the primary CPU the system could be built with more than 64KB of RAM. By adding MMUs the other two CPUs could also access all of the memory. (Morrow used a MMU for the Z-80 back in the day that even gave it the ability to run a protected-space Un*x OS. Throw that in too!) If one were inclined to stretch the boundaries completely to the point of ridiculousness, one could go ahead and throw in three 16-bit CPUs that had 8-bit bus interfaces as well: the 8088, 68008 (though a 68030 or 68040 constrained to an 8-bit bus would be both faster and much more capable), and 9995. At that point maybe just go ahead and throw in an 1802 too, for the heck of it, this slippery-slope mental exercise already got stupid some time ago. We can even add some sort of MMU⁸ to let the 1802 also get at the additional memory.

Regardless of how far you took it, this would be purely a retrocomputing exercise, because you could learn nothing that you couldn't more simply by simulating any of these CPUs on modern high-speed desktop computers, with vastly less effort and probably better (faster) results.

⁵

Penalties: 1: P-machine interpreter. 2: Threading overhead. 3: 1802!

⁶

In fact, Intel's 8085 had additional instructions that were designed for the convenience of high level languages like C and Pascal, so it wasn't just the 6809/6309 that could do this, though not as well as they. Intel, however, chose to leave these undocumented, presumably to not further fractionate the burgeoning CP/M market at the time. (No doubt this deliberate omission was heartbreaking to the designers of the 8085.) To my knowledge, no high-level language ever exploited the 8085's additional abilities.

⁷

The TI-99/4A should have been the fastest machine of the era due to its 3 MHz 9900 16-bit CPU, derived from their minicomputer architecture, but was instead one of the slowest due to all of its application memory being indirectly accessed, 8 bits at a time, through the 9918 video controller. (The CPU had only 256 bytes of 16-bit RAM for use by native code.) A truly heinous architectural decision that crippled the product, dragging it down into 1802 territory so far as apparent speed was concerned, and shooting TI out of the saddle in the lucrative and growing home computer market.

To my knowledge, every general-purpose machine that tried to save money by using deliberately slower, narrower, bulk memory than 'native' failed in the market, except for the 8088-based IBM PC (for which there were no 8086-based competitors). The 99/4A was merely the most egregious of these, being far slower than the typical half-speed entrant. (And even it might have succeeded if it were merely half the native speed, had the bulk memory been directly connected to the CPU.)

⁸

There are two MMU possibilities for the 1802:

A traditional I/O-accessed page-mapping MMU, probably using 16 4KB pages. This gives the CPU access to only 64KB at a time, but the 16 accessible pages can each come from anywhere in up to 16MB of memory. (This is the sort that Morrow used for their Z-80 running Micronix.) Best for running multiple 64KB (or smaller) independent tasks at once.
The 1802 architecture and implementation allows for another, more interesting possibility: effectively extending each of the 16-bit registers (for addressing purposes) to 24 bits, similar to the 65816. Remember that all memory access on the 1802 is addressed by one of the 16 registers, and its instruction decode and execution is simple enough that we can always track which register is doing the addressing. (Let's call this M.) The memory address decoder can utilize an external 16×8 RAM bank indexed by M to give us the upper bits of the address. Each register can thus have access to its own 64KB address segment if we wish, allowing a single task to be much larger than 64KB. (Multiple tasks are also possible, but they can't be smaller than 64KB and remain independent.)
There are two possibilities for programming this kind of MMU:
1. I/O-accessed, as above, but remember that I/O on the 1802 is a bit cumbersome.
2. The application note Data Bus Contention During CDP1802 Register-to-Register Operations hints at another possibility. Paraphrased:
  In 1802-based systems bus contention problems have been found to occur during internal data transfer operations (GHI, PHI, GLO, PLO) if memory read decoding does not also include the MRD signal, losing data in one or more registers.
  Basically the D register value is actually visible (and vulnerable) on the data bus during GHI/PHI/GLO/PLO instructions, so we can track what's going on. By using a recognizable but never-used sequence of instructions we can arrange that select PHI instructions affect both the register bank and the MMU map page by copying D to the mapper. Perhaps:
```
      ; Point R1 at beginning of second 64KB bank of RAM. ($010000)
F8 01 LDI 1
B1    PHI 1
B1    PHI 1 ; map[1] = 1, but HI(1) also affected.
F8 00 LDI 0
B1    PHI 1 ; Now clear HI(1)
A1    PLO 1 ; and LO(1).
```
  Our trigger here is two identical PHI instructions in a row, something that you would never normally do. The hardware, upon seeing the execution of the second PHI, would also write the data bus contents (D) into map[1].
  We cannot reliably do the same thing for GHI, by doubling them, because it is neither safe nor reliable to deliberately clash bus drivers (hence the Application Note), and so our mapper contents can't be read into D this way. (Some other mechanism would have to be provided, if reading were necessary. As it would be, if interrupts are part of the picture. Interrupts would also have to be blocked for the first instruction of any potential sequence so that these sequences wouldn't be interrupted, as there's no reliable way to save this halfway-there state across an interrupt. DMA should need no special consideration because those cycles are recognizable and wouldn't affect our sequencing.)
  If you were willing to forgo 1804/1805 compatibility you could safely do something like:
```
      ; Fetch R1's 64KB bank select value into D
91    GHI 1
68    INP 0 ; D = map[1]
```
  The INP 0 instruction normally can't be used because there's nothing within that execution cycle to recognize in order to gate something onto the bus. But with tracking logic we can use it to trigger the MMU read. Of course, the 68 opcode is the only formally unused one, and is the gate to all of the 1804/1805 enhancements, so doing it this way limits your CPU choice.
  Slightly less efficiently, which probably doesn't matter because you don't often need to read these values anyway, you can maintain full system flexibility by tracking this sequence instead:
```
      ; Fetch R1's 64KB bank select value into D
91    GHI 1
F8 XX LDI XX ; D = map[1]
```
  The tracking memory address decoder would turn off the normal RAM read for the immediate operand, discarding it, and substitute the MMU read data instead.
  If using an MMU like this the 1802 subroutine calling mechanisms could be used to call anywhere within the address space. (SCRT would need modification.) The 1804/1805 calling mechanism would be limited to within the current 64KB segment, and the linkage register might not point to the correct bank. The bus tracker necessary to support the 1804/1805 would also be substantially more complex, in order to do the correct bank mapping within the execution phases of SCAL and SRET, and the other enhanced instructions.
In fact, one could even perhaps combine these addressing enhancements since it would only take a 256×8 mapping RAM to do so. This could give us multiple small independent tasks, with the ability to give selected tasks greater than 64KB sizes. This could be very versatile, but really opens up the question of inter-task protection⁹, which has been lurking in the wings for awhile now. In which case you might wish to require that memory mapping be arranged through a (protected? privileged?) OS layer, possibly precluding anything so perilous as PHI-tracking. (With even more tracking and protection hardware even that might still be possible.) By using a 2KB×8 mapping RAM, practical since the early 80's, we could even have 8 hardware-mapped (instantaneous map switching) tasks, each independent and of variable size.

At this point it's getting a bit silly. If one were interested in this sort of thing, the best approach would be to craft a virtual system using emulation technology, and see how it all worked together. Only if it tested out as practical, and you actually had need of this in physical hardware, should you actually build one.

⁹

If you're going to support inter-task protection, for running things like Unix, the MMU has even more responsibilities. A bare minimum is a write-protect bit, per page/segment. This prevents inadvertent corruption, but does nothing for information security. To provide that you have to arrange so that any executing task has no access to any memory but its own. You also need a protected context for the OS itself, where only it is allowed to mess with the MMU configuration. And you need a reliable way to trap into the OS for service.

The 1802 has no exception mechanism whatsoever, so the only way to do an OS trap is to use an opcode tracking mechanism to detect and intercept a 'useless' sequence; choosing the sequence could be tricky, so that it never trips except in a deliberate trap situation.

Then you need a guaranteed way to vector execution to a known place in the OS context, without losing any state of the CPU, so that it can be resumed cleanly. (Well, you can lose a little state if that's part of the contract with the OS itself.) I think this can only be accomplished by 'jamming' an instruction sequence into the CPU, regardless of what the address bus might be doing, which takes over the CPU while recording what it must for a clean resumption later. (Basically what was done for the Z-80 CPU in Applied Microsystems' EM-180 in-circuit emulator.) Interrupts must be blocked while the memory system is in this special state. Possible instruction sequence, once the trap intercept has been triggered:

           ; Hardware has detected our OS trap sequence, and has started
           ; jamming instructions into the CPU regardless of its state.
           ; Memory access blocked during the ('****'-indicated) jam.
**** 79    MARK  ; HW captures instruction read cycle addr (addr to return to),
                 ; write cycle addr (R2) and write cycle data (X, P).
**** 73    STXD  ; HW captures write cycle data: D.  (Trap opcode?)
**** 83    GLO 3 ; HW captures R3 from data bus.  (In case P was not 3.)
**** 93    GHI 3
**** D3    SEP 3 ; Jamming lets us do this before R3 is set.
**** F8 00 LDI 0 ; Jump to R3=0000 in OS context.  HW is done capturing.
**** A3    PLO 3 ; Prevent R3.1 increment before next instruction.
**** B3    PHI 3 ; Two of these, if using 24-bit MMU described above⁸.
**** A3    PLO 3 ; Again, starts us at zero.
0000 E2    SEX 2 ; (Code could perhaps be in OS space rather than jam hardware.)
0001 B3    PHI 2 ; Two of these, if using 24-bit MMU described above⁸.
0002 A3    PLO 2
0003 22    DEC 2 ; Stack at R2=$FFFF
0004 CC    LSIE  ; Save interrupt-enabled state.  D=0 means IE=1, use RET
0005 F8 01 LDI 1 ; to return; D=1 means IE=0, use DIS to return.
0007 73    STXD  ; IE state (in D) now saved on (new) R2 stack.
0008 73    STXD  ; HW now vomits saved R(P), D, X, P, R2, & R3 state, 8 bytes,
0009 73    STXD  ; into the R2 stack.  (D register value is ignored.)
000A 73    STXD
000B 73    STXD
000C 73    STXD
000D 73    STXD
000E 73    STXD
000F 73    STXD
           ; We are now running in OS context P=3 X=2, R2=$FFFA, R3=$0010
           ; Only R(P), X, P, D, R2, R3, and IE have been saved on R2 stack,
           ; any other CPU state must be preserved before we change it.
           ; Continue on to do the OS work from here...

What a mess! (47 cycles [51 in 24-bit system] to get into the OS, not counting the trap sequence itself.) And that only got us in to the OS, we still have to craft a return to user task mechanism that reverses all this, with minimal damage to the processor state. And deal with interrupts and DMA, in some kind of safe system-designed manner. We will need interrupts for at least the task-switching timer, if not for other device drivers.

A possible task-resumption sequence, hardware-assisted:

           ; OS has restored most registers by now, we've told hardware
           ; our intent to resume, and it has started jamming instructions
	   ; into the CPU regardless of its state.  We only need to restore
	   ; R2, R3, X, P, D, and IE.  Plus any HW task-selector.  The exact
	   ; jam sequence had to be prepared earlier in the OS, based on the
           ; saved information and the results of the trap we called.
**** F8 XX LDI XX  ; Restore R2
**** B2    PHI 2
**** F8 XX LDI XX
**** A2    PLO 2
**** 22    DEC 2   ; Compensate for side-effect of upcoming RET/DIS.
**** F8 XX LDI XX  ; Restore R3 (saved value decreased by 3 to compensate for
**** B3    PHI 3   ;  the increments due to the post-PLO instruction execution).
**** F8 XX LDI XX
**** A3    PLO 3
**** F8 XX LDI XX  ; Restore D.  Result code?  DF not restored, also result?
**** 7[01] RET/DIS ; Restore X, P, and IE.  HW also un-blocks interrupt line.
           ; We are now running back in user context, with whatever OS
           ; task complete and any register changes appropriate to this.

More mess. (22 cycles to get back out of the OS, not counting work done within the OS to prepare the hardware for this return sequence.)

If our system is using DMA and/or interrupts, it'll be nearly impossible to support in a protected environment. One way might be to, for user contexts, trap any use of registers 0 and 1 by user programs. Basically dedicating them to OS and driver use only. More work for the instruction tracker, and we have to come up with a (similar to above) trap mechanism for handling 'illegal' instructions.

The instruction tracker and MMU/intercept logic will be involved enough that we'd probably need to use a FPGA of some sort. Most of these are now large enough that the rather simple 1802 itself could also be emulated by the FPGA. In which case we don't need such elaborate work for a trap, we can just force the CPU emulation to change state wholesale, whenever and however necessary. Which circles around to the basic question: what is this all for? This all sounds far too complicated, and limited, to actually build. A virtual machine for research purposes would be just as useful, and far easier.

Return to Site Home