1 |
The 8-bit 1802 is
the heart of what are widely regarded as just about the slowest home
computers ever sold. (8 clocks per machine cycle, 2 or 3 machine
cycles per instruction. Clock rate fixed at 1.7 MHz in order to
drive the 1861-based
NTSC video of the
typical system, and the video output was serviced by
per-scanline
interrupts and
per-8-pixel
DMA
cycles, which stole much time from your application program. So,
roughly
0.1 MIPS
even without the distractions, and its 'I's were nothing to write home
about, either. Comparable in performance, perhaps, to the much older
4-bit Intel
4004.)
Superficially the 8-bit 1802, with its 16 16-bit registers, appeared unexcelled until the much later 16-bit Z8000 and 68000 devices, except that the slow instruction execution combined with some truly unfortunate missing instructions2 served to hobble it significantly. In practice, except for its use in small trainer computers like the Elf and some very early programmable home-video games, it really only was deployed in places where its extreme tolerance of environmental stresses like unstable power supplies and clocks, heat, cold, and radiation, or its ultra-low full-CMOS power requirements were crucial: Space vehicles, boreholes, traffic signs, engine computers, battery-operated handheld devices, etc. (Though considered relatively unpopular it nonetheless sold millions of units in these applications, and is still in limited production as of this writing, 48 years after its introduction.) Generally it seemed to be most satisfactorily programmed in various virtual machine environments like Chip-8, or threaded Forth3. These, of course, extracted a significant performance penalty on top of an already slow processor, but paid for themselves with substantially reduced memory requirements. (Said memory also needing to be as tolerant of the environment as the CPU needed to be, and thus possibly quite bulky and/or expensive when compared to more mainstream memory devices.) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 |
The 16 16-bit registers can be used as program counters, pointer
registers, and counters, which is excellent, and can be treated as up
to 30 8-bit data holding registers, also good. However:
† Subroutines are an interesting proposition for the quirky 1802. There are basically three techniques in common use: None can be considered ideal, and variations of the above techniques are possible. These calling techniques may be mixed as needed, they are not mutually exclusive. (Though the subroutines themselves cannot be called by techniques they were not designed for, as the necessary return instructions are incompatible.) Native programs for the 1802 do not grow gracefully, because you run out of registers quickly and once you do the shuffle starts to kill you. (Modularity is often impeded due to the bias towards dedicated registers.) The 1802 is uniquely suited to implement virtual machines, like FORTH, CHIP-8, P-machine, etc., where a small number of oft-used subroutines, co-routines, pointers, and counters are the bulk of what is necessary to efficiently implement the virtual environment, and which can live in dedicated registers. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 |
Threaded
Forth
is particularly noted for being memory-thrifty, but most find Forth to
be a difficult language to work in. It occurs to me that one could
combine a
P-machine
stack-oriented
execution architecture (for a more traditional
C
or
Pascal
compiler, say) with
Forth-style threading in a 2-stack
virtual
machine to get the best of both worlds. Classic P-machine runtime
interpreters
were stack-oriented, which means that the
basic blocks
that a compiler (C, or Pascal) generated could be characterized, just
like Forth words, by their stack effects. The compiler could generate
these blocks as threadable Forth-like words (with separate return and
data stacks, allowing for
zero-prologue
subroutines and simple machine refactoring) instead of the usual
inline code. If the compiler, as each basic block was identified,
fingerprinted the block by its data stack effects and its internal
operations, it could consult the cache of already emitted code (the
application itself, as built so far to that point) and before caching
a new block 'word' it could see if a match was already there, and if
it were thread to that instead of emitting fresh code. The 2-stack
machine (versus native code) constraints would encourage a greater
block uniformity that could well lend itself to increased sharing.
Post-processing could even discover larger clusters of common
sequences (of threaded words) and re-factor them for increased
sharing, and even better code density. (Manipulations of the return
stack would stop the refactoring, of course, because the refactoring
itself affects the return stack.)
In this manner the P-machine compiler would be doing, partially anyway, what the experienced Forth programmer does automatically: extensive factoring of an application with an eye towards sharing code that already exists. (This being the secret by which Forth was so memory-efficient.) If the compiler did a good enough job of factoring and sharing, it might make a C or Pascal environment for the 1802 practical for non-toybox applications. Programs would be slow(-ish) due to the triple penalty5, but maybe they'd fit in memory.
One experience I had in school with HP Pascal/64000 on an 8085 embedded system was an abject failure in this respect, and a useful object lesson. The hardware had been designed with 4KB (2× 8755 devices, IIRC) of EPROM for the program, which was typical for embedded systems of the era and felt by all to be sufficient for the intended door-lock application. Unfortunately the project specification required Pascal as the implementation language, not assembly, and the necessary program space turned out to be about 3–4× what was available in the hardware. The Pascal compiler churned out perfectly reasonable native code, but it was just so bulky (especially on an 8-bit processor) when compared to what an assembly (or Forth) programmer could do that the project was doomed from the start. Adding source code generally resulted in a linear increase in object code, there was no factoring economy exhibited, beyond what was expressed directly as subroutines in the source code. The memory 'gas gauge' dropped precipitously with each new line of code added, burying itself far below empty before the application was barely even begun. The application was eventually completed anyway, and functioned correctly, but only when using substantial development system memory emulation resources. In fact it was never realized on the un-aided target hardware, and the project was eventually abandoned, after all the door lock hardware had been installed throughout the building. This was a very useful (and painful) object lesson on the importance of project specification and estimation, and on having a viable Plan B, and on how ill-suited 8-bit microprocessors were to handling languages conceived for the much more capable mini- and mainframe computers of the day. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 |
Consider a not-uncommon small but significant task: copying a small
(<256) number of bytes from known (constant) places (src
and dst ) in memory—your basic structure copy. In
C, perhaps something like:
I have hand-assembled some code fragments here for comparison purposes. (There may be minor errors due to the hand-assembly.)copy(char *src, char *dst, unsigned char count) { do { *dst++ = *src++; } while (--count); } In the setup parts of the code, pre-loop, we are initializing two 16-bit pointer registers and an 8-bit counter. The 3-instruction 8-byte 8080 fragment is about as optimal as it could get. It takes the 1802 almost twice as much code, 10 instructions in 15 bytes, to do the exact same part of the task. This is due to the 8-bit accumulator bottleneck, through which all data must flow. In the loop proper, the 1802 actually has a 1-instruction 1-byte advantage (3 vs 4) copying the data because it has a combined load-and-increment-pointer (LDA) instruction. For loop termination the 8080 takes 2 instructions, 4 bytes, to decrement the counter and loop around if it's not zero. The 1802 takes 3 instructions, 4 bytes, to do the same thing, leaving the 1802 with only a 1-instruction (0-byte) penalty for this phase. This penalty is due to the lack of a condition-code register: we have to pull the counter back into the accumulator to see if it's expired. (The high half of the counter register is undamaged, because we never decrement the low half below zero.) So, for the loop itself the 8080 needed 6 instructions in 8 bytes to copy each data byte, the 1802 also needed 6 instructions but in only 7 bytes, giving it a slight size edge. Which does not even begin to pay for the huge penalty it incurred setting up the loop. Note that every 1802 instruction we used takes 16 clock cycles to execute, whereas the 8080 instructions we used take 5, 7, or 10 cycles to execute. No matter how you rate it, the 1802 comes in last on this task. (And, indeed, most tasks.) As a routine matter we preserved any processor resources (registers) that we didn't need for the copy. For the 1802, registers 1–C were untouched, as was the high half of our counter R15. In the case of the 8080, our benchmark, only the 8-bit B register wasn't touched. Using the slightly newer, more-popular Z80 successor to the 8080, and its additional instructions, we can beat the 8080 a little, several different ways: These only make the 1802 look even worse, of course. Especially that last one. On the other hand, consider another wildly popular peer, and two that were not so popular:
Just as with the Z-80, the 6809 is rich enough that there are some options:
To handle blocks of any size, most of these fragments would need an additional outer loop, or other equivalent modifications to handle larger counters and/or offsets. (The Z-80's LDIR can already handle any size.) I have ignored this as I don't think it's relevant for the point(s) being made here. The little-used Hitachi 6309, a successor to the 6809, added block move instructions similar to the Z-80: Essentially the same four instructions as the Z-80, but three bytes larger due to the extension-byte orientation of the 6809 family. (Yet still two bytes smaller than the basic 6809 routine. To be fair, the Z-80 also uses extension bytes, but only needs one for this task rather than four.) This takes 17+3×N clocks, making it the absolute winner, bar none, in the speed category, and it can handle blocks of any size. (Too bad this processor was not more popular, but by the time it was available the 16-bit processors, with their vastly increased memory addressing abilities, were coming on strong.) So, here is our collected summary of the basic structure-copy subroutines, showing size and time requirements for each processor: Discounting the never-seen 6309 the Z-80 is the clear winner on all benchmark fronts here, once clock rates are made comparable. (See below.) However, benchmark fragments aren't everything. The 6502, for example, has two secret weapons: 1) due to the way its index registers work it's very good at handling data structures and arrays that are less than 256 bytes in size, and 2) it can keep up to 128 pointers (using all of zero page) on hand at once. So, that grotesque setup code probably need only be done once. Most significant programs are quite repetitive in nature, avoiding the setup on all but the first pass can really add up. For example, the second time that structure copy needs to be done, alone of the CPU's the 6502 can probably re-use the source and destination pointers where they sit in RAM. The setup for the (second and any subsequent) copy then drops to two bytes, and two clock cycles. The 6502 has been called by some the first common RISC microprocessor, because it has so few internal resources. (An arguable definition, but what is true is that the programmer generally only works with three 8-bit registers, which is Not Much no matter how you measure it. [TI's novel 16-bit 9900 takes this register-less approach even further, though in a different way.]) Instead, the 6502 is designed to work very efficiently with the first 256 bytes of RAM, using them as pointers and arrays, and arrays of pointers. Substantial applications, once written and optimized for the 6502, were fast—often very fast. (None6 of the 8-bit processors, with the exception of the later 6809/6309, natively handles high-level languages like C and Pascal well at all. Significant 8-bit applications that need maximum performance must be written in assembly language, and are thus well-poised to take advantage of architectural peculiarities.) The various '80' processors don't have enough internal registers to keep anything interesting long-term within the CPU, everything has to live out in RAM and shuffle in and out of the CPU in use. The extremely introverted 1802, focused on its generous 16 internal registers, still doesn't have enough registers to keep very much control data within the CPU, especially considering its instruction set peculiarities—most data still has to live out in RAM and endure the shuffle. The extremely extroverted 6502, with practically no registers to speak of and focused on its first 256 bytes of RAM, can keep the bulk of interesting control information out there, where (using 8-bit zero-page addressing) it is fairly fast to access and can be used where it sits, no shuffling required. (Like the 1802, you can't build even the most basic 6502 system without RAM.) While the 6502 instructions that access the data directly in RAM are inherently slower than purely register-based operations, avoiding the shuffle more than makes up for it. That efficiency adds up over an entire application. (The 6809/6309 can do many of the same in-memory tricks, and even some additional ones, but its instructions to do so are larger and slower, largely negating that 'advantage' so far as performance is concerned.) In the heyday of the 8-bit era the 'fast' 4 MHz Z-80 systems were generally (though erroneously) thought to be faster than the 'fast' 2 MHz 6502 systems. (These being approximately comparable as they interfaced to the same-speed commodity memory, and in those pre-cache days that was all that really mattered. 8-family processors were approximately half as clock-efficient as 6-family processors; scaling clock numbers by 2 was a reasonable equivalency approximation.) That erroneous conclusion was certainly true of small benchmarks: on a fragment-by-fragment basis the Z-80 does very well. But running real world applications the 2 MHz 6502 systems were usually comfortably in the lead7, followed by the 4 MHz Z-80 systems, followed by everything else, with the 1802 systems solidly bringing up the rear. (The 1 MHz 6502 systems, exemplified by the Apple II, naturally found it a lot harder to compete on performance than the 2 MHz systems did, but they usually compensated by having inexpensive integrated graphic displays that shared the main memory with no performance hit or visual artifacts when accessing it.) Larger applications that take advantage of the 6502 instruction architecture simply win. In fact, if the 1802 were implemented with circuitry as cycle-efficient as the 6502, a not-unreasonable ask, it would actually fare well in these contests, as it's not the worst in code size, and having multiple pointer registers and counters available can be a real advantage in avoiding the pesky shuffle. Revisited: Instead of 160+96×N clocks, our fictional enhancement would take 20+12×N clocks, making it the fastest of these processors at this task! (Equivalently, clock the real 1802 design at 16–32 MHz—in a non-caching design it's the memory speed that matters, not the CPU clock rate.) Of course, if we're going invent fictional derivatives of existing processors a small change to the Z80 would make it unbeatable in this particular challenge. As it happens the Z80 implements LDIR by decrementing the program counter rather than incrementing it for the second byte of the opcode, if the decremented BC counter pair is not zero. This results in re-fetching the LDIR instruction itself for each data byte transferred, except for the last one, which is why it takes a leisurely 21 cycles per byte transferred. (This is elegant, if inefficient, and also ties into preserving its ability to refresh DRAM reliably while still remaining responsive to interrupts during this lengthy instruction, all while minimizing implementation circuitry.) If it had been able to fetch the opcode pair only once and thereafter just move data, under control of the internal registers, it could have taken 13 cycles per byte transferred, or even a theoretical minimum of 6, far superior to any other 8-bit CPU. (The only non-caching processors I'm aware of that could actually keep the memory bus saturated with data while copying bulk data were the 16-bit 68010, and the 8-bit Hitachi 6309.) Were I designing the ultimate 8-bit hobbyist computer I'd be inclined to give it multiple CPU's for maximum flexibility, one each from the three most popular families: 8080, 6800, and 6502. (Plenty of dual CPU systems were sold on the market, exemplified by the Commodore 128, which had both 6502 and Z80 processors.) Sadly, the 1802 offers nothing unique enough to compensate for its lack of speed, and I would leave it out. I'd use the latest-and-greatest in each line: the Zilog Z-80 (for its wealth of CP/M applications), the Hitachi 6309 (for its support of high-level languages), and the WDC 65816 (for its speed, and larger memory addressing); each of the three brings a significant strength to the partnership. To continue on down the rabbit hole, if you used the 65816 as the primary CPU the system could be built with more than 64KB of RAM. By adding MMUs the other two CPUs could also access all of the memory. (Morrow used a MMU for the Z-80 back in the day that even gave it the ability to run a protected-space Un*x OS. Throw that in too!) If one were inclined to stretch the boundaries completely to the point of ridiculousness, one could go ahead and throw in three 16-bit CPUs that had 8-bit bus interfaces as well: the 8088, 68008 (though a 68030 or 68040 constrained to an 8-bit bus would be both faster and much more capable), and 9995. At that point maybe just go ahead and throw in an 1802 too, for the heck of it, this slippery-slope exercise already got stupid some time ago. Regardless of how far you took it, this would be purely a retrocomputing exercise, because you could learn nothing that you couldn't simply by simulating any of these CPUs on modern high-speed desktop computers, with vastly less effort and probably better (faster) results. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 |
Penalties: 1: P-machine interpreter. 2: Threading overhead. 3: 1802!
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 |
In fact, Intel's 8085 had
additional
instructions that were designed for the convenience of high level
languages like C and Pascal, so it wasn't just the 6809/6309 that
could do this, though not as well. Intel, however, chose to leave
them undocumented, presumably to not further fractionate the
burgeoning CP/M market
at the time. (No doubt this deliberate omission was heartbreaking to
the designers of the 8085.) To my knowledge, no high-level language
ever exploited the 8085's additional abilities.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | The TI-99/4A should have been the fastest machine of the era due to its 3 MHz 9900 16-bit CPU, but was instead one of the slowest due to all of its application memory being indirectly accessed, 8 bits at a time, through the video controller. (The CPU had only 256 bytes of 16-bit RAM for use by native code.) A truly heinous architectural decision that crippled the product, dragging it down into 1802 territory so far as apparent speed was concerned, and shooting TI out of the saddle on the lucrative and growing home computer market. |