CoCo (6809) Assembly on a modern computer

This article is a guide for anyone who is thinking about learning 6809 assembly language programming or wants to use newer tools for doing 6809 assembly for the Tandy Color Computer.  It’s not an assembly tutorial, it’s an explanation of how to use some of the modern tools to help write and debug assembly language programs for the CoCo.

I’ve gotten back into assembly programming on the CoCo about nine months ago and I found the modern tools that are available today make it easier and faster to learn assembly language programming.  It used to be a long slow process of assembling your program with EDTASM+ and saving it running it a debugging it back in the 1980’s.  Using MAME and lwasm you can assemble your program in a second and view the assembled code withe the extra information about how many cycles each process is taking.  Which is vital when you want to optimize your code for the most speed or the smallest size.  LWTOOLs which includes lwasm is an amazing 6809/6309 assembler that is completely free.  It’s written by William Astle, who is on the CoCo mailing list LWTOOLS can be compiled for MAC/Linux and Windows.

My favourite emulator is MAME it’s been around for a long time emulating arcade machines and the CPU emulation has been tested in many different scenarios.  There was a branch of MAME called MESS that took the same code and used it only for computer emulation.  But for a few years now MESS is joined together with the main MAME code and now MAME includes both the Arcade emulation and the computer Emulation.  MAME is cross platform and is still being heavily developed.  MAME also has a special debug mode that let’s you step through your program and see how it is running step by step, which is a fantastic testing and learning tool. MAME can output the code it is executing to a text file when using a the trace command. MAME also has something special called watch points which allows you to setup locations in memory that will halt your program if the locations are written to or read from and even setup if they are changed to a specific value! Super useful for debugging… Anyways enough of a sales pitch, I figure you are reading this because you want to setup your assembly environment.

First you need to install LWTOOLS and MAME on your computer. Also this is already done for you if you want to use a Raspberry Pi 3 with Ron Klein’s excellent SD image. You just need to add the CoCo roms and you’re ready to go…

You can compile both MAME and LWTOOLS yourself or download ready use versions. I use a Mac myself and the quick and easiest way to get MAME and lwtools and tons of other utilities installed are by using Homebrew

Once brew is installed on your system it’s as simple as typing in these two commands:

$ brew install mame

$ brew install lwtools

You can probably find similar easy installs of both programs on linux using apt-get or similar. I’m sure there are tons of ways to get these programs on windows machines. Also for windows you might want to use cygwin which gives you a unix like environment

Once they are both installed you should create a directory where you will keep all your 6809 assembly source files. Let’s call it CoCoAssembly this same folder will be where you will assemble your program and run mame from. In this directory you will need a subfolder called roms with the coco roms of the different cocos you want to emulate/test your code on.

Here is a list of all the CoCo roms that can/should go into the roms folder (don’t ask me where to get them):


In your CoCoAssembly folder you should see a sub folder called roms where you have the above roms copied. From this CoCoAssembly folder type the following to test if your MAME is installed properly.

mame coco3 -window

Or to set the uimodekey to F12 use this to start MAME:

mame coco3 -window -uimodekey F12

To exit the MAME emulator hit the keyboard emulation mode key on Mac it’s the delete key (left of the end key). Some laptops don’t have the other delete key so use the F12 command line shown above.  On Linux and windows the key is ScrLk. You want to set this mode to partial then press the Esc key to exit.

The complete installation of MAME includes some nice tools. The most useful for us is called imgtool which is used to create and manipulate our CoCo disk image files or .dsk files. If you don’t like imgtool you can use another image handling tool called toolshed I will be using imgtool below.

Create a new blank .DSK disk image that we can copy our program to with the following command:

imgtool create coco_jvc_rsdos Disk1.dsk

Now that we have a blank disk let’s assemble a program and copy it on this image file, in your favourite editor type in the following short 6809 assembly program:

        ORG $4000
        PSHS   A,B
        LDA    #'H
        LDB    #'I
        STD    $500
        PULS   A,B,PC
        END    Start

Save the program as mycode.asm

This is the command I use to from LWTOOLS to assemble my 6809 source code:

lwasm -9bl -p cd -oNEW.BIN mycode.asm

You should see:

            ( mycode.asm):00001 ORG $4000
4000        ( mycode.asm):00002 Start:
4000 3406   ( mycode.asm):00003 [5+2] PSHS A,B
4002 8648   ( mycode.asm):00004 [2] LDA #'H
4004 C649   ( mycode.asm):00005 [2] LDB #'I
4006 FD0500 ( mycode.asm):00006 [6] STD $500
4009 3586   ( mycode.asm):00007 [5+4] PULS A,B,PC
            ( mycode.asm):00008 END Start

In the lwasm output there are numbers in the square brackets, these numbers are the CPU cycles used for each line of code.  This can be helpful if you want to figure out how to optimize your assembly code for the max speed or best size as there are many tricks to speeding up code at the cost of size and vice versa.

If you want to capture the assembly output to a file called listing.txt use this command. It’s useful to keep the output file to refer back to when debugging your code since it will have the addresses of the code in memory of the instructions…

lwasm -9bl -p cd -oNEW.BIN mycode.asm > listing.txt

The above command options tells lwasm to generate our output code as an RSDOS “LOADM” compatible 6809 program.

Once you have your program assembled OK as NEW.BIN you have to transfer it to the .DSK image so it can be run with the emulator we use imgtool for this.

imgtool put coco_jvc_rsdos Disk1.dsk NEW.BIN TEST.BIN

The above command tells imgtool to put or copy the file NEW.BIN into the disk image file called Disk1.dsk use the CoCo RSDOS format of coco_jvc_rsdos and save the file on the disk with the name TEST.BIN

Another useful feature of imgtool is to delete files from an image to delete the file TEST.BIN on the .dsk file use the following:

imgtool del coco_jvc_rsdos Disk1.dsk TEST.BIN

Imgtool also has many more features, type the imgtool without any options to see all the features.

Let’s test and debug our program using MAME:

mame coco3 -window -debug -flop1 Disk1.dsk

This starts mame up in it’s debugger mode and you will see the following window, or similar with windows and linux.

It’s important to note that the mame debugger always uses hex values for its input and output as a default.

You can see the pink highlighted line on address 8C1B, this is the line that is about to be executed. This is where the CoCo 3 first starts when you power on your computer. On the top left is the cycles (counts the CPU cycles), beamx which is where the beam of the picture tube is currently being drawn in the x direction. Beamy is which row is being drawn on the screen at this moment it time. Flags shows the flags that are currently set in the CC (condition code) register of the 6809 CPU.

  • PC is the program counter and shows us the address where your instructions will be executed next.
  • S is the current stack pointer location
  • CC again is the condition code register but this time shown as a hex number.
  • DP is the Direct Page value
  • A is the A accumulators value
  • B is the B accumulators value
  • D is the A & B accumulators value together as a 16 bit value
  • X is the X registers value
  • Y is the Y registers value
  • U is the U registers value

At the bottom of this window is a command line area where you can type commands for the debugger to execute such as watchpoints or using the trace function another cool feature of the debugger. You can get a lot of help from the debugger itself by typing the word help. Let’s setup a breakpoint at $4000 which is the address where our little test program is going to be loaded and executed. In the line type the following:

wp 4000,1,w

This command sets a watchpoint at address $4000 that is 1 byte long and will stop the the execution of processor when there is a write operation at this address. We could have made the watchpoint look at many bytes and check for write and read with the wr option or just read with the r option.

Now press F5 to make the debugger continue with execution. You’re thinking why did it stop? Disk Basic hasn’t even started yet? The reason it stopped is because Disk Basic is setting up the memory and it did a write instruction at address $4000. In the debug window it shows a message Stopped at watchpoint 1 writing byte to 00004000 (PC=C033) (data=32)

This is telling us that code at address C033 (Disk ROM address) wrote the byte 32 to address $4000 and since our watchpoint is set to stop code at this point it stopped so that we can now look at the code. We don’t really want to get into all the things RSDOS is doing as it boots up so let’s hit F5 again.

Now you should see the familiar RSDOS OK prompt. So let’s make sure our disk image is being used by mame. Type the DIR command in RSDOS and you should see the TEST.BIN as per the picture below. If you got this far things are looking good.

Let’s load our program type LOADM”TEST” and hit Enter

Our debugger stopped the code again as RSDOS loaded your program into memory at address $4000. That’s good, hit F5 once again to let it finish the loadm command.

The next thing we want to do is setup a break point which will stop program execution when the program counter gets to a certain address. In our case our program is going to be executed at address $4000 so in the debug command line type the following command:

bp 4000

Next press F5 and in the RSDOS window type:


This is where the fun begins, the breakpoint stops execution and you can now step through the code line by line watching the registers each step of the way. You can also pull up a memory window with command d or probably control d from linux/windows. Or goto the Debug menu option at the top of the screen and select Memory window. In the Memory Window type 400, which is $400. This will show us a hex view of the text window for the CoCo.

Our program is going to write the word “HI” in the middle of the screen and you can see this in the Memory window as we step through the code. Click on the debug window and hit enter to step forward one line, as you do the S stack pointer will decrease by two bytes as it stores the A and B values in the stack memory space. Hit Enter again and the LDA #$48 instruction is executed and the A accumulator will change to show the value 48. Hit enter again and the LDB #$49 will load the B accumulator with the value 49. You can now see D’s value is now 4849. Hit enter again and you can see the value at address 0500 in the memory window has changed to 48 49. The RSDOS screen hasn’t changed yet since time is frozen when we are debugging and the beam that refreshes our screen hasn’t moved much at all in the time it takes for the 6809 to execute the few instructions we have in our program. We can now press F5 again and the PULS A,B,PC command will restore our accumulaotrs back to what they were before execution and return our program execution back to RSDOS and you should then see the screen refresh and show the HI in the left side of the middle of the screen. I hope you get the idea how the debugger works.

Another powerful feature of the watchpoint command is you can get it to stop execution only if the value of a certain RAM location changes to a specific value. For example (shrunk to fit on one line):

wp ffa0,10,w,{wpdata==0x0b},{printf “write to MMU %04X, value %02X @ %02X\n”,wpaddr,wpdata,pc; g}

This command tells the debugger to watch $FFa0 to FFB0 for a write operation. If one occurs check if the value is a $0b and if so write to the debug window the message.

An example of the output might be where 200D is the program counter address when it made FFA2 the value 0B

write to MMU FFA2, value of 0B @ 200D

One last cool feature I want to show is the trace feature. The trace command follows the execution of the program and saves the disassembled instructions as a text file to be analyzed. You can set it up so it also saves the register data at those points in your file too. Here is how I use it, from the debug window we still have our watchpoint activated. But I’ll show you how to deactivate the watchpoints and breakpoints first. From the Debug window click the bottom down arrow beside the command line bar and click on Break this stops execution and allows you to use the debug features again just like when a breakpoint or watchpoint has been triggered. From the Debug menu select New (Break|Watch)points Window

The window defaults to show the breakpoints but if you click on the top bar you can select ALL Breakpoints or ALL Watchpoints as below:

Select each view and click on the lines and you will see the X on the left turn into a red 0 to indicate it is disabled.

In this example I’m going to set a breakpoint at $4000 again manually and another breakpoint at the end of our program. This is so the trace output will be short, as these files can get huge if you let them run for a few seconds. Depending on the speed of your computer.

From the debug window command line type the following two lines to setup the two new breakpoints

bp 4000

bp 4009

Hit F5 and go to the RSDOS window and type EXEC again, after the BP stops and the debug shows line 4000 we will turn on the trace function by using the following command all on one line (shrunk to fit on one line):

trace,0,,{tracelog “A=%02X,B=%02X,X=%02X,Y=%02X,U=%02X,S=%02X,CC=%02X “,a,b,x,y,u,s,cc}

Then hit F5 to continue the program execution, which will stop at $4009 where our last breakpoint was set. Turn off the trace function with the command in the debug window

trace off

Hit F5 again to get the RSDOS prompt again. When you want to close MAME once again hit the Emulation key mode key and the Esc key.

Once you are out of MAME you can view the trace file in a text editor and you should see the following:

A=00,B=44,X=ABAB,Y=AAF1,U=2E0,S=7F32,CC=84 4002: LDA #$48
A=48,B=44,X=ABAB,Y=AAF1,U=2E0,S=7F32,CC=80 4004: LDB #$49
A=48,B=49,X=ABAB,Y=AAF1,U=2E0,S=7F32,CC=80 4006: STD $0500
A=48,B=49,X=ABAB,Y=AAF1,U=2E0,S=7F32,CC=80 4009: PULS A,B,PC

This shows the values of the accumulators and registers on each line of code.

Another feature of the debugger you can also get it to run until an IRQ is triggered by hitting F7 as shown here:

Another helpful thing you can do while using the MAME debug mode is you can change the contents of any accumulator/register

For example when you stop execution you can type in the debug window’s command line:


would change program counter (pc) will be changed to address $1000 and the program would continue from address $1000 if you hit F5 or step through the code.


Changes the A accumulators value to $94.  You get the idea…

If you want to see the cycle counts in your code listing you can add these lines to your assembly source code:

        opt     c
        opt     ct
        opt     cd
        opt     cc

The code listing will output the cycle counts from the place you inserted the above special options.  Anytime you want to reset the counts you can just insert the following:

        opt     cd
        opt     cc

Here is a little output code so you can see how to use it in your source code and the actual cycle counts in the listing, just to the left of the 6809 instructions.  This is some example code showing different ways to clear data in memory (or the screen).  It’s from another article I’m working on about assembly optimization.

                      (       mycode.asm):00001                         opt     c
                      (       mycode.asm):00002                         opt     ct
                      (       mycode.asm):00003                 
                      (       mycode.asm):00004                         ORG     $4000
4000                  (       mycode.asm):00005                 Start:
                      (       mycode.asm):00006                         opt     cd
                      (       mycode.asm):00007                         opt     cc
                      (       mycode.asm):00008                 * Slow way
4000 8E4000           (       mycode.asm):00009 [3]     3               LDX     #$4000
4003 CE0000           (       mycode.asm):00010 [3]     6               LDU     #$0000
                      (       mycode.asm):00011                         opt     cd
                      (       mycode.asm):00012                         opt     cc
                      (       mycode.asm):00013                 * This loop is 15 cycles to update two bytes
                      (       mycode.asm):00014                 * We have to do this loop $2000 / 2 bytes each pass = $1000 times
                      (       mycode.asm):00015                 * 15 cycles * $1000 or 4096 = 61,440 cpu cycles
4006 EF81             (       mycode.asm):00016 [5+3]   8       !       STU     ,X++
4008 8C6000           (       mycode.asm):00017 [4]     12              CMPX    #$4000+$2000
400B 26F9             (       mycode.asm):00018 [3]     15              BNE     <
                      (       mycode.asm):00019                 
                      (       mycode.asm):00020                         opt     cd
                      (       mycode.asm):00021                         opt     cc
                      (       mycode.asm):00022                 * Faster way
400D 8E4000           (       mycode.asm):00023 [3]     3               LDX     #$4000
4010 CE0000           (       mycode.asm):00024 [3]     6               LDU     #$0000
4013 CC2000           (       mycode.asm):00025 [3]     9               LDD     #$2000
                      (       mycode.asm):00026                         opt     cd
                      (       mycode.asm):00027                         opt     cc
                      (       mycode.asm):00028                 * This loop is mostly 13 cycles sometimes 18 cycles every 256 bytes
                      (       mycode.asm):00029                 * $2000 / $100 = $20
                      (       mycode.asm):00030                 * $20 / 2 = $10  (half because we write 2 bytes per cycle)
                      (       mycode.asm):00031                 * $2000 - $20 = $1FE0
                      (       mycode.asm):00032                 * $1FE0 / 2 = $FF0  (half because we write 2 bytes per cycle)
                      (       mycode.asm):00033                 * 13 cycles * $FF0 + 18 cycles * $10 = $CF30 + $120 = $D050 = 53,328 cpu cycles
4016 EF81             (       mycode.asm):00034 [5+3]   8       !       STU     ,X++
4018 5A               (       mycode.asm):00035 [2]     10              DECB
4019 26FB             (       mycode.asm):00036 [3]     13              BNE     <
401B 4A               (       mycode.asm):00037 [2]     15              DECA
401C 26F8             (       mycode.asm):00038 [3]     18              BNE     <
                      (       mycode.asm):00098
                      (       mycode.asm):00099                 * Fastest method is to use unfolded loops
                      (       mycode.asm):00100                 * and use the U Stack pointer instead of a ST instruction
4073 CC0000           (       mycode.asm):00101 [3]     136             LDD     #$0000
4076 8E0000           (       mycode.asm):00102 [3]     139             LDX     #$0000
4079 3184             (       mycode.asm):00103 [4+0]   143             LEAY    ,X
407B CE6000           (       mycode.asm):00104 [3]     146             LDU     #$4000+$2000
                      (       mycode.asm):00105                         opt     cd
                      (       mycode.asm):00106                         opt     cc
                      (       mycode.asm):00107                 * This loop is 70 cycles to write 32 bytes
                      (       mycode.asm):00108                 * We cycle through the loop 256 times so the calculation is
                      (       mycode.asm):00109                 * 256 * 70 = 17,920 CPU Cycles
407E 3636             (       mycode.asm):00110 [5+6]   11      !       PSHU    D,X,Y
4080 3636             (       mycode.asm):00111 [5+6]   22              PSHU    D,X,Y
4082 3636             (       mycode.asm):00112 [5+6]   33              PSHU    D,X,Y
4084 3636             (       mycode.asm):00113 [5+6]   44              PSHU    D,X,Y
4086 3636             (       mycode.asm):00114 [5+6]   55              PSHU    D,X,Y
4088 3606             (       mycode.asm):00115 [5+2]   62              PSHU    D
408A 11834000         (       mycode.asm):00116 [5]     67              CMPU    #$4000
408E 22EE             (       mycode.asm):00117 [3]     70              BHI     <
                      (       mycode.asm):00118                 
                      (       mycode.asm):00119                         END     Start

I should also point out a nice feature of lwasm is the use of greater than > and less than < pointers.  You don’t need a label for every branch instruction.  In the listing above you can see the use of  “BHI    <” that tells the assembler to branch if higher back in the source until the first “!” is found.  You can also branch forward with a command like “BNE    >” which will tell the assembler to branch if not equal to the next “!” found below in your source code.

I should also point out there is a special version of MAME on GitHub that has some special enhancements for the CoCo that might come in handy.  You can read up about it and it’s features here.

I hope this info helps others to get the most out using MAME to learn assembly language programming.

Posted in CoCo Programming, Emulation | 1 Comment

Zilog z80 to Motorola 6809 Transcode – Part 023 – Optimized sprite rendering, combining Compiled sprites with Stack Blasting

Talking with other CoCo users about optimizing the sprite rendering on the CoCo I’ve figured out the best way to render the sprites for Pac Man on the CoCo3.  This article will summarize what I’ve learned and then explain what I will use for Pac Man.

I got some great tips from  Richard Goedeken’s Game Engine for the CoCo 3 called “Dynosprite” it is available on github.  Going through his source code I found his code not only does straight LDA, LDB, LDD commands but also checks to see if the following instructions can be used with the A or B accumulators:


This saves another byte over a straight LD instruction, but the speed is the same.  Every byte counts!

For the examples below I’m using a Pac Man sprite facing the right with his mouth wide open.  The current palette uses 9 as yellow and 4 as black, two pixels per byte.

Previously the fastest method I thought for doing compiled sprites was the following:

[4+1]   5 LEAU 5,X
[3]     8 LDD #$4999
[3]    11 LDX #$9999
[4+1]  16 STA -3,U
[5+1]  22 STX -2,U
[5+1]  28 STD -4+128,U
[5+1]  34 STX -2+128,U
[4+4]  42 LEAU 256,U
[4]    46 LDY #$9994
[5+1]  52 STX -4,U
[6+1]  59 STY -2,U
[4+1]  64 STB -4+128,U
[5+1]  70 STX -3+128,U
[4+4]  78 LEAU 256,U
[5+1]  84 STD -5,U
[6+1]  91 STY -3,U
[4+1]  96 STA -5+128,U
[5+1] 102 STX -4+128,U
[4+4] 110 LEAU 256,U
[4+1] 115 STA -5,U
[6+1] 122 STY -4,U
[4+1] 127 STA -5+128,U
[5+1] 133 STX -4+128,U
[4+4] 141 LEAU 256,U
[5+1] 147 STD -5,U
[6+1] 154 STY -3,U
[4+1] 159 STB -4+128,U
[5+1] 165 STX -3+128,U
[4+4] 173 LEAU 256+128,U
[5+4] 182 STX -4-128,U
[6+4] 192 STY -2-128,U
[5+1] 198 STD -4,U
[5+1] 204 STX -2,U
[4+1] 209 STA -3+128,U
[5+1] 215 STX -2+128,U
[5]   220 RTS

This method is still faster then Stack Blasting and this method takes 220 CPU cycles to draw on screen and 106 bytes of RAM.  Which is pretty good since a full 16×16 sprite would take 128 bytes.  Another benefit of compiled sprites is you aren’t stuck to writing a certain block size to the screen.  This actual sprite is really only 10 pixels x 13 rows.  As bitmap data that would be used for stack blasting would still require 65 bytes of RAM and you would need code to handle different size sprites if stack blasting.

The fastest method I have come up with is:

[4+4]   8 LEAU 5+128*12,X
[3]    11 LDD #$4999
[3]    14 LDX #$9999
[5+3]  22 PSHU A,X
[4+1]  27 LEAU -128+3,U
[5+4]  36 PSHU D,X
[4+1]  41 LEAU -128+4,U
[4]    45 LDY #$9994
[5+4]  54 PSHU D,Y
[4+1]  59 LEAU -128+3,U
[5+3]  67 PSHU B,X
[4+1]  72 LEAU -128+3,U
[5+4]  81 PSHU D,Y
[4+1]  86 LEAU -128+3,U
[5+3]  94 PSHU A,X
[4+1]  99 LEAU -128+3,U
[5+3] 107 PSHU A,Y
[4+1] 112 LEAU -128+3,U
[5+3] 120 PSHU A,X
[4+1] 125 LEAU -128+4,U
[5+4] 134 PSHU D,Y
[4+1] 139 LEAU -128+4,U
[5+3] 147 PSHU B,X
[4+1] 152 LEAU -128+4,U
[5+4] 161 PSHU X,Y
[4+1] 166 LEAU -128+4,U
[5+4] 175 PSHU D,X
[4+1] 180 LEAU -128+4,U
[5+3] 188 PSHU A,X
[5]   193 RTS

This is only 193 CPU cycles and 77 bytes of RAM.

The code above added the use of PSHU in the code instead of ST instructions and used the idea of starting from the bottom of the sprite to the top (this suggestion was from Curtis L. Boyle) since the U gets changed from the PSHU command automatically this means that after the PSHU command the LEAU is less then a signed 256 byte value and is shorter.  Also the PSHU command is faster then multiple ST instructions.

This compiled/halfstack blasted sprite technique can be used especially well for any CoCo1 game too, since you have less colours which means more repeating values in the sprites.

After doing my test video shown in my previous article I’ve decided I’m going to use an different method of updating the sprites on screen.  I’m planning on using double buffering since the quick and easy method of redrawing whatever the ghosts run over and ignoring what is behind Pac Man is becoming very difficult and in the end it wasn’t looking perfect.  The main struggle was when Pac Man goes around corners there are times when it moves more then 2 pixels in each direction and if it’s not accounted for properly there was some yellow pixels left on the screen.  I’m hoping double buffering will make these trouble go away.  Although it might be a little slower but I think with the improved compiled sprite code using PSHU speed wont be an issue.

So here I go again re-writing the graphics engine…  The graphics rendering is taking a lot longer then I thought, even more work then that actual z80 to 6809 transcode tool!  But it’s been a great learning experience.

See you in the next post.

Posted in CoCo Programming, Uncategorized | Leave a comment

Zilog z80 to Motorola 6809 Transcode – Part 022 – Quick and dirty Speed Test – Five Sprites and 2 audio samples at the same time

I decided to put the Pac Man Sprite code and audio code together and see how fast the Pac Man transcode could be drawing five compiled sprites and playing back two separate audio samples at the same time.  There’s still a lot of work to do at this point.  The graphics are still leaving junk on the screen and the cut scenes still need to be done.  There are still some audio problems…

But I’m very happy with the speed the game is currently playing.  I’m quite sure I can speed the sprites up a little more, whether it will speed up the game or not I don’t know since it’s all tied to the IRQ hitting 60 times a second.

I thought I’d share a short 2 minute video, showing Pac Man running on the CoCo 3 for anyone who has been following this blog to see where I’m at.  The quality of my CoCo 3 monitor is pretty bad but you can at least see the speed the game is playing at this point in time.  I thought it would be best to show it on a real CoCo 3 monitor rather then from MAME.

The video can be found on youtube here:

See you in the next post…


Posted in CoCo Programming | Leave a comment

Zilog z80 to Motorola 6809 Transcode – Part 021 – Compiled Sprites are faster then Stack blasting!

Hello, after writing previous blogs on how great stack blasting is, I’ve recently found out about an even faster sprite rendering method called Compiled Sprites.  A few weeks ago I was watching a CoCo youtube video and just briefly they mentioned using Compiled Sprites for game rendering.  So I googled it and there wasn’t much information on the technique.  I guess this is because most computers had built in hardware for sprites so only a few computers like the CoCo had hires graphics but no sprite hardware.  So what are Compiled Sprites?  It is a method of writing the data that needs to be put on the screen as assembly code that stores the picture data directly to the video RAM.  Can you guess what this example is?

        LEAX    7,X
        LDD     #$9999
        STD     -5,X
        STD     122,X
        LDB     #$90
        STD     124,X
        STB     -3,X
        LEAX    256,X
        LDB     #$99
        STD     -6,X
        STD     122,X
        STA     -4,X
        LDA     #$90
        STA     124,X
        LDA     #$09
        STA     -7,X
        STA     121,X
        LEAX    256,X
        LDA     #$99
        STD     -7,X
        STD     121,X
        STA     -5,X
        LDA     #$90
        STA     123,X
        LEAX    256,X
        LDA     #$99
        STD     -7,X
        STD     121,X
        LDA     #$90
        STA     123,X
        LEAX    256,X
        LDA     #$99
        STD     -7,X
        STD     122,X
        STA     -5,X
        LDA     #$90
        STA     124,X
        LDA     #$09
        STA     121,X
        LEAX    256,X
        LDA     #$99
        STD     -6,X
        STD     122,X
        LDB     #$90
        STD     124,X
        STA     -4,X
        LDA     #$09
        STA     -7,X
        LEAX    256,X
        LDD     #$9999
        STD     -5,X
        LDA     #$90
        STA     -3,X

The above code is actually a sprite of Pac Man just like the picture below.

Screen Shot 2017-04-12 at 9.33.33 PM

It turns out that rendering data to the screen as a bunch of LDD,LDA,LDB and STD,STA,STB instructions is faster then stack blasting!  It takes more RAM to store the sprites as code but not a lot more.  From my tests I’ve taken the stack blasting 16×14 sprites take 529 cycles.  Compiled Sprites vary in size since it depends on how much detail is in the 16×14 pixel sprite.  I’ve had some as low as 263 cycles and the larger ones are still around 450 cycles.  These are incredible considering how fast stack blasting already is.  There are also some extra benefits to Compiled sprites besides speed.  You don’t have to worry about the stack pointers like you do with stack blasting.  You also get transparency since you only write the bits that you need to the screen.

I’m now in the process of converting my sprites to Compiled Sprites and then I’ll have to implement the new sprite handling into my Pac Man transcode.  This is going to be a lot of work, but in the end it will be that much better.

See you in the next post.

Posted in CoCo Programming | 4 Comments

Zilog z80 to Motorola 6809 Transcode – Part 020 – Sound ideas

I’ve taken a break from optimizing the Pac Man code to do some experiments with getting the CoCo 3 to play audio.  This is something I’ve read about but have never coded myself before.  I guess mainly because it involved Interrupt Requests, at least if you want to make the CoCo make sounds and continue doing something else at the same time and as I’ve written before learning more about IRQs is one of the reasons I am doing this Pac Man transcode and of course to see if the CoCo 3 can do it.  🙂

Recently I posted a question asking for some code to do sample playback on the CoCo mailing list and got some really solid information about how the CoCo DAC is used.  Basically once the CoCo is configured properly simply feed samples to at a constant rate to the 6 bit DAC.  The best way is to use the FIRQ with the timer Interrupt, you set the timer to a certain value and when the timer counts down to zero the FIRQ is triggered sending a byte of sampled data to the DAC.  Since this is going to happen very often while your program is running it is very important that the FIRQ routine is optimized.  I looked around for some code examples for doing this using the FIRQ but I came up empty.  I decided to take a look at how John Kowalski did sound in his Donkey Kong game.  I hope he doesn’t mind since he did post on his website that he hoped his version of Donkey Kong would inspire others to convert other video games to work on the CoCo.  He is a master at CoCo programming and I figure his code would be very optimized.  I didn’t decode all of his audio routine but it looks like his FIRQ routine changes the CoCo3 memory bank from the normal bank 0 to bank 1 where his audio samples are already loaded in memory and from there it grabs a sample byte does some add to it and outputs it to the DAC.  Once his data pointer reaches $8000 (sample end location) then it stops or repeats (I can’t remember)…  His method gave me the idea to do something similar and if I use the top 32k for one sample and the bottom 32k for another sample that I can then have two samples in memory at the same time and then play them back together without too much CPU interacting at all.  I’ll explain how I’ve worked this out below…

First things first…  I used the original Pac Man code on Mame in debug mode and setting it up to output to a wav file.  I tweaked the original pacman code while running it in debug mode (since I have it all decoded now) so that I can disable certain sounds and enable others to isolate individual sounds to get the best sounding samples from Pacman as I can.  The mame command is an example of capturing the cutscene music.

mame -window -natural -nojoy -debug pacman -wavwrite cutscenes.wav

Then I used some audio tools to tweak the samples to the exact length I needed and converted them down to 6kHz Mono, 8 bit unsigned raw data.  In this format the data is in the correct format to send to the CoCo3’s DAC.  I found some code on the internet from Robert Gault that explains how to setup the CoCo’s DAC so that you can simply send 8 bit audio to the CoCo and it will ignore the data in bits 0 & 1, which means that the audio doesn’t need to be processed ahead of time (making bits 0 & 1 a zero value) or using the ANDA  #$FC instruction on your data before sending it to the DAC.  It turns out to save space though I ended up stripping those bits away in my Pac Man transcode.  But for other projects it will come in handy.

I’ll spare you the details about storing the 8 bit data as 6 bits and decompressing them.  But basically you take the high 6 bits of each sample and take 4 bytes of data and turn it into 3 bytes (8 bits * 4 bytes = 32 bits, 6 bits * 4 values = 24 bits or 3 bytes)

As I stated above the FIRQ is setup to be triggered when the Timer counts down from a specific value for Pac Man I’m currently using a value of $280 with the timer speed set to 279.365 nano seconds.  I have two FIRQ routines, one that plays a sample over and over (playing the constant siren sample while Pacman is running in game mode) and another FIRQ that plays the same sample the one just described but adds a second sample that is played only once.

There may be simpler and better ways to pull this off but this is the method I came up with:

In order to make the FIRQ as fast as possible I don’t want to use anymore registers or accumulators then I need to.  Unlike the IRQ which automatically pushes all the acumulators and registers to the stack and restores them with the RTI instruction the FIRQ only affects the CC register and the stack pointer so it know where to return from once it’s complete.  This makes it faster but you also have to manage the registers yourself.  I have the FIRQ working only using the A accumulator and no registers.  Here is the code I’m currently using for the looped audio playback:

* Play Sample in the background continuously in mem $FD00-$8000 when it hits $8000 it will loop back to it's start location
        FDB     $0000       * location where sample starts counting down from, used for looping
* Play the Siren sample in the background continuously in mem block $8000-$9FFF when it hits $7FFF it will loop back to it's start location
        STA FIRQRestoreA+1  * Save A
        LDA FIRQENR         * Re enable the FIRQ
        LDA #$21            * Set the MMU bank registers to
        STA INIT1_Register1 * Bank task 1 - alternate 64k bank is now the current one
        LDA $9000
        STA PIA1_Byte_0_IRQ * OUTPUT sound to DAC
        DEC LoadAudio1+2    * Decrement the LSB of Sample 1 pointer
        BNE >               * check if we hit 00, if so decrement the MSB of the sample pointer
        DEC LoadAudio1+2    * Decrement the LSB of Sample 1 pointer (force it to go from xx00 to xxFF modified sample data to account for this while being decompressed)
        DEC LoadAudio1+1    * Decrement the MSB of Sample 1 pointer
        BMI >               * if negative then we are good still over $7FFF otherwise (end of sample 1, reset pointer)
Hit8000:                    * Restore the pointer to the start location of the sample
        LDA Sample1Start    * Point to the MSB of the sample Start location
        STA LoadAudio1+1    * Store the MSB of Sample 1 pointer
        LDA Sample1Start+1  * Point to the LSB of the sample Start location
        STA LoadAudio1+2    * Store the LSB of Sample 1 pointer
!       LDA     #$20        * Set the MMU bank registers to
        STA     INIT1_Register1 * Bank task 0 - Back to the normal 64k bank
        LDA #$00            * STA at the start of the FIRQ stores A's value here and gets loaded before the RTI saves a cycle and a byte of RAM
        RTI * Return from the FIRQ

One extra process to the audio data was to reverse it since the FIRQ routine has to count downward to know where the end of the sample is.  This is not the way John Kowalski did it in Donkey Kong but the code and idea is similar.  The sample starts for example at $9355 and counts down until the MSB of the address pointer changes to a positive value which means it reach $7F then it resets the pointer to $9355 and starts again.

To summarize the FIRQ:

  • The Routine saves the A accumulator and stores it in the LDA instruction (kind of self modifying code) so that the LDA instruction just before it exits the routine restores A’s value.
  • Next it re-enables the FIRQ by doing an LDA FIRQENR ($FF93),
  • Then it switches MMU memory bank to bank 1 where I have already setup the sample in memory block 4 ($8000-9FFF).   
  • The program loads the A register at the current pointer location and sends it to the DAC.
  • Decrement the pointer to the sampled data and if it becomes a positive number then reset the pointer to the start of the data
  • Swap the bank MMU memory bank back to the normal bank 0
  • Restore A
  • Return from the Interrupt

There is one extra problem when decrementing the LSB and the MSB of the pointer.  When the LSB gets to 00 it then decrements the MSB which means it jumps from a number like $9201 to $9200 since the LSB is now 00 it makes the MSB $91, so the value is now $9100 then the next value would be $91FF since the decrement is done before the check for 00 value.  So to get around this problem when the LSB becomes 00 I decrement the LSB again and decrement the MSB so the value actually goes from $9201 to $9200 then $91FF in one step which skips the $9200 value.  I couldn’t think of a better way to do it and still keep the code tight.  So I had to modify the audio sample data to compensate for this byte skipping, which is done in my decompression code.  I could have left the code with the $9201, to $9200, $9100, $91FF, $91FE… and modified the sample data to account for this too and it would save me a few more cycles after every 255 bytes are sent tot he DAC.  But that is pretty negligible, but maybe in the future…

So why not just count upwards you are probably asking, it’s so that I can use the top 32k of the RAM for one sample and the bottom 32k of RAM for the other sample.  My FIRQ has to exist in memory at all times in both MMU bank configurations (bank 0 and bank 1) to process the sound and it is located in the memory at $E000-FFFF where the special CoCo configuration settings are located ($FF00-$FFFF).  So I can’t have sample data in this location or it would clobber all those settings and my FIRQ code too.  Counting downwards allows me to check to see if the routine changes from a negative ($FF to $80) to a positive ($7F to 00).  My other FIRQ routine does basically the same as the one above except it uses the sample data from lower memory $0000-$7FFF and that gets added to the sample data from the top 32k and divides that value in half and sends it to the DAC.  This routine also counts downwards and when the MSB get’s to $FF and becomes a negative then this 2nd FIRQ routine ends and changes the FIRQ pointer to point to the first FIRQ again.  This 2nd FIRQ plays a second sample once along with the original sample that is being looped at the same time.

Here is the FIRQ2 code that plays the two samples together:

* Playback Sample 1 continuously play Sample 2 once
* Siren will be at the normal location in mem block $FD00-$8000 when it hits $7FFF it will loop back to it's start location
* 2nd sound will be in mem block $7FFF-$0000 backwards so we can count down to $0000 then leave this FIRQ when the sample is finished playing
* and jump to the main FIRQ just playing Sample 1 routine
        STA FIRQRestore2A+1  * Save A
        LDA FIRQENR          * Re enable the FIRQ
        LDA #$21             * Set the MMU bank registers to
        STA INIT1_Register1  * Bank task 1 - alternate 64k bank is now the current one
        LDA $0000            * Load Siren Sound sample same address as the normal FIRQ
        ADDA $0000           * Add 2nd sample points to address ($0000-$1fff) counting downwards
        RORA                 * Divide the combined samples by two
        STA  PIA1_Byte_0_IRQ * OUTPUT sound to DAC
        DEC LoadAudio2+2     * Decrement the LSB of Sample 1 pointer
        BNE Decrement_Snd2   * check if we hit 00, if so decrement the MSB of the sample pointer
        DEC LoadAudio2+2     * Decrement the LSB of Sample 1 pointer (force it to go from xx00 to xxFF modified sample data to account for this while being decompressed)
        DEC LoadAudio2+1     * Decrement the MSB of Sample 1 pointer
        BMI Decrement_Snd2   * if negative then we are good still over $7FFF skip ahead, if not (end of sample, reset pointer)
Hit8000_2:                   * Restore the pointer to the start location of the sample
        LDA Sample1Start     * Point to the MSB of the sample Start location
        STA LoadAudio2+1     * Store the MSB of Sample 1 pointer
        LDA Sample1Start+1   * Point to the LSB of the sample Start location
        STA LoadAudio2+2     * Store the LSB of Sample 1 pointer
        DEC AddAudio+2       * Decrement the LSB of the 2nd sample pointer
        BNE >                * If we LSB of 2nd Sample reached 00 then exit routine
        DEC AddAudio+2       * Decrement the LSB of the 2nd sample pointer (force it to go from xx00 to xxFF modified sample data to account for this while being decompressed)
        DEC AddAudio+1       * Decrement the MSB of the 2nd sample pointer
        BPL >                * If we haven't reached $0000 then we are a positive number so exit routine, exit if we are now at $FFFF
        LDA LoadAudio2+1     * copy the sample pointer position from this FIRQ
        STA LoadAudio1+1     * to the normal FIRQ sample pointer
        LDA LoadAudio2+2     * copy the sample pointer position from this FIRQ
        STA LoadAudio1+2     * to the normal FIRQ sample pointer
        LDA #FIRQ_Interrupt1/256 * The 2nd sample is finished so we can set the FIRQ back to the normal Just playing Siren Sound routine
        STA FIRQ_Start_Address   * Update FIRQ jump address MSB  ( in the future could use one digit if we use the Direct Page for the FIRQ)
        LDA #FIRQ_Interrupt1-((FIRQ_Interrupt1/256)*256) * Get LSB of other normal FIRQ routine
        STA FIRQ_Start_Address+1 * Update FIRQ jump address LSB
!       LDA #$20                 * Set the MMU bank registers to
        STA     INIT1_Register1  * Bank task 0 - Back to the normal 64k bank
        LDA #$00                 * STA at the start of the FIRQ stores A's value here and gets loaded before the RTI saves a cycle and a byte of RAM
        RTI                      * Return from the FIRQ

Here is an example of setting up the FIRQ’s with the sample data:


* Play Background Siren Audio Sample over and over
        LDD     Snd_07_Insert_Coin_Length
        STD     LoadAudio1+1            * to the new FIRQ pointer
        LDA     #$2D                    * 11_Siren_6khz_8bit_Mono_rev MEM Block $2D
        STA     MMU_Reg_Bank1_4         * Page $8000-$9FFF  Block #4 - Move the sample Mem BLK to the $8000-$9FFF location in the alternate MMU Bank which the FIRQ uses
        LDX     #Snd_11_Siren_Length    * Get End location of sample to play
        STX     Sample1Start            * Start location that will be used to count down from and loop from
        STX     LoadAudio1+1            * Store where to start playback from


* Play Insert Coin Audio Sample
        LDD     LoadAudio1+1            * copy the sample pointer position from FIRQ1
        STD     LoadAudio2+1            * to the new FIRQ pointer (FIRQ2)
        LDD     Snd_07_Insert_Coin_Length
        STD     AddAudio+1              * Store the length of the 2nd sample which is also the pointer to start playing the sample from
        LDA     #$25                    * 07_Insert_Coin_6khz_8bit_Mono_rev MEM Block $25
        STA     MMU_Reg_Bank1_0         * Store the Mem BLK in MMU Bank 1 block 0 = $0000-$1FFF
        LDX     #FIRQ_Interrupt2        * Change the FIRQ to the one that handles playing two samples together
        STX     FIRQ_Start_Address      * Point to the FIRQ Jump Vector

It’s also a good idea to setup the Direct Page to point to the location where the FIRQ routines exits so they execute as fast as possible.

Just for completeness this is the code that I use to initialize the audio hardware on the CoCo:

* Configure Audio settings
        LDA     PIA0_Byte_1_HSYNC       * SELECT SOUND OUT
        ANDA    #$F7                    * RESET LSB OF MUX BIT
        STA     PIA0_Byte_1_HSYNC       * STORE
        LDA     PIA0_Byte_3_VSYNC       * SELECT SOUND OUT
        ANDA    #$F7                    * RESET MSB OF MUX BIT
        STA     PIA0_Byte_3_VSYNC       * STORE
        LDA     PIA1_Byte_3_IRQ_Ct_Snd  * GET PIA
        ORA     #$8                     * SET 6-BIT SOUND ENABLE
        STA     PIA1_Byte_3_IRQ_Ct_Snd  * STORE
* From Robert Gault
* This code masks off the two low bits written to $FF20 - we wont need this since we had to compress the audio but it is a neat feature
* So you can send the PCM Unsigned 8 Bit sample as is, no masking needed
        LDA     PIA1_Byte_1_IRQ
        PSHS    A
        ANDA    #%00110011            * FORCE BIT2 LOW
        STA     PIA1_Byte_1_IRQ       * $FF20 NOW DATA DIRECTION REGISTER
        LDA     #%11111100            * OUTPUT ON DAC, INPUT ON RS-232 & CDI
        STA     PIA1_Byte_0_IRQ
        PULS    A
        STA     PIA1_Byte_1_IRQ

I know this post was pretty technical but it needs to be if others want to use it in their projects in the future.  See you in the next post…


Posted in CoCo Programming | 5 Comments

Zilog z80 to Motorola 6809 Transcode – Part 019 – You can try it, if you want…


First things first, just so I don’t get anyone telling me.  At this point Pac Man is slow about 1/3rd the speed of the original arcade machine.  The cutscenes need to be fixed and there is no sound…  Other then that it does play 100% true to the original arcade machine.

I’m at the point where I have to re-write a lot of the way the Pac Man code relates to it’s own screen hardware in order to speed the game up and hopefully get it working at 100% speed.  Maybe even add sound if I can get it running fast enough.  It’s more for education purposes then for game play at this point as it is too slow (about 1/3rd real speed) to actual play for fun in it’s current state (unless you run the CoCo3 and the game from a fast emulator).  After this point the video code will change a lot and probably won’t relate to the original hardware so I figure the commented source code that I have at this point will be more useful to anyone who wants to learn how these old arcade games worked.  Then a future final version that I’ll make available, if I get it up to speed.

To get it working and to make the copyright holders happy,  you must have the rights to the original ROMs just like using MAME to actually play this game, even on the CoCo 3.  It won’t play unless you copy the ROM file PACMAN.5E to the floppy disk.  

Here is what you need to do to try out the game:

1) – Copy the disk image Disk1.dsk file to a real floppy using your favourite disk image tool

2) – Copy the PACMAN.5E ROM file to the floppy

After copying the ROM file and you type DIR you the disk should look like this:

Screen Shot 2017-03-08 at 4.35.10 PM

3) – Type RUN”PACMAN and hit Enter

If you get an NE Error it’s because you didn’t copy the PACMAN.5E file to the disk.  I’m sure taking a look at the PACMAN.BAS file will help.

You should see the loading screen:Screen Shot 2017-03-08 at 4.52.14 PM

Then the Startup / Option selection screen:Screen Shot 2017-03-08 at 4.52.49 PMFrom this screen use the arrow keys to change the options, other then selecting RGB/Composite these are the same as changing the DIP switch settings on a real Pac Man arcade machine.

From here press the Space Bar to start the Game, then hit 5 to insert a coin.  Press 1 to start a one player game or 2 to start a two player game.  Use the arrow keys to move Pac Man around.  Sorry no joystick support, as I don’t have a joystick for my CoCo 3.

Special Keys: 

A – Force scrolling to the top of the game screen

Z – Force scrolling to the bottom of the game screen

L – Skip the current level (this is built into the original Pac Man code for testing purposes)

A few things about this software:

The resolution of the real Pac Man hardware is 256×288, with the screen rotated 90 degrees.  So the play field is 288×256, the CoCo 3 hardware can use a maximum of 320×225 with 16 colours so it can’t show the full screen at one time without scaling the screen pixels.  Since this is a translation of Pac Man and I wanted the experience to look as close as possible to the real machine I decided that scrolling the screen vertically is the best option.  The game shows most of the maze at all times and if Pac Man is on the bottom half of the screen then it scrolls down.  If Pac Man is on the top half it scrolls up.  It only scrolls as much as it needs to show the maze.  If you want to see the top of the screen where the points are shown then press the A key.  If you want to see the bottom of the screen to see what level you are on press the Z key.

Also included with the Disk1.dsk floppy image is my commented 6809 source code.  I’ve documented a lot of the code while transcoding the z80 code to the 6809 and also found a lot of commented code on the internet that is included.  So this is a really good resource if someone wants to learn how these old arcade games from the 80’s worked.

Here is the link

I hope others find it useful,

Glen Hewlett

Posted in Uncategorized | Leave a comment

Zilog z80 to Motorola 6809 Transcode – Part 018 – Transcode is complete and a look at level 256

Hello again,

I’ve transcoded all the Pac Man z80 code to the 6809 now, except for some audio related code which isn’t going to be used on the CoCo 3.  The game plays properly now.  The demo mode ghost movement matches the original Pac Man ghost movement so I think it should play the same as the real machine meaning you should be able to use Pac Man patterns on the CoCo 3 version just like the real hardware.

The cutscenes are coded in but they currently look pretty bad right now since the sprites are changed around for normal game mode and the cutscenes use different sprites that I had to remove for normal game play (space limitation).  So I have to work on cleaning up the cutscenes.  Other then that any other little graphic glitch is sprite related and I don’t know if I’m going to fix them up before working out a better way to render the sprites which may fix those little glitches…

I tested out the game and changed the level to 256 to see if it would be drawn crazy just like a real Pac Man machine does and I’m happy to say that it does.  The only difference is the colour of the text isn’t the same, this is because the palette info on Pac Man hardware is handled differently then the CoCo 3.



So from this point to make Pac Man perfect I need to:

  • Fix sprites while playing cutscene animations
  • Add Joystick support (I don’t have a joystick for my CoCo 3)
  • Speed up the game
  • Add sound output

Because the screen is a little too tall to be shown on the CoCo 3 screen I have written some code that auto scrolls vertically showing as much of the maze as it can at all times.  You can press the A key to see the top of the screen where the scores are shown.  Or you can press Z to see the bottom of the screen where the level info and the number of lives is shown.

See you in the next post…

Posted in CoCo Programming | Leave a comment