Optimizing 6809 Assembly Code: Part 2 – Speedup Storing Data – Unrolling Loops

Table of Contents (click on a link to jump to that web page)

Let’s move on to some more in-depth ways of speeding up useful tasks done in assembly.  This is probably a good place to point out that if you’re using LWTOOL’s,  lwasm  to assemble your source code you can use the options below in your code to get the cycle counts in your output listing.

opt c  - enable cycle counts
opt cd - enable detailed cycle counts breaking down addressing modes
opt ct - show a running subtotal of cycles
opt cc - clear the running subtotal

The cycle count info in the listings below were generated by lwasm.

I personally use lwasm to show me the cycle counts as I just described but others might like to use a 6809 reference chart such as this to look up the cycle counts on their own while programming.  This is also a good chart from sockmaster.  Sockmaster 6809/6309 reference chart.


First a good little trick to keep in mind is the ABX instruction in place of LEAX   B,X

ABX is 2 cycles faster and one byte less then LEAX  B,X

Keep in mind that ABX adds the unsigned value of B and adds it to X, while LEAX  B,X takes into account the singed value of B when adding it to X.

A note about using ABX from Sockmaster – “Oftentimes I rearrange register usage in my code just to make sure ABX can be used as much as possible.”


Another important way to speed up your program is to make use of the Direct Page (DP) register.  It is very useful if you are playing sampled sounds in a game using an FIRQ that your FIRQ is in the DP space.  Also use it for other small routines that get used a lot.  Also use it for storage of data that is accessed a lot.

                            LDA     #$FA
                            TFR     A,DP    * Set DP to $FA00-$FAFF
4000 FCFA55 [6]     6       LDD     $FA55   * slower and more bytes
4003 DC55   [5]     11      LDD     <$FA55  * faster and less bytes

If we don’t use direct addressing the LDD takes 6 cycles and 3 bytes.  If we use direct addressing indicated with the less than “<” symbol the LDD takes 5 cycles and 2 bytes.


Another thing to note is the impact of speed and size when you are using indexed addressing.

Mem  Code     Cycles               Assembly Code (Mnemonics)
4000 8E2000   [3]                  LDX     #$2000
4003 A684     [4+0]                LDA     ,X
4005 A61F     [4+1]                LDA     -1,X
4009 A610     [4+1]                LDA     -16,X
400B A688EF   [4+1]                LDA     -17,X
400E A68880   [4+1]                LDA     -128,X
4011 A689FF7F [4+4]                LDA     -129,X
4015 A601     [4+1]                LDA     1,X
4017 A60F     [4+1]                LDA     15,X
4019 A68810   [4+1]                LDA     16,X
401C A6887F   [4+1]                LDA     127,X
401F A6890080 [4+4]                LDA     128,X

Things to note about the above list:

  • Values of -1 to -16 are only two bytes, same with values 1 to 15
  • Values of -129 or lower use 4 bytes and 8 cycles.  The same is true for positive numbers of 128 or higher.

For a real world example:  We have a screen that is 64 bytes wide and we want to draw a line across the screen that is three pixels tall you could do this:

This routine takes 30 cycles * 32 = 960 cycles

Mem  Code   Cycles Running Total   Assembly Code (Mnemonics)
4000 8E2000   [3]                  LDX     #$2000
4003 CEFFFF   [3]                  LDU     #$FFFF      * White Pixels
4006 C602     [2]                  LDB     #2
4008 EF84     [5+0]   5    !       STU     ,X
400A EF8840   [5+1]   11           STU     64,X
400D EF890080 [5+4]   20           STU     128,X       * Big & Slow
4011 3A       [3]     23           ABX
4012 8C0810   [4]     27           CMPX    #2000+64
4015 26F1     [3]     30           BNE     <

Or you could make it faster and smaller by indexing -64 and +64 as shown below.  You could also index back -128 and -64 and point X to the bottom row.

This version of the routine takes 27 cycles * 32 = 864 cycles

Mem  Code   Cycles Running Total   Assembly Code (Mnemonics)
4000 8E2040   [3]                  LDX     #$2000+64
4003 CCFFFF   [3]                  LDD     #$FFFF      * White Pixels
4006 C602     [2]                  LDB     #2
4008 EF88C0   [5+1]   6    !       STU     -64,X
400B EF84     [5+0]   11           STU     ,X
400D EF8840   [5+1]   17           STU     64,X
4010 3A       [3]     20           ABX
4011 8C0850   [4]     24           CMPX    #2000+64+64
4014 26F2     [3]     27           BNE     <

Usually there are trade offs using techniques of speeding up your code.  Where speed usually means bigger code or more complex code, not always but typically this is true.

Here are some examples of clearing some RAM to all zeros from location $2000 to $4000. This could be used for clearing a graphics screen for a game.

Slow way, 61,440 cpu cycles:  This is a simple and easy Loop to clear the RAM from $2000 to $3FFF

Mem  Code  Cycles  Running Total    Assembly Code (Mnemonics)
4000 8E2000 [3]                     LDX     #$2000
4003 CE0000 [3]                     LDU     #$0000
* This loop is 15 cycles to update two bytes
* We have to do this loop $2000 / 2 bytes each pass = $1000 times
* 15 cycles * $1000 or 4096 = 61,440 cpu cycles
4006 EF81   [5+3]   8       !       STU     ,X++
4008 8C4000 [4]     12              CMPX    #$2000+$2000
400B 26F9   [3]     15              BNE     <

 Faster way, 53,328 cpu cycles:  A faster way is to use the A and B accumulators as counters to see if our loop is finished.  

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
400D 8E2000 [3]                     LDX     #$2000
4010 CE0000 [3]                     LDU     #$0000
4013 CC2000 [3]                     LDD     #$2000
* This loop is mostly 13 cycles sometimes 18 cycles every 256 bytes
* $2000 / $100 = $20
* $20 / 2 = $10  (half because we write 2 bytes per cycle)
* $2000 - $20 = $1FE0
* $1FE0 / 2 = $FF0  (half because we write 2 bytes per cycle)
* 13 cycles * $FF0 + 18 cycles * $10 = $CF30 + $120 = $D050 = 53,328 cpu cycles
4016 EF81   [5+3]   8       !       STU     ,X++
4018 5A     [2]     10              DECB
4019 26FB   [3]     13              BNE     <
401B 4A     [2]     15              DECA
401C 26F8   [3]     18              BNE     <

Even faster way – Code unrolled version = 34,048 CPU cycles:  It’s faster if we unroll the loops which means less comparing is done to see if we are at the end of our loop.  This needs a little calculations ahead of time.  If we are going to use this for clearing the screen then 32 bytes is a good number to use.  So using the above code we could unroll it to this:

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
4049 8E2000 [3]     3               LDX     #$2000
404C CE0000 [3]     6               LDU     #$0000
404F 5F     [2]     8               CLRB
* This loop is 133 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 133 = 34,048 CPU Cycles
4050 EF81   [5+3]   8       !       STU     ,X++
4052 EF81   [5+3]   16              STU     ,X++
4054 EF81   [5+3]   24              STU     ,X++
4056 EF81   [5+3]   32              STU     ,X++
4058 EF81   [5+3]   40              STU     ,X++
405A EF81   [5+3]   48              STU     ,X++
405C EF81   [5+3]   56              STU     ,X++
405E EF81   [5+3]   64              STU     ,X++
4060 EF81   [5+3]   72              STU     ,X++
4062 EF81   [5+3]   80              STU     ,X++
4064 EF81   [5+3]   88              STU     ,X++
4066 EF81   [5+3]   96              STU     ,X++
4068 EF81   [5+3]   104             STU     ,X++
406A EF81   [5+3]   112             STU     ,X++
406C EF81   [5+3]   120             STU     ,X++
406E EF81   [5+3]   128             STU     ,X++
4070 5A     [2]     130             DECB
4071 26DD   [3]     133             BNE     <

Simon Jonassen (the Mad Man) shows an Even faster method is to use ABX and indexed addressing.  His method below ties a lot of the examples above together:  The example below is 26,880 cycles.

Mem  Code  Cycles  Running Total    Assembly Code (Mnemonics)
4000 8E2000 [3]     159             LDX     #$2000
4003 CE0000 [3]     162             LDU     #$0000
4006 CC0010 [3]     165             LDD     #$0010  * A = Loop 256 times, B adds 16 to X
* This loop is 105 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 105 = 26,880 CPU Cycles
4009 EF84   [5+0]   5       !       STU     ,X
400B EF02   [5+1]   11              STU     2,X
400D EF04   [5+1]   17              STU     4,X
400F EF06   [5+1]   23              STU     6,X
4011 EF08   [5+1]   29              STU     8,X
4013 EF0A   [5+1]   35              STU     10,X
4015 EF0C   [5+1]   41              STU     12,X
4017 EF0E   [5+1]   47              STU     14,X
4019 3A     [3]     50              ABX             * Move forward half a row
401A EF84   [5+0]   55              STU     ,X
401C EF02   [5+1]   61              STU     2,X
401E EF04   [5+1]   67              STU     4,X
4020 EF06   [5+1]   73              STU     6,X
4022 EF08   [5+1]   79              STU     8,X
4024 EF0A   [5+1]   85              STU     10,X
4026 EF0C   [5+1]   91              STU     12,X
4028 EF0E   [5+1]   97              STU     14,X
402A 3A     [3]     100             ABX             * Move forward to the next row
402B 4A     [2]     102             DECA
402C 26DB   [3]     105             BNE     <

At the expense of a little RAM we could improve the above code by using values larger then 15 for the indexing.  The version below uses 26,368 cycles.

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
4000 8E2000 [3]     159             LDX     #$2000
4003 CE0000 [3]     162             LDU     #$0000
4006 CC0020 [3]     165             LDD     #$0020  * A = Loop 256 times, B adds 32 to X
* This loop is 103 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 103 = 26,368 CPU Cycles
4009 EF84   [5+0]   5       !       STU     ,X
400B EF02   [5+1]   11              STU     2,X
400D EF04   [5+1]   17              STU     4,X
400F EF06   [5+1]   23              STU     6,X
4011 EF08   [5+1]   29              STU     8,X
4013 EF0A   [5+1]   35              STU     10,X
4015 EF0C   [5+1]   41              STU     12,X
4017 EF0E   [5+1]   47              STU     14,X        
4019 EF8810 [5+1]   53              STU     16,X
401C EF8812 [5+1]   59              STU     18,X
401F EF8814 [5+1]   65              STU     20,X
4022 EF8816 [5+1]   71              STU     22,X
4025 EF8818 [5+1]   77              STU     24,X
4028 EF881A [5+1]   83              STU     26,X
402B EF881C [5+1]   89              STU     28,X
402E EF881E [5+1]   95              STU     30,X
4031 3A     [3]     98              ABX             * Move forward to the next row
4032 4A     [2]     100             DECA
4033 26D4   [3]     103             BNE     <

One other tip from Darren Atkinson about the above indexing method is to use negative numbers if you can to keep the size of the code down.  Darren’s version is below:

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
4000 8E2010 [3]     3               LDX    #$2000+16
4003 CE0000 [3]     6               LDU    #$0000
4006 CC0020 [3]     9               LDD    #$0020
* This loop is 103 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 103 = 26,368 CPU Cycles
4009 EF10   [5+1]   6           !   STU    -16,X
400B EF12   [5+1]   12              STU    -14,X
400D EF14   [5+1]   18              STU    -12,X
400F EF16   [5+1]   24              STU    -10,X
4011 EF18   [5+1]   30              STU    -8,X
4013 EF1A   [5+1]   36              STU    -6,X
4015 EF1C   [5+1]   42              STU    -4,X
4017 EF1E   [5+1]   48              STU    -2,X
4019 EF84   [5+0]   53              STU    ,X
401B EF02   [5+1]   59              STU    2,X
401D EF04   [5+1]   65              STU    4,X
401F EF06   [5+1]   71              STU    6,X
4021 EF08   [5+1]   77              STU    8,X
4023 EF0A   [5+1]   83              STU    10,X
4025 EF0C   [5+1]   89              STU    12,X
4027 EF0E   [5+1]   95              STU    14,X
4029 3A     [3]     98              ABX
402A 4A     [2]     100             DECA
402B 26DC   [3]     103             BNE    <

One last thing to note is the more you unroll the code the faster it will be at the expense of more RAM.  You just have to decide what is most important RAM space or the speed of your code…

The fastest method – This routine uses 17,920 CPU cycles:  It is fastest to use unrolled loops and push a stack pointer and it’s data into RAM instead of using a store instruction.

Mem  Code   Cycles Running Total            Assembly Code (Mnemonics)
4073 CC0000   [3]                             LDD     #$0000
4076 8E0000   [3]                             LDX     #$0000
4079 3184     [4+0]                           LEAY    ,X
407B CE4000   [3]                             LDU     #$2000+$2000
* This loop is 70 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 70 = 17,920 CPU Cycles
407E 3636     [5+6]   11      !               PSHU    D,X,Y
4080 3636     [5+6]   22                      PSHU    D,X,Y
4082 3636     [5+6]   33                      PSHU    D,X,Y
4084 3636     [5+6]   44                      PSHU    D,X,Y
4086 3636     [5+6]   55                      PSHU    D,X,Y
4088 3606     [5+2]   62                      PSHU    D
408A 11832000 [5]     67                      CMPU    #$2000
408E 22EE     [3]     70                      BHI             <

I’ll go into the details of this method and more in Part 3 of this series of blogs.

Cheers,

Glen

Advertisements
Posted in CoCo Programming | Tagged , , , , | 1 Comment

Optimizing 6809 Assembly Code: Part 1 – Quick and Easy Changes to Speedup Your Code

Table of Contents (click on a link to jump to that web page)

A lot of time Assembly language programs are already fast enough.  But if you want to make it faster or you are writing an arcade game then the following information will be helpful.  I’m always trying to improve my assembly language programming skills and I’ve been keeping some notes as I do more and more programming.  In the next few blogs I’ll share what I’ve learned and I hope this will be a useful guide for anyone who wants to make super fast 6809 assembly programs.

This first part will go over some of the quick and easy things you can do to speed up your program.  In fact if you have a favourite old 6809 computer game and you wanted to speed it up you could start with these changes.  Of course you would need to disassemble the game and make the changes that you can and  then re-assemble the code.  This would be a great learning experience.  Maybe I’ll make another blog showing how you could go about doing that.

Simply choosing to use certain instructions instead of others will speed up your program as shown below:

  • CMPX  is one byte shorter and one cycle faster then CMPY, CMPU and CMPD

If you have a lot of loops or counters and you use CMPX instead of the other CMP instructions every time that compare is done in the loop you saved one CPU cycle.   This could be a huge speed difference if you have large loops.


  • LDX & LDU are one byte shorter and a cycle faster then LDY.  The same is true for
  • STX & STU which are one byte shorter and one cycle faster then STY

Again if used inside loops the speed difference can be huge.


  • TFR  of 16 bit registers is slower than using LEU
   TFR    Y,U is 6 cycles and 2 bytes this can be changed to
   LEAU   ,Y which is 4 cycles 2 bytes

  • LEA 8 bit value or A or B is (5 cycles) which is faster then LEA 16 bit value or D which is (8 cycles) and less bytes too.  This is the same for all LEAX, LEAU,LEAY & LEAS.  Keep in mind that LEA does signed adds so make sure your 8 bit values take that into account for example be careful changing from LEAU   D,U to  LEAU    B,U

Quick and easy things to change in existing code, use in the order from first to last:

  • BRA      – 3 cycles, 2 bytes
  • JMP       – 4 cycles, 3 bytes
  • LBRA    – 5 cycles, 3 bytes – Only use if you want your code to be relocatable

  • BSR       – 7 cycles, 2 bytes
  • JSR        – 8 cycles, 3 bytes
  • LBSR    – 9 cycles, 3 bytes – Only use if you want your code to be relocatable

Another way to optimize your code is to make the most of the way you use jumps or branches to subroutines.  If the second last instruction in your routine is a BSR, JSR or LBSR and your last instruction is an RTS you can change the BSR, JSR or LBSR to BRA, JMP or LBRA and remove your RTS command.  The last called routine will return for you.  For example:

        LDX     #$4000
        BSR     SAVEX        * 7 CPU cycles (2 bytes)
        RTS                  * 5 CPU cycles (1 byte)
                             * These two lines require 12 cycles and
                             * 3 bytes
        ...
SAVEX   STX     ,U++
        RTS

Can be changed to:

        LDX     #$4000
        BRA     SAVEX        * 3 CPU cycles and two bytes
        ...
SAVEX   STX ,U++
        RTS

This is a savings of 9 CPU cycles and 1 byte.


A common trick for routines that use the stack to save your registers and accumulators is to use PULS  ,PC at the end of your routine instead of using the RTS command. As shown:

Code1   PSHS    D,X,Y
        LDD     #$0155
        STD     ,X++
        STD     ,Y++
        PULS    D,X,Y    * 5 + 6 CPU cycles, 2 bytes
        RTS              * 5 CPU cycles, 1 byte

Can be changed to:

Code1:  PSHS   D,X,Y
        LDD    #$0155
        STD    ,X++
        STD    ,Y++
        PULS   D,X,Y,PC  * 5 + 8 CPU cycles, 2 bytes
                         * Adding PC restores the Program counter
                         * which is saved on the stack when your
                         * routine was called with BSR,JSR or LBSR
                         * no RTS is needed.

This saves us 3 CPU cycles and 1 byte

Since we are talking about branching and it’s effect on the speed of your program.  Steve Bamford has a great point about making your program execute faster is to think about your program flow and branch only to what is least likely.  This only matters though if you have to do a long branch as short branches will always be the same.  A long branch that isn’t taken takes 5 cycles and a long branch that is taken will use 6 cycles.  Sure it’s just 1 cycle but if it is in a crucial loop in your code it can make a big difference.

For example if A is most likely to be 1 then the code below will long branch to Ais1 most of the time which means the CPU must add the branch location to the PC and jump to it which adds a cycle to the execution of your program.

CheckA:
        CMPA    #1
        LBEQ     Ais1
AisNot1:
;       Code to handle A is not 1
        ...
Ais1:
;       Code to handle A is 1
        ...

The code would run faster if it was arranged as this:

CheckA:
        CMPA    #1
        LBNE     AisNot1 
Ais1:
;       Code to handle A is 1
        ...
AisNot1:
;       Code to handle A is not 1
        ...

Sockmaster mentions a similar method in the comments below


If you need to load both A and B registers use LDD, for example:

    LDA   #$20     * 2 CPU cycles and 2 bytes
    LDB   #$55     * 2 CPU cycles and 2 bytes

Can be changed to:

    LDD   #$2055   * 3 CPU cycles and 3 bytes

Saves a cycle and a byte


Even more speed and space are saved when using indexed LDA or STB and STA or STB changed to LDD or STD as:

    LDA   $1F00    * 5 CPU cycles and 3 bytes
    LDB   $1F01    * 5 CPU cycles and 3 bytes

Can be changed to

    LDD   $1F00    * 6 CPU cycles and 3 bytes

This results in a savings of 4 CPU cycles and 3 bytes


Also another way to speed up your code if you load values from the same location in memory many times you will want to change the address to a register that has the value of the address and use it as a pointer.  For example:

    LDA    $FF00   * 5 CPU Cycles, 3 bytes

Can be

    LDX   #$FF00    * 3 CPU Cycles, 3 bytes
    LDA   ,X        * 4 CPU Cycles, 2 bytes

If you are using LDA  in a loop then the second method will be faster.


If you know of anymore quick ways to save some CPU cycles please comment below and I will update this page with credit to you.

A great reference to cycles counts and the full 6809/6309 instruction set is                 Darren Atkinson’s – Motorola 6809 and Hitachi 6309 Programmers Reference

Part 2 will cover topics that are a little more complex and show how to make them faster.

Stay tuned,

Glen

Posted in CoCo Programming | Tagged , , , , | 7 Comments

Sprite Compiler for the TRS-80 Color Computer 3

What are sprites anyways?

Sprites are little blocks of data that you move around the computer screen.  Usually used to move the characters around the screen in games such as Mario or the barrels moving around in Donkey Kong, they are sprites.  If you are making a game or a demo on the CoCo then you are left with the job of writing assembly code that moves your game data/characters around the screen using the CPU.  The CoCo doesn’t have hardware sprites so it’s all up to the processor.  So the faster we can move sprites around the better.  This might free up CPU cycles so you can add more sprites to your game or add sound.

There are many ways to move your character data around the screen on a CoCo:

  1. Where you load (LDD   ,X) the character data from memory and store (STD   ,U) the same data somewhere in video RAM.
  2. Faster is to stack blast.  This method is where you point the S stack pointer to the character data from memory and the U stack pointer to where in the video RAM you want your data to end up.  Then you do a PULS  D,X,Y  and PSHU   D,X,Y.  You also must keep track of the order of your data using this method and where in video RAM you are Pushing the data.
  3. Compiled sprites are even faster.  It is actual code that draws the character on screen directly.  The code is a lot of LDD or LDA instructions with a bunch of STD   ,X where you point X to the address you want your sprite to be in video RAM.
  4. Compiled/blasted sprites do the same as compiled sprites, but the sprite is drawn from the bottom up and whenever it’s possible instead of STD   ,X you do a PSHU  D,X,Y or similar.  Pushing is faster then Storing.

What is a sprite compiler?

A sprite compiler takes your character data in the binary format that would be loaded and then stored on the screen and turns that data into assembly code that when executed writes the sprite data to the screen very fast.

Here is an example source picture and what the compiled sprite code looks like:Screen Shot 2017-08-26 at 10.02.26 PM

The compiled code produced using the tool created this assembly code to be used on a 16 colour, 256 pixel wide screen.  Using the stack it’s best to work from the bottom row up.

**************************************************
* 00 00 00 00 FF FF 00 00 00 00 - 1 
* 00 00 0F FF FF FF FF 00 00 00 - 2 
* 00 0F EE FF FF FF FF F0 00 00 - 3 
* 00 0E 99 EF FF FF FF FF 00 00 - 4 
* 00 FE AA EF FF FF FF FF F0 00 - 5 
* 00 FE 99 EF FF FF FF FF F0 00 - 6 
* 0F FF FF FF FF FF FF FF EF 00 - 7 
* 0F FF FF FF FF FF FF FE EF 00 - 8 
* FF FF FF FF FF FF FF FE EF F0 - 9 
* FF FF FF FF FF FF FF EE EF F0 - 10 
* 0F FF FF FF FF FF FE EE EF 00 - 11 
* 0F FF FF FF FF FF EE DE EF 00 - 12 
* 00 FF FF FF FF FE ED EE F0 00 - 13 
* 00 FF FF FF FF EE DE EE F0 00 - 14 
* 00 0F FF FF EE EE EE EF 00 00 - 15 
* 00 00 FF EE EE EE EE F0 00 00 - 16 
* 00 00 0F FF EE EF FF 00 00 00 - 17 
* 00 00 00 00 FF F0 00 00 00 00 - 18 
**************************************************
_000056.raw:
        opt     c 
        opt     ct
        opt     cd
        opt     cc
* Row 17 
        LEAU    128*16+7-3079,U    * A=, B=, X=, Y=
* Row 17 and row 18 700
* 00 00 0F FF EE EF FF 00 00 00 
* 00 00 00 00 FF F0 00 00 00 00 
        LDB     126,U           * A=, B=, X=, Y=
        ORB     #$F0            * A=, B=, X=, Y=
        STB     126,U           * A=, B=XX, X=, Y=
        LDB     -5,U            * A=, B=XX, X=, Y=
        ORB     #$0F            * A=, B=XX, X=, Y=
        STB     -5,U            * A=, B=XX, X=, Y=
        LDD     #$FFEE          * A=, B=XX, X=, Y=
        STD     -4,U            * A=FF, B=EE, X=, Y=
        STA     -1,U            * A=FF, B=EE, X=, Y=
        STA     125,U           * A=FF, B=EE, X=, Y=
        LDB     #$EF            * A=FF, B=EE, X=, Y=
        STB     -2,U            * A=FF, B=EF, X=, Y=
* Row 16 800
* 00 00 FF EE EE EE EE F0 00 00 
        LEAU    -128*1,U        * A=FF, B=EF, X=, Y=
        LDA     ,U              * A=FF, B=EF, X=, Y=
        ORA     #$F0            * A=FF, B=EF, X=, Y=
        STA     ,U              * A=XX, B=EF, X=, Y=
        LDD     #$FFEE          * A=XX, B=EF, X=, Y=
        STD     -5,U            * A=FF, B=EE, X=, Y=
        STB     -3,U            * A=FF, B=EE, X=, Y=
        STB     -2,U            * A=FF, B=EE, X=, Y=
        STB     -1,U            * A=FF, B=EE, X=, Y=
* Row 15 800
* 00 0F FF FF EE EE EE EF 00 00 
        LEAU    -128*1+1,U      * A=FF, B=EE, X=, Y=
        LDB     -7,U            * A=FF, B=EE, X=, Y=
        ORB     #$0F            * A=FF, B=EE, X=, Y=
        STB     -7,U            * A=FF, B=XX, X=, Y=
        LDB     #$FF            * A=FF, B=FF, X=, Y=
        LDX     #$EEEE          * A=FF, B=FF, X=EEEE, Y=
        LDY     #$EEEF          * A=FF, B=FF, X=EEEE, Y=EEEF
        PSHU    D,X,Y           * A=FF, B=FF, X=EEEE, Y=EEEF
* Row 14 800
* 00 FF FF FF FF EE DE EE F0 00 
        LEAU    -128*1+6,U      * A=FF, B=FF, X=EEEE, Y=EEEF
        LDA     ,U              * A=FF, B=FF, X=EEEE, Y=EEEF
        ORA     #$F0            * A=FF, B=FF, X=EEEE, Y=EEEF
        STA     ,U              * A=XX, B=FF, X=EEEE, Y=EEEF
        LDA     #$FF            * A=FF, B=FF, X=EEEE, Y=EEEF
        LDX     #$FFEE          * A=FF, B=FF, X=FFEE, Y=EEEF
        LDY     #$DEEE          * A=FF, B=FF, X=FFEE, Y=DEEE
        PSHU    D,X,Y           * A=FF, B=FF, X=FFEE, Y=DEEE
        LDB     #$FF            * A=FF, B=FF, X=FFEE, Y=DEEE
        STB     -1,U            * A=FF, B=FF, X=FFEE, Y=DEEE
...

Why did you right such a tool?

When I was young there was a cool Amiga Demo from 1988 that I’ve always enjoyed watching, called Demons are Forever.  The demo has a bunch of sprites moving around the screen and changing from a sphere to a demon character and back.  It also has some cool music that plays in the background.  I thought it would be cool if this demo could be played on a CoCo 3 and I would learn a lot about smooth sprite movement so I started working on replicating this demo for the CoCo 3.

While compiling the many sprites by hand I decided that this would be easier if there was a tool to do this.  It wasn’t an easy program to write since creating a compressed and stack blasted sprite is pretty complicated.  I wanted to make sure the program made very optimized code so the sprites are rendered as fast as possible.  That’s the main reason for compiling sprites in the first place.  I ended up with a program that does a VERY good job creating compiled/stack blasted sprites from a raw binary file.  Looking through the assembly output that it creates I’ve been able to hand tweak a few things myself to make it a tiny bit faster.  But those changes were very minimal.  I figured since I found this tool useful maybe others could use it when they are working on their games and want their sprites to be as fast as possible on screen.

What format should the input file be?

The tool takes raw binary data that is the same data your normal character would use to display on the CoCo 3 screen.  You need to save the binary data that represents your character and save that data to a file.  You will need to know the width and height of your sprite to use it with the sprite compiler tool.

For an example my sprite above had a data file with the first four rows as:

* 00 00 00 00 FF FF 00 00 00 00 - 1 
* 00 00 0F FF FF FF FF 00 00 00 - 2 
* 00 0F EE FF FF FF FF F0 00 00 - 3 
* 00 0E 99 EF FF FF FF FF 00 00 - 4 

If you examined the input file with a hex editor the start of the file would look like this:

00 00 00 00 FF FF 00 00 00 00 00 00 0F FF FF FF FF 00 00 00 00 0F EE FF FF FF FF F0 00 00 00 0E 99 EF FF FF FF FF 00 00

Things to consider before you draw your sprite character to be used with this sprite compiler:

  1. For transparency the tool always uses palette 0, any other value is treated as part of the sprite.
  2. If you can use the palette number 15 = $F as the border/edge of your character then the compiled sprite will be drawn a little quicker.  The reason for this is a little hard to explain but here goes.  When the compiled sprite is being created and an edge of your sprite is one nibble in a byte and the other nibble is transparent (value of zero) then the compiler must load the byte and then do an AND $F with the nibble that has your palette.  If the compiler finds the palette is already $F then it doesn’t have to do the AND instruction.  Here is a little code to demonstrate this.

Byte is $07 so the right nibble is using palette 7 and the left nibble is transparent the code will look like this:

LDB     126,U   * Load B with whatever value is at this location
ANDB    #$F0    * AND the value so that we keep the transparent part 
ORB     #$07    * OR the value with our colour palette
STB     126,U   * Store the new value on the screen

If we use $0F for the edge colour/palette the code looks like this no AND instruction is needed.

LDB     126,U   * Load B with whatever value is at this location
ORB     #$0F    * OR the value with our colour palette
STB     126,U   * Store the new value on the screen

What are the options for this tool?

While writing the sprite compiler tool I needed it to do many things and it seemed like everyday I was adding a new feature to it.  It currently has quite a lot of options as shown below.  The limitation right now is it only works with sprites that are for the CoCo 3, Hires screen with 16 colours.  In this graphics mode each pixel is one nibble and that is what my sprite compiler currently handles.  The compiler can handle transparencies down to the pixel (nibble) level.

Usage:
cmpblsp -wX -hY [-oswX -oshY -owX -ohY -aa] [-as] [-odd] [-sX]
 [-rXXXX] [-rvX] [-nSubroutine] [-c] [cyc] [-done] spritename
Where:
-w - X is the width in bytes for the width of the sprite
-h - Y is the number of rows for the height of the sprite
-osw - Offset used - width in bytes of the entire data file
-osh - Offset used - height of the entire data file
-ow - X offset, from the left side to where the sprite data starts
-oh - Y offset, from the top row to where the sprite data starts
-as - Autocenter, usually the top left corner of the sprite is used
      as the starting point of the sprite, this makes it the middle
-aa - Autocenter point is the middle of the entire data file
-odd - Also make an odd version of the sprite, by shifting the data
       one nibble to the right, outputs a separate _odd.asm file
-s - Sets the screen width the sprite will be used with
     (256 or 320) defaults to 320
-r - Creates a restore sprite that loads the data behind the sprite
     from address offset XXXX which is a 4 digit hex number
-rv - Creates a restore sprite that stores the value X which is a hex
      value between 0 to F. This value is substituted wherever any
      non Zero byte is in the sprite data. This is useful if the
      background behind your sprite is always a known palette, such
      as black. This is faster then having to load the background.
-n - Name of the subroutine that is created, if not used the routine
     name created will be _spritename:
-c - Add comments that shows the values of the registers
-cyc - Add cycle count instructions for lwasm
-done - When the program is complete it wont wait for you to press
        a key, the program will exit and close this window
        (useful for scripts)
sprite - This is the name of the sprite to compile
         Output name will be sprite with .asm extension added

Besides creating a compiled/blasted sprite at the Pixel level the tool can also create pixel shifted version of the sprite which allows for smooth motion on the screen.  For example your sprite is drawn starting at pixel 100,100 or video RAM address $2440 (as an example) and you want it to move one pixel to the right.  You can’t just point your compiled sprite at $2441 because that is actually two nibbles or pixels to the right of where the last sprite was drawn.  You need to draw a new sprite since all pixel data is accounted for the even pixels only.  The -odd option creates a second compiled sprite with the same name and _odd.asm on the end of the filename.

Another thing the tool can do is create two different types of restore compiled sprites.  These can be used to restore the data behind a sprite as it moves across the screen.  The first and easiest restore sprite is useful if your background is all the same colour palette.  For example all blue.  If the blue background palette is Hex C then then use would use the option-rvC.  The other type of restore sprite that the tool can make is a little more complicated.

Here is how it works:

If you have a complicated background graphics screen (more then one colour) you can have a second copy of the background stored somewhere else in RAM and copy that image data back under where your sprite was previously drawn.  The option -rXXXX where the XXXX is a hex number that is where the relative backup screen is stored in memory.  If your backup screen is stored $6000 bytes forward in RAM then you would select the option -r6000.

The other advanced thing the tool does is allows you to select or crop the area that you want to use from your source data file.  This is useful if the data you have captured is bigger then the sprite.  You can set what parts of the data file are used for actual sprite data and make a compiled sprite only from that part.  For example if you have some sprite data that is 21 bytes wide and 19 rows high but your character is only 5 bytes wide and 7 bytes high and starts near the middle of the data then you would use the following options.  -w5 -h7 -osw21 -osh19 -ow10 -oh9

Remember that the tool defaults to creating sprites for a 320 pixel wide screen.  If you need to make a sprite for a 256 pixel wide screen use the -s256 option.

What language is this program written in and how can I get it for my computer?

I wrote the program using QB64 which is a modern version of QBASIC that runs on Windows, Linux and Mac.  It is very similar to Color Basic for the CoCo.  If you’ve done programming on the CoCo in BASIC then I think you will feel right at home using QB64.  You can download QB64 from the link here.

Once QB64 is installed you can simply copy the text version of my program from my pastebin account here:

https://pastebin.com/h9s9eRn7

and paste it into QB64 and create an executable version of the program to use on your computer.  The program isn’t copyrighted, do whatever you want with it.  The code is quite a bit messy but it gets the job done.  I just hope it helps other to create more cool games for the CoCo!

I think a useful improvement would be to add support for importing the picture data as a standard file formats like BMP, GIF,PNG,TIFF and maybe some of the standard graphic formats that are native to the CoCo 3 like Color Max files.  That would be better then the way it currently works which requires a raw sprite data file.  QB64 does have built in graphic file import routines for many common formats so it might not be too hard to add.  The problem is that you must be sure to keep the palette info intact when importing a file from a standard image format file.  Feel free to improve the code…

I’ve only used the sprite compiler for my own limited use and I think it is ready for release but there might still be some bugs in it.  If you find any bugs please post the details in the comments below.

I don’t currently have any plans to add other screen resolutions to the sprite compiler right now.  I might go back and revisit Space Invaders and make it work without the need to rotate your screen and then I might have a chance to add to the sprite compiler.  That is way in the future though…

Have fun,

Glen

 

Posted in CoCo Programming, Uncategorized | 1 Comment

How to make PMODE 4 CSM video files for the CoCo (TRS-80 Color Computer)

Hi All,

Using some awesome free tools a bash script and a little C program I wrote it’s possible to make PMODE 4 colour videos that playback at 23.3 fps with sound and have four colours – black, white,  and the two artifact colours blue and red/orange.  Below I’m going to explain how this all works and how you can make your own.  As an example of what you can expect from this video conversion I’ve uploaded two youtube videos showing the player on a CoCo 1.  The videos can be found here and here, don’t mind the mess on the floor, I usually have my CoCo 3 hooked up but wanted to show this on an actual CoCo model 1 with a 6309 CPU and 64k of RAM.

First we use a fairly recent version of FFMPEG, which is the most amazing video and audio conversion tool there is.  It will take just about any video or audio file as input and convert with filters and effects to many other formats.

First I use FFMPEG to set the frames per second to 23.3 which the CSM player requires.  It also scales the input video width and height so that the aspect ratio is perfect for the CoCo PMODE 4 screen. The format that FFMPEG outputs is 256×192 and will automatically generate black bars on the screen if needed.  It outputs the scaled video as individual still frames/images and numbers them as it produces them.  The images are stored in a folder called pics and are compress LZW – TIFF images.

Next the script uses a picture processing tool called ImageMagick.  ImageMagick is used to do the following things:

  1. Turns yellow pixels to white (ivory)
  2. Resizes the pictures to 128×192 (necessary for making artifacts)
  3. Handles picture levels (black, white and gamma controls)
  4. Normalizes the pictures as the pictures need to be fairly bright or they are hard to make out on the CoCo screen
  5. Remaps the colours of the original image to the black, white (blue, red/orange) colours.  It does this by looking at the palette of a 4 colour GIF file that represents the colours the CoCo 1 can display on a PMODE 4 screen
  6. Dithers that image to make it look like there are more colours on the screen
  7. Flips the image vertically and saves it as an uncompressed 16 colour bitmap (BMP) file.  The BMP format saves every picture upside down, so I flip it before saving it so the data in the file is ready to be used as it is.

Next we use FFMPEG again to process the audio of the source video.  At this point we use FFMPEG to convert the audio of the movie file to 1 channel, 11932 samples/second, unsigned 8 bit audio file.  FFMPEG could have been used to re-encode the video to stills and the audio at the same time but I like to do it separate so that it’s easier on the hard drive other wise you will be reading from the source video and writing both the audio and generating the stills at the same time.

At this point we use FFMPEG to take the sill images and the 8 bit unsigned audio and convert it to a test.mp4 file that can be viewed on your regular computer.  This is a very good representation of how the video will look on the CoCo.  Watching the test.mp4 you can tell if you need to make the output video brighter or not.  This can be done using the -b option with the conversion script.

Next my little C program is executed which takes the BMP stills and goes through each pixel and converts it to either a black set of bits (00) or if it’s orange (01) or blue (10) or white (11) together four times so it has a byte of data and stores that on the buffer.  It does that for the entire picture and then muxes the audio and new video data together in the format the Ed Sniders player will accept.

The last step is to join the CSM header file with the CSM PMODE 4 player code and the muxed audio/video file into one new .CSM file that is ready to be copied to an SD card and played back on the CoCo.

How do I make my own CSM videos?

You must install FFMPEG and ImageMagick on your computer.  If you are on a Mac the easiest way to install the command line tools is using homebrew found here.  Once homebrew is installed on your Mac type brew install ffmpeg and brew install imagemagick to install.  The BASH script and C program I wrote are ready to be used if you have a Mac.

It should work without issues on a Linux box and Cygwin/Mingw on a Windows box.  You will have to compile the program yourself and make sure to install FFMPEG and ImageMagick using your package manager.

Ron Klein is working on ready to go versions for Linux, RPI3 & Windows.  They will be available soon from the link below.

Ed Snider is hosting the files with his other CoCo SDC Media player software.  The Mac version is already available to use and can be found here:

https://www.mediafire.com/folder/20xt2l2k0160i/CoCo_SDC_Media_Player

In a folder called Tools for making CSM files.

uncompress the .ZIP file and using the command line in the makecsm folder type:

./makecsm.sh -h

This will give you a summary of the options available for the conversion.  Below is the help you will see:

makecsm – CoCoSDC CSM Video Maker v 1.00
Usage: [-s hh:mm:ss] [-e hh:mm:ss] [-d seconds] [-n] [-h] [-t] [-b 0.00 to 10.00] inputfile [outputfile[.CSM]]

option: -s hh:mm:ss is the time in the video to start conversion
-e hh:mm:ss is the time in the video to end conversion
-d duration in seconds
-n means no artifact colour, make a black and white video
-b x.xx sets the brightness level of the video (1.00 is default)
-h Prints this help message
-t normally a test.mp4 file is created so you can see
the resulting movie on this computer before copying it to the CoCoSDC
this option will disable the creation of a test.mp4 movie

Example: To make a movie from the source movie called mymovie.mkv
Starting at 53 seconds into the video and ending at 1 minute and 30 seconds.
The duration of the video will be 37 seconds. Brighten the video a little
and use the output filename COCOVID.CSM
Command would be: makecsm.sh -s 00:00:53 -e 00:01:30 -b 1.1 mymovie.mkv COCOVID.CSM

If no start time or end time is given then the entire video will be converted
If only the start time is given then the conversion will start at the given time
and it will convert the video until the end of the video.
If only the end time is given then the conversion will start from the beginning
of the video up to the end time given.
If no output filename is given then an output file will be created in the
current folder with the extension .CSM

The output filename must be uppercase and be a maximum of 8 characters long,
with the extension .CSM or the CoCoSDC player will not recognize it.

A few little notes on the options, you can select -d 100 (or any number of seconds for the duration) without the -s hh:mm:ss option and it will create a video from the start of the video for the number of seconds in this option (example here of 100 seconds).

You can make a black and white video without artifacts using the -n option.

One last thing to note

If your input or output video filenames have spaces then the script will probably fail.  It will be less troublesome if you move the source videos into the makecsm folder.

How to improve the video quality

The makecsm.sh script is pretty straight forward.  Other then FFMPEG creating the images at the correct size all of the image processing is done with ImageMagicks convert command.  If you look up help on ImageMagick there are tons of options.  Maybe going through these options you can find better settings to improve the output quality of the artifact colours.  It’s really hard to get yellows and greens on the CoCo screen so these should probably converted to grey or white.  The current script converts yellow to ivory which is better then converting it to red.  Feel free to tweak the command and if you come up with a really good setting please post it below in the comments.

Last little artifact problem to deal with

When the CoCo 1 or 2 is turned on there is no way to know if the even bits of a PMODE 4 screen make a blue colour or if it’s the odd bits that make the blue colour.  So I wrote a little basic program that I placed on Ed Snider’s PLAY.DSK image.  I called the program GO.BAS and when it is run it fills the screen with the red/orange artifact colour.  The program then asks if the picture is blue, if so then hit reset and start the program again.  If the screen is orange/red then you press a key and it starts Ed’s CSM player.  A copy of this program called GO.BAS is included with the script.  It will need to be copied to Ed’s PLAY.DSK image with imgtool or toolshed or similar utility.

10 CLS
20 PMODE4,1:PCLS:SCREEN1,1
30 FOR X=1 TO 255 STEP 2
40 LINE(X,0)-(X,191),PSET
50 NEXTX
60 PRINT”IF THE SCREEN TURNED BLUE THEN PRESS RESET AND RUN THE PROGRAM AGAIN”:PRINT
70 PRINT”IF THE SCREEN TURNED ORANGE/RED THEN PRESS ANY KEY TO START THE SDCM PLAYER”
80 I$=INKEY$
90 I$=INKEY$:IF I$=”” THEN 90
100 LOADM”SDCM”:EXEC&H5800

Have Fun,

Glen

Posted in CoCo Programming, Uncategorized | Leave a comment

Zilog z80 to Motorola 6809 Transcode – Part 025 – My z80 to 6809 program

To end this series on transcoding the z80 code to the 6809 I thought I should include my c program called z80_to_6809_15_Pacman.c it is what I used to help with the transcode.  It takes a z80 disassembly as input and outputs what it thinks is a compatible 6809 instruction in place.  It keeps the z80 source code to the right which makes it easier when you are manually going though the code.  It’s not a very complicated program but it get’s the job done.  The formatting of the text input file must have the correct spacing which you will have to play with if you want to use the program for your own projects.

You can find it here.

I hope these posts were helpful for anyone interested in the CoCo 3 or transcoding.

Cheers,

Glen

Posted in Uncategorized | Leave a comment

Zilog z80 to Motorola 6809 Transcode – Part 024 – PAC MAN is finally complete, if you have a CoCo 3 with 512k give it a try…

Hello, well I’ve finally completed my translation from the z80 arcade version of PAC MAN to the 6809 for the CoCo 3.  If you want to play it download version 1.01 here (Update – this version no longer includes any executable files* See note below).  This newer version moved the palette changing routine that makes the power pills flash into the Vblank IRQ which get’s rid of the a little glitch that the GIME chip has that causes a little flicking while the game is running.  This new version gets rid of that flicker.  Thanks to Nicolas Marentes for letting me know about the GIME chip glitch and also how to get around it.

The upload includes a user guide with instructions on how to copy the PACMAN.5E ROM onto the .DSK image so you can legally use the game.  This is similar to using MAME games and needing the rights to use the ROMs and play the games.

The user guide also explains what all the settings are in the option screen.

The .zip also has the 6809 assembly language source code files so others can hopefully play with and learn from.

Have fun,

Glen

* The updated version linked above previously included a compiled version of slz.c for both windows and Mac.  SLZ is used when you assemble the source code into a binary file, it compresses the binary to a file size that will still fit on the CoCo disk.  I saw a post on Facebook that said the .ZIP contained a virus.  I don’t own a windows PC so I really don’t know if there is a virus within the ZIP.  I had to use someone else’s windows machine to compile the slz.exe file so it’s possible that it had a virus.

To play it safe I removed the windows version and the MAC version.  The .ZIP no longer includes any executable files.  The original slz.c file is still included but you will have to compile it yourself to use it.  I hope this didn’t effect anyone.

Posted in Uncategorized | 3 Comments

CoCo (6809) Assembly on a modern computer

This article is a guide for anyone who is thinking about learning 6809 assembly language programming or wants to use newer tools for doing 6809 assembly for the Tandy Color Computer.  It’s not an assembly tutorial, it’s an explanation of how to use some of the modern tools to help write and debug assembly language programs for the CoCo.

I’ve gotten back into assembly programming on the CoCo about nine months ago and I found the modern tools that are available today make it easier and faster to learn assembly language programming.  It used to be a long slow process of assembling your program with EDTASM+ and saving it running it and debugging it back in the 1980’s.  Using MAME and lwasm you can assemble your program in a second and view the assembled code with the extra information about how many cycles each process is taking.  Which is vital when you want to optimize your code for the most speed or the smallest size.  LWTOOLs which includes lwasm is an amazing 6809/6309 assembler that is completely free.  It’s written by William Astle, who is on the CoCo mailing list LWTOOLS can be compiled for MAC/Linux and Windows.

My favourite emulator is MAME it’s been around for a long time emulating arcade machines and the CPU emulation has been tested in many different scenarios.  There was a branch of MAME called MESS that took the same code and used it only for computer emulation.  But for a few years now MESS is joined together with the main MAME code and now MAME includes both the Arcade emulation and the computer Emulation.  MAME is cross platform and is still being heavily developed.  MAME also has a special debug mode that let’s you step through your program and see how it is running step by step, which is a fantastic testing and learning tool. MAME can output the code it is executing to a text file when using a the trace command. MAME also has something special called watch points which allows you to setup locations in memory that will halt your program if the locations are written to or read from and even setup if they are changed to a specific value! Super useful for debugging… Anyways enough of a sales pitch, I figure you are reading this because you want to setup your assembly environment.

First you need to install LWTOOLS and MAME on your computer. Also this is already done for you if you want to use a Raspberry Pi 3 with Ron Klein’s excellent SD image. You just need to add the CoCo roms and you’re ready to go…

You can compile both MAME and LWTOOLS yourself or download ready use versions. I use a Mac myself and the quick and easiest way to get MAME and lwtools and tons of other utilities installed are by using Homebrew

Once brew is installed on your system it’s as simple as typing in these two commands:

$ brew install mame

$ brew install lwtools

You can probably find similar easy installs of both programs on linux using apt-get or similar. I’m sure there are tons of ways to get these programs on windows machines. Also for windows you might want to use cygwin which gives you a unix like environment

Once they are both installed you should create a directory where you will keep all your 6809 assembly source files. Let’s call it CoCoAssembly this same folder will be where you will assemble your program and run mame from. In this directory you will need a subfolder called roms with the coco roms of the different cocos you want to emulate/test your code on.

Here is a list of all the CoCo roms that can/should go into the roms folder (don’t ask me where to get them):

bas12.rom
bas13.rom
coco2b.zip
coco2.zip
coco3dw1.zip
coco3_hdb1.zip
coco3h.zip
coco3p.zip
coco3.zip
cocoe.zip
coco_fdc_v11.zip
coco_fdc.zip
coco.zip
disk10.rom
disk11.rom
extbas11.rom
hdbdw3bc3.rom
hdbdw3bck.rom
hdbdw3becker.rom
hdbdw3cc2.rom
HDBSDC.ROM
mc10.zip
RGBDOS2HD.ROM
yados.rom

In your CoCoAssembly folder you should see a sub folder called roms where you have the above roms copied. From this CoCoAssembly folder type the following to test if your MAME is installed properly.

mame coco3 -window

Or to set the uimodekey to F12 use this to start MAME:

mame coco3 -window -uimodekey F12

To exit the MAME emulator hit the keyboard emulation mode key on Mac it’s the delete key (left of the end key). Some laptops don’t have the other delete key so use the F12 command line shown above.  On Linux and windows the key is ScrLk. You want to set this mode to partial then press the Esc key to exit.

The complete installation of MAME includes some nice tools. The most useful for us is called imgtool which is used to create and manipulate our CoCo disk image files or .dsk files. If you don’t like imgtool you can use another image handling tool called toolshed I will be using imgtool below.

As per Glenn Parker’s comment’s below imgtool creation of a .dsk image didn’t work for him using MAME v 0.190 but when he created a DMK image file it all worked for him. So if you are having trouble with .dsk images maybe try using the “imgtool create coco_dmk_rsdos Disk1.dmk” of course this means below that you will have to substitute .dsk to .dmk in the lines below…

Create a new blank .DSK disk image that we can copy our program to with the following command:

imgtool create coco_jvc_rsdos Disk1.dsk

Now that we have a blank disk let’s assemble a program and copy it on this image file, in your favourite editor type in the following short 6809 assembly program:

        ORG $4000
Start:
        PSHS   A,B
        LDA    #'H
        LDB    #'I
        STD    $500
        PULS   A,B,PC
        END    Start

Save the program as mycode.asm

This is the command I use to from LWTOOLS to assemble my 6809 source code:

lwasm -9bl -p cd -oNEW.BIN mycode.asm

You should see:

            ( mycode.asm):00001 ORG $4000
4000        ( mycode.asm):00002 Start:
4000 3406   ( mycode.asm):00003 [5+2] PSHS A,B
4002 8648   ( mycode.asm):00004 [2] LDA #'H
4004 C649   ( mycode.asm):00005 [2] LDB #'I
4006 FD0500 ( mycode.asm):00006 [6] STD $500
4009 3586   ( mycode.asm):00007 [5+4] PULS A,B,PC
            ( mycode.asm):00008 END Start

In the lwasm output there are numbers in the square brackets, these numbers are the CPU cycles used for each line of code.  This can be helpful if you want to figure out how to optimize your assembly code for the max speed or best size as there are many tricks to speeding up code at the cost of size and vice versa.

If you want to capture the assembly output to a file called listing.txt use this command. It’s useful to keep the output file to refer back to when debugging your code since it will have the addresses of the code in memory of the instructions…

lwasm -9bl -p cd -oNEW.BIN mycode.asm > listing.txt

The above command options tells lwasm to generate our output code as an RSDOS “LOADM” compatible 6809 program.

Once you have your program assembled OK as NEW.BIN you have to transfer it to the .DSK image so it can be run with the emulator we use imgtool for this.

imgtool put coco_jvc_rsdos Disk1.dsk NEW.BIN TEST.BIN

The above command tells imgtool to put or copy the file NEW.BIN into the disk image file called Disk1.dsk use the CoCo RSDOS format of coco_jvc_rsdos and save the file on the disk with the name TEST.BIN

Another useful feature of imgtool is to delete files from an image to delete the file TEST.BIN on the .dsk file use the following:

imgtool del coco_jvc_rsdos Disk1.dsk TEST.BIN

Imgtool also has many more features, type the imgtool without any options to see all the features.

Let’s test and debug our program using MAME:

mame coco3 -window -debug -flop1 Disk1.dsk

This starts mame up in it’s debugger mode and you will see the following window, or similar with windows and linux.

It’s important to note that the mame debugger always uses hex values for its input and output as a default.

You can see the pink highlighted line on address 8C1B, this is the line that is about to be executed. This is where the CoCo 3 first starts when you power on your computer. On the top left is the cycles (counts the CPU cycles), beamx which is where the beam of the picture tube is currently being drawn in the x direction. Beamy is which row is being drawn on the screen at this moment it time. Flags shows the flags that are currently set in the CC (condition code) register of the 6809 CPU.

  • PC is the program counter and shows us the address where your instructions will be executed next.
  • S is the current stack pointer location
  • CC again is the condition code register but this time shown as a hex number.
  • DP is the Direct Page value
  • A is the A accumulators value
  • B is the B accumulators value
  • D is the A & B accumulators value together as a 16 bit value
  • X is the X registers value
  • Y is the Y registers value
  • U is the U registers value

At the bottom of this window is a command line area where you can type commands for the debugger to execute such as watchpoints or using the trace function another cool feature of the debugger. You can get a lot of help from the debugger itself by typing the word help. Let’s setup a breakpoint at $4000 which is the address where our little test program is going to be loaded and executed. In the line type the following:

wp 4000,1,w

This command sets a watchpoint at address $4000 that is 1 byte long and will stop the the execution of processor when there is a write operation at this address. We could have made the watchpoint look at many bytes and check for write and read with the wr option or just read with the r option.

Now press F5 to make the debugger continue with execution. You’re thinking why did it stop? Disk Basic hasn’t even started yet? The reason it stopped is because Disk Basic is setting up the memory and it did a write instruction at address $4000. In the debug window it shows a message Stopped at watchpoint 1 writing byte to 00004000 (PC=C033) (data=32)

This is telling us that code at address C033 (Disk ROM address) wrote the byte 32 to address $4000 and since our watchpoint is set to stop code at this point it stopped so that we can now look at the code. We don’t really want to get into all the things RSDOS is doing as it boots up so let’s hit F5 again.

Now you should see the familiar RSDOS OK prompt. So let’s make sure our disk image is being used by mame. Type the DIR command in RSDOS and you should see the TEST.BIN as per the picture below. If you got this far things are looking good.

Let’s load our program type LOADM”TEST” and hit Enter

Our debugger stopped the code again as RSDOS loaded your program into memory at address $4000. That’s good, hit F5 once again to let it finish the loadm command.

The next thing we want to do is setup a break point which will stop program execution when the program counter gets to a certain address. In our case our program is going to be executed at address $4000 so in the debug command line type the following command:

bp 4000

Next press F5 and in the RSDOS window type:

EXEC

This is where the fun begins, the breakpoint stops execution and you can now step through the code line by line watching the registers each step of the way. You can also pull up a memory window with command d or probably control d from linux/windows. Or goto the Debug menu option at the top of the screen and select Memory window. In the Memory Window type 400, which is $400. This will show us a hex view of the text window for the CoCo.

Our program is going to write the word “HI” in the middle of the screen and you can see this in the Memory window as we step through the code. Click on the debug window and hit enter to step forward one line, as you do the S stack pointer will decrease by two bytes as it stores the A and B values in the stack memory space. Hit Enter again and the LDA #$48 instruction is executed and the A accumulator will change to show the value 48. Hit enter again and the LDB #$49 will load the B accumulator with the value 49. You can now see D’s value is now 4849. Hit enter again and you can see the value at address 0500 in the memory window has changed to 48 49. The RSDOS screen hasn’t changed yet since time is frozen when we are debugging and the beam that refreshes our screen hasn’t moved much at all in the time it takes for the 6809 to execute the few instructions we have in our program. We can now press F5 again and the PULS A,B,PC command will restore our accumulaotrs back to what they were before execution and return our program execution back to RSDOS and you should then see the screen refresh and show the HI in the left side of the middle of the screen. I hope you get the idea how the debugger works.

Another powerful feature of the watchpoint command is you can get it to stop execution only if the value of a certain RAM location changes to a specific value. For example (shrunk to fit on one line):

wp ffa0,10,w,{wpdata==0x0b},{printf “write to MMU %04X, value %02X @ %02X\n”,wpaddr,wpdata,pc; g}

This command tells the debugger to watch $FFa0 to FFB0 for a write operation. If one occurs check if the value is a $0b and if so write to the debug window the message.

An example of the output might be where 200D is the program counter address when it made FFA2 the value 0B

write to MMU FFA2, value of 0B @ 200D

One last cool feature I want to show is the trace feature. The trace command follows the execution of the program and saves the disassembled instructions as a text file to be analyzed. You can set it up so it also saves the register data at those points in your file too. Here is how I use it, from the debug window we still have our watchpoint activated. But I’ll show you how to deactivate the watchpoints and breakpoints first. From the Debug window click the bottom down arrow beside the command line bar and click on Break this stops execution and allows you to use the debug features again just like when a breakpoint or watchpoint has been triggered. From the Debug menu select New (Break|Watch)points Window

The window defaults to show the breakpoints but if you click on the top bar you can select ALL Breakpoints or ALL Watchpoints as below:

Select each view and click on the lines and you will see the X on the left turn into a red 0 to indicate it is disabled.

In this example I’m going to set a breakpoint at $4000 again manually and another breakpoint at the end of our program. This is so the trace output will be short, as these files can get huge if you let them run for a few seconds. Depending on the speed of your computer.

From the debug window command line type the following two lines to setup the two new breakpoints

bp 4000

bp 4009

Hit F5 and go to the RSDOS window and type EXEC again, after the BP stops and the debug shows line 4000 we will turn on the trace function by using the following command all on one line (shrunk to fit on one line):

trace output.tr.txt,0,,{tracelog “A=%02X,B=%02X,X=%02X,Y=%02X,U=%02X,S=%02X,CC=%02X “,a,b,x,y,u,s,cc}

Then hit F5 to continue the program execution, which will stop at $4009 where our last breakpoint was set. Turn off the trace function with the command in the debug window

trace off

Hit F5 again to get the RSDOS prompt again. When you want to close MAME once again hit the Emulation key mode key and the Esc key.

Once you are out of MAME you can view the trace file in a text editor and you should see the following:

A=00,B=44,X=ABAB,Y=AAF1,U=2E0,S=7F32,CC=84 4002: LDA #$48
A=48,B=44,X=ABAB,Y=AAF1,U=2E0,S=7F32,CC=80 4004: LDB #$49
A=48,B=49,X=ABAB,Y=AAF1,U=2E0,S=7F32,CC=80 4006: STD $0500
A=48,B=49,X=ABAB,Y=AAF1,U=2E0,S=7F32,CC=80 4009: PULS A,B,PC

This shows the values of the accumulators and registers on each line of code.

Another feature of the debugger you can also get it to run until an IRQ is triggered by hitting F7 as shown here:

Another helpful thing you can do while using the MAME debug mode is you can change the contents of any accumulator/register

For example when you stop execution you can type in the debug window’s command line:

pc=1000

would change program counter (pc) will be changed to address $1000 and the program would continue from address $1000 if you hit F5 or step through the code.

a=94

Changes the A accumulators value to $94.  You get the idea…

If you want to see the cycle counts in your code listing you can add these lines to your assembly source code:

        opt     c
        opt     ct
        opt     cd
        opt     cc

The code listing will output the cycle counts from the place you inserted the above special options.  Anytime you want to reset the counts you can just insert the following:

        opt     cd
        opt     cc

Here is a little output code so you can see how to use it in your source code and the actual cycle counts in the listing, just to the left of the 6809 instructions.  This is some example code showing different ways to clear data in memory (or the screen).  It’s from another article I’m working on about assembly optimization.

                      (       mycode.asm):00001                         opt     c
                      (       mycode.asm):00002                         opt     ct
                      (       mycode.asm):00003                 
                      (       mycode.asm):00004                         ORG     $4000
4000                  (       mycode.asm):00005                 Start:
                      (       mycode.asm):00006                         opt     cd
                      (       mycode.asm):00007                         opt     cc
                      (       mycode.asm):00008                 * Slow way
4000 8E4000           (       mycode.asm):00009 [3]     3               LDX     #$4000
4003 CE0000           (       mycode.asm):00010 [3]     6               LDU     #$0000
                      (       mycode.asm):00011                         opt     cd
                      (       mycode.asm):00012                         opt     cc
                      (       mycode.asm):00013                 * This loop is 15 cycles to update two bytes
                      (       mycode.asm):00014                 * We have to do this loop $2000 / 2 bytes each pass = $1000 times
                      (       mycode.asm):00015                 * 15 cycles * $1000 or 4096 = 61,440 cpu cycles
4006 EF81             (       mycode.asm):00016 [5+3]   8       !       STU     ,X++
4008 8C6000           (       mycode.asm):00017 [4]     12              CMPX    #$4000+$2000
400B 26F9             (       mycode.asm):00018 [3]     15              BNE     <
                      (       mycode.asm):00019                 
                      (       mycode.asm):00020                         opt     cd
                      (       mycode.asm):00021                         opt     cc
                      (       mycode.asm):00022                 * Faster way
400D 8E4000           (       mycode.asm):00023 [3]     3               LDX     #$4000
4010 CE0000           (       mycode.asm):00024 [3]     6               LDU     #$0000
4013 CC2000           (       mycode.asm):00025 [3]     9               LDD     #$2000
                      (       mycode.asm):00026                         opt     cd
                      (       mycode.asm):00027                         opt     cc
                      (       mycode.asm):00028                 * This loop is mostly 13 cycles sometimes 18 cycles every 256 bytes
                      (       mycode.asm):00029                 * $2000 / $100 = $20
                      (       mycode.asm):00030                 * $20 / 2 = $10  (half because we write 2 bytes per cycle)
                      (       mycode.asm):00031                 * $2000 - $20 = $1FE0
                      (       mycode.asm):00032                 * $1FE0 / 2 = $FF0  (half because we write 2 bytes per cycle)
                      (       mycode.asm):00033                 * 13 cycles * $FF0 + 18 cycles * $10 = $CF30 + $120 = $D050 = 53,328 cpu cycles
4016 EF81             (       mycode.asm):00034 [5+3]   8       !       STU     ,X++
4018 5A               (       mycode.asm):00035 [2]     10              DECB
4019 26FB             (       mycode.asm):00036 [3]     13              BNE     <
401B 4A               (       mycode.asm):00037 [2]     15              DECA
401C 26F8             (       mycode.asm):00038 [3]     18              BNE     <
                      (       mycode.asm):00098
                      (       mycode.asm):00099                 * Fastest method is to use unfolded loops
                      (       mycode.asm):00100                 * and use the U Stack pointer instead of a ST instruction
4073 CC0000           (       mycode.asm):00101 [3]     136             LDD     #$0000
4076 8E0000           (       mycode.asm):00102 [3]     139             LDX     #$0000
4079 3184             (       mycode.asm):00103 [4+0]   143             LEAY    ,X
407B CE6000           (       mycode.asm):00104 [3]     146             LDU     #$4000+$2000
                      (       mycode.asm):00105                         opt     cd
                      (       mycode.asm):00106                         opt     cc
                      (       mycode.asm):00107                 * This loop is 70 cycles to write 32 bytes
                      (       mycode.asm):00108                 * We cycle through the loop 256 times so the calculation is
                      (       mycode.asm):00109                 * 256 * 70 = 17,920 CPU Cycles
407E 3636             (       mycode.asm):00110 [5+6]   11      !       PSHU    D,X,Y
4080 3636             (       mycode.asm):00111 [5+6]   22              PSHU    D,X,Y
4082 3636             (       mycode.asm):00112 [5+6]   33              PSHU    D,X,Y
4084 3636             (       mycode.asm):00113 [5+6]   44              PSHU    D,X,Y
4086 3636             (       mycode.asm):00114 [5+6]   55              PSHU    D,X,Y
4088 3606             (       mycode.asm):00115 [5+2]   62              PSHU    D
408A 11834000         (       mycode.asm):00116 [5]     67              CMPU    #$4000
408E 22EE             (       mycode.asm):00117 [3]     70              BHI     <
                      (       mycode.asm):00118                 
                      (       mycode.asm):00119                         END     Start

I should also point out a nice feature of lwasm is the use of greater than > and less than < pointers.  You don't need a label for every branch instruction.  In the listing above you can see the use of  "BHI    ” which will tell the assembler to branch if not equal to the next “!” found below in your source code.

I should also point out there is a special version of MAME on GitHub that has some special enhancements for the CoCo that might come in handy.  You can read up about it and it’s features here.

I hope this info helps others to get the most out using MAME to learn assembly language programming.

Posted in CoCo Programming, Emulation | 10 Comments