Optimizing 6809 Assembly Code: Part 4 – Odds and Sods – More Tricks

Table of Contents (click on a link to jump to that web page)

I’ve run out of notes that I had for speeding up 6809 assembly.  I’ll update this page with anymore cool ideas that anyone shares with me or puts in the comments.

Cheers.

Dave Philipsen wrote with this tip:

I don’t know that it necessarily optimizes for speed but it saves space. When printing a string or any kind of calling a subroutine which requires a string, instead of pointing to the string and then calling the subroutine like this:

ldx   #strptr     * point to the string
jsr   prtstr      * print the string
lda   #??         * continue with the rest of the program

You can do this:

jsr   prtstr        * call the print string subroutine which pulls
                    * the address of the string from the
fcs   /text string/ * program counter which was just pushed to
                    * the stack
lda   #??           * continue with the rest of the program

The prtstr routine can change the program counter as it is saved on the stack so that when the routine returns, it returns to the point just past the end of the string. This optimizes for size by eliminating the need to load the pointer each time you print. It also reduces complexity because you don’t need to assign a label to the string.

Another example of this might be calling a subroutine which positions the cursor on the screen.

Instead of:

ldd   #$0101      * A=1 (x coord), B=1 (y coord)
jsr   curXY       * position the cursor
lda   #$??        * continue with program

do this:

jsr   curXY       * position the cursor, X and Y are pointed to
                  * by the program counter
fdb   #$0101      * A=1 (x coord), B=1 (y coord)
lda   #$??        * continue with the program

Thanks Dave.

Art Flexser wrote with this tip:

When addressing some CoCo hardware registers, STA is a cycle faster than CLR and has the same effect.  Erik Gavriluk pointed out that using CLR does affect the Condition Codes and that should be taken into account.  Also “STA sets flags, too. It’s weird, but CLR really does write AND read from the memory address in question. There are CoCo hardware registers where this can cause a problem.”

CLR $FFDE         * Slow way
STA $FFDE         * Faster way

Thanks Art.

Advertisements
Posted in CoCo Programming | Tagged , , , | Leave a comment

Optimizing 6809 Assembly Code: Part 3 – Stack Blasting and Self Modifying Code

Table of Contents (click on a link to jump to that web page)

At the end of Part 2 I left off with a tease about using the Stack to blast data on the screen.  I ended off with this example:

The fastest method is to use unrolled loops and push the U Stack pointer instead of a store instruction.  This routine uses 17,920 CPU cycles

Mem  Code   Cycles Running Total            Assembly Code (Mnemonics)
4073 CC0000   [3]                             LDD     #$0000
4076 8E0000   [3]                             LDX     #$0000
4079 3184     [4+0]                           LEAY    ,X
407B CE4000   [3]                             LDU     #$2000+$2000
* This loop is 70 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 70 = 17,920 CPU Cycles
407E 3636     [5+6]   11      !               PSHU    D,X,Y
4080 3636     [5+6]   22                      PSHU    D,X,Y
4082 3636     [5+6]   33                      PSHU    D,X,Y
4084 3636     [5+6]   44                      PSHU    D,X,Y
4086 3636     [5+6]   55                      PSHU    D,X,Y
4088 3606     [5+2]   62                      PSHU    D
408A 11832000 [5]     67                      CMPU    #$2000
408E 22EE     [3]     70                      BHI             <

This brings us to one of the fastest ways to speed up loading and storing data on the 6809.

When you PUSH data onto the U or S stack pointers the 6809 does this super fast.  The bonus is it writes the data and moves the pointer all in one instruction.  We could have even used more registers for each PSHU instruction to make it even faster.  For example we could push the S and DP registers with the command PSHU   D,X,Y,S,DP which will store an extra 3 bytes for every PSHU instruction. This method of pushing data from the stack is known as “Stack Blasting.”

Stack Blasting is a little tricky to use as you have to account for the fact that stack blasting stores bytes downwards in RAM.  If you were to write some data to the screen using stack blasting you would load the U register with the address of the bottom right of the screen.  Every time you do a PSHU instruction it stores the contents of the registers below the U pointer in memory and moves the U register down in memory.  Because of this you must arrange the data that is to be blasted on screen in the correct order ahead of time.  The speed increase is definitely worth the effort.

The other half of stack blasting is loading the registers with their data.  The above code was fine for blasting zeros to the screen.  But if you want to draw something on screen you have to load the registers with the data before pushing it on screen.  This is where you would use the S stack pointer and the PULS command.  The PUL command loads the registers with the data in RAM and moves the S pointer forward.  You can use PULS and PSHU or PULU and PSHS either will work the same.

So you basically do something like this:

Draw_Backgrnd:
        PSHS    D,X,Y,DP
        STS     TempMem     * Save the Stack pointer
        LDS     #$5C3F      * Bottom right of the screen
        LDU     #$C000      * Address of data to copy to screen
Copy_bg1:
        PULU    D,X,Y,DP    * load the registers and move U forward
        PSHS    D,X,Y,DP    * Store the data and move S backwards
        CMPU    #$DC35      * Check if U has reached the end of data
        BLO     Copy_bg1    * if not keep copying
        LDS     TempMem     * Restore the Stack pointer
        PULS    D,X,Y,DP,PC

Stack Blasting data on the screen is a method used to quickly draw characters or sprites on the screen for games.  The classic arcade game Defender uses a 6809 running at 1Mhz and is a super fast game, with lots of objects on the screen moving and changing all the time.  Defender uses stack blasting for drawing all the ships and aliens on the screen…

Defender_ScreenOf course as discussed in Part 2 of this series, unrolling the stack blasting loop above will speed it up even more!


A little self modifying code can speed things up and save memory too…

This code shown previously:

Mem  Code    Cycles Running Total     Assembly Code (Mnemonics)
4000                        Draw_Backgrnd:
4000 343E     [5+7]   12              PSHS    D,X,Y,DP
4002 10FF401D [7]     19              STS     TempMem     * Save the Stack pointer
4006 10CE5C3F [4]     23              LDS     #$5C3F
400A CEC000   [3]     26              LDU     #$C000
400D                        Copy_bg1:
400D 373E     [5+7]   38              PULU    D,X,Y,DP
400F 343E     [5+7]   50              PSHS    D,X,Y,DP
4011 1183DC35 [5]     55              CMPU    #$DC35
4015 25F6     [3]     58              BLO     Copy_bg1
4017 10FE401D [7]     65              LDS     TempMem     * Restore the Stack pointer
401B 35BE     [5+9]   79              PULS    D,X,Y,DP,PC

Can be changed to this:

Mem  Code    Cycles Running Total     Assembly Code (Mnemonics)
4000                       Draw_Backgrnd:
4000 343E     [5+7]   12              PSHS    D,X,Y,DP
4002 10FF4019 [7]     19              STS     Save_S_Here+2 * Save the Stack pointer
4006 10CE5C3F [4]     23              LDS     #$5C3F
400A CEC000   [3]     26              LDU     #$C000
400D                       Copy_bg1:
400D 373E     [5+7]   38              PULU    D,X,Y,DP
400F 343E     [5+7]   50              PSHS    D,X,Y,DP
4011 1183DC35 [5]     55              CMPU    #$DC35
4015 25F6     [3]     58              BLO     Copy_bg1
4017                         Save_S_Here:
4017 10CE0000 [4]     62              LDS     #$0000        * Restore the Stack pointer
401B 35BE     [5+9]   76              PULS    D,X,Y,DP,PC

Let me explain the changes above in Bold.  First the STS instruction saves the value of the S register in the code itself at the memory location $4019 which is where the LDS instruction using immediate addressing will load it’s value.  The gain here is 3 cycles and you no longer need to store the S value in the TempMem location which saves two bytes of RAM.

Self modifying code makes following and debugging the code in the future a lot more difficult.  So I would only use it if necessary.  One such routine I do use the above method is for audio sample playback in the FIRQ routine.  When you playback sampled audio using the FIRQ you need the code to be as fast as possible since it will be triggered thousands of times a second.  This is an example FIRQ to playback audio samples for the 6809.  The FIRQ uses the Timer and is only available for the CoCo 3.  But I think this is a good example of when it’s necessary to make code as fast as possible and pull out all the stops!

First make sure the DP is set to the FIRQ, this will speed up the FIRQ too.

        LDA     #DirectPage/256
        TFR     A,DP

To save another cycle we make the FIRQ interrupt vector jump to the Sample playing routine:

        LDA     #$0E                * JMP opcode using DP addressing
        LDB     #FIRQ_Audio%256
        STD     $FEF4               * Set next FIRQ to $8000

Next you need to insert some code that adds your sample data and make sure the sample data ends at address $8000.  Store the starting address of the sample file in memory at LoadAudio+1.  By doing the following:

        LDX     #SampleStart        * Sample starting location in RAM
        STX     LoadAudio+1         * Store it where FIRQ will read

Then setup the FIRQ Timer to match the sample rate of your sound file then enable the FIRQ.

Below is an example of an FIRQ routine to play an audio sample:

        ORG     $FA00           * Address of the Table and Data to be loaded
DirectPage:
        SETDP   DirectPage/256
*****************************
FIRQ_Audio:
        STA     <FIRQ_Audio_Restore+1 * Save A for restore after FIRQ rotuine is complete
        LDA     FIRQENR        * Re enable the FIRQ
        INC     <LoadAudio+2   * Increment the LSB of the sample pointer
        BNE     LoadAudio      * jump ahead if LSB is not zero
        INC     <LoadAudio+1   * Increment the MSB of the sample pointer
        BPL     LoadAudio      * If we haven't hit $8000 then keep going
        LDA     #ReturnFIRQ%256 * Point the FIRQ to the RTI
        STA     $FEF5           * This Sample playing has now ended
LoadAudio:
        LDA     $F9FF          * Get next sample byte
        STA     $FF20          * $FF20 - store to DAC - Play a sample 🙂
FIRQ_Audio_Restore:
        LDA     #$00           * STA at the start of the FIRQ stores A's value, here we restore A before the RTI, saves a cycle and a byte of RAM
ReturnFIRQ:        
        RTI

The above code does a lot of self modifying (lines in bold):

  1. The first line saves the A accumulators value at FIRQ_Audio_Restore+1 which is loaded just before the end of the routine.  This is necessary since the FIRQ does not save the registers automatically like the IRQ does.  So we need to restore A’s value after the FIRQ is finished.
  2. The first INC instruction modifies the LSB of the sample pointer directly at address LoadAudio+2
  3. The second INC instruction modifies the MSB of the sample pointer directly at address LoadAudio+1
  4. The STA    $FEF5 changes the FIRQ vector to jump directly to the RTI instruction so the playback will no longer be active until setup again in the main program

See you in Part 4,

Glen

Posted in CoCo Programming, Uncategorized | Tagged , , , , | Leave a comment

Optimizing 6809 Assembly Code: Part 2 – Speedup Storing Data – Unrolling Loops

Table of Contents (click on a link to jump to that web page)

Let’s move on to some more in-depth ways of speeding up useful tasks done in assembly.  This is probably a good place to point out that if you’re using LWTOOL’s,  lwasm  to assemble your source code you can use the options below in your code to get the cycle counts in your output listing.

opt c  - enable cycle counts
opt cd - enable detailed cycle counts breaking down addressing modes
opt ct - show a running subtotal of cycles
opt cc - clear the running subtotal

The cycle count info in the listings below were generated by lwasm.

I personally use lwasm to show me the cycle counts as I just described but others might like to use a 6809 reference chart such as this to look up the cycle counts on their own while programming.  This is also a good chart from sockmaster.  Sockmaster 6809/6309 reference chart.


First a good little trick to keep in mind is the ABX instruction in place of LEAX   B,X

ABX is 2 cycles faster and one byte less then LEAX  B,X

Keep in mind that ABX adds the unsigned value of B and adds it to X, while LEAX  B,X takes into account the singed value of B when adding it to X.

A note about using ABX from Sockmaster – “Oftentimes I rearrange register usage in my code just to make sure ABX can be used as much as possible.”


Another important way to speed up your program is to make use of the Direct Page (DP) register.  It is very useful if you are playing sampled sounds in a game using an FIRQ that your FIRQ is in the DP space.  Also use it for other small routines that get used a lot.  Also use it for storage of data that is accessed a lot.

                            LDA     #$FA
                            TFR     A,DP    * Set DP to $FA00-$FAFF
4000 FCFA55 [6]     6       LDD     $FA55   * slower and more bytes
4003 DC55   [5]     11      LDD     <$FA55  * faster and less bytes

If we don’t use direct addressing the LDD takes 6 cycles and 3 bytes.  If we use direct addressing indicated with the less than “<” symbol the LDD takes 5 cycles and 2 bytes.


Another thing to note is the impact of speed and size when you are using indexed addressing.

Mem  Code     Cycles               Assembly Code (Mnemonics)
4000 8E2000   [3]                  LDX     #$2000
4003 A684     [4+0]                LDA     ,X
4005 A61F     [4+1]                LDA     -1,X
4009 A610     [4+1]                LDA     -16,X
400B A688EF   [4+1]                LDA     -17,X
400E A68880   [4+1]                LDA     -128,X
4011 A689FF7F [4+4]                LDA     -129,X
4015 A601     [4+1]                LDA     1,X
4017 A60F     [4+1]                LDA     15,X
4019 A68810   [4+1]                LDA     16,X
401C A6887F   [4+1]                LDA     127,X
401F A6890080 [4+4]                LDA     128,X

Things to note about the above list:

  • Values of -1 to -16 are only two bytes, same with values 1 to 15
  • Values of -129 or lower use 4 bytes and 8 cycles.  The same is true for positive numbers of 128 or higher.

For a real world example:  We have a screen that is 64 bytes wide and we want to draw a line across the screen that is three pixels tall you could do this:

This routine takes 30 cycles * 32 = 960 cycles

Mem  Code   Cycles Running Total   Assembly Code (Mnemonics)
4000 8E2000   [3]                  LDX     #$2000
4003 CEFFFF   [3]                  LDU     #$FFFF      * White Pixels
4006 C602     [2]                  LDB     #2
4008 EF84     [5+0]   5    !       STU     ,X
400A EF8840   [5+1]   11           STU     64,X
400D EF890080 [5+4]   20           STU     128,X       * Big & Slow
4011 3A       [3]     23           ABX
4012 8C0810   [4]     27           CMPX    #2000+64
4015 26F1     [3]     30           BNE     <

Or you could make it faster and smaller by indexing -64 and +64 as shown below.  You could also index back -128 and -64 and point X to the bottom row.

This version of the routine takes 27 cycles * 32 = 864 cycles

Mem  Code   Cycles Running Total   Assembly Code (Mnemonics)
4000 8E2040   [3]                  LDX     #$2000+64
4003 CCFFFF   [3]                  LDD     #$FFFF      * White Pixels
4006 C602     [2]                  LDB     #2
4008 EF88C0   [5+1]   6    !       STU     -64,X
400B EF84     [5+0]   11           STU     ,X
400D EF8840   [5+1]   17           STU     64,X
4010 3A       [3]     20           ABX
4011 8C0850   [4]     24           CMPX    #2000+64+64
4014 26F2     [3]     27           BNE     <

Usually there are trade offs using techniques of speeding up your code.  Where speed usually means bigger code or more complex code, not always but typically this is true.

Here are some examples of clearing some RAM to all zeros from location $2000 to $4000. This could be used for clearing a graphics screen for a game.

Slow way, 61,440 cpu cycles:  This is a simple and easy Loop to clear the RAM from $2000 to $3FFF

Mem  Code  Cycles  Running Total    Assembly Code (Mnemonics)
4000 8E2000 [3]                     LDX     #$2000
4003 CE0000 [3]                     LDU     #$0000
* This loop is 15 cycles to update two bytes
* We have to do this loop $2000 / 2 bytes each pass = $1000 times
* 15 cycles * $1000 or 4096 = 61,440 cpu cycles
4006 EF81   [5+3]   8       !       STU     ,X++
4008 8C4000 [4]     12              CMPX    #$2000+$2000
400B 26F9   [3]     15              BNE     <

 Faster way, 53,328 cpu cycles:  A faster way is to use the A and B accumulators as counters to see if our loop is finished.  

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
400D 8E2000 [3]                     LDX     #$2000
4010 CE0000 [3]                     LDU     #$0000
4013 CC2000 [3]                     LDD     #$2000
* This loop is mostly 13 cycles sometimes 18 cycles every 256 bytes
* $2000 / $100 = $20
* $20 / 2 = $10  (half because we write 2 bytes per cycle)
* $2000 - $20 = $1FE0
* $1FE0 / 2 = $FF0  (half because we write 2 bytes per cycle)
* 13 cycles * $FF0 + 18 cycles * $10 = $CF30 + $120 = $D050 = 53,328 cpu cycles
4016 EF81   [5+3]   8       !       STU     ,X++
4018 5A     [2]     10              DECB
4019 26FB   [3]     13              BNE     <
401B 4A     [2]     15              DECA
401C 26F8   [3]     18              BNE     <

Even faster way – Code unrolled version = 34,048 CPU cycles:  It’s faster if we unroll the loops which means less comparing is done to see if we are at the end of our loop.  This needs a little calculations ahead of time.  If we are going to use this for clearing the screen then 32 bytes is a good number to use.  So using the above code we could unroll it to this:

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
4049 8E2000 [3]     3               LDX     #$2000
404C CE0000 [3]     6               LDU     #$0000
404F 5F     [2]     8               CLRB
* This loop is 133 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 133 = 34,048 CPU Cycles
4050 EF81   [5+3]   8       !       STU     ,X++
4052 EF81   [5+3]   16              STU     ,X++
4054 EF81   [5+3]   24              STU     ,X++
4056 EF81   [5+3]   32              STU     ,X++
4058 EF81   [5+3]   40              STU     ,X++
405A EF81   [5+3]   48              STU     ,X++
405C EF81   [5+3]   56              STU     ,X++
405E EF81   [5+3]   64              STU     ,X++
4060 EF81   [5+3]   72              STU     ,X++
4062 EF81   [5+3]   80              STU     ,X++
4064 EF81   [5+3]   88              STU     ,X++
4066 EF81   [5+3]   96              STU     ,X++
4068 EF81   [5+3]   104             STU     ,X++
406A EF81   [5+3]   112             STU     ,X++
406C EF81   [5+3]   120             STU     ,X++
406E EF81   [5+3]   128             STU     ,X++
4070 5A     [2]     130             DECB
4071 26DD   [3]     133             BNE     <

Simon Jonassen (the Mad Man) shows an Even faster method is to use ABX and indexed addressing.  His method below ties a lot of the examples above together:  The example below is 26,880 cycles.

Mem  Code  Cycles  Running Total    Assembly Code (Mnemonics)
4000 8E2000 [3]     159             LDX     #$2000
4003 CE0000 [3]     162             LDU     #$0000
4006 CC0010 [3]     165             LDD     #$0010  * A = Loop 256 times, B adds 16 to X
* This loop is 105 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 105 = 26,880 CPU Cycles
4009 EF84   [5+0]   5       !       STU     ,X
400B EF02   [5+1]   11              STU     2,X
400D EF04   [5+1]   17              STU     4,X
400F EF06   [5+1]   23              STU     6,X
4011 EF08   [5+1]   29              STU     8,X
4013 EF0A   [5+1]   35              STU     10,X
4015 EF0C   [5+1]   41              STU     12,X
4017 EF0E   [5+1]   47              STU     14,X
4019 3A     [3]     50              ABX             * Move forward half a row
401A EF84   [5+0]   55              STU     ,X
401C EF02   [5+1]   61              STU     2,X
401E EF04   [5+1]   67              STU     4,X
4020 EF06   [5+1]   73              STU     6,X
4022 EF08   [5+1]   79              STU     8,X
4024 EF0A   [5+1]   85              STU     10,X
4026 EF0C   [5+1]   91              STU     12,X
4028 EF0E   [5+1]   97              STU     14,X
402A 3A     [3]     100             ABX             * Move forward to the next row
402B 4A     [2]     102             DECA
402C 26DB   [3]     105             BNE     <

At the expense of a little RAM we could improve the above code by using values larger then 15 for the indexing.  The version below uses 26,368 cycles.

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
4000 8E2000 [3]     159             LDX     #$2000
4003 CE0000 [3]     162             LDU     #$0000
4006 CC0020 [3]     165             LDD     #$0020  * A = Loop 256 times, B adds 32 to X
* This loop is 103 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 103 = 26,368 CPU Cycles
4009 EF84   [5+0]   5       !       STU     ,X
400B EF02   [5+1]   11              STU     2,X
400D EF04   [5+1]   17              STU     4,X
400F EF06   [5+1]   23              STU     6,X
4011 EF08   [5+1]   29              STU     8,X
4013 EF0A   [5+1]   35              STU     10,X
4015 EF0C   [5+1]   41              STU     12,X
4017 EF0E   [5+1]   47              STU     14,X        
4019 EF8810 [5+1]   53              STU     16,X
401C EF8812 [5+1]   59              STU     18,X
401F EF8814 [5+1]   65              STU     20,X
4022 EF8816 [5+1]   71              STU     22,X
4025 EF8818 [5+1]   77              STU     24,X
4028 EF881A [5+1]   83              STU     26,X
402B EF881C [5+1]   89              STU     28,X
402E EF881E [5+1]   95              STU     30,X
4031 3A     [3]     98              ABX             * Move forward to the next row
4032 4A     [2]     100             DECA
4033 26D4   [3]     103             BNE     <

One other tip from Darren Atkinson about the above indexing method is to use negative numbers if you can to keep the size of the code down.  Darren’s version is below:

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
4000 8E2010 [3]     3               LDX    #$2000+16
4003 CE0000 [3]     6               LDU    #$0000
4006 CC0020 [3]     9               LDD    #$0020
* This loop is 103 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 103 = 26,368 CPU Cycles
4009 EF10   [5+1]   6           !   STU    -16,X
400B EF12   [5+1]   12              STU    -14,X
400D EF14   [5+1]   18              STU    -12,X
400F EF16   [5+1]   24              STU    -10,X
4011 EF18   [5+1]   30              STU    -8,X
4013 EF1A   [5+1]   36              STU    -6,X
4015 EF1C   [5+1]   42              STU    -4,X
4017 EF1E   [5+1]   48              STU    -2,X
4019 EF84   [5+0]   53              STU    ,X
401B EF02   [5+1]   59              STU    2,X
401D EF04   [5+1]   65              STU    4,X
401F EF06   [5+1]   71              STU    6,X
4021 EF08   [5+1]   77              STU    8,X
4023 EF0A   [5+1]   83              STU    10,X
4025 EF0C   [5+1]   89              STU    12,X
4027 EF0E   [5+1]   95              STU    14,X
4029 3A     [3]     98              ABX
402A 4A     [2]     100             DECA
402B 26DC   [3]     103             BNE    <

One last thing to note is the more you unroll the code the faster it will be at the expense of more RAM.  You just have to decide what is most important RAM space or the speed of your code…

The fastest method – This routine uses 17,920 CPU cycles:  It is fastest to use unrolled loops and push a stack pointer and it’s data into RAM instead of using a store instruction.

Mem  Code   Cycles Running Total            Assembly Code (Mnemonics)
4073 CC0000   [3]                             LDD     #$0000
4076 8E0000   [3]                             LDX     #$0000
4079 3184     [4+0]                           LEAY    ,X
407B CE4000   [3]                             LDU     #$2000+$2000
* This loop is 70 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 70 = 17,920 CPU Cycles
407E 3636     [5+6]   11      !               PSHU    D,X,Y
4080 3636     [5+6]   22                      PSHU    D,X,Y
4082 3636     [5+6]   33                      PSHU    D,X,Y
4084 3636     [5+6]   44                      PSHU    D,X,Y
4086 3636     [5+6]   55                      PSHU    D,X,Y
4088 3606     [5+2]   62                      PSHU    D
408A 11832000 [5]     67                      CMPU    #$2000
408E 22EE     [3]     70                      BHI             <

I’ll go into the details of this method and more in Part 3 of this series of blogs.

Cheers,

Glen

Posted in CoCo Programming | Tagged , , , , | Leave a comment

Optimizing 6809 Assembly Code: Part 1 – Quick and Easy Changes to Speedup Your Code

Table of Contents (click on a link to jump to that web page)

A lot of time Assembly language programs are already fast enough.  But if you want to make it faster or you are writing an arcade game then the following information will be helpful.  I’m always trying to improve my assembly language programming skills and I’ve been keeping some notes as I do more and more programming.  In the next few blogs I’ll share what I’ve learned and I hope this will be a useful guide for anyone who wants to make super fast 6809 assembly programs.

This first part will go over some of the quick and easy things you can do to speed up your program.  In fact if you have a favourite old 6809 computer game and you wanted to speed it up you could start with these changes.  Of course you would need to disassemble the game and make the changes that you can and  then re-assemble the code.  This would be a great learning experience.  Maybe I’ll make another blog showing how you could go about doing that.

Simply choosing to use certain instructions instead of others will speed up your program as shown below:

  • CMPX  is one byte shorter and one cycle faster then CMPY, CMPU and CMPD

If you have a lot of loops or counters and you use CMPX instead of the other CMP instructions every time that compare is done in the loop you saved one CPU cycle.   This could be a huge speed difference if you have large loops.


  • LDX & LDU are one byte shorter and a cycle faster then LDY.  The same is true for
  • STX & STU which are one byte shorter and one cycle faster then STY

Again if used inside loops the speed difference can be huge.


  • TFR  of 16 bit registers is slower than using LEU
   TFR    Y,U is 6 cycles and 2 bytes this can be changed to
   LEAU   ,Y which is 4 cycles 2 bytes

  • LEA 8 bit value or A or B is (5 cycles) which is faster then LEA 16 bit value or D which is (8 cycles) and less bytes too.  This is the same for all LEAX, LEAU,LEAY & LEAS.  Keep in mind that LEA does signed adds so make sure your 8 bit values take that into account for example be careful changing from LEAU   D,U to  LEAU    B,U

Quick and easy things to change in existing code, use in the order from first to last:

  • BRA      – 3 cycles, 2 bytes
  • JMP       – 4 cycles, 3 bytes
  • LBRA    – 5 cycles, 3 bytes – Only use if you want your code to be relocatable

  • BSR       – 7 cycles, 2 bytes
  • JSR        – 8 cycles, 3 bytes
  • LBSR    – 9 cycles, 3 bytes – Only use if you want your code to be relocatable

Another way to optimize your code is to make the most of the way you use jumps or branches to subroutines.  If the second last instruction in your routine is a BSR, JSR or LBSR and your last instruction is an RTS you can change the BSR, JSR or LBSR to BRA, JMP or LBRA and remove your RTS command.  The last called routine will return for you.  For example:

        LDX     #$4000
        BSR     SAVEX        * 7 CPU cycles (2 bytes)
        RTS                  * 5 CPU cycles (1 byte)
                             * These two lines require 12 cycles and
                             * 3 bytes
        ...
SAVEX   STX     ,U++
        RTS

Can be changed to:

        LDX     #$4000
        BRA     SAVEX        * 3 CPU cycles and two bytes
        ...
SAVEX   STX ,U++
        RTS

This is a savings of 9 CPU cycles and 1 byte.


A common trick for routines that use the stack to save your registers and accumulators is to use PULS  ,PC at the end of your routine instead of using the RTS command. As shown:

Code1   PSHS    D,X,Y
        LDD     #$0155
        STD     ,X++
        STD     ,Y++
        PULS    D,X,Y    * 5 + 6 CPU cycles, 2 bytes
        RTS              * 5 CPU cycles, 1 byte

Can be changed to:

Code1:  PSHS   D,X,Y
        LDD    #$0155
        STD    ,X++
        STD    ,Y++
        PULS   D,X,Y,PC  * 5 + 8 CPU cycles, 2 bytes
                         * Adding PC restores the Program counter
                         * which is saved on the stack when your
                         * routine was called with BSR,JSR or LBSR
                         * no RTS is needed.

This saves us 3 CPU cycles and 1 byte

Since we are talking about branching and it’s effect on the speed of your program.  Steve Bamford has a great point about making your program execute faster is to think about your program flow and branch only to what is least likely.  This only matters though if you have to do a long branch as short branches will always be the same.  A long branch that isn’t taken takes 5 cycles and a long branch that is taken will use 6 cycles.  Sure it’s just 1 cycle but if it is in a crucial loop in your code it can make a big difference.

For example if A is most likely to be 1 then the code below will long branch to Ais1 most of the time which means the CPU must add the branch location to the PC and jump to it which adds a cycle to the execution of your program.

CheckA:
        CMPA    #1
        LBEQ     Ais1
AisNot1:
;       Code to handle A is not 1
        ...
Ais1:
;       Code to handle A is 1
        ...

The code would run faster if it was arranged as this:

CheckA:
        CMPA    #1
        LBNE     AisNot1 
Ais1:
;       Code to handle A is 1
        ...
AisNot1:
;       Code to handle A is not 1
        ...

Sockmaster mentions a similar method in the comments below


If you need to load both A and B registers use LDD, for example:

    LDA   #$20     * 2 CPU cycles and 2 bytes
    LDB   #$55     * 2 CPU cycles and 2 bytes

Can be changed to:

    LDD   #$2055   * 3 CPU cycles and 3 bytes

Saves a cycle and a byte


Even more speed and space are saved when using indexed LDA or STB and STA or STB changed to LDD or STD as:

    LDA   $1F00    * 5 CPU cycles and 3 bytes
    LDB   $1F01    * 5 CPU cycles and 3 bytes

Can be changed to

    LDD   $1F00    * 6 CPU cycles and 3 bytes

This results in a savings of 4 CPU cycles and 3 bytes


Also another way to speed up your code if you load values from the same location in memory many times you will want to change the address to a register that has the value of the address and use it as a pointer.  For example:

    LDA    $FF00   * 5 CPU Cycles, 3 bytes

Can be

    LDX   #$FF00    * 3 CPU Cycles, 3 bytes
    LDA   ,X        * 4 CPU Cycles, 2 bytes

If you are using LDA  in a loop then the second method will be faster.


If you know of anymore quick ways to save some CPU cycles please comment below and I will update this page with credit to you.

A great reference to cycles counts and the full 6809/6309 instruction set is                 Darren Atkinson’s – Motorola 6809 and Hitachi 6309 Programmers Reference

Part 2 will cover topics that are a little more complex and show how to make them faster.

Stay tuned,

Glen

Posted in CoCo Programming | Tagged , , , , | 6 Comments

Sprite Compiler for the TRS-80 Color Computer 3

What are sprites anyways?

Sprites are little blocks of data that you move around the computer screen.  Usually used to move the characters around the screen in games such as Mario or the barrels moving around in Donkey Kong, they are sprites.  If you are making a game or a demo on the CoCo then you are left with the job of writing assembly code that moves your game data/characters around the screen using the CPU.  The CoCo doesn’t have hardware sprites so it’s all up to the processor.  So the faster we can move sprites around the better.  This might free up CPU cycles so you can add more sprites to your game or add sound.

There are many ways to move your character data around the screen on a CoCo:

  1. Where you load (LDD   ,X) the character data from memory and store (STD   ,U) the same data somewhere in video RAM.
  2. Faster is to stack blast.  This method is where you point the S stack pointer to the character data from memory and the U stack pointer to where in the video RAM you want your data to end up.  Then you do a PULS  D,X,Y  and PSHU   D,X,Y.  You also must keep track of the order of your data using this method and where in video RAM you are Pushing the data.
  3. Compiled sprites are even faster.  It is actual code that draws the character on screen directly.  The code is a lot of LDD or LDA instructions with a bunch of STD   ,X where you point X to the address you want your sprite to be in video RAM.
  4. Compiled/blasted sprites do the same as compiled sprites, but the sprite is drawn from the bottom up and whenever it’s possible instead of STD   ,X you do a PSHU  D,X,Y or similar.  Pushing is faster then Storing.

What is a sprite compiler?

A sprite compiler takes your character data in the binary format that would be loaded and then stored on the screen and turns that data into assembly code that when executed writes the sprite data to the screen very fast.

Here is an example source picture and what the compiled sprite code looks like:Screen Shot 2017-08-26 at 10.02.26 PM

The compiled code produced using the tool created this assembly code to be used on a 16 colour, 256 pixel wide screen.  Using the stack it’s best to work from the bottom row up.

**************************************************
* 00 00 00 00 FF FF 00 00 00 00 - 1 
* 00 00 0F FF FF FF FF 00 00 00 - 2 
* 00 0F EE FF FF FF FF F0 00 00 - 3 
* 00 0E 99 EF FF FF FF FF 00 00 - 4 
* 00 FE AA EF FF FF FF FF F0 00 - 5 
* 00 FE 99 EF FF FF FF FF F0 00 - 6 
* 0F FF FF FF FF FF FF FF EF 00 - 7 
* 0F FF FF FF FF FF FF FE EF 00 - 8 
* FF FF FF FF FF FF FF FE EF F0 - 9 
* FF FF FF FF FF FF FF EE EF F0 - 10 
* 0F FF FF FF FF FF FE EE EF 00 - 11 
* 0F FF FF FF FF FF EE DE EF 00 - 12 
* 00 FF FF FF FF FE ED EE F0 00 - 13 
* 00 FF FF FF FF EE DE EE F0 00 - 14 
* 00 0F FF FF EE EE EE EF 00 00 - 15 
* 00 00 FF EE EE EE EE F0 00 00 - 16 
* 00 00 0F FF EE EF FF 00 00 00 - 17 
* 00 00 00 00 FF F0 00 00 00 00 - 18 
**************************************************
_000056.raw:
        opt     c 
        opt     ct
        opt     cd
        opt     cc
* Row 17 
        LEAU    128*16+7-3079,U    * A=, B=, X=, Y=
* Row 17 and row 18 700
* 00 00 0F FF EE EF FF 00 00 00 
* 00 00 00 00 FF F0 00 00 00 00 
        LDB     126,U           * A=, B=, X=, Y=
        ORB     #$F0            * A=, B=, X=, Y=
        STB     126,U           * A=, B=XX, X=, Y=
        LDB     -5,U            * A=, B=XX, X=, Y=
        ORB     #$0F            * A=, B=XX, X=, Y=
        STB     -5,U            * A=, B=XX, X=, Y=
        LDD     #$FFEE          * A=, B=XX, X=, Y=
        STD     -4,U            * A=FF, B=EE, X=, Y=
        STA     -1,U            * A=FF, B=EE, X=, Y=
        STA     125,U           * A=FF, B=EE, X=, Y=
        LDB     #$EF            * A=FF, B=EE, X=, Y=
        STB     -2,U            * A=FF, B=EF, X=, Y=
* Row 16 800
* 00 00 FF EE EE EE EE F0 00 00 
        LEAU    -128*1,U        * A=FF, B=EF, X=, Y=
        LDA     ,U              * A=FF, B=EF, X=, Y=
        ORA     #$F0            * A=FF, B=EF, X=, Y=
        STA     ,U              * A=XX, B=EF, X=, Y=
        LDD     #$FFEE          * A=XX, B=EF, X=, Y=
        STD     -5,U            * A=FF, B=EE, X=, Y=
        STB     -3,U            * A=FF, B=EE, X=, Y=
        STB     -2,U            * A=FF, B=EE, X=, Y=
        STB     -1,U            * A=FF, B=EE, X=, Y=
* Row 15 800
* 00 0F FF FF EE EE EE EF 00 00 
        LEAU    -128*1+1,U      * A=FF, B=EE, X=, Y=
        LDB     -7,U            * A=FF, B=EE, X=, Y=
        ORB     #$0F            * A=FF, B=EE, X=, Y=
        STB     -7,U            * A=FF, B=XX, X=, Y=
        LDB     #$FF            * A=FF, B=FF, X=, Y=
        LDX     #$EEEE          * A=FF, B=FF, X=EEEE, Y=
        LDY     #$EEEF          * A=FF, B=FF, X=EEEE, Y=EEEF
        PSHU    D,X,Y           * A=FF, B=FF, X=EEEE, Y=EEEF
* Row 14 800
* 00 FF FF FF FF EE DE EE F0 00 
        LEAU    -128*1+6,U      * A=FF, B=FF, X=EEEE, Y=EEEF
        LDA     ,U              * A=FF, B=FF, X=EEEE, Y=EEEF
        ORA     #$F0            * A=FF, B=FF, X=EEEE, Y=EEEF
        STA     ,U              * A=XX, B=FF, X=EEEE, Y=EEEF
        LDA     #$FF            * A=FF, B=FF, X=EEEE, Y=EEEF
        LDX     #$FFEE          * A=FF, B=FF, X=FFEE, Y=EEEF
        LDY     #$DEEE          * A=FF, B=FF, X=FFEE, Y=DEEE
        PSHU    D,X,Y           * A=FF, B=FF, X=FFEE, Y=DEEE
        LDB     #$FF            * A=FF, B=FF, X=FFEE, Y=DEEE
        STB     -1,U            * A=FF, B=FF, X=FFEE, Y=DEEE
...

Why did you right such a tool?

When I was young there was a cool Amiga Demo from 1988 that I’ve always enjoyed watching, called Demons are Forever.  The demo has a bunch of sprites moving around the screen and changing from a sphere to a demon character and back.  It also has some cool music that plays in the background.  I thought it would be cool if this demo could be played on a CoCo 3 and I would learn a lot about smooth sprite movement so I started working on replicating this demo for the CoCo 3.

While compiling the many sprites by hand I decided that this would be easier if there was a tool to do this.  It wasn’t an easy program to write since creating a compressed and stack blasted sprite is pretty complicated.  I wanted to make sure the program made very optimized code so the sprites are rendered as fast as possible.  That’s the main reason for compiling sprites in the first place.  I ended up with a program that does a VERY good job creating compiled/stack blasted sprites from a raw binary file.  Looking through the assembly output that it creates I’ve been able to hand tweak a few things myself to make it a tiny bit faster.  But those changes were very minimal.  I figured since I found this tool useful maybe others could use it when they are working on their games and want their sprites to be as fast as possible on screen.

What format should the input file be?

The tool takes raw binary data that is the same data your normal character would use to display on the CoCo 3 screen.  You need to save the binary data that represents your character and save that data to a file.  You will need to know the width and height of your sprite to use it with the sprite compiler tool.

For an example my sprite above had a data file with the first four rows as:

* 00 00 00 00 FF FF 00 00 00 00 - 1 
* 00 00 0F FF FF FF FF 00 00 00 - 2 
* 00 0F EE FF FF FF FF F0 00 00 - 3 
* 00 0E 99 EF FF FF FF FF 00 00 - 4 

If you examined the input file with a hex editor the start of the file would look like this:

00 00 00 00 FF FF 00 00 00 00 00 00 0F FF FF FF FF 00 00 00 00 0F EE FF FF FF FF F0 00 00 00 0E 99 EF FF FF FF FF 00 00

Things to consider before you draw your sprite character to be used with this sprite compiler:

  1. For transparency the tool always uses palette 0, any other value is treated as part of the sprite.
  2. If you can use the palette number 15 = $F as the border/edge of your character then the compiled sprite will be drawn a little quicker.  The reason for this is a little hard to explain but here goes.  When the compiled sprite is being created and an edge of your sprite is one nibble in a byte and the other nibble is transparent (value of zero) then the compiler must load the byte and then do an AND $F with the nibble that has your palette.  If the compiler finds the palette is already $F then it doesn’t have to do the AND instruction.  Here is a little code to demonstrate this.

Byte is $07 so the right nibble is using palette 7 and the left nibble is transparent the code will look like this:

LDB     126,U   * Load B with whatever value is at this location
ANDB    #$F0    * AND the value so that we keep the transparent part 
ORB     #$07    * OR the value with our colour palette
STB     126,U   * Store the new value on the screen

If we use $0F for the edge colour/palette the code looks like this no AND instruction is needed.

LDB     126,U   * Load B with whatever value is at this location
ORB     #$0F    * OR the value with our colour palette
STB     126,U   * Store the new value on the screen

What are the options for this tool?

While writing the sprite compiler tool I needed it to do many things and it seemed like everyday I was adding a new feature to it.  It currently has quite a lot of options as shown below.  The limitation right now is it only works with sprites that are for the CoCo 3, Hires screen with 16 colours.  In this graphics mode each pixel is one nibble and that is what my sprite compiler currently handles.  The compiler can handle transparencies down to the pixel (nibble) level.

Usage:
cmpblsp -wX -hY [-oswX -oshY -owX -ohY -aa] [-as] [-odd] [-sX]
 [-rXXXX] [-rvX] [-nSubroutine] [-c] [cyc] [-done] spritename
Where:
-w - X is the width in bytes for the width of the sprite
-h - Y is the number of rows for the height of the sprite
-osw - Offset used - width in bytes of the entire data file
-osh - Offset used - height of the entire data file
-ow - X offset, from the left side to where the sprite data starts
-oh - Y offset, from the top row to where the sprite data starts
-as - Autocenter, usually the top left corner of the sprite is used
      as the starting point of the sprite, this makes it the middle
-aa - Autocenter point is the middle of the entire data file
-odd - Also make an odd version of the sprite, by shifting the data
       one nibble to the right, outputs a separate _odd.asm file
-s - Sets the screen width the sprite will be used with
     (256 or 320) defaults to 320
-r - Creates a restore sprite that loads the data behind the sprite
     from address offset XXXX which is a 4 digit hex number
-rv - Creates a restore sprite that stores the value X which is a hex
      value between 0 to F. This value is substituted wherever any
      non Zero byte is in the sprite data. This is useful if the
      background behind your sprite is always a known palette, such
      as black. This is faster then having to load the background.
-n - Name of the subroutine that is created, if not used the routine
     name created will be _spritename:
-c - Add comments that shows the values of the registers
-cyc - Add cycle count instructions for lwasm
-done - When the program is complete it wont wait for you to press
        a key, the program will exit and close this window
        (useful for scripts)
sprite - This is the name of the sprite to compile
         Output name will be sprite with .asm extension added

Besides creating a compiled/blasted sprite at the Pixel level the tool can also create pixel shifted version of the sprite which allows for smooth motion on the screen.  For example your sprite is drawn starting at pixel 100,100 or video RAM address $2440 (as an example) and you want it to move one pixel to the right.  You can’t just point your compiled sprite at $2441 because that is actually two nibbles or pixels to the right of where the last sprite was drawn.  You need to draw a new sprite since all pixel data is accounted for the even pixels only.  The -odd option creates a second compiled sprite with the same name and _odd.asm on the end of the filename.

Another thing the tool can do is create two different types of restore compiled sprites.  These can be used to restore the data behind a sprite as it moves across the screen.  The first and easiest restore sprite is useful if your background is all the same colour palette.  For example all blue.  If the blue background palette is Hex C then then use would use the option-rvC.  The other type of restore sprite that the tool can make is a little more complicated.

Here is how it works:

If you have a complicated background graphics screen (more then one colour) you can have a second copy of the background stored somewhere else in RAM and copy that image data back under where your sprite was previously drawn.  The option -rXXXX where the XXXX is a hex number that is where the relative backup screen is stored in memory.  If your backup screen is stored $6000 bytes forward in RAM then you would select the option -r6000.

The other advanced thing the tool does is allows you to select or crop the area that you want to use from your source data file.  This is useful if the data you have captured is bigger then the sprite.  You can set what parts of the data file are used for actual sprite data and make a compiled sprite only from that part.  For example if you have some sprite data that is 21 bytes wide and 19 rows high but your character is only 5 bytes wide and 7 bytes high and starts near the middle of the data then you would use the following options.  -w5 -h7 -osw21 -osh19 -ow10 -oh9

Remember that the tool defaults to creating sprites for a 320 pixel wide screen.  If you need to make a sprite for a 256 pixel wide screen use the -s256 option.

What language is this program written in and how can I get it for my computer?

I wrote the program using QB64 which is a modern version of QBASIC that runs on Windows, Linux and Mac.  It is very similar to Color Basic for the CoCo.  If you’ve done programming on the CoCo in BASIC then I think you will feel right at home using QB64.  You can download QB64 from the link here.

Once QB64 is installed you can simply copy the text version of my program from my pastebin account here:

https://pastebin.com/h9s9eRn7

and paste it into QB64 and create an executable version of the program to use on your computer.  The program isn’t copyrighted, do whatever you want with it.  The code is quite a bit messy but it gets the job done.  I just hope it helps other to create more cool games for the CoCo!

I think a useful improvement would be to add support for importing the picture data as a standard file formats like BMP, GIF,PNG,TIFF and maybe some of the standard graphic formats that are native to the CoCo 3 like Color Max files.  That would be better then the way it currently works which requires a raw sprite data file.  QB64 does have built in graphic file import routines for many common formats so it might not be too hard to add.  The problem is that you must be sure to keep the palette info intact when importing a file from a standard image format file.  Feel free to improve the code…

I’ve only used the sprite compiler for my own limited use and I think it is ready for release but there might still be some bugs in it.  If you find any bugs please post the details in the comments below.

I don’t currently have any plans to add other screen resolutions to the sprite compiler right now.  I might go back and revisit Space Invaders and make it work without the need to rotate your screen and then I might have a chance to add to the sprite compiler.  That is way in the future though…

Have fun,

Glen

 

Posted in CoCo Programming, Uncategorized | 1 Comment

How to make PMODE 4 CSM video files for the CoCo (TRS-80 Color Computer)

Hi All,

Using some awesome free tools a bash script and a little C program I wrote it’s possible to make PMODE 4 colour videos that playback at 23.3 fps with sound and have four colours – black, white,  and the two artifact colours blue and red/orange.  Below I’m going to explain how this all works and how you can make your own.  As an example of what you can expect from this video conversion I’ve uploaded two youtube videos showing the player on a CoCo 1.  The videos can be found here and here, don’t mind the mess on the floor, I usually have my CoCo 3 hooked up but wanted to show this on an actual CoCo model 1 with a 6309 CPU and 64k of RAM.

First we use a fairly recent version of FFMPEG, which is the most amazing video and audio conversion tool there is.  It will take just about any video or audio file as input and convert with filters and effects to many other formats.

First I use FFMPEG to set the frames per second to 23.3 which the CSM player requires.  It also scales the input video width and height so that the aspect ratio is perfect for the CoCo PMODE 4 screen. The format that FFMPEG outputs is 256×192 and will automatically generate black bars on the screen if needed.  It outputs the scaled video as individual still frames/images and numbers them as it produces them.  The images are stored in a folder called pics and are compress LZW – TIFF images.

Next the script uses a picture processing tool called ImageMagick.  ImageMagick is used to do the following things:

  1. Turns yellow pixels to white (ivory)
  2. Resizes the pictures to 128×192 (necessary for making artifacts)
  3. Handles picture levels (black, white and gamma controls)
  4. Normalizes the pictures as the pictures need to be fairly bright or they are hard to make out on the CoCo screen
  5. Remaps the colours of the original image to the black, white (blue, red/orange) colours.  It does this by looking at the palette of a 4 colour GIF file that represents the colours the CoCo 1 can display on a PMODE 4 screen
  6. Dithers that image to make it look like there are more colours on the screen
  7. Flips the image vertically and saves it as an uncompressed 16 colour bitmap (BMP) file.  The BMP format saves every picture upside down, so I flip it before saving it so the data in the file is ready to be used as it is.

Next we use FFMPEG again to process the audio of the source video.  At this point we use FFMPEG to convert the audio of the movie file to 1 channel, 11932 samples/second, unsigned 8 bit audio file.  FFMPEG could have been used to re-encode the video to stills and the audio at the same time but I like to do it separate so that it’s easier on the hard drive other wise you will be reading from the source video and writing both the audio and generating the stills at the same time.

At this point we use FFMPEG to take the sill images and the 8 bit unsigned audio and convert it to a test.mp4 file that can be viewed on your regular computer.  This is a very good representation of how the video will look on the CoCo.  Watching the test.mp4 you can tell if you need to make the output video brighter or not.  This can be done using the -b option with the conversion script.

Next my little C program is executed which takes the BMP stills and goes through each pixel and converts it to either a black set of bits (00) or if it’s orange (01) or blue (10) or white (11) together four times so it has a byte of data and stores that on the buffer.  It does that for the entire picture and then muxes the audio and new video data together in the format the Ed Sniders player will accept.

The last step is to join the CSM header file with the CSM PMODE 4 player code and the muxed audio/video file into one new .CSM file that is ready to be copied to an SD card and played back on the CoCo.

How do I make my own CSM videos?

You must install FFMPEG and ImageMagick on your computer.  If you are on a Mac the easiest way to install the command line tools is using homebrew found here.  Once homebrew is installed on your Mac type brew install ffmpeg and brew install imagemagick to install.  The BASH script and C program I wrote are ready to be used if you have a Mac.

It should work without issues on a Linux box and Cygwin/Mingw on a Windows box.  You will have to compile the program yourself and make sure to install FFMPEG and ImageMagick using your package manager.

Ron Klein is working on ready to go versions for Linux, RPI3 & Windows.  They will be available soon from the link below.

Ed Snider is hosting the files with his other CoCo SDC Media player software.  The Mac version is already available to use and can be found here:

https://www.mediafire.com/folder/20xt2l2k0160i/CoCo_SDC_Media_Player

In a folder called Tools for making CSM files.

uncompress the .ZIP file and using the command line in the makecsm folder type:

./makecsm.sh -h

This will give you a summary of the options available for the conversion.  Below is the help you will see:

makecsm – CoCoSDC CSM Video Maker v 1.00
Usage: [-s hh:mm:ss] [-e hh:mm:ss] [-d seconds] [-n] [-h] [-t] [-b 0.00 to 10.00] inputfile [outputfile[.CSM]]

option: -s hh:mm:ss is the time in the video to start conversion
-e hh:mm:ss is the time in the video to end conversion
-d duration in seconds
-n means no artifact colour, make a black and white video
-b x.xx sets the brightness level of the video (1.00 is default)
-h Prints this help message
-t normally a test.mp4 file is created so you can see
the resulting movie on this computer before copying it to the CoCoSDC
this option will disable the creation of a test.mp4 movie

Example: To make a movie from the source movie called mymovie.mkv
Starting at 53 seconds into the video and ending at 1 minute and 30 seconds.
The duration of the video will be 37 seconds. Brighten the video a little
and use the output filename COCOVID.CSM
Command would be: makecsm.sh -s 00:00:53 -e 00:01:30 -b 1.1 mymovie.mkv COCOVID.CSM

If no start time or end time is given then the entire video will be converted
If only the start time is given then the conversion will start at the given time
and it will convert the video until the end of the video.
If only the end time is given then the conversion will start from the beginning
of the video up to the end time given.
If no output filename is given then an output file will be created in the
current folder with the extension .CSM

The output filename must be uppercase and be a maximum of 8 characters long,
with the extension .CSM or the CoCoSDC player will not recognize it.

A few little notes on the options, you can select -d 100 (or any number of seconds for the duration) without the -s hh:mm:ss option and it will create a video from the start of the video for the number of seconds in this option (example here of 100 seconds).

You can make a black and white video without artifacts using the -n option.

One last thing to note

If your input or output video filenames have spaces then the script will probably fail.  It will be less troublesome if you move the source videos into the makecsm folder.

How to improve the video quality

The makecsm.sh script is pretty straight forward.  Other then FFMPEG creating the images at the correct size all of the image processing is done with ImageMagicks convert command.  If you look up help on ImageMagick there are tons of options.  Maybe going through these options you can find better settings to improve the output quality of the artifact colours.  It’s really hard to get yellows and greens on the CoCo screen so these should probably converted to grey or white.  The current script converts yellow to ivory which is better then converting it to red.  Feel free to tweak the command and if you come up with a really good setting please post it below in the comments.

Last little artifact problem to deal with

When the CoCo 1 or 2 is turned on there is no way to know if the even bits of a PMODE 4 screen make a blue colour or if it’s the odd bits that make the blue colour.  So I wrote a little basic program that I placed on Ed Snider’s PLAY.DSK image.  I called the program GO.BAS and when it is run it fills the screen with the red/orange artifact colour.  The program then asks if the picture is blue, if so then hit reset and start the program again.  If the screen is orange/red then you press a key and it starts Ed’s CSM player.  A copy of this program called GO.BAS is included with the script.  It will need to be copied to Ed’s PLAY.DSK image with imgtool or toolshed or similar utility.

10 CLS
20 PMODE4,1:PCLS:SCREEN1,1
30 FOR X=1 TO 255 STEP 2
40 LINE(X,0)-(X,191),PSET
50 NEXTX
60 PRINT”IF THE SCREEN TURNED BLUE THEN PRESS RESET AND RUN THE PROGRAM AGAIN”:PRINT
70 PRINT”IF THE SCREEN TURNED ORANGE/RED THEN PRESS ANY KEY TO START THE SDCM PLAYER”
80 I$=INKEY$
90 I$=INKEY$:IF I$=”” THEN 90
100 LOADM”SDCM”:EXEC&H5800

Have Fun,

Glen

Posted in CoCo Programming, Uncategorized | Leave a comment

Zilog z80 to Motorola 6809 Transcode – Part 025 – My z80 to 6809 program

To end this series on transcoding the z80 code to the 6809 I thought I should include my c program called z80_to_6809_15_Pacman.c it is what I used to help with the transcode.  It takes a z80 disassembly as input and outputs what it thinks is a compatible 6809 instruction in place.  It keeps the z80 source code to the right which makes it easier when you are manually going though the code.  It’s not a very complicated program but it get’s the job done.  The formatting of the text input file must have the correct spacing which you will have to play with if you want to use the program for your own projects.

You can find it here.

I hope these posts were helpful for anyone interested in the CoCo 3 or transcoding.

Cheers,

Glen

Posted in Uncategorized | Leave a comment