Optimizing 6809 Assembly Code: Part 3 – Stack Blasting and Self Modifying Code

Table of Contents (click on a link to jump to that web page)

At the end of Part 2 I left off with a tease about using the Stack to blast data on the screen. I ended off with this example:

The fastest method is to use unrolled loops and push the U Stack pointer instead of a store instruction. This routine uses 17,920 CPU cycles

Mem  Code   Cycles Running Total            Assembly Code (Mnemonics)
4073 CC0000   [3]                             LDD     #$0000
4076 8E0000   [3]                             LDX     #$0000
4079 3184     [4+0]                           LEAY    ,X
407B CE4000   [3]                             LDU     #$2000+$2000
* This loop is 70 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 70 = 17,920 CPU Cycles
407E 3636     [5+6]   11      !               PSHU    D,X,Y
4080 3636     [5+6]   22                      PSHU    D,X,Y
4082 3636     [5+6]   33                      PSHU    D,X,Y
4084 3636     [5+6]   44                      PSHU    D,X,Y
4086 3636     [5+6]   55                      PSHU    D,X,Y
4088 3606     [5+2]   62                      PSHU    D
408A 11832000 [5]     67                      CMPU    #$2000
408E 22EE     [3]     70                      BHI             <

This brings us to one of the fastest ways to speed up loading and storing data on the 6809.

When you PUSH data onto the U or S stack pointers the 6809 does this super fast. The bonus is it writes the data and moves the pointer all in one instruction. We could have even used more registers for each PSHU instruction to make it even faster. For example we could push the S and DP registers with the command PSHU D,X,Y,S,DP which will store an extra 3 bytes for every PSHU instruction. This method of pushing data from the stack is known as “Stack Blasting.”

Stack Blasting is a little tricky to use as you have to account for the fact that stack blasting stores bytes downwards in RAM. If you were to write some data to the screen using stack blasting you would load the U register with the address of the bottom right of the screen. Every time you do a PSHU instruction it stores the contents of the registers below the U pointer in memory and moves the U register down in memory. Because of this you must arrange the data that is to be blasted on screen in the correct order ahead of time. The speed increase is definitely worth the effort.

The other half of stack blasting is loading the registers with their data. The above code was fine for blasting zeros to the screen. But if you want to draw something on screen you have to load the registers with the data before pushing it on screen. This is where you would use the S stack pointer and the PULS command. The PUL command loads the registers with the data in RAM and moves the S pointer forward. You can use PULS and PSHU or PULU and PSHS either will work the same.

So you basically do something like this:

Draw_Backgrnd:
        PSHS    D,X,Y,DP
        STS     TempMem     * Save the Stack pointer
        LDS     #$5C3F      * Bottom right of the screen
        LDU     #$C000      * Address of data to copy to screen
Copy_bg1:
        PULU    D,X,Y,DP    * load the registers and move U forward
        PSHS    D,X,Y,DP    * Store the data and move S backwards
        CMPU    #$DC35      * Check if U has reached the end of data
        BLO     Copy_bg1    * if not keep copying
        LDS     TempMem     * Restore the Stack pointer
        PULS    D,X,Y,DP,PC

Stack Blasting data on the screen is a method used to quickly draw characters or sprites on the screen for games. The classic arcade game Defender uses a 6809 running at 1Mhz and is a super fast game, with lots of objects on the screen moving and changing all the time. Defender uses stack blasting for drawing all the ships and aliens on the screen…

Defender_Screen Of course as discussed in Part 2 of this series, unrolling the stack blasting loop above will speed it up even more!

One last thing about putting data on the screen using the stack. If you need to copy data to the screen or anywhere in RAM (buffer space) that is constantly changing so you can’t prepare it backwards ahead of time you can use the code below which is a quick way to copy data. It’s not as fast as the above method but it’s pretty quick and might be useful.

   LDU     RAMPointer      * Source
   LDY     #Buffer+16      * Destination +16 because we start at -16,Y
   STS     RestoreS1+2     * Save the Stack pointer (self modify code)
Loop:
   PULU    D,X,S
   STD     -16,Y
   STX     -14,Y
   STS     -12,Y
   PULU    D,X,S
   STD     -10,Y
   STX     -8,Y
   STS     -6,Y
   PULU    D,X,S
   STD     -4,Y
   STX     -2,Y
   STS     ,Y
   PULU    D,X,S
   STD     2,Y
   STX     4,Y
   STS     6,Y
   PULU    D,X
   STD     8,Y
   STX     10,Y
   PULU    D,X
   STD     12,Y
   STX     14,Y
   LEAY    32,Y
   CMPY    #Buffer+16+$100   * Example copies $100 bytes
   BNE     Loop
RestoreS1:
   LDS     #$0000    * Restore the Stack saved from above

Adjust it according to the amount of data you need to copy. The above example is copying 256 ($100) bytes. Keep in mind if you adjust the size for your needs that you should use S the least amount of time as it uses an extra cycle and an extra byte doing a STS vs a STX. As usual if it’s a small amount of data you need to copy then you could unfold the loop completely to speed it up a little more.

A little self modifying code can speed things up and save memory too…

This code shown previously:

Mem  Code    Cycles Running Total     Assembly Code (Mnemonics)
4000                        Draw_Backgrnd:
4000 343E     [5+7]   12              PSHS    D,X,Y,DP
4002 10FF401D [7]     19              STS     TempMem     * Save the Stack pointer
4006 10CE5C3F [4]     23              LDS     #$5C3F
400A CEC000   [3]     26              LDU     #$C000
400D                        Copy_bg1:
400D 373E     [5+7]   38              PULU    D,X,Y,DP
400F 343E     [5+7]   50              PSHS    D,X,Y,DP
4011 1183DC35 [5]     55              CMPU    #$DC35
4015 25F6     [3]     58              BLO     Copy_bg1
4017 10FE401D [7]     65              LDS     TempMem     * Restore the Stack pointer
401B 35BE     [5+9]   79              PULS    D,X,Y,DP,PC

Can be changed to this:

Mem  Code    Cycles Running Total     Assembly Code (Mnemonics)
4000                       Draw_Backgrnd:
4000 343E     [5+7]   12              PSHS    D,X,Y,DP
4002 10FF4019 [7]     19              STS     Save_S_Here+2 * Save the Stack pointer
4006 10CE5C3F [4]     23              LDS     #$5C3F
400A CEC000   [3]     26              LDU     #$C000
400D                       Copy_bg1:
400D 373E     [5+7]   38              PULU    D,X,Y,DP
400F 343E     [5+7]   50              PSHS    D,X,Y,DP
4011 1183DC35 [5]     55              CMPU    #$DC35
4015 25F6     [3]     58              BLO     Copy_bg1
4017                         Save_S_Here:
4017 10CE0000 [4]     62              LDS     #$0000        * Restore the Stack pointer
401B 35BE     [5+9]   76              PULS    D,X,Y,DP,PC

Let me explain the changes above in Bold. First the STS instruction saves the value of the S register in the code itself at the memory location $4019 which is where the LDS instruction using immediate addressing will load it’s value. The gain here is 3 cycles and you no longer need to store the S value in the TempMem location which saves two bytes of RAM.

Self modifying code makes following and debugging the code in the future a lot more difficult. So I would only use it if necessary. One such routine I do use the above method is for audio sample playback in the FIRQ routine. When you playback sampled audio using the FIRQ you need the code to be as fast as possible since it will be triggered thousands of times a second. This is an example FIRQ to playback audio samples for the 6809. The FIRQ uses the Timer and is only available for the CoCo 3. But I think this is a good example of when it’s necessary to make code as fast as possible and pull out all the stops!

First make sure the DP is set to the FIRQ, this will speed up the FIRQ too.

        LDA     #DirectPage/256
        TFR     A,DP

To save another cycle we make the FIRQ interrupt vector jump to the Sample playing routine:

        LDA     #$0E                * JMP opcode using DP addressing
        LDB     #FIRQ_Audio%256
        STD     $FEF4               * Set next FIRQ to $8000

Next you need to insert some code that adds your sample data and make sure the sample data ends at address $8000. Store the starting address of the sample file in memory at LoadAudio+1. By doing the following:

        LDX     #SampleStart        * Sample starting location in RAM
        STX     LoadAudio+1         * Store it where FIRQ will read

Then setup the FIRQ Timer to match the sample rate of your sound file then enable the FIRQ.

Below is an example of an FIRQ routine to play an audio sample:

        ORG     $FA00           * Address of the Table and Data to be loaded
DirectPage:
        SETDP   DirectPage/256
*****************************
FIRQ_Audio:
        STA     <FIRQ_Audio_Restore+1 * Save A for restore after FIRQ rotuine is complete
        LDA     FIRQENR        * Re enable the FIRQ
        INC     <LoadAudio+2   * Increment the LSB of the sample pointer
        BNE     LoadAudio      * jump ahead if LSB is not zero
        INC     <LoadAudio+1   * Increment the MSB of the sample pointer
        BPL     LoadAudio      * If we haven't hit $8000 then keep going
        LDA     #ReturnFIRQ%256 * Point the FIRQ to the RTI
        STA     $FEF5           * This Sample playing has now ended
LoadAudio:
        LDA     $F9FF          * Get next sample byte
        STA     $FF20          * $FF20 - store to DAC - Play a sample 🙂
FIRQ_Audio_Restore:
        LDA     #$00           * STA at the start of the FIRQ stores A's value, here we restore A before the RTI, saves a cycle and a byte of RAM
ReturnFIRQ:        
        RTI

The above code does a lot of self modifying (lines in bold):

The first line saves the A accumulators value at FIRQ_Audio_Restore+1 which is loaded just before the end of the routine. This is necessary since the FIRQ does not save the registers automatically like the IRQ does. So we need to restore A’s value after the FIRQ is finished.
The first INC instruction modifies the LSB of the sample pointer directly at address LoadAudio+2
The second INC instruction modifies the MSB of the sample pointer directly at address LoadAudio+1
The STA $FEF5 changes the FIRQ vector to jump directly to the RTI instruction so the playback will no longer be active until setup again in the main program

See you in Part 4,

Glen

4 Responses to Optimizing 6809 Assembly Code: Part 3 – Stack Blasting and Self Modifying Code

Armand says:

October 12, 2018 at 1:23 am

Hi there, I have been trying stack blasting on 6809 computer, but it works only with one register the PULU PUSHS render in weirdo order which bring the picture … weird !
But thanks a lot anyway for this trick, I was thinking about making it but having the piece of code ready to go saved me some times 🙂

- Armand says:
  
  October 12, 2018 at 1:25 am
  
  I forgot to say that I ended up with PULU D, PUSHS D instead of PULU D,X,Y,….
  I’ll try to analyse deeper whats going on
  
Armand says:

September 10, 2022 at 6:45 pm

Hello, for some reasons, some compiler expect to push registers in reverse order
PULU X,Y,DP
PSHS DP,X,Y
cheers,

Armand

- nowhereman999 says:
  
  September 10, 2022 at 7:04 pm
  
  Hi Armand,
  
  A good assembler should take care of the order by itself. The 6809 always does a push or pull in the same order no matter what order you write them in your code.
  
  The order of a PSHS/PULS is:
  CC,A,B,DP,X,Y,U,PC
  
  If you do a PSHU/PULU
  CC,A,B,DP,X,Y,S,PC
  
  For example if you do a PSHS Y,D
  In memory the S register will be changed to S=S-4 and @ S the CPU will store Y and @ S+2 the CPU will store D
  The exact same thing will happen if you do a PSHS D,Y
  In memory the S register will be changed to S=S-4 and @ S the CPU will store Y and @ S+2 the CPU will store D
  
  If you do a PULS Y,D then
  Y=value at S, S=S+2, D=value at S, S=S+2
  again same thing if you do a PULS D,Y
  Y=value at S, S=S+2, D=value at S, S=S+2
  
  You must always remember the order for stack that the CPU will always use is:
  CC,A,B,DP,X,Y,U/S,PC
  No matter what order you write in your code.
  
  I hope that helps,
  Glen