Table of Contents (click on a link to jump to that web page)
- Part 1 – Quick and Easy Changes to Speedup Your Code
- Part 2 – Speedup Storing Data – Unrolling Loops
- Part 3 – Stack Blasting and Self Modifying Code
- Part 4 – Odds and Sods – More Tricks
At the end of Part 2 I left off with a tease about using the Stack to blast data on the screen. I ended off with this example:
The fastest method is to use unrolled loops and push the U Stack pointer instead of a store instruction. This routine uses 17,920 CPU cycles
Mem Code Cycles Running Total Assembly Code (Mnemonics) 4073 CC0000  LDD #$0000 4076 8E0000  LDX #$0000 4079 3184 [4+0] LEAY ,X 407B CE4000  LDU #$2000+$2000 * This loop is 70 cycles to write 32 bytes * We cycle through the loop 256 times so the calculation is * 256 * 70 = 17,920 CPU Cycles 407E 3636 [5+6] 11 ! PSHU D,X,Y 4080 3636 [5+6] 22 PSHU D,X,Y 4082 3636 [5+6] 33 PSHU D,X,Y 4084 3636 [5+6] 44 PSHU D,X,Y 4086 3636 [5+6] 55 PSHU D,X,Y 4088 3606 [5+2] 62 PSHU D 408A 11832000  67 CMPU #$2000 408E 22EE  70 BHI <
This brings us to one of the fastest ways to speed up loading and storing data on the 6809.
When you PUSH data onto the U or S stack pointers the 6809 does this super fast. The bonus is it writes the data and moves the pointer all in one instruction. We could have even used more registers for each PSHU instruction to make it even faster. For example we could push the S and DP registers with the command PSHU D,X,Y,S,DP which will store an extra 3 bytes for every PSHU instruction. This method of pushing data from the stack is known as “Stack Blasting.”
Stack Blasting is a little tricky to use as you have to account for the fact that stack blasting stores bytes downwards in RAM. If you were to write some data to the screen using stack blasting you would load the U register with the address of the bottom right of the screen. Every time you do a PSHU instruction it stores the contents of the registers below the U pointer in memory and moves the U register down in memory. Because of this you must arrange the data that is to be blasted on screen in the correct order ahead of time. The speed increase is definitely worth the effort.
The other half of stack blasting is loading the registers with their data. The above code was fine for blasting zeros to the screen. But if you want to draw something on screen you have to load the registers with the data before pushing it on screen. This is where you would use the S stack pointer and the PULS command. The PUL command loads the registers with the data in RAM and moves the S pointer forward. You can use PULS and PSHU or PULU and PSHS either will work the same.
So you basically do something like this:
Draw_Backgrnd: PSHS D,X,Y,DP STS TempMem * Save the Stack pointer LDS #$5C3F * Bottom right of the screen LDU #$C000 * Address of data to copy to screen Copy_bg1: PULU D,X,Y,DP * load the registers and move U forward PSHS D,X,Y,DP * Store the data and move S backwards CMPU #$DC35 * Check if U has reached the end of data BLO Copy_bg1 * if not keep copying LDS TempMem * Restore the Stack pointer PULS D,X,Y,DP,PC
Stack Blasting data on the screen is a method used to quickly draw characters or sprites on the screen for games. The classic arcade game Defender uses a 6809 running at 1Mhz and is a super fast game, with lots of objects on the screen moving and changing all the time. Defender uses stack blasting for drawing all the ships and aliens on the screen…
Of course as discussed in Part 2 of this series, unrolling the stack blasting loop above will speed it up even more!
One last thing about putting data on the screen using the stack. If you need to copy data to the screen or anywhere in RAM (buffer space) that is constantly changing so you can’t prepare it backwards ahead of time you can use the code below which is a quick way to copy data. It’s not as fast as the above method but it’s pretty quick and might be useful.
LDU RAMPointer * Source LDY #Buffer+16 * Destination +16 because we start at -16,Y STS RestoreS1+2 * Save the Stack pointer (self modify code) Loop: PULU D,X,S STD -16,Y STX -14,Y STS -12,Y PULU D,X,S STD -10,Y STX -8,Y STS -6,Y PULU D,X,S STD -4,Y STX -2,Y STS ,Y PULU D,X,S STD 2,Y STX 4,Y STS 6,Y PULU D,X STD 8,Y STX 10,Y PULU D,X STD 12,Y STX 14,Y LEAY 32,Y CMPY #Buffer+16+$100 * Example copies $100 bytes BNE Loop RestoreS1: LDS #$0000 * Restore the Stack saved from above
Adjust it according to the amount of data you need to copy. The above example is copying 256 ($100) bytes. Keep in mind if you adjust the size for your needs that you should use S the least amount of time as it uses an extra cycle and an extra byte doing a STS vs a STX. As usual if it’s a small amount of data you need to copy then you could unfold the loop completely to speed it up a little more.
A little self modifying code can speed things up and save memory too…
This code shown previously:
Mem Code Cycles Running Total Assembly Code (Mnemonics) 4000 Draw_Backgrnd: 4000 343E [5+7] 12 PSHS D,X,Y,DP 4002 10FF401D  19 STS TempMem * Save the Stack pointer 4006 10CE5C3F  23 LDS #$5C3F 400A CEC000  26 LDU #$C000 400D Copy_bg1: 400D 373E [5+7] 38 PULU D,X,Y,DP 400F 343E [5+7] 50 PSHS D,X,Y,DP 4011 1183DC35  55 CMPU #$DC35 4015 25F6  58 BLO Copy_bg1 4017 10FE401D  65 LDS TempMem * Restore the Stack pointer 401B 35BE [5+9] 79 PULS D,X,Y,DP,PC
Can be changed to this:
Mem Code Cycles Running Total Assembly Code (Mnemonics) 4000 Draw_Backgrnd: 4000 343E [5+7] 12 PSHS D,X,Y,DP 4002 10FF4019  19 STS Save_S_Here+2 * Save the Stack pointer 4006 10CE5C3F  23 LDS #$5C3F 400A CEC000  26 LDU #$C000 400D Copy_bg1: 400D 373E [5+7] 38 PULU D,X,Y,DP 400F 343E [5+7] 50 PSHS D,X,Y,DP 4011 1183DC35  55 CMPU #$DC35 4015 25F6  58 BLO Copy_bg1 4017 Save_S_Here: 4017 10CE0000  62 LDS #$0000 * Restore the Stack pointer 401B 35BE [5+9] 76 PULS D,X,Y,DP,PC
Let me explain the changes above in Bold. First the STS instruction saves the value of the S register in the code itself at the memory location $4019 which is where the LDS instruction using immediate addressing will load it’s value. The gain here is 3 cycles and you no longer need to store the S value in the TempMem location which saves two bytes of RAM.
Self modifying code makes following and debugging the code in the future a lot more difficult. So I would only use it if necessary. One such routine I do use the above method is for audio sample playback in the FIRQ routine. When you playback sampled audio using the FIRQ you need the code to be as fast as possible since it will be triggered thousands of times a second. This is an example FIRQ to playback audio samples for the 6809. The FIRQ uses the Timer and is only available for the CoCo 3. But I think this is a good example of when it’s necessary to make code as fast as possible and pull out all the stops!
First make sure the DP is set to the FIRQ, this will speed up the FIRQ too.
LDA #DirectPage/256 TFR A,DP
To save another cycle we make the FIRQ interrupt vector jump to the Sample playing routine:
LDA #$0E * JMP opcode using DP addressing LDB #FIRQ_Audio%256 STD $FEF4 * Set next FIRQ to $8000
Next you need to insert some code that adds your sample data and make sure the sample data ends at address $8000. Store the starting address of the sample file in memory at LoadAudio+1. By doing the following:
LDX #SampleStart * Sample starting location in RAM STX LoadAudio+1 * Store it where FIRQ will read
Then setup the FIRQ Timer to match the sample rate of your sound file then enable the FIRQ.
Below is an example of an FIRQ routine to play an audio sample:
ORG $FA00 * Address of the Table and Data to be loaded DirectPage: SETDP DirectPage/256 ***************************** FIRQ_Audio: STA <FIRQ_Audio_Restore+1 * Save A for restore after FIRQ rotuine is complete LDA FIRQENR * Re enable the FIRQ INC <LoadAudio+2 * Increment the LSB of the sample pointer BNE LoadAudio * jump ahead if LSB is not zero INC <LoadAudio+1 * Increment the MSB of the sample pointer BPL LoadAudio * If we haven't hit $8000 then keep going LDA #ReturnFIRQ%256 * Point the FIRQ to the RTI STA $FEF5 * This Sample playing has now ended LoadAudio: LDA $F9FF * Get next sample byte STA $FF20 * $FF20 - store to DAC - Play a sample 🙂 FIRQ_Audio_Restore: LDA #$00 * STA at the start of the FIRQ stores A's value, here we restore A before the RTI, saves a cycle and a byte of RAM ReturnFIRQ: RTI
The above code does a lot of self modifying (lines in bold):
- The first line saves the A accumulators value at FIRQ_Audio_Restore+1 which is loaded just before the end of the routine. This is necessary since the FIRQ does not save the registers automatically like the IRQ does. So we need to restore A’s value after the FIRQ is finished.
- The first INC instruction modifies the LSB of the sample pointer directly at address LoadAudio+2
- The second INC instruction modifies the MSB of the sample pointer directly at address LoadAudio+1
- The STA $FEF5 changes the FIRQ vector to jump directly to the RTI instruction so the playback will no longer be active until setup again in the main program
See you in Part 4,