Table of Contents (click on a link to jump to that web page)
- Part 1 – Quick and Easy Changes to Speedup Your Code
- Part 2 – Speedup Storing Data – Unrolling Loops
- Part 3 – Stack Blasting and Self Modifying Code
- Part 4 – Odds and Sods – More Tricks
A lot of time Assembly language programs are already fast enough. But if you want to make it faster or you are writing an arcade game then the following information will be helpful. I’m always trying to improve my assembly language programming skills and I’ve been keeping some notes as I do more and more programming. In the next few blogs I’ll share what I’ve learned and I hope this will be a useful guide for anyone who wants to make super fast 6809 assembly programs.
This first part will go over some of the quick and easy things you can do to speed up your program. In fact if you have a favourite old 6809 computer game and you wanted to speed it up you could start with these changes. Of course you would need to disassemble the game and make the changes that you can and then re-assemble the code. This would be a great learning experience. Maybe I’ll make another blog showing how you could go about doing that.
Simply choosing to use certain instructions instead of others will speed up your program as shown below:
- CMPX is one byte shorter and one cycle faster then CMPY, CMPU and CMPD
If you have a lot of loops or counters and you use CMPX instead of the other CMP instructions every time that compare is done in the loop you saved one CPU cycle. This could be a huge speed difference if you have large loops.
- LDX & LDU are one byte shorter and a cycle faster then LDY. The same is true for
- STX & STU which are one byte shorter and one cycle faster then STY
Again if used inside loops the speed difference can be huge.
- TFR of 16 bit registers is slower than using LEU
TFR Y,U is 6 cycles and 2 bytes this can be changed to LEAU ,Y which is 4 cycles 2 bytes
- LEA 8 bit value or A or B is (5 cycles) which is faster then LEA 16 bit value or D which is (8 cycles) and less bytes too. This is the same for all LEAX, LEAU,LEAY & LEAS. Keep in mind that LEA does signed adds so make sure your 8 bit values take that into account for example be careful changing from LEAU D,U to LEAU B,U
Quick and easy things to change in existing code, use in the order from first to last:
- BRA – 3 cycles, 2 bytes
- JMP – 4 cycles, 3 bytes
- LBRA – 5 cycles, 3 bytes – Only use if you want your code to be relocatable
- BSR – 7 cycles, 2 bytes
- JSR – 8 cycles, 3 bytes
- LBSR – 9 cycles, 3 bytes – Only use if you want your code to be relocatable
Another way to optimize your code is to make the most of the way you use jumps or branches to subroutines. If the second last instruction in your routine is a BSR, JSR or LBSR and your last instruction is an RTS you can change the BSR, JSR or LBSR to BRA, JMP or LBRA and remove your RTS command. The last called routine will return for you. For example:
LDX #$4000 BSR SAVEX * 7 CPU cycles (2 bytes) RTS * 5 CPU cycles (1 byte) * These two lines require 12 cycles and * 3 bytes ... SAVEX STX ,U++ RTS
Can be changed to:
LDX #$4000 BRA SAVEX * 3 CPU cycles and two bytes ... SAVEX STX ,U++ RTS
This is a savings of 9 CPU cycles and 1 byte.
A common trick for routines that use the stack to save your registers and accumulators is to use PULS ,PC at the end of your routine instead of using the RTS command. As shown:
Code1 PSHS D,X,Y LDD #$0155 STD ,X++ STD ,Y++ PULS D,X,Y * 5 + 6 CPU cycles, 2 bytes RTS * 5 CPU cycles, 1 byte
Can be changed to:
Code1: PSHS D,X,Y LDD #$0155 STD ,X++ STD ,Y++ PULS D,X,Y,PC * 5 + 8 CPU cycles, 2 bytes * Adding PC restores the Program counter * which is saved on the stack when your * routine was called with BSR,JSR or LBSR * no RTS is needed.
This saves us 3 CPU cycles and 1 byte
Since we are talking about branching and it’s effect on the speed of your program. Steve Bamford has a great point about making your program execute faster is to think about your program flow and branch only to what is least likely. This only matters though if you have to do a long branch as short branches will always be the same. A long branch that isn’t taken takes 5 cycles and a long branch that is taken will use 6 cycles. Sure it’s just 1 cycle but if it is in a crucial loop in your code it can make a big difference.
For example if A is most likely to be 1 then the code below will long branch to Ais1 most of the time which means the CPU must add the branch location to the PC and jump to it which adds a cycle to the execution of your program.
CheckA: CMPA #1 LBEQ Ais1 AisNot1: ; Code to handle A is not 1 ... Ais1: ; Code to handle A is 1 ...
The code would run faster if it was arranged as this:
CheckA: CMPA #1 LBNE AisNot1 Ais1: ; Code to handle A is 1 ... AisNot1: ; Code to handle A is not 1 ...
Sockmaster mentions a similar method in the comments below
If you need to load both A and B registers use LDD, for example:
LDA #$20 * 2 CPU cycles and 2 bytes LDB #$55 * 2 CPU cycles and 2 bytes
Can be changed to:
LDD #$2055 * 3 CPU cycles and 3 bytes
Saves a cycle and a byte
Even more speed and space are saved when using indexed LDA or STB and STA or STB changed to LDD or STD as:
LDA $1F00 * 5 CPU cycles and 3 bytes LDB $1F01 * 5 CPU cycles and 3 bytes
Can be changed to
LDD $1F00 * 6 CPU cycles and 3 bytes
This results in a savings of 4 CPU cycles and 3 bytes
Also another way to speed up your code if you load values from the same location in memory many times you will want to change the address to a register that has the value of the address and use it as a pointer. For example:
LDA $FF00 * 5 CPU Cycles, 3 bytes
LDX #$FF00 * 3 CPU Cycles, 3 bytes LDA ,X * 4 CPU Cycles, 2 bytes
If you are using LDA in a loop then the second method will be faster.
If you know of anymore quick ways to save some CPU cycles please comment below and I will update this page with credit to you.
A great reference to cycles counts and the full 6809/6309 instruction set is Darren Atkinson’s – Motorola 6809 and Hitachi 6309 Programmers Reference
Part 2 will cover topics that are a little more complex and show how to make them faster.
Pingback: Optimizing 6809 Assembly Code: Part 1 – Quick and Easy Changes to Speedup Your Code – Vintage is the New Old
Another good one is using ABX instead of LEA B,X. It Adds B (unsigned) to X and it’s only 3 cycles, 1 byte! Oftentimes I rearrange register usage in my code just to make sure ABX can be used as much as possible. Extra bonus, it’s only 1 cycle on the 6309!
I have already added this to my Part 2 which I’ll probably post tonight. I’ll add your note about rearranging register usage just to make sure ABX can be used as much as possible.
Please keep the tips coming if you find more…
Maybe mention that when addressing some CoCo hardware registers, STA is a cycle faster than CLR and has the same effect, e.g., STA $FFDE vs CLR $FFDE.
Sometimes programs can be made faster in general by finding ways to cut cycles from frequently executed code, even if it’s at the expense of adding cycles to infrequently executed code. I’m not sure what a good general example would be, but it often means making a performance sacrifice in one routine to reserve/preserve best optimization of the other.
On a related note, if you do have long conditional branches:
LBEQ taken (5 cycles not taken, 6 cycles taken)
Especially if infrequently taken, you can save 2 cycles when it’s not taken at the expense of 1 cycle when taken:
BNE not_taken (3 cycles not taken)
JMP taken (7 cycles taken)
Thanks for the ideas, Steve Bamford mentioned a similar thing on the CoCo list and I updated this post with an example.
Pingback: Optimizing 6809 Assembly Code: Part 4 – Odds and Sods – More Tricks | Glen's Weblog
Pingback: Optimizing 6809 Assembly Code: Part 3 – Stack Blasting and Self Modifying Code | Glen's Weblog
Pingback: Code golf on the CoCo | Glen's Weblog