Optimizing 6809 Assembly Code: Part 1 – Quick and Easy Changes to Speedup Your Code

Table of Contents (click on a link to jump to that web page)

A lot of time Assembly language programs are already fast enough. But if you want to make it faster or you are writing an arcade game then the following information will be helpful. I’m always trying to improve my assembly language programming skills and I’ve been keeping some notes as I do more and more programming. In the next few blogs I’ll share what I’ve learned and I hope this will be a useful guide for anyone who wants to make super fast 6809 assembly programs.

This first part will go over some of the quick and easy things you can do to speed up your program. In fact if you have a favourite old 6809 computer game and you wanted to speed it up you could start with these changes. Of course you would need to disassemble the game and make the changes that you can and then re-assemble the code. This would be a great learning experience. Maybe I’ll make another blog showing how you could go about doing that.

Simply choosing to use certain instructions instead of others will speed up your program as shown below:

CMPX is one byte shorter and one cycle faster then CMPY, CMPU and CMPD

If you have a lot of loops or counters and you use CMPX instead of the other CMP instructions every time that compare is done in the loop you saved one CPU cycle. This could be a huge speed difference if you have large loops.

LDX & LDU are one byte shorter and a cycle faster then LDY. The same is true for
STX & STU which are one byte shorter and one cycle faster then STY

Again if used inside loops the speed difference can be huge.

TFR of 16 bit registers is slower than using LEU

   TFR    Y,U is 6 cycles and 2 bytes this can be changed to
   LEAU   ,Y which is 4 cycles 2 bytes

LEA 8 bit value or A or B is (5 cycles) which is faster then LEA 16 bit value or D which is (8 cycles) and less bytes too. This is the same for all LEAX, LEAU,LEAY & LEAS. Keep in mind that LEA does signed adds so make sure your 8 bit values take that into account for example be careful changing from LEAU D,U to LEAU B,U

Quick and easy things to change in existing code, use in the order from first to last:

BRA – 3 cycles, 2 bytes
JMP – 4 cycles, 3 bytes
LBRA – 5 cycles, 3 bytes – Only use if you want your code to be relocatable

BSR – 7 cycles, 2 bytes
JSR – 8 cycles, 3 bytes
LBSR – 9 cycles, 3 bytes – Only use if you want your code to be relocatable

Another way to optimize your code is to make the most of the way you use jumps or branches to subroutines. If the second last instruction in your routine is a BSR, JSR or LBSR and your last instruction is an RTS you can change the BSR, JSR or LBSR to BRA, JMP or LBRA and remove your RTS command. The last called routine will return for you. For example:

        LDX     #$4000
        BSR     SAVEX        * 7 CPU cycles (2 bytes)
        RTS                  * 5 CPU cycles (1 byte)
                             * These two lines require 12 cycles and
                             * 3 bytes
        ...
SAVEX   STX     ,U++
        RTS

Can be changed to:

        LDX     #$4000
        BRA     SAVEX        * 3 CPU cycles and two bytes
        ...
SAVEX   STX ,U++
        RTS

This is a savings of 9 CPU cycles and 1 byte.

A common trick for routines that use the stack to save your registers and accumulators is to use PULS ,PC at the end of your routine instead of using the RTS command. As shown:

Code1   PSHS    D,X,Y
        LDD     #$0155
        STD     ,X++
        STD     ,Y++
        PULS    D,X,Y    * 5 + 6 CPU cycles, 2 bytes
        RTS              * 5 CPU cycles, 1 byte

Can be changed to:

Code1:  PSHS   D,X,Y
        LDD    #$0155
        STD    ,X++
        STD    ,Y++
        PULS   D,X,Y,PC  * 5 + 8 CPU cycles, 2 bytes
                         * Adding PC restores the Program counter
                         * which is saved on the stack when your
                         * routine was called with BSR,JSR or LBSR
                         * no RTS is needed.

This saves us 3 CPU cycles and 1 byte

Since we are talking about branching and it’s effect on the speed of your program. Steve Bamford has a great point about making your program execute faster is to think about your program flow and branch only to what is least likely. This only matters though if you have to do a long branch as short branches will always be the same. A long branch that isn’t taken takes 5 cycles and a long branch that is taken will use 6 cycles. Sure it’s just 1 cycle but if it is in a crucial loop in your code it can make a big difference.

For example if A is most likely to be 1 then the code below will long branch to Ais1 most of the time which means the CPU must add the branch location to the PC and jump to it which adds a cycle to the execution of your program.

CheckA:
        CMPA    #1
        LBEQ     Ais1
AisNot1:
;       Code to handle A is not 1
        ...
Ais1:
;       Code to handle A is 1
        ...

The code would run faster if it was arranged as this:

CheckA:
        CMPA    #1
        LBNE     AisNot1 
Ais1:
;       Code to handle A is 1
        ...
AisNot1:
;       Code to handle A is not 1
        ...

Sockmaster mentions a similar method in the comments below

If you need to load both A and B registers use LDD, for example:

    LDA   #$20     * 2 CPU cycles and 2 bytes
    LDB   #$55     * 2 CPU cycles and 2 bytes

Can be changed to:

    LDD   #$2055   * 3 CPU cycles and 3 bytes

Saves a cycle and a byte

Even more speed and space are saved when using indexed LDA or STB and STA or STB changed to LDD or STD as:

    LDA   $1F00    * 5 CPU cycles and 3 bytes
    LDB   $1F01    * 5 CPU cycles and 3 bytes

Can be changed to

    LDD   $1F00    * 6 CPU cycles and 3 bytes

This results in a savings of 4 CPU cycles and 3 bytes

Also another way to speed up your code if you load values from the same location in memory many times you will want to change the address to a register that has the value of the address and use it as a pointer. For example:

    LDA    $FF00   * 5 CPU Cycles, 3 bytes

Can be

    LDX   #$FF00    * 3 CPU Cycles, 3 bytes
    LDA   ,X        * 4 CPU Cycles, 2 bytes

If you are using LDA in a loop then the second method will be faster.

If you know of anymore quick ways to save some CPU cycles please comment below and I will update this page with credit to you.

A great reference to cycles counts and the full 6809/6309 instruction set is Darren Atkinson’s – Motorola 6809 and Hitachi 6309 Programmers Reference

Part 2 will cover topics that are a little more complex and show how to make them faster.

Stay tuned,

Glen

9 Responses to Optimizing 6809 Assembly Code: Part 1 – Quick and Easy Changes to Speedup Your Code

Pingback: Optimizing 6809 Assembly Code: Part 1 – Quick and Easy Changes to Speedup Your Code – Vintage is the New Old
Sockmaster says:

September 15, 2017 at 6:49 pm

Another good one is using ABX instead of LEA B,X. It Adds B (unsigned) to X and it’s only 3 cycles, 1 byte! Oftentimes I rearrange register usage in my code just to make sure ABX can be used as much as possible. Extra bonus, it’s only 1 cycle on the 6309!

- nowhereman999 says:
  
  September 15, 2017 at 9:26 pm
  
  Thanks Sockmaster,
  
  I have already added this to my Part 2 which I’ll probably post tonight. I’ll add your note about rearranging register usage just to make sure ABX can be used as much as possible.
  
  Please keep the tips coming if you find more…
  Glen
  
Art Flexser says:

September 15, 2017 at 9:49 pm

Maybe mention that when addressing some CoCo hardware registers, STA is a cycle faster than CLR and has the same effect, e.g., STA $FFDE vs CLR $FFDE.

Art

Sockmaster says:

September 16, 2017 at 2:40 pm

Sometimes programs can be made faster in general by finding ways to cut cycles from frequently executed code, even if it’s at the expense of adding cycles to infrequently executed code. I’m not sure what a good general example would be, but it often means making a performance sacrifice in one routine to reserve/preserve best optimization of the other.

On a related note, if you do have long conditional branches:
LBEQ taken (5 cycles not taken, 6 cycles taken)
…

Especially if infrequently taken, you can save 2 cycles when it’s not taken at the expense of 1 cycle when taken:
BNE not_taken (3 cycles not taken)
JMP taken (7 cycles taken)
not_taken …

- nowhereman999 says:
  
  September 16, 2017 at 4:19 pm
  
  Thanks for the ideas, Steve Bamford mentioned a similar thing on the CoCo list and I updated this post with an example.
  
Pingback: Optimizing 6809 Assembly Code: Part 4 – Odds and Sods – More Tricks | Glen's Weblog
Pingback: Optimizing 6809 Assembly Code: Part 3 – Stack Blasting and Self Modifying Code | Glen's Weblog
Pingback: Code golf on the CoCo | Glen's Weblog