Table of Contents (click on a link to jump to that web page)

- Part 1 – Quick and Easy Changes to Speedup Your Code
- Part 2 – Speedup Storing Data – Unrolling Loops
- Part 3 – Stack Blasting and Self Modifying Code
- Part 4 – Odds and Sods – More Tricks

Let’s move on to some more in-depth ways of speeding up useful tasks done in assembly. This is probably a good place to point out that if you’re using LWTOOL’s, lwasm to assemble your source code you can use the options below in your code to get the cycle counts in your output listing.

opt c - enable cycle counts opt cd - enable detailed cycle counts breaking down addressing modes opt ct - show a running subtotal of cycles opt cc - clear the running subtotal

The cycle count info in the listings below were generated by lwasm.

I personally use lwasm to show me the cycle counts as I just described but others might like to use a 6809 reference chart such as this to look up the cycle counts on their own while programming. This is also a good chart from sockmaster. Sockmaster 6809/6309 reference chart.

First a good little trick to keep in mind is the ABX instruction in place of LEAX B,X

ABXis 2 cycles faster and one byte less thenLEAX B,X

Keep in mind that ABX adds the unsigned value of B and adds it to X, while LEAX B,X takes into account the singed value of B when adding it to X.

A note about using **ABX** from Sockmaster – “Oftentimes I rearrange register usage in my code just to make sure **ABX** can be used as much as possible.”

Another important way to speed up your program is to make use of the **Direct Page (DP)** register. It is very useful if you are playing sampled sounds in a game using an FIRQ that your FIRQ is in the DP space. Also use it for other small routines that get used a lot. Also use it for storage of data that is accessed a lot.

LDA #$FATFR A,DP * Set DP to $FA00-$FAFF4000 FCFA55 [6] 6 LDD $FA55 * slower and more bytes 4003 DC55 [5] 11 LDD<$FA55 * faster and less bytes

If we don’t use direct addressing the LDD takes 6 cycles and 3 bytes. If we use direct addressing indicated with the less than **“<” **symbol the LDD takes 5 cycles and 2 bytes.

Another thing to note is the impact of speed and size when you are using indexed addressing.

Mem Code Cycles Assembly Code (Mnemonics)4000 8E2000 [3] LDX #$2000 4003 A684 [4+0] LDA ,X 4005 A61F [4+1] LDA -1,X 4009 A610 [4+1] LDA -16,X 400B A688EF [4+1] LDA -17,X 400E A68880 [4+1] LDA -128,X4011 A689FF7F [4+4] LDA -129,X4015 A601 [4+1] LDA 1,X 4017 A60F [4+1] LDA 15,X 4019 A68810 [4+1] LDA 16,X 401C A6887F [4+1] LDA 127,X401F A6890080 [4+4] LDA 128,X

Things to note about the above list:

- Values of -1 to -16 are only two bytes, same with values 1 to 15
- Values of -129 or lower use 4 bytes and 8 cycles. The same is true for positive numbers of 128 or higher.

For a real world example: We have a screen that is 64 bytes wide and we want to draw a line across the screen that is three pixels tall you could do this:

This routine takes 30 cycles * 32 = 960 cycles

Mem Code Cycles Running Total Assembly Code (Mnemonics)4000 8E2000 [3] LDX #$2000 4003 CEFFFF [3] LDU #$FFFF * White Pixels 4006 C602 [2] LDB #2 4008 EF84 [5+0] 5 ! STU ,X 400A EF8840 [5+1] 11 STU 64,X400D EF890080 [5+4] 20 STU 128,X * Big & Slow4011 3A [3] 23 ABX 4012 8C0810 [4] 27 CMPX #2000+64 4015 26F1 [3] 30 BNE <

Or you could make it faster and smaller by indexing -64 and +64 as shown below. You could also index back -128 and -64 and point X to the bottom row.

This version of the routine takes 27 cycles * 32 = 864 cycles

Mem Code Cycles Running Total Assembly Code (Mnemonics)4000 8E2040 [3] LDX #$2000+64 4003 CCFFFF [3] LDD #$FFFF * White Pixels 4006 C602 [2] LDB #2 4008 EF88C0 [5+1] 6 ! STU -64,X 400B EF84 [5+0] 11 STU ,X 400D EF8840 [5+1] 17 STU 64,X 4010 3A [3] 20 ABX 4011 8C0850 [4] 24 CMPX #2000+64+64 4014 26F2 [3] 27 BNE <

Usually there are trade offs using techniques of speeding up your code. Where speed usually means bigger code or more complex code, not always but typically this is true.

Here are some examples of clearing some RAM to all zeros from location $2000 to $4000. This could be used for clearing a graphics screen for a game.

Slow way, 61,440 cpu cycles: This is a simple and easy Loop to clear the RAM from $2000 to $3FFF

Mem Code Cycles Running Total Assembly Code (Mnemonics)4000 8E2000 [3] LDX #$2000 4003 CE0000 [3] LDU #$0000 * This loop is 15 cycles to update two bytes * We have to do this loop $2000 / 2 bytes each pass = $1000 times * 15 cycles * $1000 or 4096 = 61,440 cpu cycles4006 EF81 [5+3] 8 ! STU ,X++4008 8C4000 [4] 12 CMPX #$2000+$2000400B 26F9 [3] 15 BNE <

Faster way, 53,328 cpu cycles: A faster way is to use the A and B accumulators as counters to see if our loop is finished.

Mem Code Cycles Running Total Assembly Code (Mnemonics)400D 8E2000 [3] LDX #$2000 4010 CE0000 [3] LDU #$0000 4013 CC2000 [3] LDD #$2000 * This loop is mostly 13 cycles sometimes 18 cycles every 256 bytes * $2000 / $100 = $20 * $20 / 2 = $10 (half because we write 2 bytes per cycle) * $2000 - $20 = $1FE0 * $1FE0 / 2 = $FF0 (half because we write 2 bytes per cycle) * 13 cycles * $FF0 + 18 cycles * $10 = $CF30 + $120 = $D050 = 53,328 cpu cycles4016 EF81 [5+3] 8 ! STU ,X++4018 5A [2] 10 DECB4019 26FB [3] 13 BNE <401B 4A [2] 15 DECA401C 26F8 [3] 18 BNE <

Even faster way – Code unrolled version = 34,048 CPU cycles: It’s faster if we unroll the loops which means less comparing is done to see if we are at the end of our loop. This needs a little calculations ahead of time. If we are going to use this for clearing the screen then 32 bytes is a good number to use. So using the above code we could unroll it to this:

Mem Code Cycles Running Total Assembly Code (Mnemonics)4049 8E2000 [3] 3 LDX #$2000 404C CE0000 [3] 6 LDU #$0000 404F 5F [2] 8 CLRB * This loop is 133 cycles to write 32 bytes * We cycle through the loop 256 times so the calculation is * 256 * 133 = 34,048 CPU Cycles4050 EF81 [5+3] 8 ! STU ,X++4052 EF81 [5+3] 16 STU ,X++4054 EF81 [5+3] 24 STU ,X++4056 EF81 [5+3] 32 STU ,X++4058 EF81 [5+3] 40 STU ,X++405A EF81 [5+3] 48 STU ,X++405C EF81 [5+3] 56 STU ,X++405E EF81 [5+3] 64 STU ,X++4060 EF81 [5+3] 72 STU ,X++4062 EF81 [5+3] 80 STU ,X++4064 EF81 [5+3] 88 STU ,X++4066 EF81 [5+3] 96 STU ,X++4068 EF81 [5+3] 104 STU ,X++406A EF81 [5+3] 112 STU ,X++406C EF81 [5+3] 120 STU ,X++406E EF81 [5+3] 128 STU ,X++4070 5A [2] 130 DECB4071 26DD [3] 133 BNE <

Simon Jonassen (the Mad Man) shows an Even faster method is to use ABX and indexed addressing. His method below ties a lot of the examples above together: The example below is 26,880 cycles.

Mem Code Cycles Running Total Assembly Code (Mnemonics)4000 8E2000 [3] 159 LDX #$2000 4003 CE0000 [3] 162 LDU #$0000 4006 CC0010 [3] 165 LDD #$0010 * A = Loop 256 times, B adds 16 to X * This loop is 105 cycles to write 32 bytes * We cycle through the loop 256 times so the calculation is * 256 * 105 = 26,880 CPU Cycles 4009 EF84 [5+0] 5 ! STU ,X 400B EF02 [5+1] 11 STU 2,X 400D EF04 [5+1] 17 STU 4,X 400F EF06 [5+1] 23 STU 6,X 4011 EF08 [5+1] 29 STU 8,X 4013 EF0A [5+1] 35 STU 10,X 4015 EF0C [5+1] 41 STU 12,X 4017 EF0E [5+1] 47 STU 14,X 4019 3A [3] 50 ABX * Move forward half a row 401A EF84 [5+0] 55 STU ,X 401C EF02 [5+1] 61 STU 2,X 401E EF04 [5+1] 67 STU 4,X 4020 EF06 [5+1] 73 STU 6,X 4022 EF08 [5+1] 79 STU 8,X 4024 EF0A [5+1] 85 STU 10,X 4026 EF0C [5+1] 91 STU 12,X 4028 EF0E [5+1] 97 STU 14,X 402A 3A [3] 100 ABX * Move forward to the next row 402B 4A [2] 102 DECA 402C 26DB [3] 105 BNE <

At the expense of a little RAM we could improve the above code by using values larger then 15 for the indexing. The version below uses 26,368 cycles.

Mem Code Cycles Running Total Assembly Code (Mnemonics)4000 8E2000 [3] 159 LDX #$2000 4003 CE0000 [3] 162 LDU #$0000 4006 CC0020 [3] 165 LDD #$0020 * A = Loop 256 times, B adds 32 to X * This loop is 103 cycles to write 32 bytes * We cycle through the loop 256 times so the calculation is * 256 * 103 = 26,368 CPU Cycles 4009 EF84 [5+0] 5 ! STU ,X 400B EF02 [5+1] 11 STU 2,X 400D EF04 [5+1] 17 STU 4,X 400F EF06 [5+1] 23 STU 6,X 4011 EF08 [5+1] 29 STU 8,X 4013 EF0A [5+1] 35 STU 10,X 4015 EF0C [5+1] 41 STU 12,X 4017 EF0E [5+1] 47 STU 14,X 4019 EF8810 [5+1] 53 STU 16,X 401C EF8812 [5+1] 59 STU 18,X 401F EF8814 [5+1] 65 STU 20,X 4022 EF8816 [5+1] 71 STU 22,X 4025 EF8818 [5+1] 77 STU 24,X 4028 EF881A [5+1] 83 STU 26,X 402B EF881C [5+1] 89 STU 28,X 402E EF881E [5+1] 95 STU 30,X 4031 3A [3] 98 ABX * Move forward to the next row 4032 4A [2] 100 DECA 4033 26D4 [3] 103 BNE <

One other tip from Darren Atkinson about the above indexing method is to use negative numbers if you can to keep the size of the code down. Darren’s version is below:

Mem Code Cycles Running Total Assembly Code (Mnemonics)4000 8E2010 [3] 3 LDX #$2000+16 4003 CE0000 [3] 6 LDU #$0000 4006 CC0020 [3] 9 LDD #$0020 * This loop is 103 cycles to write 32 bytes * We cycle through the loop 256 times so the calculation is * 256 * 103 = 26,368 CPU Cycles 4009 EF10 [5+1] 6 ! STU -16,X 400B EF12 [5+1] 12 STU -14,X 400D EF14 [5+1] 18 STU -12,X 400F EF16 [5+1] 24 STU -10,X 4011 EF18 [5+1] 30 STU -8,X 4013 EF1A [5+1] 36 STU -6,X 4015 EF1C [5+1] 42 STU -4,X 4017 EF1E [5+1] 48 STU -2,X 4019 EF84 [5+0] 53 STU ,X 401B EF02 [5+1] 59 STU 2,X 401D EF04 [5+1] 65 STU 4,X 401F EF06 [5+1] 71 STU 6,X 4021 EF08 [5+1] 77 STU 8,X 4023 EF0A [5+1] 83 STU 10,X 4025 EF0C [5+1] 89 STU 12,X 4027 EF0E [5+1] 95 STU 14,X 4029 3A [3] 98 ABX 402A 4A [2] 100 DECA 402B 26DC [3] 103 BNE <

One last thing to note is the more you unroll the code the faster it will be at the expense of more RAM. You just have to decide what is most important RAM space or the speed of your code…

The fastest method – This routine uses 17,920 CPU cycles: It is fastest to use unrolled loops and push a stack pointer and it’s data into RAM instead of using a store instruction.

Mem Code Cycles Running Total Assembly Code (Mnemonics)4073 CC0000 [3] LDD #$0000 4076 8E0000 [3] LDX #$0000 4079 3184 [4+0] LEAY ,X 407B CE4000 [3] LDU #$2000+$2000 * This loop is 70 cycles to write 32 bytes * We cycle through the loop 256 times so the calculation is * 256 * 70 = 17,920 CPU Cycles 407E 3636 [5+6] 11 ! PSHU D,X,Y 4080 3636 [5+6] 22 PSHU D,X,Y 4082 3636 [5+6] 33 PSHU D,X,Y 4084 3636 [5+6] 44 PSHU D,X,Y 4086 3636 [5+6] 55 PSHU D,X,Y 4088 3606 [5+2] 62 PSHU D 408A 11832000 [5] 67 CMPU #$2000 408E 22EE [3] 70 BHI <

I’ll go into the details of this method and more in Part 3 of this series of blogs.

Cheers,

Glen