Optimizing 6809 Assembly Code: Part 2 – Speedup Storing Data – Unrolling Loops

Table of Contents (click on a link to jump to that web page)

Let’s move on to some more in-depth ways of speeding up useful tasks done in assembly.  This is probably a good place to point out that if you’re using LWTOOL’s,  lwasm  to assemble your source code you can use the options below in your code to get the cycle counts in your output listing.

opt c  - enable cycle counts
opt cd - enable detailed cycle counts breaking down addressing modes
opt ct - show a running subtotal of cycles
opt cc - clear the running subtotal

The cycle count info in the listings below were generated by lwasm.

I personally use lwasm to show me the cycle counts as I just described but others might like to use a 6809 reference chart such as this to look up the cycle counts on their own while programming.  This is also a good chart from sockmaster.  Sockmaster 6809/6309 reference chart.


First a good little trick to keep in mind is the ABX instruction in place of LEAX   B,X

ABX is 2 cycles faster and one byte less then LEAX  B,X

Keep in mind that ABX adds the unsigned value of B and adds it to X, while LEAX  B,X takes into account the singed value of B when adding it to X.

A note about using ABX from Sockmaster – “Oftentimes I rearrange register usage in my code just to make sure ABX can be used as much as possible.”


Another important way to speed up your program is to make use of the Direct Page (DP) register.  It is very useful if you are playing sampled sounds in a game using an FIRQ that your FIRQ is in the DP space.  Also use it for other small routines that get used a lot.  Also use it for storage of data that is accessed a lot.

                            LDA     #$FA
                            TFR     A,DP    * Set DP to $FA00-$FAFF
4000 FCFA55 [6]     6       LDD     $FA55   * slower and more bytes
4003 DC55   [5]     11      LDD     <$FA55  * faster and less bytes

If we don’t use direct addressing the LDD takes 6 cycles and 3 bytes.  If we use direct addressing indicated with the less than “<” symbol the LDD takes 5 cycles and 2 bytes.


Another thing to note is the impact of speed and size when you are using indexed addressing.

Mem  Code     Cycles               Assembly Code (Mnemonics)
4000 8E2000   [3]                  LDX     #$2000
4003 A684     [4+0]                LDA     ,X
4005 A61F     [4+1]                LDA     -1,X
4009 A610     [4+1]                LDA     -16,X
400B A688EF   [4+1]                LDA     -17,X
400E A68880   [4+1]                LDA     -128,X
4011 A689FF7F [4+4]                LDA     -129,X
4015 A601     [4+1]                LDA     1,X
4017 A60F     [4+1]                LDA     15,X
4019 A68810   [4+1]                LDA     16,X
401C A6887F   [4+1]                LDA     127,X
401F A6890080 [4+4]                LDA     128,X

Things to note about the above list:

  • Values of -1 to -16 are only two bytes, same with values 1 to 15
  • Values of -129 or lower use 4 bytes and 8 cycles.  The same is true for positive numbers of 128 or higher.

For a real world example:  We have a screen that is 64 bytes wide and we want to draw a line across the screen that is three pixels tall you could do this:

This routine takes 30 cycles * 32 = 960 cycles

Mem  Code   Cycles Running Total   Assembly Code (Mnemonics)
4000 8E2000   [3]                  LDX     #$2000
4003 CEFFFF   [3]                  LDU     #$FFFF      * White Pixels
4006 C602     [2]                  LDB     #2
4008 EF84     [5+0]   5    !       STU     ,X
400A EF8840   [5+1]   11           STU     64,X
400D EF890080 [5+4]   20           STU     128,X       * Big & Slow
4011 3A       [3]     23           ABX
4012 8C0810   [4]     27           CMPX    #2000+64
4015 26F1     [3]     30           BNE     <

Or you could make it faster and smaller by indexing -64 and +64 as shown below.  You could also index back -128 and -64 and point X to the bottom row.

This version of the routine takes 27 cycles * 32 = 864 cycles

Mem  Code   Cycles Running Total   Assembly Code (Mnemonics)
4000 8E2040   [3]                  LDX     #$2000+64
4003 CCFFFF   [3]                  LDD     #$FFFF      * White Pixels
4006 C602     [2]                  LDB     #2
4008 EF88C0   [5+1]   6    !       STU     -64,X
400B EF84     [5+0]   11           STU     ,X
400D EF8840   [5+1]   17           STU     64,X
4010 3A       [3]     20           ABX
4011 8C0850   [4]     24           CMPX    #2000+64+64
4014 26F2     [3]     27           BNE     <

Usually there are trade offs using techniques of speeding up your code.  Where speed usually means bigger code or more complex code, not always but typically this is true.

Here are some examples of clearing some RAM to all zeros from location $2000 to $4000. This could be used for clearing a graphics screen for a game.

Slow way, 61,440 cpu cycles:  This is a simple and easy Loop to clear the RAM from $2000 to $3FFF

Mem  Code  Cycles  Running Total    Assembly Code (Mnemonics)
4000 8E2000 [3]                     LDX     #$2000
4003 CE0000 [3]                     LDU     #$0000
* This loop is 15 cycles to update two bytes
* We have to do this loop $2000 / 2 bytes each pass = $1000 times
* 15 cycles * $1000 or 4096 = 61,440 cpu cycles
4006 EF81   [5+3]   8       !       STU     ,X++
4008 8C4000 [4]     12              CMPX    #$2000+$2000
400B 26F9   [3]     15              BNE     <

 Faster way, 53,328 cpu cycles:  A faster way is to use the A and B accumulators as counters to see if our loop is finished.  

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
400D 8E2000 [3]                     LDX     #$2000
4010 CE0000 [3]                     LDU     #$0000
4013 CC2000 [3]                     LDD     #$2000
* This loop is mostly 13 cycles sometimes 18 cycles every 256 bytes
* $2000 / $100 = $20
* $20 / 2 = $10  (half because we write 2 bytes per cycle)
* $2000 - $20 = $1FE0
* $1FE0 / 2 = $FF0  (half because we write 2 bytes per cycle)
* 13 cycles * $FF0 + 18 cycles * $10 = $CF30 + $120 = $D050 = 53,328 cpu cycles
4016 EF81   [5+3]   8       !       STU     ,X++
4018 5A     [2]     10              DECB
4019 26FB   [3]     13              BNE     <
401B 4A     [2]     15              DECA
401C 26F8   [3]     18              BNE     <

Even faster way – Code unrolled version = 34,048 CPU cycles:  It’s faster if we unroll the loops which means less comparing is done to see if we are at the end of our loop.  This needs a little calculations ahead of time.  If we are going to use this for clearing the screen then 32 bytes is a good number to use.  So using the above code we could unroll it to this:

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
4049 8E2000 [3]     3               LDX     #$2000
404C CE0000 [3]     6               LDU     #$0000
404F 5F     [2]     8               CLRB
* This loop is 133 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 133 = 34,048 CPU Cycles
4050 EF81   [5+3]   8       !       STU     ,X++
4052 EF81   [5+3]   16              STU     ,X++
4054 EF81   [5+3]   24              STU     ,X++
4056 EF81   [5+3]   32              STU     ,X++
4058 EF81   [5+3]   40              STU     ,X++
405A EF81   [5+3]   48              STU     ,X++
405C EF81   [5+3]   56              STU     ,X++
405E EF81   [5+3]   64              STU     ,X++
4060 EF81   [5+3]   72              STU     ,X++
4062 EF81   [5+3]   80              STU     ,X++
4064 EF81   [5+3]   88              STU     ,X++
4066 EF81   [5+3]   96              STU     ,X++
4068 EF81   [5+3]   104             STU     ,X++
406A EF81   [5+3]   112             STU     ,X++
406C EF81   [5+3]   120             STU     ,X++
406E EF81   [5+3]   128             STU     ,X++
4070 5A     [2]     130             DECB
4071 26DD   [3]     133             BNE     <

Simon Jonassen (the Mad Man) shows an Even faster method is to use ABX and indexed addressing.  His method below ties a lot of the examples above together:  The example below is 26,880 cycles.

Mem  Code  Cycles  Running Total    Assembly Code (Mnemonics)
4000 8E2000 [3]     159             LDX     #$2000
4003 CE0000 [3]     162             LDU     #$0000
4006 CC0010 [3]     165             LDD     #$0010  * A = Loop 256 times, B adds 16 to X
* This loop is 105 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 105 = 26,880 CPU Cycles
4009 EF84   [5+0]   5       !       STU     ,X
400B EF02   [5+1]   11              STU     2,X
400D EF04   [5+1]   17              STU     4,X
400F EF06   [5+1]   23              STU     6,X
4011 EF08   [5+1]   29              STU     8,X
4013 EF0A   [5+1]   35              STU     10,X
4015 EF0C   [5+1]   41              STU     12,X
4017 EF0E   [5+1]   47              STU     14,X
4019 3A     [3]     50              ABX             * Move forward half a row
401A EF84   [5+0]   55              STU     ,X
401C EF02   [5+1]   61              STU     2,X
401E EF04   [5+1]   67              STU     4,X
4020 EF06   [5+1]   73              STU     6,X
4022 EF08   [5+1]   79              STU     8,X
4024 EF0A   [5+1]   85              STU     10,X
4026 EF0C   [5+1]   91              STU     12,X
4028 EF0E   [5+1]   97              STU     14,X
402A 3A     [3]     100             ABX             * Move forward to the next row
402B 4A     [2]     102             DECA
402C 26DB   [3]     105             BNE     <

At the expense of a little RAM we could improve the above code by using values larger then 15 for the indexing.  The version below uses 26,368 cycles.

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
4000 8E2000 [3]     159             LDX     #$2000
4003 CE0000 [3]     162             LDU     #$0000
4006 CC0020 [3]     165             LDD     #$0020  * A = Loop 256 times, B adds 32 to X
* This loop is 103 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 103 = 26,368 CPU Cycles
4009 EF84   [5+0]   5       !       STU     ,X
400B EF02   [5+1]   11              STU     2,X
400D EF04   [5+1]   17              STU     4,X
400F EF06   [5+1]   23              STU     6,X
4011 EF08   [5+1]   29              STU     8,X
4013 EF0A   [5+1]   35              STU     10,X
4015 EF0C   [5+1]   41              STU     12,X
4017 EF0E   [5+1]   47              STU     14,X        
4019 EF8810 [5+1]   53              STU     16,X
401C EF8812 [5+1]   59              STU     18,X
401F EF8814 [5+1]   65              STU     20,X
4022 EF8816 [5+1]   71              STU     22,X
4025 EF8818 [5+1]   77              STU     24,X
4028 EF881A [5+1]   83              STU     26,X
402B EF881C [5+1]   89              STU     28,X
402E EF881E [5+1]   95              STU     30,X
4031 3A     [3]     98              ABX             * Move forward to the next row
4032 4A     [2]     100             DECA
4033 26D4   [3]     103             BNE     <

One other tip from Darren Atkinson about the above indexing method is to use negative numbers if you can to keep the size of the code down.  Darren’s version is below:

Mem  Code  Cycles Running Total     Assembly Code (Mnemonics)
4000 8E2010 [3]     3               LDX    #$2000+16
4003 CE0000 [3]     6               LDU    #$0000
4006 CC0020 [3]     9               LDD    #$0020
* This loop is 103 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 103 = 26,368 CPU Cycles
4009 EF10   [5+1]   6           !   STU    -16,X
400B EF12   [5+1]   12              STU    -14,X
400D EF14   [5+1]   18              STU    -12,X
400F EF16   [5+1]   24              STU    -10,X
4011 EF18   [5+1]   30              STU    -8,X
4013 EF1A   [5+1]   36              STU    -6,X
4015 EF1C   [5+1]   42              STU    -4,X
4017 EF1E   [5+1]   48              STU    -2,X
4019 EF84   [5+0]   53              STU    ,X
401B EF02   [5+1]   59              STU    2,X
401D EF04   [5+1]   65              STU    4,X
401F EF06   [5+1]   71              STU    6,X
4021 EF08   [5+1]   77              STU    8,X
4023 EF0A   [5+1]   83              STU    10,X
4025 EF0C   [5+1]   89              STU    12,X
4027 EF0E   [5+1]   95              STU    14,X
4029 3A     [3]     98              ABX
402A 4A     [2]     100             DECA
402B 26DC   [3]     103             BNE    <

One last thing to note is the more you unroll the code the faster it will be at the expense of more RAM.  You just have to decide what is most important RAM space or the speed of your code…

The fastest method – This routine uses 17,920 CPU cycles:  It is fastest to use unrolled loops and push a stack pointer and it’s data into RAM instead of using a store instruction.

Mem  Code   Cycles Running Total            Assembly Code (Mnemonics)
4073 CC0000   [3]                             LDD     #$0000
4076 8E0000   [3]                             LDX     #$0000
4079 3184     [4+0]                           LEAY    ,X
407B CE4000   [3]                             LDU     #$2000+$2000
* This loop is 70 cycles to write 32 bytes
* We cycle through the loop 256 times so the calculation is
* 256 * 70 = 17,920 CPU Cycles
407E 3636     [5+6]   11      !               PSHU    D,X,Y
4080 3636     [5+6]   22                      PSHU    D,X,Y
4082 3636     [5+6]   33                      PSHU    D,X,Y
4084 3636     [5+6]   44                      PSHU    D,X,Y
4086 3636     [5+6]   55                      PSHU    D,X,Y
4088 3606     [5+2]   62                      PSHU    D
408A 11832000 [5]     67                      CMPU    #$2000
408E 22EE     [3]     70                      BHI             <

I’ll go into the details of this method and more in Part 3 of this series of blogs.

Cheers,

Glen

Advertisements
This entry was posted in CoCo Programming and tagged , , , , . Bookmark the permalink.

One Response to Optimizing 6809 Assembly Code: Part 2 – Speedup Storing Data – Unrolling Loops

  1. Pingback: Optimizing 6809 Assembly Code: Part 4 – Odds and Sods – More Tricks | Glen's Weblog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s