|
|
| Author |
Message |
mikep
Joined: 24 Sep 2005
Posts: 765
Location: Austin, TX
|
|
Posted: 26 February 2006, 19:59 PM Post subject: ZBasic Performance Measurements |
|
|
I'm trying to measure the performance of ZBasic. Here is my test program: | Code: | Const count as Long = 100000
Sub Main()
Dim b as Byte
Dim i as Long
Dim t as Single, o as Single
' empty loop
t = timer()
For i=1 to count
Next
o = timer() - t
Debug.Print "Loop Overhead:";CStr(o)
' measurement loop
t = timer()
For i=1 to count
b = b + 1
Next
t = timer() - t - o
Debug.Print "Time for ";CStr(count); " iterations:";CStr(t)
Debug.Print "Number of iterations per second "; CStr(CSng(count)/t)
End Sub | When b is declared as a Byte I get the result | Code: | | Number of iterations per second 99805.07 |
When b is a integer or Long I get slightly slower speeds obviously because there is more work to perform the addition.
However this result doesn't match the stated claim of 160,000 per second. What am I missing here?
BTW When the same program is run under BasicX, the result is 27061 iterations per seconds. |
|
| Back to top |
|
 |
dkinzer Site Admin
Joined: 03 Sep 2005
Posts: 2499
Location: Portland, OR
|
|
Posted: 27 February 2006, 4:55 AM Post subject: |
|
|
The claim of 160,000 maximum instructions per second is based on the measured execution time of a byte increment instruction. The test method is to measure the execution time of a loop not having the subject instruction and then measuring it again with the subject instruction. Here is the code for the basic loop that establishes the baseline speed.
| Code: |
Sub Main()
Register.DDRC = &H01
Do
Register.PortC = Register.PortC Xor &H01
Loop
End Sub |
When this program is run, pin 12 will produce a square wave whose period is twice the nominal loop execution time. The period is measured with an oscilloscope or logic analyzer.
The program is then modified as shown below, adding a statement that generates a byte increment instruction.
| Code: |
Dim b as Byte
Sub Main()
Register.DDRC = &H01
Do
Register.PortC = Register.PortC Xor &H01
b = b + 1
Loop
End Sub |
When this program compiled and run, the increase in the period of the output from pin 12 will be attributable to the added instruction. The measured difference is twice the execution time of the added statement.
Note that in this case, the tested statement involves the incrementing of a module-level variable which yields a single pcode instruction. Your test code increments a local variable for which the following code is generated:
| Code: | b = b + 1
006e 0d0000 PSHR_A bp+0
0071 c4 INCI_B |
There is no pcode instruction for incrementing a variable whose address is relative to the stack frame. There is, however, an instruction for incrementing a variable whose address is on the top of the stack so that is used since it is still more efficient than pushing the value, incrementing the variable on the TOS and then popping the value.
There are 8-bit, 16-bit and 32-bit versions of the three address mode variations of the increment instruction as well as corresponding decrement instructions.
Last edited by dkinzer on 27 February 2006, 19:23 PM; edited 1 time in total |
|
| Back to top |
|
 |
mikep
Joined: 24 Sep 2005
Posts: 765
Location: Austin, TX
|
|
Posted: 27 February 2006, 6:44 AM Post subject: |
|
|
Yes my mistake. I forgot to check the list file. When I move the variable b to be a global instead of a stack relative local I get the following (rounded) results:
- Byte = 168,977 increments per second
- Integer = 158,514 increments per second
- Long = 145,868 increments per second
|
|
| Back to top |
|
 |
spamiam
Joined: 13 Nov 2005
Posts: 665
|
|
Posted: 27 February 2006, 16:28 PM Post subject: |
|
|
| mikep wrote: | Yes my mistake. I forgot to check the list file. When I move the variable b to be a global instead of a stack relative local I get the following (rounded) results:
- Byte = 168,977 increments per second
- Integer = 158,514 increments per second
- Long = 145,868 increments per second
|
Wow, I would have expected a GREATER performance hit for a long vs a byte.
The AVR machine code under optimal circumstances consists of a subtract, then 3 subtract'sw/ carry. (Under the RISC architecture, no add machine code exists.)
So, a long increment takes the hardware 4 times longer for a long. I suppose this means that most of the time for a ZBASIC instruction is taken with other operations than just the read the data from memory to registers then doing the operation.
I wonder what the raw speed of the hardware is for a byte, int, and long increment (or add). Does anyone here play with the AVRs in C or ASM and have a scope. I don't have the scope, and I only use C, and the compiler optimizes away trivial stuff like repetetive increments unless I turn off ALL optimizations.
-Tony |
|
| Back to top |
|
 |
stevech
Joined: 23 Feb 2006
Posts: 657
|
|
Posted: 27 February 2006, 16:38 PM Post subject: |
|
|
I did a little test with something like
flag = TRUE
do
n=n+1 ' an unsigned long
loop while flag
And I had earlier created a task that sleeps for 10 seconds and then clears "flag". I then printed n divided by 10 and got about 360,000.
I did it this way to reduce the overhead in the loop iteration. I'll try it again with n as a global rather than a local. |
|
| Back to top |
|
 |
mikep
Joined: 24 Sep 2005
Posts: 765
Location: Austin, TX
|
|
Posted: 27 February 2006, 16:52 PM Post subject: |
|
|
| spamiam wrote: | | Wow, I would have expected a GREATER performance hit for a long vs a byte. |
Given a clock speed of 14.7456 MHz, then 168,977 increments per second means 87 clocks for each increment.
Assuming 3 instructions for a byte increment then the remaining clocks are used for loading the code from EEPROM, instruction decoding and load/storing into the ZBasic stack.
Similar for a long increment the number of clocks is 101. These extra 14 clocks are the instructions to deal with the extra complication of a long increment and loading/saving 4 bytes instead of 1. |
|
| Back to top |
|
 |
dkinzer Site Admin
Joined: 03 Sep 2005
Posts: 2499
Location: Portland, OR
|
|
Posted: 27 February 2006, 17:02 PM Post subject: |
|
|
The setup code is nearly the same for byte, word and dword increment/decrement, explaining why the times are not proportional to operand length.
| Quote: | | The AVR machine code under optimal circumstances consists of a subtract, then 3 subtract'sw/ carry. |
The optimal code is a word add, then 2 byte adds with carry. The timing is the same as the sequence that you described but the code is two bytes shorter this way.
| Code: | adiw r24, 1
adc r22, zero
adc r23, zero |
|
|
| Back to top |
|
 |
mikep
Joined: 24 Sep 2005
Posts: 765
Location: Austin, TX
|
|
Posted: 27 February 2006, 17:11 PM Post subject: |
|
|
| stevech wrote: | I then printed n divided by 10 and got about 360,000.
I did it this way to reduce the overhead in the loop iteration. I'll try it again with n as a global rather than a local. |
The real way to eliminate the loop iteration is to time two tests (one a null loop and one with the code under test) as Don and I have both remarked on earlier. It turns out that the loop time is much longer than the increment time.
Are you sure you divided by 10 because your result doesn't make sense - ZBasic is not as fast as 360,000 per second. I rewrote the test using your sample code (which includes the loop overhead) and got 38,506 iterations per second.
| Code: | Private flag as Boolean
Private n as UnsignedLong
Private delayStack(1 to 50) as Byte
Public Sub Main()
flag = TRUE
n = 0
CallTask "DelayTask", delayStack
Do
n=n+1
Loop While flag
Debug.Print CStr(n\10)
End Sub
Private Sub DelayTask()
Call Sleep(10.0)
flag = FALSE
End Sub |
|
|
| Back to top |
|
 |
stevech
Joined: 23 Feb 2006
Posts: 657
|
|
Posted: 27 February 2006, 22:50 PM Post subject: |
|
|
360K - That's what I recall but I may be mistaken; I'm away from home now. Your code, above, is virtually identical to what I did. So...
I never make mistakes (I thought I made one once - but I was wrong!) |
|
| Back to top |
|
 |
|