wokwi / avr8js

Arduino (8-bit AVR) simulator, written in JavaScript and runs in the browser / Node.js
https://blog.wokwi.com/avr8js-simulate-arduino-in-javascript/
MIT License
481 stars 78 forks source link

Optimize CPU busy waits #27

Closed urish closed 3 years ago

urish commented 4 years ago

There are many cases where the Arduino code is simply waiting for an external event. One example is calling delay(), which is implemented as a loop that repeatedly looks at the value of micros(). Another example is Serial.write, which waits for an interrupt or an external flag to be set when the TX buffer is full.

The basic idea would be to detect any loop that has no side effects (i.e.it only reads from the memory), and skip all the iterations of the loop until the next external event (timer firing / interrupt).

There are several factors that makes the implementation challenging:

  1. Many instructions update the status register (SREG).
  2. In many cases, the busy-waiting code does some calculations, and thus it has a local side effect of updating register values.
  3. We still need to keep the clock cycle count correct, as if we actually executed the skipped instructions.
  4. Currently, the avrInstruction() method advances the clock cycle count, and other peripherals (e.g. timers) synchronize with it by looking at the value of cpu.cycles. We will need to implement a new mechanism to keep track of the clock cycles and to synchronize the different peripherals to it.

As an example, here is the disassembly of the Arduino delay() function (as compiled by the Arduino CLI). It was obtained by compiling the "Blink" program and then running avr-objdump -S on the generated ELF file:

 1de:   0e 94 b8 00     call    0x170   ; 0x170 <micros>
 1e2:   68 19           sub     r22, r8
 1e4:   79 09           sbc     r23, r9
 1e6:   8a 09           sbc     r24, r10
 1e8:   9b 09           sbc     r25, r11
 1ea:   68 3e           cpi     r22, 0xE8       ; 232
 1ec:   73 40           sbci    r23, 0x03       ; 3
 1ee:   81 05           cpc     r24, r1
 1f0:   91 05           cpc     r25, r1
 1f2:   a8 f3           brcs    .-22            ; 0x1de <delay.constprop.1+0x24>

As you can see here, the code does some calculations that updates the registers r22 - r25, as well as the status register, and it also calls the micros() function. The call instruction pushes a value to the stack and updates the stack pointer, causing another temporary side effect (that is later cancelled out when we call ret in micros()).

And the disassembly of micros():

unsigned long micros() {
        unsigned long m;
        uint8_t oldSREG = SREG, t;
 170:   3f b7           in      r19, 0x3f       ; 63

        cli();
 172:   f8 94           cli
        m = timer0_overflow_count;
 174:   80 91 05 01     lds     r24, 0x0105     ; 0x800105 <timer0_overflow_count>
 178:   90 91 06 01     lds     r25, 0x0106     ; 0x800106 <timer0_overflow_count+0x1>
 17c:   a0 91 07 01     lds     r26, 0x0107     ; 0x800107 <timer0_overflow_count+0x2>
 180:   b0 91 08 01     lds     r27, 0x0108     ; 0x800108 <timer0_overflow_count+0x3>
        t = TCNT0;
 184:   26 b5           in      r18, 0x26       ; 38
        if ((TIFR0 & _BV(TOV0)) && (t < 255))
 186:   a8 9b           sbis    0x15, 0 ; 21
 188:   05 c0           rjmp    .+10            ; 0x194 <micros+0x24>
 18a:   2f 3f           cpi     r18, 0xFF       ; 255
 18c:   19 f0           breq    .+6             ; 0x194 <micros+0x24>
                m++;
 18e:   01 96           adiw    r24, 0x01       ; 1
 190:   a1 1d           adc     r26, r1
 192:   b1 1d           adc     r27, r1
        if ((TIFR & _BV(TOV0)) && (t < 255))
                m++;
        SREG = oldSREG;
 194:   3f bf           out     0x3f, r19       ; 63

        return ((m << 8) + t) * (64 / clockCyclesPerMicrosecond());
 196:   ba 2f           mov     r27, r26
 198:   a9 2f           mov     r26, r25
 19a:   98 2f           mov     r25, r24
 19c:   88 27           eor     r24, r24
 19e:   bc 01           movw    r22, r24
 1a0:   cd 01           movw    r24, r26
 1a2:   62 0f           add     r22, r18
 1a4:   71 1d           adc     r23, r1
 1a6:   81 1d           adc     r24, r1
 1a8:   91 1d           adc     r25, r1
 1aa:   42 e0           ldi     r20, 0x02       ; 2
 1ac:   66 0f           add     r22, r22
 1ae:   77 1f           adc     r23, r23
 1b0:   88 1f           adc     r24, r24
 1b2:   99 1f           adc     r25, r25
 1b4:   4a 95           dec     r20
 1b6:   d1 f7           brne    .-12            ; 0x1ac <micros+0x3c>
}
 1b8:   08 95           ret

This code also changes the status register (through the in and out instructions, and as a result of the arithmetic instructions), as well as registers r18, r20, and r22-r27.

urish commented 3 years ago

Seems like a fun issue, but performance is not a big issue right now, and there are lower hanging fruits, e.g. optimizing the opcode decoding.