nashif commented 7 years ago

Reported by Sergey Arkhipov:

Bug in Zephyr OS high-precision timings sub-system (function sys_cycle_get_32())

Configuration

any Seen on nano (v1.5.0), but potentially exists on any kernel type/revision Bug detailed description ======================== Subsequent calls to sys_cycle_get_32() may return logically wrong time-stamps when the value returned earlier is greater than the value returned later. This is NOT a 32-bit counter overflow issue which happens naturally by design, but rather a design flaw, which could be seen randomly, sometimes even every several milliseconds, i.e. a way more often than underlying counter overflow. Test case to replicate ===================== Set up 1 ms system tick. Modify hello-world sample: ``` void main(void) { uint32_t t0, t1, diff; PRINT("Hello World! %s\n", CONFIG_ARCH); for (uint32_t i = 0; i < ULONG_MAX; i++) { t0 = sys_cycle_get_32(); t1 = sys_cycle_get_32(); if (t1 >= t0) { diff = t1 - t0; } else { diff = ULONG_MAX - t0 + t1; // Fixes 32-bit counter overflow PRINT("diff = %u (t0 = %u : t1 = %u); i = %u\n", diff, t0, t1, i); } } } ``` Expected result =============== Something like this, when subsequent 'diff' printouts could happen approx. every 2 minutes (i.e. 32-bit timer overflow time @ 32MHz CPU clock): ``` Hello World! x86 diff = 28 (t0 = 4294967291 : t1 = 24); i = 61659273 diff = 28 (t0 = 4294967283 : t1 = 16); i = 125651793 diff = 28 (t0 = 4294967290 : t1 = 23); i = 253639277 .... ``` Actual result ============= Something like this, when subsequent 'diff' printouts could happen approx. every several milliseconds (i.e. well below the 32-bit timer overflow time @ 32MHz CPU clock): ``` Hello World! x86 diff = 4294935318 (t0 = 383987 : t1 = 352010); i = 4005 diff = 4294935320 (t0 = 607989 : t1 = 576014); i = 4779 diff = 4294935840 (t0 = 863993 : t1 = 832538); i = 6115 diff = 4294935320 (t0 = 1055982 : t1 = 1024007); i = 6326 diff = 4294935318 (t0 = 1375987 : t1 = 1344010); i = 8676 diff = 4294935318 (t0 = 1663985 : t1 = 1632008); i = 10464 diff = 4294935855 (t0 = 2047991 : t1 = 2016551); i = 13878 diff = 4294935320 (t0 = 2239984 : t1 = 2208009); i = 13921 diff = 4294935842 (t0 = 2527993 : t1 = 2496540); i = 15621 .... ``` Notes ===== Basically when the next sequence occurs in the code t0 = sys_cycle_get_32(); t1 = sys_cycle_get_32(); one would expect t0 to be less than t1 provided there is no underlying 32-bit counter overflow (which is rather rare event, happening approx every 2 min at 32MHz clock cycle, and thus expected and easily fixed). Instead, within timer overflow cycle multiple events occur when t0 is greater than t1, basically limiting OS high-precision timer service to one system tick resolution. The core problem is in the code which is almost the same for all platforms/revisions. Here it is for x86: ``` uint32_t sys_cycle_get_32(void) { uint32_t val; /* system clock value */ ... val = accumulated_cycle_count - current_count_register_get() + cycles_per_tick; ... return val; } ``` The problem occurs when sys_cycle_get_32() is called at the moment when hardware down counter reaches zero and then gets auto-reinitialized to the original value but 'accumulated_cycle_count' value has not been updated yet because it happens asynchronously inside the ISR. Then it may happen that t1 value is less then t0 value for an amount of up to 'accumulated_cycle_count'. This makes OS high-precision timing sub-system useless. Temporary workaround ==================== For x86 architecture use _NanoTscRead() instead of sys_cycle_get_32(). It works flawlessly. (Imported from Jira ZEP-1546)

nashif commented 7 years ago

by Benjamin Walsh:

I think it's a duplicate of GH-1292.

nashif commented 7 years ago

by Sergey Arkhipov:

Probably yes. It just extends to sys_cycle_get_32() function so it appears to be not very useful.

nashif commented 7 years ago

by Andy Ross:

OK, starting with repeating my understanding:

The root cause is a straightforward race (in both loapic_timer and cortex_m_systick) between a "accumulated" cycle count which is updated in an ISR and a register read of the "current" cycle count since the last timer interrupt. Because the ISR is async, it may not have occured before the timer count reaches zero and so you can see either the correct/updated accumulated count or a stale-by-one-tick non-updated one alongside a zero cycle count.

(Random aside: I can sort of see this for APIC as it's an "external" device, but on ARM the cycle counter is a CPU regiser: how did they manage NOT to specify delivery of the timer interrupt to be cycle exact vs. reads of the cycle count register! Ugh.)

I think I agree with Ben in GH-1292, there's no proper way to fix this except by putting a irq_lock() around the two reads, which I'm reasonably certain is going to be impossible for perfrmance reasons.

But it's possible to almost fix this with a heuristic. Use an atomic flag to signal the begin/end of the read pair and check that it wasn't set from the ISR in the middle. Something like this pseudocode:

static atomic_val_t timer_isr_happened; /* Our new flag */

static volatile accum_count; /* Set to 1 by ISR */

uint32_t get_curr_cycle_count() { /* read hardwre */

uint32_t k_cycle_get_32(void)
{
    uint32_t accum, curr;
    do {
        atomic_clear(&timer_isr_happend);
    accum = accum_count;
    curr = get_curr_cycle_count();
    } while (atomic_get(&timer_isr_happened));
}

(Note we don't strictly need the atomic API for this purpose and a volatile int would be fine on both ARM and x86, we might want to do that to save the handful of cycles on ARM due to the assembly functions. I think x86 just resolves to single-instruction compiler builtins).

This would robustly detect the ISR race and redo both reads when that happens so you never see a non-monotonic timer read. Under normal circumstances it's just that one extra read/test/branch to do, so should be fast enough.

But it's not perfect: it fixes the case in the bug here where you have a thread rapidly polling the cycle count and likely to hit the overflow every time, but it can still race against other callers to k_cycle_get_32(). If you context get preempted inside that critical section and the other thread calls k_cycle_get_32(), then it will reset the flag for you even if the ISR triggered already and you'll miss the update you were trying to detect.

Honestly, I argue that that's rare enough to ignore. The race is a very small window, and it's only possible to see in code like this (which is reasonable code!) that spins on the cycle count making millions of calls per tick to get high precision. But until Zephyr gets true SMP, it's literally impossible for two threads to do this at once, so we can't actually hit this bug that way.

Ben?

nashif commented 7 years ago

by Andy Ross:

Oh, and Sergey: which hardware are you seeing this on most reliably? I'm guessing interrupt delivery in qemu may not be crazy enough to trigger it?

nashif commented 7 years ago

by Andy Ross:

One final note: I just stumbled on this comment in loapic_timer:

     * NOTE: Although the cycle count is supposed to stop decrementing once it
     * hits zero in one-shot mode, not all targets implement this properly
     * (and continue to decrement).  Thus a second comparison is required to
     * check for wrap-around.

If that's true, then we have one more test to do to clean up overflow conditions. Do we have any idea which platforms this is talking about? Sergey, does this match behavior you see or are you seeing only the on-tick aliasing described above?

nashif commented 7 years ago

by Sergey Arkhipov:

Andy, we are using a custom board with Intel Curie (Quark SE C1000) on it. As the performance is paramount in our application, we have to do thorough profiling of each and every section of code. That’s when we’ve faced the issue which was driving us crazy for a while, giving us deltas of around 2^32 cycles because of the issue. It happens consistently, every several minutes at best. The sample I’ve included with the issue description triggers the issue every millisecond or so. I’d assume that for x86 architecture in TICKLESS_IDLE kernel configuration disabled that we are using the timer is configured for auto-reloading mode, not the one-shot mode. So when it reaches zero, it gets reloaded to the initial value hardware-wise. Which is fine for us, otherwise it could be hiccups in the base hi-res timeline of the kernel occurring, kind of a ‘floating ground’, because timer ISR execution could be delayed due to interrupt disabled somewhere. But I have to check the code in loapic_timer.c just in case. Yes, we are using sys_cycle_get_32(), not k_cycle_get_32() as we are still on v1.5.0 by now.

nashif commented 7 years ago

by Sergey Arkhipov:

BTW, ARC architecture also has the same issue, not only x86 or ARM.

nashif commented 7 years ago

by Sergey Arkhipov:

Another side effect you haven’t mentioned is that the issue totally eliminates hi-res timings readouts for Zephyr. What is the point to call k_cycle_get_32() if it can’t guarantee the precision better than a system tick?

nashif commented 7 years ago

by Sergey Arkhipov:

Luckily, for x86 architecture, we can use _NanoTscRead(). But for the ARC we had to settle with system-tick resolution.

nashif commented 7 years ago

by Benjamin Walsh:

Andy Ross I was looking at this issue this morning, and I was coming up with a somewhat similar solution. I was looking at the arcv2_timer0 driver especially.

I don't think you need the ISR to do anything new: k_cycle_get_32() should be able to know by reading the accumulated_cycle_count and checking it has not changed. Also, we should be able to avoid the race condition you mentioned by locking the scheduler. Here's an example of what I am talking about:

uint32_t k_cycle_get_32(void)
{
        uint32_t cycles, accum[2], count;
        int cur = 0;

        k_sched_lock();

        accum[0] = accumulated_cycle_count;
        for (;;) {
                count = timer0_count_register_get();
                accum[!cur] = accumulated_cycle_count;

                if (accum[0] == accum[1]) {
                        cycles = accum[cur] + count;
                        break;
                }
                cur = !cur;
        }

        k_sched_unlock();

        return cycles;
}

accumulated_cycle_count has to be made volatile for this to work too, as mentioned in GH-1292.

IIRC, the target that had an issue was Simics, 2-3 years ago. Might have been fixed by now.

Sergey Arkhipov Yes, I was looking at the ARC timer0 driver, and I saw it has the same issue indeed.

nashif commented 7 years ago

by Benjamin Walsh:

Sergey Arkhipov Not sure why you're saying that this does not have sub-tick resolution ? The API still reads the hardware counter, which is normally much higher resolution than the system tick...

nashif commented 7 years ago

by Andy Ross:

Good point; your idea is cleaner and lacks the cross-thread race that mine did. One review-ish comment: do you really want the k_sched_lock() there? That's not a correctness requirement, it just makes it less likely to need the second iteration due to preemption. But the whole point is that the retry is going to be very rare, and easily detected. I think paying the extra cycles for the lock call (which I see is just the two function calls and dec/increment) isn't worth it, IMHO.

nashif commented 7 years ago

by Benjamin Walsh:

Andy Ross Yeah, I think you're right about not needing the sched_lock/unlock.

Instead of tracking "cur" as well, this is even cleaner:

uint32_t k_cycle_get_32(void)
{
        uint32_t accum[2], count;

        accum[0] = accumulated_cycle_count;
        for (;;) {
                count = timer0_count_register_get();
                accum[1] = accumulated_cycle_count;

                if (accum[0] == accum[1]) {
                        return accum[0] + count;
                }
                accum[0] = accum[1];
        }
}

nashif commented 7 years ago

by Sergey Arkhipov:

Benjamin Walsh Well, in our code I tried to make a workaround to beat non-monotonous timer behavior for users of k_cycle_get_32() function. I had to do something with deltas of about 2^32 magnitude:

t0 = k_cycle_get_32();

/* Some code */
…

t1 = k_cycle_get_32();

/* Now calculate the performance metrics for the code */

if (t1 >= t0) {
    diff = t1 - t0;
} else {
    /* Fix 32-bit timer overflow */
    diff = ULONG_MAX - t0 + t1;
    /* Perform workaround against bug in k_cycle_get_32() */
    if (diff >= (ULONG_MAX - sys_clock_hw_cycles_per_tick)) {
        diff = t0 - t1;
    }
}

And the diff value was floating, especially for the case when non-monotonous timer issue occurred. Probably this is for the cases when another task or/and some lengthy ISR running preempts current task for the period of more than one system tick. I believe this is more a conceptual issue. One would expect for the function like k_cycle_get_32() to be atomic in a sense that it returns underlying timer value AT THE TIME it gets called. But it is not the case as it can be pre-empted inside by ISR or another task which could be quite lengthy and non-monotonous timer issue makes things even worse. By the time k_cycle_get_32() returns it could be much water has flowed under the bridge.

nashif commented 7 years ago

by Benjamin Walsh:

Sergey Arkhipov

"One would expect for the function like k_cycle_get_32() to be atomic in a sense that it returns underlying timer value AT THE TIME it gets called."

But that is what it's supposed to do if not for that bug that we are trying to fix. If you don't protect against preemption while trying to compute an amount of time between two points in the code flow, all bets are off.

I still don't understand your comment about tick granularity though. On some implementations of k_cycle_get_32, basically on targets where the timer current counter is reloaded every tick, and thus has a period that is (way) smaller than what would fill 32-bits, we have an accumulator that gets updated every tick (in the timer ISR), and when the cycle value is needed, the hw register is read and added to the accumulator. That gives hw-frequency precision/granularity, not system tick granularity. On some other implementations that have a free running 64-bit counter, like the hpet, we can return the low 32-bits directly from the hardware. In both cases, there is overflow after 32-bits, so the behaviour is (in theory) the same.

nashif commented 7 years ago

by Sergey Arkhipov:

Benjamin Walsh Yes I never had issues with the function _NanoTscRead() which is kind of bare metal one. Probably this non-monotonous timer issue which makes k_cycle_get_32() function sometimes to return very inconsistent results has given me an idea that it can’t be trusted upon in regards of the precision. I’ll try the fix you’ve suggested. Hopefully, it resolves all the issues.

nashif commented 7 years ago

by M R Rosen:

Isnt the fundamental problem that the CCR register (at least when discussing x86 LOAPIC timer) is free running while accumulated_cycle_count depends on the interrupt to update correctly, thus, the issue arises when the CCR value is mismatched with the accumulated_cycle_count value (by using a accumulated_cycle_count value from the previous period but using the CCR value from the current period), the resulting value of the function fails to capture time correctly?As such, even with the suggested fix to k_cycle_get_32(), the function will not work correctly for subsequent calls within an ISR or a block of code where ISRs are blocked. If so, it would be important to make that note with regards to this function (that it may only work reliably when interrupts are enabled).

nashif commented 7 years ago

by Sergey Arkhipov:

M R Rosen Exactly. That’s why I would call the behavior of this function inherently non-atomic. It seems that even if we put ‘interrupt disabled’ brackets around the code to be profiled including the k_cycle_get32() calls, we can not get precise enough results. We can’t assume that we are only one accumulated_cycle_count ahead of the value read from the CCR. In applications with very fast system ticks (we currently have 1 ms system tick) and relatively slow code to be profiled it is possible that we have several timer interrupts requested while doing the metrics. In this case, it is not easy to recover actual delta value. The more reliable way is to use non-portable architecture dependent functions like _NanoTscRead() for x86. This one is inherently atomic.

nashif commented 7 years ago

by Sergey Arkhipov:

I did try the solution suggested for the sample hello-world application mentioned above and it improved the non-monotonous timer behaviour indeed. However, the k_cycle_get_32() function still sometimes returns wrong values. Instead of doing that every several milliseconds as before the fix applied, it does it on start-up and sometimes after PRINT (which are mapped on printk()) function calls. So the printk() calls act as the triggers for the issue now for the hello-world application. May be printk() does something with interrupts and thus affects k_cycle_get_32() non-monotonous timer misbehaviour?

This is the result:

Hello World! x86

diff = 4294935327 (t0 = 575974 : t1 = 544006); i = 5937

diff = 4294935327 (t0 = 959974 : t1 = 928006); i = 8818

diff = 4294935324 (t0 = 1311972 : t1 = 1280001); i = 11248

diff = 4294935327 (t0 = 1695974 : t1 = 1664006); i = 13996

diff = 34 (t0 = 4294967265 : t1 = 4); i = 56327349

diff = 34 (t0 = 4294967273 : t1 = 12); i = 169000964

diff = 4294935326 (t0 = 410088 : t1 = 378119); i = 169004111

diff = 4294935326 (t0 = 762088 : t1 = 730119); i = 169006167

diff = 4294935326 (t0 = 1146088 : t1 = 1114119); i = 169008643

diff = 30 (t0 = 4294967270 : t1 = 5); i = 450680420

diff = 4294935326 (t0 = 261608 : t1 = 229639); i = 450681653

diff = 4294935326 (t0 = 517608 : t1 = 485639); i = 450682449

diff = 4294935326 (t0 = 741608 : t1 = 709639); i = 450682826

diff = 30 (t0 = 4294967278 : t1 = 13); i = 563346314

diff = 4294935329 (t0 = 183012 : t1 = 151046); i = 563346481

diff = 4294935331 (t0 = 407010 : t1 = 375046); i = 563346858

Printouts with diff = 30..34 happen when we have 32-bit timer overflow every 134 seconds and are absolutely valid. As it can be seen, these printouts cause series of wrong diff values happening reflecting non-monotonous timer issue. It is interesting that the results shown above are 100% reproducible, with exactly the same values for diif, t0, t1, i at every loop iteration.

nashif commented 7 years ago

by Andy Ross:

I'm guessing (maybe) you're seeing the bug I found commented in the loapic driver: apparently the counter doesn't stop at zero and will keep going, so near zero you can read an underflowed value. If that's true, then it should (heh, famous last words) be sufficient just to treat the timer value as a signed quantify and in the subtraction, so that negative values get added to the accumulator.

Honestly that might be happening anyway though if the 2's complement math isn't otherwise doing tricky stuff, so I'm not sure. This gets hard in Jira. Can you post a patch to gerrit showing exactly what you're doing? A patch to 1.5 code will work fine to submit (obviously we won't merge be able to merge it), so no need to try to port anything.

nashif commented 7 years ago

by Sergey Arkhipov:

Andy Ross Andy, you might be right. It is not printk(). I removed it from the cycle:

void main(void)
{
    uint32_t t0, t1, diff, error = 0, overflow = 0;

    PRINT("Hello World! %s\n", CONFIG_ARCH);

    for (uint32_t i = 0; i < 500000000ul; i++) {
        t0 = sys_cycle_get_32();
        t1 = sys_cycle_get_32();

        if (t1 >= t0) {
            diff = t1 - t0;
        } else {
            diff = ULONG_MAX - t0 + t1; // Fixes 32-bit counter overflow
            if (diff > 1000ul)
            {
                error++;
            } else {
                overflow++;
            }
        }
    }
    PRINT("error = %u; overflow = %u\n", error, overflow);
}

and still have the next:

Hello World! x86

error = 5; overflow = 5

It looks like now with the fix applied we have the issue only when the 32-bit timer overflows.

nashif commented 7 years ago

by Sergey Arkhipov:

It looks like the fix suggested for k_cycle_get_32() fixes the issue for the ARC architecture. But it still occurs in x86 architecture.

nashif commented 7 years ago

by Andrew Boie:

Are we overthinking this?

1) I have no problem with documenting k_cycle_get_32() that it will not work reliably with interrupts locked. 2) I think we should make k_cycle_get_32() correctly return a monotonic value, even if that means locking interrupts briefly in k_cycle_get_32(), or rechecking values, declaring the accumulator volatile, etc. Correctness should trump performance for this function. 3) People who need high-performance timekeeping should be using the TSC or similar HW feature. 4) We need to fix the issue Andy identified in the LOAPIC driver as well.

nashif commented 7 years ago

by Andrew Boie:

I'm seeing what I suspect is a race condition, not mentioned in this discussion previously, on LOAPIC with CONFIG_TICKLESS_IDLE=n. I'm using Sergey Arkhipov 's test case. I'm running on an Arduino 101.

In this scenario, the timer is configured as periodic. The interrupt handler is very simple: it adds cycles_per_tick to the accumulated_cycle_count. In the test case, we make two calls to k_cycle_get_32() in succession.

What I am seeing occasionally is that accumulated_cycle_count is the same between the two calls (i.e. the interrupt handler hasn't run), but the CCR has reset itself for the next tick. So the second call to k_cycle_get_32() returns a smaller value. In every case, the delta is about 319900.

Is there an execution window between when CCR resets itself, and the interrupt fires?

This patch illustrates the problem: https://gerrit.zephyrproject.org/r/11245

Note that in the above patch, if I remove the irq_lock()s in main.c the problem still occurs, just the printouts aren't consistent since accumulated_cycle_count may have been updated. But there still seems to be a window where accumulated_cycle_count has not been updated, yet CCR has been.

Here is the output of my test. Note the acc values are the same. t0 is always acc + an amount just below cycles_per_tick, and t1 is acc plus a very small value.

***** BOOTING ZEPHYR OS v1.6.99 - BUILD: Feb 14 2017 22:58:17 *****
cycles_per_tick = 320000
diff = -319956 (t0 = 959963 : t1 = 640007); i = 9028 acc1 = 640000 acc2 = 640000
diff = -319956 (t0 = 1599965 : t1 = 1280009); i = 12925 acc1 = 1280000 acc2 = 1280000
diff = -319956 (t0 = 2239983 : t1 = 1920027); i = 16662 acc1 = 1920000 acc2 = 1920000
diff = -319956 (t0 = 2559974 : t1 = 2240018); i = 17105 acc1 = 2240000 acc2 = 2240000
diff = -319956 (t0 = 3839979 : t1 = 3520023); i = 27429 acc1 = 3520000 acc2 = 3520000
diff = -319956 (t0 = 5119977 : t1 = 4800021); i = 37753 acc1 = 4800000 acc2 = 4800000
diff = -319956 (t0 = 5759965 : t1 = 5440009); i = 41490 acc1 = 5440000 acc2 = 5440000
diff = -319956 (t0 = 6719965 : t1 = 6400009); i = 48520 acc1 = 6400000 acc2 = 6400000
diff = -319956 (t0 = 7999983 : t1 = 7680027); i = 58844 acc1 = 7680000 acc2 = 7680000
diff = -319956 (t0 = 8319982 : t1 = 8000026); i = 59287 acc1 = 8000000 acc2 = 8000000
diff = -319956 (t0 = 9279973 : t1 = 8960017); i = 66318 acc1 = 8960000 acc2 = 8960000
diff = -319956 (t0 = 9599980 : t1 = 9280024); i = 66761 acc1 = 9280000 acc2 = 9280000
diff = -319956 (t0 = 9919971 : t1 = 9600015); i = 67204 acc1 = 9600000 acc2 = 9600000
diff = -319956 (t0 = 10559970 : t1 = 10240014); i = 70941 acc1 = 10240000 acc2 = 10240000
diff = -319954 (t0 = 11519960 : t1 = 11200006); i = 77843 acc1 = 11200000 acc2 = 11200000
diff = -319956 (t0 = 12159976 : t1 = 11840020); i = 81451 acc1 = 11840000 acc2 = 11840000
diff = -319956 (t0 = 12479975 : t1 = 12160019); i = 81766 acc1 = 12160000 acc2 = 12160000
diff = -319956 (t0 = 12799966 : t1 = 12480010); i = 82081 acc1 = 12480000 acc2 = 12480000
diff = -319956 (t0 = 13439975 : t1 = 13120019); i = 85689 acc1 = 13120000 acc2 = 13120000
diff = -319956 (t0 = 14719973 : t1 = 14400017); i = 95885 acc1 = 14400000 acc2 = 14400000
diff = -319956 (t0 = 15359963 : t1 = 15040007); i = 99494 acc1 = 15040000 acc2 = 15040000
diff = -319956 (t0 = 16319979 : t1 = 16000023); i = 106396 acc1 = 16000000 acc2 = 16000000
diff = -319956 (t0 = 18559965 : t1 = 18240009); i = 126441 acc1 = 18240000 acc2 = 18240000
diff = -319956 (t0 = 19519982 : t1 = 19200026); i = 133311 acc1 = 19200000 acc2 = 19200000
diff = -319956 (t0 = 20159979 : t1 = 19840023); i = 136887 acc1 = 19840000 acc2 = 19840000
diff = -319954 (t0 = 20799960 : t1 = 20480006); i = 140463 acc1 = 20480000 acc2 = 20480000
diff = -319955 (t0 = 21119961 : t1 = 20800006); i = 140746 acc1 = 20800000 acc2 = 20800000
diff = -319956 (t0 = 21759979 : t1 = 21440023); i = 144323 acc1 = 21440000 acc2 = 21440000
diff = -319956 (t0 = 23039965 : t1 = 22720009); i = 154487 acc1 = 22720000 acc2 = 22720000
diff = -319956 (t0 = 24319965 : t1 = 24000009); i = 164651 acc1 = 24000000 acc2 = 24000000
diff = -319956 (t0 = 24639963 : t1 = 24320007); i = 164934 acc1 = 24320000 acc2 = 24320000
diff = -319954 (t0 = 25599960 : t1 = 25280006); i = 171804 acc1 = 25280000 acc2 = 25280000
diff = -319956 (t0 = 26879980 : t1 = 26560024); i = 181968 acc1 = 26560000 acc2 = 26560000
diff = -319956 (t0 = 28159976 : t1 = 27840020); i = 192132 acc1 = 27840000 acc2 = 27840000
diff = -319956 (t0 = 28479964 : t1 = 28160008); i = 192414 acc1 = 28160000 acc2 = 28160000
diff = -319956 (t0 = 29759965 : t1 = 29440009); i = 202578 acc1 = 29440000 acc2 = 29440000
diff = -319956 (t0 = 30719973 : t1 = 30400017); i = 209448 acc1 = 30400000 acc2 = 30400000
diff = -319956 (t0 = 31039970 : t1 = 30720014); i = 209731 acc1 = 30720000 acc2 = 30720000
diff = -319956 (t0 = 31359962 : t1 = 31040006); i = 210014 acc1 = 31040000 acc2 = 31040000
diff = -319956 (t0 = 31999981 : t1 = 31680025); i = 213591 acc1 = 31680000 acc2 = 31680000
diff = -319956 (t0 = 32959957 : t1 = 32640001); i = 220461 acc1 = 32640000 acc2 = 32640000
diff = -319954 (t0 = 33279960 : t1 = 32960006); i = 220744 acc1 = 32960000 acc2 = 32960000
diff = -319956 (t0 = 34879965 : t1 = 34560009); i = 234201 acc1 = 34560000 acc2 = 34560000
diff = -319956 (t0 = 35839963 : t1 = 35520007); i = 241071 acc1 = 35520000 acc2 = 35520000
errors = 44

In previous discussion it had been speculated that locking interrupts would solve all of this, but it appears to me that the reset of the CCR, and the firing of the interrupt, are decoupled. Locking interrupts around the body of k_cycle_get_32() is not fixing this for me.

nashif commented 7 years ago

by Andy Ross:

That doesn't sound 100% insane to me. Interrupt delivery is at least a little bit asynchronous always: there are inevitable hardware cycles between a device posting an interrupt or pulling the IRQ line down and the actual execution of the ISR (at the very least, the CPU needs to wait for the pipeline to clear). And in that time it's possible to read the state that caused the interrupt in the first place and see the values from "after" the interrupt was posted.

Now, this timer is actually on the interrupt controller, so it would actually be possible to "hold" the timer rollover until after the interrupt delivery was signaled by the hardware. But I'll bet it doesn't. And in general we can't rely on that anyway.

Isn't the solution just to treat the rollover as a valid condition, by e.g. doing the delta using signed math and allowing times from the "future" to be legal?

nashif commented 7 years ago

by Andrew Boie:

Andy Ross I'm not sure what you mean by signed math, there's no negative value in play. I don't see any way to detect this situation. With CCR being reset back to cycles_per_tick, for all intents and purposes we have moved back in time. I don't see a way to programmatically know "oh, CCR reset to initial value but the interrupt hasn't fired yet"

nashif commented 7 years ago

by Andy Ross:

Ah, OK. I understood that the timer had a wraparound behavior (documented as a "bug" in the loapic comments somewhere) and not a static reset. Yeah, that's a mess without some kind of extra clock (TSC maybe?) to reference against.

nashif commented 7 years ago

by Andy Ross:

I guess another possibility without using a second clock would be to define a "fudge window" around the top of the timer countdown where we "expect that the ISR hasn't fired". If you get a value in that range (which in your data above is never more than 44 cycles long), just busy-wait polling it until it gets low enough. That comes to a 0.03% chance that a call to k_cycle_get_32() will be delayed by 100 cycles or so. Maybe not too high a price.

nashif commented 7 years ago

by Andrew Boie:

It's hard to use the TSC in tandem with LOAPIC timer as they use different units and wrap around at different times. The TSC is in terms of CPU cycles, LOAPIC in terms of bus cycles.

Also I've seen the window as large as 500 cycles in some runs of the test, I think a 'fudge window' on the order of a 1000 cycles may be too large.

Andy Ross Benjamin Walsh I'm wondering if we should take a simpler approach and just have LOAPIC's k_cycle_get_32() simply return accumulated_cycle_count. It will drastically reduce the resolution, but I'm not sure there is a requirement for this function to be high resolution. x86 (and soon ARC) implement _tsc_read(), which is what I think people should be using for high-resolution use-cases like timekeeping, possibly promoted to a public kernel API.

alternatively, just use TSC for _k_cycle_get_32() and be done with it.

nashif commented 7 years ago

by M R Rosen:

Andrew Boie

I would recommend against having the function just return accumulated_cycle_count because that's now just going to get you a counter with kernel tick resolution which you already have an API for, right?

I think using TSC and other architecture-specific clocks might be the best way to implement what is promised by k_cycle_get_32(). I'm not sure there is a solid way to sure up the current mechanism unfortunately :(

nashif commented 7 years ago

by Andrew Boie:

M R Rosen agreed. I'm looking into a fix which will use the TSC.

nashif commented 7 years ago

by Andrew Boie:

This issue is also reproducible in QEMU on Nios 2 and RISCV. I can't reproduce on x86 and ARM in QEMU. However, on x86 this very easily reproduces on Arduino 101. I haven't tested on ARM boards yet. I think I'm going to need to feed my test case to TCF and see what we get.

Most timer drivers have the same design: they update an accumulator in the ISR, and k_cycle_get_32() returns the sum of the accumulator + the value in the timer's count register. I believe at this point that this design is fundamentally unsound.

It's also worth noting that this bug has been with us forever.

nashif commented 7 years ago

by Andrew Boie:

Also reproducible on ARC, tested on Arduino 101 ARC side. My expectation is that all implementations of k_cycle_get_32 that are based on an accumulated cycle count updated by the interrupt have this problem.

Kumar Gala Maureen Helm I have no ARM hardware at my disposal and all my colleagues that do are traveling internationally. If one of you has a chance, can you run this test https://gerrit.zephyrproject.org/r/#/c/11305/1 on an ARM board of your choosing and let me know the result? QEMU didn't report any issue, but QEMU didn't on x86 either I had to run it on a real board.

nashif commented 7 years ago

by M R Rosen:

Andrew Boie

If you have an Arduino 101, cant you flash the Nordic BLE core? Thats an ARM Cortex-M0( + ?) and should be supported by Zephyr 1.6+ (one module, 3 architectures :) )

nashif commented 7 years ago

by Andrew Boie:

M R Rosen aha, I did not know this. Will give that a try. Thanks!

nashif commented 7 years ago

by Maureen Helm:

The test passes on frdm_k64f (Cortex M4F @120 MHz), fails on frdm_kw41z (Cortex M0+ @40 MHz)

nashif commented 7 years ago

by Andrew Boie:

Maureen Helm thanks!

For drivers that re-arm the timer with every interrupt, this can typically be fixed by eliminating races such as locking interrupts around the body of k_cycle_get_32(). This fixed the problem on ARC.

However on ARM, this looks like a situation like on x86 and Nios II, where the timer is simply periodic and the interrupt just updates an accumulated cycle count and tells the system that a timer tick has elapsed. This introduces an unavoidable race where the periodic timer register could reset and be readable for a short period of time before the CPU gets the associated interrupt.

In this case, a second clock source is needed. On x86 we used the CPU timestamp counter (TSC). On Nios II you can use another timer IP block. On ARM it looks like the right thing to do is use the ARM Cycle Count Register (CCNT). I'm looking to see if CMSIS has APIs for this. Also need to figure out if CCNT frequency is the same as CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC.

I have patches up which are either merged or under review for all the other arches, so AFAIK ARM is the only arch left to fix.

nashif commented 7 years ago

by Andrew Boie:

M R Rosen I looked into the BLE CPU on Arduino 101, but "make flash" is unimplemented on it, and I could not find documentation on how to get an image onto the device https://wiki.zephyrproject.org/view/Supported_Boards.

Do you have any pointers for getting an image onto the Arduino 101's ARM CPU and capturing console output?

nashif commented 7 years ago

by Andrew Boie:

FIxes for all arches now merged.

nashif commented 7 years ago

by Sharron LIU:

verified cycle counter wrap-around working normal @x86 @arc @arm. Andrew Boie or Sergey Arkhipov , could you verify @nios2 @riscv32 @xtensa? Thanks.

nashif commented 7 years ago

by M R Rosen:

Andrew Boie

Sorry, I missed your comment earlier, but just for your own reference in the future: You can assign the UART port on the nrf51 series chip in Curie to any digital GPIO pins (you can see the Kconfig for the driver to see which flags control this). On the Arduino 101 however, the 3 digital pins from the SoC are not brought out (BLE_SDA, BLE_SCL, BT_GPIO can all be used for any purpose). You can use the interface with the Quark SE C1000 though to bridge the gap. Ie, put a simple pass-through application that takes anything received on UART_0 and prints it out on the x86 core and then leave the pins to the defaults for the arduino_101_ble board. Just note that due to defconfig issues, it was difficult for me to set the baudrate to the standard 115200 on both cores, the arduino_101_ble kinda forces it to be 1000000 I think. As for flashing, you can put images on the board by using dfu-util and flashing to alternate 8, ie ble_core. So, you can directly call "dfu-util -a ble_core -d 8087:0aba -D $(OUT_DIR)/zephyr.bin" to flash the core.

nashif commented 7 years ago

Duplicate of ZEP-1416

zephyriot / zep-jira14

Bug in Zephyr OS high-precision timings sub-system (function sys_cycle_get_32()) #1412

Configuration