_Swap on ARM is nonatomic by design, can we fix that?

andyross commented 5 years ago

The way _Swap() is implemented on ARM is nonatomic, which while not incorrect has turned out to be very surprising. It's done with a PendSV exception whose priority sits below that of hardware interrupts, so it's possible for a process to decide to context switch based on atomic state that then gets changed under the interrupt handler before the context switch actually happens.

This trick has resulted in three moderatly excruciating bug hunts so far (c.f. commits 41070c3b35f5 and 6c95dafd8260 and the current work submitted in PR #12448). In all honestly I doubt that's going to be the last of them.

There's a framework now which will help keep the workarounds (which so far haven't been complicated) tidy, which should help some.

But basically: how wedded are we to this architecture? How hard would it be and what would it break to set the PendSV exception to the maximum interrupt priority such that it can't be interrupted like this? That would match ARM's behavior to that of other systems (who do context switching with more typical musical-chairs register swaps and not in an exception handler), albeit at the cost of higher worst case latencies for high priority interrupts.

IMHO it would make for better reliability on ARM in the long term, and certainly would make my life easier.

andyross commented 5 years ago

@agross-linaro @andrewboie and anyone you want to add who might be interested. Mostly I'm interested in whoever jumps in with "No! you can't do that because..."

pabigot commented 5 years ago

More background from material I prepared before I saw this had been opened:

PR #12248 enhances the sys_dlist_t data structure to support detecting and diagnosing API mis-use, in particular removing a node from the same list twice. With this change the gap in interrupt blocking present in the ARM context switch produces in a failure in the case where (a) the current thread is being swapped because it has been aborted and removed from the run queue, but (b) the timeslicing infrastructure wakes up and attempts to remove it from the queue a second time before the PendSV handler is invoked.

The data structure enhancements found at least one unknown API violation, and have been approved on their own, but before they can be merged a fix for the root fault they expose needs to be in place.

A previous observation of this issue was resolved through a workaround in #8126 by @mike-scott. It recurred in the testing of #12248. Through coordinated effort the root cause was identified. A similar workaround in that PR was rejected in favor of a more complete fix, a sentiment I agree with in principle, but not for something of this complexity and lack of visibility.

@andyross has provided a fix currently attached to #12248. I have some concerns, but github won't let me review it in that context because I submitted the PR. So I've marked that PR DNM until this issue is properly fixed, or the local workaround restored.

agross-oss commented 5 years ago

The pendsv has historically always been the lowest priority so that it ends up tail chaining on to any irqs that occur. This means that multiple swap() requests could occur before the swap actually occurs due to multiple irqs calling the swap() mechanism. The actual swap won't occur until all of the irqs are handled and the pendsv is then called last due to it being the lowest prio.

If you look at the irq code in swap_helper.S, it disables irqs, then does the list manipulations (per your change in 401073c . So there is an opportunity for it to be interrupted, but only right before it locks the irqs. Once that is done, you won't be interrupting until it completes the actual swap operation. Are you worried there is another race condition?

Do you have a concrete example of this that occurs today? [never mind, i just read through 12248 fully].

what is this framework of workarounds?

agross-oss commented 5 years ago

The pendsv is configurable. But if we are going to go this route, why even use pendsv for context switch?

pabigot commented 5 years ago

@agross-linaro The framework of workarounds would be the last three patches @andyross pushed onto that PR, the last being this one. It makes the tests run, as does my smaller patch, and was expected to be merged by now, but I requested a delay so they can be reviewed properly in a distinct PR.

I suspect I could come up with strained examples where having an outdated current_ causes issues with non-timer commits, so I don't think a change that only fixes problems with ISRs that go through z_clock_announce is enough. I might be wrong about that. Your point about chaining ISRs has merit; is that how this operates on non-ARM architectures?

agross-oss commented 5 years ago

@pabigot Ok, it sounded so formal that I thought there was something more than just a few patches. Given your issue I can see how this is occurring. This takes multiple events happening that 1) cause a swap and 2) change the conditions that precipitated the swap before the actual swap occurs.

And no, no other architecture has this tail chaining behavior. If you search around a little, you'll find the pendsv is the preferred method for context switching for ARM processors. That doesn't mean we are forced to use it or even use it with the lowest priority. However, if we were thinking of bumping the priority I would suggest we just abandon it and come up with something else.

@galak Do you have any thoughts on this?

pabigot commented 5 years ago

This paper might be useful. Their results (which I have not reviewed) show software switching is faster than PendSV on Cortex-M3, at least as they implemented each.

agross-oss commented 5 years ago

@pabigot i don't doubt the switching comes in faster. The cost is slightly higher irq latency, which they don't show. The issue at heart here is the fact that we have an expectation that a swap is atomic. I say this because of the things we have had to do to 'fix' issues related to the tail chaining.

andyross commented 5 years ago

The issue at heart here is the fact that we have an expectation that a swap is atomic

To be clear: we don't. But the fact that we don't causes bugs, and I think is likely to continue to do so.

I mean, if you think about it it's a lower level violation of the basic idea of a condition variable: you want to check if some condition is true and then decide to block if it isn't. So in a cvar API, the "release the lock" and "go to sleep" steps are guaranteed by the system to be atomic. And the _Swap() API, which takes a key argument received from irq_lock(), is piggy backing on the same metaphor.

But in Zephyr, that can't be guaranteed on ARM -- something else may run after you've checked the state and before you are put to sleep, so you're not guaranteed that state isn't going to change. And in a handful of cases it has.

Personally... I dunno, I have a hard time believing that a non-atomic context switch can ever really be a good thing. It's just too dangerous in the long term for the benefit. Apps with super-tight latency requirements can certainly work with this kind of PendSV priority magic on their own for their own interrupts, but right now the exception sits below everything in the kernel.

pabigot commented 5 years ago

I also think a non-atomic context switch is a perversion of nature. Regardless, I certainly have an expectation that Zephyr be consistent in whether context switch is atomic, so if ARM is the only architecture where it isn't I think that should be changed.

We should make sure stakeholders who really care about IRQ latency on ARM systems get involved. Things like Bluetooth or 6lowpan time-synchronized network transactions are one case. I know both Nordic and EFx32 have hardware-triggered events that reduce the risk of timing violations, but not how they might be affected by a change. @carlescufi?

agross-oss commented 5 years ago

@andyross We're talking about the condition changing multiple times before the pendsv fires. And the only way i can think this would work properly is to make the swap atomic and just accept the fact that you might have multiple save/restores of the stack frame + PSP switching instead of just a single hw assisted swap. That's all you get out of tail chaining the PENDSV.

galak commented 5 years ago

@andyross We're talking about the condition changing multiple times before the pendsv fires. And the only way i can think this would work properly is to make the swap atomic and just accept the fact that you might have multiple save/restores of the stack frame + PSP switching instead of just a single hw assisted swap. That's all you get out of tail chaining the PENDSV.

My general feel is having _Swap be atomic makes sense, but I think we should get input from the Nordic guys as I wonder if that will impact timing for Bluetooth on low end M0 devices were they already have pretty tight deadlines.

agross-oss commented 5 years ago

You guys want to discuss this in the API meeting tomorrow? or the earlier mmu/mpu meeting?

carlescufi commented 5 years ago

CC @cvinayak @pizi-nordic

We need to understand what bringing this atomicity in would entail in terms of interrupt latency. How long would the interrupts be disabled for? As @agross-linaro mentions, we are already disabling interrupts today in order to perform the actual swap, so I am unsure whether the maximum time with interrupts disabled would increase by a significant amount. I assume this would lead to more swaps being executed though in the general case, and in particular with the Bluetooth stack generating a lot of events to be consumed by the threads. All the hard real-time sections of the Link Layer are executed without kernel intervention, so I assume there would be no issues on that side of things.

carlescufi commented 5 years ago

I know both Nordic and EFx32 have hardware-triggered events that reduce the risk of timing violations, but not how they might be affected by a change. @carlescufi?

The hardware triggered events are unfortunately not enough in the case of Bluetooth. An actual radio ISR needs to execute in order to meet the hard real-time deadlines.

carlescufi commented 5 years ago

This paper might be useful. Their results (which I have not reviewed) show software switching is faster than PendSV on Cortex-M3, at least as they implemented each.

Thanks for liking that, it's definitely interesting. The problem I think is that they don't consider interrupt latency worst times in that paper, only the actual time it takes to context switch. Also, some of the instructions they use (stmfd and ldmfd) are not available on Cortex-M0 IIRC

carlescufi commented 5 years ago

I discussed this with @cvinayak. The issue we see is that we'd lose something valuable: running swap() only once for all nested interrupts. If we switch from using PendSV to actually invoking swap() explicitly at ISR exit, there will be multiple context switches executed that will override each other instead of doing a single one like we do now at the end of the execution of all nested ISRs and just before returning to thread mode.

That said, as we see it there are two distinct places where a swap can occur:

In thread mode (_pend_current_thread())
At the end of an ISR execution (direct call to __swap() after the ISR execution or, in the case of arch/arm directly setting the PendSV bit in exc_exit.S)

In the case of 1. we don't see a problem in switching from using PendSV to an atomic in-place execution of the register swapping. In the case of 2. however, we would then incur in the added CPU execution time wasted in useless register swaps in case of multiple nested interrupts. So would it be possible to have 1. use an atomic, in-place register swap and leave 2. as-is?

carlescufi commented 5 years ago

You guys want to discuss this in the API meeting tomorrow? or the earlier mmu/mpu meeting?

Sure, if @pizi-nordic is around it would be great to discuss this in the API meeting today. I don't think that @cvinayak will be able to join due to the timezone difference.

pabigot commented 5 years ago

@carlescufi what is the hard realtime requirement for the radio ISR: i.e., how long can you tolerate a radio ISR being delayed?

cvinayak commented 5 years ago

@pabigot on our nRF51 Cortex M0 at 16 MHz clock, between trx the CPU used is ~130 us of the hard limit of 138 us. By experience, any latencies introduced in vectoring to Radio ISR breaks the controller. Only latencies of upto ~10us is tolerated as of today.

agross-oss commented 5 years ago

Let's target the API meeting for discussing this as this is more related to that instead of mmu/mpu/tee.

agross-oss commented 5 years ago

The cost on this would be at least 1 copy of stack frame information. If we have multiple IRQs requesting a swap() then it'd be multiple copies (as opposed to the single one occurring inside the tail chained pendsv handler). So the cost would be variable depending on how many irq's request swap before the normal thread context is run. It looks like the BT tolerance is pretty darn low and this concerns me.

andyross commented 5 years ago

The core takeaways from the meeting:

In principle, @agross-linaro sees no objection to implementing a more conventional context switch on ARM Cortex-M. He'll investigate as time is available.
The existing workaround in place are tolerable, and no other known bugs exist.
We should come up with measured performance numbers for how long a PendSV exception takes to execute, so we know what kind of interrupt latency impact such a change would have.

The discussion also showed why a naive change to make PendSV higher priority would not work well: ARM interrupts, on exit (in _ExcExit()) will check the current thread vs. the scheduler choice and set the PendSV flag if a context switch is needed. But unlike other architectures, they do this in nested interrupts too, so a stacked set of N interrupts that each invoked the scheduler and caused a (high priority) context switch would cause N-1 needless context switches on unwinding, when all we really want is one on exception exit. That's sort of a performance non-starter.

That is: we want PendSV to be "high priority" when triggered from userspace and "low priority" when flagged from an interrupt. It's probably possible to implement that somehow, but not without impact and complexity, at which point we might as well be investigating conventional context switching.

ioannisg commented 5 years ago

We should come up with measured performance numbers for how long a PendSV exception takes to execute, so we know what kind of interrupt latency impact such a change would have.

PendSV might take long to execute due to MPU re-programming, FPU save/restore, and possibly other things we do when we swap. I feel this rules out the possibility of raising PendSV prio to highest.

andrewboie commented 5 years ago

The issue we see is that we'd lose something valuable: running swap() only once for all nested interrupts. If we switch from using PendSV to actually invoking swap() explicitly at ISR exit, there will be multiple context switches executed that will override each other instead of doing a single one like we do now at the end of the execution of all nested ISRs and just before returning to thread mode.

I'm very confused by this assertion, I don't think this is true at all. You're not going to lose this, look at x86's code (or indeed, every arch except ARM). We track in the kernel the nested exception count, we only call swap once the last exception has been processed. We don't call swap at the end of every nested exception!

Also FWIW, for irqs that have extremely sensitive latencies we already support "direct" interrupts which bypass all the common interrupt code and just service the IRQ.

zephyrbot commented 9 months ago

Hi @agross-oss,

This issue, marked as an Enhancement, was opened a while ago and did not get any traction. Please confirm the issue is correctly assigned and re-assign it otherwise.

Please take a moment to review if the issue is still relevant to the project. If it is, please provide feedback and direction on how to move forward. If it is not, has already been addressed, is a duplicate, or is no longer relevant, please close it with a short comment explaining the reason.

Thanks!

zephyrproject-rtos / zephyr

_Swap on ARM is nonatomic by design, can we fix that? #12342