xapi-project / sm

Xapi Project storage managers
GNU Lesser General Public License v2.1
21 stars 91 forks source link

CA-386316 Fix race condition between sr_detach and GC #664

Closed rdn32 closed 8 months ago

rdn32 commented 10 months ago

The problem showed up with NFSSR, but other SR types work the same way: the detach method was causing any active GC for an SR to abort, before going on to the rest of work of detaching; but there was nothing preventing a new GC starting in between. The change is to prevent the GC lock for an SR being acquired (as the GC need to do in order to become active) whilst something else (such as an sr_detach operation) is holding the SR lock; and (in case the GC does this just after the SR has been detached) it avoids doing anything if the SR is no longer plugged in.

It might have been sufficient to guard the acquisition of the GC lock like this in _gcLoop, but I decided to wrap the GC lock so that any acquisition of it requires the SR lock momentarily be acquired.

rdn32 commented 10 months ago

I don't know whether you already use read-write locks or not

The sm code implements distinct WriteLock and ReadLock classes, but as far as I can see it never uses the latter.

I think currently an unplug operation may fail if a GC is active

I thought that was the point of aborting GC as part of detaching. (That said, I've not yet tried to understand the mechanism used to do this, and I have no idea how reliable it is.)

MarkSymsCtx commented 10 months ago

I think currently an unplug operation may fail if a GC is active

I thought that was the point of aborting GC as part of detaching. (That said, I've not yet tried to understand the mechanism used to do this, and I have no idea how reliable it is.)

Indeed, the expectation is that an SR detach/unplug shall terminate any in progress GC activity, which the code attempted to do but didn't hold the relevant locks for long enough to prevent things kicking off while the detach was happening.

MarkSymsCtx commented 10 months ago

The change to cleanup is fine.

I'm less happy about the unit test changes. It would be preferable I think to avoid needing to write a script out to storage and then execute it but i'm not sure if even using a fork as per https://github.com/xapi-project/sm/blob/master/drivers/cleanup.py#L2867 and then having the two processes play with locks would be sufficiently reliable and safe?

rdn32 commented 10 months ago

I'm less happy about the unit test changes. It would be preferable I think to avoid needing to write a script out to storage and then execute it

I wrote it like that, rather than using mocking, was that I thought the most likely source of bugs was me misunderstanding what other bits of code, specifically Lock, were doing. Now I know it passes in this form, I'm happy to rework it to be more straightforward.