Closed skyzh closed 3 years ago
Yeah, well, code allocator code deserves some simplification and refactoring, which is one of the things i'm planning to look into when i'm back from my leave in the coming weeks. I think what we really want is the allocator to just pick from a list of available zones, and handle the resets and finishes and "available zones" list(s) management separately from that. I hope you can help reviewing that work, Cheers!
@skyzh : have you been able to verify that the fix mentioned resolves the issue? We might want to pull in a fix for this problem before we start the allocator refactoring.
@skyzh : have you been able to verify that the fix mentioned resolves the issue? We might want to pull in a fix for this problem before we start the allocator refactoring.
In our testing environment, active zones and open zones numbers are all correctly calculated in a day-long benchmark. I believe this PR could be merged before the refactor starts.
… and you may pull this patch https://github.com/bzbd/zenfs/pull/22
@skyzh : merged to master in #52 (i added the description you provided in this issue to the commit message) - cheers!
This is also an issue caused by racing condition, and will make
AllocateZone
returnnullptr
in rare cases (but actually might happen once a day when active zone number is close to limit).Let's walk through this part of code https://github.com/westerndigitalcorporation/zenfs/blob/master/fs/zbd_zenfs.cc#L64. Assume when we close a zone, active zone = open zone = 13, which is just at the hardware limit. And we calculate "idle zone" by
active_io_zones_ - open_io_zones_
. If "idle zone" > 0, thenAllocateZone
should return a zone. Otherwise,AllocateZone
should block on conditional variable. The case is that, current ZenFS will returnnullptr
when it thinks that "idle zone" > 0.Assume some thread receives the signal and starts allocate zone between
NotifyIOZoneClosed
andNotifyIOZoneFull
. Now we have 13 zones active. 12 of them are open, and one is full. ForAllcateZone
,We could successfully pass this conditional variable, and go ahead to find a zone to allocate. But the problem is, there is only one zone that is active but not opened, and this zone is already full. After all,
AllocateZone
will returnnullptr
. Therefore,AllocateZone
might mistakenly assume fully-written zone is available to use.A simple fix is to hold the
zone_resources_mtx_
throughout theCloseWR
function, instead of taking it twice inside toNotifyIOZone{Full|Closed}
. Therefore,AllocateZone
could always see a consistent state.Point me out if I'm wrong, and thanks for reviewing this in advance!