westerndigitalcorporation / zenfs

ZenFS is a storage backend for RocksDB that enables support for ZNS SSDs and SMR HDDs.
GNU General Public License v2.0
243 stars 88 forks source link

`AllocateZone` might mistakenly assume fully-written zone is available to use #35

Closed skyzh closed 3 years ago

skyzh commented 3 years ago

This is also an issue caused by racing condition, and will make AllocateZone return nullptr in rare cases (but actually might happen once a day when active zone number is close to limit).

Let's walk through this part of code https://github.com/westerndigitalcorporation/zenfs/blob/master/fs/zbd_zenfs.cc#L64. Assume when we close a zone, active zone = open zone = 13, which is just at the hardware limit. And we calculate "idle zone" by active_io_zones_ - open_io_zones_. If "idle zone" > 0, then AllocateZone should return a zone. Otherwise, AllocateZone should block on conditional variable. The case is that, current ZenFS will return nullptr when it thinks that "idle zone" > 0.

void Zone::CloseWR() {
  // assume current zone is full
  assert(open_for_write_);
  open_for_write_ = false;
  // active_io_zones_ = open_io_zones_ = 13, idle zone = 0
  if (Close().ok()) {
    zbd_->NotifyIOZoneClosed();
    // active_io_zones_ = 13, open_io_zones_ = 12, idle zone = 1
  }

  // assume some thread starts `AllocateZone` now...

  if (capacity_ == 0) zbd_->NotifyIOZoneFull();
  // active_io_zones_ = 12, open_io_zones_ = 12, idle zone = 0
}

Assume some thread receives the signal and starts allocate zone between NotifyIOZoneClosed and NotifyIOZoneFull. Now we have 13 zones active. 12 of them are open, and one is full. For AllcateZone,

  {
    std::unique_lock<std::mutex> lk(zone_resources_mtx_);
    zone_resources_.wait(lk, [this] {
      if (open_io_zones_.load() < max_nr_open_io_zones_) return true;
      return false;
    });
  }

We could successfully pass this conditional variable, and go ahead to find a zone to allocate. But the problem is, there is only one zone that is active but not opened, and this zone is already full. After all, AllocateZone will return nullptr. Therefore, AllocateZone might mistakenly assume fully-written zone is available to use.

A simple fix is to hold the zone_resources_mtx_ throughout the CloseWR function, instead of taking it twice inside to NotifyIOZone{Full|Closed}. Therefore, AllocateZone could always see a consistent state.

Point me out if I'm wrong, and thanks for reviewing this in advance!

yhr commented 3 years ago

Yeah, well, code allocator code deserves some simplification and refactoring, which is one of the things i'm planning to look into when i'm back from my leave in the coming weeks. I think what we really want is the allocator to just pick from a list of available zones, and handle the resets and finishes and "available zones" list(s) management separately from that. I hope you can help reviewing that work, Cheers!

yhr commented 3 years ago

@skyzh : have you been able to verify that the fix mentioned resolves the issue? We might want to pull in a fix for this problem before we start the allocator refactoring.

skyzh commented 3 years ago

@skyzh : have you been able to verify that the fix mentioned resolves the issue? We might want to pull in a fix for this problem before we start the allocator refactoring.

In our testing environment, active zones and open zones numbers are all correctly calculated in a day-long benchmark. I believe this PR could be merged before the refactor starts.

skyzh commented 3 years ago

… and you may pull this patch https://github.com/bzbd/zenfs/pull/22

yhr commented 3 years ago

@skyzh : merged to master in #52 (i added the description you provided in this issue to the commit message) - cheers!