westerndigitalcorporation / zenfs

ZenFS is a storage backend for RocksDB that enables support for ZNS SSDs and SMR HDDs.
GNU General Public License v2.0
243 stars 88 forks source link

[Feature Request] We need a better ZoneAllocation implementation #36

Closed royguo closed 2 years ago

royguo commented 3 years ago

Background:

  1. WAL sync should be enabled by default.
  2. Multiple WAL files may be opened at the same time (only one WAI file could be written into)
  3. Over 14 files could be concurrently opened & written
  4. WAL throughput could be larger than 150MB/s (We have KV-separation in TerarkDB, so you can lower down this in RocksDB)
  5. Multiple column families (3~4 CFs)
  6. Key 16B, value 8KB

Problems Description:

Expectations:

yhr commented 3 years ago

Yes, this is something I've been thinking about as well. One solution i would like to try out is to prioritize wal writes by reserving at least one available(empty or zones without active writes) for wal allocation. Or perhaps even better, a configurable parameter that can tell zenfs that the user wants at least N MBs of WAL space. There will be backpressure eventually if the compaction can't keep up, but it should remove the latency spikes.

I don't know what you mean by random read zone, all zones can be randomly read. Could you explain?

royguo commented 3 years ago

Hi, @yhr

We've patched a simple solution for now:

Sorry for the random read question, I misremembered, I thought we should open a zone before reading from it, LOL, just ignore it.

royguo commented 3 years ago

Besides that, we also find that the RocksDB does not close WAL files explicitly by default, which may take too many open zone resources during a heavy workload benchmarking.

In our case, we find there are almost 6~7 WALs were opened at the same time(though only one file is able to be written), the other WAL files keep opening there until RocksDB Sync/Close them in the background (We just fixed this in TerarkDB).

For RocksDB, though it has the same problem (WAL files are not closed immediately), it doesn't matter in ext4 since it takes no extra resources.

But this is not the first priority, we handled it by some hack fix in RocksDB/TerarkDB for now.

We've two temporary solutions that both work for us, we selected the first one for now:

skyzh commented 3 years ago

FYI, this is one of our attempts to reduce allocation latency. https://github.com/bzbd/zenfs/pull/19

yhr commented 3 years ago

Thanks, I'm still off on paternal leave but i'm hoping to look over all the issues reported (and the suggested fixes) properly and figure out the best way forward. I think RocksDB should be able to close the WALs not being written to (so that may be a bug in RocksDB). I plan to address the latency issue in the allocator by introducing a background thread. Let's keep this issue open until i've created new issues discussing and tracking that work.

yhr commented 3 years ago

Thanks!

royguo commented 3 years ago

@yhr A background thread seems to be the final solution, please go ahead. (our solution, for now, is simply re-arrange allocation locks).

And even we allocate zones in the background thread, we should also make sure WAL file can have higher priority to get a zone from the background zone queue.

For the RocksDB WAL file problem, RocksDB doesn't close the file foreground immediately but leaves it to a background flush thread(which will close all existing WAL files periodically), I think it is not a bug, because for cases that have WAL sync = false, it is true that no need to close it immediately, just FYI.

Since we always need to use WAL sync = true, so we move the WAL close() logic to the foreground and let the original background thread ignore WAL close action. But this should be considered in your solution since this problem will make RocksDB keeps a lot of WAL open for quite a few seconds.

Again, no hurry since we've fixed it temporarily, good luck.

skyzh commented 3 years ago

By the way, the latency problem might be easier to observe when:

yhr commented 3 years ago

My plan is to do this work in three steps:

  1. Refactor the allocator function, separating the zone management and allocation functionality (the allocator would then just pick from a list of available zones, or wait if there are none at the moment)
  2. Move zone management(resets, finishes) to a background thread - fixing the latency problem
  3. Add WAL allocation prioritization (keep N zones in the available list for WALs)
yhr commented 2 years ago

The allocator has been reworked as part of https://github.com/westerndigitalcorporation/zenfs/pull/114 , closing this :)