[Feature Request] We need a better ZoneAllocation implementation

westerndigitalcorporation / zenfs

ZenFS is a storage backend for RocksDB that enables support for ZNS SSDs and SMR HDDs.

GNU General Public License v2.0

243 stars 88 forks source link

[Feature Request] We need a better ZoneAllocation implementation #36

Closed royguo closed 2 years ago

royguo commented 3 years ago

Background:

WAL sync should be enabled by default.
Multiple WAL files may be opened at the same time (only one WAI file could be written into)
Over 14 files could be concurrently opened & written
WAL throughput could be larger than 150MB/s (We have KV-separation in TerarkDB, so you can lower down this in RocksDB)
Multiple column families (3~4 CFs)
Key 16B, value 8KB

Problems Description:

WAL writes could be affected by other file writes, which leads to user side latency spike

Expectations:

All WAL writes should be returned ASAP, background writers can wait until there are free zone resources available.
We should also make sure there's at least one free zone for random reading operation.
Thus we should gurantee
- 2+ WAL zones, 1 Random Read zones, 11- for others

yhr commented 3 years ago

Yes, this is something I've been thinking about as well. One solution i would like to try out is to prioritize wal writes by reserving at least one available(empty or zones without active writes) for wal allocation. Or perhaps even better, a configurable parameter that can tell zenfs that the user wants at least N MBs of WAL space. There will be backpressure eventually if the compaction can't keep up, but it should remove the latency spikes.

I don't know what you mean by random read zone, all zones can be randomly read. Could you explain?

royguo commented 3 years ago

Hi, @yhr

We've patched a simple solution for now:

In the previous version, there is a big lock inside AllocateZone and do not aware of WAL writes and do not allow concurrent allocation at all
We split the lock and allow normal writes to give up lock during the Zone allocation to WAL Zone allocation (carefully tuned)
Now the WAL latency works good and we will submit the PR soon to see if it's possible for you to merge it into your repo,
But for long-term purposes, we do need to move the allocation to the background thread but it's too complex for us since we don't know ZenFS as well as you guys.

Sorry for the random read question, I misremembered, I thought we should open a zone before reading from it, LOL, just ignore it.

royguo commented 3 years ago

Besides that, we also find that the RocksDB does not close WAL files explicitly by default, which may take too many open zone resources during a heavy workload benchmarking.

In our case, we find there are almost 6~7 WALs were opened at the same time(though only one file is able to be written), the other WAL files keep opening there until RocksDB Sync/Close them in the background (We just fixed this in TerarkDB).

For RocksDB, though it has the same problem (WAL files are not closed immediately), it doesn't matter in ext4 since it takes no extra resources.

But this is not the first priority, we handled it by some hack fix in RocksDB/TerarkDB for now.

We've two temporary solutions that both work for us, we selected the first one for now:

skyzh commented 3 years ago

FYI, this is one of our attempts to reduce allocation latency. https://github.com/bzbd/zenfs/pull/19

yhr commented 3 years ago

Thanks, I'm still off on paternal leave but i'm hoping to look over all the issues reported (and the suggested fixes) properly and figure out the best way forward. I think RocksDB should be able to close the WALs not being written to (so that may be a bug in RocksDB). I plan to address the latency issue in the allocator by introducing a background thread. Let's keep this issue open until i've created new issues discussing and tracking that work.

yhr commented 3 years ago

Thanks!

royguo commented 3 years ago

@yhr A background thread seems to be the final solution, please go ahead. (our solution, for now, is simply re-arrange allocation locks).

And even we allocate zones in the background thread, we should also make sure WAL file can have higher priority to get a zone from the background zone queue.

For the RocksDB WAL file problem, RocksDB doesn't close the file foreground immediately but leaves it to a background flush thread(which will close all existing WAL files periodically), I think it is not a bug, because for cases that have WAL sync = false, it is true that no need to close it immediately, just FYI.

Since we always need to use WAL sync = true, so we move the WAL close() logic to the foreground and let the original background thread ignore WAL close action. But this should be considered in your solution since this problem will make RocksDB keeps a lot of WAL open for quite a few seconds.

Again, no hurry since we've fixed it temporarily, good luck.

skyzh commented 3 years ago

By the way, the latency problem might be easier to observe when:

There are multiple column families
Compaction job number is high
LSM is large enough (e.g. has L5 and L6 layers) FYI

yhr commented 3 years ago

My plan is to do this work in three steps:

Refactor the allocator function, separating the zone management and allocation functionality (the allocator would then just pick from a list of available zones, or wait if there are none at the moment)
Move zone management(resets, finishes) to a background thread - fixing the latency problem
Add WAL allocation prioritization (keep N zones in the available list for WALs)

yhr commented 2 years ago

The allocator has been reworked as part of https://github.com/westerndigitalcorporation/zenfs/pull/114 , closing this :)