westerndigitalcorporation / zenfs

ZenFS is a storage backend for RocksDB that enables support for ZNS SSDs and SMR HDDs.

GNU General Public License v2.0

239 stars 87 forks source link

Reduce write latency during metadata roll #61

Closed yhr closed 2 years ago

yhr commented 2 years ago

When rolling to a new metadata zone there is a significant amount write latency introduced.

The write latency is due to holding the file mutex(holding off any other metadata syncs and new file creations etc) while:

finishing the current zone
writing a superblock and snapshot to new new zone
resetting the current metadata zone

We could do the rolling in the background, at the cost of one extra active zone being reserved for metadata rolls.

skyzh commented 2 years ago

Metadata zone is mainly written by WAL sync. What about storing WALs in separate zones as logs?

e.g. We can reserve a zone for WAL. Upon WAL sync, we write size + CRC + data to that zone, without touching metadata zone. When we recover from error, we can simply scan the WAL zone.

skyzh commented 2 years ago

... and I think this requires a more careful design. What if the "writing a superblock and snapshot to new new zone" is full?

yhr commented 2 years ago

@skyzh : A sparse file format is a good idea too. I've actually implemented an extension to the rocksdb WAL format that pads the WAL to be block-size aligned before sync requests (for a different reason - i used the same file format for writing the medata log).

yhr commented 2 years ago

If every write is block aligned and contains the number of bytes padded, we don't need to sync metadata and we can recover easily by parsing the file up to the write pointer. We only need to write file metadata when allocating a new extent. This is extra cool because a conventional ssd can't do this :)

yhr commented 2 years ago

I decided not to upstream the rocksdb patches as there was no clear performance benefit of it at the time and zenfs would be the only user of it.

yhr commented 2 years ago

I think we could make buffered files default to having extent-info being written in-line, and just sync when we need to switch to a new zone (or close the file). I'll dig into the code to see if that is feasible.

royguo commented 2 years ago

IIRC, zenfs syncs metadata for SST files only after they were closed, I don’t expect the inlined sst extents to have more benefits. But inlined WAL metadata looks more general and promising, It only cost us a zone scan on startup which I think is not a big problem.

royguo commented 2 years ago

But I should point out that the most significant drawback comes from active zone contention, not the writes itself… so inlined wal metadata is a “better to have”, the first priority to solve this completely is to use a better zone management strategy, LOL

yhr commented 2 years ago

The allocator needs to be improved for sure - but as you have a workaround for that at the moment, i'll focus on this issue first, then rework the allocator. Once we've fixed both these issues, you should be able to re-align with zenfs master. Does that make sense @royguo ?

yhr commented 2 years ago

If we store all extent starts we don't have to store any metadata information while we write to the active zone. When we need to switch to a new zone we would store all the extents(for fast recovery). For buffered writes (like the WAL) we need to add a header to each write (that describes how much data was actually flushed). For non-buffered(direct) writes we don't need this as all writes are block-aligned.

yhr commented 2 years ago

By doing this we would only need to scan the latest written zone up to the write pointer to reconstruct the file extents when we mount.

royguo commented 2 years ago

The allocator needs to be improved for sure - but as you have a workaround for that at the moment, i'll focus on this issue first, then rework the allocator. Once we've fixed both these issues, you should be able to re-align with zenfs master. Does that make sense @royguo ?

We are also doing our own refactor as I mentioned before (focus on new allocator workaround & metadata rolling), but if your implementation has a better result, I would definitely use yours.

But of course, we will follow your latest updates to see if there's anything we can cherry-pick.

yhr commented 2 years ago

This is being adressed as part of https://github.com/westerndigitalcorporation/zenfs/pull/117 , that PR reduces the frequency of metadata rolls (from every 3s under heavy-write --sync=1, to ~every hour)

yhr commented 2 years ago

117 is merged now, let's so close this issue. If there are any further needs to reduce medatata sync tail latency, let's create another issue.