Closed yhr closed 2 years ago
Metadata zone is mainly written by WAL sync. What about storing WALs in separate zones as logs?
e.g. We can reserve a zone for WAL. Upon WAL sync, we write size + CRC + data to that zone, without touching metadata zone. When we recover from error, we can simply scan the WAL zone.
... and I think this requires a more careful design. What if the "writing a superblock and snapshot to new new zone" is full?
@skyzh : A sparse file format is a good idea too. I've actually implemented an extension to the rocksdb WAL format that pads the WAL to be block-size aligned before sync requests (for a different reason - i used the same file format for writing the medata log).
If every write is block aligned and contains the number of bytes padded, we don't need to sync metadata and we can recover easily by parsing the file up to the write pointer. We only need to write file metadata when allocating a new extent. This is extra cool because a conventional ssd can't do this :)
I decided not to upstream the rocksdb patches as there was no clear performance benefit of it at the time and zenfs would be the only user of it.
I think we could make buffered files default to having extent-info being written in-line, and just sync when we need to switch to a new zone (or close the file). I'll dig into the code to see if that is feasible.
IIRC, zenfs syncs metadata for SST files only after they were closed, I don’t expect the inlined sst extents to have more benefits. But inlined WAL metadata looks more general and promising, It only cost us a zone scan on startup which I think is not a big problem.
But I should point out that the most significant drawback comes from active zone contention, not the writes itself… so inlined wal metadata is a “better to have”, the first priority to solve this completely is to use a better zone management strategy, LOL
The allocator needs to be improved for sure - but as you have a workaround for that at the moment, i'll focus on this issue first, then rework the allocator. Once we've fixed both these issues, you should be able to re-align with zenfs master. Does that make sense @royguo ?
If we store all extent starts we don't have to store any metadata information while we write to the active zone. When we need to switch to a new zone we would store all the extents(for fast recovery). For buffered writes (like the WAL) we need to add a header to each write (that describes how much data was actually flushed). For non-buffered(direct) writes we don't need this as all writes are block-aligned.
By doing this we would only need to scan the latest written zone up to the write pointer to reconstruct the file extents when we mount.
The allocator needs to be improved for sure - but as you have a workaround for that at the moment, i'll focus on this issue first, then rework the allocator. Once we've fixed both these issues, you should be able to re-align with zenfs master. Does that make sense @royguo ?
We are also doing our own refactor as I mentioned before (focus on new allocator workaround & metadata rolling), but if your implementation has a better result, I would definitely use yours.
But of course, we will follow your latest updates to see if there's anything we can cherry-pick.
This is being adressed as part of https://github.com/westerndigitalcorporation/zenfs/pull/117 , that PR reduces the frequency of metadata rolls (from every 3s under heavy-write --sync=1, to ~every hour)
When rolling to a new metadata zone there is a significant amount write latency introduced.
The write latency is due to holding the file mutex(holding off any other metadata syncs and new file creations etc) while:
We could do the rolling in the background, at the cost of one extra active zone being reserved for metadata rolls.