kunga commented 2 months ago

Description

Make Shared Cache concurrent (better lock-free).

This improvement opens us the possibility of making DataShard reads in parallel.

Steps

1. Implement new cache eviction strategies that do not require data structure mutation on each page access

Instead of the current SLRU cache, which requires LRU lists to be modified on each page access, some better cache algorithms should be implemented and tested.

Our first candidates are:

S3-FIFO, because it's the newest and simplest in the implementation cache, and promised to be good according to the author's research
Clock Pro, because it is used in other databases (for example CockroachDB) and has an implementation that can be adopted

Tasks:

2. Use Shared Cache directly where possible

Instead of using Private Cache, use shared cache directly for pinning, touching pages.

[ ] https://github.com/ydb-platform/ydb/pull/11779
[ ] Replace sticky pages with TSharedPageRef
[ ] Replace pages pinning with TSharedPageRef
[ ] Replace NSharedCache::TEvTouch with counters increment

3. Get rid of Private Cache

Instead of Private Cache and cache coherency protocol, access pages from Shared Cache (or its pages snapshot) directly.

Internal design doc.

4. Make further improvements

[ ] Quota serverless per-database cache usage
[ ] Improve cache usage during compactions
[ ] Improve cache usage during scans
[ ] Improve cache work with SST's index pages

ben-manes commented 2 months ago

You should run real workloads against the policies because it is very likely that you will significantly lower your hit rates. For example here is a SQL database trace.

hit_rate

A very simple trick for LRU-based caches is to use an optimistic tryLock on read to perform the reordering, where if unsuccessful it is skipped, and exclusively lock on a write for insertion and eviction. This has the read penalty is merely a volatile read and, if unlocked, then a CAS. If under heavy contention then while read reorderings will be lost, but they only need to be best-effort as the popularity skew of accesses will cause the frequent items to be reordered eventually. For example memcached used that approach along with a 60s threshold between per-entry reorderings to simply ignore the unnecessary CASes. This way you maintain the O(1) time complexity with minimal changes, rather than O(n) with a lower hit rate.

kunga commented 2 months ago

Hi @ben-manes, thanks for commenting on this.

You should run real workloads against the policies because it is very likely that you will significantly lower your hit rates.

Yes, our current plan is to implement new strategies and test them. At first, we will run synthetic database workloads such as TPC-C and YSCB and then gradually move these changes to the real production environment. That's why we have implemented hot cache policy switch. We also have thousands of different databases with different load and memory capacity so some of them might get better and some get worse, but we will make the final decision which policy is to use using the real data.

For example here is a SQL database trace.

I'm wondering whether these results come from your own project or maybe they were taken from some public paper/talk/etc? If so, please provide a link to them.

A very simple trick for LRU-based caches is to use an optimistic tryLock on read to perform the reordering, where if unsuccessful it is skipped, and exclusively lock on a write for insertion and eviction.

If I understand the idea correctly, tryLock should be taken for the whole data structure? If so, we're trying to avoid this behaviour as there may be hundreds of pages are read during one transaction execution, and that lock can be a bottleneck.

Also in Shared Cache we store small data pages (parts of a contiguous SST data structure). So we want the cache to be scan-resilient.

Anyway, thanks for the idea, will keep it in mind.

ben-manes commented 2 months ago

maybe they were taken from some public paper/talk/etc? If so, please provide a link to them.

That trace came from the ARC paper (DS1, repository). Its search has similar results. The UMass trace repository includes other search traces have a similar workload pattern. These are the public traces that I’m aware of, and I’m always glad to see that grow with a variety to cross check against.

I ran it using my simulator. For my implementations of the policies, I verify correctness against the author’s original under multiple traces to ensure that I have an honest reproduction. In that case I found the only difference is for variably sized entries (aka weight limit not entry count limit). My simulator adjusts the existing entry to the newly observed size, whereas the author’s does not (~0.2% difference in hit rates). The reason is because they S3-F allows concurrent lock-free evictions, but when the entry is resized in an update while being simultaneously shuffled between queues, this race corrupts the accounting of what the queue’s current capacity is. By making the entry’s byte size immutable this race is avoided, but could lead to a memory leak because the cache no longer honors the maximum capacity limits.

If I understand the idea correctly, tryLock should be taken for the whole data structure?

Typically you would minimize the eviction policy’s lock to only its data structures, the SLRU doubly linked lists, and pair it with a concurrent hash table. This way the reads can be lock-free map lookups and optimistic policy updates if the tryLock is acquired (else skipped).

Optionally you can also keep it separate in writes so table updates are concurrent and briefly serialize for a the policy update/eviction logic. However this requires bookkeeping of entry lifecycles (create, update, delete), because you loose strict ordering between threads for the same key. So since cache writes are rarer, often that complexity is avoided by coarsening the policy lock by requiring all writes to both data structures go under it’s exclusive access. That’s often fine, like it was for memcached, but has limitations such as losing linearizability of per-key locking for loading through the cache to avoid cache stampedes.

Caffeine uses the BP-Wrapper approach from the Lirs/ClockPro authors, which combines the tryLock with ring buffers and scheduling logic. Instead of reads or writes immediately updating the policy, a pubsub idea means it is eventually consistent. The client performs the hash table operation, enqueues the event, schedules a drain if appropriate, and returns. That might defer to a background thread or be amortized by callers, but in effect avoids lock contention because threads no longer require the global lock and can those ring buffers can be striped to avoid hotspots. On my M3 Max 14-core with 16 threads, this has 900M reads/s and 40M writes/s (updates), and scales with core/thread count inline with the hash table’s rate. It’s kind of like a write-ahead transaction log.

That’s a more complex design but creates a lot of flexibility, as the policy data structures no longer need to be thread safe. So we use timing wheels for expiration, countmin sketch for ghost history, LRU lists, hill climbing to adaptive resize queues, etc. we can optimize the algorithms independently instead of degrade them trying to make them more concurrency friendly.

Go’s Otter tried to follow the S3F paper with the author’s help, but struggled because the paper has a memory leak (unbounded ghost queue), the varying entry size corruption, high eviction contention on hot head/tail pointers, O(n) evictions of repeated scans searching for a cold victim while reads marked them as hot, and poor hit rates due as shown above. He ported a lot of Caffeine’s code to replace their concurrency with BP-Wrapper’s to fix the throughput problems. I advised an early threshold limit for how many entries the Clock queues could traverse on its search for a cold entry, e.g. 20, which made it effectively O(1) and greatly improved the hit rates in those traces as the cache more aggressively evict newer arrivals. Note that likely degrades in different workloads as there is no heuristic that is universally optimal for all workloads, which is why Caffeine is adaptive to tune itself

— — — — — — — — — — — — — — — —

I know that was a lot but I hope it offered some insights as you explore your design tradeoffs.

kunga commented 2 months ago

@ben-manes Thank you very much! We will certainly take your suggestions into consideration!

kunga commented 1 month ago

TPC-C Results

https://nda.ya.ru/t/UE3RlZTC78wDTy

https://paste.yandex-team.ru/af2cd6b7-72fe-4c0e-b290-f7c99fd54bd9/text

Run 1h TPC-C load with different replacement policies, 8k wh data, 2k wh load, 10 GB Shared Cache.

First 20 minutes without compactions, then start compacting script to simulate real database lifestyle.

ThreeLeveledLRU

================RESULTS================
           Time, s |               3300
         NewOrders |            1379226
             TPM-C |           25076.84
        Efficiency |             97.50%
reqs/s: Results(nanoSeconds=3600760378556, measuredRequests=3066854) = 
  851.7239909282409 requests/sec (throughput), 900.7825178582781 requests/sec (goodput)

S3FIFO

================RESULTS================
           Time, s |               3300
         NewOrders |            1387282
             TPM-C |           25223.31
        Efficiency |             98.07%
reqs/s: Results(nanoSeconds=3600708348089, measuredRequests=3081606) = 
  855.8332700385182 requests/sec (throughput), 904.8852850659887 requests/sec (goodput)

ClockPro

================RESULTS================
           Time, s |               3300
         NewOrders |            1277476
             TPM-C |           23226.84
        Efficiency |             90.31%
reqs/s: Results(nanoSeconds=3600770085502, measuredRequests=2862789) = 
  795.0490956161354 requests/sec (throughput), 838.2437446236453 requests/sec (goodput)

Read only total latency percentiles (ms):

Read/Write total latency percentiles (ms):

Tablets cache hits:

Tablets cache misses:

Tablets cache misses%:

CPU:

Size:

Compactions:

TxFinished:

TxRetried/TxPostponed:

TxRetried/TxPostponed%:

ben-manes commented 1 month ago

It looks like your ClockPro is based on CockroachDB's, which in turn is based on the Python version. This might explain the hit rate you observe. It has been a long time but, if I recall correctly, the cockroach's removed the non-resident history in order to reduce lock hold times due to scanning (it has some weird looping problem). That made it more like a Clock-based SLRU. While the Python one included non-resident entries, it did not use researcher's C version as a reference and had much lower hit rates and a much longer run time due to various honest mistakes (the papers are difficult to understand).

Caffeine simulator's version was also based on the Python one, as I exhausted my time debugging the researcher's LIRS version to make that match exactly. Chanyoung Park rewrote it to match their ClockPro correctly so there is a very readable reference. He also tried to make a simpler variant for maintainability and lower scan times and an enhanced version that adds ARC's adaptive sizing. He was very diligent in ensuring the hit rates matched the research reference and trying to make it understandable for others.

I'd still recommend that you capture a trace on cache access events, which should give you an I/O trace that is independent of the policy, that could then be simulated without a requiring a full implementation. Then you can observe the hit rates across a wider range of policies, cache sizes, etc. and can validate in unit tests that your real implementation matches the reference (as real-world needs often cause subtle changes, and a small mistake can have surprising repercussions).

kunga commented 1 month ago

Hi @ben-manes!

If I recall correctly, the cockroach's removed the non-resident history in order to reduce lock hold times due to scanning (it has some weird looping problem).

They have non-resident history (test pages), but I found out that they don't distinguish cold resident pages and cold resident test pages. These are presented in the paper and are missed in the Python implementation and in a slides that I used as a reference (as you mentioned, it's too complicated and confusing to understand the algorithm from the paper).

Quotes from the paper:

To give the cold pages a chance to compete with the hot pages and to ensure their cold/hot statuses accurately reflect their current access behavior, we grant a cold page a test period once it is accepted into the list. Then, if it is re-accessed during its test period, the cold page turns into a hot page. If the cold page passes the test period without a re-access, it will leave the list.

For each cold page, there is a flag indicating if the page is in the test period.

Accordingly, that may be a cause of a low efficiency during a compaction process.

I also double-checked that I have the same implementation as CockroachDB has using their canon-data test.

I'd still recommend that you capture a trace on cache access events, which should give you an I/O trace that is independent of the policy, that could then be simulated without a requiring a full implementation.

Any suggestions on how to achieve this? For example, if we'd collect that traces, what is the best way to evaluate them with existing algorithms? I found libCacheSim that seems to do what I need, but maybe you have some other useful tools in mind.

I/O trace that is independent of the policy

There is also a few caveats to evaluate and collect these traces.

Firstly, when a transaction is executing in YDB and then page-faulted (has to fetch data from distributed storage), we also run a special prefetch process, that try to guess data that is also might be needed in that transaction. So, sometimes futher reads are depend on what data is in cache right now.

Also, the cache miss metric isn't really suitable for us, as we mostly want to optimize latency & throughput of the database. Maybe the amount of page-faulted transactions would be more suitable here, but I'm not sure.

Taking into account these points, and after discussions within our team, we decided to move forward with S3FIFO and delivering it to the production for now.

ben-manes commented 4 weeks ago

Hi @kunga!

I also double-checked that I have the same implementation as CockroachDB has using their canon-data test.

~Thanks, I'll try to run the trace and give you back some metrics.~

That trace appears too small to be meaningful for analysis. It's decent for validation that it matches the original code, but almost all eviction policies have a similar hit rate and vary by a few percentage points. As a unit test its reasonable, but since the original Python code was an incorrect implementation of ClockPro it doesn't tell us anything useful.

if we'd collect that traces, what is the best way to evaluate them with existing algorithms?

You could use my simulator, which I wrote as necessary tooling when trying to decide what more modern eviction policy to use in my caching library. It uses a 64-bit integer hash for the key, it is easy to add custom trace readers, and I tried to be careful by checking my implementations against the research author's simulators if available, else their paper's hit rate.

I would shy away from academic simulators because the code quality varies, since the authors are students and lack experience. Unfortunately while their own policy code can be trusted, they are rarely very careful in implementing other's at the (convenient) misfortune of their competition. The code is written for a paper's publication rather than for informing engineers, so I use them as sanity checks with LRU as a common baseline. Similarly, I would not put a lot of weight on a paper's claims because very often I find exaggerations, cherry-picking, fabrications, and misleading bias. There is no repercussions for falsehoods and the competitive pressures results in a lot of dishonest behaviors.

we also run a special prefetch process, that try to guess data that is also might be needed in that transaction. So, sometimes further reads are depend on what data is in cache right now.

I would imagine this translates into either a key, e.g. row id, or a I/O start address + limit with block sizes. Either is quite easy in my trace readers since they return a sequence of keys (e.g. ARC trace) as they stream the file. The easiest is to use a logger at the cache lookup method to record the event(s). As a logger is already designed to be low overhead, conditionally enabled, etc. you don't need to worry about setting up bespoke infrastructure.

Also, the cache miss metric isn't really suitable for us, as we mostly want to optimize latency & throughput of the database. Maybe the amount of page-faulted transactions would be more suitable here, but I'm not sure.

That's fair but I think different benchmarks, unfortunately, as I think that is a comprehensive to the cache design rather than the specific eviction policy. Caffeine's eviction policy is not concurrent, but achieves a robustly higher hit rate across the board (example). The concurrency mechanism is an independent design, where a Zipfian benchmark is used to measure contention / latency, where my M3 Max 14-core achieves 900M reads/s compared to a baseline of 2.3B, implying that the concurrency mechanism is not throttled by the eviction policy. Of course once any cache reaches a performance goal then any additional speedup is likely unmeasurable, so one looks to see if it is fast enough to the cache not be a bottleneck.

Latency-aware caching is a fairly niche topic and might be part of your consideration in policy design. I've sketched my own ideas of a design, which a contributor prototyped with success. That may be a different type of latency problem then you have, though.

ydb-platform / ydb

👑 Concurrent Shared Cache #8447

Description

Steps

1. Implement new cache eviction strategies that do not require data structure mutation on each page access

2. Use Shared Cache directly where possible

3. Get rid of Private Cache

4. Make further improvements

TPC-C Results

ThreeLeveledLRU

S3FIFO

ClockPro