zkat / cacache-rs

A high-performance, concurrent, content-addressable disk cache, with support for both sync and async APIs. 💩💵 but for your 🦀
https://crates.io/crates/cacache
Other
529 stars 35 forks source link

Eviction? #28

Open piegamesde opened 3 years ago

piegamesde commented 3 years ago

I'm looking at different caching libraries right now for my project and this one looks really cool! However, I cannot find any information on cache eviction. Do I need to implement it manually on top of cacache? Have others already done it? (Is this even possible?)

What I need is some simple (access) time eviction, but I think most use cases require bounded size and some LRU/LFU algorithm.

piegamesde commented 3 years ago

Normally, I'd suggest having this as an option and then have all methods automatically enforce the eviction invariants. But because the API design suggests that a "cache" is no more than the path to a directory, I instead propose adding some evict method, that takes in a path and an evictor object, which has to be called manually when desired.

The naive implementation would then simply iterate over the content and then take out some of the keys based on the configured metric. But the more apparent problem is that Metadata does not store enough relevant data in order to implement most eviction algorithms.

zkat commented 2 years ago

Original cacache actually has a mark-and-sweep garbage collector built in that's extensible enough to build this feature into, but no one ever did it, so I figured it wasn't that important? Eviction is a very application-specific feature, imo.

tarka commented 1 year ago

Not exactly eviction, but you can implement a TTL using metadata:

pub fn get_cached(cache: &str, key: &str, ttl: u128) -> Result<Option<String>> {
    let md = match cacache::metadata_sync(cache, key)? {
        Some(m) => m,
        None => return Ok(None)
    };

    let now = SystemTime::now()
        .duration_since(UNIX_EPOCH)?
        .as_millis();
    if now - md.time > ttl {
        info!("Cached valued expired: {} - {} = {} > {}", now, md.time, now - md.time, ttl);
        return Ok(None);
    }

    let bval = cacache::read_sync(cache, key)?;
    let uval = String::from_utf8(bval)?;
    debug!("Cache hit: {} -> {}", key, uval);
    Ok(Some(uval))
}

A slightly nicer variant of this would be if the desired TTL could be added to the metadata at put time, but there doesn't seem to be a method of setting arbitrary metadata from the user PoV.

fiag commented 1 year ago

Read all index entries, and filter by TTL, then remove index and content.

use std::time::{Duration, SystemTime, UNIX_EPOCH};

fn main() {
    let cache = "~/.my-cache";
    let ttl = Duration::from_secs(60);
    for md in cacache::list_sync(&cache).flatten() {
        let now = SystemTime::now()
            .duration_since(UNIX_EPOCH)
            .unwrap()
            .as_millis();
        if now - md.time > ttl {
            cacache::remove_hash_sync(&cache, &md.integrity).unwrap();
            cacache::remove_sync(&cache, &md.key).unwrap();
        }
    }
}
matt-phylum commented 9 months ago

Read all index entries, and filter by TTL, then remove index and content.

This almost works, but there are two problems related to remove_hash_sync:

  1. In a content addressable cache, multiple keys may have the same integrity value, so remove_hash_sync when removing an old key may also remove the data for newer keys which are not being removed.
  2. If a previous operation has been interrupted, there may be cached data for which there is no associated key. list_sync returns only keys and their associated information, so if operations returning unique information are being interrupted, eventually the cache will fill up with orphan data for which remove_hash_sync needs to be called but it is not accessible via the output of list_sync.