wvwwvwwv / scalable-concurrent-containers

High performance containers and utilities for concurrent and asynchronous programming
Apache License 2.0
319 stars 16 forks source link

What cases will `HashIndex` clone keys or values? => [HashIndex::{modify, modify_async}] implemented #96

Closed novacrazy closed 1 year ago

novacrazy commented 1 year ago

I've seen in your notes that HashIndex requires key/value to be clone to maintain immutability benefits, but what are the cases where they do clone?

Furthermore, the docs for update state "The caller has to make sure that there is no reader of the entry". What happens if that's occurring? There is no way to ensure that other than to use a Mutex, defeating the entire point of this library. Seems like that would be the use case for cloning the value for the update.

wvwwvwwv commented 1 year ago

Hash* structs share the same HashTable::resize trait method which adjusts the capacity of the hash table. Here, HashMap/Set and HashIndex acts differently after a new bucket array is allocated. HashMap moves key and value instances from the old bucket array to the new array whereas HashIndex clones them and drops the ones in the old bucket array once after the old array becomes totally unreachable. This is because HashIndex::read* and HashIndex::iter are allowed to read entries even when they are locked by HashIndex::update*, HashIndex::remove* and HashIndex::insert*; especially, HashIndex::read_with and HashIndex::iter even can leak a reference to a locked entry and keep the reference longer than the HashIndex itself! => K, V: Clone is not about HashIndex::update*, but for resizing.

Note that, HashIndex::update* does not implement copy-on-write semantics; it is for in-place updates of an entry. I may have to implement copy-on-write methods as well later.

So, if "The caller has to make sure that there is no reader of the entry" is not guaranteed, there can be a reader of an entry which is being updated through HashIndex::update*. HashIndex guarantees that there is no writer to the entry being updated, but unfortunately, as you stated, read operations are out of control. This semantics suggests that HashIndex::update* is only safe if V is safe for single-writer-multiple-readers scenarios, e.g., V: Copy.

novacrazy commented 1 year ago

By “safe”, do you mean that it’s possible to invoke undefined behavior by calling update? If so that’s pretty bad. Unsure how significant the part about leaking references is. If it’s to a dropped element, then also pretty bad.

Cloning during resize also has me wanting to wrap all my values in Arc, which would make update useless anyway.

Could at least the associated value be stored in a separate pointer to allow for more efficient resizing and potentially a copy-on-write system? I doubt the extra indirection would matter much in real-world random-access. Though I suppose it would consume more space for the pointer itself.

wvwwvwwv commented 1 year ago

Yes, calling update may lead to undefined behavior if V is not ready for it, and that's why the method is unsafe. On the other hand, the leaking reference part is not at all unsafe; the references are protected by EBR, and the instance won't be dropped until the references are gone.

And right, wrapping instances in Arc doesn't help much here as Arc itself is not thread-safe. Actually, using Arc amounts to the extra indirection through a separate pointer, and that will enable HashIndex to get rid of the Clone constraints. What I'm concerned about the approach is repeated heap memory allocations; every V gets its own heap memory allocation which is definitely bad for performance unless very good arena allocators are used. => I'd leave this up to the user; if the user prefers not cloning V, simply, Arc<V> can be used instead.

Copy-on-write semantics is a slightly different story; update_cow* (?) can be implemented by, lock the bucket -> insert a new version of the value -> mark the old version of the value as removed -> unlock the bucket: simple! => Would this suit your use case?

novacrazy commented 1 year ago

Oh, my mistake, I completely missed that update was unsafe. Just glanced right over that somehow.

update_cow sounds good. Pretty much the ideal version of a locking copy-on-write.

Might also be worth looking into something like arc-swap's rcu method. rcu may run the updater function multiple times to make it lock-free, but there are some cases where that's acceptable.

In addition to update_cow, it would be amazing to provide an rcu method even if you can't make it lock-free:

pub fn rcu<Q, F>(&self, key: &Q, updater: F) -> bool // returns true if exists and updated
where
    K: Borrow<Q>,
    Q: Eq + Hash + ?Sized,
    F: FnOnce(&K, &V) -> V;

pub async fn rcu_async(...) -> bool ...

such that the new value can be generated from scratch rather than taking a mutable reference, since a mutable reference would be useless if V = Arc<T>.

Perhaps rename the old update to update_in_place or similar as well? Make update_cow the new update.

wvwwvwwv commented 1 year ago

The suggested rcu API looks great; I guess F: FnOnce(&K, &V) -> Option<V> would be better for the caller to decide whether to create a new version or not based on the current value.

I'm reluctant to change the naming of existing methods in v1.x.x; I'll cleanup any confusing naming in the crate in v2.0.0 (next year? hopefully when allocator_api becomes stable).

I'll upload a draft version of new methods soon.

novacrazy commented 1 year ago

Might be a bit of feature creep, but if you're adding in an option to "do nothing", perhaps an enum such as:

enum RcuAction<T> {
    DoNothing,
    Update(T),
    Remove,
}
impl<T> From<Option<T>> for RcuAction<T> ... // DoNothing or Update
impl<T> From<()> for RcuAction<T> ... // DoNothing

pub fn rcu<Q, F, R>(&self, key: &Q, updater: F) -> bool
where
    K: Borrow<Q>,
    Q: Eq + Hash + ?Sized,
    F: FnOnce(&K, &V) -> R,
    R: Into<RcuAction<V>>;

to also allow removing the element while we hold the lock. Could be useful if it's not difficult to implement.

remove_if could then be implemented on top of rcu, even.

wvwwvwwv commented 1 year ago

Yep, a bit of a feature creep, but nice to have it since HashMap already has entry/entry_async methods that semantically do the same, and it's not at all difficult to implement it.

However, the naming rcu wouldn't fit here since rcu typically implies lock-free code whereas a lock-free update of a HashIndex entry is virtually impossible (not 100% sure if it is totally impossible) unless some API semantics is relaxed (e.g., linearizable -> eventual consistency). Any suggestions other than rcu? I'm thinking of

novacrazy commented 1 year ago

entry would be good if there was also a way to insert, such as by providing (&K, Option<&V>) to the callback and allowing it to instead insert the returned value. The complete range of operations.

However, that introduces annoyances with the key where it would either need to be given by value or be cloned automatically on insertion. Granted, the entry API elsewhere does require it by value anyway, so it would at least match that.

enum Action<T> {
    DoNothing,
    Set(T),
    Remove,
}
impl<T> From<Option<T>> for Action<T> ... // DoNothing or Set
impl<T> From<()> for Action<T> ... // DoNothing

pub fn entry<F, R>(&self, key: K, updater: F) -> bool
where
    F: FnOnce(&K, Option<&V>) -> R,
    R: Into<Action<V>>;

Using something like std::borrow::Cow for the key would work instead, maybe.

Otherwise, if it must require the value exist, modify.

wvwwvwwv commented 1 year ago

OK, inserting a new entry through the new methods doesn't look nice at all :-( I'll then go for modify which does read->remove, update, or nothing.

novacrazy commented 1 year ago
table.entry(key, |_, entry| Some(match entry {
    None => Default::default(),
    Some(&value) => value + 1,
}));

looks decent to me. Similar to:

match table.entry(key) {
    Entry::Vacant(v) => v.insert_entry(Default::default()),
    Entry::Occupied(v) => *v.get_mut() += 1,
}
novacrazy commented 1 year ago

I just had an idea right after I sent that. Instead of relying on the return value, you could create a hash_index::Entry struct similar to with hash_map::Entry, but provide that to the closure instead, and the semantics would be identical to the other hash_map::Entry?

hash_index_table.entry(key, |entry| {
    // use entry like you would hash_map::Entry
});
wvwwvwwv commented 1 year ago

If I'd have to implement HashIndex::entry, I would just resemble the HashMap implementation, e.g., receiving K by value, returning Entry that owns the bucket. HashIndex::entry resembling that of HashMap is definitely meaningful, therefore I'll implement it in the foreseeable future. => I'm not quite fond of HashIndex::entry receiving a closure and a reference to K, since the entry method is usually invoked by a user with intention to insert a key.

I guess I've digressed the topic by mentioning HashMap::entry thinking that the semantics of rcu was similar to that of HashMap::entry which is not quite true due to insertion.

Therefore, I'd like to just implement modify<F: FnOnce(&V, &V) -> Action<V>>(&key, action) -> bool for this issue. Would that be OK for you?

novacrazy commented 1 year ago

You're right, the important part of rcu/modify was that the user explicitly copies the value themselves. I was hung up on the wrong thing. modify will work just fine.

Thank you sincerely for your work on this.

wvwwvwwv commented 1 year ago

Thanks too for your great ideas! I'll upload a new version (v1.6.3) in a couple of days.

wvwwvwwv commented 1 year ago

Just published 1.6.3. Feel free to give feedback about the new methods!

ibraheemdev commented 3 months ago

This is because HashIndex::read and HashIndex::iter are allowed to read entries even when they are locked by HashIndex::update, HashIndex::remove and HashIndex::insert

How exactly does this work if the updates are not performed atomically? If HashIndex::insert is overwriting a value that is being read by another thread, what is done to avoid a data race?

wvwwvwwv commented 3 months ago

@ibraheemdev Hi! ”Remove“ does not remove an entry and it only updates the metadata part of HashIndex atomically so any ongoing read operations can still read the data. Outdated entries will be cleaned up on resize or when a large number of entries are outdated in the HashIndex.

ibraheemdev commented 3 months ago

Ah, so entries can never be overwritten and reclamation only happens at the bucket array level?

wvwwvwwv commented 3 months ago

Yes, your understanding is correct. Ideally those outdated entries can be reclaimed by marking them empty as soon as all the current readers are gone, but that is not implemented yet (possible, but implementing it efficiently is pretty hard).