Closed novacrazy closed 1 year ago
Hash*
structs share the same HashTable::resize
trait method which adjusts the capacity of the hash table. Here, HashMap/Set
and HashIndex
acts differently after a new bucket array is allocated. HashMap
moves key and value instances from the old bucket array to the new array whereas HashIndex
clones them and drops the ones in the old bucket array once after the old array becomes totally unreachable. This is because HashIndex::read*
and HashIndex::iter
are allowed to read entries even when they are locked by HashIndex::update*
, HashIndex::remove*
and HashIndex::insert*
; especially, HashIndex::read_with
and HashIndex::iter
even can leak a reference to a locked entry and keep the reference longer than the HashIndex
itself!
=> K, V: Clone
is not about HashIndex::update*
, but for resizing.
Note that, HashIndex::update*
does not implement copy-on-write semantics; it is for in-place updates of an entry. I may have to implement copy-on-write
methods as well later.
So, if "The caller has to make sure that there is no reader of the entry" is not guaranteed, there can be a reader of an entry which is being updated through HashIndex::update*
. HashIndex
guarantees that there is no writer to the entry being updated, but unfortunately, as you stated, read operations are out of control. This semantics suggests that HashIndex::update*
is only safe if V
is safe for single-writer-multiple-readers scenarios, e.g., V: Copy
.
By “safe”, do you mean that it’s possible to invoke undefined behavior by calling update
? If so that’s pretty bad. Unsure how significant the part about leaking references is. If it’s to a dropped element, then also pretty bad.
Cloning during resize also has me wanting to wrap all my values in Arc
, which would make update
useless anyway.
Could at least the associated value be stored in a separate pointer to allow for more efficient resizing and potentially a copy-on-write system? I doubt the extra indirection would matter much in real-world random-access. Though I suppose it would consume more space for the pointer itself.
Yes, calling update
may lead to undefined behavior if V
is not ready for it, and that's why the method is unsafe
. On the other hand, the leaking reference part is not at all unsafe
; the references are protected by EBR
, and the instance won't be dropped until the references are gone.
And right, wrapping instances in Arc
doesn't help much here as Arc
itself is not thread-safe. Actually, using Arc
amounts to the extra indirection through a separate pointer, and that will enable HashIndex
to get rid of the Clone
constraints. What I'm concerned about the approach is repeated heap memory allocations; every V
gets its own heap memory allocation which is definitely bad for performance unless very good arena allocators are used.
=> I'd leave this up to the user; if the user prefers not cloning V
, simply, Arc<V>
can be used instead.
Copy-on-write
semantics is a slightly different story; update_cow* (?)
can be implemented by, lock the bucket -> insert a new version of the value -> mark the old version of the value as removed -> unlock the bucket
: simple!
=> Would this suit your use case?
Oh, my mistake, I completely missed that update
was unsafe
. Just glanced right over that somehow.
update_cow
sounds good. Pretty much the ideal version of a locking copy-on-write.
Might also be worth looking into something like arc-swap
's rcu
method. rcu
may run the updater function multiple times to make it lock-free, but there are some cases where that's acceptable.
In addition to update_cow
, it would be amazing to provide an rcu
method even if you can't make it lock-free:
pub fn rcu<Q, F>(&self, key: &Q, updater: F) -> bool // returns true if exists and updated
where
K: Borrow<Q>,
Q: Eq + Hash + ?Sized,
F: FnOnce(&K, &V) -> V;
pub async fn rcu_async(...) -> bool ...
such that the new value can be generated from scratch rather than taking a mutable reference, since a mutable reference would be useless if V = Arc<T>
.
Perhaps rename the old update
to update_in_place
or similar as well? Make update_cow
the new update
.
The suggested rcu
API looks great; I guess F: FnOnce(&K, &V) -> Option<V>
would be better for the caller to decide whether to create a new version or not based on the current value.
I'm reluctant to change the naming of existing methods in v1.x.x; I'll cleanup any confusing naming in the crate in v2.0.0 (next year? hopefully when allocator_api becomes stable).
I'll upload a draft version of new methods soon.
Might be a bit of feature creep, but if you're adding in an option to "do nothing", perhaps an enum such as:
enum RcuAction<T> {
DoNothing,
Update(T),
Remove,
}
impl<T> From<Option<T>> for RcuAction<T> ... // DoNothing or Update
impl<T> From<()> for RcuAction<T> ... // DoNothing
pub fn rcu<Q, F, R>(&self, key: &Q, updater: F) -> bool
where
K: Borrow<Q>,
Q: Eq + Hash + ?Sized,
F: FnOnce(&K, &V) -> R,
R: Into<RcuAction<V>>;
to also allow removing the element while we hold the lock. Could be useful if it's not difficult to implement.
remove_if
could then be implemented on top of rcu
, even.
Yep, a bit of a feature creep, but nice to have it since HashMap
already has entry/entry_async
methods that semantically do the same, and it's not at all difficult to implement it.
However, the naming rcu
wouldn't fit here since rcu
typically implies lock-free code whereas a lock-free update of a HashIndex
entry is virtually impossible (not 100% sure if it is totally impossible) unless some API semantics is relaxed (e.g., linearizable -> eventual consistency). Any suggestions other than rcu
? I'm thinking of
entry
.modify
.read_modify
.entry
would be good if there was also a way to insert, such as by providing (&K, Option<&V>)
to the callback and allowing it to instead insert the returned value. The complete range of operations.
However, that introduces annoyances with the key where it would either need to be given by value or be cloned automatically on insertion. Granted, the entry
API elsewhere does require it by value anyway, so it would at least match that.
enum Action<T> {
DoNothing,
Set(T),
Remove,
}
impl<T> From<Option<T>> for Action<T> ... // DoNothing or Set
impl<T> From<()> for Action<T> ... // DoNothing
pub fn entry<F, R>(&self, key: K, updater: F) -> bool
where
F: FnOnce(&K, Option<&V>) -> R,
R: Into<Action<V>>;
Using something like std::borrow::Cow
for the key would work instead, maybe.
Otherwise, if it must require the value exist, modify
.
OK, inserting a new entry through the new methods doesn't look nice at all :-( I'll then go for modify
which does read->remove, update, or nothing
.
table.entry(key, |_, entry| Some(match entry {
None => Default::default(),
Some(&value) => value + 1,
}));
looks decent to me. Similar to:
match table.entry(key) {
Entry::Vacant(v) => v.insert_entry(Default::default()),
Entry::Occupied(v) => *v.get_mut() += 1,
}
I just had an idea right after I sent that. Instead of relying on the return value, you could create a hash_index::Entry
struct similar to with hash_map::Entry
, but provide that to the closure instead, and the semantics would be identical to the other hash_map::Entry
?
hash_index_table.entry(key, |entry| {
// use entry like you would hash_map::Entry
});
If I'd have to implement HashIndex::entry, I would just resemble the HashMap implementation, e.g., receiving K
by value, returning Entry
that owns the bucket. HashIndex::entry resembling that of HashMap is definitely meaningful, therefore I'll implement it in the foreseeable future.
=> I'm not quite fond of HashIndex::entry
receiving a closure and a reference to K
, since the entry
method is usually invoked by a user with intention to insert a key.
I guess I've digressed the topic by mentioning HashMap::entry
thinking that the semantics of rcu
was similar to that of HashMap::entry
which is not quite true due to insertion.
Therefore, I'd like to just implement modify<F: FnOnce(&V, &V) -> Action<V>>(&key, action) -> bool
for this issue. Would that be OK for you?
You're right, the important part of rcu
/modify
was that the user explicitly copies the value themselves. I was hung up on the wrong thing. modify
will work just fine.
Thank you sincerely for your work on this.
Thanks too for your great ideas! I'll upload a new version (v1.6.3) in a couple of days.
Just published 1.6.3. Feel free to give feedback about the new methods!
This is because HashIndex::read and HashIndex::iter are allowed to read entries even when they are locked by HashIndex::update, HashIndex::remove and HashIndex::insert
How exactly does this work if the updates are not performed atomically? If HashIndex::insert
is overwriting a value that is being read by another thread, what is done to avoid a data race?
@ibraheemdev Hi! ”Remove“ does not remove an entry and it only updates the metadata part of HashIndex atomically so any ongoing read operations can still read the data. Outdated entries will be cleaned up on resize or when a large number of entries are outdated in the HashIndex.
Ah, so entries can never be overwritten and reclamation only happens at the bucket array level?
Yes, your understanding is correct. Ideally those outdated entries can be reclaimed by marking them empty as soon as all the current readers are gone, but that is not implemented yet (possible, but implementing it efficiently is pretty hard).
I've seen in your notes that
HashIndex
requires key/value to be clone to maintain immutability benefits, but what are the cases where they do clone?Furthermore, the docs for
update
state "The caller has to make sure that there is no reader of the entry". What happens if that's occurring? There is no way to ensure that other than to use a Mutex, defeating the entire point of this library. Seems like that would be the use case for cloning the value for the update.