Bulk remove items from TreeIndex: `remove_range` - Githubissues

wvwwvwwv / scalable-concurrent-containers

High performance containers and utilities for concurrent and asynchronous programming

Apache License 2.0

285 stars 14 forks source link

Bulk remove items from TreeIndex: `remove_range` #120

Closed novacrazy closed 5 months ago

novacrazy commented 6 months ago

Would it be possible to add an efficient remove_range or drain-like API on TreeIndex?

I'd like to use a TreeIndex<u64, Arc<[u8]>> as a queue where many readers keep a pointer into it by index, so they can keep up at their own pace, and periodically clear old entries like .remove_range(..least_recent_index). The alternative would be just repeated calls to .remove with a decrementing counter until it returns false.

wvwwvwwv commented 6 months ago

Yes, that kind of API would be helpful, however, I wouldn't expect a huge performance gain, since there can always be other threads inserting values in that range, and thus disallowing me to implement something similar to split_off. -> Still, I can get rid of one tree traversal for each removal (comparing to for e in range { if (cond) { tree.remove(e.0); } }, so the gain would be at around ~30%, I'd say.

I'll implement something next week.

novacrazy commented 6 months ago

For my use case at least the key is always atomically incrementing, I keep an Arc<AtomicU64> around for that, so it never would re-insert below the range. Perhaps some sort of unsafe fn split_off or drain that assumes this? If not, I understand, this is a hefty ask.

Thank you again for your continued development and upkeep of scc. It's become a staple library for me.

wvwwvwwv commented 6 months ago

In my opinion, implementing a new data structure for the purpose is optimal since the ever-increasing key assumption will enable a lot of optimization; I think, a ring-buffer-like data structure will be better than a tree (didn't think about it seriously yet).

On the other hand, implementing fn split_off seems to be a viable option, and maybe it doesn't have to be unsafe, because fn insert can check whether the inserted key has just been split off; though correctly implementing it will be super hard..

Anyway, I'll firstly focus on optimizing the for e in range { if (cond) { tree.remove(e.0); } } case, and then think about any other possible long-term solutions (including a new data structure and split_off) afterwards.

wvwwvwwv commented 6 months ago

Implement fn remove for Iter and Range. => It's relatively trivial; ~ 14 Jan.
Implement fn remove_range<R: RangeBounds<K>>(&self, range: R) for bulk removal. => Slightly complicated, but it will be doable! Though I need to figure out possible outcomes when there are other threads inserting keys in the range, and document them. => ~ mid Feb?

wvwwvwwv commented 6 months ago

OK, I tried to implement remove for Iter and Range, but it turns out to be much more complicated than expected.

On the other hand, O(1) bulk removal seems to be quite doable in a couple of weeks.

So, in this issue, as originally suggested, I'll directly implement remove_range.

novacrazy commented 6 months ago

O(1) bulk removal would be more than optimal. Also absolutely no rush on this, and thank you again!

wvwwvwwv commented 6 months ago

You're welcome! Just note that, O(1) will be a best-case scenario, and most of the time, removing entries close to start and end bounds will require tree traversal.

wvwwvwwv commented 6 months ago

FYI: uploaded SCC 2.0.9 which contains an unoptimized version of remove_range. To be optimized later.

wvwwvwwv commented 5 months ago

FYI: uploaded 2.0.14 that contains the proposed optimized version of remove_range. -> Time complexity is O(depth = logN), independent of the range. -> Not 100% sure about correctness of the method. If the method doesn't seem function correctly, please use for (k, _) in tree.range(a..b, guard) { tree.remove(k); } instead.