Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.
If the cardinality for a range query is very high (i.e. it matches many distinct values) the current approach is essentially a full-table scan on the inverted index. This runs very slow on large datasets. This ticket is a feature request to build support for native range queries
Bitmaps can be used to to build specific range-based indexes that can retrieve all matches in a very few lookups.
From what I understand from the (fantastic!) Feature-base article on range bitmaps, we would have to bit-slice bitmaps for an efficient range index.
This means for an int64/float64, we would need 64+1=65 bitmaps.
That is overall fairly reasonable because that is constant
However, we need to be clever about the write strategy because every value would want to simultaneously write into those same 64 bitmaps (or a subset of them). That's a recipe for lock contention
Based on this, I would suggest that:
We should make this feature opt-in. Not every property needs to be range-indexed, and there's a risk of a significant index cost, so we should only build this for props that need it
We should be smart about the memtable logic.
I could imagine that it makes sense to not insert every value into every bitmap directly. Instead we could store the full value first (fairly cheap, just a 64bit append).
The same applies for the WAL. We don't need to have a WAL per bitmap, we can just write the 64bit value and the 64bit doc id onto the WAL, since that's already the most efficient representation of the 65 bitmaps
Then at flush time, we could first build the 65 bitmaps. This could be bitmap-centric, e.g. one goroutine per bitmap and should be possible to achieve without any locks (the values are now immutable, so we can safely read them concurrently, each bitmap can be built in isolation
A segment merge should be brutally cheap (compared to a roaring set bitmap) because there are never more than 65 bitmaps per segment, so it's just 65 OR operations.
If the cardinality for a range query is very high (i.e. it matches many distinct values) the current approach is essentially a full-table scan on the inverted index. This runs very slow on large datasets. This ticket is a feature request to build support for native range queries
Bitmaps can be used to to build specific range-based indexes that can retrieve all matches in a very few lookups.