Closed Sheharyar570 closed 1 month ago
Ground truth distances, depends on the dataset. See: https://github.com/zilliztech/VectorDBBench/blob/main/vectordb_bench/backend/dataset.py
e.g.
name: str = "LAION"
dim: int = 768
metric_type: MetricType = MetricType.L2
if the ground truth distance metric and algorithm distance metric are different Yes, since they are different algorithms , this would effect recall. However, I have run some ad-hoc tests comparing Euclidean (L2) and Cosine (the two metrics used by vectordbbench) on the LAION 100m dataset and the difference was minimal and didn't really impact recall by more than .01 or less.
Thanks @greenhal, if you could answer one more thing, how does shuffled and non-shuffled dataset have impact on recall?
I think shuffled data is just a way of shuffling the source data so it's not loaded in order of the id.
It shouldn't have any impact on recall.
@greenhal Yes.
I think shuffled data is just a way of shuffling the source data so it's not loaded in order of the id. It shouldn't have any impact on recall.
When we designed the filter case, the condition we used was based on (int) id > xxx
. We found that a lot of vector db behave strangely when the filter condition is related to the insertion order, so we designed the shuffled
option. It will not have any effect on non-filter test cases.
Hi everyone, is the dataset's ground truth in VectorDBBench computed using the cosine distance metric, euclidean, or another distance metric? Also, if the ground truth distance metric and algorithm distance metric are different, would this impact recall?