zilliztech / VectorDBBench

A Benchmark Tool for VectorDB
MIT License
561 stars 151 forks source link

Distance metric used to compute Ground Truth #374

Closed Sheharyar570 closed 1 month ago

Sheharyar570 commented 1 month ago

Hi everyone, is the dataset's ground truth in VectorDBBench computed using the cosine distance metric, euclidean, or another distance metric? Also, if the ground truth distance metric and algorithm distance metric are different, would this impact recall?

greenhal commented 1 month ago

Ground truth distances, depends on the dataset. See: https://github.com/zilliztech/VectorDBBench/blob/main/vectordb_bench/backend/dataset.py

e.g.

    name: str = "LAION"
    dim: int = 768
    metric_type: MetricType = MetricType.L2

if the ground truth distance metric and algorithm distance metric are different Yes, since they are different algorithms , this would effect recall. However, I have run some ad-hoc tests comparing Euclidean (L2) and Cosine (the two metrics used by vectordbbench) on the LAION 100m dataset and the difference was minimal and didn't really impact recall by more than .01 or less.

Sheharyar570 commented 1 month ago

Thanks @greenhal, if you could answer one more thing, how does shuffled and non-shuffled dataset have impact on recall?

greenhal commented 1 month ago

I think shuffled data is just a way of shuffling the source data so it's not loaded in order of the id.

see: https://github.com/zilliztech/VectorDBBench/blob/51b1eced3b9d7a6283a4a119956eecdc262f88a0/README.md?plain=1#L304

It shouldn't have any impact on recall.

alwayslove2013 commented 1 month ago

@greenhal Yes.

I think shuffled data is just a way of shuffling the source data so it's not loaded in order of the id. It shouldn't have any impact on recall.

When we designed the filter case, the condition we used was based on (int) id > xxx. We found that a lot of vector db behave strangely when the filter condition is related to the insertion order, so we designed the shuffled option. It will not have any effect on non-filter test cases.