Open svim-ig opened 1 day ago
@svim-ig After selecting the test case, VectorDBBench
will automatically download the required dataset, so there is no need to download it in advance.
@alwayslove2013 As of now, I am not going to use the datasets in vectordb benchmark tool... I would like to download and utilize the datasets for performance evaluation of my llm models and vectordb ...hence required the datasets separately
@svim-ig The dataset used by VectorDBBench
is derived from publicly available datasets (excluding OpenAI
). I recommend using the original datasets for testing. You can easily find download links online, such as on Hugging Face
.
The download links for theVectorDBBench
datasets can be somewhat complex.
The basic format is:
[data_source]/benchmark/[dataset_dir]/[file_type]
For example:
assets.zilliz.com/benchmark/openai_medium_500k/shuffle_train.parquet
The data_source
is categorized by region, you can choose based on your network conditions.
assets.zilliz.com
(AWS US West)assets.zilliz.com.cn
(Aliyun CN Shanghai)Common dataset_dir categories include:
openai_large_5m
openai_medium_500k
cohere_large_10m
cohere_medium_1m
(Note: Cohere
datasets are 768 dimensions, while OpenAI
datasets are 1536 dimensions.)The file_type is divided into three categories:
test.parquet
- Test vectorsneighbors.parquet
- Ground truth IDsshuffle_train.parquet
- Training vector set. It is important to note that for large datasets, the files are split into multiple smaller files, with naming conventions like shuffle_train-04-of-10.parquet
.
Unable to download the datasets mentioned in repo ...same issue with GIST and Cohere datasets...share the download links for the datasets utilized in repo for benchmark testing