Unable to download the datasets

svim-ig commented 1 day ago

Unable to download the datasets mentioned in repo ...same issue with GIST and Cohere datasets...share the download links for the datasets utilized in repo for benchmark testing

alwayslove2013 commented 14 hours ago

@svim-ig After selecting the test case, VectorDBBench will automatically download the required dataset, so there is no need to download it in advance.

svim-ig commented 12 hours ago

@alwayslove2013 As of now, I am not going to use the datasets in vectordb benchmark tool... I would like to download and utilize the datasets for performance evaluation of my llm models and vectordb ...hence required the datasets separately

alwayslove2013 commented 10 hours ago

@svim-ig The dataset used by VectorDBBench is derived from publicly available datasets (excluding OpenAI). I recommend using the original datasets for testing. You can easily find download links online, such as on Hugging Face.

The download links for theVectorDBBench datasets can be somewhat complex.

The basic format is: [data_source]/benchmark/[dataset_dir]/[file_type]

For example: assets.zilliz.com/benchmark/openai_medium_500k/shuffle_train.parquet

The data_source is categorized by region, you can choose based on your network conditions.

assets.zilliz.com (AWS US West)
assets.zilliz.com.cn (Aliyun CN Shanghai)

Common dataset_dir categories include:

openai_large_5m
openai_medium_500k
cohere_large_10m
cohere_medium_1m (Note: Cohere datasets are 768 dimensions, while OpenAI datasets are 1536 dimensions.)

The file_type is divided into three categories:

test.parquet - Test vectors
neighbors.parquet - Ground truth IDs
shuffle_train.parquet - Training vector set. It is important to note that for large datasets, the files are split into multiple smaller files, with naming conventions like shuffle_train-04-of-10.parquet.

zilliztech / VectorDBBench

Unable to download the datasets #411