zilliztech / VectorDBBench

A Benchmark Tool for VectorDB
MIT License
458 stars 108 forks source link

Chroma Test fail - 1M Dataset , 768 Dim , FIlter 1% - Expected operand value to be an int or a float for operator $gt, got >=10000 (serial_runner.py:191) (916) #276

Closed itay-barzion closed 2 months ago

itay-barzion commented 4 months ago

Chroma DB test - fail on 1M vectors, 768 dimensions) under a low filtering rate (1% vectors), Can you fix it please ?

WARNING: VectorDB search_embedding error: Expected operand value to be an int or a float for operator $gt, got >=10000 (serial_runner.py:191) (916)

2024-02-25 14:53:25,356 | INFO: generated uuid for the tasks: d62f7103b4934207b008fbe76871717b (interface.py:69) (199) 2024-02-25 14:53:25,578 | INFO | DB | CaseType Dataset Filter | task_label (task_runner.py:288) 2024-02-25 14:53:25,578 | INFO | ----------- | ------------ -------------------- ------- | ------- (task_runner.py:288) 2024-02-25 14:53:25,578 | INFO | Chroma-ebs | Performance Cohere-MEDIUM-1M 0.01 | 2024022514 (task_runner.py:288) 2024-02-25 14:53:25,578 | INFO: task submitted: id=d62f7103b4934207b008fbe76871717b, 2024022514, case number: 1 (interface.py:235) (199) 2024-02-25 14:53:26,702 | INFO: [1/1] start case: {'label': <CaseLabel.Performance: 2>, 'dataset': {'data': {'name': 'Cohere', 'size': 1000000, 'dim': 768, 'metric_type': <MetricType.COSINE: 'COSINE'>}}, 'db': 'Chroma-ebs'}, drop_old=True (interface.py:167) (382) 2024-02-25 14:53:27,063 | INFO: Chroma client drop_old collection: example2 (chroma.py:38) (382) 2024-02-25 14:53:27,280 | INFO: local dataset root path not exist, creating it: /tmp/vectordb_bench/dataset/cohere/cohere_medium_1m (data_source.py:126) (382) 2024-02-25 14:53:27,280 | INFO: Start to downloading files, total count: 5 (data_source.py:142) (382) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:18<00:00, 27.62s/it] 2024-02-25 14:55:45,367 | INFO: Succeed to download all files, downloaded file count = 5 (data_source.py:147) (382) 2024-02-25 14:55:45,367 | INFO: Read the entire file into memory: test.parquet (dataset.py:229) (382) 2024-02-25 14:55:46,888 | INFO: (SpawnProcess-1:1) Start inserting embeddings in batch 5000 (serial_runner.py:35) (433) 2024-02-25 14:55:46,888 | INFO: Get iterator for shuffle_train.parquet (dataset.py:247) (433) 2024-02-25 14:58:17,354 | INFO: (SpawnProcess-1:1) Loaded 100000 embeddings into VectorDB (serial_runner.py:61) (433) 2024-02-25 15:01:08,708 | INFO: (SpawnProcess-1:1) Loaded 200000 embeddings into VectorDB (serial_runner.py:61) (433) 2024-02-25 15:04:18,759 | INFO: (SpawnProcess-1:1) Loaded 300000 embeddings into VectorDB (serial_runner.py:61) (433) 2024-02-25 15:07:52,795 | INFO: (SpawnProcess-1:1) Loaded 400000 embeddings into VectorDB (serial_runner.py:61) (433) 2024-02-25 15:11:45,831 | INFO: (SpawnProcess-1:1) Loaded 500000 embeddings into VectorDB (serial_runner.py:61) (433) 2024-02-25 15:15:53,432 | INFO: (SpawnProcess-1:1) Loaded 600000 embeddings into VectorDB (serial_runner.py:61) (433) 2024-02-25 15:20:20,517 | INFO: (SpawnProcess-1:1) Loaded 700000 embeddings into VectorDB (serial_runner.py:61) (433) 2024-02-25 15:25:11,862 | INFO: (SpawnProcess-1:1) Loaded 800000 embeddings into VectorDB (serial_runner.py:61) (433) 2024-02-25 15:30:17,823 | INFO: (SpawnProcess-1:1) Loaded 900000 embeddings into VectorDB (serial_runner.py:61) (433) 2024-02-25 15:35:40,085 | INFO: (SpawnProcess-1:1) Loaded 1000000 embeddings into VectorDB (serial_runner.py:61) (433) 2024-02-25 15:35:40,085 | INFO: (SpawnProcess-1:1) Finish loading all dataset into VectorDB, dur=2393.1970115460026 (serial_runner.py:63) (433) 2024-02-25 15:35:43,460 | INFO: Finish loading the entire dataset into VectorDB, insert_duration=2395.502501733001, optimize_duration=0.11211290799838025 load_duration(insert + optimize) = 2395.6146 (task_runner.py:136) (382) 2024-02-25 15:35:43,535 | INFO: Read the entire file into memory: neighbors_head_1p.parquet (dataset.py:229) (382) 2024-02-25 15:35:45,052 | INFO: SpawnProcess-1:3 start search the entire test_data to get recall and latency (serial_runner.py:173) (916) 2024-02-25 15:35:45,182 | WARNING: VectorDB search_embedding error: Expected operand value to be an int or a float for operator $gt, got >=10000 (serial_runner.py:191) (916) Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/runner/serial_runner.py", line 184, in search results = self.db.search_embedding( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/clients/chroma/chroma.py", line 112, in search_embedding results = self.collection.query( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/chromadb/api/models/Collection.py", line 295, in query valid_where = validate_where(where) if where else {} ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/chromadb/api/types.py", line 346, in validate_where validate_where(where_expression) File "/usr/local/lib/python3.12/site-packages/chromadb/api/types.py", line 359, in validate_where raise ValueError( ValueError: Expected operand value to be an int or a float for operator $gt, got >=10000 2024-02-25 15:35:46,229 | WARNING: search error: Expected operand value to be an int or a float for operator $gt, got >=10000, Expected operand value to be an int or a float for operator $gt, got >=10000 (task_runner.py:174) (382) 2024-02-25 15:35:46,229 | WARNING: Failed to run performance case, reason = Expected operand value to be an int or a float for operator $gt, got >=10000 (task_runner.py:146) (382) Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/task_runner.py", line 143, in _run_perf_case m.recall, m.serial_latency_p99 = self._serial_search() ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/task_runner.py", line 176, in _serial_search raise e from None File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/task_runner.py", line 172, in _serial_search return self.serial_search_runner.run() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/runner/serial_runner.py", line 226, in run return self._run_in_subprocess() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/runner/serial_runner.py", line 222, in _run_in_subprocess result = future.result() ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 456, in result return self.get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in get_result raise self._exception ValueError: Expected operand value to be an int or a float for operator $gt, got >=10000 2024-02-25 15:35:46,230 | WARNING: [1/1] case {'label': <CaseLabel.Performance: 2>, 'dataset': {'data': {'name': 'Cohere', 'size': 1000000, 'dim': 768, 'metric_type': <MetricType.COSINE: 'COSINE'>}}, 'db': 'Chroma-ebs'} failed to run, reason=Expected operand value to be an int or a float for operator $gt, got >=10000 (interface.py:187) (382) Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/vectordb_bench/interface.py", line 168, in _async_task_v2 case_res.metrics = runner.run(drop_old) ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/task_runner.py", line 101, in run return self._run_perf_case(drop_old) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/task_runner.py", line 148, in _run_perf_case raise e from None File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/task_runner.py", line 143, in _run_perf_case m.recall, m.serial_latency_p99 = self._serial_search() ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/task_runner.py", line 176, in _serial_search raise e from None File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/task_runner.py", line 172, in _serial_search return self.serial_search_runner.run() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/runner/serial_runner.py", line 226, in run return self._run_in_subprocess() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/vectordb_bench/backend/runner/serial_runner.py", line 222, in _run_in_subprocess result = future.result() ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 456, in result return self.get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in get_result raise self._exception ValueError: Expected operand value to be an int or a float for operator $gt, got >=10000 2024-02-25 15:35:46,232 | INFO |Task summary: run_id=d62f7, task_label=2024022514 (models.py:285) 2024-02-25 15:35:46,232 | INFO |DB | db_label case label | load_dur qps latency(p99) recall max_load_count | label (models.py:285) 2024-02-25 15:35:46,232 | INFO |------ | -------- ------------------- ---------- | ----------- ---------- --------------- ------------- -------------- | ----- (models.py:285) 2024-02-25 15:35:46,232 | INFO |Chroma | ebs Performance768D1M1P 2024022514 | 0.0 0.0 0.0 0.0 0 | x (models.py:285) 2024-02-25 15:35:46,232 | INFO: local result directory not exist, creating it: /usr/local/lib/python3.12/site-packages/vectordb_bench/results/Chroma (models.py:131) (382) 2024-02-25 15:35:46,232 | INFO: write results to disk /usr/local/lib/python3.12/site-packages/vectordb_bench/results/Chroma/result_20240225_2024022514_chroma.json (models.py:143) (382) 2024-02-25 15:35:46,233 | INFO: Succes to finish task: label=2024022514, run_id=d62f7103b4934207b008fbe76871717b (interface.py:207) (382)

alwayslove2013 commented 4 months ago

Seems something wrong with the parameterization of the hybrid search. We will check it.

ValueError: Expected operand value to be an int or a float for operator $gt, got >=10000

https://github.com/zilliztech/VectorDBBench/blob/438c1677f84dbfce4a04e0ef70398c783be39817/vectordb_bench/backend/clients/chroma/chroma.py#L107-L127

alwayslove2013 commented 4 months ago

@s-gruneberg could you help fix it?

maybe using id_value rather than metadata_value would work.

FYI https://docs.trychroma.com/usage-guide#using-where-filters

XuanYang-cn commented 4 months ago

assumes benchmark test filters of format: {'metadata': '>=10000', 'id': 10000}

metadata_value(metadata) is str,used by milvus expr; chroma client should be using id only

s-gruneberg commented 4 months ago

@s-gruneberg could you help fix it?

maybe using id_value rather than metadata_value would work.

FYI https://docs.trychroma.com/usage-guide#using-where-filters

I drafted something up and opened a pr. The unit tests are passing, unfortunately I don't have time to do a full scale test right now. I think it should work with the format of the test filters, but it may need some tweaking to work with edge cases for the filters dictionary that gets passed in. It will also only filter by 'id'.

Hope this helps!

alwayslove2013 commented 4 months ago

@itay-barzion The latest main branch has been fixed and Pypi will be updated in the next release. PR: https://github.com/zilliztech/VectorDBBench/pull/278

You can test the latest changes via git.

git clone https://github.com/zilliztech/VectorDBBench.git
cd VectorDBBench
pip install -e ".[test]"
init_bench
itay-barzion commented 4 months ago

Hi , Thanks much, I will check it

itay-barzion commented 4 months ago

HI, using the fix - I don't see the old error - but still it's seems stuck after a while on search -

2024-03-04 14:01:50,824 | INFO: generated uuid for the tasks: 972ae2b286504feaa4c05a96d35c96bb (interface.py:69) (206) 2024-03-04 14:01:51,051 | INFO | DB | CaseType Dataset Filter | task_label (task_runner.py:288) 2024-03-04 14:01:51,051 | INFO | ----------- | ------------ -------------------- ------- | ------- (task_runner.py:288) 2024-03-04 14:01:51,051 | INFO | Chroma-fsx | Performance Cohere-MEDIUM-1M 0.01 | 2024030414 (task_runner.py:288) 2024-03-04 14:01:51,051 | INFO: task submitted: id=972ae2b286504feaa4c05a96d35c96bb, 2024030414, case number: 1 (interface.py:235) (206) 2024-03-04 14:01:52,199 | INFO: [1/1] start case: {'label': <CaseLabel.Performance: 2>, 'dataset': {'data': {'name': 'Cohere', 'size': 1000000, 'dim': 768, 'metric_type': <MetricType.COSINE: 'COSINE'>}}, 'db': 'Chroma-fsx'}, drop_old=True (interface.py:167) (243) 2024-03-04 14:01:52,564 | INFO: Chroma client drop_old collection: example2 (chroma.py:32) (243) 2024-03-04 14:01:52,786 | INFO: local dataset root path not exist, creating it: /tmp/vectordb_bench/dataset/cohere/cohere_medium_1m (data_source.py:126) (243) 2024-03-04 14:01:52,787 | INFO: Start to downloading files, total count: 5 (data_source.py:142) (243) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:17<00:00, 27.40s/it] 2024-03-04 14:04:09,804 | INFO: Succeed to download all files, downloaded file count = 5 (data_source.py:147) (243) 2024-03-04 14:04:09,805 | INFO: Read the entire file into memory: test.parquet (dataset.py:229) (243) 2024-03-04 14:04:11,355 | INFO: (SpawnProcess-1:1) Start inserting embeddings in batch 5000 (serial_runner.py:35) (293) 2024-03-04 14:04:11,355 | INFO: Get iterator for shuffle_train.parquet (dataset.py:247) (293) 2024-03-04 14:06:42,553 | INFO: (SpawnProcess-1:1) Loaded 100000 embeddings into VectorDB (serial_runner.py:61) (293) 2024-03-04 14:09:37,565 | INFO: (SpawnProcess-1:1) Loaded 200000 embeddings into VectorDB (serial_runner.py:61) (293) 2024-03-04 14:12:51,249 | INFO: (SpawnProcess-1:1) Loaded 300000 embeddings into VectorDB (serial_runner.py:61) (293) 2024-03-04 14:16:22,450 | INFO: (SpawnProcess-1:1) Loaded 400000 embeddings into VectorDB (serial_runner.py:61) (293) 2024-03-04 14:20:08,331 | INFO: (SpawnProcess-1:1) Loaded 500000 embeddings into VectorDB (serial_runner.py:61) (293) 2024-03-04 14:24:10,327 | INFO: (SpawnProcess-1:1) Loaded 600000 embeddings into VectorDB (serial_runner.py:61) (293) 2024-03-04 14:28:28,498 | INFO: (SpawnProcess-1:1) Loaded 700000 embeddings into VectorDB (serial_runner.py:61) (293) 2024-03-04 14:33:09,604 | INFO: (SpawnProcess-1:1) Loaded 800000 embeddings into VectorDB (serial_runner.py:61) (293) 2024-03-04 14:38:08,783 | INFO: (SpawnProcess-1:1) Loaded 900000 embeddings into VectorDB (serial_runner.py:61) (293) 2024-03-04 14:43:20,308 | INFO: (SpawnProcess-1:1) Loaded 1000000 embeddings into VectorDB (serial_runner.py:61) (293) 2024-03-04 14:43:20,308 | INFO: (SpawnProcess-1:1) Finish loading all dataset into VectorDB, dur=2348.953247871017 (serial_runner.py:63) (293) 2024-03-04 14:43:23,358 | INFO: Finish loading the entire dataset into VectorDB, insert_duration=2350.8661376769887, optimize_duration=0.13282266608439386 load_duration(insert + optimize) = 2350.999 (task_runner.py:136) (243) 2024-03-04 14:43:23,426 | INFO: Read the entire file into memory: neighbors_head_1p.parquet (dataset.py:229) (243) 2024-03-04 14:43:24,988 | INFO: SpawnProcess-1:3 start search the entire test_data to get recall and latency (serial_runner.py:173) (766)

s-gruneberg commented 4 months ago

@itay-barzion I can try to help you troubleshoot this. Can you see the logs for your chroma database looking busy while it's searching? I'm running locally, using chroma in docker, and my logs are very busy with search queries. I trimmed the dataset for testing down to 10% it's size to see if I could get the same filter performance test running (quickly), and it finished the entire (10% original size) test in about 20 minutes. If the logs are busy, it's searching, just slowly.

If the chroma logs are not busy while the search is running, are the chroma unit tests located in tests/test_chroma.py passing? (I will normally clear my database/collections manually when running the unit tests, but you can modify the unit tests to delete the collection like this (I forgot to include this in the last PR I made for chroma)). This will clear out the collection made by VectorDBBench in Chroma, and will require you to reload the vectors when running your test.

Screenshot 2024-03-04 at 12 31 55 PM

You should also append this to the end of the test_chroma.py file to clear out the collection after the unit tests run:

Screenshot 2024-03-04 at 12 17 42 PM

I need to modify these unit tests and create a pull request to make troubleshooting chroma easier in the future, but I have not done it yet (sorry!) so if the Chroma logs are not busy while searching, could you please modify your local files as shown above and run the unit tests?

@alwayslove2013 I'll try to help troubleshoot this the best I can. I will make another PR for this issue with the modifications to the unit tests I screenshotted above to help troubleshoot issues with chroma in the future. The test is running for me locally

itay-barzion commented 4 months ago

Thanks for all the help, I will check it and update