Ingestion of documents with Ollama is incredibly slow

Zirgite commented 4 months ago

I upgraded to the last version of privateGPT and the ingestion speed is much slower than in previous versions. It is so slow to the point of being unusable. I use the recommended ollama possibility. More than 1 h stiil the document is not finished. I have 3090 and 18 core CPU. And I am using the very small Mistral. I am ingesting 105 kb pdf file. 37 pages of text Later I switched to less recommended 'llms-llama-cpp' option in PrivateGP. The problem was solved. But still is anyway to have fast ingetion with Ollama?

yangyushi commented 4 months ago

I have the exact same issue with the ollama embedding mode pre--configured in the file settings-ollama.yaml.

I ingested my documents with a reasonable (much faster) speed with the huggingface embedding mode.

imartinez commented 4 months ago

Interesting. Ollama embedding model is way bigger than the default huggingface one, may be the main cause. Dimensionality of vectors is double in Ollama's embedding model

iotnxt commented 4 months ago

I can confirm a performance degradation on 0.4.0 when running with this : poetry install --extras "ui llms-ollama embeddings-ollama vector-stores-qdrant" is unusable for older PCs ...and if I try: poetry install --extras "ui llms-llama-cpp embeddings-huggingface vector-stores-qdrant" then I have authentication issues with huggingface when running this: poetry run python scripts/setup

dbzoo commented 4 months ago

Embedding model changes:

BAAI/bge-small-en-v1.5 has a vector size of 384
nomic-embed-text has a vector size of 768 If you are using ollama with the default configuration you are using a larger vector size. This will take longer, but it will also give you better context searching. FWIW: On M2 mac it did not feel that much slower.

iotnxt commented 4 months ago

Thanks @dbzoo but I think it might be more than just that.

During the 60+ min it was ingesting, there was a very modest resource utilisation: ~8.4% out of 32GB RAM ~20% CPU / 8 Core 3.2Ghz Sporadic and small spikes of 1.5TB SSD activity

At least one of those resources above should have been very high (on average) during those 60+ minutes while processing that small PDF before I decided to cancelled it.

Note: No GPU on my modest system but not long ago the same file took 20min on an earlier version of privateGPT and it worked when asking questions (replies were slow but it did work).

cc: @imartinez FEATURE Request: -please show a progress bar or a percentage indicating how much have been ingested. (maybe I cancelled it without knowing there was just one min left)

imartinez commented 4 months ago

@iotnxt maybe Ollama's support for embeddings models is not fully optimized yet. Could be the case. Go back to Huggingface embeddings for intensive use cases.

About the feature request, feel free to contribute through a PR! Being transparent, the roadmap is full of functional improvements, and the "progress bar" would never be prioritized - it is perfect for a contribution though

Robinsane commented 4 months ago

For me it's very slow too, and I keep getting the error below after a certain amount of time: (posted at #1723)

chipgpt | Traceback (most recent call last): chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/gradio/queueing.py", line 495, in call_prediction chipgpt | output = await route_utils.call_process_api( chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/gradio/route_utils.py", line 235, in call_process_api chipgpt | output = await app.get_blocks().process_api( chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/gradio/blocks.py", line 1627, in process_api chipgpt | result = await self.call_function( chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/gradio/blocks.py", line 1173, in call_function chipgpt | prediction = await anyio.to_thread.run_sync( chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync chipgpt | return await get_asynclib().run_sync_in_worker_thread( chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread chipgpt | return await future chipgpt | ^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run chipgpt | result = context.run(func, *args) chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/gradio/utils.py", line 690, in wrapper chipgpt | response = f(*args, **kwargs) chipgpt | ^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/private_gpt/ui/ui.py", line 266, in _upload_file chipgpt | self._ingest_service.bulk_ingest([(str(path.name), path) for path in paths])chipgpt | File "/home/worker/app/private_gpt/server/ingest/ingest_service.py", line 84, in bulk_ingest chipgpt | documents = self.ingest_component.bulk_ingest(files) chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/private_gpt/components/ingest/ingest_component.py", line 198, in bulk_ingest chipgpt | return self._save_docs(documents) chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/private_gpt/components/ingest/ingest_component.py", line 210, in _save_docs chipgpt | self._index.insert_nodes(nodes, show_progress=True) chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 320, in insert_nodes chipgpt | self._insert(nodes, **insert_kwargs) chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 311, in _insert chipgpt | self._add_nodes_to_index(self._index_struct, nodes, **insert_kwargs) chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 233, in _add_nodes_to_index chipgpt | new_ids = self._vector_store.add(nodes_batch, **insert_kwargs) chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/vector_stores/qdrant/base.py", line 254, in add chipgpt | points, ids = self._build_points(nodes) chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/vector_stores/qdrant/base.py", line 221, in _build_points chipgpt | vectors.append(node.get_embedding()) chipgpt | ^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/core/schema.py", line 344, in get_embedding chipgpt | raise ValueError("embedding not set.") chipgpt | ValueError: embedding not set.

RolT commented 4 months ago

There's an issue with ollama + nomic-embed-text. Fixed but not yet released. Using ollama 0.1.29 fixed the issue for me.

https://github.com/ollama/ollama/issues/3029

btonasse commented 3 months ago

+1. It takes ~2s to generate embeddings for a 4 word phrase

codespearhead commented 3 months ago

Ollama v0.1.30 has recently been released.

Is this issue still reproducible in that version?

fcarsten commented 2 months ago

I seem to have the same or a very similar problem with "ollama" default settings and running ollama v0.1.32.

The console says I get parsing nodes: ~1000 it/s, and generating embeddings: ~ 2s/it

The strange thing is, that it seems that private-gpt/ollama are using hardly any of the available resources. CPU < 4%, Memory < 50%, GPU < 4% processing (1.5/12GB GPU memory), Disk <1%, etc on a Intel i7- I3700K, 32GB Ram, RTX 4070

Example output from the console log: [...] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 998.88it/s] Generating embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:39<00:00, 2.08s/it] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 999.83it/s] Generating embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:37<00:00, 2.10s/it] Generating embeddings: 0it [00:00, ?it/s] [...]

stevenlafl commented 2 months ago

Still excruciatingly slow with it barely hitting GPU. Embeddings are ~8 it/s on a 3080. It -does- use the GPU, I confirmed that much. If I double the number of workers, it halves the it/s performance. So there is zero recourse there.

ollama 0.1.33-rc6 so that patch would have been applied.

zubairahmed-ai commented 1 month ago

Can confirm, using mxbai-embed-large from HF, even a 1.44mb file is taking closer to an hour and still unfinished, while CPU and GPU utilization is below 50% and 10% respectively, using the latest version of PrivateGPT

Any fixes @imartinez ?

dougy83 commented 1 month ago

+1. It takes ~2s to generate embeddings for a 4 word phrase

I noticed the same when using http API and python interface. The server says it took <50ms (CPU), so I'm guessing the problem is with detecting that the response is complete. Setting my request timeout to 100ms makes each request take 100ms.

If I use fetch() in nodejs, the response takes <30ms.

I've never used private-gpt, but I'm guessing it's the same problem

EDIT: The python request is fast if I use http://127.0.0.1 rather than http://localhost

Castolus commented 3 weeks ago

Thanks @dbzoo but I think it might be more than just that.

During the 60+ min it was ingesting, there was a very modest resource utilisation: ~8.4% out of 32GB RAM ~20% CPU / 8 Core 3.2Ghz Sporadic and small spikes of 1.5TB SSD activity

At least one of those resources above should have been very high (on average) during those 60+ minutes while processing that small PDF before I decided to cancelled it.

Note: No GPU on my modest system but not long ago the same file took 20min on an earlier version of privateGPT and it worked when asking questions (replies were slow but it did work).

cc: @imartinez FEATURE Request: -please show a progress bar or a percentage indicating how much have been ingested. (maybe I cancelled it without knowing there was just one min left)

Hi, maybe i´m too late, but i post it anyways.

You can set a progress bar in the console, by editing in ui.py:

Instead this line (345):

    self._ingest_service.bulk_ingest([(str(path.name), path) for path in paths_to_ingest])

put this one:

    for path in tqdm(paths_to_ingest, desc="Ingesting files"):
        self._ingest_service.bulk_ingest([(str(path.name), path)])

by using tdqm you´ll be able to see something like this in the console:

Ingesting files: 40%|████ | 2/5 [00:38<00:49, 16.44s/it]14:10:07.319 [INFO ] private_gpt.server.ingest.ingest_service - Ingesting

Don´t forget to import the library:

from tqdm import tqdm

I´ll probablly integrate it in the UI in the future. Have some other features that may be interesting to @imartinez

Cheers

zylon-ai / private-gpt

Ingestion of documents with Ollama is incredibly slow #1691