zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://docs.privategpt.dev
Apache License 2.0
52.91k stars 7.12k forks source link

Too many document IDs for one document. #1974

Open dvcoa opened 3 weeks ago

dvcoa commented 3 weeks ago

Too many document IDs are assigned to a single document, causing it to be treated as multiple documents.

C:\Users\asdfasdf\MyAIAsisstant> python .\list_privateGPT_doc_id.py a8dea6c8-279c-49c0-9c9c-78411ddc9b6e 45e82b06-28d2-483d-89c1-0cb45f507bd3 69858812-355e-42ae-bf23-401c93ed53f9 698f7f9e-5b90-4090-8376-8c82ba3c9e39 0ec0d6ca-d3d7-4f9c-8200-dda745504dd0 8ebf0c06-5271-4100-a350-e1de6f7b622b 5963e620-1587-4be4-84d2-7d4431bfa178 9fdf519c-0b5b-4678-8c98-407c67bc03c9 ddcb6417-0473-487a-b6eb-d3a6a760a399 3f91ad89-e72a-4cdd-a738-efe1ecb0f903 f91bc72a-3ecf-4fee-817a-5718c025651e 1060b0b8-d6f8-4d20-b830-dca5ff1f8174 c8391595-f01c-4e13-91f4-0d7216a08cbe 1275d7aa-b90b-4e4e-92b4-6fc4f75d1eba c03c9317-ccb4-4d04-bd90-b20b774674b3 a75aa39c-ffe0-486a-9253-c9573602332b fe341d43-2f34-4581-be37-0537aeb641b3 3d291a62-2e31-4423-8051-6ef4a2ea965b a27c0171-f210-4672-96bd-00525075e8d7 2bc6b2f4-1d74-4e53-a063-53b60f289e90 bc1a301f-47ef-4c9a-b1b1-92e3688bb2a1 7fd481f9-a293-40e3-acde-f2b7566e7bcd 4f19ae57-46cb-42c0-96b1-e44436337752 e1da4f1b-2cd9-4247-b53b-0f0c98f31e1e 1c9dd1c9-d002-4c21-8aef-a8181119e278 97d592de-a821-4dc2-8063-f009928adf22 521d0219-6de5-4eec-86ab-4dce8dc04234 9327549d-f3b2-4716-982e-ff00b42607c2 c711de0a-9cda-412d-b88d-2f1d92352076 40279fe7-4f86-4909-9a04-73b5c3e5df3f c24a2fbf-1093-4336-984b-e93404cb7a03 cc3ec721-8b86-4ed2-a96c-c7fae680c621 453e1f3d-cae9-421d-a54e-45c573eb8a94 2e202eb6-bb19-4529-9a3b-03880ef6a398 4e5c12ab-d750-41c5-ab49-fa34081359c0 cb3cf70c-2377-43fe-b834-befe4501640c 2c2d84d9-7dbe-4bdf-b9e6-dfcc0b8eae29 ac4f7108-a203-459b-bf54-2e209b5fd472 4be9bc50-b357-49c6-8b43-c9e4e2de7c6d 54eed172-4a0a-42eb-9b99-59a11eb314fa 3f6f279c-ec9f-4585-9dab-3a65c27f786b 5708b69d-2b53-4ad0-85d3-144e3b97e015 4405f9ac-80a0-463d-8aa7-a106ebf708ee 8a04cae2-fef6-46c4-a355-1c10125979c9 cc361a2d-4ded-45e3-be2c-77838d9ee158 ec431997-0e00-4ffa-9859-4e2d582c78a1 8f162158-12ed-4dd0-b29c-2851effc423c 1b132fad-80d0-4c95-8e7b-96c217295350 d21c8086-5597-4f0b-b508-fedb13225285 80680505-1524-4d77-bca3-f3f74a4fa202 9ff61495-4764-4093-b681-c73d4bbd2c43 c191ffce-e9fb-4a1e-a22e-416d8356db20 baafa992-21c8-4b1e-b520-709c8117b734 3d273b93-53fa-4d44-a9d1-d58a891b158c 70d4e1f9-4ecc-4a27-a27d-74b4085b49db 5c72b1d3-cca3-4d6d-911c-8ef807d4659e 00d63be5-5f88-4470-a57e-838e05fbb3bc a141fb9c-499e-4805-b2f6-2233ffbd3ff3 9ccf71e5-eb0c-4f62-bb25-b48df4424cf3 b7cb847b-4c9b-4c8f-8400-bc5d6ec59f27 e3e2381b-3415-4bb5-8d13-e5a23a53ebb0 0093c75b-65f8-4d62-869b-5dc7d30a978a aee861e6-7dd4-4f76-8d27-6156670997a9 0177fdd0-25ba-4578-8fc4-dad2b95864d8 0881deba-3134-44f0-a93d-0989f25899ec 4f9bc816-492d-47c8-93ff-845ea61d14fc 29be2c7e-7ccd-470e-b5c4-ed6b68485e71 ab75a336-8c02-40d7-b68b-57d6f1056ad6 111d5c84-d832-48fe-9596-690b7ace0ba6 269c3ca1-c4b4-45b2-be10-85e93a649ce3 5a9535a3-ed26-4931-a51e-e75327a8b7ef 35e1c00f-cdb4-4bcd-8d43-4fb20e574caf cc420370-c619-45e0-a89d-f318dc4c6654 075ffaa6-3967-4ce4-860d-47d541939141 cccfcb91-c32e-47ad-baff-ef43c9133b97 b40172f4-8f77-4b3e-bd42-d27d1c7ad62a cc1b002e-cde2-4d91-8e99-f9cdab663b17 e79ea87f-e57e-42b0-8446-55f5f184c71d a2feb0a4-61d6-4d9e-8e24-cd4116045990 5a06e85e-8f09-49d2-bfca-cb2c70b9451c f922cc0c-8884-42e3-a029-a4d6f0e633f0 55d12317-fdb4-48a5-8553-15b287c1fcb0 c9952c38-1bce-4adb-b40e-99b6cad12796 8c3afdb0-5d99-4aa3-b1cc-05f27af90991 6cf423d8-33f8-4ecb-aa84-a2ed07600168 177c0192-722f-4333-aedb-cf3d89517674 e27c49b0-22b6-4e10-96ed-57366e587c24 8b55609c-d798-4d7d-8d82-3a7bdaee8691 008bcc1f-f07b-4f9f-ab53-b8c255a9a401 117f39ab-e7f9-4828-800c-f3ef1bb7c49c 4b7d9630-2ea8-4c84-b7de-9e3fe7033acc aceff978-f922-45c5-8a78-fc6114a62616 d4c1c01a-dad8-4180-bf81-303fcc04c654 626c34a3-8024-4190-ac6b-51522e82e7d5 f1d661ed-860e-452c-85ef-6066c18679d0 5909591d-44de-44ce-9f18-0a719cf4080f bc809096-384a-4b17-92cb-62f61ce9a829 d530f1d9-dc7f-459b-b0f6-d432039aa6b5 a5d3a180-fce8-4267-b4b2-033de9e54ca6 3995db4d-40b7-46a9-8ee3-f36908421e43 (venv) PS C:\Users\asdfasdf\MyAIAsisstant>

nopmop commented 1 week ago

Are you sure you're listing doc_id and not node_id ?

dvcoa commented 1 week ago

Are you sure you're listing doc_id and not node_id ?

yes, no reason to get node_id, I tested with curl and api.

nopmop commented 1 week ago

I checked this from the local_data directory and everything looks fine. Perhaps you should share the source of your python program (and the details of your setup).

dvcoa commented 1 week ago

I checked this from the local_data directory and everything looks fine. Perhaps you should share the source of your python program (and the details of your setup).

Doc link : https://www.woollahra.nsw.gov.au/files/assets/public/v/2/plans-policies-publications/development-control-plans/chapter-b3-general-development-controls.pdf

Code:

from pgpt_python.client import PrivateGPTApi

client = PrivateGPTApi(base_url="http://localhost:8001/")

for doc in client.ingestion.list_ingested().data: print(doc.doc_id)

dvcoa commented 1 week ago

details of setup:

Using python3 (3.11.9) 20:37:41.862 [INFO ] private_gpt.settings.settings_loader - Starting application with profiles=['default', 'local'] tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1.47k/1.47k [00:00<?, ?B/s] 20:38:02.007 [WARNING ] py.warnings - C:\Users\baehw\AppData\Local\pypoetry\Cache\virtualenvs\private-gpt-0t2nuzJx-py3.11\Lib\site-packages\huggingface_hub\file_download.py:157: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\baehw\privateGPT\private-gpt\models\cache\models--mistralai--Mistral-7B-Instruct-v0.2. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message)

tokenizer.model: 100%|████████████████████████████████████████████████████████████████████████████████████████| 493k/493k [00:01<00:00, 281kB/s] tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1.80M/1.80M [00:04<00:00, 403kB/s] special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 72.0/72.0 [00:00<?, ?B/s] 20:38:10.979 [INFO ] private_gpt.components.llm.llm_component - Initializing the LLM in mode=llamacpp ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\baehw\privateGPT\private-gpt\models\mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 4095.05 MiB ................................................................................................. llama_new_context_with_model: n_ctx = 3900 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 487.50 MiB llama_new_context_with_model: KV self size = 487.50 MiB, K (f16): 243.75 MiB, V (f16): 243.75 MiB llama_new_context_with_model: CUDA_Host input buffer size = 16.65 MiB llama_new_context_with_model: CUDA0 compute buffer size = 283.37 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 2 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'general.name': 'mistralai_mistral-7b-instruct-v0.2', 'general.architecture': 'llama', 'llama.context_length': '32768', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '15', 'llama.attention.head_count_kv': '8', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.freq_base': '1000000.000000', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"} Guessed chat format: mistral-instruct 20:38:20.841 [INFO ] private_gpt.components.embedding.embedding_component - Initializing the embedding model in mode=huggingface 20:38:22.808 [INFO ] llama_index.core.indices.loading - Loading all indices. 20:38:23.085 [INFO ] private_gpt.ui.ui - Mounting the gradio UI, at path=/ 20:38:23.226 [INFO ] uvicorn.error - Started server process [17992] 20:38:23.226 [INFO ] uvicorn.error - Waiting for application startup. 20:38:23.227 [INFO ] uvicorn.error - Application startup complete. 20:38:23.228 [INFO ] uvicorn.error - Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)

nopmop commented 1 week ago

I verified this. It happens when you re-ingest the same files.

dvcoa commented 1 week ago

I verified this. It happens when you re-ingest the same files.

I tried injest a new file after removing local_data folder, but the same.

when I ingest a new file, it looks like this:

00:40:19.376 [INFO ] private_gpt.server.ingest.ingest_service - Ingesting file_names=['Alian Interview.pdf'] Parsing nodes: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 67.35it/s] Generating embeddings: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.30s/it] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 999.60it/s] Generating embeddings: 100%|█████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 17.15it/s] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 500.39it/s] Generating embeddings: 100%|███████████████████████████████████████████████████████████| 15/15 [00:01<00:00, 10.61it/s] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 993.20it/s] Generating embeddings: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.21it/s] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|█████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s] Generating embeddings: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 23.66it/s] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1000.55it/s] Generating embeddings: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 21.85it/s] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1001.74it/s] Generating embeddings: 100%|█████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 15.41it/s] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 996.27it/s] Generating embeddings: 100%|███████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 18.46it/s] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|█████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s] Generating embeddings: 100%|█████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 23.80it/s] Generating embeddings: 0it [00:00, ?it/s] Parsing nodes: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 499.38it/s] Generating embeddings: 100%|███████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 18.85it/s] Generating embeddings: 0it [00:00, ?it/s]

nopmop commented 1 week ago

There is a problem with ingestion creating duplicates of the same files being re-ingested. However, in your case it's not a bug, it's a feature - that's just how it works: https://github.com/zylon-ai/private-gpt/blob/c7212ac7cc891f9e3c713cc206ae9807c5dfdeb6/private_gpt/components/ingest/ingest_helper.py#L75 Transforming before ingestion yields better results.