zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://privategpt.dev
Apache License 2.0
53.98k stars 7.26k forks source link

Bulk Local Ingestion problem. #1226

Open frenchiveruti opened 11 months ago

frenchiveruti commented 11 months ago

Per the docs https://docs.privategpt.dev/#section/Ingesting-and-Managing-Documents:

Bulk Local Ingestion

When you are running PrivateGPT in a fully local setup, you can ingest a complete folder for convenience (containing pdf, text files, etc.) and optionally watch changes on it with the command:

make ingest /path/to/folder -- --watch

To log the processed and failed files to an additional file, use:

make ingest /path/to/folder -- --watch --log-file /path/to/log/file.log

After ingestion is complete, you should be able to chat with your documents by navigating to http://localhost:8001/ and using the option Query documents, or using the completions / chat API.

When I run any of these variations: image

Instead, I get:

llama_new_context_with_model: total VRAM used: 4857.93 MB (model: 4095.05 MB, context: 762.87 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
Traceback (most recent call last):
  File "C:\Users\Fran\privateGPT\scripts\ingest_folder.py", line 80, in <module>
    raise ValueError(f"Path {args.folder} does not exist")
ValueError: Path `arg=C:\\Users\\Fran\\privateGPT\\FilesFeed does not exist
make: *** [Makefile:52: ingest] Error 1

No matter how I phrase the path. And I think it's because the line is being converted to "`arg=(...)" that means, it's adding the arg= section without reason.

I'm running PrivateGPT on Windows 10 with anaconda via powershell.

rar8022 commented 11 months ago

So it's a "hack job" but it works if it helps you to solve it. I went into the ingest_folder.py and edited line 78: "path = Path(args.folder)" to "path = Path(args.folder[5:])" to strip out the arg= being added in. Then if I use the command make ingest "C:\Users\import" it works as intended. I've spent a few hours trying to figure out where the arg= is inserted and I just can't.

FrangSierra commented 11 months ago

The arg= param comes from the Makefile. However the problem that you are probably facing if you are a Windows user is that you need to set the Args during the call on the command line. Instead of make ingest C:\Users\yourUsers\Documents\bulkFolder\ try make ingest arg=C:\Users\yourUsers\Documents\bulkFolder\

This should solve the troubles caused by windows console while working with the generic solution of the Makefile. For MAC or Linux users the param call is inferred.

Give it a try and let me know if that fix it for u @rar8022 @frenchiveruti

frenchiveruti commented 11 months ago

@FrangSierra Sorry, that did not work. image

mferris77 commented 11 months ago

Windows user here with the same issue. I don't have much experience with 'make' but did look at the makefile and was able figure out that it's just running scripts/ingest_folder.py, so I was able to get it to ingest an entire folder by running the following (after activating the venv):

python scripts\ingest_folder.py C:\PATH\TO\PDFS --watch --log-file ingestLog.txt

Although after an unfortunate power loss and restarting the ingest, it seems to be re-ingesting the files it previously ingested. Not sure if 'doubling up' the content will impact anything or if I should just wipe what's been ingested and start from scratch. Not really related to the topic but an observation worth mentioning, I suppose.

lucabat commented 11 months ago

I had the same problem on Windows. I tried all the solutions above but none of them worked. What works for me was to follow @rar8022 approach and deleting also the last characters from args.folder. I don't know why this happens but now it's working. These are my modifications to the file _scripts\ingestfolder.py : path = Path(args.folder[5:-2]) if not path.exists(): raise ValueError(f"Path {args.folder[5:-2]} does not exist")

And this is the command I used: make ingest "C:\Users\lucabat\Documents\" --

If I added --watch to the command i had to remove 7 more characters

frenchiveruti commented 11 months ago

Well, the "not so hack" way is this one:


Ingest documents: 
#Missing docx2txt
conda install -c conda-forge docx2txt
poetry run python .\scripts\ingest_folder.py "D:\IngestDataPGPT"
poetry run python -m uvicorn private_gpt.main:app --reload --port 8001

from: https://gist.github.com/mberman84/9b3c281ae5e3e92b7e946f6a09787cde?permalink_comment_id=4759723#gistcomment-4759723

FrangSierra commented 11 months ago

For those of u guys who are doing the hack of removing the first 4 params of the call. U can go to MakeFile and on the first line u can find:

# Any args passed to the make script, use with $(call args, default_value)
args = `arg="$(filter-out $@,$(MAKECMDGOALS))" && echo $${arg:-${1}}`

This is what is causing that arg to be read has part of the path and not as an argument. It seems that Windows handles this way different than OSX. the idea of this line is to provide a generic way for 1 or more parameters that may be passed to Make. Probably its also related with your powershell version. Im using 7.0 and Windows 10 Pro. Adding the param name made it work for me has I shared above.

However for those of u that still having the trouble. Instead than doing the Path(args.folder[5:-2]) hack, you can try to tweak the regex of the MakeFile to work properly on Windows. You can find it here: https://github.com/imartinez/privateGPT/blob/main/Makefile

Anyway i realized that is not passing a default_value and the generic regex above is expecting one.

ingest:
    @poetry run python scripts/ingest_folder.py $(call args)

Im away for a couple of days but you could try to see if the behaviour changes.

ingest:
    @poetry run python scripts/ingest_folder.py $(call args, "") //pass a default value, in this case an empty string.

Anything that you discover, please share it here! So we can improve the documentation of the project to include all this side cases. Have a nice weekend!

dhsellars commented 10 months ago

Well, the "not so hack" way is this one:

Ingest documents: 
#Missing docx2txt
conda install -c conda-forge docx2txt
poetry run python .\scripts\ingest_folder.py "D:\IngestDataPGPT"
poetry run python -m uvicorn private_gpt.main:app --reload --port 8001

from: https://gist.github.com/mberman84/9b3c281ae5e3e92b7e946f6a09787cde?permalink_comment_id=4759723#gistcomment-4759723

This is what finally worked for me. I didn't need to install docx2txt first, just running with poetry handled it on my Windows 11 system. THANK YOU!

poetry run python .\scripts\ingest_folder.py "D:\IngestDataPGPT"

gx2g commented 3 months ago

Is this issue ever going to get resolved for windows because it does not work.

Traceback (most recent call last): File "C:\Project-Alice\private-gpt\scripts\ingest_folder.py", line 98, in raise ValueError(f"Path {args.folder} does not exist") ValueError: Path `arg= does not exist make: *** [ingest] Error 1

any feed back will be greatly helpful.