zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://privategpt.dev
Apache License 2.0
53.65k stars 7.21k forks source link

bad operand type for unary -: 'list' #604

Closed thugib closed 7 months ago

thugib commented 1 year ago

I am running ingest.py. My source_documents folder has 3,255 documents (all .pdf's except for 7 .txt). When I run ingest.py it seems to get to 46 documents before the fail. I am using Python 3.10.11 on CentOS 7.5..1804 in an anaconda environment.

I was wondering if the code could be written so that the error message could identify the specific document that made for the error, and specify the document responsible for the message data-loss while decompressing corrupted data ?

Here is my error message:

Creating new vectorstore Loading documents from source_documents Loading new documents: 1%|▏ | 44/3255 [00:04<06:23, 8.38it/s]Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Loading new documents: 1%|▎ | 46/3255 [00:04<05:04, 10.53it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/Array/privateGPT/privateGPT-main/ingest.py", line 89, in load_single_document return loader.load()[0] File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/pdf.py", line 207, in load return list(self.lazy_load()) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/pdf.py", line 214, in lazy_load yield from self.parser.parse(blob) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/base.py", line 87, in parse return list(self.lazy_parse(blob)) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/parsers/pdf.py", line 35, in lazy_parse text = extract_text(pdf_file_obj) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/high_level.py", line 175, in extract_text interpreter.process_page(page) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 997, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1016, in render_contents self.execute(list_value(streams)) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1042, in execute func(args) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 816, in do_TL self.textstate.leading = -cast(float, leading) TypeError: bad operand type for unary -: 'list' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Array/privateGPT/privateGPT-main/ingest.py", line 167, in main() File "/Array/privateGPT/privateGPT-main/ingest.py", line 157, in main texts = process_documents() File "/Array/privateGPT/privateGPT-main/ingest.py", line 119, in process_documents documents = load_documents(source_directory, ignored_files) File "/Array/privateGPT/privateGPT-main/ingest.py", line 108, in load_documents for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)): File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 873, in next raise value TypeError: bad operand type for unary -: 'list'

melroy89 commented 1 year ago

Seems like a numeric value was excepted, but instead the code received a type 'list'. The operand - can't be executed on a list, during some kind of float casting.

thugib commented 1 year ago

Is there any way I can identify what document caused this error ?

melroy89 commented 1 year ago

Looks like the error is coming from this line: https://github.com/imartinez/privateGPT/blob/main/ingest.py#L89

Comparing this code with your error stack list (return loader.load()[0]), shows a difference on code level...!? Are you sure you are using the latest main branch code? If not, try to upgrade first (new git clone or git pull..)?

thugib commented 1 year ago

I cloned with git clone https://github.com/imartinez/privateGPT.git and received the same error, but seemed to make it a couple more documents. However, I added a few more documents to the source_directory, so it still could be the same document that is causing the error.

python ingest.py Creating new vectorstore Loading documents from source_documents Loading new documents: 1%|▎ | 47/3256 [00:04<05:52, 9.11it/s]Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Loading new documents: 1%|▎ | 47/3256 [00:04<05:15, 10.16it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/Array/privateGPT/privateGPT/ingest.py", line 89, in load_single_document return loader.load() File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/pdf.py", line 207, in load return list(self.lazy_load()) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/pdf.py", line 214, in lazy_load yield from self.parser.parse(blob) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/base.py", line 87, in parse return list(self.lazy_parse(blob)) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/parsers/pdf.py", line 35, in lazy_parse text = extract_text(pdf_file_obj) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/high_level.py", line 175, in extract_text interpreter.process_page(page) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 997, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1016, in render_contents self.execute(list_value(streams)) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1042, in execute func(args) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 816, in do_TL self.textstate.leading = -cast(float, leading) TypeError: bad operand type for unary -: 'list' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Array/privateGPT/privateGPT/ingest.py", line 166, in main() File "/Array/privateGPT/privateGPT/ingest.py", line 156, in main texts = process_documents() File "/Array/privateGPT/privateGPT/ingest.py", line 118, in process_documents documents = load_documents(source_directory, ignored_files) File "/Array/privateGPT/privateGPT/ingest.py", line 107, in load_documents for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)): File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 873, in next raise value TypeError: bad operand type for unary -: 'list'

melroy89 commented 1 year ago

Yes, so you now at least on the latest version. Now the error message with code, is the same as on the main branch.

Too bad that didn't solve the problem. Looking at the code it seems that you do put documents/files with a valid file extension to the source_directory directory.

Adding a print statement with the file_path variable on line 88 of this ingest.py file could help you at least pin-point the document that is causing the issue...

print(file_path)

That should print all your document paths to the console, until the moment it crashes. The latest document that is printed to your console would be the file that is causing issues here...

Some more background info: Depending on your file extension this python code will try to load the corresponding loader class (like PDFMinerLoader in case of a PDF). Then it will execute PDFMinerLoader.load(), so load() method of that class. I do assume it actually tries do use this class, since your Python exception is using the pdf.py loader file.

EDIT: It could be a bug in the loader class or maybe in filtered_files variable. Maybe filtered_files variable is not of type string[] (Array of strings).. Let see what you get first. You might also want to print this filtered_files data actually..

To do that, you can add another print statement, but above the for loop for sure. So let's say around line 103, you can add:

print(filtered_files)

This variable should return a list/array of strings (something like: ['blabbla', 'blablsblabla', 'testetest, '...])

thugib commented 1 year ago

Thank you very much for your patience and help. I was getting to about 58 documents of ~ 3200 in my source_directory. By printing out the document name, I have been removing documents from the source_directory when they seem to be the last or next-to-last before the error. I have removed 34 documents so far, and I can now make it to 309 documents, but I am now receiving a different error (TypeError: unsupported operand type(s) for *: 'float' and 'PSLiteral'). I remove the last and next-to-last documents before the error, but I still only go 309 or so documents and get the exception and there are different documents being loaded near the exception. This type of behavior occurred previously. The ingest.py would always seem to come to a halt after a certain number of documents were loaded. Perhaps there is a document more upstream before the exception that is causing the problem. From looking at the code, could it be that documents are loaded in groups of three ?
I will paste two separate outputs below, so you can see that the exception occurs near 309-311 documents loaded, but with different documents being loaded near where the exception occurs. ( Because I get 309 documents ingested, I will not paste the whole output, but only the part closer to the exception.)

I also added a print statement to line 103 of ingest.py, I am not sure of the correct indentation. Here is where I have it now:

with Pool(processes=os.cpu_count()) as pool:
        results = []
        with tqdm(total=len(filtered_files), desc='Loading new documents', ncols=80) as pbar:
            for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
                results.extend(docs)
                pbar.update()
    print(filtered_files)
    return results

First output below: ... source_documents/Bhatla2018_Chapter_Auxins.pdf Loading new documents: 9%|█▌ | 301/3222 [01:39<14:27, 3.37it/s]source_documents/AtPID Arabidopsis thaliana protein interactome database an integrative platform for plant systems biology NucAcidRes_08.pdf Loading new documents: 9%|█▌ | 302/3222 [01:39<14:17, 3.41it/s]source_documents/Efficient strategies for controlled release of nanoencapsulated phytohormones to improve plant stress tolerance.pdf Loading new documents: 9%|█▌ | 303/3222 [01:39<12:26, 3.91it/s]source_documents/OmicsARules a R package for integration of multi omics datasets via association rules mining BMCBioinfo_19.pdf source_documents/WRKY transcription factors evolution binding and action PythopathRes_19.pdf Loading new documents: 9%|█▌ | 306/3222 [01:40<13:15, 3.67it/s]source_documents/science_daily_Putting vision into context.pdf source_documents/ALal2018_Chapter_ConceptsInMetabolism.pdf source_documents/A Systematic Evaluation of single Cell RNA seq Analysis Pipelines bioRxiv_19.pdf Loading new documents: 10%|█▋ | 308/3222 [01:40<10:55, 4.45it/s]source_documents/Global cancer genomics project comes to fruition Nature_20.pdf Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data source_documents/The Arabidopsis lyrata genome sequence and the basis of rapid genome size change NatGenet_11.pdf Loading new documents: 10%|█▋ | 309/3222 [01:41<15:55, 3.05it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/Array/privateGPT/privateGPT/ingest.py", line 90, in load_single_document return loader.load() File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/pdf.py", line 207, in load return list(self.lazy_load()) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/pdf.py", line 214, in lazy_load yield from self.parser.parse(blob) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/base.py", line 87, in parse return list(self.lazy_parse(blob)) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/parsers/pdf.py", line 35, in lazy_parse text = extract_text(pdf_file_obj) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/high_level.py", line 175, in extract_text interpreter.process_page(page) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 997, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1016, in render_contents self.execute(list_value(streams)) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1042, in execute func(args) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 909, in do_Tj self.do_TJ([s]) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 902, in do_TJ self.device.render_string( File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfdevice.py", line 116, in render_string dxscale = 0.001 fontsize scaling TypeError: unsupported operand type(s) for *: 'float' and 'PSLiteral' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Array/privateGPT/privateGPT/ingest.py", line 167, in main() File "/Array/privateGPT/privateGPT/ingest.py", line 157, in main texts = process_documents() File "/Array/privateGPT/privateGPT/ingest.py", line 119, in process_documents documents = load_documents(source_directory, ignored_files) File "/Array/privateGPT/privateGPT/ingest.py", line 108, in load_documents for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)): File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 873, in next raise value TypeError: unsupported operand type(s) for *: 'float' and 'PSLiteral'

A second output follows below: ... source_documents/pbio_1000320.pdf Loading new documents: 10%|█▋ | 309/3220 [01:41<06:48, 7.12it/s]source_documents/PlantTFDB3p0_NAR2014.pdf source_documents/pcaExplorer an R Bioconductor_BCMBioinfo_2020.pdf Loading new documents: 10%|█▋ | 311/3220 [01:42<10:26, 4.64it/s]Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data Data-loss while decompressing corrupted data source_documents/nihms814917.pdf Loading new documents: 10%|█▋ | 311/3220 [01:42<15:56, 3.04it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/Array/privateGPT/privateGPT/ingest.py", line 90, in load_single_document return loader.load() File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/pdf.py", line 207, in load return list(self.lazy_load()) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/pdf.py", line 214, in lazy_load yield from self.parser.parse(blob) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/base.py", line 87, in parse return list(self.lazy_parse(blob)) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/parsers/pdf.py", line 35, in lazy_parse text = extract_text(pdf_file_obj) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/high_level.py", line 175, in extract_text interpreter.process_page(page) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 997, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1016, in render_contents self.execute(list_value(streams)) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1042, in execute func(args) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 909, in do_Tj self.do_TJ([s]) File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 902, in do_TJ self.device.render_string( File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pdfminer/pdfdevice.py", line 116, in render_string dxscale = 0.001 fontsize scaling TypeError: unsupported operand type(s) for *: 'float' and 'PSLiteral' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Array/privateGPT/privateGPT/ingest.py", line 167, in main() File "/Array/privateGPT/privateGPT/ingest.py", line 157, in main texts = process_documents() File "/Array/privateGPT/privateGPT/ingest.py", line 119, in process_documents documents = load_documents(source_directory, ignored_files) File "/Array/privateGPT/privateGPT/ingest.py", line 108, in load_documents for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)): File "/Array/bin/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 873, in next raise value TypeError: unsupported operand type(s) for *: 'float' and 'PSLiteral'

melroy89 commented 1 year ago

You are correct the documents are processed in parallel using this pool.imap_unordered code line... Printing filtered_files should give you a list of all your files.

You can try to add a print statement within the load_single_document method instead, to give you document by document output:

def load_single_document(file_path: str) -> List[Document]:
    print(file_path) # <-- Like HERE :)
    ext = "." + file_path.rsplit(".", 1)[-1]
    if ext in LOADER_MAPPING:
        loader_class, loader_args = LOADER_MAPPING[ext]
        loader = loader_class(file_path, **loader_args)
        return loader.load()

Ideally, we need to turn off this parallelize pool, so it's easier for you to pin-point the actual issue and document.. For that we can change this pool.imap_unordered to just a simple map method:

        results = []
        with tqdm(total=len(filtered_files), desc='Loading new documents', ncols=80) as pbar:
            # See line below, I just use map() now... This will be slower of course, but helps you debugging the document that is causing issues
            for i, docs in enumerate(map(load_single_document, filtered_files)):
                results.extend(docs)
                pbar.update()

This WILL help you pin point the exact document that is causing issues. But it will not yet solve the actual problem yet.