zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://privategpt.dev
Apache License 2.0
54.14k stars 7.29k forks source link

AttributeError: 'NoneType' object has no attribute 'strip' when using a single csv file #412

Closed PierrickLozach closed 1 year ago

PierrickLozach commented 1 year ago

Describe the bug and how to reproduce it

ingest.py fails with a single csv file

Downloading (…)5dded/.gitattributes: 100%|█████████████████████████████████████████████████████████████████████████████████| 1.18k/1.18k [00:00<00:00, 2.31MB/s]
Downloading (…)_Pooling/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 1.01MB/s]
Downloading (…)4d81d5dded/README.md: 100%|█████████████████████████████████████████████████████████████████████████████████| 10.6k/10.6k [00:00<00:00, 8.96MB/s]
Downloading (…)81d5dded/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 573/573 [00:00<00:00, 1.23MB/s]
Downloading (…)ce_transformers.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 288kB/s]
Downloading (…)ded/data_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████| 39.3k/39.3k [00:00<00:00, 13.7MB/s]
Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 134M/134M [00:12<00:00, 11.0MB/s]
Downloading (…)nce_bert_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 131kB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 26.1kB/s]
Downloading (…)5dded/tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 1.61MB/s]
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 352/352 [00:00<00:00, 640kB/s]
Downloading (…)dded/train_script.py: 100%|█████████████████████████████████████████████████████████████████████████████████| 13.2k/13.2k [00:00<00:00, 40.5MB/s]
Downloading (…)4d81d5dded/vocab.txt: 100%|███████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 10.3MB/s]
Downloading (…)1d5dded/modules.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 2.31MB/s]
Creating new vectorstore
Loading documents from source_documents
Loading new documents:   0%|                              | 0/1 [00:01<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 89, in load_single_document
    return loader.load()[0]
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/document_loaders/csv_loader.py", line 48, in load
    content = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in row.items())
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/document_loaders/csv_loader.py", line 48, in <genexpr>
    content = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in row.items())
AttributeError: 'NoneType' object has no attribute 'strip'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 167, in <module>
    main()
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 157, in main
    texts = process_documents()
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 119, in process_documents
    documents = load_documents(source_directory, ignored_files)
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 108, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
AttributeError: 'NoneType' object has no attribute 'strip'

my .env:

PERSIST_DIRECTORY=db
MODEL_TYPE=GPT4All
MODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L12-v2
MODEL_N_CTX=1000

I cannot share the csv file but it is ; separated.

haschm commented 1 year ago

Try replacing the csv-loader in the LOADER_MAPPING in ingest.py by: ".csv": (CSVLoader, {"csv_args": {"delimiter": ";"}})

PierrickLozach commented 1 year ago

No change unfortunately. Still getting the same error.

Updated code:

LOADER_MAPPING = {
    ".csv": (CSVLoader, {"csv_args": {"delimiter": ";"}}),

Sample csv (I modified some of the content to remove anything sensitive):

question;answer
"Confirm that user privileges are/can be reviewed for toxic combinations";"Customers control user access, roles and permissions within the \nCloud CX application. The platform will display roles that any user have access\nto and all the permissions for a user can be viewed from the user\nprofile.&nbsp; User permissions are controlled by the roles that are\nassigned.&nbsp; Full detail here: https://link-here"
"Do we use any external cyber intelligence service to gather intelligence on latest vulnerabilities?";"We do use intelligence services and teams are in various industry standard groups where threat knowledge is shared.  We do however not publish details on these."
"How and when are call recordings decrypted.";"When an authenticated and authorised request for the replay or download of a recoridng is recieved the&nbsp;recording file is copied to temporary storage and decrypted on demand and made available."

error:

Creating new vectorstore
Loading documents from source_documents
Loading new documents:   0%|                              | 0/1 [00:01<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/pierrick.lozach/anaconda3/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 89, in load_single_document
    return loader.load()[0]
  File "/Users/pierrick.lozach/anaconda3/lib/python3.10/site-packages/langchain/document_loaders/csv_loader.py", line 48, in load
    content = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in row.items())
  File "/Users/pierrick.lozach/anaconda3/lib/python3.10/site-packages/langchain/document_loaders/csv_loader.py", line 48, in <genexpr>
    content = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in row.items())
AttributeError: 'NoneType' object has no attribute 'strip'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 178, in <module>
    main()
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 167, in main
    texts = process_documents()
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 121, in process_documents
    documents = load_documents(source_directory, ignored_files)
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 109, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/Users/pierrick.lozach/anaconda3/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
AttributeError: 'NoneType' object has no attribute 'strip'
arnacb commented 1 year ago

After updating to python 3.11 solve it for me.

PierrickLozach commented 1 year ago

No luck for me.

Python version:

Python 3.11.3

Error:

Creating new vectorstore
Loading documents from source_documents
Loading new documents:   0%|                              | 0/1 [00:01<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/pierrick.lozach/anaconda3/envs/privateGPT/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 89, in load_single_document
    return loader.load()[0]
           ^^^^^^^^^^^^^
  File "/Users/pierrick.lozach/anaconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/document_loaders/csv_loader.py", line 48, in load
    content = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in row.items())
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pierrick.lozach/anaconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/document_loaders/csv_loader.py", line 48, in <genexpr>
    content = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in row.items())
                           ^^^^^^^
AttributeError: 'NoneType' object has no attribute 'strip'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 178, in <module>
    main()
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 167, in main
    texts = process_documents()
            ^^^^^^^^^^^^^^^^^^^
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 121, in process_documents
    documents = load_documents(source_directory, ignored_files)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 109, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/Users/pierrick.lozach/anaconda3/envs/privateGPT/lib/python3.11/multiprocessing/pool.py", line 873, in next
    raise value
AttributeError: 'NoneType' object has no attribute 'strip'
haschm commented 1 year ago

Strange! Your file (stored as csv) works for me when I use the delimiter option. (Python 3.10.11 btw)

PierrickLozach commented 1 year ago

Thanks for that. I reduced the number of entries in my csv and it works indeed. I guess some items must be incorrect. I will work on that.

PierrickLozach commented 1 year ago

FYI, I just faced that issue again and it seems to be due to invalid characters (escape quotes in my case).

This issue seems to be due to CSVLoader itself as it's reference in this issue here: https://github.com/hwchase17/langchain/issues/2074

ne0YT commented 1 year ago

@PierrickI3 do you know a tool to check my docs with to see what's the issue?