Everything I have been trying to ingest caused an encoding error

unalignedcoder commented 11 months ago

On Windows 10, Python 3.11.6

Using bulk ingestion, with the command: poetry run python scripts/ingest_folder.py "folder\path"

I keep getting this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 169: character maps to <undefined>, after which ingestion stops. In fact, judging from the Gradio interface, nothing has been ingested at all.

I tried different folders, with mixed kinds of files (emails, ebooks...). Sooner or later it encounters a file which breaks the process.

I have seen previous similar issues being addressed by making a correction in relation to encoding, to a file name ingest.py, which however I cannot find anywhere.

Full console error message:

Traceback (most recent call last):
  File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 41, in <module>
    _recursive_ingest_folder(path)
  File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 28, in _recursive_ingest_folder
    _recursive_ingest_folder(file_path)
  File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 28, in _recursive_ingest_folder
    _recursive_ingest_folder(file_path)
  File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 26, in _recursive_ingest_folder
    _do_ingest(file_path)
  File "I:\privateGPT-0.0.2\scripts\ingest_folder.py", line 34, in _do_ingest
    ingest_service.ingest(changed_path.name, changed_path)
  File "I:\privateGPT-0.0.2\private_gpt\server\ingest\ingest_service.py", line 80, in ingest
    text = file_data.read_text()
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<user>\.pyenv\pyenv-win\versions\3.11.6\Lib\pathlib.py", line 1059, in read_text
    return f.read()
           ^^^^^^^^
  File "C:\Users\<user>\.pyenv\pyenv-win\versions\3.11.6\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 169: character maps to <undefined>

unalignedcoder commented 11 months ago

no ideas or suggestions? I can't be the only one who's having this problem...

imamcs19 commented 11 months ago

I still have the same problem. 'charmap' codec can't decode byte 0x9d in position 321: character maps to

I upload a file via http://localhost:8001/ some *.txt files are successful, but other files contain unicode get error above.

doobidoo commented 11 months ago

I have similar issues and asked AI, got the following answer and modified the ingest_helper.py.

The error traceback indicates that there is a UnicodeDecodeError when trying to read a file using the read_text() method from the pathlib module. The specific error is "'utf-8' codec can't decode byte 0xcf in position 1060: invalid continuation byte."

This error typically occurs when the file contains characters that are not valid UTF-8. To handle this issue, you have a few options:

Specify the Encoding: If you know the encoding of the files you're working with, you can explicitly specify the encoding when reading the file. For example, if the files are encoded in Latin-1, you can modify the _load_file_to_documents function in ingest_helper.py:
```
return string_reader.load_data([file_data.read_text(encoding='latin-1')])
```
Replace 'latin-1' with the correct encoding of your files.
Handle Unicode Errors: You can handle Unicode errors by using the errors parameter of the decode method. For example:
```
return string_reader.load_data([file_data.read_text(errors='replace')])
```
The errors='replace' option will replace any problematic characters with the Unicode replacement character. This may not be the best solution, as it might lead to loss of information.

Choose the option that best fits your use case and the nature of your data. If your files may contain characters from multiple encodings, you might need a more sophisticated approach to handle various encodings appropriately.

TomLucidor commented 10 months ago

Same here but for HTML pages like @imamcs19 with TXT, I think there is an issue with auto-detecting text encoding. Not sure if any of these are useful https://github.com/PyYoshi/cChardet https://github.com/chardet/chardet https://github.com/douban/PyCharlockHolmes https://github.com/sonicdoe/detect-character-encoding https://github.com/CharsetDetector/UTF-unknown

gthieleb commented 10 months ago

https://dev.to/methane/python-use-utf-8-mode-on-windows-212i

urlan commented 3 months ago

ingest_helper.py can be found on private_gpt > components > ingest

However, about this:

return string_reader.load_data([file_data.read_text(errors='replace')])

I suggest NOT to use this solution. It only makes garbage.

zylon-ai / private-gpt

Everything I have been trying to ingest caused an encoding error #1153

I have similar issues and asked AI, got the following answer and modified the ingest_helper.py.