Open unalignedcoder opened 11 months ago
no ideas or suggestions? I can't be the only one who's having this problem...
I still have the same problem.
'charmap' codec can't decode byte 0x9d in position 321: character maps to
I upload a file via http://localhost:8001/ some *.txt files are successful, but other files contain unicode get error above.
The error traceback indicates that there is a UnicodeDecodeError when trying to read a file using the read_text()
method from the pathlib
module. The specific error is "'utf-8' codec can't decode byte 0xcf in position 1060: invalid continuation byte."
This error typically occurs when the file contains characters that are not valid UTF-8. To handle this issue, you have a few options:
Specify the Encoding: If you know the encoding of the files you're working with, you can explicitly specify the encoding when reading the file. For example, if the files are encoded in Latin-1, you can modify the _load_file_to_documents
function in ingest_helper.py
:
return string_reader.load_data([file_data.read_text(encoding='latin-1')])
Replace 'latin-1'
with the correct encoding of your files.
Handle Unicode Errors: You can handle Unicode errors by using the errors
parameter of the decode
method. For example:
return string_reader.load_data([file_data.read_text(errors='replace')])
The errors='replace'
option will replace any problematic characters with the Unicode replacement character. This may not be the best solution, as it might lead to loss of information.
Choose the option that best fits your use case and the nature of your data. If your files may contain characters from multiple encodings, you might need a more sophisticated approach to handle various encodings appropriately.
Same here but for HTML pages like @imamcs19 with TXT, I think there is an issue with auto-detecting text encoding. Not sure if any of these are useful https://github.com/PyYoshi/cChardet https://github.com/chardet/chardet https://github.com/douban/PyCharlockHolmes https://github.com/sonicdoe/detect-character-encoding https://github.com/CharsetDetector/UTF-unknown
ingest_helper.py can be found on private_gpt > components > ingest
However, about this:
return string_reader.load_data([file_data.read_text(errors='replace')])
I suggest NOT to use this solution. It only makes garbage.
On Windows 10, Python 3.11.6
Using bulk ingestion, with the command:
poetry run python scripts/ingest_folder.py "folder\path"
I keep getting this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 169: character maps to <undefined>
, after which ingestion stops. In fact, judging from the Gradio interface, nothing has been ingested at all.I tried different folders, with mixed kinds of files (emails, ebooks...). Sooner or later it encounters a file which breaks the process.
I have seen previous similar issues being addressed by making a correction in relation to encoding, to a file name
ingest.py
, which however I cannot find anywhere.Full console error message: