UnicodeDecodeError: 'charmap' codec can't decode byte `0x8f` when reading archive file

ferrybig commented 1 year ago

Trying to run the program gives the following error:

$ reddit-user-to-sqlite archive C:\Users\fernando\Downloads\export_ferrybig_20230601
loading data found in archive at C:\Users\fernando\Downloads\export_ferrybig_20230601 into reddit.db
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "c:\users\fernando\.local\bin\reddit-user-to-sqlite.exe\__main__.py", line 7, in <module>
  File "C:\Users\fernando\.local\pipx\venvs\reddit-user-to-sqlite\Lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\fernando\.local\pipx\venvs\reddit-user-to-sqlite\Lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\fernando\.local\pipx\venvs\reddit-user-to-sqlite\Lib\site-packages\click\core.py", line 1657, in invoke    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\fernando\.local\pipx\venvs\reddit-user-to-sqlite\Lib\site-packages\click\core.py", line 1404, in invoke    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\fernando\.local\pipx\venvs\reddit-user-to-sqlite\Lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\fernando\.local\pipx\venvs\reddit-user-to-sqlite\Lib\site-packages\reddit_user_to_sqlite\cli.py", line 106, in archive
    comment_ids = load_ids_from_file(db, archive_path, "comments")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\fernando\.local\pipx\venvs\reddit-user-to-sqlite\Lib\site-packages\reddit_user_to_sqlite\csv_helpers.py", line 26, in load_ids_from_file
    return [
           ^
  File "C:\Users\fernando\.local\pipx\venvs\reddit-user-to-sqlite\Lib\site-packages\reddit_user_to_sqlite\csv_helpers.py", line 26, in <listcomp>
    return [
           ^
  File "C:\Python311\Lib\csv.py", line 111, in __next__
    row = next(self.reader)
          ^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3438: character maps to <undefined>

The error mentions "comments" and "position 3438" , looking via notepad++ which is the following line:

e5e21gp,https://www.reddit.com/r/Minecraft/comments/9cstbv/whats_the_difference_between_java_edition_and/e5e21gp/,2018-09-04 18:38:43 UTC,,Minecraft,0,https://www.reddit.com/r/Minecraft/comments/9cstbv/whats_the_difference_between_java_edition_and/,e5d1u04,"The Java version also has cross play between all platforms it runs on, Linux people an play together with Windows people on a server hosted on a Mac device",

The error points to position in the middle of the subreddit name, the space between Mine and Craft

xavdid commented 1 year ago

Interesting, I'll take a look. Thanks!

xavdid commented 1 year ago

I was unable to reproduce this after pasting your line. If there's a rogue character in your csv, it's possible github has stripped it out.

Some brief googling also points to the fact that this is maybe a windows-specific issue, which I wouldn't be able to reproduce anyway.

If it's just that one line giving you trouble, I'd try to manually retype the word Minecraft in your csv and the rest hopefully works fine.

The other option would be to make sure I include a file encoding, but I'm not sure if that has implications for other users who write non-english content.

ferrybig commented 1 year ago

Running the import on the same file on Linux worked without issues, maybe linux and windows default to different encoding when opening a file

Green0Photon commented 1 year ago

I came across this issue myself on Windows. Pretty simple fix, thankfully. (Though I'm annoyed my archive took 16 days to generate, coming to me only after all the deletions. 😔)

In csv_helpers.py on Line 41

    with open(validate_and_build_path(archive_path, filename), encoding="utf8") as archive_rows:

and on Line 50:

    with open(validate_and_build_path(archive_path, "statistics"), encoding="utf8") as stat_rows:

To clarify, the difference is adding , encoding="utf8" to make sure Python is using UTF-8 to open the archive files. Because that's what these files are stored in.

And I don't know what other OSes do by default (I suppose Python does do UTF-8 on Linux by default), but on Windows it's not, I suppose.

Eh, chances are you might not need that second change, since statistics.csv is all ASCII, but eh. If it were me, I'd change both, because I like the consistency, but it doesn't really matter.

xavdid commented 1 year ago

Thanks for hunting that down! You're totally right that it's different per-platform. The docs mention that specifically. TIL!

I've release 0.4.1, which specifies a default encoding. Sorry for the hassle!

xavdid / reddit-user-to-sqlite

UnicodeDecodeError: 'charmap' codec can't decode byte `0x8f` when reading archive file #10