webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.36k stars 212 forks source link

cdx-indexer UnicodeDecodeError #10 #312

Closed dportabella closed 6 years ago

dportabella commented 6 years ago
$ cdx-indexer -r CC-MAIN-2016-36_filtered.cdx CC-MAIN-2016-36_filtered/

Traceback (most recent call last):
  File "/home/david/.local/bin/cdx-indexer", line 11, in <module>
    sys.exit(main())
  File "/home/david/.local/lib/python2.7/site-packages/pywb/indexer/cdxindexer.py", line 454, in main
    minimal=cmd.minimal_cdxj)
  File "/home/david/.local/lib/python2.7/site-packages/pywb/indexer/cdxindexer.py", line 287, in write_multi_cdx_index
    writer.write(entry, filename)
  File "/home/david/.local/lib/python2.7/site-packages/pywb/indexer/cdxindexer.py", line 56, in write
    self.write_cdx_line(self.out, entry, filename)
  File "/home/david/.local/lib/python2.7/site-packages/pywb/indexer/cdxindexer.py", line 126, in write_cdx_line
    out.write(entry['url'])
  File "/usr/lib/python2.7/codecs.py", line 369, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 80: ordinal not in range(128)
ikreymer commented 6 years ago

Sorry for delay, can you provide a sample WARC where this is happening? a CommonCrawl WARC? Is it only happening in python 2?

dportabella commented 6 years ago

I only checked with python 2:

$ wget "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-36/segments/1471982290442.1/warc/CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.gz"

$ cdx-indexer warc.cdx CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.gz
dportabella commented 6 years ago

I just tried with python3, and it works.

ikreymer commented 6 years ago

Fixed in 2.0.4!