webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.34k stars 207 forks source link

Indexing Errors with YouTube JSON in POST Request Payload #869

Open mona-ul opened 8 months ago

mona-ul commented 8 months ago

Describe the bug

When using pywb (wb-manager reindex, cdx-indexer) and cdxj-indexer a WARC file can’t get indexed. All indexing methods return an error. (“Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte”)

WARC file: https://storage.googleapis.com/rhizome-hurl/stuttering-by-nathaniel-stern-20230821133816.warc

The WARC Record causing problems seems to be a POST Request, with a payload containing query data in JSON. Identified WARC Records causing the error:

cdxj-indexer Error Message

cdxj-indexer -p [warc file] > [index file] 
Error parsing: {"context":{"client":{"hl":"en","gl":"US","clientName":1,"clientVersion":"2.20230815.00.00","configInfo": [...]

The error refers to the payload of the Request Record urn:uuid:4bdccb83-e6e4-43d6-887d-51b81d6d1e90. (Full payload attached above)

wb-manager reindex Error Message

wb-manager reindex [collection]
Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

cdx-indexer Error Message

cdx-indexer -p [WARC file]
[...]
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mona/.local/bin/cdx-indexer", line 8, in <module>
    sys.exit(main())
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/cdxindexer.py", line 468, in main
    write_multi_cdx_index(cmd.output, cmd.inputs,
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/cdxindexer.py", line 306, in write_multi_cdx_index
    for entry in entry_iter:
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 342, in __call__
    for entry in entry_iter:
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 215, in join_request_records
    for entry in entry_iter:
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 188, in create_record_iter
    post_query = MethodQueryCanonicalizer(method,
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/warcserver/inputrequest.py", line 281, in __init__
    sys.stderr.write("Ignoring query, error parsing as json: " + query.decode("utf-8") + "\n")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Steps to reproduce the bug

Environment

Additional context

Identification of Error Records

When using the indexing methods a index.cdxj is partly written. Comparing the entries of the WARC and index file chronologically, the first entry in the WARC file that is not in the index file was identified as the record causing problems. To verify that this record is causing problems, it was removed using warcio. After that, the indexing worked.

WARC-Processing with warcio

The WARC file and the identifies error records were processed with warcio and no utf-8 occurred.

from warcio.archiveiterator import ArchiveIterator
import sys

warc1_path = sys.argv[1]

from warcio.archiveiterator import ArchiveIterator

with open(warc1_path, 'rb') as stream:
    for i, record in enumerate(ArchiveIterator(stream)):
        print(i, record.rec_headers.get_header('WARC-Target-URI'))
        print(i, record.rec_headers.get_header('WARC-Record-ID'))
        if record.rec_type == 'request':
            content = record.content_stream().read()
            print(content.decode('utf-8'))