webrecorder / cdxj-indexer

CDXJ Indexing of WARC/ARCs
Apache License 2.0
21 stars 10 forks source link

AttributeError: 'NoneType' object has no attribute 'protocol' #16

Closed edsu closed 2 years ago

edsu commented 2 years ago

While using cdxj-indexer to index a backlog of WARC data I ran into this error when using --post-append:

was@was-dev:~$ cdxj-indexer --sort --post-append /web-archiving-stacks/data/collections/jt898xc8096/fq/567/wq/8955/ARCHIVEIT-5425-MONTHLY-JOB292430-20170430083101595-00035.warc.gz > x
Traceback (most recent call last):
  File "/opt/app/was/.local/bin/cdxj-indexer", line 8, in <module>
    sys.exit(main())
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 477, in main
    write_cdx_index(cmd.output, cmd.inputs, vars(cmd))
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 492, in write_cdx_index
    indexer.process_all()
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 210, in process_all
    super().process_all()
  File "/opt/app/was/.local/lib/python3.8/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 244, in process_one
    for record in wrap_it:
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/bufferiter.py", line 49, in buffering_record_iter
    join_req_resp(req, resp, post_append, url_key_func)
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/bufferiter.py", line 103, in join_req_resp
    method = req.http_headers.protocol
AttributeError: 'NoneType' object has no attribute 'protocol'

I tracked it down to an request record that seems to lack a body, which seems wrong, but probably shouldn't generate an error? These records came from Archive-It.

WARC/1.0
WARC-Type: request
WARC-Target-URI: https://img1.doubanio.com/icon/u3927203-87.jpg
WARC-Date: 2017-04-30T11:39:19Z
WARC-Concurrent-To: <urn:uuid:4ababcf0-a610-4839-9b3e-57e3f1f056e2>
WARC-Record-ID: <urn:uuid:ef603b53-b29b-412e-89c3-2bac194b9224>
Content-Length: 0
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Block-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ

Maybe a guard against a req.http_headers being None here would be helpful in (admittedly obscure) cases like this?