While using cdxj-indexer to index a backlog of WARC data I ran into this error when using --post-append:
was@was-dev:~$ cdxj-indexer --sort --post-append /web-archiving-stacks/data/collections/jt898xc8096/fq/567/wq/8955/ARCHIVEIT-5425-MONTHLY-JOB292430-20170430083101595-00035.warc.gz > x
Traceback (most recent call last):
File "/opt/app/was/.local/bin/cdxj-indexer", line 8, in <module>
sys.exit(main())
File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 477, in main
write_cdx_index(cmd.output, cmd.inputs, vars(cmd))
File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 492, in write_cdx_index
indexer.process_all()
File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 210, in process_all
super().process_all()
File "/opt/app/was/.local/lib/python3.8/site-packages/warcio/indexer.py", line 33, in process_all
self.process_one(fh, out, filename)
File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 244, in process_one
for record in wrap_it:
File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/bufferiter.py", line 49, in buffering_record_iter
join_req_resp(req, resp, post_append, url_key_func)
File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/bufferiter.py", line 103, in join_req_resp
method = req.http_headers.protocol
AttributeError: 'NoneType' object has no attribute 'protocol'
I tracked it down to an request record that seems to lack a body, which seems wrong, but probably shouldn't generate an error? These records came from Archive-It.
While using cdxj-indexer to index a backlog of WARC data I ran into this error when using
--post-append
:I tracked it down to an request record that seems to lack a body, which seems wrong, but probably shouldn't generate an error? These records came from Archive-It.
Maybe a guard against a
req.http_headers
beingNone
here would be helpful in (admittedly obscure) cases like this?