webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
384 stars 58 forks source link

Error reading WAT files #102

Closed MohammedElsayyed closed 4 years ago

MohammedElsayyed commented 4 years ago

When I try to use warcio to read WAT files generated from archive-metadata-extractor tool, it gives me this error message

    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: -97518
    Remainder: b'WARC/1.0\r\n'
Traceback (most recent call last):
  File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 220, in _detect_type_load_headers
    rec_headers = self.warc_parser.parse(stream, statusline)
  File "/home/.local/lib/python3.7/site-packages/warcio/statusandheaders.py", line 264, in parse
    raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WA
RC/0.17', 'WARC/0.18'] - Found: WARC-Type: metadata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 432, in main
    run()
  File "/home/.vscode/extensions/ms-python.python-2020.1.58038/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/usr/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/warcio_reader.py", line 4, in <module>
    for record in ArchiveIterator(stream):
  File "/home/.local/lib/python3.7/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/home/.local/lib/python3.7/site-packages/warcio/archiveiterator.py", line 262, in _next_record
    self.check_digests)
  File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
    known_format))
  File "/home/.local/lib/python3.7/site-packages/warcio/recordloader.py", line 225, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: metadata

This is the code snippet I used to read WAT files:

from warcio.archiveiterator import ArchiveIterator

with open('file.wat.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'metadata':
            print(record.rec_headers.get_header('WARC-Target-URI'))
wumpus commented 4 years ago

That error message is saying that warcio thinks the file is invalid. Is it valid? if you can send me a copy, I'll look at it for you.

MohammedElsayyed commented 4 years ago

https://we.tl/t-Lm5Uc40Wf6

wumpus commented 4 years ago

Looks to me like the metadata records in this WAT are missing a \r\n at the end of the body -- there are supposed to be 2 pairs, and there's only 1. So yes, it's invalid.

This is not that unusual in the WARC community, standards-conformance has been historically hit-or-miss.

If this is a common problem, I think warcio ought to tolerate this weirdness.

youssefeldakar commented 4 years ago

Thanks for checking it out.

The WAT is generated by archive-metadata-extractor, so I suppose options are to either fix archive-metadata-extractor or add tolerance in warcio. Is there a potential downside to making warcio tolerate the missing \r\n pair?

https://webarchive.jira.com/wiki/spaces/Iresearch/pages/14057510/archive-metadata-extractor.jar

Alternatively, are there other tools for going from WARC/ARC to WAT besides archive-metadata-extractor?

wumpus commented 4 years ago

Which version of archive-metadata-extractor are you running? If it's the one linked from https://webarchive.jira.com/wiki/spaces/Iresearch/pages/14057510/archive-metadata-extractor.jar

notice the comment at the bottom explaining that a newer version fixes this bug.

youssefeldakar commented 4 years ago

@wumpus Thanks so much again for the help and sorry we overlooked the comment about the bug.

I know it's not directly related to warcio, but we are unsure how to invoke webarchive-commons the same way we used to invoke the archive-metadata-extractor.jar from the command line to generate a WAT and weren't able to find docs on that. Any quick tips?

We appreciate it.

MohammedElsayyed commented 4 years ago

Building webarchive-commons generates 2 jar files under target directory as follows:

webarchive-commons-1.1.5-IA.jar webarchive-commons-jar-with-dependencies.jar

When executing

java -jar webarchive-commons-jar-with-dependencies.jar

it throws this error message

no main manifest attribute, in webarchive-commons-jar-with-dependencies.jar

Any suggestions?

ikreymer commented 4 years ago

Thanks @wumpus for looking into this. I don't know that this is particularly common, and its very old code from IA that's generating these WATs.. IA might have an updated version of these files, would recommend checking with them.

I suppose warcio recompress could be able to fix these types of errors, but haven't really seen this issue in general, so closing for now.