Closed MohammedElsayyed closed 4 years ago
That error message is saying that warcio thinks the file is invalid. Is it valid? if you can send me a copy, I'll look at it for you.
Looks to me like the metadata records in this WAT are missing a \r\n at the end of the body -- there are supposed to be 2 pairs, and there's only 1. So yes, it's invalid.
This is not that unusual in the WARC community, standards-conformance has been historically hit-or-miss.
If this is a common problem, I think warcio ought to tolerate this weirdness.
Thanks for checking it out.
The WAT is generated by archive-metadata-extractor, so I suppose options are to either fix archive-metadata-extractor or add tolerance in warcio. Is there a potential downside to making warcio tolerate the missing \r\n pair?
https://webarchive.jira.com/wiki/spaces/Iresearch/pages/14057510/archive-metadata-extractor.jar
Alternatively, are there other tools for going from WARC/ARC to WAT besides archive-metadata-extractor?
Which version of archive-metadata-extractor are you running? If it's the one linked from https://webarchive.jira.com/wiki/spaces/Iresearch/pages/14057510/archive-metadata-extractor.jar
notice the comment at the bottom explaining that a newer version fixes this bug.
@wumpus Thanks so much again for the help and sorry we overlooked the comment about the bug.
I know it's not directly related to warcio, but we are unsure how to invoke webarchive-commons the same way we used to invoke the archive-metadata-extractor.jar from the command line to generate a WAT and weren't able to find docs on that. Any quick tips?
We appreciate it.
Building webarchive-commons generates 2 jar files under target directory as follows:
webarchive-commons-1.1.5-IA.jar webarchive-commons-jar-with-dependencies.jar
When executing
java -jar webarchive-commons-jar-with-dependencies.jar
it throws this error message
no main manifest attribute, in webarchive-commons-jar-with-dependencies.jar
Any suggestions?
Thanks @wumpus for looking into this. I don't know that this is particularly common, and its very old code from IA that's generating these WATs.. IA might have an updated version of these files, would recommend checking with them.
I suppose warcio recompress
could be able to fix these types of errors, but haven't really seen this issue in general, so closing for now.
When I try to use warcio to read WAT files generated from archive-metadata-extractor tool, it gives me this error message
This is the code snippet I used to read WAT files: