webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
360 stars 55 forks source link

Not compatible with WARC-files/records writtin by ArchiveSpark #131

Open parismic opened 2 years ago

parismic commented 2 years ago

warcio raises warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response at the second WARC record in a WARC file written with ArchiveSpark Both state that they use ISO http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf

warcio also returns a warning before the error:

WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 433
Remainder: b'WARC/1.0\r\n'

It could be that ArchiveSpark should write an additional empty line between the records or warcio is not in line with the ISO.

warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: WARC-Type: response

I'll post this issue on ArchiveSpark as well. Does anyone know more?

wumpus commented 2 years ago

It's a pretty straight-forward thing to look at if you sent us a link to an actual warc that has this problem.

slaimon commented 2 months ago

It's a pretty straight-forward thing to look at if you sent us a link to an actual warc that has this problem.

I tried to open a few WARCs using WebRecorder Player and got this exact error message, I don't know if they were created via ArchiveSpark but maybe it can be useful to solve the problem. They can be found here.