webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
387 stars 58 forks source link

Not compatible with WARC-files/records writtin by ArchiveSpark #131

Closed parismic closed 2 months ago

parismic commented 3 years ago

warcio raises warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response at the second WARC record in a WARC file written with ArchiveSpark Both state that they use ISO http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf

warcio also returns a warning before the error:

WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 433
Remainder: b'WARC/1.0\r\n'

It could be that ArchiveSpark should write an additional empty line between the records or warcio is not in line with the ISO.

warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: WARC-Type: response

I'll post this issue on ArchiveSpark as well. Does anyone know more?

wumpus commented 3 years ago

It's a pretty straight-forward thing to look at if you sent us a link to an actual warc that has this problem.

slaimon commented 7 months ago

It's a pretty straight-forward thing to look at if you sent us a link to an actual warc that has this problem.

I tried to open a few WARCs using WebRecorder Player and got this exact error message, I don't know if they were created via ArchiveSpark but maybe it can be useful to solve the problem. They can be found here.

bohemia420 commented 2 months ago

yeah, I asked chatgpt to generate a warc file for me, to be read by WarcReader of datatrove (by HF), and I get the same error!! even tried the saturn.warc by common crawls, to no luck!

wumpus commented 2 months ago

@slaimon I downloaded the file anywhere.warc.gz and warcio index and warcio extract anywhere.warc.gz 427 (the second record) work fine.

If I do a bad thing and gunzip the warc file, warcio throws the error shown at the top.

@parismic don't gunzip warc files. If you have, use warcio recompress to fix them.