Closed parismic closed 2 months ago
It's a pretty straight-forward thing to look at if you sent us a link to an actual warc that has this problem.
It's a pretty straight-forward thing to look at if you sent us a link to an actual warc that has this problem.
I tried to open a few WARCs using WebRecorder Player and got this exact error message, I don't know if they were created via ArchiveSpark but maybe it can be useful to solve the problem. They can be found here.
yeah, I asked chatgpt to generate a warc file for me, to be read by WarcReader of datatrove (by HF), and I get the same error!! even tried the saturn.warc by common crawls, to no luck!
@slaimon I downloaded the file anywhere.warc.gz
and warcio index
and warcio extract anywhere.warc.gz 427
(the second record) work fine.
If I do a bad thing and gunzip the warc file, warcio throws the error shown at the top.
@parismic don't gunzip warc files. If you have, use warcio recompress
to fix them.
warcio raises
warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response
at the second WARC record in a WARC file written with ArchiveSpark Both state that they use ISO http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdfwarcio also returns a warning before the error:
It could be that ArchiveSpark should write an additional empty line between the records or warcio is not in line with the ISO.
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: WARC-Type: response
I'll post this issue on ArchiveSpark as well. Does anyone know more?