webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
387 stars 58 forks source link

Record not followed by newline (conversion error) #140

Open mw0000 opened 2 years ago

mw0000 commented 2 years ago

Hi, how to deal with such an error? I'm trying to convert a real old ARCs to use in SolrWayback

mw@webarch:~/solrwayback/indexing/warcs1$ warcio recompress test2.arc.gz test2.warc.gz
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 52006972
    Remainder: b'http://www.omega.poznet.pl:80/rekin.html 212.126.5.228 200101211835 text/html 4274\n'
Recompress Failed: test2.arc.gz could not be read as a WARC or ARC
ikreymer commented 2 years ago

Can you share the ARC file that is causing the error? It may be using a format that was not supported so far..