Closed joshuaavalon closed 5 years ago
wget 1.19 writes WARC-Target-URI headers with brackets around the URL, which breaks some software. Maybe that's what happening here? If you're able, perhaps you could try rewriting the WARC without those brackets and see if that fixes it?
Here's some example Python code to do that using warcio
:
>>> from warcio.archiveiterator import ArchiveIterator
>>> from warcio.warcwriter import WARCWriter
>>> output = open('lfes-not-in-ia-1.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-1.warc.gz', 'rb') as stream:
... for record in ArchiveIterator(stream):
... if 'WARC-Target-URI' in record.rec_headers:
... record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
... writer.write_record(record)
...
>>> output.close()
@vitorio Yes it works.
I have the same issue. I'm not that experienced that I can rewrite the WARC. Can I use the python code? (and how do I do this?)
Good news! We've recently added support for wget 1.19.4 WARCs in our WARC reader library (webrecorder/warcio#42) so these types of WARCs should "just work" without any changes.
The next update release of Webrecorder Player will include this fix.
this is great news! thank you!
We've just released 1.6.0 (https://github.com/webrecorder/webrecorderplayer-electron/releases/tag/v1.6.0) and you should now be able to open wget 1.19.4+ WARCs
Version 1.0.9 64 bit OS: Window
warc.sh
I have tested with wget 1.17.1 and wget 1.19.4. The program can only read created warc created by 1.17.1. It show blank page for warc created by 1.19.4.