webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Apache License 2.0
426 stars 39 forks source link

Cannot read warc create by wget 1.19.4 #57

Closed joshuaavalon closed 5 years ago

joshuaavalon commented 6 years ago

Version 1.0.9 64 bit OS: Window

warc.sh

USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
SAVE_HOST="$1"
DATE=`date +%Y-%m-%d`
WARC_NAME="$SAVE_HOST-$DATE"

wget \
    -e robots=off --mirror --page-requisites \
    --waitretry 5 --timeout 60 --tries 5 --wait 1 \
    --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" \
    -U "$USER_AGENT" "$SAVE_HOST"
sh warc.sh example.com

I have tested with wget 1.17.1 and wget 1.19.4. The program can only read created warc created by 1.17.1. It show blank page for warc created by 1.19.4.

vitorio commented 6 years ago

wget 1.19 writes WARC-Target-URI headers with brackets around the URL, which breaks some software. Maybe that's what happening here? If you're able, perhaps you could try rewriting the WARC without those brackets and see if that fixes it?

Here's some example Python code to do that using warcio:

>>> from warcio.archiveiterator import ArchiveIterator
>>> from warcio.warcwriter import WARCWriter
>>> output = open('lfes-not-in-ia-1.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-1.warc.gz', 'rb') as stream:
...     for record in ArchiveIterator(stream):
...             if 'WARC-Target-URI' in record.rec_headers:                     
...                     record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...             writer.write_record(record)                                     
... 
>>> output.close()
joshuaavalon commented 6 years ago

@vitorio Yes it works.

nvanderperren commented 5 years ago

I have the same issue. I'm not that experienced that I can rewrite the WARC. Can I use the python code? (and how do I do this?)

ikreymer commented 5 years ago

Good news! We've recently added support for wget 1.19.4 WARCs in our WARC reader library (webrecorder/warcio#42) so these types of WARCs should "just work" without any changes.

The next update release of Webrecorder Player will include this fix.

nvanderperren commented 5 years ago

this is great news! thank you!

ikreymer commented 5 years ago

We've just released 1.6.0 (https://github.com/webrecorder/webrecorderplayer-electron/releases/tag/v1.6.0) and you should now be able to open wget 1.19.4+ WARCs