webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Apache License 2.0
426 stars 39 forks source link

Is there a means by which one can crawl a self-hosted Warc file? #103

Open deltabravozulu opened 3 years ago

deltabravozulu commented 3 years ago

So, my general usecase for this is that I have a personal website I recorded at one point but no longer have the original files for. I'd like to rehost my site, but at this point, without the old source code, I cannot. I'm trying to figure out a way to get all my links and everything put back in order the warc file from my backups, but thus far this has been in vain.

I've found that webrecorder (not the player) puts things together in such a way that other programs that have been built over the years cannot take them apart (e.g. warc to zip , warcat, or warc-extractor ) -- each runs into errors when trying to figure out the indexing of the warc.

As such, I ran sudo netstat -tulpn | grep -i webrecord which gave me a host:port of http://127.0.0.1:35535. I found that instead of going through webrecorder-player, I could actually open the whole site in Chrome by going to http://127.0.0.1:35535/local/collection/http://deltabravozu.lu. Because I can access it in the browser with all links working as they would in webrecorder-player, I figured I should be able to crawl the site and pull down the intact site structure using, say, wget or httrack, but thus far I've been able to crawl nothing more than the first page and random offsite links encoded in the webrecorder-player server (e.g. https://www.w3.org).

For wget, I used wget --force-directories --timestamping --level=inf --no-remove-listing --debug --page-requisites --adjust-extension --convert-links --retry-connrefused --span-hosts --follow-ftp --retry-on-host-error --execute robots=off http://127.0.0.1:35535/local/collection/http://deltabravozu.lu

Does anyone have any idea as to how I might more effectively go about my little task?