webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Apache License 2.0
426 stars 39 forks source link

Support directly replay of URLs that are not pages (was: Problem playing back WARC with long URL:s) #77

Open peterk opened 5 years ago

peterk commented 5 years ago

I have WARC files collected with node-warc 3.1.0 that can not be opened in Webrecorder player (No pages found). The only discerning characteristic is that the files are archived from Facebook posts with long URL:s. Other files archived with the same tool seem to work fine.

Listing URLs from the WARC with warcio works. Not sure if this is a bug in Webrecorder player or node-warc. Example file in the related node-warc issue: https://github.com/N0taN3rd/node-warc/issues/25

Version details: webrecorder player 1.6.1 (Mac) webrecorder 4.1.5 (@e926c65) pywb 2.1.1 (@3e0bb49) har2warc 1.0.4 warcio 1.6.2

peterk commented 5 years ago

The same file is opened correctly in openwayback.

ikreymer commented 5 years ago

To clarify, the issue is not that the file doesn't load, it's related to the page detection. To make things easier for the user, when opening non-Webrecorder WARCs, we attempt to 'detect' which URLs are pages, and you are right in that long urls are occasionally rejected. (The other option is for squidwarc to write the page metadata directly as WR does, and @N0taN3rd and I are looking into that as well).

openwayback does not have any such page detection, but allows you to enter urls directly. We also need to add support for loading an arbitrary URL that you know, even if its not detected as a page. We plan to make exploring the WARC easier as well.