webrecorder / replayweb.page

Serverless replay of web archives directly in the browser
https://replayweb.page
GNU Affero General Public License v3.0
703 stars 56 forks source link

"No pages are defined in this archive" + no URL's #22

Closed nvanderperren closed 3 years ago

nvanderperren commented 4 years ago

Hi,

Some days ago I created a WARC file with Heritrix. Webrecorder Players discovers around 10.000 pages; replayweb 0. There certainly are pages and URL's in that WARC-file. Is this a bug? Or maybe there is a dependency for the app that I had to install first?

ikreymer commented 4 years ago

This should be expected for pages, since there is no page detection (which was always a bit experimental, since other tools don't have a concept of pages, only URLs). The page detection was a work-around since WR Player did not have a way to load URLs. However, the URL search in replayweb.page should allow searching by HTML as well as all other types of resources.

The plan is that the 'page detection' will be part of the WACZ format, and the detection can happen optionally. Are you able to search by URLs from the URL tab or is that also blank? If it is blank, that is likely something wrong. Would you be able to share the WARC file Would you be able to share the WARC file in question?

nvanderperren commented 4 years ago

It's also blank if I search by URL's. I can share the WARC file if you let me know how I can send it to you.

nvanderperren commented 4 years ago

Further investigation 🕵️‍♀️

I don't have this problem with WARCs created with Browertrix, Webrecorder Desktop, SquidWarc and Brozzler.

ikreymer commented 3 years ago

Tested the WARC shared from #23, the URLs are now showing up (no pages in this WARC), and I think related issues to indexing have been fixed.