webrecorder / replayweb.page

Serverless replay of web archives directly in the browser
https://replayweb.page
GNU Affero General Public License v3.0
688 stars 55 forks source link

Displaying incorrect url when using a wacz file, works when using a warc.gz #31

Closed cfeddersen closed 3 years ago

cfeddersen commented 3 years ago

Hi there,

first of all thank you very much for this great tool and the idea of the wacz format so simplify the handling of warc files. I'm not sure if this is the correct project to report this too, so please let me know if I should go elsewhere.

I've greated a warc.gz file with grab-site. It's valid according to a check with warcio.

Using the warc-gz (4,3GB in size) works fine with replayweb.page if you're patient while it loads (using Chrome).

I created a wacz with

wacz create

It loads fine within replayweb.page.

Browsing through leads to an unexpected behaviour. Clicking on some links will bring you to a different page. I crawled a bulletin board. So browsing a overview page and clicking on a thread will lead you to a differnt thread.

Example: Overview page contains a link to showflat.php?Number=5970002, but it will take you to showflat.php?Number=1005878.

This is only happening with the wacz file, not with the warc.gz, so I'm pretty confident that it's either something in the wacz creation process, in the wabac.js handling or within replayweb.page.

I've verified that the the index.cdx file within the warc does contain an entry for showflat.php?Number=5970002. The index.idx file does not, but I'm not sure if it should.

Looking at the IndexedDB via Chrome Dev Console shows 107813 entries for resources if I open the warc.gz file, just 2700 entries for the wacz file.

I'm happy to share the warc file if that helps or any other information that helps to debug this.

ikreymer commented 3 years ago

Thanks for reporting! Would you be able to share the WACZ file?

The WACZ format is designed to load everything on demand, however, the replay system also has a fuzzy matching system that attempts to match inexact responses. The system is necessary match inexact timestamps or other params.

My guess is that for some reason, the two systems may be conflicting here, and the fuzzy matching is being applied without first pulling in additional data to see if there is an exact match.

In the case of the WARC, everything is loaded into memory, but this is less than ideal and does not scale well for large archives. I'll see if I can repro this with a smaller data set, but if you can share the WACZ, that would be helpful!

cfeddersen commented 3 years ago

Sure, here you go: https://drive.google.com/file/d/1sYtCxxNno_XCqYN86B2cGi2WFs4E6fxT/view?usp=sharing

Url in my example is http://forum.herr-der-ringe-film.de/showflat.php?Number=5970002&fpart=all

ikreymer commented 3 years ago

@cfeddersen Sorry for delay, I think this should work now on replayweb.page and in upcoming release of the app. I believe this was fixed from latest change in wabac.js (the uppercase in the query string caused it to be sorted incorrectly due to a missing lowecase conversion)

Should be able to test locally or even directly from google drive. If you have the file loaded already, select 'Purge Cache + Full Reload' just in case.

ikreymer commented 3 years ago

Fairly certain this should be fixed, so closing for now. Please report if this is still an issue.

cfeddersen commented 3 years ago

Working fine now, thank you very much!