webrecorder / archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
https://chrome.google.com/webstore/detail/webrecorder/fpeoodllldobpkbkabpblcfaogecpndd
GNU Affero General Public License v3.0
818 stars 59 forks source link

Ensure page loaded from POST request has correct page info (was: WACZ file replays in ArchiveWebPage but not ReplayWebPage) #242

Closed edsu closed 1 month ago

edsu commented 1 month ago

ReplayWeb.page Version

v2.1.1 (AWP 0.12.4)

What did you expect to happen? What happened instead?

I created a small archive using ArchiveWebPage v0.12.4 of this page:

https://rescarta.lapl.org/ResCarta-Web/jsp/RcWebImageViewer.jsp?doc_id=040428be-8b21-4de1-9b1e-3421068c0f1c/cl000000/20190731/00000011

I exported the archive as a WACZ and then imported into ReplayWebPage and the page does not replay. It reports Archived Page Not Found.

I then deleted the archive in ArchiveWebPage and imported the WACZ into ArchiveWebPage and the page replays ok.

Step-by-step reproduction instructions

Using Chrome go to https://replayweb.page and load this WACZ file by URL:

https://edsu-webarchives.s3.amazonaws.com/tmp/lapl.wacz

If you try to replay the one page you will see the Archived Page Not Found.

Screenshot 2024-07-12 at 9 34 13 AM

If you go to https://archiveweb.page and load the same URL you should see that it plays fine.

Screenshot 2024-07-12 at 9 34 16 AM

Additional details

I noticed in the dev console when using ReplayWebPage that there is a RangeError: Invalid time value and that the page date in the interface is blank?

Screenshot 2024-07-12 at 9 36 40 AM

When using ArchiveWebPage there is no console error, but the page date shows up as "Invalid Date":

Screenshot 2024-07-12 at 9 36 35 AM
ikreymer commented 1 month ago

This is indeed strange, I think what happened was that there was a POST request made to the page, and ArchiveWeb.page only saved the POST version, but not the GET version (both should have been saved). Did you land it from submitting a form perhaps? The POST version is not being looked up from the page link in ReplayWeb.page, though you can find it in the URL resources list, and then load the page. However, it is being looked up in ArchiveWeb.page, will try to sort out the discrepancy... The data is loaded fully in ArchiveWeb.page, while being looked up on demand in ReplayWeb.page, so that may be part of it..

If you just visit the page again and record it, and download that, it does work in both (just a GET request).

edsu commented 1 month ago

Yes, I think I did start archiving the page after a POST.

edsu commented 1 month ago

Yes, I think I did start archiving the page after a POST.

ikreymer commented 1 month ago

I think the issue is in AWP, as it seems to not set the page info properly if page is created for a POST request, will move issue there.

ikreymer commented 1 month ago

ArchiveWeb.page didn't handle POST request pages properly, since this is possible to arrive at a page from a POST request, it will now check for that, and update the page listing according. The attached WACZ wasn't created properly (hence the invalid date), but it is now fixed in 0.12.5. You should be able to archive a page after a POST request and then load it in ReplayWeb.page (though existing archives won't be affected).