webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.34k stars 206 forks source link

PYWB stripping out part of URLs on timeline page <url>#/<something> #863

Open ChrisDoyleMW opened 10 months ago

ChrisDoyleMW commented 10 months ago

Describe the bug

PYWB seems to be stripping out part of the URL when a timeline page is requested. For example: https://webarchive.nationalarchives.gov.uk/*/https://www.arcgis.com/apps/op sdashboard/index.html#/f94c3c90da5b4e9f9a0b19484dd4bb14 loads a timeline for https://www.arcgis.com/apps/opsdashboard/index.html Each instance shown is for index.html and not index.html#/f94c3c90da5b4e9f9a0b19484dd4bb14

Steps to reproduce the bug

  1. Open this url in your browser: https://webarchive.nationalarchives.gov.uk/*/https://www.arcgis.com/apps/opsdashbo ard/index.html#/f94c3c90da5b4e9f9a0b19484dd4bb14
  2. Click on the link dated 02 April 2020.
  3. Initially a page with the url: https://webarchive.nationalarchives.gov.uk/ukgwa/20200402132156/https://www.arcg is.com/apps/opsdashboard/index.html starts to load. Note that the string after the # symbol has been stripped out.
  4. The page does not load but redirects to this url: https://webarchive.nationalarchives.gov.uk/ukgwa/20200328185042/https://www.arcg is.com/sharing/rest/oauth2/authorize?client_id=opsdashboard&display=default&respo nse_type=token&expiration=20160&redirect_uri=https%3A%2F%2Fwww.arcgis.co m%2Fapps%2Fopsdashboard%2FpostSignIn.html&locale=en- gb&state=%7B%22redirect%22%3A%22https%3A%2F%2Fwww.arcgis.com%2Fap ps%2Fopsdashboard%2Findex.html%22%2C%22portalUrl%22%3A%22https%3A% 2F%2Fwww.arcgis.com%2Fsharing%2Frest%2F%22%7D which displays as a blank page.

    Expected behavior

    I'd expect the timeline page to show the correct URL timeline and allow visitors to view the history of capture for this specific URL - and not strip out the final part of the url.

    Screenshots

    Screenshot 2023-09-04 at 12 52 44 Screenshot 2023-09-04 at 12 51 21 Screenshot 2023-09-04 at 12 50 44 Screenshot 2023-09-04 at 12 55 20

Environment

• OS: Linux • Browser Any • Version PYWB 2.7

petsva commented 10 months ago

Everything after a # in a URL is the fragment part, and it is never sent to the server, but is handled by the web browser. (Normally to scroll to a certain position on the page.) Hence a harvester can only harvest with a URL with the fragment part stripped. That is why Pywb strips it, and shows what it found in the index about the URL without fragment part.

But maybe Pywb could replace the fragment in the links, to trick the browser to scroll according to it. Or maybe that would be confusing.