webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.34k stars 207 forks source link

How should we handle playback of redirects to the web archive itself? #591

Open anjackson opened 3 years ago

anjackson commented 3 years ago

Expected behavior

We've archived this page in the past: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx

The 2008 copy works fine, but it's been replaced with a redirect to us, the UK Web Archive: https://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx

And since then, we've archived the redirect, so now the archive points at itself. This ends with a blank page (at least when using a more recent pywb, here: https://beta.webarchive.org.uk/wayback/archive/20140613220103mp_/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx)

It should ideally somehow know those are self-redirects and drop them, rolling back to the 2008 version: http://beta.webarchive.org.uk/wayback/archive/cdx?url=http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx


EDIT to try and make what's going on clear: _The actual WARC response record has a Location header that points back the us, the UK Web Archive, i.e. we indexed a redirect to ourselves, because they put in redirects to us, but we kept archiving their pages.

Really, I guess we don't want to index responses that point to any web archive, so perhaps this is an indexing problem not a playback problem?


What actually happened

Blank page instead of 2008 instance.

Browser

All.

ikreymer commented 3 years ago

Hm, it looks like that 301 response does not include a Location header, hence the blank page. Is that what it is in the warc record? Did the crawler end up crawling UKWA itself, or stopped there? It seems like this should just be a special case for the self-redirect check, but not entirely clear from the response yet...

anjackson commented 3 years ago

Hm, the WARC records look alright to me (see below). We do have some crufty records from accidentally crawler our own archive in the past, but we don't seem to have one for this particular page.

/heritrix/output/frequent-npld/20191203215907/warcs/BL-NPLD-20191207020008229-03586-75~npld-heritrix3-worker-1~8443.warc.gz 634600266 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2019-12-07T05:49:33Z
WARC-IP-Address: 52.84.141.67
WARC-Payload-Digest: sha1:Z6IJ46JXZU7TCLCDINT3OMVFHV5GZPYU
WARC-Record-ID: <urn:uuid:73e71ddc-1886-4431-af60-d7792f41716e>
Content-Type: application/http; msgtype=response
Content-Length: 635

HTTP/1.1 301 Moved Permanently
Server: CloudFront
Date: Sat, 07 Dec 2019 05:49:33 GMT
Content-Type: text/html
Content-Length: 183
Connection: close
Location: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
X-Cache: Redirect from cloudfront
Via: 1.1 3ddebf82c7d3a31f75ae0b53cadb99f3.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: MAN50-C3
X-Amz-Cf-Id: BWDyc1daE0vinNIGPj4gwlw9nl-SwIAhmKDNQtH7EM3Gx5n06DFUTA==

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>CloudFront</center>
</body>
</html>

/heritrix/output/warcs/quarterly/20191001020435/BL-20191005233634759-01995-62~ukwa-h3-pulse-quarterly~8443.warc.gz 383937190 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2019-10-05T23:42:42Z
WARC-IP-Address: 54.192.33.125
WARC-Payload-Digest: sha1:Z6S5IZX7WMF4M6AQ7W4C3MIH7IUZN3QT
WARC-Record-ID: <urn:uuid:15b1d8ec-45c1-492b-954d-bb2339f41d63>
Content-Type: application/http; msgtype=response
Content-Length: 1401

HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=iso-8859-1
Content-Length: 350
Connection: close
Date: Sat, 05 Oct 2019 23:42:42 GMT
Set-Cookie: AWSALB=Vrci55fIlihaFHr0WbnluCDpZfXRjrPdDr3JvSry9znByUayv6KtF4h3/AAK2wOo3de3me9gbcg6po1sdD5puEy3ISo6n8YsniPmgBg3Le2PNebeVlTOzFvP668R; Expires=Sat, 12 Oct 2019 23:42:42 GMT; Path=/
Server: Apache
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=31536000
Feature-Policy: microphone 'none'; payment 'none'; sync-xhr 'self' https://www.jisc.ac.uk”
Referrer-Policy: same-origin
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Location: http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
Cache-Control: max-age=1209600
Expires: Sat, 19 Oct 2019 23:42:42 GMT
X-Cache: Miss from cloudfront
Via: 1.1 3eb04a11bfe0f7e0abb7389a916f0d41.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: MAN50-C1
X-Amz-Cf-Id: ocfzaOYcoFIy_K7w7LiEp4s_awox_A9ZwW9ezra34owjGN8xUfmRlw==

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx">here</a>.</p>
</body></html>

/heritrix/output/warcs/quarterly/20190701020558/BL-20190704080945660-00599-63~ukwa-h3-pulse-quarterly~8443.warc.gz 959790954 0
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2019-07-04T10:24:33Z
WARC-IP-Address: 54.192.34.75
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Payload-Digest: sha1:Z6S5IZX7WMF4M6AQ7W4C3MIH7IUZN3QT
WARC-Refers-To-Date: 2019-04-03T19:48:41Z
WARC-Refers-To-Target-URI: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Record-ID: <urn:uuid:25369e12-dfe1-46bd-878e-c12521176c7c>
Content-Type: application/http; msgtype=response
Content-Length: 1051

HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=iso-8859-1
Content-Length: 350
Connection: close
Date: Thu, 04 Jul 2019 10:24:33 GMT
Set-Cookie: AWSALB=XWV2MwRQmJwO4YL/voHHHea8XDmOWBK9tcsyquOhIyceJF52oDj6ZHlABSdG5I9oKwyk9zZ0eA/GBMVg+4Y7Jtkfs05FvUFjCeMcR+VwAqtgEpaASOEUguGc0tRJ; Expires=Thu, 11 Jul 2019 10:24:33 GMT; Path=/
Server: Apache
x-frame-options: SAMEORIGIN
Strict-Transport-Security: max-age=31536000
Feature-Policy: microphone 'none'; payment 'none'; sync-xhr 'self' https://www.jisc.ac.uk”
Referrer-Policy: same-origin
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Location: http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
Cache-Control: max-age=1209600
Expires: Thu, 18 Jul 2019 10:24:33 GMT
X-Cache: Miss from cloudfront
Via: 1.1 a364335587d085de3832514f7712e0e0.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: MAN50-C1
X-Amz-Cf-Id: 3DYqGb2l7Hhe7loZSahKhc0WzoGjrsLKBUJX6i4tpo9lRzukWKN2fQ==

/heritrix/output/warcs/quarterly/20190401020202/BL-20190403192157681-01649-62~ukwa-h3-pulse-quarterly~8443.warc.gz 904849596 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2019-04-03T19:48:41Z
WARC-IP-Address: 54.192.33.85
WARC-Payload-Digest: sha1:Z6S5IZX7WMF4M6AQ7W4C3MIH7IUZN3QT
WARC-Record-ID: <urn:uuid:8545283f-6a9e-460b-8079-f9686b7f7fe8>
Content-Type: application/http; msgtype=response
Content-Length: 1377

HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=iso-8859-1
Content-Length: 350
Connection: close
Date: Wed, 03 Apr 2019 19:48:41 GMT
Set-Cookie: AWSALB=UykjQ7bRJJiF79tYvgooUZaygW6Ms4qy4z9V7fR4YCNJ79mZ+Qc80QTP2y8zVY32/k070noNtxK98AHX2+f6Sujfg+obKz+Al03s+gBPz7XqtGM5eKZ8X50ukhlT; Expires=Wed, 10 Apr 2019 19:48:41 GMT; Path=/
Server: Apache
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=31536000
Feature-Policy: microphone 'none'; payment 'none'; sync-xhr 'self' https://www.jisc.ac.uk”
Referrer-Policy: same-origin
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Location: http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
Cache-Control: max-age=1209600
Expires: Wed, 17 Apr 2019 19:48:41 GMT
X-Cache: Miss from cloudfront
Via: 1.1 c6c27fb3a8bc413f99e81981948a67c6.cloudfront.net (CloudFront)
X-Amz-Cf-Id: ydQSmLN9906psJSAJK0v21hrcA30BKpfRpIrtMR7Q8Ct1gare825cg==

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx">here</a>.</p>
</body></html>

/heritrix/output/warcs/quarterly/20190401020202/BL-20190403192157680-01648-62~ukwa-h3-pulse-quarterly~8443.warc.gz 882680918 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2019-04-03T19:48:40Z
WARC-IP-Address: 54.192.33.85
WARC-Payload-Digest: sha1:Z6IJ46JXZU7TCLCDINT3OMVFHV5GZPYU
WARC-Record-ID: <urn:uuid:f52e8a14-3662-484b-9789-ff0d5be11a3a>
Content-Type: application/http; msgtype=response
Content-Length: 611

HTTP/1.1 301 Moved Permanently
Server: CloudFront
Date: Wed, 03 Apr 2019 19:48:40 GMT
Content-Type: text/html
Content-Length: 183
Connection: close
Location: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
X-Cache: Redirect from cloudfront
Via: 1.1 5df88084d2e6c90392a3f4e5a634f39d.cloudfront.net (CloudFront)
X-Amz-Cf-Id: LVWNGCzgau2FFu6GgC-i51SJPBwqoAB7C6JtGZDtuRUu5ntyTXVZIQ==

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>CloudFront</center>
</body>
</html>

/heritrix/output/warcs/quarterly/20181001021312/BL-20181013053614140-01151-63~ukwa-h3-pulse-quarterly~8443.warc.gz 623624382 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2018-10-13T06:15:46Z
WARC-IP-Address: 13.33.54.2
WARC-Payload-Digest: sha1:Z6S5IZX7WMF4M6AQ7W4C3MIH7IUZN3QT
WARC-Record-ID: <urn:uuid:882211ed-fcc1-45f0-a1aa-be342c641bf7>
Content-Type: application/http; msgtype=response
Content-Length: 1443

HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=iso-8859-1
Content-Length: 350
Connection: close
Date: Sat, 13 Oct 2018 06:15:46 GMT
Server: Apache
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Feature-Policy: microphone 'none'; payment 'none'; sync-xhr 'self' https://www.jisc.ac.uk”
Referrer-Policy: same-origin
Public-Key-Pins: pin-sha256='X3pGTSOuJeEVw989IJ/cEtXUEmy52zs1TZQrU06KUKg='; pin-sha256='MHJYVThihUrJcxW6wcqyOISTXIsInsdj3xK8QrZbHec='; pin-sha256='isi41AizREkLvvft0IRW4u3XMFR2Yg7bvrF7padyCJg='; includeSubdomains; max-age=2592000
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Location: http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
Cache-Control: max-age=1209600
Expires: Sat, 27 Oct 2018 06:15:46 GMT
X-Cache: Miss from cloudfront
Via: 1.1 4583e6648e47a3495c29f53f72bab417.cloudfront.net (CloudFront)
X-Amz-Cf-Id: GnI55yw_sYGl6cyVo3HiaES8TeMc3iBJDNkZv2RTAbUMdO6zwcG7ww==

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx">here</a>.</p>
</body></html>

/heritrix/output/warcs/quarterly/20181001021312/BL-20181013053614145-01152-63~ukwa-h3-pulse-quarterly~8443.warc.gz 635049293 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2018-10-13T06:15:45Z
WARC-IP-Address: 13.33.54.2
WARC-Payload-Digest: sha1:Z6IJ46JXZU7TCLCDINT3OMVFHV5GZPYU
WARC-Record-ID: <urn:uuid:9ef26f78-33b1-4b42-8d87-c2b040076ee6>
Content-Type: application/http; msgtype=response
Content-Length: 611

HTTP/1.1 301 Moved Permanently
Server: CloudFront
Date: Sat, 13 Oct 2018 06:15:45 GMT
Content-Type: text/html
Content-Length: 183
Connection: close
Location: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
X-Cache: Redirect from cloudfront
Via: 1.1 17570bdaeda2a4497e4f831a500e55ff.cloudfront.net (CloudFront)
X-Amz-Cf-Id: msONVO-E33_sEVZ63a55-FDGPwH7U32RF2dtRVV5q2HX-ib_2G6Qvw==

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>CloudFront</center>
</body>
</html>

/data/129641/8618135/WARCS/BL-8618135-72.warc.gz 44738434 0
WARC/1.0
WARC-Type: response
WARC-Date: 2008-06-25T15:26:08Z
WARC-Record-ID: <urn:uuid:2b18bb8e-14bc-47bf-8688-5469fe75767d>
WARC-IP-Address: 83.137.214.22
WARC-Target-URI: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Payload-Digest: sha512:3c8ece225eeef7b8484991d572e59f10335b4acaf689e6923554b087a69b8056e5703c0aaaed22da452a911ae74faa3a9ec3f2fa0e4668b2059b22d6b80386fe
Content-Type: application/http;msgtype=response
WARC-Identified-Payload-Type: text/html
Content-Length: 16490

HTTP/1.1 200 OK
Connection: close
Date: Wed, 25 Jun 2008 15:26:09 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Cache-Control: no-cache, no-store
Pragma: no-cache
Expires: -1
Content-Type: text/html; charset=utf-8
Content-Length: 16207

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head> 
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type" />
    <link href="/css/print.css" rel="style
...
anjackson commented 3 years ago

A-ha, I think this arises because there's a closest_limit of 10 that's used when looking up the URL in OutbackCDX. PyWB appends &limit=10&matchType=exact to the query and that fails if there's a lot of revisits etc. I can't find a way to configure this setting!? HELP! :-)

anjackson commented 3 years ago

Hm, something weird is going on. I've deployed our latest PyWB on our BETA service, and made it filter out revisits, leading to this calendar:

https://beta.webarchive.org.uk/wayback/archive/*/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx#

The ones prior to 20181013061545 work, but 20181013061545 onwards does not work. I've added an API so you can access the raw WARC record directly:

https://beta.webarchive.org.uk/api/query/warc/20181013061545/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx

anjackson commented 3 years ago

To be clear, limiting closest_limit to a hardcoded value of 10 is definately a problem and is causing various playback issues. It may not be the only problem.

https://github.com/webrecorder/pywb/blob/54d8bccf4a4eebf305012d49cb7330eaddea9eba/pywb/warcserver/index/indexsource.py#L116-L121

anjackson commented 3 years ago

Proposed #606.

anjackson commented 3 years ago

Following update to run under 2.5.0, this should work fine I think. Under a test server, it still says:

The url http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx could not be found in this collection. 

But I think that's because it's not running on www.webarchive.org.uk and that'll be fine once on live.

anjackson commented 2 years ago

Unfortunately, this doesn't seem to work on live, e.g. https://www.webarchive.org.uk/act/wayback/archive/20181013061546/https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx still says

The url http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx could not be found in this collection. 

EDIT: there are some suggestions the service might be seeing it's internal server name prod1 at the NGINX level at least. See https://github.com/ukwa/w3act/issues/664

ikreymer commented 2 years ago

Hm, yes, the error message suggests that something odd is happening with the URL look, and maybe some sort of mismatch on the prefix..

anjackson commented 1 year ago

Hey @ikreymer I don't suppose you've any idea how to fix this? It's now causing us major problems. Okay, we have a workaround, but this is a bit of a pain. Perhaps it's easier to modify cdxj-indexer to stop the records getting to the index.

anjackson commented 1 year ago

Looking at: https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/resource/responseloader.py#L120 I feel pretty sure the cases that are causing problems are a whole class of redirects that are not covered at all by the current implementation.

The current implementation catches redirect loops, where the redirect location is the same as the current URL.

The cases I'm hitting are cases where the redirect of URL goes to https://www.webarchive.org.uk/wayback/archive/URL - i.e. extra configuration/logic is needed to drop redirects that go to hosts like *.webarchive.org.uk. It may be possible to block these at indexing time (see webrecorder/cdxj-indexer#21) but ideally they should be blocked here too.

If I understand the code flow, I think this could be added to the raise_on_self_redirect function as an additional case it deals with and treats as a self-redirect, so that those corresponding index entries get skipped.

VascoRatoFCCN commented 10 months ago

We at Arquivo.pt had a similar issue: we archived a page which includes a link to another archived page. So when we try to follow the archived link, pywb tries to redirect us to an archived version of our own archived page. Since we don't archive ourselves, the pywb fails to replay anything. We would have preferred if pywb could recognize that it's trying to replay an archived page of our own archive and redirect to the original page instead. (Link to our github issue)

However on our case I believe this could be easily fixed during replay: Since we're not dealing with an HTTP redirect request we don't need to look at headers or anything, so this is all happening on the client-side. I thought that maybe we could just prevent pywb from processing URLs that point towards ourselves, and instead just go to the desired endpoint.

As a proof of concept, I messed with the template of our pwyb instance and implemented a very crude way to detect links that point towards ourselves (see below). This worked, so we will probably use a more robust version of this as a workaround for now, but it'd be great if this could be a configurable behavior for pywb.

window.addEventListener("message", onMessage, false);        

function onMessage(event) {
  if (typeof data.wb_type !== 'undefined') {
    if (data.wb_type == "load" || data.wb_type == "replace-url" ) {
      if(data.url.includes('arquivo.pt/wayback')) { // <-- pywb event includes the data.url parameter which is the original url of the archived link.
         window.location.href = data.url;  // <-- force the browser to redirect to the url instead of trying to get an archived version
      }
    }
  }
}