webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.34k stars 207 forks source link

No calendar/timeline: front end URLs in the template add the static_prefix after the pywb.host_prefix (Playback) #854

Closed ptrourke closed 12 months ago

ptrourke commented 12 months ago

Describe the bug

When I make a request like https://hostname.doman.tld/collection_name/*/example.com, there are no results displayed, and the backend OutbackCDX server logs no requests.

I suspect the issue is due to the VueJS scripts being loaded with an http:// scheme rather than the https:// scheme. At @ldko 's suggestion, I looked at the developer tools console, and saw these errors:

Blocked loading mixed active content “http://hostname.domain.tld/static/css/bootstrap.min.css”
example.com
Blocked loading mixed active content “http://hostname.domain.tld/static/css/font-awesome.min.css”
example.com
Blocked loading mixed active content “http://hostname.domain.tld/static/css/base.css”
example.com
Blocked loading mixed active content “http://hostname.domain.tld/static/js/jquery-latest.min.js”
example.com
Loading failed for the <script> with source “http://hostname.domain.tld/static/js/jquery-latest.min.js”. example.com:14:75
Blocked loading mixed active content “http://hostname.domain.tld/static/js/bootstrap.min.js”
example.com
Loading failed for the <script> with source “http://hostname.domain.tld/static/js/bootstrap.min.js”. example.com:15:71
Blocked loading mixed active content “http://hostname.domain.tld/static/loading-spinner/loading-spinner.js”
example.com
Loading failed for the <script> with source “http://hostname.domain.tld/static/loading-spinner/loading-spinner.js”. example.com:19:86
Blocked loading mixed active content “http://hostname.domain.tld/static/vue/vueui.js”
example.com
Loading failed for the <script> with source “http://hostname.domain.tld/static/vue/vueui.js”. example.com:20:64
Uncaught ReferenceError: VueUI is not defined
    <anonymous> https://hostname.domain.tld/general/*/example.com:100
example.com:100:3
Loading mixed (insecure) display content “http://hostname.domain.tld/favicon.ico/” on a secure page
example.com
GET
https://hostname.domain.tld/favicon.ico/
[HTTP/1.1 404 Not Found 0ms]

I saw in the templates (e.g., URL Query/Calendar Page Template) that there is a {{static_prefix}} variable, and in the documentation, I see the suggestion that a value for {{static_prefix}} might be

{{ static_prefix }} - the prefix from which static files will be accessed from, e.g. http://localhost:8080/static/.

(This was also brought to my attention by @ldko ).

However, when I added the key static_prefix: https://hostname.domain.tld/ to config.yaml, I still got "Blocked loading mixed active content" errors, with URLs like http://hostname.domain.tld/https://hostname.domain.tld//static/vue/vueui.js

I noticed in the issue 688 that there was a change to set

environ['pywb.static_prefix'] = environ['pywb.host_prefix'] + environ['pywb.app_prefix'] + '/' + self.static_prefix

I'm wondering if this bug could be a result of that change?

Steps to reproduce the bug

static_prefix: https://hostname.domain.tld/ banner_html: banner.html head_insert_html: head_insert.html frame_insert_html: frame_insert.html

query_html: query.html search_html: search.html not_found_html: not_found.html

home_html: index.html error_html: error.html

proxy_cert_download_html: proxy_cert_download.html proxy_select_html: proxy_select.html

info_json: collinfo.json

html_templates:

rules_config: pkg://pywb/rules.yaml

* In /opt/applications/playback/config/collections/general/, have no files (minimal test of OutbackCDX connection).
* Owner on /opt/applications/playback/config/collections and child directories is root:root, mode is 755 
* Owner on /opt/applications/playback/config/config.yaml is root:root, mode is 644 
* Podman is run in root mode by root user
* Navigate to https://hostname.doman.tld/collection_name/*/example.com. No results appear.
* Open the developer tools in the browser. You'll see a number of "Blocked loading mixed active content" errors.
* On the container itself (`sudo podman container exec -it container-name bash`), try `curl https://outbackserver.domain.tld/collection_name?url=example.com`. The expected result from the OutbackCDX server appears.
* On the container itself, in the python interactive environment, try:
```python
import requests
result = requests.get("https://outbackserver.domain.tld/collection_name?url=example.com")
result.status_code

. The expected result of 200 appears.

Environment

Additional context

Environment requires https everywhere, connections to port 80 are prohibited by network policy. Preferred deployment practice is containerization. OutbackCDX backend currently has dozens of records matching the query, which appear when curled from inside the container.

Hostnames, container names, file paths in all examples have been modified.

Thanks to @tw4l , @ldko, and @thatcher for assistance characterizing this issue.

ptrourke commented 12 months ago

I tried setting pywb.host_prefix as an ENV in the Dockerfile, but it didn't resolve the issue. I tried host_prefix and prefix instead of static_prefix, those also did not help. Originally, when I talked to @ldko and @tw4l the first time, I thought the issue was with our CA, but I resolved that with a few lines in the Dockerfile, and it now works with curl and requests, so I'm pretty sure it's not the certificate on the OutbackCDX backend.

tw4l commented 12 months ago

Thanks @ptrourke! Going to start looking into this in the afternoon.

ato commented 12 months ago

I don't think it should be necessary to set static_prefix as long as the Host and X-Forwarded-Proto request headers are set by the proxy to appropriate values for generating the correct frontend URLs.

For Apache that probably looks like this (untested):

ProxyPreserveHost on
RequestHeader set X-Forwarded-Proto "https"
tw4l commented 12 months ago

I would try @ato's recommendation first. In addition you could look into https://uwsgi-docs.readthedocs.io/en/latest/Vars.html?highlight=scheme#uwsgi-scheme, which may be able to force HTTPS by setting the WSGI scheme to https. I see there's a check for the scheme value in the rewriterapp: https://github.com/webrecorder/pywb/blob/83b2113be2c2574ec120ba292006d706e3cc3d53/pywb/apps/rewriterapp.py#L820. I'm not sure if that would be used by the Vue app/service worker for fetching CDX however, so it seems that making sure outbound requests going through nginx/Apache are HTTPS might be the way to go.

On other note: static_prefix expects only the name of the directory for static files following the HTTP(S) scheme and host, to be able to e.g. rename the static directory to something else. So that's why the value set there is added after the host - e.g. http://hostname.domain.tld/https://hostname.domain.tld//static/vue/vueui.js if trying to set it to a full https URL.

ptrourke commented 12 months ago

Thanks, @tw4l . The documentation seems to suggest that it will take a full URL prefix, but that's probably just me reading into it. I think I had some confusion over which methods would be used by the Vue app/service worker and which would not!

I'm going to try @ato 's suggestions, which honestly would be easier for our deployment scripts, too. I hadn't seen X-Forwarded-Proto before, I'm afraid. If that works, I'll probably put in a pull request with slight documentation suggestions for the next time someone with my scenario runs into a misunderstanding. Can we leave this open while I test it?

tw4l commented 12 months ago

Absolutely, and any documentation clarifications or additions are welcome, thanks!

ikreymer commented 12 months ago

I don't think it should be necessary to set static_prefix as long as the Host and X-Forwarded-Proto request headers are set by the proxy to appropriate values for generating the correct frontend URLs.

For Apache that probably looks like this (untested):

ProxyPreserveHost on
RequestHeader set X-Forwarded-Proto "https"

Another option might be to set

SetEnv UWSGI_SCHEME https

for the proxy.

I noticed our Sample Apache Config doesn't involve setting this variable this while the nginx one does - though it may not be needed if using the uwsgi module. I can't quite tell if its being used or if just using a regular proxy? Perhaps in the latter case, it should be set, also per: https://uwsgi-docs.readthedocs.io/en/latest/Vars.html#uwsgi-protocol-magic-variables

If that works, we can definitely update this in the docs. We have tested a lot more with nginx than apache, so any improvements here would be welcome!

ptrourke commented 12 months ago

@ato 's suggestion worked perfectly.

tw4l commented 12 months ago

@ptrourke Glad to hear it! Are we okay to close this issue then?

ptrourke commented 12 months ago

Sure, I'll still submit a pull request later, but this issue is resolved.