webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.42k stars 217 forks source link

SCRIPT_NAME environment variable undefined? #39

Closed protonpopsicle closed 10 years ago

protonpopsicle commented 10 years ago

index page loads fine. tried to hit mydomain.com/pywb/*/example.com

Pywb Error

'SCRIPT_NAME'
Error Details:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/wsgi_wrappers.py", line 62, in __call__
    response = wb_router(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/proxy.py", line 28, in __call__
    response = super(ProxyArchivalRouter, self).__call__(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 33, in __call__
    result = route(env, self.abs_path)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 77, in __call__
    wbrequest = self.parse_request(env, use_abs_prefix)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 90, in parse_request
    rel_prefix = env['SCRIPT_NAME'] + '/' + matched_str + '/'
KeyError: 'SCRIPT_NAME'

I'm running pywb 0.4.7 (installed w/ pip) via uWSGI behind Nginx.

Nginx server block

upstream pywb {
    server 127.0.0.1:8001;
}

server {
    listen 80 default_server;
    listen [::]:80 default_server ipv6only=on;

    location / {
        uwsgi_pass pywb;
        include /etc/nginx/uwsgi_params;
       }
}

contents of uwsgi_params: https://github.com/phusion/nginx/blob/master/conf/uwsgi_params

command to run uWSGI: $ /usr/local/bin/uwsgi --ini /etc/pywb/wsgi.ini

contents of /etc/pywb/wsgi.ini

[uwsgi]
socket = :8001
master = true
processes = 10
buffer-size = 65536
die-on-term = true

# specify config file here
env = PYWB_CONFIG_FILE=/etc/pywb/config.yaml
chdir = /usr/local/lib/python2.7/dist-packages/pywb/
wsgi = pywb.apps.wayback

contents of /etc/pywb/config.yaml

# pywb config file
# ========================================
#
# Settings for each collection

collections:
    # <name>: <cdx_path>
    # collection will be accessed via /<name>
    # <cdx_path> is a string or list of:
    #  - string or list of one or more local .cdx file
    #  - string or list of one or more local dirs with .cdx files
    #  - a string value indicating remote http cdx server
    pywb: /my_archive/cdx/

    # ex with filtering: filter CDX lines by filename starting with 'dupe'
    #pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}

# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
# SURT keys are recommended for future indices, but non-SURT cdxs
# are also supported
#
#   * Set to true if cdxs start with surts: com,example)/
#   * Set to false if cdx start with urls: example.com)/
#
# default:
# surt_ordered: true

# list of paths prefixes for pywb look to 'resolve'  WARC and ARC filenames
# in the cdx to their absolute path
#
# if path is:
#   * local dir, use path as prefix
#   * local file, lookup prefix in tab-delimited sorted index
#   * http:// path, use path as remote prefix
#   * redis:// path, use redis to lookup full path for w:<warc> as key

archive_paths: /my_archive/warcs/

# The following are default settings -- uncomment to change
# Set to '' to disable the ui

# ==== UI: HTML/Jinja2 Templates ====

# template for <head> insert into replayed html content
#head_insert_html: ui/head_insert.html

# template to for 'calendar' query,
# eg, a listing of captures  in response to a ../*/<url>
#
# may be a simple listing or a more complex 'calendar' UI
# if omitted, will list raw cdx in plain text
#query_html: ui/query.html

# template for search page, which is displayed when no search url is entered
# in a collection
#search_html: ui/search.html

# template for home page.
# if no other route is set, this will be rendered at /, /index.htm and /index.html
#home_html: ui/index.html

# error page temlpate for may formatting error message and details
# if omitted, a text response is returned
#error_html: ui/error.html

# ==== Other Paths ====

# list of host names that pywb will be running from to detect
# 'fallthrough' requests based on referrer
#
# eg: an incorrect request for http://localhost:8080/image.gif with a referrer
# of http://localhost:8080/pywb/index.html, pywb can correctly redirect
# to http://localhost:8080/pywb/image.gif
#

#hostpaths: ['http://localhost:8080']

# Rewrite urls with absolute paths instead of relative
#absoulte_paths: true

# List of route names:
# <route>: <package or file path>
# default route static/default for pywb defaults
static_routes:
          static/default: pywb/static/

# ==== New / Experimental Settings ====
# Not yet production ready -- used primarily for testing

# Enable simple http proxy mode
enable_http_proxy: true

# enable cdx server api for querying cdx directly (experimental)
enable_cdx_api: true

# custom rules for domain specific matching
# set to false to disable
#domain_specific_rules: rules.yaml

# Memento support, enable
enable_memento: true

# Replay content in an iframe
framed_replay: true
protonpopsicle commented 10 years ago

don't have this problem when run via SimpleHTTPServer

ikreymer commented 10 years ago

For whatever reason, the uwsgi_params does not define the standard SCRIPT_NAME variable, presumably to force user to define it.

It seems that others have also run into this: https://bitbucket.org/akorn/wheezy.web/issue/2/keyerror-script_name-nginx-uwsgi

adding:

uwsgi_param SCRIPT_NAME '';

after the include uwsgi_params should fix it for you.

Even though according to PEP 3333, the field should always be defined, there's no harm in changing the key lookup in pywb from:

env['SCRIPT_NAME']

to env.get('SCRIPT_NAME', '')

to avoid this error altogether in the future.

I'll add that to next release.

protonpopsicle commented 10 years ago

great, figured as much. thanks for the quick response!