Closed protonpopsicle closed 10 years ago
contents of our config file:
# pywb config file
# ========================================
#
# Settings for each collection
collections:
# <name>: <cdx_path>
# collection will be accessed via /<name>
# <cdx_path> is a string or list of:
# - string or list of one or more local .cdx file
# - string or list of one or more local dirs with .cdx files
# - a string value indicating remote http cdx server
ArtBase: /Users/rhiz/Desktop/my_archive/cdx/
# ex with filtering: filter CDX lines by filename starting with 'dupe'
#pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
# SURT keys are recommended for future indices, but non-SURT cdxs
# are also supported
#
# * Set to true if cdxs start with surts: com,example)/
# * Set to false if cdx start with urls: example.com)/
#
# default:
# surt_ordered: true
# list of paths prefixes for pywb look to 'resolve' WARC and ARC filenames
# in the cdx to their absolute path
#
# if path is:
# * local dir, use path as prefix
# * local file, lookup prefix in tab-delimited sorted index
# * http:// path, use path as remote prefix
# * redis:// path, use redis to lookup full path for w:<warc> as key
archive_paths: /Users/rhiz/Desktop/my_archive/warcs/
# The following are default settings -- uncomment to change
# Set to '' to disable the ui
# ==== UI: HTML/Jinja2 Templates ====
# template for <head> insert into replayed html content
#head_insert_html: ui/head_insert.html
# template to for 'calendar' query,
# eg, a listing of captures in response to a ../*/<url>
#
# may be a simple listing or a more complex 'calendar' UI
# if omitted, will list raw cdx in plain text
#query_html: ui/query.html
# template for search page, which is displayed when no search url is entered
# in a collection
#search_html: ui/search.html
# template for home page.
# if no other route is set, this will be rendered at /, /index.htm and /index.html
#home_html: ui/index.html
# error page temlpate for may formatting error message and details
# if omitted, a text response is returned
#error_html: ui/error.html
# ==== Other Paths ====
# list of host names that pywb will be running from to detect
# 'fallthrough' requests based on referrer
#
# eg: an incorrect request for http://localhost:8080/image.gif with a referrer
# of http://localhost:8080/pywb/index.html, pywb can correctly redirect
# to http://localhost:8080/pywb/image.gif
#
#hostpaths: ['http://localhost:8080']
# Rewrite urls with absolute paths instead of relative
#absoulte_paths: true
# List of route names:
# <route>: <package or file path>
# default route static/default for pywb defaults
static_routes:
static/default: pywb/static/
# ==== New / Experimental Settings ====
# Not yet production ready -- used primarily for testing
# Enable simple http proxy mode
enable_http_proxy: true
# enable cdx server api for querying cdx directly (experimental)
enable_cdx_api: true
# custom rules for domain specific matching
# set to false to disable
#domain_specific_rules: rules.yaml
# Memento support, enable
enable_memento: true
# Use lxml parser, if available
use_lxml_parser: false
# Replay content in an iframe
framed_replay: true
Issue was caused by improper encoding detection. To solve this issue, and potentially others, switching to just using raw bytes for html rewriting, as suggested by @despens Since most encodings are ascii compatible, this should lead to better results. Will need to detect UTF-16 and other rare encodings and properly decode them, but in general seems like this will work.
Implemented in 70b7e29b366036a73b0874357500f1a14c8384c4, to be part of 0.4.7 release
see stack trace below. we took a warc from our collection, indexed and visited the url in a locally running pywayback. this warc was made by wget (we can send the file via email but it is too big to upload here). other warcs we have tried that we created using webrecorder.io work perfectly. we're on v0.4.5