webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.39k stars 216 forks source link

encoding issue - failing to playback warc #38

Closed protonpopsicle closed 10 years ago

protonpopsicle commented 10 years ago

see stack trace below. we took a warc from our collection, indexed and visited the url in a locally running pywayback. this warc was made by wget (we can send the file via email but it is too big to upload here). other warcs we have tried that we created using webrecorder.io work perfectly. we're on v0.4.5

Pywb Error

'utf8' codec can't decode byte 0xf1 in position 12562: invalid continuation byte
Error Details:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 62, in __call__
    response = wb_router(env)
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/proxy.py", line 28, in __call__
    response = super(ProxyArchivalRouter, self).__call__(env)
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/archivalrouter.py", line 33, in __call__
    result = route(env, self.abs_path)
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/archivalrouter.py", line 78, in __call__
    return self.handler(wbrequest) if wbrequest else None
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/handlers.py", line 41, in __call__
    cdx_callback)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 81, in __call__
    return self.render_content(wbrequest, *args)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 159, in render_content
    failed_files)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 224, in replay_capture
    response_iter)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 242, in buffered_response
    for buff in iterator:
  File "/usr/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 224, in stream_to_gen
    buff = rewrite_func(buff)
  File "/usr/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 151, in do_rewrite
    buff = self._decode_buff(buff, stream, encoding)
  File "/usr/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 179, in _decode_buff
    buff = buff.decode(encoding)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 12562: invalid continuation byte
protonpopsicle commented 10 years ago

contents of our config file:

# pywb config file
# ========================================
#
# Settings for each collection

collections:
    # <name>: <cdx_path>
    # collection will be accessed via /<name>
    # <cdx_path> is a string or list of:
    #  - string or list of one or more local .cdx file
    #  - string or list of one or more local dirs with .cdx files
    #  - a string value indicating remote http cdx server
    ArtBase: /Users/rhiz/Desktop/my_archive/cdx/

    # ex with filtering: filter CDX lines by filename starting with 'dupe'
    #pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}

# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
# SURT keys are recommended for future indices, but non-SURT cdxs
# are also supported
#
#   * Set to true if cdxs start with surts: com,example)/
#   * Set to false if cdx start with urls: example.com)/
#
# default:
# surt_ordered: true

# list of paths prefixes for pywb look to 'resolve'  WARC and ARC filenames
# in the cdx to their absolute path
#
# if path is:
#   * local dir, use path as prefix
#   * local file, lookup prefix in tab-delimited sorted index
#   * http:// path, use path as remote prefix
#   * redis:// path, use redis to lookup full path for w:<warc> as key

archive_paths: /Users/rhiz/Desktop/my_archive/warcs/

# The following are default settings -- uncomment to change
# Set to '' to disable the ui

# ==== UI: HTML/Jinja2 Templates ====

# template for <head> insert into replayed html content
#head_insert_html: ui/head_insert.html

# template to for 'calendar' query,
# eg, a listing of captures  in response to a ../*/<url>
#
# may be a simple listing or a more complex 'calendar' UI
# if omitted, will list raw cdx in plain text
#query_html: ui/query.html

# template for search page, which is displayed when no search url is entered
# in a collection
#search_html: ui/search.html

# template for home page.
# if no other route is set, this will be rendered at /, /index.htm and /index.html
#home_html: ui/index.html

# error page temlpate for may formatting error message and details
# if omitted, a text response is returned
#error_html: ui/error.html

# ==== Other Paths ====

# list of host names that pywb will be running from to detect
# 'fallthrough' requests based on referrer
#
# eg: an incorrect request for http://localhost:8080/image.gif with a referrer
# of http://localhost:8080/pywb/index.html, pywb can correctly redirect
# to http://localhost:8080/pywb/image.gif
#

#hostpaths: ['http://localhost:8080']

# Rewrite urls with absolute paths instead of relative
#absoulte_paths: true

# List of route names:
# <route>: <package or file path>
# default route static/default for pywb defaults
static_routes:
          static/default: pywb/static/

# ==== New / Experimental Settings ====
# Not yet production ready -- used primarily for testing

# Enable simple http proxy mode
enable_http_proxy: true

# enable cdx server api for querying cdx directly (experimental)
enable_cdx_api: true

# custom rules for domain specific matching
# set to false to disable
#domain_specific_rules: rules.yaml

# Memento support, enable
enable_memento: true

# Use lxml parser, if available
use_lxml_parser: false

# Replay content in an iframe
framed_replay: true
ikreymer commented 10 years ago

Issue was caused by improper encoding detection. To solve this issue, and potentially others, switching to just using raw bytes for html rewriting, as suggested by @despens Since most encodings are ascii compatible, this should lead to better results. Will need to detect UTF-16 and other rare encodings and properly decode them, but in general seems like this will work.

ikreymer commented 10 years ago

Implemented in 70b7e29b366036a73b0874357500f1a14c8384c4, to be part of 0.4.7 release