webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.41k stars 217 forks source link

Indexing errors on Content-Type: multipart/form-data when "boundary" is missing #598

Closed ldko closed 3 years ago

ldko commented 3 years ago

Describe the bug

When indexing a WARC file with records containing Content-Type: multipart/form-data (missing "boundary" such as in multipart/form-data; boundary=----WebKitFormBoundaryrdRXu11VSoXKFFBV), the indexing fails at:

https://github.com/webrecorder/pywb/blob/7b51101b040628ce6ceddb7bd79440b03c0081d4/pywb/warcserver/inputrequest.py#L262

with ValueError: Invalid boundary in multipart form: b''

Steps to reproduce the bug

Download this sample WARC (created with Brozzler) that contains records with Content-Type: multipart/form-data. Try to index the WARC.

Expected behavior

The indexing process should not choke on the inadequate Content-Type header.

Environment

ldko commented 3 years ago

Fixed in #599 .