webrecorder / warcit

Convert Directories, Files and ZIP Files to Web Archives (WARC)
https://pypi.python.org/pypi/warcit
Apache License 2.0
81 stars 13 forks source link

URLs of file names containing # are not escaped correctly #5

Open despens opened 6 years ago

despens commented 6 years ago

Possibly hinting at other escaping issues.

Example:

WARC/1.0
WARC-Date: 2004-11-10T16:15:13Z
WARC-Source-URI: file://waste/images/17#.jpg
WARC-Created-Date: 2018-02-06T16:26:13Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:73015799-0b5a-11e8-9ac5-5ce0c57ec2e1>
WARC-Target-URI: http://heise.de/tp/kunst/waste/images/17#.jpg
WARC-Payload-Digest: sha1:GLC3CKKQ4LSVN4FD75TBXBOOAHA6WP6N
WARC-Block-Digest: sha1:GLC3CKKQ4LSVN4FD75TBXBOOAHA6WP6N
Content-Type: image/jpeg
Content-Length: 5222

Should be:

WARC/1.0
WARC-Date: 2004-11-10T16:15:13Z
WARC-Source-URI: file://waste/images/17#.jpg
WARC-Created-Date: 2018-02-06T16:26:13Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:73015799-0b5a-11e8-9ac5-5ce0c57ec2e1>
WARC-Target-URI: http://heise.de/tp/kunst/waste/images/17%23.jpg
WARC-Payload-Digest: sha1:GLC3CKKQ4LSVN4FD75TBXBOOAHA6WP6N
WARC-Block-Digest: sha1:GLC3CKKQ4LSVN4FD75TBXBOOAHA6WP6N
Content-Type: image/jpeg
Content-Length: 5222
despens commented 6 years ago

Fixed by https://github.com/webrecorder/warcit/pull/2

despens commented 6 years ago

Fixed in https://github.com/webrecorder/warcit/pull/2