webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.37k stars 214 forks source link

Broken URI archive_paths support #560

Open anjackson opened 4 years ago

anjackson commented 4 years ago

This change breaks our archive_paths: "webhdfs://server/" becauseos.path.join` just discards the prefix when the suffix is an absolute path.

https://github.com/webrecorder/pywb/blob/92e459bda52a2b03f33a4b0b8094ed424248d2a5/pywb/warcserver/resource/pathresolvers.py#L40

ikreymer commented 4 years ago

Hm, not sure I understand.. This seems as expected:

The filename should generally be a relative path:

>>> os.path.join('webhdfs://server/', 'filename.warc')
'webhdfs://server/filename.warc'

Though, if it needs to be absolute, then archive_paths: '' should work:

os.path.join('', 'webhdfs://filename.warc')
'webhdfs://filename.warc'

Or do you have a mix of absolute and relative? Then this would be problematic:

>>> os.path.join('webhdfs://server/', 'webhdfs://server/filename.warc')
'webhdfs://server/webhdfs://server/filename.warc'
anjackson commented 4 years ago

The problem is, ours look like this:

os.path.join('webhdfs://server', '/file/path/on/hdfs.warc.gz')

which gives /file/path/on/hdfs.warc.gz but the old code gave webhdfs://server/file/path/on/hdfs.warc.gz.

ikreymer commented 4 years ago

Ah i see. Hm, perhaps should just keep old behavior for now.. was designed to deal with edge cases where slash is missing..