webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.41k stars 217 forks source link

Proper encoding of load_url #658

Closed maeb closed 3 years ago

maeb commented 3 years ago

Formatting of _loadurl does not encode the url parameter properly if it ends up in the query string of the configured _urlfield (_replayurl):

https://github.com/webrecorder/pywb/blob/843fe28ed8cc497c3a11345243dbcfc288455337/pywb/warcserver/index/indexsource.py#L160-L162

Some url's does not survive query parameter parsing unscaded when the url parameter is part of the query string of the _loadurl.

This seems to fix the issue:

        cdx[self.url_field] = res_template(self.replay_url, dict(url=cdx['url'],
                                                                 timestamp=cdx['timestamp'],
                                                                 src_coll=source_coll))

I believe this is a proper fix without breaking changes, but I am not sure. Shall I post a PR?

maeb commented 3 years ago

Referencing my previous issue #656 here. That issue concerned encoding of the url parameter in the query string of the request between the frontend and the backend. This issue concerns encoding of the same parameter in the request between the backend and the warcserver (as configured via the replay_url).

ikreymer commented 3 years ago

Just to confirm, this was for use with OutbackCDX, right? Or some other configuration?

maeb commented 3 years ago

We use our own indexer and loader backend https://github.com/nlnwa/gowarcserver.

maeb commented 3 years ago

Our config looks something like:

collections:                                                                                                          
  veidemann:                                                                                                          
    index:                                                                                                            
      type: cdx                                                                                                       
      api_url: http://gowarcserver:9999/warcserver/all/index?url={url}&closest={closest}                        
      replay_url: http://gowarcserver:9999/warcserver/all/resource?url={url}&closest={timestamp}&output=content