webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.37k stars 214 forks source link

Missing documentation on how to use proxy with auth based collection selection #909

Open despens opened 3 months ago

despens commented 3 months ago

I would like to activate the selection of a collection in proxy mode via proxy auth. In this mode, a user would access a pywb instance via proxy, and instead of getting to a pre-configured default collection, the collection could be selected by typing its name into a proxy auth dialog.

The documentation doesn't explain how to activate this mode.

Describe the solution you'd like

I'd like a proxy mode in which the collection can be selected via proxy auth. As far as I know this mode exists but remains undocumented.

Describe alternatives you've considered

I am currently setting up multiple instances of pywb, accepting connections on different ports, each with a different proxy configuration. These pywb instances then use the memento and cdx apis to access different collections on my main pywb instance. This seems overkill 😅

ato commented 3 months ago

I don't think pywb has this mode builtin anymore. There's some old documentation on the wiki but as far as I can tell the code for it has been removed.

The configuration guide mentions:

Extensions to pywb can override proxy_route_request() to provide custom handling, such as setting the collection dynamically or based on external data sources.

I'm not sure what the proper way to write an extension is but putting this in a .py file and running it worked for me:

#!/usr/bin/env python
from base64 import b64decode
from pywb.apps.frontendapp import FrontEndApp
from pywb.utils.geventserver import GeventServer

def require_auth(env):
    proxy_auth = env.get('HTTP_PROXY_AUTHORIZATION')
    if proxy_auth is None or not proxy_auth.lower().startswith('basic '):
        return 'Collection as username, blank password'
    return None

class App(FrontEndApp):
    def proxy_route_request(self, url, env):
        auth = env['HTTP_PROXY_AUTHORIZATION'].split(' ')[1]
        user_pass = b64decode(auth.encode('utf-8'))
        coll = user_pass.decode('utf-8').split(':')[0]
        if '/' in coll: raise Exception("Username can't contain /")
        return '/{0}/bn_/{1}'.format(coll, url)

    # this is weird but it's what wsgiprox calls to prompt for auth
    proxy_route_request.require_auth = require_auth

if __name__ == '__main__':
    GeventServer(App(), port=8080, hostname='0.0.0.0').join()
despens commented 3 months ago

Thank you so much @ato I'll give this a try and report back.