webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.36k stars 212 forks source link

Route based on PATH_INFO? #4

Closed jcushman closed 10 years ago

jcushman commented 10 years ago

Speaking of REQUEST_URI, would it make sense to do the routing part with PATH_INFO instead of the full request_uri? For example, my main wsgi file routes /warc urls to pywb:

from werkzeug.wsgi import DispatcherMiddleware
application = DispatcherMiddleware(
    get_wsgi_application(), # Django
    {
        '/warc': warc_application # pywb
    }
)

Then a request to /warc/foo/bar?a=b comes in with env = {'SCRIPT_NAME': '/warc', 'PATH_INFO': '/foo/bar', 'QUERY_STRING': 'a=b'}

If my pywb routes then match against PATH_INFO, I can change the location of the whole application without needing to edit the routes.

ikreymer commented 10 years ago

Hm, I think that does make sense.. I was using REQUEST_URI because it included the query, but of course that can be retrieved from QUERY_STRING.

However, pywb also does its own routing, as there may be multiple paths that correspond to different cdx collections of warc data, other handlers, etc...

Also, I think currently there is a assumption that it is running at the server root, it would be nice to get rid of that.

A possible configuration might be:

 wbparser = ArchivalRequestRouter(
    {
      MatchPrefix('coll1', coll2Handler),
      MatchPrefix('coll2', coll2Handler),
    }

Currently, this leads to urls being handler and rewritten with /coll1/ and /coll2/ prefixs. eg: http://example.com -> /coll1/timestamp/http://example.com or http://example.com -> /coll2/timestamp/http://example.com

Hm, and it should probably add SCRIPT_NAME to the wb_prefix that is being rewritten?

In your example, what is the url rewriting scheme that you'd be using?

do urls get rewritten with /warc/ or something else?

(I actually considered using werkzeug routing, but decided to do it manually for simplicity for now)

ikreymer commented 10 years ago

Ah just found: https://github.com/jcushman/perma/blob/dev.warc_server/perma_web/warc_server/globalwb.py

Trying to better understand the use case: are you not using the timestamped version of urls?

also, just removed customParams and replaced it with queryFilter (it seemed safer), sorry.. I see you're using it, I can bring it back..

Hopefully, will try to stabilize the interface to cdx serving component soon.

So far, just been using it with our http based cdx server, but should also support a simple cdx stream that is returned locally. high on the list of things to implement :)

ikreymer commented 10 years ago

Let me know how this looks.., basically in your globalwb, you should be able to just use

MatchRegex(r'([a-zA-Z0-9\-]+)', replay.WBHandler(query_handler, replay_handler))

in your declaration and warc prefix will be added as part of the SCRIPT_NAME

Also added a way to pass in a custom ArchivalUrl subclass.. More changes improvements coming :)

jcushman commented 10 years ago

Sweet, I'll try this out.

RE: "Trying to better understand the use case: are you not using the timestamped version of urls?", yes, that's right.

In our use case an author requests that a particular URL be archived at a particular moment, a unique ID is generated (like ABCD-1234), a .warc is archived for that ID, and then they review the archive and make sure it's correct before using it in their published work. If you take (let's say) a court decision that uses Perma, it will reference a unique ID each time the judge sites a webpage, which refers to a unique archive of the cited page, made and approved by the judge. So (in theory) if three users asked for the same page to be archived at the same moment, they would each have their own .warc generated with their own ID.

So then when someone requests a particular ID (like perma.cc/ABCD-1234), we already know that they want a particular URL and its assets, which will be stored in a particular .warc. So the cleanest way to serve up the .warc is to forward it to pywb at a url like /warc/ABCD-1234/[archived_url](in an iframe) and include the ID in the CDX lookup, rather than messing around with datestamps.

Thanks for thinking about this, and the tweaks you've already made to make it easier.

ikreymer commented 10 years ago

Hmm.. I see. That should work, but FWIW, there is a lot of use in the datestamp itself. For one, it is more clear to the user when the citation was crawled. Another, perhaps even more significant, is that if you stick with the /datestamp/url format, the links are compatible with other archives (such as IA). If Perma were to be down / unavailable for some reason, and a user is left with a link like /ABCD-1234/example.com, it would be much harder to retrieve that url from another source unless they also have the id ABCD-1234. but if it has the datestamp, you could cross-reference with other archives (like IA, uk web archive, etc...) I suppose you can do both ABCD-1234-201410101010 and ignore the datestamp part for your lookup, too, but that would lead to longer urls. You guys probably thought about this before, so just my two cents :)

jcushman commented 10 years ago

Thanks -- feedback is always good. :)

It might make sense to have a long cite format at some point, but our primary usecase right now is printed law journal articles and court decisions, where the bitly-like brevity is valuable, and the metadata is already there in more readable form. In context our links might look something like this in print:

Diversity and Inclusion, GOLDMAN SACHS, http://www.goldmansachs.com/who-we-are/diversity-and-inclusion/index.html (last visited May 10, 2013), archived at http://perma.cc/0W27YcMtFsf.

When you go to the permalink you get a friendly landing page with the date and a few different versions of the archive and so on. Improved landing-page design is arriving tomorrow hopefully. :)

So our target audience is not going to use us instead of the original link any time soon; we need to give them something short to add. And readers will also appreciate not having to type out http://perma.cc/201305101010/http://www.goldmansachs.com/who-we-are/diversity-and-inclusion/index.html

In terms of durability, this will shortly be mirrored across a network of libraries, so http://perma.law.harvard.edu/0W27YcMtFsf will still work if perma.cc is down. And we'll also have contingency plans set up so if something more serious happens, we can at least have a lightweight server forwarding to the best available alternative (hopefully IA :)