webrecorder / specs

Specifications developed and maintained by the Webrecorder community.
https://specs.webrecorder.net
124 stars 14 forks source link

New spec: Request body canonicalization #141

Open tw4l opened 1 year ago

tw4l commented 1 year ago

POST canonicalization (or POST append) is implemented in pywb, warcio.js, and cdxj-indexer, but doesn't have a written specification.

This was first implemented in pywb in https://github.com/webrecorder/pywb/pull/636. Various implementations are described here: https://github.com/webrecorder/replayweb.page/issues/69#issuecomment-935194637

A proper specification would likely help in resolving issues such as https://github.com/webrecorder/pywb/issues/768 as well as generally making it clear to users and developers the rationale and expected behavior for POST canonicalization.

ato commented 1 year ago

I had an attempt at documenting it while writing a Java implementation.

http://iipc.github.io/warc-specifications/guidelines/cdx-non-get-requests/

There's some interesting quirks to it. JSON null is encoded as the string "None". It can also produce slightly different output when run on old versions of Python (< 3.7).

tw4l commented 1 year ago

Thank you @ato , this is very helpful! Documenting the quirks especially - I'll have to look into the Python < 3.7 issues!

tw4l commented 1 year ago

We will be modifying pywb and warcio.js to be consistent - necessary issues have been opened. Notably, we'll be making pywb use native JSON values rather than Pythonic values (e.g. True, None) in the canonicalized URL.

We'll need to make sure that this isn't a breaking change and that already-canonicalized URLs created by pywb will either continue to work with pywb and wabac.js's fuzzy matching (preferred) or that we have a conversion process available.

kaij commented 1 year ago

@tw4l I have the modified outbackcdx from @ato using index version 5 running and indexed a warc using cdx-indexer -j -p. The urlkeys in outbackcdx look good for POST request (they include __wb_method as well as the parameters as described in atos draft at http://iipc.github.io/warc-specifications/guidelines/cdx-non-get-requests/).

However, I can't get the replay of POST requests working. It seems the __wb_method and value parameters are not getting through to outbackcdx (latest pywb 2.7.4) in the request from pywb. Before looking too far - did I miss something in the config? (I'm using index_paths: cdx+http://outbackcdx:8080/nb-webarchive)

ato commented 1 year ago

Pywb needs this patch to pass them through: https://github.com/nla/pywb/commit/2bb97fc7081a3260d6fdf7f2d248e0dd51dd6129

kaij commented 1 year ago

It works very nicely now, thanks @ato!

Do you see any problems for using index version 5 in production? (besides having to recreate the index if the spec changes at a later time)

ato commented 1 year ago

We run index version 5 it in production and haven't had any issues so far. The only reason it's not the default yet is because the upgrade process needs a bit more polishing.

kaij commented 1 year ago

We run index version 5 it in production and haven't had any issues so far. The only reason it's not the default yet is because the upgrade process needs a bit more polishing.

Sounds good! Mentioning upgrade... so there is a way to upgrade the index? (so far I only used a newly created index) It would of course simplify things a lot.

ato commented 1 year ago

I've written up some notes about upgrading here: https://github.com/nla/outbackcdx/issues/117 If you have any further OutbackCDX questions please post them there as we've drifted off the topic of the POST canonicalization spec. :-)

tw4l commented 7 months ago

@ato PR is in and would love your eyes on it if you have the bandwith: https://github.com/webrecorder/specs/pull/149