Open tw4l opened 1 year ago
I had an attempt at documenting it while writing a Java implementation.
http://iipc.github.io/warc-specifications/guidelines/cdx-non-get-requests/
There's some interesting quirks to it. JSON null is encoded as the string "None". It can also produce slightly different output when run on old versions of Python (< 3.7).
Thank you @ato , this is very helpful! Documenting the quirks especially - I'll have to look into the Python < 3.7 issues!
We will be modifying pywb and warcio.js to be consistent - necessary issues have been opened. Notably, we'll be making pywb use native JSON values rather than Pythonic values (e.g. True
, None
) in the canonicalized URL.
We'll need to make sure that this isn't a breaking change and that already-canonicalized URLs created by pywb will either continue to work with pywb and wabac.js's fuzzy matching (preferred) or that we have a conversion process available.
@tw4l I have the modified outbackcdx from @ato using index version 5 running and indexed a warc using cdx-indexer -j -p
. The urlkeys in outbackcdx look good for POST request (they include __wb_method as well as the parameters as described in atos draft at http://iipc.github.io/warc-specifications/guidelines/cdx-non-get-requests/).
However, I can't get the replay of POST requests working. It seems the __wb_method and value parameters are not getting through to outbackcdx (latest pywb 2.7.4) in the request from pywb. Before looking too far - did I miss something in the config? (I'm using index_paths: cdx+http://outbackcdx:8080/nb-webarchive
)
Pywb needs this patch to pass them through: https://github.com/nla/pywb/commit/2bb97fc7081a3260d6fdf7f2d248e0dd51dd6129
It works very nicely now, thanks @ato!
Do you see any problems for using index version 5 in production? (besides having to recreate the index if the spec changes at a later time)
We run index version 5 it in production and haven't had any issues so far. The only reason it's not the default yet is because the upgrade process needs a bit more polishing.
We run index version 5 it in production and haven't had any issues so far. The only reason it's not the default yet is because the upgrade process needs a bit more polishing.
Sounds good! Mentioning upgrade... so there is a way to upgrade the index? (so far I only used a newly created index) It would of course simplify things a lot.
I've written up some notes about upgrading here: https://github.com/nla/outbackcdx/issues/117 If you have any further OutbackCDX questions please post them there as we've drifted off the topic of the POST canonicalization spec. :-)
@ato PR is in and would love your eyes on it if you have the bandwith: https://github.com/webrecorder/specs/pull/149
POST canonicalization (or POST append) is implemented in pywb, warcio.js, and cdxj-indexer, but doesn't have a written specification.
This was first implemented in pywb in https://github.com/webrecorder/pywb/pull/636. Various implementations are described here: https://github.com/webrecorder/replayweb.page/issues/69#issuecomment-935194637
A proper specification would likely help in resolving issues such as https://github.com/webrecorder/pywb/issues/768 as well as generally making it clear to users and developers the rationale and expected behavior for POST canonicalization.