webrecorder / cdxj-indexer

CDXJ Indexing of WARC/ARCs
Apache License 2.0
21 stars 12 forks source link

Revisit records with POST requests lack a POST append in their URL key #22

Open ARiedijk opened 1 year ago

ARiedijk commented 1 year ago

When using the cdxj-indexer on a webpage that contains multiple different HTTP POST requests with the same response, the cdxj-indexer will only append the URL for the response record. This means that revisit records will not have a POST append URL key.

When running the cdxj-indexer Expected result: com,example)/inc/postdatastatic.php?__wb_method=post&body=counter0&rnd=result1-0 20230118150449 {... "mime": "text/html"} com,example)/inc/postdatastatic.php?__wb_method=post&body=counter0&rnd=result1-1 20230118150449 {... "mime": "warc/revisit"} Actual result com,example)/inc/postdatastatic.php?__wb_method=post&body=counter0&rnd=result1-0 20230118150449 {... "mime": "text/html"} com,example)/inc/postdatastatic.php?rnd=result1-1 20230118150449 {... "mime": "warc/revisit"}

fix_postappend-revist.patch