webrecorder / cdxj-indexer

CDXJ Indexing of WARC/ARCs
Apache License 2.0
21 stars 10 forks source link

SURT are not created for HTTP CONNECT requests in WARC file #20

Open ARiedijk opened 1 year ago

ARiedijk commented 1 year ago

Hi, we are using this cdx-indexer tool and found out that while replaying our Wacz files in Replayweb.page player, sometimes certain resources were not found, while they were present in the Warc files.

What turned out in our Warc files are CONNECT requests and these are not converted to a SURT. For example, url=distillery.wistia.com:443 remains after surt.surt(url) method call distillery.wistia.com:443. The Replayweb.page player checks whether the index.idx has a surt, using useSurt = prefix.indexOf(")/") > 0; in the MultiWacz.js. If by chance the last line has a CONNECT then this block is considered surt = false in the cdx. Then querying in the browser DB using the upperBound method does not work properly.

Given:

A warc file with:

WARC/1.0
Content-Length: 308
Content-Type: application/http;msgtype=request
WARC-Block-Digest: sha1:XDTRC67IG3EYGKYRBFK7BOYLBRJHW52X
WARC-Date: 2022-09-14T14:45:01Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Record-ID: <urn:uuid:d083e59a-e1c5-4079-bb20-cf6115fa342d>
WARC-Target-URI: distillery.wistia.com:443
WARC-Type: request

CONNECT distillery.wistia.com:443 HTTP/1.1
Accept-Encoding: *, compress;q=0, br;q=0
Content-Length: 0
Host: distillery.wistia.com:443
Proxy-Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/105.0.5195.102 Safari/537.36

When running the cdxj_indexer with the following parameters:

main.py -p -o index.idx -c index.cdx.gz -s -d -l 1024 small.warc

Then the result in de index is:

!meta 0 {"format": "cdxj-gzip-1.0", "filename": "c:\\temp\\index.cdx.gz"}
distillery.wistia.com:443 20220914144501 {"offset": 0, "length": 371, "digest": "sha256:8e8d3aa0f13b077615de09a2d349121130ec5fca9783c97d10c07721e1d13585"}

excepted:

com,wistia,distillery)/ 20220914144501 {"offset": 0, "length": 377, "digest": "sha256:b75ede157ec02f31a25126270771b287d1ccc42554c9678ebc2c1446249a554d"}