webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
611 stars 79 forks source link

Question: is there any processing done to URI values? #492

Closed benoit74 closed 6 months ago

benoit74 commented 6 months ago

In https://github.com/openzim/warc2zim/issues/206, we are wondering what is responsible for the url-encoding we find in WARC-Target-URI record header.

For instance, if a page HTML specifies a link like ./images/urlencoding1_icône-débuter-Solidarité-Numérique_1@300x.png, what we find in the WARC record header is https://xxx.xxx.xx/xxxx/images/urlencoding1_ico%CC%82ne-de%CC%81buter-Solidarite%CC%81-Nume%CC%81rique_1@300x.png.

Is there any processing done by the crawler / warc.io lib / ... or is it directly the URL returned by the browser?

When the URL is an href and is hence processed by Browsertrix as part of its logic to retrieve to pages to fetch, is the URL extracted "as-is" and passed to the browser? Is there any processing (there is at least some to transform into about URL I imagine)?

Finally, is there any specification which says what should be the URI format (encoded or not) inside the WARC? In ZIMs we have decided that items paths must not be encoded for instance, and it is the scraper responbility to decode what needs to be.

Thank you!

ato commented 6 months ago

Is there any processing done by the crawler / warc.io lib / ... or is it directly the URL returned by the browser?

In the current version of Browsertrix WARC-Target-URI comes from the url field in the Network.Request struct in the Chrome Dev Tools protocol. In older versions of Browsetrix resources are recorded using a HTTP proxy and the URL is constructed from the HTTP request message as sent over the wire. In both cases the browser's URL parser will have already handled parsing, resolving and encoding according to the HTML standard.

When the URL is an href and is hence processed by Browsertrix as part of its logic to retrieve to pages to fetch, is the URL extracted "as-is" and passed to the browser? Is there any processing (there is at least some to transform into about URL I imagine)?

The URL is extracted by calling the href getter in the DOM API which will parse and encode the URL per the HTML standard. I didn't see any further processing in the code.

is there any specification which says what should be the URI format (encoded or not) inside the WARC?

The WARC specification says that the WARC-Target-URI field "shall be written as specified in RFC 3986" which only allows a subset of US-ASCII characters.

The HTML specification defines how to follow a hyperlink and one of the steps in that is encoding-parsing-and-serializing a URL from the href attribute which involves encoding as US-ASCII and percent encoding.

benoit74 commented 6 months ago

@ato Thank you very much ! 🙏🏼