webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
589 stars 73 forks source link

`.warc` strict conforming output ? #554

Closed dbuenzli closed 2 months ago

dbuenzli commented 3 months ago

I'm evaluating browsertrix-crawler for long term preservation efforts for a non-profit archival organisation. As such I have a few questions about the .warc files it generates:

  1. I noticed that the .warc files have non-standard headers like WARC-JSON-Metadata or WARC-Page-ID. I understand the explanations there. But it's a bit problematic to have these fields if their semantics eventually ends up being standardized differently in the future. Is there a way to convince the crawler to generate strictly conformant .warc files ?

  2. It seems the .warc files contain resource records with WARC-Target-URI of the form urn:pageinfo:URI, I gather this seems to represent a page and the ressources it needs for display. However is this scheme standard and/or described somewhere ?

tw4l commented 3 months ago

Hi @dbuenzli , thanks for these comments.

In terms of the new fields, yes, perhaps we should create/propose an extension to the core WARC format with these new fields, and push for them to be included in future versions of the core standard. I'm putting this issue on our sprint board for after the IIPC WAC conference for consideration of this, so thanks for raising the issue.

Is there a way to convince the crawler to generate strictly conformant .warc files ?

This might be a possibility, however some of these fields are necessary for features of Browsertrix Crawler (e.g. WARC-Page-ID relates WARC records to the pages list in the pages.jsonl/extraPages.jsonl files of the WACZ format, and the urn:pageinfo:URIrecords are necessary for our new Quality Assurance (QA) features). So we would have to carefully consider how to manage that in relation to a "strict" WARC mode if we were to go down that path, and we'd have to give some thought as to whether we'd want to manage the complexity of that.

dbuenzli commented 3 months ago

Thanks for your answers @tw4l.

Note that strictly speaking I don't mind post processing the files to remove these non-standard elements for our long term archive. But that raises two question:

  1. Can the urn:pageinfo:URI records be regenerated from a warc file which has been pruned from these records ? I.e. is it derived data or something essential about the crawling process is being captured here ?

  2. If I prune these elements from the warc files, would it still be possible to view them with your very nice serverless viewer https://replayweb.page/ ?

ikreymer commented 2 months ago

Thanks for your answers @tw4l.

Note that strictly speaking I don't mind post processing the files to remove these non-standard elements for our long term archive. But that raises two question:

I would strongly recommend against doing this. Our goal is to write up the proposals for the urn:pageinfo records as well as custom WARC-* headers and publish them on our specs site, at https://specs.webrecorder.net/ and also submit to https://github.com/iipc/warc-specifications to track there as well, we just haven't had time.

Per the WARC-1.1 standard, unknown headers and records should be ignored by conforming tools. The reason for this is to encourage extension of the WARC spec, but doing so in a way that allows users to try new things. ISO standardization can take a long time, is reviewed on a 5 year, and the community decided (if I remember correctly) that it would be best to see what is actually in use 'in the wild' before proposing it as further extension to WARC. Only things that are in actual use should be standardized and it just takes a long time. We do hope that the extensions we have can be standardized one day, but our focus is on making sure we can archive at risk content today at highest fidelity.

  1. Can the urn:pageinfo:URI records be regenerated from a warc file which has been pruned from these records ? I.e. is it derived data or something essential about the crawling process is being captured here ?

The answer is 'maybe', but not necessarily. The goal of these records is to capture all resources loaded by a browser at the time of capture, so that they can later be analyzed / compared with resources loaded at time of replay. This includes resources that are duplicates, loaded from cache, etc.. so they don't correspond one-to-one with a new WARC record, or even revisit record.

  1. If I prune these elements from the warc files, would it still be possible to view them with your very nice serverless viewer https://replayweb.page/ ?

Yes, replayweb.page isn't using these at the moment, and we won't require them, but we may add additional features in the future that use these. These records are more intended to be forward-compatible at the moment.

I don't think you gain anything by removing them and my advice would be against removing / reprocessing WARCs. , WARCs can contain often additional records that are not used during replay, such as warcinfo, and would generally advise against reprocessing WARCs to remove unused records. Other crawlers such as Heritrix also write custom metadata records for crawl logs, etc.. that are not in use during replay, but may be useful for other types of analysis, etc.. We use resource records and that is also a standard type of WARC record and should be handled or ignored by conforming WARC readers.

dbuenzli commented 2 months ago

Thanks @ikreymer for taking the time to respond.

I don't mind having additional stuff in there but I'm a bit uneasy that:

  1. Some of the things are (AFAIK) entirely undocumented e.g. the formal purpose and JSON format of the urn:pageinfo:URI payload. (Plans to document are good but that remains plans).

  2. The semantics of a name may change between time of capture and possible standardization. Shouldn't perhaps these things live under X- prefixes ? I gather that in some context it brings its own sets of problems (e.g. the bureaucracy around browser prefixes in css). But from a long term archival perspective the fact that this semantic change happened can easily be lost, even over the course of a couple of years. Software comes, evolves and goes quicker than the websites we want to preserve.

tw4l commented 2 months ago

We are in the process of documenting these new headers and fields, tracking in https://github.com/webrecorder/browsertrix/issues/issue/1588