webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://browsertrix.com
GNU Affero General Public License v3.0
143 stars 29 forks source link

Document new WARC fields in 1.x crawler-produced WACZ files #1588

Open tuehlarsen opened 3 months ago

tuehlarsen commented 3 months ago

Browsertrix Cloud Version

v1.9.3-79a217b

What did you expect to happen? What happened instead?

I have found some new WARC fields and files in the newest WACZ from beta.browsertrix release: 1.9.3 Browsertrix-Crawler 1.0.0-beta.6 (with warcio.js 2.2.1):

Here a snip from the warc file: WARC/1.1^M WARC-Page-ID: 61046c48-286b-485a-a8ed-9974f79a179d^M WARC-Resource-Type: document^M WARC-JSON-Metadata: {"cert":{"issuer":"GlobalSign Atlas R3 DV TLS CA 2024 Q1","ctc":"0"}}^M ...

I don't find any proposal concerning the WARC-Page-ID here : https://iipc.github.io/warc-specifications/ . I also found a text and screendump warc.gz file, but no documentation. All files validates with the newest version of jwat here: https://github.com/netarchivesuite/jwat-tools/releases/tag/v0.7.2-beta1

Any comments?

Step-by-step reproduction instructions

see above

Additional details

No response

tuehlarsen commented 3 months ago

Are above 3 new warc fields mandatory for modern browserbased replay and are they defacto used in other tools today?

tw4l commented 3 months ago

Hi @tuehlarsen, longer explanation coming but in short:

None of these additions should cause WARC validation to fail or cause any replay issues. Most software other than Browsertrix will simply ignore the fields, as is suggested in section 5.1 of the WARC 1.1 specification:

Because new fields may be defined in extensions to the core WARC format, WARC processing software shall ignore fields with unrecognized names.

Re: your comment about missing documentation for screenshot and text WARC files, that is noted and should be coming shortly! As of the latest 1.0.0 crawler beta release, these WARCs will also be prefixed if a WARC prefix is specified.

tuehlarsen commented 2 months ago

I hope it will be more explicitly - as it is of great importance for large older web archives what the new warc fields are for and what they will be used for in the future - it requires some syntax definition and description that can be input to a later iso standardization process..