Open tuehlarsen opened 8 months ago
Are above 3 new warc fields mandatory for modern browserbased replay and are they defacto used in other tools today?
Hi @tuehlarsen, longer explanation coming but in short:
WARC-Resource-Type
: Proposal created at https://github.com/iipc/warc-specifications/issues/96; this is used to differentiate resources fetched via JavaScript from those loaded directly in the page, and has other possibilities for future analysis of crawlsWARC-Page-ID
: We added this in Browsertrix to be able to easily associate pages between original crawls and QA replay crawlsWARC-JSON-Metadata
: Proposal created at https://github.com/iipc/warc-specifications/issues/27None of these additions should cause WARC validation to fail or cause any replay issues. Most software other than Browsertrix will simply ignore the fields, as is suggested in section 5.1 of the WARC 1.1 specification:
Because new fields may be defined in extensions to the core WARC format, WARC processing software shall ignore fields with unrecognized names.
Re: your comment about missing documentation for screenshot and text WARC files, that is noted and should be coming shortly! As of the latest 1.0.0 crawler beta release, these WARCs will also be prefixed if a WARC prefix is specified.
I hope it will be more explicitly - as it is of great importance for large older web archives what the new warc fields are for and what they will be used for in the future - it requires some syntax definition and description that can be input to a later iso standardization process..
Browsertrix Cloud Version
v1.9.3-79a217b
What did you expect to happen? What happened instead?
I have found some new WARC fields and files in the newest WACZ from beta.browsertrix release: 1.9.3 Browsertrix-Crawler 1.0.0-beta.6 (with warcio.js 2.2.1):
Here a snip from the warc file: WARC/1.1^M WARC-Page-ID: 61046c48-286b-485a-a8ed-9974f79a179d^M WARC-Resource-Type: document^M WARC-JSON-Metadata: {"cert":{"issuer":"GlobalSign Atlas R3 DV TLS CA 2024 Q1","ctc":"0"}}^M ...
I don't find any proposal concerning the WARC-Page-ID here : https://iipc.github.io/warc-specifications/ . I also found a text and screendump warc.gz file, but no documentation. All files validates with the newest version of jwat here: https://github.com/netarchivesuite/jwat-tools/releases/tag/v0.7.2-beta1
Any comments?
Step-by-step reproduction instructions
see above
Additional details
No response