webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
611 stars 79 forks source link

Support Contextual Information in datapackage.json for WACZ #268

Open markpbaggett opened 1 year ago

markpbaggett commented 1 year ago

With browsertrix-crawler, a user can use combineWARC to write contextual information defined in the warcinfo property into the destination warc. When the warc is read, the fields defined in the property can be read.

WARC/1.0
WARC-Filename: exhibitstwo_0.warc.gz
WARC-Date: 2023-03-29T13:58:06Z
WARC-Type: warcinfo
WARC-Record-ID: <urn:uuid:0b216a80-253c-4324-9dc7-a0e18e535b66>
Content-Type: application/warc-fields
Content-Length: 712

software: Browsertrix-Crawler 0.8.0-beta.2 (with warcio.js 1.6.2 pywb 2.7.3)
format: WARC File Format 1.0
title: Database of the Smokies
type: collection
operator: Mark Baggett askDI@utk.edu
hostname: www.lib.utk.edu
creator: University of Tennessee, Libraries
description: This warc contains web assets associated with the Database of the Smokies (DOTS). DOTS provided citations to written works about the Great Smoky Mountains National Park and bordering  communities from 1935 to the present day. In 2023, a decision was made to sunset DOTS but provide ongoing access to its contents. While DOTS will be spun down, its web archive will be made available for perpetuity.

There does not appear to be a way currently with Browsertrix-Crawler (0.8.0-beta.2) to do a similar thing where one could add contextual information about the contents of a WACZ to the datapackage.json file. The WACZ spec states that a "WACZ file includes all the data that is needed for the rendering archived content as well as contextual information required for users to interpret it." It goes further to state that "the datapackage.json SHOULD include properties that allow rendering applications to present the user with contextual information about the web archive:" including title, description, "other properties from the [FRICTIONLESS-DATA-PACKAGE] specification such as licenses, version, organization, contributors, email," and "custom properties that do not interfere with pre-existing properties."

Is there a way currently or are there plans to add a method to add contextual information from the config to datapackage.json? If not, I think this would be a great addition.

ikreymer commented 1 year ago

Yes, this is a good idea, and just something we haven't gotten around to yet. The warcinfo option (which I forgot about!) exists because someone requested it at one point. I think its just a matter of exposing these settings.. are there specific properties that you'd like to see included? title and description?

A tricky aspect of this is that this sort of metadata may change while the actual archived data remains the same, eg. you may want to edit the description, but perhaps not. At least with WACZ, it is easier to update than in the WARC.

ikreymer commented 1 year ago

@tw4l what do you think would make sense here? Perhaps just adding a --title and --desc flags, or a more extensive metadata json blob in the config?

markpbaggett commented 1 year ago

@ikreymer Great question. In our planning, we we looking to use title and description from the wacz spec along with homepage and contributors from frictionless data. It looks like title and description are already defined in py-wacz.

since the range of homepage is a string, it could follow the same pattern as title and description, but i'm not sure how you all feel about expanding py-wacz to support things that aren't prescribed explicitly in wacz 4.2.4. contributors is of course more complex since it's an array of objects, with its own optional fields.

In case it's helpful, this was what we were hoping to capture as it closely matches what we put into the warcinfo:

{
  "profile": "data-package",
  "resources": [
    {
      "name": "filename.ext",
      "path": "path_in_wacz/filename.ext",
      "hash": "sha256:917573c8c06fbe3784f9255652a7aaa7e4f04436b9e0bfab929d7082f88f3a14",
      "bytes": 961064
    }
  ],
  "title": "Database of the Smokies",
  "description": "This WACZ and its associated WARCs contain web assets associated with the Database of the Smokies (DOTS). DOTS provided citations to written works about the Great Smoky Mountains National Park and bordering communities from 1935 to the present day. In 2023, a decision was made to sunset DOTS but provide ongoing access to its contents via WACZ. While DOTS will be spun down, its web archive will be made available for perpetuity.",
  "homepage": "https://www.lib.utk.edu",
  "created": "2023-03-30T19:13:15Z",
  "wacz_version": "1.1.1",
  "software": "py-wacz 0.4.8",
  "contributors": [
    {
        "title": "Mark Baggett",
        "email": "mbagget1@utk.edu",
        "role": "author",
        "organization": "University of Tennessee, Libraries"
    }
  ]
}

I think hearing what others think would be great. Honestly, just having title and description (which already seem to function if you use py-wacz directly) would be awesome.

tw4l commented 1 year ago

@tw4l what do you think would make sense here? Perhaps just adding a --title and --desc flags, or a more extensive metadata json blob in the config?

I think having title and description able to be passed in via --title and --desc/--description crawler args is a great start, especially since the spec states they SHOULD be present and they're supported in py-wacz already.

I'd love to get support for a more extensive metadata json blob in eventually (maybe/ideally limited to Frictionless Data Package metadata fields - not sure to what degree we could do validation but it'd be a nice value-add) but as @markpbaggett's example suggests this might be a little more involved, so I think starting with title and description seems reasonable.

tw4l commented 1 year ago

@markpbaggett Initial PR is merged and the 0.9.0 release of the crawler will include --title and --description CLI args :)

Thanks for the full example of the metadata you'd like to be able to pass into datapackage.json! Keeping this issue open so that we can track adding more metadata to the WACZ in time.

markpbaggett commented 1 year ago

@tw4l and @ikreymer awesome 😎 . this is a great start.