Open markpbaggett opened 1 year ago
Yes, this is a good idea, and just something we haven't gotten around to yet. The warcinfo option (which I forgot about!) exists because someone requested it at one point.
I think its just a matter of exposing these settings.. are there specific properties that you'd like to see included? title
and description
?
A tricky aspect of this is that this sort of metadata may change while the actual archived data remains the same, eg. you may want to edit the description
, but perhaps not. At least with WACZ, it is easier to update than in the WARC.
@tw4l what do you think would make sense here? Perhaps just adding a --title
and --desc
flags, or a more extensive metadata json blob in the config?
@ikreymer Great question. In our planning, we we looking to use title
and description
from the wacz spec along with homepage
and contributors
from frictionless data. It looks like title
and description
are already defined in py-wacz
.
since the range of homepage
is a string, it could follow the same pattern as title
and description
, but i'm not sure how you all feel about expanding py-wacz
to support things that aren't prescribed explicitly in wacz 4.2.4. contributors
is of course more complex since it's an array of objects, with its own optional fields.
In case it's helpful, this was what we were hoping to capture as it closely matches what we put into the warcinfo:
{
"profile": "data-package",
"resources": [
{
"name": "filename.ext",
"path": "path_in_wacz/filename.ext",
"hash": "sha256:917573c8c06fbe3784f9255652a7aaa7e4f04436b9e0bfab929d7082f88f3a14",
"bytes": 961064
}
],
"title": "Database of the Smokies",
"description": "This WACZ and its associated WARCs contain web assets associated with the Database of the Smokies (DOTS). DOTS provided citations to written works about the Great Smoky Mountains National Park and bordering communities from 1935 to the present day. In 2023, a decision was made to sunset DOTS but provide ongoing access to its contents via WACZ. While DOTS will be spun down, its web archive will be made available for perpetuity.",
"homepage": "https://www.lib.utk.edu",
"created": "2023-03-30T19:13:15Z",
"wacz_version": "1.1.1",
"software": "py-wacz 0.4.8",
"contributors": [
{
"title": "Mark Baggett",
"email": "mbagget1@utk.edu",
"role": "author",
"organization": "University of Tennessee, Libraries"
}
]
}
I think hearing what others think would be great. Honestly, just having title
and description
(which already seem to function if you use py-wacz
directly) would be awesome.
@tw4l what do you think would make sense here? Perhaps just adding a --title and --desc flags, or a more extensive metadata json blob in the config?
I think having title
and description
able to be passed in via --title
and --desc/--description
crawler args is a great start, especially since the spec states they SHOULD be present and they're supported in py-wacz already.
I'd love to get support for a more extensive metadata json blob in eventually (maybe/ideally limited to Frictionless Data Package metadata fields - not sure to what degree we could do validation but it'd be a nice value-add) but as @markpbaggett's example suggests this might be a little more involved, so I think starting with title and description seems reasonable.
@markpbaggett Initial PR is merged and the 0.9.0 release of the crawler will include --title
and --description
CLI args :)
Thanks for the full example of the metadata you'd like to be able to pass into datapackage.json
! Keeping this issue open so that we can track adding more metadata to the WACZ in time.
@tw4l and @ikreymer awesome 😎 . this is a great start.
With browsertrix-crawler, a user can use
combineWARC
to write contextual information defined in thewarcinfo
property into the destination warc. When the warc is read, the fields defined in the property can be read.There does not appear to be a way currently with Browsertrix-Crawler (0.8.0-beta.2) to do a similar thing where one could add contextual information about the contents of a WACZ to the datapackage.json file. The WACZ spec states that a "WACZ file includes all the data that is needed for the rendering archived content as well as contextual information required for users to interpret it." It goes further to state that "the datapackage.json SHOULD include properties that allow rendering applications to present the user with contextual information about the web archive:" including
title
,description
, "other properties from the [FRICTIONLESS-DATA-PACKAGE] specification such as licenses, version, organization, contributors, email," and "custom properties that do not interfere with pre-existing properties."Is there a way currently or are there plans to add a method to add contextual information from the config to
datapackage.json
? If not, I think this would be a great addition.