wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
48 stars 10 forks source link

What file metadata should be surfaced in or via IIIF for born digital? #5605

Open tomcrane opened 2 years ago

tomcrane commented 2 years ago

The DDS reads Archivematica METS and builds IIIF Manifests.

In doing so it parses the various file characterisation tool outputs to extract information. This is WIP!

Some of this information is essential for creating the Manifest:

Some of it is effectively essential because of decisions made about labelling and identifiers:

These are all reflected in machine-readable properties in the IIIF Manifest.

But there is a lot more file information that might be of interest to users and/or API consumers, both as information for display, and perhaps as machine-readable, API information that is at the file level, below the item level of the Catalogue record but perfectly possible to attach individually to files in the Manifest, that would be useful for anyone analysing born digital items in bulk. Things like:

Information like this can be surfaced for display in IIIF metadata name value pairs, but that is just for display, it's not machine-readable (well it is, but machines shouldn't be encouraged to read from those values, they have no semantics). We could introduce a simple file info data model (ideally an existing vocab) and provide a machine-readable chunk of data per file.

Alternatively the Manifest could provide access to the raw METS and allow users to attempt to extract file characteristics from the tool outputs. Seeing as we're parsing that stuff anyway, it seems friendlier to pull a few more values of interest out while we're in there and surface them in the IIIF.

Just making something up:

{
    "format": "application/msword",
    "pronomKey": "fmt/123",
    "formatName": "Word for Macintosh 2005",
    "fileSizeBytes": 2343254,
    "fileSizeDisplay": "23 KB",
    "pageCount": 12,
    "created": "2005-12-23 13:23",
    "etc": "etc"
}

But maybe someone has a nice vocab/pattern for doing this simply.

alexwlchan commented 2 years ago

In terms of patterns, one thing I would lean towards is creating objects for related data, which is what we've done in the catalogue APIs. This is possibly a bit far, but something like:

{
    "format": {
      "id": "application/msword",
      "label": "Word for Macintosh 2005",
      "pronomKey": "fmt/123",
      "type": "Format"
    },
    …
}

The other thing I'd note is that I'd lean away from encoding display logic in the API; have it return a byte count, but not a formatted string.

jtweed commented 2 years ago

I think extracting useful metadata for files is a good idea and providing it in the manifest is the correct place. I don't think the front end should be parsing any METS, that should definitely all be contained within DDS.

I can totally see us displaying what kind of file something is, so the pronom id and label are more important that the mime type as metadata. File size is also extremely useful, more so than page count. Everything else I would see as additional and could come later.

tomcrane commented 2 years ago

For now I've added some info as Presentation metadata - for display to users, rather than machines - we can come back to a little model later. I've put these in for now to remind us that we can do this and add more.

image

tomcrane commented 2 years ago

Incidentally, matches this:

https://leedsunilibrary.wordpress.com/2021/01/06/a-year-and-three-months-as-a-bridging-the-digital-gap-trainee-in-leeds/

As a result, users will be able to explore this collection and discover several features for each digital file – such as file format, size and original file path – along with format-specific information – such as image size, video encoding, word count etc.

tomcrane commented 2 years ago

Just occurred to me that Manifest-level metadata could provide info like total size of this item on disk by summing the file sizes; a little report as a summary field:

2.46 MB in 12 files:

That kind of data is to-hand in the DDS when building the manifest - it's traversing the files - but may be very hard to get hold of quantitatively at other times.