A/V Workflow Changes in Goobi RFC

aray-wellcome commented 4 years ago

As we ramp up A/V digitisation it is necessary to make changes to the Goobi A/V workflows in order to:

Handle ingesting poster images alongside films
Ingest new file types (.mxf, .mp4)
Ingest multiple file types in one process (.mxf, .jpg, .mp4)
Add trigger for S3 uploads for A/V unzipped folders with correct file types inside

1. Add poster images to current Wellcome Film process Also see https://github.com/wellcometrust/workflow/issues/222

Our current Wellcome Film process template accepts MP2s. We will need to keep this workflow for the ~200 backlog MP2s we have to ingest and for any rollbacks that might be needed in the future.

However, we will need this process template to allow us to ingest JPG poster images into the bag alongside the MP2.

Before the migration, MP2s were ingested into Goobi and the poster image was placed into a local drive. DDS was configured to look into the local drive and serve up any images it found as the poster image for the film in the Universal Viewer.

For the migration, Tom Crane moved the poster images into the bags. The ticket on this is here https://github.com/wellcometrust/platform/issues/3718

What we need Goobi to do now is accept both an MP2 and a JPG poster image at ingest and then write metadata about both, and place the MP2 in the objects folder in the bag and the poster image in a poster image folder in the bag, as Tom Crane specified.

2. Create new Wellcome Film workflow for MXF and MP4 files

Wellcome is now receiving digitised visual material as a JP2 in an MXF container, rather than an MP2. We then transcode the MXF into an MP4 via MediaConvert.

We would like to be able to ingest this MXF, MP4, and JPG poster image into one process in Goobi. Goobi should write metadata about these files and then bag them and send them to the storage service. (No normalization will occur in Goobi).

Work with Digirati will be needed to help DDS understand which file it should be looking at to create it's own MP4 for (unsure if it should create it from the MXF or the high def MP4 we created, but probably the latter).

3. S3 Upload of A/V

Currently, the wellcomecollection-workflow-upload bucket that is used to upload items into Goobi has a trigger on it that is set off when it receives a zip file. If the .zip file is the same name as a process in Goobi, Goobi will upload the images to the process. This works well for our photography uploads.

However, most A/V files are already in S3 and we are hoping to avoid the need to download these from S3, zip them into a file together, and reupload to the wellcomecollection-workflow-upload bucket.

We would like another trigger for A/V items that does not involve a .zip file in order to transfer items within the cloud.

One suggestion is that a folder with the same name as a Goobi process could be created in the wellcomecollection-workflow-upload bucket and then once it receives either an MP2 and JPG or an MXF, MP4, and JPG, it considers itself complete and uploads into Goobi.

rsehr commented 4 years ago

1.) can you send us a mets file that contains both, a mpeg and a jpeg? Is the jpeg a second object that is mentioned in the structMap, structLink area or is it a second fileGroup for the single video file?

If it is a second file, most of the workflow is already done and should work out of the box. However I don't think the bagit creation tool allows the usage of jpeg files at the moment. So we need to implement to extract the premis metadata for jpegs and allow jpegs to go into the zipped bag.

2.) can you send us an example including the expected folder structure and sample mets file? I guess all files will end up in the data/ directory, but it is unclear how the mets file should be designed. As above, are the 3 files mentioned in the structMap as separate entries or as a single object with 3 different representations?

3.) Currently we have a lambda function that calls Goobi as soon as a file was uploaded. The import itself, extracting, moving to the right destination etc is made by Goobi. I think we must create a second endpoint to handle A/V uploads in Goobi. Also we must adopt the lambda function, so it does also check the content of the folder. I am not sure, if we can implement the completeness check on the lambda function or if it must be done in Goobi

aray-wellcome commented 4 years ago

@rsehr

Here are xml samples from a film and then a film plus a transcript.

b16747823.txt b16747823_0001.txt b16747823_0002.txt b21650020.txt

The poster image is mentioned in the section of the mets for the .mpg.

aray-wellcome commented 4 years ago

I'm asking Digirati about the second point.
That sounds about right to me in terms of the new endpoint and the new lambda. Would we need a different bucket?

On another point, these films at the moment, don't have transcripts but we'd like the ability to add transcripts later but not as an MMO. Do we have any options for that we could discuss?

aray-wellcome commented 4 years ago

Re: number 2

Ashley Ray 12:26 PM @tomcrane We are looking at changing some of our workflows in Goobi for a/v material. We are first, going to get them to add the poster image in to the ingest. But second, material we're getting digitized now, we're going to want to ingest an MXF, MP4 and jpg poster image through Goobi for one process, and have DDS read only the MP4 (and convert it in DLCS) and the jpg. Intranda needs to know what the mets would look like for this in order to start considering teh work. Do you have an idea what DDS would need in terms of mets for this?

tomcrane 12:34 PM Hi @ashleyray I think the DDS should just adapt to whatever makes sense in the METS for digital preservation, workflow etc. The example given in https://github.com/wellcometrust/platform/issues/3718 is the MVP arrived at for the migration, just to get poster images into the METS and into a bag, for storage. At the moment, the DDS understands METS that looks like that - but it's deliberately avoiding going any further than it needs. You could take that as a start, or have a completely different approach; it shouldn't be very complicated to change the way the DDS reads the METS to determine what the poster image is and where it lives in storage. (edited) 12:36 This comment, specifically, and the few that follow - https://github.com/wellcometrust/platform/issues/3718#issuecomment-516326742 (edited) 12:36 What's missing is any reference to this techMD entry from logical or physical structMap - that's the bit that Intranda can suggest a good model for... what's a METS-idiomatic way of including a poster image in the digital object description in METS? 12:38 So basically, decide what works best for you and Goobi, and then I can make the DDS understand that model.

Ashley Ray 12:45 PM Ok so we kind of have free reign then haha

tomcrane 12:51 PM I trust you to create METS that is not so impenetrable that there's no way I can see what I need to pull out of it! 12:51 (challenge)

rsehr commented 4 years ago

We have basically two options to create the mets files. One would look like this:

<mets:fileSec>
  <mets:fileGrp USE="OBJECTS">
    <mets:file ID="FILE_0001_OBJECTS" MIMETYPE="video/mpeg">
      <mets:FLocat LOCTYPE="URL" xlink:href="objects/some_name.mpg" />
    </mets:file>
  </mets:fileGrp>
  <mets:fileGrp USE="POSTER">
    <mets:file ID="FILE_0001_POSTER" MIMETYPE="image/jpeg">
      <mets:FLocat LOCTYPE="URL" xlink:href="objects/some_name.jpg" />
  </mets:fileGrp>
</mets:fileSec>
<mets:structMap TYPE="LOGICAL">
  <mets:div ADMID="AMD" DMDID="DMDLOG_0000" ID="LOG_0000" LABEL="main title" TYPE="Video" />
</mets:structMap>
<mets:structMap TYPE="PHYSICAL">
  <mets:div DMDID="DMDPHYS_0000" ID="PHYS_0000" TYPE="physSequence">
    <mets:div ADMID="AMD_0001" ID="PHYS_0001" ORDER="1" ORDERLABEL=" - " TYPE="page">
      <mets:fptr FILEID="FILE_0001_OBJECTS" />
      <mets:fptr FILEID="FILE_0001_POSTER" />
    </mets:div>
  </mets:div>
</mets:structMap>
<mets:structLink>
  <mets:smLink xlink:from="LOG_0000" xlink:to="PHYS_0001" />
</mets:structLink>

This means you still have one physical object, but two different representations of it. With mp4, mxf and jpeg you would have 3 fileGrps. The problem with this solution is that you can assign only one mets:amdSec to it.

The second solution would look like this:

<mets:fileSec>
  <mets:fileGrp USE="OBJECTS">
    <mets:file ID="FILE_0001_OBJECTS" MIMETYPE="video/mpeg">
      <mets:FLocat LOCTYPE="URL" xlink:href="objects/some_name.mpg" />
    </mets:file>
    <mets:file ID="FILE_0002_OBJECTS" MIMETYPE="image/jpeg">
      <mets:FLocat LOCTYPE="URL" xlink:href="objects/some_name.jpg" />
  </mets:fileGrp>
</mets:fileSec>
<mets:structMap TYPE="LOGICAL">
  <mets:div ADMID="AMD" DMDID="DMDLOG_0000" ID="LOG_0000" LABEL="main title" TYPE="Video" />
</mets:structMap>
<mets:structMap TYPE="PHYSICAL">
  <mets:div DMDID="DMDPHYS_0000" ID="PHYS_0000" TYPE="physSequence">
    <mets:div ADMID="AMD_0001" ID="PHYS_0001" ORDER="1" ORDERLABEL=" - " TYPE="page">
      <mets:fptr FILEID="FILE_0001_OBJECTS" />
    </mets:div>
    <mets:div ADMID="AMD_0002" ID="PHYS_0002" ORDER="2" ORDERLABEL=" - " TYPE="page">
      <mets:fptr FILEID="FILE_0002_OBJECTS" />
    </mets:div>
  </mets:div>
</mets:structMap>
<mets:structLink>
  <mets:smLink xlink:from="LOG_0000" xlink:to="PHYS_0001" />
  <mets:smLink xlink:from="LOG_0000" xlink:to="PHYS_0002" />
</mets:structLink>

Now you have several physical objects, one for each file type. Semantically it means that they have a sequence and an order, so it is more then just different representations of the same object. But we have the option to assign a separate amdSec to each file. Therefore I would prefer this solution.

tomcrane commented 4 years ago

This is fine for me, because I can extract the necessary information. I have one question though - does the METS need to convey in some formal way that some_name.jpg is the poster image for some_name.mpg? Is it possible that a package like this could comprise a video file and an image for some other reason, where the image is not supposed to be the poster for the video, and that any representation of the object should show them both as equally important things?

This may be a non-real-world scenario and nothing to worry about.

tomcrane commented 4 years ago

For a user encountering the object, the behaviour of a poster image is very different from the behaviour of a thing with two media that are equally valid and choose-able.

aray-wellcome commented 4 years ago

This is fine for me, because I can extract the necessary information. I have one question though - does the METS need to convey in some formal way that some_name.jpg is the poster image for some_name.mpg? Is it possible that a package like this could comprise a video file and an image for some other reason, where the image is not supposed to be the poster for the video, and that any representation of the object should show them both as equally important things?

This may be a non-real-world scenario and nothing to worry about.

I can't think of any other jpg we would put in with an film besides a poster image.

In the case of the new A/V workflow, we will have an mxf and an mp4 and we'll want the poster image associated with the mp4 as that's the one we'll want DDS to use.

aray-wellcome commented 4 years ago

Now you have several physical objects, one for each file type. Semantically it means that they have a sequence and an order, so it is more then just different representations of the same object. But we have the option to assign a separate amdSec to each file. Therefore I would prefer this solution.

I think several physical objects we can assign amdSec to each is probably the way to go, too. If we needed to include transcripts into the ingest, this would probably work best for us in terms of writing METS too, right?

wellcomecollection / goobi-infrastructure

A/V Workflow Changes in Goobi RFC #255