wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
47 stars 10 forks source link

New METS structure for audio and video ingests (File Group and Usage Attributes) #4788

Open aray-wellcome opened 3 years ago

aray-wellcome commented 3 years ago

Background

Audio and video ingests can have 1) multiple audio or video files, 2) transcripts, and or 3) poster images.

To handle multiple files or transcripts, the item used to be ingested into Goobi (via an old workflow we're no longer using) as a multiple manifestation so each file or transcript had its own METS and they were tied together by an anchor file. The poster image was picked up from a local network drive.

This no longer works for us for two reasons: 1) For the storage service migration, the poster images for the audio and video ingests were included in the bag and are now served from the Storage Service. The local network drive for poster images should no longer be used but we have no way to add the poster images into the bag through Goobi. 2) We no longer want to create multiple manifestations for audio and video ingests because they are not true multiple manifestations. We instead want everything together in the same ingest.

New A/V Ingest Workflow and METS

Working with Intranda, we have re-created our entire A/V workflow in staging Goobi to allow for multiple audio or video files, transcripts, and poster images to be ingested together through Goobi and be stored in one bag in the Storage Service.

This change means that the METS for audio and video ingests has had to be re-written as well. It now includes file groups with use attributes for DDS to use. DDS needs to be updated to be able to understand and use the FileGroups in the new METS.

File Group and Use Explanations:

File Group: Masters, Use: Preservation This file group will only be seen in the METS for ingests that include an MXF file. The MXF is too large to use so DDS should ignore this file in the METS and the MXF will be sent to Deep Store in the Storage Service.

File Group: Objects, Use: Access This file group will contain files like MP4s (which will be bagged with an MXF), and MPGS for video ingests and WAVS and MP3s for audio ingests. These are the files that DDS/DLCS should use to serve the file to the user.

File Group: Poster Image, Use: Poster This file group will contain JPG poster images (for now it will be JPG) for audio or video ingests. DDS should apply this JPG to the audio or video file as a place holder image before the file is played.

File Group: Transcript, Use: Transcript This file group will contain the PDF transcript (for now it will be PDF) for audio or video ingests. DDS should display the PDF transcript alongside the audio or video file for the user.

File Ingest Combinations Ingests will vary depending on what is available for the audio or video file at the time. While all ingests will have the Object File Group, some may not have the Transcript File Group and/or the Poster Image File Group. Only MXF ingests will have the Masters File Group.

XML Samples

I have attached samples of different METS File Group combinations below:

Film

b30496160.xml = MXF, MP4, JPG, PDF (Filegroups: Master, Objects, Poster, Transcript) b30496160.txt

b30496020.xml = MPG, JPG, PDF (Filegroups: Objects, Poster, Transcript) b30496020.txt

Audio

b30655729.xml= WAV, PDF (Filegroups: Objects, Transcript) b30655729.txt

b31630327.xml = 2 WAV, PDF, docstruct labels in the Logical metadata (Filegroups: Objects, Transcript) b31630327.txt

b22488522.xml = 7 WAV, PDF (Filegroups: Objects, Transcript) b22488522.txt

b30655730.xml = WAV, PDF =(Filegroups: Objects, Transcript) b30655730.txt

b32494361.xml = WAV, JPG (Filegroups: Objects, Poster)
b32494361.txt

b2914615x.xml = 2 MP3, docstruct labels in the Logical metadata (Filegroups: Objects)
b2914615x.txt

Notes

We are testing some structural metadata labels for the audio files in the Logical section of the METS. We are waiting to see what the audio looks like on DDS before deciding if a change is needed for those.

And these METS do not have the width, height, nor parsed duration times in them yet as being discussed in #4777

tomcrane commented 3 years ago

@aray-wellcome do you plan to retrospectively apply this to existing AV, replacing the old METS with things that look like this?

I want to get the existing AV working in new IIIF Builder as a baseline, to verify we've got the logic we need out of old DDS, before handling the new model. But it would be nice if everything used the new model.

Would any AV still be a Multiple Manifestation in the new model? I know most things wouldn't be, but would there ever be a complex film or set of films that have one b number but would still make sense as a IIIF Collection? (I suspect not).

At the moment, the naive METS to IIIF mapping has to be stopped from building a IIIF Collection for Video+Transcript MMs; we go back and gather up the outputs and build a single IIIF Manifest with a single Canvas.

aray-wellcome commented 3 years ago

We have no plans at this moment to retrospectively apply this to the ~1000 AV items we already have in there. This would be for moving forward (though, if we ever had to re-ingest an old item, we'd re-ingest it via the new METS). As such, we'd need support for the old AV METS and the new AV METS, which is almost definitely not what you want to hear!

But we don't foresee any reason in the future with the new A/V workflow where we'd have a MM. Everything that should be together should be together in the bag.

tomcrane commented 3 years ago

For MMs, they'd still be together in the bag but described in more than one METS/Anchor file.

So we do need to support old and new.

Now that we have examples available in staging, I think we do want to build in support for this before we finish the DDS migration project, so that in a big initial population, this model is supported.

In the current chunk of work, we should get existing AV working (as in the correct IIIF built). I think we should get the new flow working as the first task in the next chunk.

aray-wellcome commented 3 years ago

Update: Goobi now uses MediaInfo to analyze the a/v files and write the duration, width, and height in a consistent way. The duration is in the format seconds.milliseconds and height and width just gives the number of pixels, no suffix. These can be adjusted if necessary.

Example:

b2923721x (wav files) b2923721x.txt

b3216273x (mxf and mp4) b3216273x.txt

tomcrane commented 3 years ago

Morning @aray-wellcome

Some questions about the new workflow and its DDS/IIIF-Builder implementation

  1. Will the previous test approach to MXF files, from back in April, be used at all? This is where the MXF file, the MPEG and the poster image are all separate entries in the METS physical files section, rather than the new approach where there is a single entry in physical files, which points to more than one mets:File, differentiated by USE="xxx" attributes. There is a fair amount of complexity that could be eliminated if that previous additional model was not supported - just the new model and the original Multiple Manifestation approach.
  2. Are any of these in production storage? Can they be? Not essential but handy to test.
  3. The ones in staging storage don't quite match the examples given as text attachments above. They are all there (same b numbers) but they don't show the same arrangement of files. E.g., b30496160 in the test one above has a full house of Master, Access, Poster and Transcript but the METS file in storage for b30496160 only has Access and Transcript. I can test many aspects of this using the the above test fixtures in isolation but some tests need to pull from storage; it would be good to have at least one in staging or production storage with the full house of files.
  4. It would be handy to have a safe way of quickly identifying a METS file as one that uses the new workflow. I think the presence of USE="ACCESS" on the filegroup that contains the access copy is a way of doing this (old workflows have USE="OBJECTS"). Does this sound right?
aray-wellcome commented 3 years ago

Morning, @tomcrane

  1. The test approach with all the items as separate entities in the METS physical files section will not be used. Moving forward will will only use the METS with the usage attributes USE="xxx" for ingests. The MM approach needs to be retained as we've used it to ingest 1200 items previously and it will allow us to update these as necessary.
  2. We don't have any of the new a/v ingests in production, no. I can get Intranda to install the new film workflow on production if you need this, though.
  3. We tested these items over and over so I think they've gotten a little out of sync of what we wanted to show you. That said, I can ingest an MXF ingest and and MP2 ingest with all of the files present for you. As far as I'm aware we're only moving forward with the films to be put on old DDS as the audio is too complicated to deal with right now. If you need audio now though, please let me know. I'll let you know which bnumbers I've ingested with all the files ASAP as I need to get the info from Harkiran.
  4. Yes, that sounds right to me. The usage attribute on the new METS for a/vfileswill not have use=objects but use=access and in this way we'll know it's the new METS.
tomcrane commented 3 years ago

Thanks @aray-wellcome !

That's good to know. This allows us to remove some over-complex code, as there are only 2 patterns to support, not 3 - and the new pattern is an extension of the original one (ALTOs are referenced in a similar way) rather than an alternative approach with some Physical files that need to be ignored. It feels nicer for sure, the model for ALTO just clicks into place with the MXF/Poster/Transcript use.

I need a word to describe these variants. The one we are interested in in the DDS is the access copy, which for old workflow is the only one present. What do we call the files that sit alongside? Derivative is the wrong word - the MXF is not a derivative of the MPEG. Variant isn't right either. We need a term that will make sense in the code, and I've picked adjunct - but if there is a better term that you might use in a preservation context I'll use that.

aray-wellcome commented 3 years ago

@tomcrane I have got an MXF ingest and an MPG ingest with full file loads making their way into the storage staging system. I'll let you know when they are available.

I'm glad the method we fixed clicks in well!

As for the the variants, do you mean you need a word for, say, what the MXF, poster image, and transcript are collectively compared to the MP4 for access? These would collectively be called adjuncts? We have just been calling the MXF a master file and then hand waving at the poster image and transcript to include them in a conversation, no real terms yet haha.

tomcrane commented 3 years ago

I'll stick with adjunct. It's so that code makes sense...

foreach (var adjunct in physicalFile.GetAdjuncts())...

aray-wellcome commented 3 years ago

@tomcrane Sorry for the delay but we got two clean ingests through overnight.

b32248398 has the mxf, mp4, pdf, and jpg b32248398.txt

b30496111 has the mpg, pdf, and jpg b30496111.txt

Both are now in the wellcomecollection-storage-staging bucket. Just to note, they're both ingested as open to make life easier for testing but they should really be restricted.

Would you like me to kick off more ingests with the complete file sets to use? When did you need an example on prod? Intranda says it'll take a day or two to get the new a/v workflows on prod so just trying to figure out timing among our other work

tomcrane commented 3 years ago

Don't worry about getting it on prod yet, I'll develop and test against these ones (and the existing ones) in the staging environment. The only reason to have some on prod too was to be able to test old and new things side by side - regression testing of the new sync/dashboard/iiif-generation code - and I think stage storage was cleared of old stuff.

An alternative would be to replicate a few example audio, video, book and archive items on stage-storage alongside these new workflow examples.

aray-wellcome commented 3 years ago

We do have a few things in staging-storage for you:

Book from Internet Archive: b19677911 Archive item: sa_eug_c_172_box_17_b16232665 (on its way to the storage service, at least, currently) Manuscript: ms_4263_b19367454

We don't have any audio or video that's from the old workflow, though.

tomcrane commented 3 years ago

Morning @aray-wellcome and sorry for labouring this point, just want to be 100% sure before I start merrily ripping out code on Monday.

Is it true to say that there are NEVER entries in the METS physical files list that should be ignored by the DDS?

In the older first-attempt MXF workflow, MXF files had their own entries in the physical files list. The DDS ignored these physical file entries. Now, in the new version, MXF files are one of the set of file pointers for a physical file entry. The DDS ignores that particular pointer, but doesn't ignore the physical file element itself because one of the other pointers is the mpeg, and (sometimes) one is the transcript.

So now, some of the actual files pointed at from a physical file element can be ignored by the DDS (like MXF), and/or not sent to DLCS (like ALTO), but there will always be at least one mets:filePtr that needs to be sent to the DLCS, in the 1,2,3 or 4 file pointers in any give physical file element - either a JP2, or an mpeg|wav|etc, or a transcript.

I'm 99.9% sure this is true, just quadruple-checking.

Why this is a nice simplification will become more apparent when you see the dashboard UI.

aray-wellcome commented 3 years ago

Hi @tomcrane Had to read this a few times but what you're saying is correct here and will be the only way we're moving forward on this:

Now, in the new version, MXF files are one of the set of file pointers for a physical file entry. The DDS ignores that particular pointer, but doesn't ignore the physical file element itself because one of the other pointers is the mpeg, and (sometimes) one is the transcript.

So now, some of the actual files pointed at from a physical file element can be ignored by the DDS (like MXF), and/or not sent to DLCS (like ALTO), but there will always be at least one mets:filePtr that needs to be sent to the DLCS, in the 1,2,3 or 4 file pointers in any give physical file element - either a JP2, or an mpeg|wav|etc, or a transcript.

The DDS should ignore any file under the Masters pointer in the Physical file element. In this case, the masters pointer points to the MXF file but in the distant future or in a different workflow could be another type of file.

I hope that helped confirm and didn't confuse you more...!