wellcomecollection / goobi-infrastructure

Wellcome Collection digital workflow infrastructure
MIT License
0 stars 3 forks source link

Goobi to MediaGraph API #440

Closed aray-wellcome closed 5 months ago

aray-wellcome commented 1 year ago

TandemVault has been allowing us to test their new product MediaGraph, which is an updated version of TandemVault. Our instance is at https://mediagraph.io/wellcome

MediaGraph has their new API docs here https://docs.mediagraph.io/

For the moment, we will need both APIs (one to MediaGraph, one to TandemVault) working as we continue to test Mediagraph.

The MediaGraph API should be used nearly the same as TandemVault's but I think we need a few changes based on other things happening with this workflow outside of Goobi

  1. I think we should split up Ad Hoc and Editorial Photography ingests to the EP workflow by changing prefixes. Ad Hoc would be AH and Editorial Photography would be EP
  2. If the above happens, we'll need to make sure that saving these items in the S3 bucket works with the new prefix

In MediaGraph, normally all uploads from Goobi would be dumped into one Uploads folder. This won't work for us as the folder would get too large to open properly so the MediaGraph devs want to use the metadata we send in (Shoot type) to decide what folder it should go in. I know Shoot Types for Ad Hoc will say Digitisation. Shoot types like Editorial, Events, Exhibitions, Objects, Portraits would go into the Editorial Photography folder.

I've been in touch with MediaGraph and they thing the best way to work through this would be for you to contact them and you can work together. "Please feel free to have the Intranda team reach out directly to our CTO, Nick Merwin: nick@MediaGraph.io. He will work with them to make sure assets ingest into the correct storage folders in MediaGraph.

aray-wellcome commented 1 year ago

How the Editorial Photography Workflow works at present with TandemVault

(To the best of my knowledge)

  1. A digitization or editorial request comes in via a form to our LightBlue software which automatically assigns an EP shoot number (EP_000XXX)
  2. Once the digitization job or editorial photoshoot is completed the images (usually TIFFs) are zipped up with a shoot export.csv from LightBlue. Goobi is expecting this shoot.csv to have the following headers but only the asterisked items must have a value:

Attached are two examples Shoot export 2022-10-25 10-40-27.csv Shoot export 2022-10-04 10-58-35.csv

  1. The zipped package is uploaded to the S3 bucket /wellcomecollection-workflow-upload/editorial where a Lambda triggers and sends the zip to Goobi

  2. Goobi then runs the item through the Editorial Photography Workflow image Within this workflow the metadata from the shoot export csv is put on the images (I think?) and jpgs are made of the TIFFS.

  3. The jpgs are packaged up and sent to TandemVault where an upload set is created based on the title and reference from the shoot export csv. All filled out headers from the shoot export csv should be applied in TandemVault in this section image

  4. The shoot export csv and master TIFFS are sent to the S3 bucket wellcomecollection-editorial-photography and stored under folders based on the last two digits of the EP number

Updating TandemVault

There is a second workflow in Goobi that is to be used to update an upload set in TandemVault. The procedures are the same as above but if a Goobi process already exists with the same EP number reference, the item will be sent to the Editorial_Photography_Update workflow which will overwrite the images in the upload set in Tandem Vault and in the S3 wellcomecollection-editorial-photography bucket.

image

However, this workflow almost never works. I have to delete the items so a fresh ingest can go through Editorial_Photography instead

Deleting Editorial Photography ingests in TandemVault

If a fresh upload must be made, I delete: -the upload set from TandemVault from the Upload Set page -the folder of images and metadata from wellcomecollection-editorial-photography -the process in the Editorial_Photography workflow in Goobi

aray-wellcome commented 1 year ago

We had been planning to look at our Editorial Photography workflow for a while but with MediaGraph appearing, we kind of dived in to making changes along the whole workflow.

Two kinds of work in the Editorial_Photography Workflow

Essentially, the Editorial_Photography workflow in Goobi serves two steams of work:

  1. Ad hoc digitization - we offer free digitization services to enquirers that need something that we haven't digitized yet. It's small amounts of work but takes a lot of resources from the team. If there is something that is requested for digitization that cannot go on Wellcome Collection's site (because of copyright or only part of a book was digitized) this would go into TandemVault. The enquirer would be delivered a copy of the files in TandemVault via a Lightbox

  2. Editorial photography - we have an in-house photography team that takes photos of Wellcome events and staged photoshoots for marketing and Wellcome Stories. They store these shoots in TandemVault for easy of colleagues being able to access the shoots to use for various things as well as a way to archive them.

Changes

  1. EP numbering split into EP and AH numbering

Right now both ad hoc photography and editorial photography shoots use an EP number that is generated by our LightBlue shoot software. But ad hoc digitization orders will mostly like be moved out of LightBlue and into a new system, Quickbase (still being built and tested). It's too difficult to continue using the EP numbering convention in another system so ad hoc digitization ingests will most likely be using an AH_00XXXX convention number.

Editorial Photography should continue using the EP numbering convention even if moved to a new software.

  1. Upload set changes

In TandemVault, both digitized and editorial photography shoots are put into Upload Sets together.

image

The only way you can tell that something is a ad hoc digitized item or an editorial photography shoot is to click on the Upload set and then a photo.

Ad Hoc Digitization shoots have a tag that says Digitisation image

Editorial Photography shots have a tag that says WEP (Wellcome Editorial Photography) image

This has worked fine by us but MediaGraph no longer has Upload sets. Instead it has the File Vault. Our items from TandemVault were migrated to MediaGraph for us but we quickly realized that the way it was imported wasn't working for us. The File Vault had everything arranged by who imported the items. The vast majority of imports were under Intranda's name and it basically made a giant folder that was unopenable.

We have since manually re-arranged the File Vault into CP items (Corporate Photography, a legacy project from years ago and shouldn't have anything added to it), EP - Digitisation Requests, and EP - WEP (Wellcome Editorial Photography)

image

When Goobi starts sending ingests in via the API, we need to have a way for Goobi and MediaGraph to work together to put the ingests in new folders, most likely called something like Ad Hoc Digitistation and Editorial Photography Shoots, or something like that. In this way we'll avoid having a folder we can't open, at least for a while. It may be the case that we have to switch up the folder in the future as it fills us and slows the functionality.

I suppose that they can be sorted by number, either AH or EP, or by the tags that get applied via the shoot export.csv in Shoot Type

  1. Saving masters

The EP/AH workflows should both still send jpgs to MediaGraph as they're smaller and easier to store. The original TIFFs should be sent to wellcomecollection-editorial-photography as usual for now. But it may be that we want to migrate all images from wellcomecollection-editorial-photography into the Wellcome Storage Service to sit in a space alongside wellcomecolletion-storage/digitised . This means the new EP/AH workflow would need to be able to bag up the master TIFFs and metadata into a bag and store it after sending the JPGs in.

But as I said, this is still just something we're looking at, nothing is scheduled to happen as of me writing this. I just wanted to give a heads up so we can look at building the flexibility to do this in now if we need to.

aray-wellcome commented 1 year ago

How MediaGraph workflows should work

Or at least what I think so far...

  1. An Editorial Photography order comes through on LightBlue and gets an EP number/ an Ad Hoc order comes into Quickbase and gets an AH number

  2. Once the job is completed, the EP images get zipped up with a shoot export.csv from LightBlue/ the AH images get zipped up with a shoot export.csv from Quickbase (I can make Quickbase match LightBlue's. But I'm unsure if the API requires any headers/values to be changed to get the same information we need in MediaGraph)

  3. An EP zip/an AH zip is uploaded into wellcomecollection-upload-workflow/editorial (might need to make this a more generic title? Or leave as is) and the Lambda sends it to Goobi

  4. Goobi's workflow that sends things to MediaGraph handles the EP/AH item (I think that one workflow for this stuff should still work?). The workflow should still embed all metadata, make jpgs and send them to MG, and send the masters and shoot csv to S3 (likely still just the editiorial-photography bucket at this stage)

  5. The EP item is received by MediaGraph and is put into the File Vault under Editorial Photography shoots with a folder title that has the EP number + the title from the shoot export.csv like so image

An AH item is received by MediaGraph and is put into the File Vault under Ad Hoc Digitisation with a folder title that has the AH number + the title from the shoot export.csv

  1. The masters and csvs sent to the editiorial-photography bucket are stored under a folder based on their last two digits as it is currently

Updating I have no idea how updating things with the new API and MediaGraph should work but I think we should still have the ability to do so. It might have to work in the same way as the old TandemVault update workflow but only if it's more stable

Deleting We do have things that we have to delete sometimes, either for reingest or someone's asked for a shoot they're in to be deleted. I assume we could delete a folder in FileVault, S3 and Goobi as we do now?

aray-wellcome commented 1 year ago

Additional thoughts: Though most of the stuff we put through is going to be TIFFs, sometimes mp4s are put in as well so we should be able to handle those too.