Ingest corporate archive images

jcateswellcome commented 2 months ago

To support the accession of corporate photography into the archive, we would like to find a way to automatically ingest them in Archivematica.

https://www.notion.so/wellcometrust/Ingest-corporate-archive-images-09d2b2fc47b846a0a377900a6c7e386d?pvs=4

paul-butcher commented 1 month ago

Questions - just make sure I understand what it is I'm supposed to be doing:

Broadly speaking, this task is to write something that will...
- iterate over the list of shoots, fetching each one from S3 as a folder - for each one
  - throw away the two redundant metadata files
  - create the appropriate metadata file for ingest
  - zip that all up
  - stick it in the right place on S3 for Archivematica to consume it
    1. There is something about "if a shoot needs to be broken up" - is that a size constraint? If so, what is it, or do I have to just suck it and see?

paul-butcher commented 1 month ago

Regarding the Glacier aspect, I think we can trigger a Bulk retrieval then use Notifications to trigger the next step

paul-butcher commented 1 month ago

There is a mention of this forming a "Repeatable pipeline is established for future ingests from S3 buckets", are there some other ingests expected in the near future? Knowing this would help establish what kinds of options/parameters I might need to establish.

jcateswellcome commented 1 month ago

@paul-butcher my understanding is that the next, and most likely, ingest would be when the next year of corporate photography shoots is accessioned, so it is likely to be a largely similar/uniform kind of thing from a similar sort of place - if that is vaguely precise enough for now.

There is a mention of this forming a "Repeatable pipeline is established for future ingests from S3 buckets", are there some other ingests expected in the near future? Knowing this would help establish what kinds of options/parameters I might need to establish.

jcateswellcome commented 1 month ago

Questions - just make sure I understand what it is I'm supposed to be doing:

Broadly speaking, this task is to write something that will...

iterate over the list of shoots, fetching each one from S3 as a folder - for each one

throw away the two redundant metadata files

create the appropriate metadata file for ingest

zip that all up

stick it in the right place on S3 for Archivematica to consume it

There is something about "if a shoot needs to be broken up" - is that a size constraint? If so, what is it, or do I have to just suck it and see?

yes, sounds right
I think suck it and see - this will come down to an Archivematica ingest limitations. I can try and find out from Ashley if there is a known or approximate number here.

paul-butcher commented 1 month ago

There is a mention of this forming a "Repeatable pipeline is established for future ingests from S3 buckets", are there some other ingests expected in the near future? Knowing this would help establish what kinds of options/parameters I might need to establish.

That's great. At the lowest level, the main thing I was wondering about is whether Glacier normally be involved, or if it's just going to be involved occasionally, might it be easiest to do that bit manually? I don't need an answer on this, as I'll probably work it out as I go along.

paul-butcher commented 1 month ago

Are there any folders in this format that have are currently not in Glacier (i.e. less than 6 months old). Not necessarily on the list. Just something where I can run a realistic not-quite-end-to-end test? No worries - I've just put a folder under ST for it.

jcateswellcome commented 1 month ago

Don't know, assume not. I am out on Tuesday/Wednesday - so suggest you get in touch with Ashley?

paul-butcher commented 1 month ago

Ah. I've just spotted the minor wrinkle that both the buckets are in different accounts. That's a little bit of a pain.

paul-butcher commented 1 month ago

Archivematica can be a bit flaky when ingesting large amounts of data. We may need to do some kind of retry.

The limit on ingest is by number of files per packet - Ashley recalls that there is a maximum of probably 500. I will set the maximum 250 in order to steer well clear of that.

If we have to retry because of ephemeral issues, I'd like to be pretty sure we aren't also failing because the packages are too big.

paul-butcher commented 3 weeks ago

I assume that the target for these is wellcomecollection-workflow-(stage-)?upload, with the presence of stage- depending on whether the process is being run for real life or just to try it out.

Should it go into some (new?) subfolder in that bucket?

When ephemeral failures occur, is it just a matter of moving a zip from failed, back to where it was originally uploaded, or do I have to store the zips elsewhere in order to resubmit them?

If they fail for a legitimate reason, would it be appropriate to download the failed zip from there, modify/split it, then upload? (as opposed to storing the zips elsewhere and fetching the one that corresponds to the failure)

aray-wellcome commented 3 weeks ago

If we want to practice this (and I think we should because it's really hard to delete things in storage if you mess up) then it'll need to go into the born-digital-accession folder of /wellcomecollection-archivematica-staging-transfer-source. No other subfolders are needed, just all zips into born-digital-accessions

The /wellcomecollection-archivematica-staging-transfer-source doesn't have a failed folder. You'll either get a success or failure log for each zip you put in. In the case of a failure, you can open the log and see what's wrong with it. The Lambda that produces the logs looks for issues with the metadata.csv and the structure of the zip I think.

If you need to resubmit the zip because it failed in Archivematica (usually because Archviematica fell over rather than anything legitimately wrong with the zip), you can just copy the zip into the same location and it'll overwrite and the Lambda should pick it up again.

If they fail for a legitimate reason then you should be able to pick it back up out of /wellcomecollection-archivematica-staging-transfer-source but we should check that it's not set to automatically clean up the successful items. I feel like Alex Chan did have some sort of cleanup code on this bucket but I have no idea if that's actually there and if it is, it's still working.

wellcomecollection / storage-service

Ingest corporate archive images #1126