wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
47 stars 10 forks source link

Clean up miro-sourced Wellcome Images in S3 #4885

Open kenoir opened 3 years ago

kenoir commented 3 years ago

Part of: #4809

After we've stored all images from miro in the storage service, then we should look at removing the images duplicated in s3 in the following buckets:


Timetable:

alexwlchan commented 3 years ago

Done!

alexwlchan commented 3 years ago

tl;dr: I think we pause cleaning up the remaining Miro-sourced images from wellcomecollection-assets-workingstorage until we put the Editorial Photography images in the storage service.

Here's a quick summary of the current issue:

So here's my proposal: We leave the unmatched Editorial_Photography images in wellcomecollection-assets-workingstorage as is, and we clean them up when we put the Editorial Photography images in the storage service.

alexwlchan commented 1 year ago

Returning to this as a lingering bit of cleanup. I don't know if I'll finish it, but I want to at least get a sense of how much work is involved.

All the Miro objects got sent to Glacier Deep Archive, so we need to restore them first. To do that I need a list of all the objects in an S3 Inventory or CSV, so I've set up an S3 Inventory job in the console for it.

alexwlchan commented 1 year ago

I've kicked off an S3 Batch Operation to restore all the objects from Deep Archive, which should take a day or so to complete.

alexwlchan commented 1 year ago

Okay, so what's left in the old bucket is 419,876 files, taking up ~8TB of data.

alexwlchan commented 1 year ago

An awful lot of what's in here is Editorial Photography, which is spread across the storage service and the wellcomecollection-editorial-photography bucket. The latter is the canonical store of Editorial Photography images; it's managed by Goobi and nicely structured.

The reason they weren't caught last time is because the files aren't exact matches – those files have stored as a mixture of JP2, TIF and JPEG, and you can't just look at Size/ETag to confirm two files are similar. e.g. C0001234.jp2 and C0001234.tif might be the same or they might be different, but unless you download the files you have no way of telling.

I'm now downloading all the files; I've initiated a restore of the relevant objects in wellcomecollection-editorial-photography, and I'm doing visual diff by downloading the images and running them through a Python library – if there's an image in assets with a perfect match, I'm deleting it. It's quite slow, but it's the only way to be sure we aren't losing data.

alexwlchan commented 1 year ago

I've taken a very substantial chunk out of this today, by doing visual diffs of the Editorial Photography images – we're down to 9,764 files, taking up ~300GB of data. For every image I deleted, I can point to an object in wellcomecollection-editorial-photography or wellcomecollection-storage that is a visual match.

What's left are the more "interesting" examples, e.g. irregular filenames like C1234a.tif when all my regexes are only looking for C1234.tif.

Unfortunately at least some of these point to historical errors – we've migrated between TIF and JP2 at various points for editorial photography, and I've already found two corrupted JP2 images in the storage service. The copy in wellcomecollection-assets-workingstorage may be the only "good" copy of these images.

I'll continue to investigate what's in here, but it'll be a slower process to whittle down the remaining images.

alexwlchan commented 1 year ago

Down to 8,227 objects and 234.4GB.

alexwlchan commented 1 year ago

I did a couple of extra passes to pick up the stragglers:

It's down to 1,830 objects and 18GB.

At least some of what's left seems to be the image diff, e.g. a diff score of 0.023 which is above the 0.02 threshold. Given the small number of images left, I might just crank through those by hand rather than continuing to crank up the threshold and potentially start binning non-dupe images.

alexwlchan commented 1 year ago

Down to 820 objects and 10.7GB, mostly a mix of manual review and improving the heuristic for name matching.

alexwlchan commented 1 year ago

Down to 525 objects and 9.7GB, by running spot checking on some of the images.

alexwlchan commented 1 year ago

The headline

Having spot-checked those remaining 525, most of the errors fall into three categories:

I'm not sure we can do much about the corrupted images until everything gets reingested into METS – putting new Miro images in the storage service is possible, but fiddly; we threw away all the Archivematica code we used for the initial ingest.

I consider this a partial win – we haven't totally solved the problem, but it will be easier to identify the images that need fixing the next time we do some work on the Miro data.

alexwlchan commented 1 year ago

I’m going to use this ticket to provide a brief recap of the Miro images, for anybody who stumbles upon this ticket in future.

Miro was the back-end management system for Wellcome Images, an image library that predates the current Wellcome Collection website and was hosted at wellcomeimages.org (now redirected to wc.org). It was a mixture of images from external contributors and images we'd produced in-house.

The image files were stored on an on-premise network file share, and organised by hand. Everything on this file share was uploaded to an S3 assets bucket before we removed the on-prem storage.

When we shut down Wellcome Images, we sorted the images into three buckets:

Most of the images were then copied from the assets bucket to the storage service, grouped based on which bucket they were in. There are also images in the editorial-photography bucket, which is where Goobi keeps all the Editorial Photography images.

We've removed most of what's in the assets bucket, by looking for matching images, which usually means some combination of:

There are ~500 images in the assets bucket which haven't yet been matched to anything in the permanent storage yet; it's possible that in some cases these are the last copy of a particular image and we don't want to delete them!

Eventually we should clean up all these images and put them in permanent storage, but each of these images will likely need checking by hand.

alexwlchan commented 1 year ago

I'm going to put this ticket back in the backlog for now – although it's not completely done, the smaller image set is much easier to deal with.

alexwlchan commented 1 year ago

Just to further complicate matters, it looks like not all the images in the editorial-photography bucket are safe. 😭

This is one example I found of image corruption: C0125908. On the left is the high-resolution TIF image from the editorial-photography bucket; on the right is the JPEG derivative from the assets bucket.

Screenshot 2023-04-20 at 21 46 46
alexwlchan commented 1 year ago

I was doing some spot checking, and realised we can probably peel off a few more images.

In particular, while I detected corruption in the previous pass, I didn't check where the corruption was. Here's another example: the high-resolution TIFF in the editorial photography bucket is fine, but the JPEG derivative in the assets bucket is corrupted. But that's not an issue for us – we can see they were the same image originally, and the high-resolution copy is fine, so we can delete the corrupted derivative.

Screenshot 2023-06-22 at 09 39 47
alexwlchan commented 1 year ago

Before this pass:

502 objects, totalling 8.9 GB, last modified 18 April

alexwlchan commented 1 year ago

Here's another category of images which are essentially equivalent: the same image, but one of them is rotated differently. I could clean these up automatically, but it's easier to just work through them individually – there can't be many like this.

Screenshot 2023-06-22 at 09 52 08
alexwlchan commented 1 year ago

There are a bunch of images in the C0084000 prefix where the JPEG/TIFFs do match, but with different IDs. I'm going to ignore them for now.

alexwlchan commented 1 year ago

I reduced it by another 20%:

382 objects, totalling 5.6 GB, last modified 18 April

and I'm going to toss this back onto the backlog.

alexwlchan commented 1 year ago

I've whittled this down to a couple of hundred images, so I'm going to go through what's left and make a spreadsheet of the errors, which we can pass to Collections and/or Production to deal with.