kenoir commented 3 years ago

Part of: #4809

After we've stored all images from miro in the storage service, then we should look at removing the images duplicated in s3 in the following buckets:

wellcomecollection-assets-workingstorage/miro
wellcomecollection-miro-images-public

Timetable:

[x] 15 March – send an email to Collections explaining the buckets are changing, and how to find images in the storage-service
[x] 29 March – send a reminder email
[x] 6 March – disable access to images in the platform account
[x] Delete objects from miro-images-public
[x] Delete objects from wellcomecollection-images
[ ] Delete objects from wellcomecollection-assets-workingstorage

alexwlchan commented 3 years ago

Done!

alexwlchan commented 3 years ago

tl;dr: I think we pause cleaning up the remaining Miro-sourced images from wellcomecollection-assets-workingstorage until we put the Editorial Photography images in the storage service.

Here's a quick summary of the current issue:

We assumed that Miro IDs were globally unique, and that two files with the same ID would refer to the same image. We now know this is wrong. In particular, we have examples from the Editorial_Photography folder where:
- The same ID is used for an image at different stages of the process (e.g. one file is a single layer, another file is the complete image)
- The same ID is used to describe entirely different photographs
In the latter case, we have examples of images that don't obviously exist anywhere outside wellcomecollection-assets-workingstorage – for example, there's an image labelled C0100987.jp2, but the photo it contains doesn't appear in Tandem Vault, the storage service, or the Editorial Photography backups bucket.
We will lose images if we do a blanket deletion on wellcomecollection-assets-workingstorage.

How many? I don't know; I don't have a good grasp of the scale of the issue. It could be a handful of images, it could be hundreds.

I've deleted the files which are "obviously" safe – that is, files that have another object with matching size, filename and ETag/checksum in wellcomecollection-storage. That leaves ~400k remaining objects to check.
We’d need to do a lot more work to check the remaining images.

A lot (all?) of the remaining images have a counterpart in the wellcomecollection-editorial-photography bucket. We've said something to the tune of "We have a file called C0123456.jpg in the assets bucket, and we can see a file called C0123456.tif in the editorial photography bucket. Those must be the same image, right?". This is no longer a safe assumption.

If we can no longer rely on filename matching, I think we want to look at image diffing (like we did for the Miro/DLCS migration). Pull an image from assets-workingstorage, pull an image from editorial-photography, run an image diff and check the numbers. I'd expect to see 99% matches if they're the same.

Unfortunately, the images in that bucket have all been lifecycled to Glacier. It wouldn't be too expensive to retrieve them (I'd guess ~$50), but it's more work to build that tooling, and we have higher priorities. There's also no guarantee it would finish this off – we might find a subset of images that need more attention.

So here's my proposal: We leave the unmatched Editorial_Photography images in wellcomecollection-assets-workingstorage as is, and we clean them up when we put the Editorial Photography images in the storage service.

There'll be more time to work on this
We can give it our full attention, rather than trying to rush it to round out the quarter
We'll be un-Glaciering all the images in the current editorial-photography bucket when that happens, so it'll be much easier and quicker to do image diffing

alexwlchan commented 1 year ago

Returning to this as a lingering bit of cleanup. I don't know if I'll finish it, but I want to at least get a sense of how much work is involved.

All the Miro objects got sent to Glacier Deep Archive, so we need to restore them first. To do that I need a list of all the objects in an S3 Inventory or CSV, so I've set up an S3 Inventory job in the console for it.

alexwlchan commented 1 year ago

I've kicked off an S3 Batch Operation to restore all the objects from Deep Archive, which should take a day or so to complete.

alexwlchan commented 1 year ago

Okay, so what's left in the old bucket is 419,876 files, taking up ~8TB of data.

alexwlchan commented 1 year ago

An awful lot of what's in here is Editorial Photography, which is spread across the storage service and the wellcomecollection-editorial-photography bucket. The latter is the canonical store of Editorial Photography images; it's managed by Goobi and nicely structured.

The reason they weren't caught last time is because the files aren't exact matches – those files have stored as a mixture of JP2, TIF and JPEG, and you can't just look at Size/ETag to confirm two files are similar. e.g. C0001234.jp2 and C0001234.tif might be the same or they might be different, but unless you download the files you have no way of telling.

I'm now downloading all the files; I've initiated a restore of the relevant objects in wellcomecollection-editorial-photography, and I'm doing visual diff by downloading the images and running them through a Python library – if there's an image in assets with a perfect match, I'm deleting it. It's quite slow, but it's the only way to be sure we aren't losing data.

alexwlchan commented 1 year ago

I've taken a very substantial chunk out of this today, by doing visual diffs of the Editorial Photography images – we're down to 9,764 files, taking up ~300GB of data. For every image I deleted, I can point to an object in wellcomecollection-editorial-photography or wellcomecollection-storage that is a visual match.

What's left are the more "interesting" examples, e.g. irregular filenames like C1234a.tif when all my regexes are only looking for C1234.tif.

Unfortunately at least some of these point to historical errors – we've migrated between TIF and JP2 at various points for editorial photography, and I've already found two corrupted JP2 images in the storage service. The copy in wellcomecollection-assets-workingstorage may be the only "good" copy of these images.

I'll continue to investigate what's in here, but it'll be a slower process to whittle down the remaining images.

alexwlchan commented 1 year ago

Down to 8,227 objects and 234.4GB.

alexwlchan commented 1 year ago

I did a couple of extra passes to pick up the stragglers:

Account for different image modes – the image diff tool I was using wouldn't allow you to compare, say, an RGB image with an RGBA image. I solved this by converting to the more-permissive format, e.g. RGB into RGBA, or L into RGB
Increase the ephemeral storage available to the Lambda – turns out you can get up to 10GB now!
Increase the memory available to the Lambda – converting the images was causing OOM errors
Loosen the name matches to make it easier to find candidate matches, e.g. there's stuff like _DSC0123.tif

It's down to 1,830 objects and 18GB.

At least some of what's left seems to be the image diff, e.g. a diff score of 0.023 which is above the 0.02 threshold. Given the small number of images left, I might just crank through those by hand rather than continuing to crank up the threshold and potentially start binning non-dupe images.

alexwlchan commented 1 year ago

Down to 820 objects and 10.7GB, mostly a mix of manual review and improving the heuristic for name matching.

alexwlchan commented 1 year ago

Down to 525 objects and 9.7GB, by running spot checking on some of the images.

alexwlchan commented 1 year ago

The headline

Having spot-checked those remaining 525, most of the errors fall into three categories:

There's a corrupted image somewhere. There's corruption in both directions – some in the assets bucket, some in the storage service (or sometimes in both). I've found at least once case (C0131243) where the TIF and the JP2 are both corrupted, but the JPEG derivative is fine.

e.g. C0044452.tif in the assets bucket is fine, but C0044452.jp2 in the storage service is corrupted.
The same image ID refers to two completely different things; when you look in the storage service and the assets bucket you find two different images. (I'm now struggling to find an example, but I have seen them.)
The images are the same, but in a more difficult to determine way, e.g. one is a rotated version of the other.

I'm not sure we can do much about the corrupted images until everything gets reingested into METS – putting new Miro images in the storage service is possible, but fiddly; we threw away all the Archivematica code we used for the initial ingest.

I consider this a partial win – we haven't totally solved the problem, but it will be easier to identify the images that need fixing the next time we do some work on the Miro data.

alexwlchan commented 1 year ago

I’m going to use this ticket to provide a brief recap of the Miro images, for anybody who stumbles upon this ticket in future.

Miro was the back-end management system for Wellcome Images, an image library that predates the current Wellcome Collection website and was hosted at wellcomeimages.org (now redirected to wc.org). It was a mixture of images from external contributors and images we'd produced in-house.

The image files were stored on an on-premise network file share, and organised by hand. Everything on this file share was uploaded to an S3 assets bucket before we removed the on-prem storage.

When we shut down Wellcome Images, we sorted the images into three buckets:

Library content – anything we could make publicly available in the new search, under a Creative Commons license. These are the images now available at wellcomecollection.org/search/images.
Private content – images we wanted to keep, but for staff use only. These were stored in Tandem Vault (now MediaGraph), and include a lot of Editorial Photography.
Cold store – images we didn't want to keep or make available, e.g. because they were from an external contributor who never replied to our messages about the closure of Wellcome Images

Most of the images were then copied from the assets bucket to the storage service, grouped based on which bucket they were in. There are also images in the editorial-photography bucket, which is where Goobi keeps all the Editorial Photography images.

We've removed most of what's in the assets bucket, by looking for matching images, which usually means some combination of:

Files are in the permanent storage buckets, i.e. the storage service or editorial-photography
The filenames match or are similar (e.g. C0001234.jp2 and C0001234.tif)
The files have the same size and ETag (which implies identical content), or the images are visually similar enough to be considered equivalent

There are ~500 images in the assets bucket which haven't yet been matched to anything in the permanent storage yet; it's possible that in some cases these are the last copy of a particular image and we don't want to delete them!

Eventually we should clean up all these images and put them in permanent storage, but each of these images will likely need checking by hand.

alexwlchan commented 1 year ago

I'm going to put this ticket back in the backlog for now – although it's not completely done, the smaller image set is much easier to deal with.

alexwlchan commented 1 year ago

Just to further complicate matters, it looks like not all the images in the editorial-photography bucket are safe. 😭

This is one example I found of image corruption: C0125908. On the left is the high-resolution TIF image from the editorial-photography bucket; on the right is the JPEG derivative from the assets bucket.

alexwlchan commented 1 year ago

I was doing some spot checking, and realised we can probably peel off a few more images.

In particular, while I detected corruption in the previous pass, I didn't check where the corruption was. Here's another example: the high-resolution TIFF in the editorial photography bucket is fine, but the JPEG derivative in the assets bucket is corrupted. But that's not an issue for us – we can see they were the same image originally, and the high-resolution copy is fine, so we can delete the corrupted derivative.

alexwlchan commented 1 year ago

Before this pass:

502 objects, totalling 8.9 GB, last modified 18 April

alexwlchan commented 1 year ago

Here's another category of images which are essentially equivalent: the same image, but one of them is rotated differently. I could clean these up automatically, but it's easier to just work through them individually – there can't be many like this.

alexwlchan commented 1 year ago

There are a bunch of images in the C0084000 prefix where the JPEG/TIFFs do match, but with different IDs. I'm going to ignore them for now.

alexwlchan commented 1 year ago

I reduced it by another 20%:

382 objects, totalling 5.6 GB, last modified 18 April

and I'm going to toss this back onto the backlog.

alexwlchan commented 1 year ago

I've whittled this down to a couple of hundred images, so I'm going to go through what's left and make a spreadsheet of the errors, which we can pass to Collections and/or Production to deal with.

wellcomecollection / platform

Clean up miro-sourced Wellcome Images in S3 #4885

The headline