This is what I suspect to be the root cause of https://github.com/wellcomecollection/platform/issues/5484. I'm going to write it out with quite a lot of context, because (1) I don't know how quickly we'll implement this and (2) it's helpful for somebody else to understand it.

Context

What is the bag replicator?

The bag replicator is the service that copies a bag (a collection of files) from one storage location to another. It runs three times when storing a normal bag:

Copying from working storage to our warm S3 storage
Copying from working storage to our cold S3 storage
Copying from working storage to our cold Azure storage

It calls ListObjects on the working storage, then does one-by-one copies of objects from the source to the destination.

How the bag replicator tries to do write-once operation

We want the bag replicator to write exactly once to the permanent storage location – so we can replicate an object, verify it, then be certain that it hasn't changed after it was verified.

We don't currently enable the S3 features that allow you to do that (Legal Holds, Object Lock, governance mode, etc.). We do enable legal holds in Azure.

So instead, the bag replicator checks for an existing object itself, and only writes the object if there isn't already an object at the destination. If there's already an object, it confirms it's the same as what it's about to replicate.

flowchart TD
    F[get object to replicate] --> DE{does the <br/>destination have<br/> this object?}
    DE-- yes -->DEY{are the source <br/>and destination <br/>objects the same?}
    DE-- no -->DEN["✅ replicate the object"]
    DEY-- yes -->DEYY["✅ nothing to do"]
    DEY-- no -->DEYN["⚠️ something has gone wrong"]

To prevent two processes doing this check simultaneously, the bag replicator will lock around the destination prefix – if it's got the lock, it should be the only thing working on this prefix.

This uses our Dynamo locking code, which has locks that expire after a certain period. (This is so applications that crash without releasing their locks don't gum up a process in perpetuity.)

S3 consistency and lazy checking

This approach had a bad interaction with the way S3 consistency used to work.

S3 used to be eventually consistent. If you called GetObject on a non-existent object, then PutObject, then GetObject again, you might get a 404 on the second call. This could cause verification failures, like so:

sequenceDiagram
  participant replicator
  participant verifier
  participant S3

  replicator->>S3: GetObject(f)
  S3->>replicator: 404 Not Found

  replicator->>S3: CopyObject(f, destination)
  S3->>replicator: OK, done!

  replicator->>verifier: you can verify f
  verifier->>S3: GetObject(f)
  S3->>verifier: 404 Not Found (!)

To get around this, the replicator starts by looking for existing objects in the prefix using a ListObjects call. If there aren't any, it knows that nobody is going to write any new objects because it's holding the lock, so it can skip checking for each object in turn. This avoids querying any object individually and hitting this consistency issue.

S3 is now strongly consistent so we no longer need this mitigation, but we haven't removed it yet.

The problem

I think this is the rough sequence of events with the problem bag:

The Azure replicator queue gets a message "Please replicate b32832631"
Task A picks up the message. It acquires a lock on the destination prefix, sees there are no objects already there, starts replicating.
For some reason, the task gets stuck (possibly because we're doing a lot of S3→Azure replication at the same time). It's still working, but it doesn't replicate any objects before the SQS timeout and prefix lock both expire.
Task B picks up the message. It acquires a lock on the destination prefix, sees there are no objects already there, starts replicating. At this point, both A and B are trying to replicate into the same prefix, and both think they have exclusive rights to do so.
Eventually the logjam clears, and both tasks are able to replicate objects into Azure.
Because we have legal holds enabled on the Azure container, at some point both tasks encounter an object that the other task has written, and the Azure copy fails:

This operation is not permitted as the blob is immutable due to one or more legal holds

Proposed changes

Changes I think we should definitely make:

Lock around individual destination objects, not the entire prefix. This problem became possible because the replicator kept working after its lock had expired. We could build a way to keep locks "fresh" and tie that to our SQS visibility timeout, but this is adding complexity to an already pretty-complicated bit of code.

This means we'd spend more on DynamoDB locking, but individual locks would only have to last as long as it takes to replicate a single object. The likelihood of a lock expiring while work is ongoing drops dramatically.
Always look for an existing object before replicating. We added this to work around an S3 limitation that no longer applies. We could remove this code and simplify the replicator. It's less efficient (because we'll check for lots of objects that don't exist), but combined with the individual locking gives us a stronger level of protection against overwrites.

It might be an issue if we ever want to expand our replication locations to another object store that's eventually consistent, but (1) we have no plans to do that and (2) we can't be sure this hypothetical future store would have the same failure mode.

And a change I considered but don't think we should do:

Handle a 409 Blob Already Exists from Azure. With the first two changes this should become practically impossible. If it does happen, we should let it bubble up as an error so we can work out what's going wrong earlier in the replicator.

wellcomecollection / storage-service

Fix the issues around long-running bag replications #993