wellcomecollection / storage-service

🗃️ Managing the safe, long-term storage of our digital collections in the cloud
MIT License
32 stars 5 forks source link

Fix the issues around long-running bag replications #993

Open alexwlchan opened 2 years ago

alexwlchan commented 2 years ago

This is what I suspect to be the root cause of https://github.com/wellcomecollection/platform/issues/5484. I'm going to write it out with quite a lot of context, because (1) I don't know how quickly we'll implement this and (2) it's helpful for somebody else to understand it.

Context

What is the bag replicator?

The bag replicator is the service that copies a bag (a collection of files) from one storage location to another. It runs three times when storing a normal bag:

It calls ListObjects on the working storage, then does one-by-one copies of objects from the source to the destination.

How the bag replicator tries to do write-once operation

We want the bag replicator to write exactly once to the permanent storage location – so we can replicate an object, verify it, then be certain that it hasn't changed after it was verified.

We don't currently enable the S3 features that allow you to do that (Legal Holds, Object Lock, governance mode, etc.). We do enable legal holds in Azure.

So instead, the bag replicator checks for an existing object itself, and only writes the object if there isn't already an object at the destination. If there's already an object, it confirms it's the same as what it's about to replicate.

flowchart TD
    F[get object to replicate] --> DE{does the <br/>destination have<br/> this object?}
    DE-- yes -->DEY{are the source <br/>and destination <br/>objects the same?}
    DE-- no -->DEN["✅ replicate the object"]
    DEY-- yes -->DEYY["✅ nothing to do"]
    DEY-- no -->DEYN["⚠️ something has gone wrong"]

To prevent two processes doing this check simultaneously, the bag replicator will lock around the destination prefix – if it's got the lock, it should be the only thing working on this prefix.

This uses our Dynamo locking code, which has locks that expire after a certain period. (This is so applications that crash without releasing their locks don't gum up a process in perpetuity.)

S3 consistency and lazy checking

This approach had a bad interaction with the way S3 consistency used to work.

S3 used to be eventually consistent. If you called GetObject on a non-existent object, then PutObject, then GetObject again, you might get a 404 on the second call. This could cause verification failures, like so:

sequenceDiagram
  participant replicator
  participant verifier
  participant S3

  replicator->>S3: GetObject(f)
  S3->>replicator: 404 Not Found

  replicator->>S3: CopyObject(f, destination)
  S3->>replicator: OK, done!

  replicator->>verifier: you can verify f
  verifier->>S3: GetObject(f)
  S3->>verifier: 404 Not Found (!)

To get around this, the replicator starts by looking for existing objects in the prefix using a ListObjects call. If there aren't any, it knows that nobody is going to write any new objects because it's holding the lock, so it can skip checking for each object in turn. This avoids querying any object individually and hitting this consistency issue.

S3 is now strongly consistent so we no longer need this mitigation, but we haven't removed it yet.

The problem

I think this is the rough sequence of events with the problem bag:

  1. The Azure replicator queue gets a message "Please replicate b32832631"
  2. Task A picks up the message. It acquires a lock on the destination prefix, sees there are no objects already there, starts replicating.
  3. For some reason, the task gets stuck (possibly because we're doing a lot of S3→Azure replication at the same time). It's still working, but it doesn't replicate any objects before the SQS timeout and prefix lock both expire.
  4. Task B picks up the message. It acquires a lock on the destination prefix, sees there are no objects already there, starts replicating. At this point, both A and B are trying to replicate into the same prefix, and both think they have exclusive rights to do so.
  5. Eventually the logjam clears, and both tasks are able to replicate objects into Azure.
  6. Because we have legal holds enabled on the Azure container, at some point both tasks encounter an object that the other task has written, and the Azure copy fails:

    This operation is not permitted as the blob is immutable due to one or more legal holds

Proposed changes

Changes I think we should definitely make:

And a change I considered but don't think we should do:

alexwlchan commented 2 years ago

When should we do this?

We shouldn't rush into this; the replicator/verifier is the most critical part of the storage service, and we don't want to introduce new bugs. The existing code has also been extensively tested in production, and we know it's pretty reliable.

We should fix this, because it is a bug and will become more likely as we push more/bigger bags through the storage service.

Who should do this?

Another Scala developer pairing with me, as a way for them to get some experience with the storage service. It'd be better to wait and use this to do some knowledge sharing, than have me rush into writing a fix and miss this opportunity.