Open alexwlchan opened 2 years ago
We shouldn't rush into this; the replicator/verifier is the most critical part of the storage service, and we don't want to introduce new bugs. The existing code has also been extensively tested in production, and we know it's pretty reliable.
We should fix this, because it is a bug and will become more likely as we push more/bigger bags through the storage service.
Another Scala developer pairing with me, as a way for them to get some experience with the storage service. It'd be better to wait and use this to do some knowledge sharing, than have me rush into writing a fix and miss this opportunity.
This is what I suspect to be the root cause of https://github.com/wellcomecollection/platform/issues/5484. I'm going to write it out with quite a lot of context, because (1) I don't know how quickly we'll implement this and (2) it's helpful for somebody else to understand it.
Context
What is the bag replicator?
The bag replicator is the service that copies a bag (a collection of files) from one storage location to another. It runs three times when storing a normal bag:
It calls ListObjects on the working storage, then does one-by-one copies of objects from the source to the destination.
How the bag replicator tries to do write-once operation
We want the bag replicator to write exactly once to the permanent storage location – so we can replicate an object, verify it, then be certain that it hasn't changed after it was verified.
We don't currently enable the S3 features that allow you to do that (Legal Holds, Object Lock, governance mode, etc.). We do enable legal holds in Azure.
So instead, the bag replicator checks for an existing object itself, and only writes the object if there isn't already an object at the destination. If there's already an object, it confirms it's the same as what it's about to replicate.
To prevent two processes doing this check simultaneously, the bag replicator will lock around the destination prefix – if it's got the lock, it should be the only thing working on this prefix.
This uses our Dynamo locking code, which has locks that expire after a certain period. (This is so applications that crash without releasing their locks don't gum up a process in perpetuity.)
S3 consistency and lazy checking
This approach had a bad interaction with the way S3 consistency used to work.
S3 used to be eventually consistent. If you called GetObject on a non-existent object, then PutObject, then GetObject again, you might get a 404 on the second call. This could cause verification failures, like so:
To get around this, the replicator starts by looking for existing objects in the prefix using a ListObjects call. If there aren't any, it knows that nobody is going to write any new objects because it's holding the lock, so it can skip checking for each object in turn. This avoids querying any object individually and hitting this consistency issue.
S3 is now strongly consistent so we no longer need this mitigation, but we haven't removed it yet.
The problem
I think this is the rough sequence of events with the problem bag:
Because we have legal holds enabled on the Azure container, at some point both tasks encounter an object that the other task has written, and the Azure copy fails:
Proposed changes
Changes I think we should definitely make:
Lock around individual destination objects, not the entire prefix. This problem became possible because the replicator kept working after its lock had expired. We could build a way to keep locks "fresh" and tie that to our SQS visibility timeout, but this is adding complexity to an already pretty-complicated bit of code.
This means we'd spend more on DynamoDB locking, but individual locks would only have to last as long as it takes to replicate a single object. The likelihood of a lock expiring while work is ongoing drops dramatically.
Always look for an existing object before replicating. We added this to work around an S3 limitation that no longer applies. We could remove this code and simplify the replicator. It's less efficient (because we'll check for lots of objects that don't exist), but combined with the individual locking gives us a stronger level of protection against overwrites.
It might be an issue if we ever want to expand our replication locations to another object store that's eventually consistent, but (1) we have no plans to do that and (2) we can't be sure this hypothetical future store would have the same failure mode.
And a change I considered but don't think we should do: