fix(operator, aggregator): aggregator or operator downtime can break eventual consistency

entropidelic commented 1 month ago

Tasks

[ ] Solve issue on Operator
[ ] Solve issue on Aggregator

Operator

If an Operator experiences downtime, NewBatch events emitted when the Operator was offline will not get processed by the Operator. If over one-third of the Operators experience downtime and subsequently miss a NewBatch event (e.g., due to a poorly rolled out upgrade), it will be impossible for the Aggregator to accumulate a quorum of verifications for the corresponding batch.

Recommended remediation

To address Operator downtime, we recommend that an Operator log timestamps to a file periodically. When the Operator is back online after downtime, it could read the file to learn when it went offline and process all the batches since the downtime that do not already have a corresponding BatchVerified event. Additionally, we recommend having the Operator write to a log any batch it has previously verified and skip re-verifying those batches after downtime, which the Aggregator has already verified, but which have not yet triggered a BatchVerified event by the smart contract either because a quorum has not yet been reached, or because the respondToTask transaction is still in the mempool

Aggregator

If the Aggregator were to experience downtime, it could similarly cause batches to be skipped. If an Operator's message fails to reach the Aggregator, the Operator will continue to attempt to reach the Aggregator for 100 seconds. If the Aggregator fails to become available within this time, data loss will occur.

Recommended remediation

o address Aggregator downtime, we recommend solving the Operator-to-Aggregator message loss by placing a highly available queue in front of the Aggregator, such as Amazon SQS. However, Operator signatures accumulated by the Aggregator that had not yet reached a quorum would be lost in an Aggregator crash event. To remediate this, Operator signatures could be logged to a file, which could be read on service startup. However, if an Aggregator that is experiencing issues is replaced by another Aggregator, these accumulated messages would not be available, as the log file would be on a separate instance. A shared datastore, such as Redis, would be required to facilitate node replacement. For catching up on missed events from AlignedLayerServiceManage, the timestamp logging solution should suffice.

Oppen commented 1 month ago

To address Operator downtime, we recommend that an Operator log timestamps to a file periodically. When the Operator is back online after downtime, it could read the file to learn when it went offline and process all the batches since the downtime that do not already have a corresponding BatchVerified event. Additionally, we recommend having the Operator write to a log any batch it has previously verified and skip re-verifying those batches after downtime, which the Aggregator has already verified, but which have not yet triggered a BatchVerified event by the smart contract either because a quorum has not yet been reached, or because the respondToTask transaction is still in the mempool

Would it make more sense to just log the latest verified batch? Then we could just assume that:

Any batch before and including that one is verified, and we don't do it again.
Any batch after is not verified, so we start verifying them. This is assuming a total order exists, of course.

Oppen commented 1 month ago

o address Aggregator downtime, we recommend solving the Operator-to-Aggregator message loss by placing a highly available queue in front of the Aggregator, such as Amazon SQS. However, Operator signatures accumulated by the Aggregator that had not yet reached a quorum would be lost in an Aggregator crash event. To remediate this, Operator signatures could be logged to a file, which could be read on service startup. However, if an Aggregator that is experiencing issues is replaced by another Aggregator, these accumulated messages would not be available, as the log file would be on a separate instance. A shared datastore, such as Redis, would be required to facilitate node replacement. For catching up on missed events from AlignedLayerServiceManage, the timestamp logging solution should suffice.

I would take a different approach here. Rather than the queue considering everything that's been delivered to be done with, we could use explicit acknowledgements (RabbitMQ supports this, for example) to signal that the Aggregator is done with the value. Otherwise, the value is not removed from the queue in the first place. That way, every message is either completely processed or considered to not have been delivered at all and still queued.

yetanotherco / aligned_layer