yetanotherco / aligned_layer

Aligned is a verification layer for zero-knowledge proofs using EigenLayer. Our mission is to accelerate the adoption of zero-knowledge and validity proofs on Ethereum.
https://alignedlayer.com/
MIT License
146 stars 342 forks source link

fix: improve telemetry spans #1472

Open avilagaston9 opened 6 days ago

avilagaston9 commented 6 days ago

Improve Telemetry Spans

Motivation

We found that sometimes our Batcher tries to cancel batches that were actually included in the net, calling the batcherTaskCreationFailed endpoint, which finalizes the trace and prevents the Aggregator from registering its spans in the trace.

Description

Observations

On a real batcherTaskCreationFailed, the Aggregator won't receive the new task, and the trace will remain unfinished. Furthermore, the trace metadata won't be removed from the Telemetry server store. Despite that, we will be able to visualize the orphans spans with a warning that their parent ID is invalid.

1477 was created to address this issue.

How To Test

  1. Check that everything is working normally:

Run anvil, all Aligned components with one or more operators and start telemetry:

make telemetry_full_start

Go to jaeger and explore the generated traces.

image

  1. Check the scenario addressed in this PR:

Change the Batcher create_new_task_retryable function in batcher/aligned-batcher/src/retry/batcher_retryables.rs:165 to return an error after receiving the receipt:

 // timeout to prevent a deadlock while waiting for the transaction to be included in a block.
    let _result = timeout(Duration::from_millis(transaction_wait_timeout), pending_tx)
        .await
        .map_err(|e| {
            warn!("Error while waiting for batch inclusion: {e}");
            RetryError::Permanent(BatcherError::ReceiptNotFoundError)
        })?
        .map_err(|e| {
            warn!("Error while waiting for batch inclusion: {e}");
            RetryError::Permanent(BatcherError::ReceiptNotFoundError)
        })?
        .ok_or(RetryError::Permanent(BatcherError::ReceiptNotFoundError));
    Err(RetryError::Permanent(BatcherError::ReceiptNotFoundError))

Then, start all components again and you should be able to see the Aggregator spans even when the Batcher sends Batcher - Task Creation Failed

image

Type of change

Please delete options that are not relevant.

Checklist

PatStiles commented 6 days ago

Ran on local devnet and observed the Aggregator spans even when Batcher - Task Creation Failed was thrown.

Oppen commented 6 days ago

I don't think a span for every event is the right solution here. I believe we still want real spans, just not a single one. The times for the spans seem off as well, and it looks like the aggregator sees the future here:

image

Still, if it works better than what we have I think we should merge it and review the solution later.

avilagaston9 commented 6 days ago

I don't think a span for every event is the right solution here. I believe we still want real spans, just not a single one. The times for the spans seem off as well, and it looks like the aggregator sees the future here: image Still, if it works better than what we have I think we should merge it and review the solution later.

The problem with the events is that they are associated with a parent span, but if for some reason we don't finalize it, we lose all the associated events. It seems that the task event arrives at the Aggregator before the receipt arrives at the Batcher. This behavior is also observed in staging, but I don't consider this a problem. IMO, we should separate this into two traces in the future, one for each component, removing the dependency between these two. I have created #1477 to revisit this solution later. @Oppen