Open avilagaston9 opened 6 days ago
Ran on local devnet and observed the Aggregator spans even when Batcher - Task Creation Failed
was thrown.
I don't think a span for every event is the right solution here. I believe we still want real spans, just not a single one. The times for the spans seem off as well, and it looks like the aggregator sees the future here:
Still, if it works better than what we have I think we should merge it and review the solution later.
I don't think a span for every event is the right solution here. I believe we still want real spans, just not a single one. The times for the spans seem off as well, and it looks like the aggregator sees the future here: Still, if it works better than what we have I think we should merge it and review the solution later.
The problem with the events is that they are associated with a parent span, but if for some reason we don't finalize it, we lose all the associated events. It seems that the task event arrives at the Aggregator before the receipt arrives at the Batcher. This behavior is also observed in staging, but I don't consider this a problem. IMO, we should separate this into two traces in the future, one for each component, removing the dependency between these two. I have created #1477 to revisit this solution later. @Oppen
Improve Telemetry Spans
Motivation
We found that sometimes our Batcher tries to cancel batches that were actually included in the net, calling the
batcherTaskCreationFailed
endpoint, which finalizes the trace and prevents the Aggregator from registering its spans in the trace.Description
batcherTaskCreationFailed
occurs.Observations
On a real
batcherTaskCreationFailed
, the Aggregator won't receive the new task, and the trace will remain unfinished. Furthermore, the trace metadata won't be removed from the Telemetry server store. Despite that, we will be able to visualize the orphans spans with a warning that their parent ID is invalid.1477 was created to address this issue.
How To Test
Run anvil, all Aligned components with one or more operators and start telemetry:
Go to jaeger and explore the generated traces.
Change the Batcher
create_new_task_retryable
function inbatcher/aligned-batcher/src/retry/batcher_retryables.rs:165
to return an error after receiving the receipt:Then, start all components again and you should be able to see the Aggregator spans even when the Batcher sends
Batcher - Task Creation Failed
Type of change
Please delete options that are not relevant.
Checklist
testnet
, everything else tostaging
1477