node: Solana watcher filters transactions based on logs which may be truncated, leading to missed messages

evan-gray commented 1 month ago

Description and context

Background

Wormhole

The guardian is responsible for witnessing all core contract message emissions on all chains. On Solana, the message data is stored in a Solana account via the postMessage or postMessageUnreliable instruction.

https://github.com/wormhole-foundation/wormhole/blob/46bcc70e9563121eb30b09797ffe5582474fe8ab/node/pkg/watchers/solana/client.go#L192-L193

It must also be able to detect and process all Wormhole messages within a block in less time than it takes the following block to be produced, so as not to fall behind, in addition to all of the other guardian node tasks. As of this writing, according to the Solana Explorer, the average block time is ~422ms with about 3.7k transactions per second.

Notably, the Solana contract includes only one log (msg!) which logs the sequence number.

https://github.com/wormhole-foundation/wormhole/blob/46bcc70e9563121eb30b09797ffe5582474fe8ab/solana/bridge/program/src/api/post_message.rs#L224-L225

Solana

With the advent of Versioned Transactions, v0 transactions added support for Address Lookup Tables. This means that, for v0 transactions, an instruction's program index may be located in the lookup table, and populating that table requires an additional RPC call for the given account. A quick check of a recent block shows a mix of 172 v0 transactions, 79 of those with address table lookups, to 3074 legacy transactions.

Additionally, there is a long standing Solana node DoS prevention around log messages, truncating the message log to a default of 10k log bytes.

Current Watcher Implementation

The current guardian watcher (note: there are two, one for confirmed and one for finalized) performs the following steps:

For each slot between the last read (exclusive) and the latest slot (inclusive), fetch the block (RPC call) [source]
For each successful transaction in the block... [source]
That includes a Wormhole Core program log... [source]
Decode the transaction and populate the lookup table accounts (RPC call for each v0 transaction leveraging lookup tables) [source]
Process each top-level instruction [source]
Process each inner instruction [source]

The explicit purpose of the log filter (step 3) was to prevent the RPC footprint of this method from growing linearly with the number of v0 transactions using lookup tables.

However, this comes with an extremely notable shortcoming - the watcher will skip any transaction where the critical log does not appear in the first 10k bytes. Reliable messages missed in this way, can still be reobserved.

Steps to reproduce

Write and invoke a Solana program that performs the following:

Log 10k bytes
Call post_message on the core bridge

Experienced behavior

The message is not observed by the guardians and a VAA is not produced.

Expected behavior

The message is observed by the guardians and a VAA is produced.

Solution recommendation

I am not immediately confident that a different solution is more desirable than the status quo, as they all come with trade-offs. Here is a list of alternatives I have considered.

Investigate an alternative to the log check in step 3 above. This could be to perform a check for instructions which otherwise look like the postMessage or postMessageUnreliable instruction. However, this requires decoding all transactions in a block and has a potential for false-positives, leading to loading more lookup tables than necessary. The scaling performance of decoding all transactions would have to be considered along with the false-positive rate based on historical transactions.
Switch to the websockets implementation used by Pyth. The trade-off here is that contributors have seen degradation and misses relying on programSubscribe. For this reason, there is a check which prevents this from being used for a chain other than Pyth. The reliability could be investigated and then toggled via feature flag to allow individual guardians to test the performance and reliability against their RPC nodes. My understanding is that these subscriptions do require greater RPC resources than the existing approach.
Rewrite the watcher to use getSignaturesForAddress to filter the transactions. I'm unsure of the cost of this, but at least this could narrow down the transactions to only those from the core bridge program. However, this is at least one additional RPC call per block, which again would have to be completed quickly, and those transactions would still need to go through all of the existing processing.
Offer a Geyser plugin *handwave handwave*.

It is again important to note that RPC load can cause a Solana node to fall behind the network and slow RPC responses can cause a guardian to fall behind its peers and delay quorum for messages. As is the case for every guardian responsibility, it is critical for guardians to process Solana messages in a timely and performant manner. The log limitation is an effective compromise, but I am opening this issue to document the limitation, reveal the considerations, and weigh alternatives.

linuxhjkaru commented 1 month ago

@evan-gray How can we make the guardian reobserve the missing message?

evan-gray commented 1 month ago

@evan-gray How can we make the guardian reobserve the missing message?

Guardians may manually or automatically re-observe missing transactions via their admin commands - ideally, these are not required during normal network operations.

As far as I understand, integrators who believe they have a missing message should reach out on the Wormhole discord for support.

wormhole-foundation / wormhole