zeta-chain / node

ZetaChain’s blockchain node and an observer validator client
https://zetachain.com
MIT License
164 stars 105 forks source link

Smoke tests sometimes freeze because TSS is never generated #983

Closed lumtis closed 1 year ago

lumtis commented 1 year ago

It happens to me that the smoke tests are stuck in the initialization phase because the TSS is never created https://github.com/zeta-chain/node/blob/6b860efb1cd378328775113fb572a72271918270/contrib/localnet/orchestrator/smoketest/main.go#L137

The reason for this is that for some reasons the ZetaClient container restarts and misses the block height for keygen generation:

zetaclient0  | 2023-08-16T09:55:59Z INF Waiting For Keygen Block to arrive or new keygen block to be set. Keygen Block : 20 Current Block : 9 module=keygen
zetaclient0  | 2023-08-16T09:56:00Z INF Waiting For Keygen Block to arrive or new keygen block to be set. Keygen Block : 20 Current Block : 10 module=keygen
zetaclient0  | 2023-08-16T09:56:03Z INF Waiting For Keygen Block to arrive or new keygen block to be set. Keygen Block : 20 Current Block : 11 module=keygen
zetaclient0  | Wait for zetacore to exchange genesis file
zetaclient0  | operatorAddress: zeta16c4u3tledclhenpagjqvqt42rqluq5nws5xp79
zetaclient0  | Start zetaclientd
zetaclient0  | rm: can't remove '/root/.tss/*': No such file or directory
zetaclient0  | 2023-08-16T09:57:07Z INF ZetaCore is ready , Trying to connect to  module=startup
zetaclient0  | 2023-08-16T09:57:07Z INF Zeta-core height: 16 module=CoreBridge

…

zetaclient0  | 2023-08-16T10:41:44Z INF Successfully announced! module=communication
zetaclient0  | 2023-08-16T10:41:44Z INF Starting the TSS servers
zetaclient0  | 2023-08-16T10:41:44Z INF LocalID: 16Uiu2HAmCXPGhc6B85FkjK3awKpHxMbnGoDzKjnQCxyQBEQ1VuiN
zetaclient0  | 2023-08-16T10:41:44Z INF TSS Keyshare file NOT found
zetaclient0  | 2023-08-16T10:41:45Z INF Waiting For Keygen Block to arrive or new keygen block to be set. Keygen Block : 20 Current Block : 34 module=keygen
zetaclient0  | 2023-08-16T10:41:48Z INF Waiting For Keygen Block to arrive or new keygen block to be set. Keygen Block : 20 Current Block : 35 module=keygen
zetaclient0  | 2023-08-16T10:41:50Z INF Waiting For Keygen Block to arrive or new keygen block to be set. Keygen Block : 20 Current Block : 36 module=keygen

Considered solution

It appears to me that it is not necessary for ZetaClient to be exactly at cfg.Keygen.BlockNumber to generate TSS but to be at least at this block. In this case, the solution would be to replace the condition at: https://github.com/zeta-chain/node/blob/6b860efb1cd378328775113fb572a72271918270/cmd/zetaclientd/keygen_tss.go#L65

to

if currentBlock >= cfg.Keygen.BlockNumber { 

I think we should also put a max try value at: https://github.com/zeta-chain/node/blob/6b860efb1cd378328775113fb572a72271918270/contrib/localnet/orchestrator/smoketest/main.go#L137 to make the smoke test stopping with a failure instead of having the initialization in an infinite loop

kingpinXD commented 1 year ago

Using a height here is to coordinate the keygen so that every client broadcasts the transaction simultaneously. If we modify it to >= even though it can enter the keygen loop, the keygen itself will fail. Curious, though, how often are you getting the error?

lumtis commented 1 year ago

Ok, I think it's fine the issue arise around 20% of the time. So maybe just make sure the smoke test fails when tss is not generated is fine.

I'm still wondering why the gneeration couldn't happen asynchronously? Do we need all txs in the same block?

brewmaster012 commented 1 year ago

TSS keygen is an interactive "ceremony", this is why they are often called Keygen/Keysign ceremony. It's a pretty heavy MPC computation. The participants need to on line at roughly the same time otherwise it will fail. This is why all zetaclients need to synchronized to a certain block.

In your case, why would zetaclientd miss that exact block? Timing or slow computer?

Ok, I think it's fine the issue arise around 20% of the time. So maybe just make sure the smoke test fails when tss is not generated is fine.

I'm still wondering why the gneeration couldn't happen asynchronously? Do we need all txs in the same block?

lumtis commented 1 year ago

Thanks for the explanation.

Somehow I can't reproduce the issue anymore. I will close it for now, eventually reopen if it occurs again.