An `eth_estimate` request uses stale state in a mining node.

egonspace commented 5 months ago

System information

All go-wemix version.

Expected behaviour

Some tx to a mining node is failed as the tx processor does not refer the latest state. But other call requests getting some info are processed successfully.

The difference between the tx request and the information inquiry request is that the tx request gets the state through the pending block, and the information inquiry request gets the state through the latest block.

When we analyzed the cause, it was a problem that the update of the pending block was a little late, in Wemix, the logic of updating the pending block in the EN node and mining node is slightly different.

Here is debugging log.

DEBUG[06-11|18:51:36.419] Mining Token, failure                    height=164 parent-hash=578bf1..b768cf error="not eligible"
INFO [06-11|18:51:36.419] CHECKING REFRESH PENDING                 parent block number=163
DEBUG[06-11|18:51:37.218] Fetching single header                   id=427e989a4c8e9a02 conn=staticdial hash=19ad33..10f6ed
DEBUG[06-11|18:51:37.220] Fetching batch of block bodies           id=427e989a4c8e9a02 conn=staticdial count=1
DEBUG[06-11|18:51:37.221] Queued delivered header or block         peer=427e989a4c8e9a0279b80791367d93dc37d282def6e1b31c0fe325fe44307ae4 number=164 hash=19ad33..10f6ed queued=1
DEBUG[06-11|18:51:37.221] Importing propagated block               peer=427e989a4c8e9a0279b80791367d93dc37d282def6e1b31c0fe325fe44307ae4 number=164 hash=19ad33..10f6ed
DEBUG[06-11|18:51:37.288] Persisted trie from memory database      nodes=61 size=10.23KiB time=17.136417ms  gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=-164.00B
INFO [06-11|18:51:37.289] Imported new chain segment               blocks=1 txs=1 mgas=0.561 elapsed=59.739ms  mgasps=9.396 number=164 hash=19ad33..10f6ed dirty=0.00B
DEBUG[06-11|18:51:37.289] Inserted new block                       number=164 hash=19ad33..10f6ed uncles=0 txs=1 gas=561,304 elapsed=50.377ms  root=2f928a..0b4420
DEBUG[06-11|18:51:37.289] Reinjecting stale transactions           count=0
INFO [06-11|18:51:37.346] Estimate gas!!!!!                        pendigBlock=-2
INFO [06-11|18:51:37.346] Estimate gas2222!!!!!                    pendigBlock=164
DEBUG[06-11|18:51:37.363] 1111111111!!!!!                          res="&{UsedGas:120763 Fee:+12076300000000000 Err:execution reverted ReturnData:[8 195 121 160 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 67 104 101 99 107 32 104 101 114 101 48 48 55 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]}" err=nil
DEBUG[06-11|18:51:37.364] Mining Token, successful                 height=165 parent-hash=19ad33..10f6ed
DEBUG[06-11|18:51:37.364] time-it                                  round=1       offset=1  height=164 dt=1

Based on the above log, the following logic sequence can be found.

A mining node's latest block = 163, pending block = 164(not actual block but virtual block, state = 163 block)
A mining node failed to acquire the mining token (not mining in this turn)
A mining node receives the propagated block(164) and inserts this its chain.
A mining does not update pending block yet.
A mining node acquires the mining token.
A mining node sleeps for block interval.
A mining node receives the tx request (it uses state of block 163 although it has block 164 for the latest block) => it's bug!

Expected behavior

mining node should update pending block as soon as possible when it receives new block

Steps to reproduce the behaviour

reproduction conditions:

tx should be to a mining node (tx to an EN node is processed normally)
long block interval
3 nodes at least (2 nodes chain does not change the mining node frequently)

Backtrace

[backtrace]

When submitting logs: please submit them as text and not screenshots.

jennwiederholen commented 5 months ago

given issue could be solved by executing refreshPending() method right after/before node is trying to acquire mining token. currently, refreshPending() method is executed when node failed to acquire mining token and the node who got the mining token updates pending states after the block is made which takes certain length of sleep time. ( to adjust blockCreationTime)

so if refreshPending() is executed whether or not mining token is acquired, pending state will update the current block as soon as it's propagated.

// before
func (w *worker) commitWork(interrupt *int32, noempty bool, timestamp int64) {
    ... 
    if !wemixminer.IsPoW() {
        parent := w.chain.CurrentBlock()
        height := new(big.Int).Add(parent.Number(), common.Big1)
        ok, err := wemixminer.AcquireMiningToken(height, parent.Hash())
        if ok {
            log.Debug("Mining Token, successful", "height", height, "parent-hash", parent.Hash())
        } else {
            log.Debug("Mining Token, failure", "height", height, "parent-hash", parent.Hash(), "error", err)
        }
        if !ok {
                         w.refreshPending(true)
            return
        }
    }
    ...

// after 
func (w *worker) commitWork(interrupt *int32, noempty bool, timestamp int64) {
    ... 
    if !wemixminer.IsPoW() {
        parent := w.chain.CurrentBlock()
        height := new(big.Int).Add(parent.Number(), common.Big1)
        ok, err := wemixminer.AcquireMiningToken(height, parent.Hash())
                 w.refreshPending(true)
        if ok {
            log.Debug("Mining Token, successful", "height", height, "parent-hash", parent.Hash())
        } else {
            log.Debug("Mining Token, failure", "height", height, "parent-hash", parent.Hash(), "error", err)
        }
        if !ok {
            return
        }
    }
    ...

need more check if any side effects though..

jed-wemade commented 5 months ago

In Ethereum v1.10.15 using PoW, Engine.FinalizeAndAssemble() (called in w.commit()) comes after w.updateSnapshot(). In Ethereum v1.10.16, Engine.FinalizeAndAssemble() comes before w.updateSnapshot(). The change is applied by commit https://github.com/ethereum/go-ethereum/commit/78636ee56856ef50299183dd04d02a3e7f555cbc. I cannot find the commit message or discussion why the order is changed. It seems that the snapshot which is exposed to public is not so critical for synchronization. There is a short period that is out of sync.

In Wemix, however, (*worker).commitTransactionsEx() sleeps until env.till so the out-of-sync time stands out. So the waiting must be done before or after both commit and updating snapshot.

egonspace commented 5 months ago

For reference, this issue does not occur on the mainnet. This is because miners on the mainnet do not receive transactions directly.

There are two ways to resolve this issue:

Avoid sleeping in worker.commitTransactionEx() and instead perform updateSnapshot() in worker.commitEx(), followed by sleep(blockInterval).
Call refreshPending() before sleeping in worker.commitTransactionEx().

Option 1 is more desirable because it reflects the transactions of the currently created block in the pending state. In option 2, since the transaction receipts cannot yet be obtained, it is not possible to reflect the currently processed transactions in the pending state, and the only option is to refresh with the state of the latest block.

egonspace commented 5 months ago

After investigating this issue, I hope to solve the following three problems.

If it is a mining node, do updateSnapshot() before going to sleep during block interval
Calculate from a latestBlock, not a pendingBlock, in eth.estimateGas and eth.createAccessList
- refer ethereum issue(https://github.com/ethereum/execution-apis/issues/495)
- PR: https://github.com/ethereum/go-ethereum/pull/28623
No need to add newWork in newWorkLoopEx() if it is a non-mining node

jennwiederholen commented 5 months ago

the issue is fixed in #111. close the issue

wemixarchive / go-wemix