witnet / witnet-rust

Open source Rust implementation of the Witnet decentralized oracle protocol, including full node and wallet backend 👁️🦀
https://docs.witnet.io
GNU General Public License v3.0
179 stars 56 forks source link

Missing blocks in storage #2102

Open tmpolaczyk opened 3 years ago

tmpolaczyk commented 3 years ago

I noticed some errors about missing blocks in one node. It is not clear if that's a problem with witnet-rust, or a problem with rocksdb, so I think the best solution for now would be to assume that this can happen and perform automatic database integrity checks or something similar.

I implemented a JSON-RPC command to check for missing blocks, available in this branch. Can be run as:

echo '{"jsonrpc":"2.0","id":"1","method":"checkBlockChain","params":null}' | ./target/release/witnet -c witnet.toml node raw

And the result is this list of (epoch, block_hash) that is missing from that one node:

[[681700,"c2674475ed7cb14f83d782ca6d40d27908b1393df779445e07253667e690983c"],[681701,"661fae380e01a03acd7332f0430d735c1b98eb615bb8ef00c58183b72317a173"],[681702,"1162fdfc976212e6571b848a59d4288496333bf5a4e87413093df937bd9e37e4"],[681703,"e004ef4925350f4bb4b916fc72bb14a3043cb6bc926a646465d9d180978fa9bf"],[681704,"dadabbbb1e9d99ed5c444cfceacc71791ab657931306feaac6f6e6e9edc7ef38"],[681705,"ae36350715c1f68df72377aa07590c3c4b0e8d78985881aba9b2658e1c9624d8"],[681706,"218f41843fcdc5f67e2e0e798fbc49c7293b0df51ac9dd05a3bf77324b17857c"],[681707,"07995e9b4857265fa0f6df95f60357f67ce7e5256f62a017b896992c69bdd6c4"],[681708,"7adcc6fd6df0fe2f205adf2d7cd44db63d253b0f2253b224c1496f4bdd690dd2"],[681709,"b995950291409753dd4c91184a5edbf703210284069d500d2d3401536bf3cb60"]]

Currently this kind of errors can be solved by doing a rewind, which will process all the blocks until the first missing block, and then continue doing a normal (slow) synchronization. So the next step would be to implement some functionality to retrieve those missing blocks from other peers automatically, without getting the user to run the rewind command.

tmpolaczyk commented 2 years ago

I was able to reproduce this issue the other day. It can happen while synchronizing a node, steps to try to reproduce:

The node will not stop immediately because it needs to finish processing the blocks batch. However, after processing the batch it keeps running for a second and then segfaults with the message:

pthread lock: Invalid argument
Aborted (core dumped)

And then the node is missing exactly 20 blocks from this batch.

Probably related to #2008

Tommytrg commented 2 years ago

@tmpolaczyk was this issue closed in #2159?

tmpolaczyk commented 2 years ago

No, this is still not fixed.