Open tmpolaczyk opened 3 years ago
I was able to reproduce this issue the other day. It can happen while synchronizing a node, steps to try to reproduce:
The node will not stop immediately because it needs to finish processing the blocks batch. However, after processing the batch it keeps running for a second and then segfaults with the message:
pthread lock: Invalid argument
Aborted (core dumped)
And then the node is missing exactly 20 blocks from this batch.
Probably related to #2008
@tmpolaczyk was this issue closed in #2159?
No, this is still not fixed.
I noticed some errors about missing blocks in one node. It is not clear if that's a problem with witnet-rust, or a problem with rocksdb, so I think the best solution for now would be to assume that this can happen and perform automatic database integrity checks or something similar.
I implemented a JSON-RPC command to check for missing blocks, available in this branch. Can be run as:
And the result is this list of
(epoch, block_hash)
that is missing from that one node:Currently this kind of errors can be solved by doing a
rewind
, which will process all the blocks until the first missing block, and then continue doing a normal (slow) synchronization. So the next step would be to implement some functionality to retrieve those missing blocks from other peers automatically, without getting the user to run the rewind command.