wirelineio / wns

Wireline Naming Service
GNU General Public License v3.0
0 stars 1 forks source link

"proof is unexpectedly empty" for wns-lite sync without obvious cause #258

Closed telackey closed 4 years ago

telackey commented 4 years ago

Working at 18:44:42:

2020-06-18 18:44:42.874: time="2020-06-18T18:44:42Z" level=debug msg="RPC endpoints: [tcp://node1.dxos.network:26657 tcp://45.79.120.249:26657 tcp://127.0.0.1:26657]"
2020-06-18 18:44:47.391: time="2020-06-18T18:44:47Z" level=info msg="Syncing from tcp://45.79.120.249:26657 at height: 1124654"
2020-06-18 18:44:53.999: time="2020-06-18T18:44:53Z" level=info msg="Syncing from tcp://45.79.120.249:26657 at height: 1124655"

Broken at 18:55:16:

2020-06-18 18:55:16.045: time="2020-06-18T18:55:16Z" level=info msg="Syncing from tcp://127.0.0.1:26657 at height: 1124656"
2020-06-18 18:55:16.046: time="2020-06-18T18:55:16Z" level=error msg="ABCIQuery: Post \"http://127.0.0.1:26657\": dial tcp 127.0.0.1:26657: connect: connection refused"
2020-06-18 18:55:17.201: time="2020-06-18T18:55:17Z" level=info msg="Syncing from tcp://45.79.120.249:26657 at height: 1124656"
2020-06-18 18:55:17.581: time="2020-06-18T18:55:17Z" level=error msg="error fetching state: {\"codespace\":\"sdk\",\"code\":1,\"message\":\"proof is unexpectedly empty; ensure height has not been pruned\"}"
2020-06-18 18:55:18.738: time="2020-06-18T18:55:18Z" level=info msg="Syncing from tcp://node1.dxos.network:26657 at height: 1124656"
2020-06-18 18:55:18.899: time="2020-06-18T18:55:18Z" level=error msg="error fetching state: {\"codespace\":\"sdk\",\"code\":1,\"message\":\"proof is unexpectedly empty; ensure height has not been pruned\"}"

Nothing leaps out in the logs, but once this error begins, it is persistent. I fixed it restarting wnsd-lite with --reset.

ashwinphatak commented 4 years ago

This happens if the lite node is offline for a few minutes and the full-node prunes state in the IAVL store. There is a flag to disable pruning in full-nodes, in which case this error won't occur. We can't force full-nodes not under our control to turn on that flag, in which cases a reset is required.

telackey commented 4 years ago

I am not sure how many minutes it takes, but in practice I cannot run my local KUBE for more a week with confidence, even when there has not been any known disruption in connectivity. More rarely, I have seen this even on the cloud nodes.

I think the most tractable solution would be to make this easy to recover from, rather than impossible to happen. If we did an automatic reset and restart of the lite node when we hit this condition, for all practical purposes it would not be an issue.

telackey commented 4 years ago

(Note, if we even exited when we hit this condition, on the KUBE, the auto-restart logic would take care of the rest.)

ashwinphatak commented 4 years ago

I'm pretty sure your node went offline sometime.

"Automatic reset" is wrong for reasons too lengthy to explain in full here (short version: attacks/malicious nodes), but the least evil option is to timeout and exit if the lite node hasn't made progress in > N minutes.

https://github.com/wirelineio/testnet/issues/22