wkalt / dp3

multimodal log database
Other
12 stars 0 forks source link

nondeterministic failure when querying concurrently with big import #11

Closed wkalt closed 6 months ago

wkalt commented 7 months ago

I am able to cause a failure that logs this:

2024/04/05 19:45:08 ERROR Internal server error msg="error getting messages: failed to get messages: failed to load iterators: failed to get iterators: failed to get next message: failed to get next leaf: failed to get node 17858078595780745765:15437249478555840127:10845002079556675049: node 17858078595780745765:15437249478555840127:10845002079556675049 no
t found" request_id=5d4ebdbb-2fca-498a-8137-b20b2c5e359c

that node ID looks random, so it's either an issue in the node serialization logic not overwriting all temporary addresses, or the bytes we are interpreting as an ID are misaligned. Better logging would also make it possible to understand what version we are working on here, which should allow us to inspect the correct tree state.

This seems to happen rarely, and subsequent requests (concurrent with newer inserts) succeed.

Edit - forgot to provide instructions. To produce this I am doing a concurrent import of all of my MCAP data:

./dp3 import --producer my-robot ~/data/**/*.mcap --workers 16

while repeatedly starting and killing requests to export all topics:

./dp3 export --producer my-robot --json
wkalt commented 6 months ago

current example of the error:

2024/04/17 13:37:46 ERROR Internal server error msg="error getting messages: failed to get messages: failed to load iterators: failed to get iterators: failed to get next message from root 5000000036:0:374: failed to get next leaf: failed to get node 2290813043743012272:9948196836361821987:8848029873248386146: node default/!()!image_raw/my-robot/2290813043743012272:9948196836361821987:8848029873248386146 not found" request_id=c0b009ac-ed4e-4ea2-837a-b3490b8a0caf
wkalt commented 6 months ago

https://github.com/wkalt/dp3/pull/18