Closed dcherednik closed 1 month ago
Same for case of disabled VolatileTx
{"resolved":[1721744218000,0]}
{"update":{"v":"xSoIAAAAAOTFKggAAAAA5MUqCAAAAADkxSoIAAAAAOTFKggAAAAA"},"key":[3458764513821096247],"ts":[1721744215418,844425147299155]}
https://github.com/ydb-platform/ydb/pull/7051/files works for case of disabled volatile tx
I tried adding a changefeed to jepsen tables and validating the resulting log, and can easily see the issue. The issue reproduces both with and without volatile transactions. Looks like it's somehow related to splits (more splits -> better chance to reproduce), but it's unclear so far.
Reproduced with a build that logs per-shard heartbeat messages (89d27b20ac5081734898c740846e17c91a30c5fd):
https://gist.github.com/snaury/9245b47b09e7f6c943bfeea134c0111b
It is clearly visible that when 72075186224037898 split into 72075186224037901 and 72075186224037902 they both started streaming their updates and heartbeats. However it is clearly visible that while pq partition waited for new heartbeats from the second shard, it appeared to be ignored when actually producing resolved messages:
{"shard_id":72075186224037898,"resolved_private":[1725549507000,0]}
{"resolved":[1725549507000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549507000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549508000,0]}
...
{"shard_id":72075186224037901,"resolved_private":[1725549514000,0]}
{"shard_id":72075186224037902,"resolved_private":[1725549507000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549515000,0]}
{"shard_id":72075186224037902,"resolved_private":[1725549508000,0]}
{"resolved":[1725549515000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549516000,0]}
{"shard_id":72075186224037902,"resolved_private":[1725549509000,0]}
{"resolved":[1725549516000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549517000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549518000,0]}
{"shard_id":72075186224037902,"resolved_private":[1725549510000,0]}
{"resolved":[1725549518000,0]}
It is very suspicious that a new resolved message is always generated after a new resolved_private from 72075186224037902, however the actual value is from the last resolved_private from 72075186224037901. This also kept happening when 72075186224037902 did split into 72075186224037903 and 72075186224037904.
Even weirder when it backtracks:
{"resolved":[1725549546000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549547000,0]}
{"shard_id":72075186224037903,"resolved_private":[1725549548000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549548000,0]}
{"shard_id":72075186224037904,"resolved_private":[1725549534000,0]}
{"resolved":[1725549548000,0]}
{"shard_id":72075186224037904,"resolved_private":[1725549535000,0]}
{"resolved":[1725549547000,0]}
Notice how it first produced [1725549548000,0]
, but then a new heartbeat from 72075186224037904
appeared and it produced [1725549547000,0]
, a previous value!
{"resolved":[1721729882000,0]} {"resolved":[1721729924000,0]} {"resolved":[1721729955000,0]} {"update":{"v":"1MkPAAAAAEjUyQ8AAAAASNTJDwAAAABI1MkPAAAAAA=="},"key":[4899916394580082192],"ts":[1721729950551,281475194075732]} {"update":{"v":"wqMAAAAAAJDCowAAAAAA"},"key":[4899916394579553682],"ts":[1721729950556,281475194075747]} {"resolved":[1721729985000,0]} {"resolved":[1721730001000,0]}