whatyouhide / xandra

Fast, simple, and robust Cassandra/ScyllaDB driver for Elixir.
ISC License
406 stars 54 forks source link

Fix decoding multiple frames in a single envelope in native protocol v5 #368

Closed whatyouhide closed 4 months ago

whatyouhide commented 5 months ago

@jvf @harunzengin this PR can be a collaborative attempt at figuring out what the flying horse crab hell is going on with #356.

At the time of opening it, the PR only adds a test to reproduce the timeouts (which does reproduce them ~90% of the time in my experience) and some additional logging.

Most of the time, Xandra does not receive the frame that times out. This is weird. I’m running this with a locally-running Dockerized Cassandra in case it helps.

Btw, I’m opening this because I won't have a ton of time to dedicate to this as I’m pretty busy at work, but I figured we can dig in together, especially after @jvf's fantastic reproducing steps and tests in #356 🙃

jvf commented 5 months ago

We will have a look.

jvf commented 5 months ago

I may have found something:

CASSANDRA_NATIVE_PROTOCOL=v3 mix test --only test:"test concurrent requests on a single connection"

and

CASSANDRA_NATIVE_PROTOCOL=v4 mix test --only test:"test concurrent requests on a single connection"

do not produce a failure, only

CASSANDRA_NATIVE_PROTOCOL=v5 mix test --only test:"test concurrent requests on a single connection"

does. Tested with up to max_requests = 100 (in test/xandra_test.exs:349). So this may be a problem with native protocol v5!

whatyouhide commented 5 months ago

Wooooah fantastic find!!!!

jvf commented 4 months ago

I started looking at the v5 implementation, but nothing jumped out at me. Since the workaround (forcing protocol_version: :v4) is sufficient for us, I did not get the approval to investigate this further.

whatyouhide commented 4 months ago

I've created https://issues.apache.org/jira/browse/CASSANDRA-19753 to see if this might be a C* issue.

whatyouhide commented 4 months ago

With the help of Sam in the JIRA issue, we figured out that the issue was that I screwed up decoding multiple frames in a single envelope in native protocol v5 🤦 My bad. I pushed fixes into this PR.