Closed whatyouhide closed 4 months ago
We will have a look.
I may have found something:
CASSANDRA_NATIVE_PROTOCOL=v3 mix test --only test:"test concurrent requests on a single connection"
and
CASSANDRA_NATIVE_PROTOCOL=v4 mix test --only test:"test concurrent requests on a single connection"
do not produce a failure, only
CASSANDRA_NATIVE_PROTOCOL=v5 mix test --only test:"test concurrent requests on a single connection"
does. Tested with up to max_requests = 100
(in test/xandra_test.exs:349
). So this may be a problem with native protocol v5!
Wooooah fantastic find!!!!
I started looking at the v5
implementation, but nothing jumped out at me. Since the workaround (forcing protocol_version: :v4
) is sufficient for us, I did not get the approval to investigate this further.
I've created https://issues.apache.org/jira/browse/CASSANDRA-19753 to see if this might be a C* issue.
With the help of Sam in the JIRA issue, we figured out that the issue was that I screwed up decoding multiple frames in a single envelope in native protocol v5 🤦 My bad. I pushed fixes into this PR.
@jvf @harunzengin this PR can be a collaborative attempt at figuring out what the flying horse crab hell is going on with #356.
At the time of opening it, the PR only adds a test to reproduce the timeouts (which does reproduce them ~90% of the time in my experience) and some additional logging.
Most of the time, Xandra does not receive the frame that times out. This is weird. I’m running this with a locally-running Dockerized Cassandra in case it helps.
Btw, I’m opening this because I won't have a ton of time to dedicate to this as I’m pretty busy at work, but I figured we can dig in together, especially after @jvf's fantastic reproducing steps and tests in #356 🙃