zookzook / elixir-mongodb-driver

MongoDB driver for Elixir
Apache License 2.0
244 stars 61 forks source link

Transactions issue on sharded cluster #61

Closed AlexKovalevych closed 4 years ago

AlexKovalevych commented 4 years ago

We use mongodb transactions according to your example here, but at sharded cluster we constantly getting errors like:

[error: %Mongo.Error{code: 251, host: nil, message: "cannot continue txnId -1 for session 91bfeae1-beb3-4413-aad7-0db4e173900f - VDMmLO0tnJxxSRigNiYTZC14rDR7jUqPYfaCjQVq164= with txnId 1"}]

The only mention about this error is at mongodb site here: https://docs.mongodb.com/manual/core/transactions-production-consideration/#use-of-mongodb-4-0-drivers.

Do you have any ideas what can be the problem? Isn't driver ready for sharded transactions yet?

zookzook commented 4 years ago

Yes, it seems that there is little work to do for supporting transactions in sharded deployments.

https://github.com/mongodb/specifications/blob/master/source/transactions/transactions.rst#sharded-transactions

zookzook commented 4 years ago

Could you provide a source code to reproduce the error?

AlexKovalevych commented 4 years ago

unfortunately, i wasn't able to reproduce it locally and now i understand why (need to have a sharded cluster not a regular one). But probably it won't be too hard now, i mean our transaction fails even if it has a single insert (but it has many inserts at the same moment due to load)

zookzook commented 4 years ago

I created a sharded cluster with two mongos:

mlaunch --replicaset --sharded 3 --mongos 2
launching: "mongod" on port 27019
launching: "mongod" on port 27020
launching: "mongod" on port 27021
launching: "mongod" on port 27022
launching: "mongod" on port 27023
launching: "mongod" on port 27024
launching: "mongod" on port 27025
launching: "mongod" on port 27026
launching: "mongod" on port 27027
launching: config server on port 27028
replica set 'configRepl' initialized.
replica set 'shard01' initialized.
replica set 'shard02' initialized.
replica set 'shard03' initialized.
launching: mongos on port 27017
launching: mongos on port 27018
adding shards. can take up to 30 seconds...

and execute the following inserts:

{:ok, top} =  Mongo.start_link(url: "mongodb://localhost:27017,localhost:27018/test-db")

Mongo.create(top, "dogs")

{:ok, ids} = Session.with_transaction(top, fn opts ->
{:ok, _} = Mongo.insert_one(top, "dogs", %{name: "Greta"}, opts)
{:ok, _} = Mongo.insert_one(top, "dogs", %{name: "Waldo"}, opts)
{:ok, _} = Mongo.insert_one(top, "dogs", %{name: "Tom"}, opts)
{:ok, :ok}
end)
{:ok, :ok}

{:ok, :ok}
iex(8)> Mongo.find(top, "dogs", %{}) |> Enum.to_list()
[
  %{"_id" => #BSON.ObjectId<5ec8edce306a5f06073709c5>, "name" => "Greta"},
  %{"_id" => #BSON.ObjectId<5ec8edce306a5f06073709c6>, "name" => "Waldo"},
  %{"_id" => #BSON.ObjectId<5ec8edce306a5f06073709c7>, "name" => "Tom"}
]

So, I cannot reproduce the error. Maybe you can share some source code to reproduce the error.

AlexKovalevych commented 4 years ago

I did pretty much the same, except i created a sharded collection and did about 30 transactions in parallel - couldn't reproduce too. I'll give you more information on the next week, since on weekend i don't have access to the environment where it's reproducible.

AlexKovalevych commented 4 years ago

We can confirm that the issue happens with multiple mongos (6, also tried with 2 mongos) deployed in k8s, mongodb 4.2 used with the 0.7.0 elixir-mongodb-driver with the first transaction we run in the cluster. After we scaled to a single mongos and restarted applications the issue is gone. I couldn't reproduce it with mlaunch locally though. So it looks exactly as a message at the mongodb site like driver is missing something mongo 4.2 related.

zookzook commented 4 years ago

Is there a load-balancer used in front of the mongos?

AlexKovalevych commented 4 years ago

correct, there is a load-balancer in front of the mongos

zookzook commented 4 years ago

That is the problem!

I think, the mongos used while doing the transaction changed and therefore you got this error message. It is called mongos pinning. That means, if you start the transaction all requests must be sent to the same mongos server where you started.

So don't need to load-balance the connection from the driver, because the driver contains some balancing code (random). Just specify your mongos servers and everything should be fine.

AlexKovalevych commented 4 years ago

You're right, that was it! Thank you, the problem is solved.

zookzook commented 4 years ago

Your are welcome!