xerions / mariaex

Pure Elixir database driver for MariaDB / MySQL
Other
263 stars 89 forks source link

ssl recv timeout on mysql/percona 5.7 connection with TLS enabled #249

Open kalymero opened 5 years ago

kalymero commented 5 years ago

Environment

Description

Starting from OTP-21.2.0, when trying to connect to a mysql instance with TLS enabled from an Elixir/Ecto application the :ssl.recv() call fails due a timeout with the following stacktrace:

** (EXIT) an exception was raised:
--
** (Mariaex.Error) [ssl] `recv` failed with: :timeout
(ecto) lib/ecto/adapters/sql.ex:431: Ecto.Adapters.SQL.execute_and_cache/7
(ecto) lib/ecto/repo/queryable.ex:133: Ecto.Repo.Queryable.execute/5
(ecto) lib/ecto/repo/queryable.ex:37: Ecto.Repo.Queryable.all/4
(custom_lib) lib/data/cache.ex:nnn: Cache.seed_table/3
(elixir) lib/enum.ex:769: Enum."-each/2-lists^foreach/1-0-"/2
(elixir) lib/enum.ex:769: Enum.each/2
(custom_lib) lib/data/cache.ex:nn: Cache.init/1
(stdlib) gen_server.erl:374: :gen_server.init_it/2
(stdlib) gen_server.erl:342: :gen_server.init_it/6
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
** (Mix) Could not start application app: Application.start(:normal, []) returned an error: shutdown: failed to start child:

For the test we used Percona and Mysql 5.7 official docker images.

Please note that it works fine with older OTP versions (for instance OTP-21.1.4)

Also the Erlang layer looks good, it's possible to connect and query the db in pure Erlang

1> {ok, Pid} = mysql:start_link( [ {host, "localhost"}, {user, "root"}, {password, ""}, {database, "test_db"},  {ssl, [ {version, 'tlsv1.2'}, {verify, verify_none} ] } ] ).
{ok,<0.143.0>}
2> mysql:query(Pid, <<"SHOW DATABASES;">>, []).
{ok,[<<"Database">>],
[[<<"information_schema">>],
[<<"mysql">>],
[<<"performance_schema">>],
[<<"sys">>],
[<<"test_db">>]]}
3> mysql:query(Pid, <<"SELECT * FROM test_table">>, []).
{ok,[<<"id">>,<<"name">>],[[1,<<"it works!">>]]}

An educated guess is that in the new OTP versions some lib changed behaviour and it breaks the Mariaex flow

How to reproduce

  1. Start a Percona docker container with SSL enabled:

$ docker run --rm -d -p 3306:3306/tcp -e MYSQL_ALLOW_EMPTY_PASSWORD=yes percona:5.7 --require-secure-transport=ON

  1. Create an empty db

$ mysql -h 127.0.0.1 -u root -p -e "CREATE DATABASE test_db;"

  1. Try to connect to the 127.0.0.1:3306 socket with tlsv1.2 to the test_db database.

(you can also use mysql:5.7 image and tlsv1.1)

The easiest way is to create a new ecto project

$ mix phx.new . --app test_ecto --database mysql --no-webpack --no-html

configure Ecto

config :test_ecto, TestEcto.Repo,
  ssl: true,
  ssl_opts: [versions: [:"tlsv1.2"], ciphers: :ssl.cipher_suites(:all)],
  username: "root",
  password: "",
  database: "test_db",
  hostname: "localhost",
  pool_size: 10

run it

$ mix deps.get && iex -S mix

Erlang/OTP 21 [erts-10.2.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe] [dtrace]
--
 
Interactive Elixir (1.8.1) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> TestEcto.Repo.query!("select * from test;")
[error] Mariaex.Protocol (#PID<0.277.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv: timeout
[error] Mariaex.Protocol (#PID<0.281.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.283.0> timed out because it queued and checked out the connection for longer than 15000ms
[error] Mariaex.Protocol (#PID<0.273.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv: timeout
[error] Mariaex.Protocol (#PID<0.276.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv: timeout
[error] Mariaex.Protocol (#PID<0.282.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv: timeout
[error] Mariaex.Protocol (#PID<0.275.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv: timeout
[error] Mariaex.Protocol (#PID<0.280.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv: timeout
[error] Mariaex.Protocol (#PID<0.278.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv: timeout
[error] Mariaex.Protocol (#PID<0.274.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv: timeout
[error] Mariaex.Protocol (#PID<0.279.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv: timeout
hauntedhost commented 5 years ago

I'm also experiencing this same issue. Downgrading to Erlang 21.1.4 worked.

burnison commented 5 years ago

After running a git bisect on OTP, it seems that this is related to a change in the SSL implementation:

6168cf2f5f8b5839b1a56ce870658d76faf3c22f is the first bad commit
commit 6168cf2f5f8b5839b1a56ce870658d76faf3c22f
Author: Ingela Anderton Andin <ingela@erlang.org>
Date:   Mon Nov 19 13:50:35 2018 +0100

    ssl: Internaly use {active, N}

    Make next_record an internal help function to next_event and avoid
    duplicate calls to tls_socket:setopts for setting the active option.

I have been able to restore original behaviour with the following change to mariaex:

diff --git a/lib/mariaex/protocol.ex b/lib/mariaex/protocol.ex
index 7b24c62..ecdfca9 100644
--- a/lib/mariaex/protocol.ex
+++ b/lib/mariaex/protocol.ex
@@ -311,7 +311,7 @@ defmodule Mariaex.Protocol do
   DBConnection callback
   """
   def checkout(%{buffer: :active_once, sock: {sock_mod, sock}} = s) do
-    case setopts(s, [active: :false], :active_once) do
+    case setopts(s, [active: 0], :active_once) do
       :ok                       -> sock_mod.recv_active(sock, 0, "") |> handle_recv_buffer(s)
       {:disconnect, _, _} = dis -> dis
     end

However, I am only passingly familiar with Elixir and even more unfamiliar with Erlang. As such, I do not know what implications this change may have. From what I can make of the code, it seems that this works for similar reasons as described in the line 814 up the upstream change: https://github.com/erlang/otp/commit/6168cf2f5f8b5839b1a56ce870658d76faf3c22f?diff=unified#diff-f053ce3c6b9ffc18f497e4bcbc9661f2R814.

Can any maintainers validate this would be a safe change? I'd be happy to turn it into a pull request.

burnison commented 5 years ago

Setting active, 0 seems to have worked by coincidence, and I have proposed a change in #253 that seems to "more-correctly" address the actual underlying issue. However, I am not sure that the proposed change retains the exact same semantics as the original code and a thorough review should be done before using or merging.