Open myronmarston opened 7 years ago
We're also getting similar function clause errors with <<6>>
:
Mariaex.Messages.decode_msg(<<6>>, :running)
These errors are happening fairly often (a few times per hour) and I'd love to see this fixed. I can pitch in, if guidance could be provided. I don't really know the MySQL protocol so I'm not sure what to do with this kind of message. Is it safe to add a decode_msg
definition that ignores (or logs) unrecognized messages instead of crashing?
Hi! @myronmarston Do you have any test case with a mock data, how to reproduce it or it really happen somewhere spontaneously during long run?
Unfortunately, I have no idea how to reproduce this. I've only ever seen it in our production logs, and it's not clear what queries were running when this happens.
@myronmarston Can you check the logs, please and check, if the stacktraces always points to some disconnect routine ( Mariaex.Protocol.disconnect
) or there is something else involved?
Do you mean check to see if (mariaex) lib/mariaex/protocol.ex:217: Mariaex.Protocol.disconnect/2
is always in the stacktrace when this error occurs?
@myronmarston For example, yes.
I looked through our prod logs and it looks like that line is always part of the stack trace.
Its not safe to do a socket recv in the disconnect callback of DBConnection unless there's a DBConnection.SojournError
as the first argument. The client process might be using the socket or have data in the buffer that is part of a query.
We have similar issues, @fishcakez any hints how to pin point the issue?
After digging in a bit more I found out that we actually get one more error 2-5 seconds earlier always before this decode issue, something like:
Elixir.Mariaex.Error [tcp] `recv` failed with: :timeout
lib/ecto/adapters/sql.ex:444 Ecto.Adapters.SQL.execute_or_reset/7
lib/ecto/repo/queryable.ex:130 Ecto.Repo.Queryable.execute/5
lib/ecto/repo/queryable.ex:35 Ecto.Repo.Queryable.all/4
timer.erl:166 :timer.tc/1
lib/ex_statsd.ex:184 anonymous fn/4 in ExStatsD.timing/3
lib/preload.ex:24 anonymous fn/2 in Preload.preload_store/2
timer.erl:166 :timer.tc/1
lib/ex_statsd.ex:184 anonymous fn/4 in ExStatsD.timing/3
So probably resolving this should resolve the decode issue right?
@driv3r the timeout issue is described in https://github.com/xerions/mariaex/issues/162 and disconnect recv is at https://github.com/xerions/mariaex/blob/f4fe6f6f5f746f0ac9476e94891faf2fc915170b/lib/mariaex/protocol.ex#L224
We are still getting this periodically (usually a time or two a day) and it would be great to see this get fixed. I often jump into open source projects we use to fix issues we face but I don't really understand how to fix this one. Anything I can do to help?
On our servers we get the timeout issue and the get decode_msg called with "*"
and :running
logdump
** (FunctionClauseError) no function clause matching in Mariaex.Messages.decode_msg/2
(mariaex) lib/mariaex/messages.ex:261: Mariaex.Messages.decode_msg("*", :running)
(mariaex) lib/mariaex/messages.ex:225: Mariaex.Messages.decode/2
(mariaex) lib/mariaex/protocol.ex:1097: Mariaex.Protocol.msg_decode/2
(mariaex) lib/mariaex/protocol.ex:1065: Mariaex.Protocol.msg_recv/3
(mariaex) lib/mariaex/protocol.ex:254: Mariaex.Protocol.disconnect/2
(db_connection) lib/db_connection/connection.ex:178: DBConnection.Connection.disconnect/2
(connection) lib/connection.ex:767: Connection.disconnect/3
(stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
Last message: {:timeout, #Reference<0.1092615215.1916010497.47981>, {DBConnection.Connection, #PID<0.13935.9>, 15000}}
I have no idea what that message means nor what the library is supposed to do at that point otherwise I would submit a pull request right this second, however, I will go look it up and see if I can contribute anything - especially since this problem dings us pretty hard every few days
We are also seeing this after a timeout talking to MySQL.
2018-01-17 21:40:52 =ERROR REPORT====
** Generic server <0.2091.0> terminating
** Last message in was {timeout,#Ref<0.1184361718.207355906.33227>,{'Elixir.DBConnection.Connection',<0.2539.0>,120000}}
** When Server state == {'Elixir.Mariaex.Protocol',#{'__struct__' => 'Elixir.Mariaex.Protocol',binary_as => field_type_var_string,buffer => <<>>,cache => #Ref<0.1184361718.204341254.50865>,catch_eof => false,connection_id => <0.2091.0>,cursors => #{},deprecated_eof => false,lru_cache => {100,#Ref<0.1184361718.204341254.50866>},opts => [{socket_options,[]},{sock_type,tcp},{cache_size,100},{name,'Elixir.Sample.Repo.Pool'},{otp_app,sample_app},{repo,'Elixir.Sample.Repo'},{adapter,'Elixir.Ecto.Adapters.MySQL'},{username,<<"user">>},{password,<<"password">>},{database,<<"sample_db">>},{hostname,<<"127.0.0.1">>},{timeout,120000},{pool_size,320},{port,3306},{loggers,[{'Elixir.Ecto.LogEntry',log,[]},{'Elixir.Sample.Repo.Metrics',record_metric,[]}]},{pool_timeout,5000},{adapter,'Elixir.Ecto.Adapters.MySQL'},{username,{system,<<"DB_USER">>}},{password,{system,<<"DB_PASS">>}},{database,{system,<<"DB_NAME">>}},{hostname,{system,<<"DB_HOST">>}},{timeout,120000},{pool_size,{system,integer,<<"DB_POOL_SIZE">>,20}},{loggers,[{'Elixir.Ecto.LogEntry',log,[]},{'Elixir.Sample.Repo.Metrics',record_metric,[]}]},{pool,'Elixir.DBConnection.Poolboy'}],seqnum => 0,sock => {'Elixir.Mariaex.Connection.Tcp',#Port<0.4032>},ssl_conn_state => not_used,state => running,state_data => nil,timeout => 120000}}
** Reason for termination ==
** {function_clause,[{'Elixir.Mariaex.Messages',decode_msg,[<<18>>,running],[{file,"lib/mariaex/messages.ex"},{line,261}]},{'Elixir.Mariaex.Messages',decode,2,[{file,"lib/mariaex/messages.ex"},{line,225}]},{'Elixir.Mariaex.Protocol',msg_decode,2,[{file,"lib/mariaex/protocol.ex"},{line,1097}]},{'Elixir.Mariaex.Protocol',msg_recv,3,[{file,"lib/mariaex/protocol.ex"},{line,1065}]},{'Elixir.Mariaex.Protocol',disconnect,2,[{file,"lib/mariaex/protocol.ex"},{line,254}]},{'Elixir.DBConnection.Connection',disconnect,2,[{file,"lib/db_connection/connection.ex"},{line,178}]},{'Elixir.Connection',disconnect,3,[{file,"lib/connection.ex"},{line,767}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]}]}
2018-01-17 21:40:52 =CRASH REPORT====
crasher:
initial call: Elixir.DBConnection.Connection:init/1
pid: <0.2091.0>
registered_name: []
exception error: {function_clause,[{'Elixir.Mariaex.Messages',decode_msg,[<<18>>,running],[{file,"lib/mariaex/messages.ex"},{line,261}]},{'Elixir.Mariaex.Messages',decode,2,[{file,"lib/mariaex/messages.ex"},{line,225}]},{'Elixir.Mariaex.Protocol',msg_decode,2,[{file,"lib/mariaex/protocol.ex"},{line,1097}]},{'Elixir.Mariaex.Protocol',msg_recv,3,[{file,"lib/mariaex/protocol.ex"},{line,1065}]},{'Elixir.Mariaex.Protocol',disconnect,2,[{file,"lib/mariaex/protocol.ex"},{line,254}]},{'Elixir.DBConnection.Connection',disconnect,2,[{file,"lib/db_connection/connection.ex"},{line,178}]},{'Elixir.Connection',disconnect,3,[{file,"lib/connection.ex"},{line,767}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]}]}
ancestors: [<0.2037.0>,'Elixir.Sample.Repo.Pool','Elixir.Sample.Repo','Elixir.Sample.Supervisor',<0.2033.0>]
message_queue_len: 1
messages: [{'$gen_cast',{disconnect,#Ref<0.1184361718.206831620.161378>,#{'__exception__' => true,'__struct__' => 'Elixir.Mariaex.Error',action => <<"recv">>,connection_id => <0.2091.0>,mariadb => nil,message => nil,reason => timeout,tag => tcp},#{'__struct__' => 'Elixir.Mariaex.Protocol',binary_as => field_type_var_string,buffer => <<>>,cache => #Ref<0.1184361718.204341254.50865>,catch_eof => false,connection_id => <0.2091.0>,cursors => #{},deprecated_eof => false,lru_cache => {100,#Ref<0.1184361718.204341254.50866>},opts => [{socket_options,[]},{sock_type,tcp},{cache_size,100},{name,'Elixir.Sample.Repo.Pool'},{otp_app,sample_app},{repo,'Elixir.Sample.Repo'},{adapter,'Elixir.Ecto.Adapters.MySQL'},{username,<<"user">>},{password,<<"password">>},{database,<<"sample_db">>},{hostname,<<"127.0.0.1">>},{timeout,120000},{pool_size,320},{port,3306},{loggers,[{'Elixir.Ecto.LogEntry',log,[]},{'Elixir.Sample.Repo.Metrics',record_metric,[]}]},{pool_timeout,5000},{adapter,'Elixir.Ecto.Adapters.MySQL'},{username,{system,<<"DB_USER">>}},{password,{system,<<"DB_PASS">>}},{database,{system,<<"DB_NAME">>}},{hostname,{system,<<"DB_HOST">>}},{timeout,120000},{pool_size,{system,integer,<<"DB_POOL_SIZE">>,20}},{loggers,[{'Elixir.Ecto.LogEntry',log,[]},{'Elixir.Sample.Repo.Metrics',record_metric,[]}]},{pool,'Elixir.DBConnection.Poolboy'}],seqnum => 0,sock => {'Elixir.Mariaex.Connection.Tcp',#Port<0.4032>},ssl_conn_state => not_used,state => running,state_data => nil,timeout => 120000}}}]
links: [<0.2036.0>,<0.2037.0>,#Port<0.4032>]
dictionary: [{lru_cache,{100,#Ref<0.1184361718.204341254.50866>}},{cache,#Ref<0.1184361718.204341254.50865>}]
trap_exit: false
status: running
heap_size: 2586
stack_size: 27
reductions: 38528
neighbours:
2018-01-17 21:40:52 =SUPERVISOR REPORT====
Supervisor: {<0.2037.0>,poolboy_sup}
Context: child_terminated
Reason: {function_clause,[{'Elixir.Mariaex.Messages',decode_msg,[<<18>>,running],[{file,"lib/mariaex/messages.ex"},{line,261}]},{'Elixir.Mariaex.Messages',decode,2,[{file,"lib/mariaex/messages.ex"},{line,225}]},{'Elixir.Mariaex.Protocol',msg_decode,2,[{file,"lib/mariaex/protocol.ex"},{line,1097}]},{'Elixir.Mariaex.Protocol',msg_recv,3,[{file,"lib/mariaex/protocol.ex"},{line,1065}]},{'Elixir.Mariaex.Protocol',disconnect,2,[{file,"lib/mariaex/protocol.ex"},{line,254}]},{'Elixir.DBConnection.Connection',disconnect,2,[{file,"lib/db_connection/connection.ex"},{line,178}]},{'Elixir.Connection',disconnect,3,[{file,"lib/connection.ex"},{line,767}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]}]}
Offender: [{pid,<0.2091.0>},{id,'Elixir.DBConnection.Poolboy.Worker'},{mfargs,{'Elixir.DBConnection.Poolboy.Worker',start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]
This error has been happening very frequently for us, 186 times in the last 24 hours. This is the stack trace we get:
I periodically get this in my logs:
Looks like there's some message mariaex isn't able to handle?