xerions / mariaex

Pure Elixir database driver for MariaDB / MySQL
Other
261 stars 89 forks source link

Periodically getting `FunctionClauseError` from `Mariaex.Messages.decode_msg(<<5>>, :running)` #151

Open myronmarston opened 7 years ago

myronmarston commented 7 years ago

I periodically get this in my logs:

** (FunctionClauseError) no function clause matching in Mariaex.Messages.decode_msg/2
    (mariaex) lib/mariaex/messages.ex:215: Mariaex.Messages.decode_msg(<<5>>, :running)
    (mariaex) lib/mariaex/messages.ex:209: Mariaex.Messages.decode/2
    (mariaex) lib/mariaex/protocol.ex:610: Mariaex.Protocol.msg_recv/3
    (mariaex) lib/mariaex/protocol.ex:217: Mariaex.Protocol.disconnect/2
    (db_connection) lib/db_connection/connection.ex:178: DBConnection.Connection.disconnect/2
    (connection) lib/connection.ex:767: Connection.disconnect/3
    (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:681: :gen_server.handle_msg/5

=CRASH REPORT==== 16-Nov-2016::21:51:15 ===
  crasher:
    initial call: Elixir.DBConnection.Connection:init/1
    pid: <0.1844.0>
    registered_name: []
    exception exit: {function_clause,
                        [{'Elixir.Mariaex.Messages',decode_msg,
                             [<<5>>,running],
                             [{file,"lib/mariaex/messages.ex"},{line,215}]},
                         {'Elixir.Mariaex.Messages',decode,2,
                             [{file,"lib/mariaex/messages.ex"},{line,209}]},
                         {'Elixir.Mariaex.Protocol',msg_recv,3,
                             [{file,"lib/mariaex/protocol.ex"},{line,610}]},
                         {'Elixir.Mariaex.Protocol',disconnect,2,
                             [{file,"lib/mariaex/protocol.ex"},{line,217}]},
                         {'Elixir.DBConnection.Connection',disconnect,2,
                             [{file,"lib/db_connection/connection.ex"},
                              {line,178}]},
                         {'Elixir.Connection',disconnect,3,
                             [{file,"lib/connection.ex"},{line,767}]},
                         {gen_server,try_dispatch,4,
                             [{file,"gen_server.erl"},{line,615}]},
                         {gen_server,handle_msg,5,
                             [{file,"gen_server.erl"},{line,681}]}]}
      in function  gen_server:terminate/7 (gen_server.erl, line 814)
    ancestors: [<0.1841.0>,'Elixir.Delorean.Legion.Shard1Repo.Pool',
                  'Elixir.Delorean.Legion.Shard1Repo',
                  'Elixir.Delorean.Legion.Supervisor',<0.1822.0>]
    messages: []
    links: [<0.1840.0>,<0.1841.0>,#Port<0.5966>,#Port<0.1321426>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 9491627
  neighbours:

=SUPERVISOR REPORT==== 16-Nov-2016::21:51:15 ===
     Supervisor: {<0.1841.0>,poolboy_sup}
     Context:    child_terminated
     Reason:     {function_clause,
                     [{'Elixir.Mariaex.Messages',decode_msg,
                          [<<5>>,running],
                          [{file,"lib/mariaex/messages.ex"},{line,215}]},
                      {'Elixir.Mariaex.Messages',decode,2,
                          [{file,"lib/mariaex/messages.ex"},{line,209}]},
                      {'Elixir.Mariaex.Protocol',msg_recv,3,
                          [{file,"lib/mariaex/protocol.ex"},{line,610}]},
                      {'Elixir.Mariaex.Protocol',disconnect,2,
                          [{file,"lib/mariaex/protocol.ex"},{line,217}]},
                      {'Elixir.DBConnection.Connection',disconnect,2,
                          [{file,"lib/db_connection/connection.ex"},
                           {line,178}]},
                      {'Elixir.Connection',disconnect,3,
                          [{file,"lib/connection.ex"},{line,767}]},
                      {gen_server,try_dispatch,4,
                          [{file,"gen_server.erl"},{line,615}]},
                      {gen_server,handle_msg,5,
                          [{file,"gen_server.erl"},{line,681}]}]}
     Offender:   [{pid,<0.1844.0>},
                  {id,'Elixir.DBConnection.Poolboy.Worker'},
                  {mfargs,
                      {'Elixir.DBConnection.Poolboy.Worker',start_link,
                          undefined}},
                  {restart_type,temporary},
                  {shutdown,5000},
                  {child_type,worker}]

Looks like there's some message mariaex isn't able to handle?

myronmarston commented 7 years ago

We're also getting similar function clause errors with <<6>>:

Mariaex.Messages.decode_msg(<<6>>, :running)

These errors are happening fairly often (a few times per hour) and I'd love to see this fixed. I can pitch in, if guidance could be provided. I don't really know the MySQL protocol so I'm not sure what to do with this kind of message. Is it safe to add a decode_msg definition that ignores (or logs) unrecognized messages instead of crashing?

liveforeverx commented 7 years ago

Hi! @myronmarston Do you have any test case with a mock data, how to reproduce it or it really happen somewhere spontaneously during long run?

myronmarston commented 7 years ago

Unfortunately, I have no idea how to reproduce this. I've only ever seen it in our production logs, and it's not clear what queries were running when this happens.

liveforeverx commented 7 years ago

@myronmarston Can you check the logs, please and check, if the stacktraces always points to some disconnect routine ( Mariaex.Protocol.disconnect ) or there is something else involved?

myronmarston commented 7 years ago

Do you mean check to see if (mariaex) lib/mariaex/protocol.ex:217: Mariaex.Protocol.disconnect/2 is always in the stacktrace when this error occurs?

liveforeverx commented 7 years ago

@myronmarston For example, yes.

myronmarston commented 7 years ago

I looked through our prod logs and it looks like that line is always part of the stack trace.

fishcakez commented 7 years ago

Its not safe to do a socket recv in the disconnect callback of DBConnection unless there's a DBConnection.SojournError as the first argument. The client process might be using the socket or have data in the buffer that is part of a query.

driv3r commented 7 years ago

We have similar issues, @fishcakez any hints how to pin point the issue?

driv3r commented 7 years ago

After digging in a bit more I found out that we actually get one more error 2-5 seconds earlier always before this decode issue, something like:

Elixir.Mariaex.Error [tcp] `recv` failed with: :timeout 
    lib/ecto/adapters/sql.ex:444 Ecto.Adapters.SQL.execute_or_reset/7
    lib/ecto/repo/queryable.ex:130 Ecto.Repo.Queryable.execute/5
    lib/ecto/repo/queryable.ex:35 Ecto.Repo.Queryable.all/4
    timer.erl:166 :timer.tc/1
    lib/ex_statsd.ex:184 anonymous fn/4 in ExStatsD.timing/3
    lib/preload.ex:24 anonymous fn/2 in Preload.preload_store/2
    timer.erl:166 :timer.tc/1
    lib/ex_statsd.ex:184 anonymous fn/4 in ExStatsD.timing/3

So probably resolving this should resolve the decode issue right?

fishcakez commented 7 years ago

@driv3r the timeout issue is described in https://github.com/xerions/mariaex/issues/162 and disconnect recv is at https://github.com/xerions/mariaex/blob/f4fe6f6f5f746f0ac9476e94891faf2fc915170b/lib/mariaex/protocol.ex#L224

myronmarston commented 7 years ago

We are still getting this periodically (usually a time or two a day) and it would be great to see this get fixed. I often jump into open source projects we use to fix issues we face but I don't really understand how to fix this one. Anything I can do to help?

MonkeyIsNull commented 6 years ago

On our servers we get the timeout issue and the get decode_msg called with "*" and :running

logdump

** (FunctionClauseError) no function clause matching in Mariaex.Messages.decode_msg/2
    (mariaex) lib/mariaex/messages.ex:261: Mariaex.Messages.decode_msg("*", :running)
    (mariaex) lib/mariaex/messages.ex:225: Mariaex.Messages.decode/2
    (mariaex) lib/mariaex/protocol.ex:1097: Mariaex.Protocol.msg_decode/2
    (mariaex) lib/mariaex/protocol.ex:1065: Mariaex.Protocol.msg_recv/3
    (mariaex) lib/mariaex/protocol.ex:254: Mariaex.Protocol.disconnect/2
    (db_connection) lib/db_connection/connection.ex:178: DBConnection.Connection.disconnect/2
    (connection) lib/connection.ex:767: Connection.disconnect/3
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
Last message: {:timeout, #Reference<0.1092615215.1916010497.47981>, {DBConnection.Connection, #PID<0.13935.9>, 15000}}

I have no idea what that message means nor what the library is supposed to do at that point otherwise I would submit a pull request right this second, however, I will go look it up and see if I can contribute anything - especially since this problem dings us pretty hard every few days

lmarlow commented 6 years ago

We are also seeing this after a timeout talking to MySQL.

2018-01-17 21:40:52 =ERROR REPORT====
** Generic server <0.2091.0> terminating 
** Last message in was {timeout,#Ref<0.1184361718.207355906.33227>,{'Elixir.DBConnection.Connection',<0.2539.0>,120000}}
** When Server state == {'Elixir.Mariaex.Protocol',#{'__struct__' => 'Elixir.Mariaex.Protocol',binary_as => field_type_var_string,buffer => <<>>,cache => #Ref<0.1184361718.204341254.50865>,catch_eof => false,connection_id => <0.2091.0>,cursors => #{},deprecated_eof => false,lru_cache => {100,#Ref<0.1184361718.204341254.50866>},opts => [{socket_options,[]},{sock_type,tcp},{cache_size,100},{name,'Elixir.Sample.Repo.Pool'},{otp_app,sample_app},{repo,'Elixir.Sample.Repo'},{adapter,'Elixir.Ecto.Adapters.MySQL'},{username,<<"user">>},{password,<<"password">>},{database,<<"sample_db">>},{hostname,<<"127.0.0.1">>},{timeout,120000},{pool_size,320},{port,3306},{loggers,[{'Elixir.Ecto.LogEntry',log,[]},{'Elixir.Sample.Repo.Metrics',record_metric,[]}]},{pool_timeout,5000},{adapter,'Elixir.Ecto.Adapters.MySQL'},{username,{system,<<"DB_USER">>}},{password,{system,<<"DB_PASS">>}},{database,{system,<<"DB_NAME">>}},{hostname,{system,<<"DB_HOST">>}},{timeout,120000},{pool_size,{system,integer,<<"DB_POOL_SIZE">>,20}},{loggers,[{'Elixir.Ecto.LogEntry',log,[]},{'Elixir.Sample.Repo.Metrics',record_metric,[]}]},{pool,'Elixir.DBConnection.Poolboy'}],seqnum => 0,sock => {'Elixir.Mariaex.Connection.Tcp',#Port<0.4032>},ssl_conn_state => not_used,state => running,state_data => nil,timeout => 120000}}
** Reason for termination == 
** {function_clause,[{'Elixir.Mariaex.Messages',decode_msg,[<<18>>,running],[{file,"lib/mariaex/messages.ex"},{line,261}]},{'Elixir.Mariaex.Messages',decode,2,[{file,"lib/mariaex/messages.ex"},{line,225}]},{'Elixir.Mariaex.Protocol',msg_decode,2,[{file,"lib/mariaex/protocol.ex"},{line,1097}]},{'Elixir.Mariaex.Protocol',msg_recv,3,[{file,"lib/mariaex/protocol.ex"},{line,1065}]},{'Elixir.Mariaex.Protocol',disconnect,2,[{file,"lib/mariaex/protocol.ex"},{line,254}]},{'Elixir.DBConnection.Connection',disconnect,2,[{file,"lib/db_connection/connection.ex"},{line,178}]},{'Elixir.Connection',disconnect,3,[{file,"lib/connection.ex"},{line,767}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]}]}
2018-01-17 21:40:52 =CRASH REPORT====
  crasher:
    initial call: Elixir.DBConnection.Connection:init/1
    pid: <0.2091.0>
    registered_name: []
    exception error: {function_clause,[{'Elixir.Mariaex.Messages',decode_msg,[<<18>>,running],[{file,"lib/mariaex/messages.ex"},{line,261}]},{'Elixir.Mariaex.Messages',decode,2,[{file,"lib/mariaex/messages.ex"},{line,225}]},{'Elixir.Mariaex.Protocol',msg_decode,2,[{file,"lib/mariaex/protocol.ex"},{line,1097}]},{'Elixir.Mariaex.Protocol',msg_recv,3,[{file,"lib/mariaex/protocol.ex"},{line,1065}]},{'Elixir.Mariaex.Protocol',disconnect,2,[{file,"lib/mariaex/protocol.ex"},{line,254}]},{'Elixir.DBConnection.Connection',disconnect,2,[{file,"lib/db_connection/connection.ex"},{line,178}]},{'Elixir.Connection',disconnect,3,[{file,"lib/connection.ex"},{line,767}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]}]}
    ancestors: [<0.2037.0>,'Elixir.Sample.Repo.Pool','Elixir.Sample.Repo','Elixir.Sample.Supervisor',<0.2033.0>]
    message_queue_len: 1
    messages: [{'$gen_cast',{disconnect,#Ref<0.1184361718.206831620.161378>,#{'__exception__' => true,'__struct__' => 'Elixir.Mariaex.Error',action => <<"recv">>,connection_id => <0.2091.0>,mariadb => nil,message => nil,reason => timeout,tag => tcp},#{'__struct__' => 'Elixir.Mariaex.Protocol',binary_as => field_type_var_string,buffer => <<>>,cache => #Ref<0.1184361718.204341254.50865>,catch_eof => false,connection_id => <0.2091.0>,cursors => #{},deprecated_eof => false,lru_cache => {100,#Ref<0.1184361718.204341254.50866>},opts => [{socket_options,[]},{sock_type,tcp},{cache_size,100},{name,'Elixir.Sample.Repo.Pool'},{otp_app,sample_app},{repo,'Elixir.Sample.Repo'},{adapter,'Elixir.Ecto.Adapters.MySQL'},{username,<<"user">>},{password,<<"password">>},{database,<<"sample_db">>},{hostname,<<"127.0.0.1">>},{timeout,120000},{pool_size,320},{port,3306},{loggers,[{'Elixir.Ecto.LogEntry',log,[]},{'Elixir.Sample.Repo.Metrics',record_metric,[]}]},{pool_timeout,5000},{adapter,'Elixir.Ecto.Adapters.MySQL'},{username,{system,<<"DB_USER">>}},{password,{system,<<"DB_PASS">>}},{database,{system,<<"DB_NAME">>}},{hostname,{system,<<"DB_HOST">>}},{timeout,120000},{pool_size,{system,integer,<<"DB_POOL_SIZE">>,20}},{loggers,[{'Elixir.Ecto.LogEntry',log,[]},{'Elixir.Sample.Repo.Metrics',record_metric,[]}]},{pool,'Elixir.DBConnection.Poolboy'}],seqnum => 0,sock => {'Elixir.Mariaex.Connection.Tcp',#Port<0.4032>},ssl_conn_state => not_used,state => running,state_data => nil,timeout => 120000}}}]
    links: [<0.2036.0>,<0.2037.0>,#Port<0.4032>]
    dictionary: [{lru_cache,{100,#Ref<0.1184361718.204341254.50866>}},{cache,#Ref<0.1184361718.204341254.50865>}]
    trap_exit: false
    status: running
    heap_size: 2586
    stack_size: 27
    reductions: 38528
  neighbours:
2018-01-17 21:40:52 =SUPERVISOR REPORT====
     Supervisor: {<0.2037.0>,poolboy_sup}
     Context:    child_terminated
     Reason:     {function_clause,[{'Elixir.Mariaex.Messages',decode_msg,[<<18>>,running],[{file,"lib/mariaex/messages.ex"},{line,261}]},{'Elixir.Mariaex.Messages',decode,2,[{file,"lib/mariaex/messages.ex"},{line,225}]},{'Elixir.Mariaex.Protocol',msg_decode,2,[{file,"lib/mariaex/protocol.ex"},{line,1097}]},{'Elixir.Mariaex.Protocol',msg_recv,3,[{file,"lib/mariaex/protocol.ex"},{line,1065}]},{'Elixir.Mariaex.Protocol',disconnect,2,[{file,"lib/mariaex/protocol.ex"},{line,254}]},{'Elixir.DBConnection.Connection',disconnect,2,[{file,"lib/db_connection/connection.ex"},{line,178}]},{'Elixir.Connection',disconnect,3,[{file,"lib/connection.ex"},{line,767}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,616}]}]}
     Offender:   [{pid,<0.2091.0>},{id,'Elixir.DBConnection.Poolboy.Worker'},{mfargs,{'Elixir.DBConnection.Poolboy.Worker',start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]
michaelst commented 5 years ago

This error has been happening very frequently for us, 186 times in the last 24 hours. This is the stack trace we get: