Consumer is executing message infinitely if for some reason it handles a message it's not the owner

Soulou commented 6 years ago

From time to time, we're experiencing some weird behavior with our NSQ consumers. It might be related to load pike, we haven't found a way to reproduce it 100% sure, but the issue appears once every week at least in our infrastructure.

2018-05-28 10:18:33.363289812 +0200 CEST [nsqworker-2] account - 09b9e490041f8000 - starting
2018-05-28 10:18:33.383662927 +0200 CEST [nsqworker-2] account - 09b9e490041f8000 - done  - 0.023424769s
2018-05-28 10:19:16.799691171 +0200 CEST [nsqworker-2] W, [2018-05-28T08:19:16.793795 #15]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 10:19:17.069883413 +0200 CEST [nsqworker-2] E, [2018-05-28T08:19:17.065925 #15] ERROR -- : [host: <host1> port: 4150]  Error received: E_FIN_FAILED FIN 09b9e490041f8000 failed client does not own message

So we can see here that between the execution and the FIN message, an error occurs concerning the NSQd connection, then the FIN fails.

It's what's happening in a loop, as after this, the message (with the same ID) is executed again by the process, and fails to FIN again, etc.

When I look at the logging history about <host1>, I get this:

2018-05-28 05:19:58.395189938 +0200 CEST [nsqworker-2] W, [2018-05-28T03:19:58.386943 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:20:58.460665543 +0200 CEST [nsqworker-2] W, [2018-05-28T03:20:58.457603 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:21:58.546145107 +0200 CEST [nsqworker-2] W, [2018-05-28T03:21:58.545515 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:22:58.625993464 +0200 CEST [nsqworker-2] W, [2018-05-28T03:22:58.621311 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:23:58.741414566 +0200 CEST [nsqworker-2] W, [2018-05-28T03:23:58.733896 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:24:58.835791494 +0200 CEST [nsqworker-2] W, [2018-05-28T03:24:58.829582 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:25:58.932305585 +0200 CEST [nsqworker-2] W, [2018-05-28T03:25:58.921822 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:26:59.021718841 +0200 CEST [nsqworker-2] W, [2018-05-28T03:26:59.017355 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:27:59.121631901 +0200 CEST [nsqworker-2] W, [2018-05-28T03:27:59.113451 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:28:59.221418436 +0200 CEST [nsqworker-2] W, [2018-05-28T03:28:59.209866 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:29:59.330347765 +0200 CEST [nsqworker-2] W, [2018-05-28T03:29:59.327889 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:30:59.410571465 +0200 CEST [nsqworker-2] W, [2018-05-28T03:30:59.409541 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:31:59.545008079 +0200 CEST [nsqworker-2] W, [2018-05-28T03:31:59.543487 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:32:59.773488847 +0200 CEST [nsqworker-2] W, [2018-05-28T03:32:59.770228 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket
2018-05-28 05:33:59.877373181 +0200 CEST [nsqworker-2] W, [2018-05-28T03:33:59.871865 #16]  WARN -- : [host: <host1> port: 4150]  Died from: No data from socket

So as you can see it's like if an old broken connection was kept in the pool.

On the nsqd part I have:

[nsqd] 2018/05/28 03:26:59.033853 [10.127.0.116:56708] IDENTIFY: {ShortID: LongID: ClientID:nsqworker-2 Hostname:nsqworker-2 HeartbeatInterval:30000 OutputBufferSize:0 OutputBufferTimeout:250 FeatureNegotiation:true TLSv1:true Deflate:false DeflateLevel:0 Snappy:false SampleRate:0 UserAgent:nsq-ruby/2.2.0 MsgTimeout:60000}

[nsqd] 2018/05/28 03:27:59.101820 PROTOCOL(V2): [10.127.0.116:56708] exiting ioloop
[nsqd] 2018/05/28 03:27:59.101925 ERROR: client(10.127.0.116:56708) - failed to read command - read tcp 172.17.0.3:4150->10.127.0.116:56708: i/o timeout
[nsqd] 2018/05/28 03:27:59.101940 PROTOCOL(V2): [10.127.0.116:56708] exiting messagePump

Is the timeout coming from the consumer not responding? But whatever is the issue, I don't sure that executing a message in a loop is a good behavior.

Thanks a lot for your insights

bschwartz commented 6 years ago

@Soulou Nothing is jumping out at me. I'm grasping at straws a bit here but are you using max_attempts (see: https://github.com/wistia/nsq-ruby/commit/5461c9844dc5daeb3d5551767ee21a8d27ce03f9). Maybe that's causing it to try to fin the message over and over?

In terms of getting the Died from: No data from socket over and over, is your thought that it's trying to repeatedly fin the message and that's raising an exception in the read_loop (https://github.com/wistia/nsq-ruby/blob/8774b0a8a06389f3664e86d1d5263807a51aca9d/lib/nsq/connection.rb#L312) which then triggers a reconnect?

Sorry I'm not more helpful here. I suppose that forking nsq-ruby and adding some more logging is the way to go. If you want to add more debugging level logging, happy to have that in the main repo too in case it's helpful to others.

Soulou commented 6 years ago

Hi @bschwartz, forking and adding logging is what we've started to do. We're going to check concerning max_attempts, but I don't think it's infinite.

Concerning the Died from I've actually no real idea of the source of it, all I know is that the nsqd instance is alive and healthy, and that some golang producer/consumer don't have these issues, so that's why I'm thinking there is something wrong in here, I just don't know what yet :-)

wistia / nsq-ruby

Consumer is executing message infinitely if for some reason it handles a message it's not the owner #47