x-cubed / event-store-client

JS client library for connecting to Event Store over TCP/IP
The Unlicense
71 stars 24 forks source link

Error from long-running event handler #33

Open mikegibney opened 6 years ago

mikegibney commented 6 years ago

(Disclaimer I'm not sure if this error is coming from event store client, from event store itself, or something completely different that has not yet been identified.)

We have an event handler that throws the following error when it runs for a sufficiently long time:

Error: write after end
    at writeAfterEnd (_stream_writable.js:166:12)
    at Socket.Writable.write (_stream_writable.js:217:5)
    at Socket.write (net.js:650:40)
    at Connection.sendMessage (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:533:21)
    at receiveMessage (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:160:22)
    at Socket.onData (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:81:21)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:172:18)
    at Socket.Readable.push (_stream_readable.js:130:10)
    at TCP.onread (net.js:542:20)
/home/app/tableservice/tableEventStore/index.js:404
            throw err;
            ^

Error: write after end
    at writeAfterEnd (_stream_writable.js:166:12)
    at Socket.Writable.write (_stream_writable.js:217:5)
    at Socket.write (net.js:650:40)
    at Connection.sendMessage (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:533:21)
    at receiveMessage (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:160:22)
    at Socket.onData (/home/app/tableservice/node_modules/event-store-client/lib/connection.js:81:21)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:172:18)
    at Socket.Readable.push (_stream_readable.js:130:10)
    at TCP.onread (net.js:542:20)

If I restart the service running this event handler, it picks up where it died just fine and continues processing events.

Something that might be relevant - the event handler that appears to be the source of this error is subscribed to one stream. When it picks up a certain class of event, it opens up a different stream to read detailed events relating to the subscribed stream.

x-cubed commented 6 years ago

My suspicion is that Event Store is closing the connection due to a timeout. If you need to maintain a long-running connection to Event Store, it's necessary to send data occasionally for the server to consider the connection to be active. The easiest way to do this is to call the ping method on a regular basis. See #24 for another issue describing this scenario.

mikegibney commented 6 years ago

Is there an idiomatic way of checking the current state of the connection, or would I need to track that myself in the onConnect/onError/onClose callbacks?

x-cubed commented 6 years ago

You would need to track that yourself from those events.

mikegibney commented 6 years ago

As near as we've been able to determine, this problem stems from the automated heartbeat messages that Event Store sends (adding the sendPing helped but the connection still eventually died). In event-store-client/lib/connection.js, method receiveMessage:

        // Handle the message
        if (command == Commands.HeartbeatRequest) {
            // Automatically respond to heartbeat requests
            instance.sendMessage(correlationId, Commands.HeartbeatResponse);

        }

For whatever reason, it seems that sometimes the connection can be closed between receiving a heartbeat request and sending the response. Is this something that we should be accounting for in our application?

mikegibney commented 6 years ago

And for any future visitors, this is roughly the ping logic we're using:

    var connected = false;

    connection = new client.Connection({
        // settings
        onConnect: function () {
            connected = true;
            setInterval(() => {
                if(connection && connected) {
                    connection.sendPing(
                        () => {} // empty callback to avoid "unknown corrlation ID"
                    );
                }
            }, 10 * 1000);
        // connection logic

        },
        onError: function (err) {
            connected = false;
            // whatever error-handling logic
            throw err;
            });
        },
        onClose: (err) => {
            connected = false;
        }
    });
x-cubed commented 6 years ago

Ok, if you're suspecting the heartbeat responses are a problem then we'll need some more information to diagnose it.

I would recommend looking at the Event Store logs to see what it has to say about the connection (that will tell you whether it has closed the connection, or it thinks everything is okay). The other thing that would help would be a packet capture from Wireshark, so that we can see if it's a particular sequence of events that is causing the problem, or if there's something wrong with the heartbeat response packets themselves.

mikegibney commented 6 years ago

Here's the relevant bit from /var/log/eventstore/:

[PID:09313:025 2018.03.27 15:28:05.991 INFO  TcpConnectionManager] Connection 'external-normal' [10.10.0.32:54834, {e4fc4784-93a0-4111-b8bb-d037ce0d8708}] closed: Success.
[PID:09313:031 2018.03.27 15:28:05.992 DEBUG PersistentSubscripti] Lost connection from 10.10.0.32:54834
[PID:09313:032 2018.03.27 15:28:05.998 INFO  TcpConnection       ] ES TcpConnection closed [15:28:05.998: N10.10.0.32:54836, L10.10.0.34:1113, {a1e065d1-662f-429a-8a12-80ef35b81915}]:Received bytes: 106, Sent bytes: 16141
[PID:09313:032 2018.03.27 15:28:05.999 INFO  TcpConnection       ] ES TcpConnection closed [15:28:05.999: N10.10.0.32:54836, L10.10.0.34:1113, {a1e065d1-662f-429a-8a12-80ef35b81915}]:Send calls: 1, callbacks: 1
[PID:09313:032 2018.03.27 15:28:06.000 INFO  TcpConnection       ] ES TcpConnection closed [15:28:06.000: N10.10.0.32:54836, L10.10.0.34:1113, {a1e065d1-662f-429a-8a12-80ef35b81915}]:Receive calls: 2, callbacks: 2
[PID:09313:032 2018.03.27 15:28:06.000 INFO  TcpConnection       ] ES TcpConnection closed [15:28:06.000: N10.10.0.32:54836, L10.10.0.34:1113, {a1e065d1-662f-429a-8a12-80ef35b81915}]:Close reason: [Success] Socket closed
[PID:09313:031 2018.03.27 15:28:06.001 INFO  TcpService          ] External TCP connection accepted: [Normal, 10.10.0.32:54838, L10.10.0.34:1113, {d2c0aaee-e55a-49b5-883d-e517f673618a}].
[PID:09313:032 2018.03.27 15:28:06.003 INFO  TcpConnectionManager] Connection 'external-normal' [10.10.0.32:54836, {a1e065d1-662f-429a-8a12-80ef35b81915}] closed: Success.
[PID:09313:021 2018.03.27 15:28:06.004 DEBUG PersistentSubscripti] Lost connection from 10.10.0.32:54836
[PID:09313:019 2018.03.27 15:28:09.891 TRACE TcpConnectionManager] Closing connection 'external-normal' [10.10.0.32:54504, L10.10.0.34:1113, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 7
[PID:09313:019 2018.03.27 15:28:09.892 INFO  TcpConnection       ] ES TcpConnection closed [15:28:09.892: N10.10.0.32:54504, L10.10.0.34:1113, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}]:Received bytes: 382, Sent bytes: 759
[PID:09313:019 2018.03.27 15:28:09.893 INFO  TcpConnection       ] ES TcpConnection closed [15:28:09.893: N10.10.0.32:54504, L10.10.0.34:1113, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}]:Send calls: 8, callbacks: 8
[PID:09313:019 2018.03.27 15:28:09.893 INFO  TcpConnection       ] ES TcpConnection closed [15:28:09.893: N10.10.0.32:54504, L10.10.0.34:1113, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}]:Receive calls: 8, callbacks: 7
[PID:09313:019 2018.03.27 15:28:09.894 INFO  TcpConnection       ] ES TcpConnection closed [15:28:09.894: N10.10.0.32:54504, L10.10.0.34:1113, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}]:Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 7
[PID:09313:019 2018.03.27 15:28:09.895 INFO  TcpConnectionManager] Connection 'external-normal' [10.10.0.32:54504, {0b40a3c5-8a8e-448f-b9aa-932d2b7ac315}] closed: Success.
[PID:09313:017 2018.03.27 15:28:09.896 DEBUG PersistentSubscripti] Lost connection from 10.10.0.32:54504
[PID:09313:021 2018.03.27 15:28:10.242 INFO  TcpConnection       ] ES TcpConnection closed [15:28:10.242: N10.10.0.32:54838, L10.10.0.34:1113, {d2c0aaee-e55a-49b5-883d-e517f673618a}]:Received bytes: 106, Sent bytes: 16230
[PID:09313:021 2018.03.27 15:28:10.243 INFO  TcpConnection       ] ES TcpConnection closed [15:28:10.243: N10.10.0.32:54838, L10.10.0.34:1113, {d2c0aaee-e55a-49b5-883d-e517f673618a}]:Send calls: 2, callbacks: 2
[PID:09313:021 2018.03.27 15:28:10.244 INFO  TcpConnection       ] ES TcpConnection closed [15:28:10.244: N10.10.0.32:54838, L10.10.0.34:1113, {d2c0aaee-e55a-49b5-883d-e517f673618a}]:Receive calls: 2, callbacks: 2
[PID:09313:021 2018.03.27 15:28:10.244 INFO  TcpConnection       ] ES TcpConnection closed [15:28:10.244: N10.10.0.32:54838, L10.10.0.34:1113, {d2c0aaee-e55a-49b5-883d-e517f673618a}]:Close reason: [Success] Socket closed
[PID:09313:021 2018.03.27 15:28:10.245 INFO  TcpConnectionManager] Connection 'external-normal' [10.10.0.32:54838, {d2c0aaee-e55a-49b5-883d-e517f673618a}] closed: Success.
[PID:09313:026 2018.03.27 15:28:10.246 DEBUG PersistentSubscripti] Lost connection from 10.10.0.32:54838
[PID:09313:027 2018.03.27 15:36:46.456 TRACE InMemoryBus         ] SLOW BUS MSG [manager input bus]: GetStatistics - 54ms. Handler: ProjectionManager.
[PID:09313:027 2018.03.27 15:36:46.457 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projections Master]: GetStatistics - 54ms. Q: 0/0.
[PID:09313:027 2018.03.27 15:48:10.475 TRACE InMemoryBus         ] SLOW BUS MSG [manager input bus]: GetStatistics - 75ms. Handler: ProjectionManager.
[PID:09313:027 2018.03.27 15:48:10.476 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projections Master]: GetStatistics - 76ms. Q: 0/0.

The exact timestamp of the error from the container was 2018-03-27T15:28:10.225626957Z. 10.10.0.32 is the IP of the consuming service. Still working on getting a packet capture.

mikegibney commented 6 years ago

Sorry I lost the thread on this, had to deal with a couple other issues that popped up. What would you want from a packet capture? This is running on Ubuntu server, so it looks I'd be using tshark or tcpdump but if I'm being completely honest I'm in way over my head with this. I'm far from a TCP expert so I'm not even sure what I'm looking for or how to go about getting it.

x-cubed commented 6 years ago

So the Event Store logs are suggesting that we didn't reply to the Heartbeat request with a Heartbeat response. A packet capture should be able to tell us if that is the case, and if not, might identify what the confusion is.

All we need to capture is the TCP data on port 1113 when this issue is happening. I'm no expert on tcpdump (I usually use Wireshark myself), but it looks like tcpdump port 1113 -w capture.pcap should capture the data we need and save it to a file that you can then attach to this issue.

mikegibney commented 6 years ago

After a certain degree of bumbling and false starts, I have successfully obtained a packet capture....which is chock full of a bunch of proprietary data that I don't think I can share. :/ Is there an easy way to extract the heartbeat packets?

x-cubed commented 6 years ago

The heartbeat packets by themselves probably aren't that useful, it's more useful to see the full stream of packets, in case there's something wrong with the order in which they're being sent.

Three possible approaches from here:

mikegibney commented 6 years ago

Okay, I managed to randomize/obfuscate the sensitive data with relative ease. Let me know if this is what you need.

capture.zip

x-cubed commented 6 years ago

Ok, there's definitely something wrong with the heartbeat responses. Here's a list of packets from your capture showing the server requesting a heartbeat, but the client carries on regardless without sending a heartbeat response:

heartbeatrequestbutnoresponse

x-cubed commented 6 years ago

So the interesting thing to note there is that it does respond to the heartbeat request in packet 81343, but then fails to respond later to the second request in 81567. That would suggest that the receive logic is not handling incoming messages after the ReadStreamEventsForwardCompleted is received.

You could try enabling the debug flag on the connection to get log output from the client, to see if it recognizes the second HeartbeatRequest.