xmppo / node-xmpp-bosh

An XMPP BOSH & WebSocket server (connection manager) written on node.js using Javascript
https://github.com/xmppo/node-xmpp-bosh
263 stars 85 forks source link

Unhandled 'error' event #76

Open valeriansaliou opened 11 years ago

valeriansaliou commented 11 years ago

Hello,

I upgraded Jappix.com to latest NXB v0.7.3 from NPM + node v0.6.12, and it appears it is sometimes crashing with the following bug (repeated 2 times there):

Error: 3072534224:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:741:

    at CleartextStream._puller (tls.js:501:24)
    at CleartextStream._pull (tls.js:432:19)
    at SecurePair.cycle (tls.js:727:20)
    at EncryptedStream.write (tls.js:130:13)
    at Socket.ondata (stream.js:38:26)
    at Socket.emit (events.js:67:17)
    at TCP.onread (net.js:372:14)

events.js:48
        throw arguments[1]; // Unhandled 'error' event
                       ^
Error: 3072566992:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:741:

    at CleartextStream._puller (tls.js:501:24)
    at CleartextStream._pull (tls.js:432:19)
    at SecurePair.cycle (tls.js:727:20)
    at EncryptedStream.write (tls.js:130:13)
    at Socket.ondata (stream.js:38:26)
    at Socket.emit (events.js:67:17)
    at TCP.onread (net.js:372:14)
dhruvbird commented 11 years ago

@valeriansaliou Any idea which host(s) this happened for (do you have debug logs enabled)? It seems to be a TLS connection to the upstream xmpp server that is causing problems.

dhruvbird commented 11 years ago

@valeriansaliou By any chance, did you update openssl on that box? https://bugs.launchpad.net/ubuntu/+source/openssl/+bug/986147

valeriansaliou commented 11 years ago

This cannot happen with jappix.com and anonymous.jappix.com since they are defined as "non-TLS hosts" in NXB configuration file. I can enable debug logs but I'm afraid the event we're looking for will get lost in a junk of messages.

About OpenSSL, I'm running OpenSSL 1.0.1e 11 Feb 2013 from Debian 7 Wheezy official repos. System is up-to-date.

dhruvbird commented 11 years ago

Sorry, so I just realized that node.js comes bundles with openssl and it uses its own version rather than the system default.

Can you try running (from within the node.js_source_dir/deps/openss/)

node-v0.6.17/deps/openssl$ cat README.chromium  | grep Version
dhruvbird commented 11 years ago

Will try and send you a patch that enables logging only for TLS initiation. That might narrow the problem down.

dhruvbird commented 11 years ago

@valeriansaliou Can you try running https://github.com/dhruvbird/node-xmpp-bosh/tree/log_TLS_initiation with --logging=WARN and let me know what you see.

valeriansaliou commented 11 years ago

Thanks Dhruv! BTW, I'm running node v0.6.21 (latest one of the 0.6 branch).

OpenSSL version is 0.9.8o (result from the cat you gave me on my current node sources).

I'll run this with WARN logging, the log result may take up to a few hours since it doesn't crash a lot (3-4 times a day only - at very random intervals).

valeriansaliou commented 11 years ago

One side question: I'm using NPM, any way to update/reinstall NXB from NPM using this log_TLS_initiation branch? (just as ROR does with Gems that are directly fetched from Github).

dhruvbird commented 11 years ago

So, you can install everything normally from npm, and just apply the diff using this command:

$ cd node-xmpp-bosh-git-directory
$ git pull --rebase
$ git checkout -b log_TLS_initiation --track origin/log_TLS_initiation
$ git diff master # Will show you the changes

It's just a single line change, so you can copy the file over in your npm install directory (i.e. wherever npm has installed NXB).

valeriansaliou commented 11 years ago

Done. I'll report you the WARN trace when I'll get one. Thanks ;)

BTW, not in the topic but is that normal that the BOSH send me back this:

<body xmlns="http://jabber.org/protocol/httpbind" message="Timed out"/>

When the wait maximum time has come? I mean, I was getting a blank message with the previous NXB version I was using:

<body xmlns="http://jabber.org/protocol/httpbind">
dhruvbird commented 11 years ago

@valeriansaliou Thanks for reporting this actually! :)

And yes, we added "message" to some of the responses for better debugability so that when we see logs, it's easier to see what's been happening.

valeriansaliou commented 11 years ago

Okay, that's fine. An user of our server thought something went wrong and asked me why he always got this.

BTW, here's the crash log I got 12 minutes ago (crashed 2 times in 14 hours):

Restoring old listeners
Restore::event: connect
Remove all listeners for event: connect
Restore::event: error
Remove all listeners for event: error
Restore::event: timeout
Remove all listeners for event: timeout
[2013-05-15 14:20:28.355] [WARN] [xmpp-proxy.js:XMPPProxy._starttls:143] - startTLS to billard-aktuell.de requested
Restoring old listeners
Restore::event: connect
Remove all listeners for event: connect
Restore::event: error
Remove all listeners for event: error
Restore::event: timeout
Remove all listeners for event: timeout
[2013-05-15 14:20:31.178] [WARN] [xmpp-proxy.js:XMPPProxy._starttls:143] - startTLS to billard-aktuell.de requested
Restoring old listeners
Restore::event: connect
Remove all listeners for event: connect
Restore::event: error
Remove all listeners for event: error
Restore::event: timeout
Remove all listeners for event: timeout
[2013-05-15 14:20:36.637] [WARN] [xmpp-proxy.js:XMPPProxy._starttls:143] - startTLS to billard-aktuell.de requested
+----------------------------------------------------------------------------------------------------------------------+
| Starting BOSH server 'v0.7.3' on 'http://:::5280/^\/http-bind(\/+)?$/' at 'Wed May 15 2013 16:21:02 GMT+0200 (CEST)' |
+----------------------------------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------------+
| Starting WEBSOCKET server 'v0.7.3' on ws://:::5280' at 'Wed May 15 2013 16:21:02 GMT+0200 (CEST)' |
+---------------------------------------------------------------------------------------------------+
Restoring old listeners
Restore::event: connect
Remove all listeners for event: connect
Restore::event: error
Remove all listeners for event: error
Restore::event: timeout
Remove all listeners for event: timeout
Restoring old listeners
Restore::event: connect
Remove all listeners for event: connect

And the associated error:

events.js:48
        throw arguments[1]; // Unhandled 'error' event
                       ^
Error: 3072362192:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:741:

    at CleartextStream._puller (tls.js:501:24)
    at CleartextStream._pull (tls.js:432:19)
    at SecurePair.cycle (tls.js:727:20)
    at EncryptedStream.write (tls.js:130:13)
    at Socket.ondata (stream.js:38:26)
    at Socket.emit (events.js:67:17)
    at TCP.onread (net.js:372:14)

The crash occurred exactly on 2013-05-15 14:20:36.637.

dhruvbird commented 11 years ago

@valeriansaliou I tried reproducing the crash by connecting to billard-aktuell.de by connecting to it in a loop and doing a starttls, but both node-v0.6.17 and node-v0.6.21 seem to not crash. Do you know if this crash happens only for this host?

valeriansaliou commented 11 years ago

@dhruvbird not sure it only happens on this host, but I have checked my logs and it seems to be recurrent (checked 3 different crashes, and the same domain comes out everytime).

Also, the crash actually happens one time over a hundred. I have many TLS logs for this host, but not that many crash after the TLS log.

dhruvbird commented 11 years ago

@valeriansaliou I've pushed out a commit to try and see if the TLS module is emitting an error event after emitting a close event (that could explain why the error event is unhandled). You should be able to apply it on top of current npm (v0.7.4). Commit is on branch: https://github.com/dhruvbird/node-xmpp-bosh/tree/log_TLS_initiation

valeriansaliou commented 11 years ago

Logging started. I will report you subsequent crashes. ;)

valeriansaliou commented 11 years ago

Got this now:

From bosh.err:

events.js:48
        throw arguments[1]; // Unhandled 'error' event
                       ^
Error: 3072407248:error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error:s23_clnt.c:741:

    at CleartextStream._puller (tls.js:501:24)
    at CleartextStream._pull (tls.js:432:19)
    at SecurePair.cycle (tls.js:727:20)
    at EncryptedStream.write (tls.js:130:13)
    at Socket.ondata (stream.js:38:26)
    at Socket.emit (events.js:67:17)
    at TCP.onread (net.js:372:14)

From bosh.log:

starttls::onclose() -> 547
starttls::onclose() -> 550
starttls::onclose() -> 416
starttls::onclose() -> 552
+----------------------------------------------------------------------------------------------------------------------+
| Starting BOSH server 'v0.7.4' on 'http://:::5280/^\/http-bind(\/+)?$/' at 'Sun May 19 2013 13:21:01 GMT+0200 (CEST)' |
+----------------------------------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------------+
| Starting WEBSOCKET server 'v0.7.4' on ws://:::5280' at 'Sun May 19 2013 13:21:01 GMT+0200 (CEST)' |
+---------------------------------------------------------------------------------------------------+
starttls::onclose() -> 24
starttls::onclose() -> 25
starttls::onclose() -> 27

I also got some WARN that I assume are normal (apart from the crash, those WARN came maybe half an hour before it happened, unrelated):

starttls::onclose() -> 68
starttls::onclose() -> 69
starttls::onclose() -> 70
[2013-05-19 09:24:29.337] [WARN] [lookup-service.js:give_up_trying_to_connect:118] - Giving up connection attempts to caronte@jappix.com
starttls::onclose() -> 71
starttls::onclose() -> 72
[2013-05-19 09:24:36.061] [WARN] [lookup-service.js:give_up_trying_to_connect:118] - Giving up connection attempts to caronte@jappix.com
starttls::onclose() -> 73
starttls::onclose() -> 74
[2013-05-19 09:25:08.118] [WARN] [lookup-service.js:give_up_trying_to_connect:118] - Giving up connection attempts to hotmail.com
starttls::onclose() -> 63
[2013-05-19 09:25:31.320] [WARN] [lookup-service.js:give_up_trying_to_connect:118] - Giving up connection attempts to caronte@jappix.com
starttls::onclose() -> 75
starttls::onclose() -> 76
starttls::onclose() -> 77
[2013-05-19 09:25:57.199] [WARN] [lookup-service.js:give_up_trying_to_connect:118] - Giving up connection attempts to https://jappix.com/
starttls::onclose() -> 78
starttls::onclose() -> 79
dhruvbird commented 11 years ago

@valeriansaliou I see a lot of onclose() calls, but no onerror() calls. Were any onerror() calls logged for this duration?

valeriansaliou commented 11 years ago

Yeah, actually there's one:

starttls::onerror() -> 29
valeriansaliou commented 11 years ago

Check your emails BTW ;)

valeriansaliou commented 11 years ago

After restarting my XMPPd today, I noticed NXB crashed just some milliseconds after the restart, the first time in 3 days it did not crash.

Same error log as before.

I do think this is only due to an unhandled error type on socket drop/close, that may have been introduced in nodejs 0.6.21 (well, the last one from 0.6 branch, not sure it is 21). Adding the error handler should do the trick, right?

dhruvbird commented 11 years ago

I tried adding the error handler in the patch I sent, but it doesn't seem to help. Will need some more investigation.

valeriansaliou commented 11 years ago

Oh, I updated NXB to v0.7.4 in between. I was thinking you did not change anything else than logging. Sorry for the misunderstanding.

I'm re-adding your changes to my running v0.7.4, and keep you up-to-date in a week about whether it crashed or not.

dhruvbird commented 11 years ago

Link to node.js group post: https://groups.google.com/forum/?fromgroups#!topic/nodejs/5XxHhuia-lw

valeriansaliou commented 11 years ago

Mhh so you suggest I give a try to nodejs v0.10?

I got some strange issues with my proxy server on front of NXB when first testing with node v0.10 but this may be linked to the fact I was changing a lot of things on the server, so I might have messed up my mind. The issue was that I started to get blank proxied pages after a while. Not a 500 error, blank pages. But anyway I can retry.

Will node v0.10 be more clear on the error message or fix the whole issue?

dhruvbird commented 11 years ago

Okay, so there were some bug-fixes on an external starttls module that my very old code didn't have. I've pushed those changes here: https://github.com/dhruvbird/node-xmpp-bosh/tree/log_TLS_initiation Please could you run from that branch and let me know if that fixes the crash?

valeriansaliou commented 11 years ago

Thanks @dhruvbird ;)

I have updated Node to v0.10.8 (it's said it's 2 times faster on I/O than the 0.6 branch I was using). NXB from log_TLS_initiation branch is now running, I'll report you any further error.

dhruvbird commented 11 years ago

@valeriansaliou Cool - were you able to it to work for ipv6 addresses?

valeriansaliou commented 11 years ago

Absolutely. A telnet on IPv6 :5280 can be opened and my lighttpd frontend proxies on local loopback IPv6 sockets anyway ;)

valeriansaliou commented 11 years ago

Mhh, crashed after an hour (running very well on latest Node version before crash):

From bosh.err:

node: ../src/nodecrypto.cc:975: void node::crypto::Connection::ClearError(): Assertion `handle->Get(String::New("error"))->BooleanValue() == false' failed.

It seem to be related to the error we previously had.

This time, the crash followed an overactivity kick from myself. I generated a lot of traffic on my account session, getting a lot of vCards at the same time, and should have been kicked. Instead NXB crashed.

Note that it's really weird I got something coming from node_crypto.css, which is a TLS lib as I understand, but I was not using a TLS c2s connection since connections to my local loopback are set to be plain text (in NXB config file, no TLS domains).

dhruvbird commented 11 years ago

@valeriansaliou The error seems helpful. I'll post on google groups and see if something comes out of it.

Are you sure the config file was being used? nxb prints out the config. options in debug (or maybe trace) mode.

valeriansaliou commented 11 years ago

@dhruvbird great. I'm in FATAL for now, but config is actually used, otherwise NXB won't be listening on an IPv6 socket ;)

valeriansaliou commented 11 years ago

Downgrading to v0.8.23, v0.10.x is causing blank pages after w/o making NXB crash (and no error log). I assume this is an HTTP server bug.

valeriansaliou commented 11 years ago

Now with v0.8.23 it looks like I don't get blank page issue, but still crashing. Crashes gives that error trace log:

assert.js:102 throw new assert.AssertionError({ ^ AssertionError: handle != self._handle at Object.afterConnect as oncomplete

Seems to match v0.10 error log.

By the way, I got some WARNs during NXB install using NPM, that some piece of third party software were deprecated:

npm WARN unmet dependency /usr/local/lib/node_modules/npm/node_modules/fstream-npm/node_modules/fstream-> ignore requires inherits@'~1.0.0' but will load npm WARN unmet dependency /usr/local/lib/node_modules/npm/node_modules/fstream-npm/node_modules/inherits, npm WARN unmet dependency which is version 2.0.0

Otherwise the rest seems to be okay. Related to a stream package, may be related to the crash?