yfinkelstein / node-zookeeper

node.js client for Apache Zookeeper
MIT License
479 stars 111 forks source link

Node crashes if timeout (rc: -7) is thrown during init() #262

Open sirmonin opened 3 years ago

sirmonin commented 3 years ago

Describe the bug If timeout is reached (rc: -7) during the first init() call, I receive the error below and app crashes with code: 139.

Assertion failed: loop->watchers[w->fd] == w

I am sure that app doesn't use Zookeeper instance until there is a 'connect' event (which is never received).

To Reproduce 1) Initialize Zookeeper via .init() or .connect() with a timeout set to a tiny value. E.g. 5ms. (This will produce the same effect as on our server, where timeout is set to 10000ms.) 2) See error.

Error Log

2021-01-11 16:58:46,945:14539:ZOO_INFO@log_env@1102: Client environment:zookeeper.version=zookeeper C client 3.5.8
2021-01-11 16:58:46,945:14539:ZOO_INFO@log_env@1106: Client environment:host.name=project
2021-01-11 16:58:46,945:14539:ZOO_INFO@log_env@1113: Client environment:os.name=Linux
2021-01-11 16:58:46,945:14539:ZOO_INFO@log_env@1114: Client environment:os.arch=4.19.121-linuxkit
2021-01-11 16:58:46,946:14539:ZOO_INFO@log_env@1115: Client environment:os.version=#1 SMP Tue Dec 1 17:50:32 UTC 2020
2021-01-11 16:58:46,946:14539:ZOO_INFO@log_env@1123: Client environment:user.name=(null)
2021-01-11 16:58:46,946:14539:ZOO_INFO@log_env@1131: Client environment:user.home=/home/node
2021-01-11 16:58:46,946:14539:ZOO_INFO@log_env@1143: Client environment:user.dir=/srv
2021-01-11 16:58:46,946:14539:ZOO_INFO@zookeeper_init_internal@1177: Initiating client connection, host=zookeeper:2181 sessionTimeout=5 watcher=0x7f988ef61490 sessionId=0 sessionPasswd=<null> context=0x5623b4960d20 flags=0
2021-01-11 16:58:46,947:14539:ZOO_DEBUG@get_next_server_in_reconfig@1330: [OLD] count=0 capacity=0 next=0 hasnext=0
2021-01-11 16:58:46,947:14539:ZOO_DEBUG@get_next_server_in_reconfig@1333: [NEW] count=1 capacity=16 next=0 hasnext=1
2021-01-11 16:58:46,947:14539:ZOO_DEBUG@get_next_server_in_reconfig@1344: Using next from NEW=172.18.0.2:2181
2021-01-11 16:58:46,947:14539:ZOO_DEBUG@zookeeper_connect@2232: [zk] connect()
Interest in (fd=19, read=true, write=true, timeout=1)
2021-01-11 16:58:46,957:14539:ZOO_WARN@zookeeper_interest@2284: Exceeded deadline by 9ms
zk_timer_cb fired
ping timer went off
2021-01-11 16:58:46,957:14539:ZOO_ERROR@handle_socket_error_msg@2386: Socket [172.18.0.2:2181] zk retcode=-7, errno=110(Operation timed out): connection to 172.18.0.2:2181 timed out (exceeded timeout by 7ms)
2021-01-11 16:58:46,958:14539:ZOO_DEBUG@cleanup@1785: Previous connection=[172.18.0.2:2181] delay=0
yield:zookeeper_interest returned error: -7 - operation timeout
zk_timer_cb fired
ping timer went off
2021-01-11 16:58:46,959:14539:ZOO_WARN@zookeeper_interest@2284: Exceeded deadline by 11ms
2021-01-11 16:58:46,960:14539:ZOO_DEBUG@get_next_server_in_reconfig@1330: [OLD] count=0 capacity=0 next=0 hasnext=0
2021-01-11 16:58:46,960:14539:ZOO_DEBUG@get_next_server_in_reconfig@1333: [NEW] count=1 capacity=16 next=1 hasnext=0
2021-01-11 16:58:46,960:14539:ZOO_DEBUG@get_next_server_in_reconfig@1355: Failed to find either new or old
2021-01-11 16:58:46,960:14539:ZOO_DEBUG@zookeeper_connect@2232: [zk] connect()
Assertion failed: loop->watchers[w->fd] == w (../deps/uv/src/unix/core.c: uv__io_stop: 932)

Expected behavior Timeout doesn't crash node app and closes client connection, so I could recreate client instance and retry to connect.

Desktop (please complete the following information):

DavidVujic commented 3 years ago

Thank you for reporting this @sirmonin!

DavidVujic commented 3 years ago

Could be related to https://github.com/yfinkelstein/node-zookeeper/issues/238

DavidVujic commented 3 years ago

Hi @sirmonin!

To find out what could be wrong:

sirmonin commented 3 years ago

@DavidVujic, thanks for a quick reply!

node-zookeeper: Tested both on 4.6.0 and 4.7.1 Zookeeper: I am using Docker image bitnami/zookeeper:latest which is 3.6.2

DavidVujic commented 3 years ago

Great, thanks!

sirmonin commented 3 years ago

@DavidVujic It is also observed, that timeout with crashing happens while application is running under some load.

DavidVujic commented 3 years ago

Thanks!

I'm planning to look at this later this week.

Notes: I think there is need for some extra uv_closing (or something similar) checks here: https://github.com/yfinkelstein/node-zookeeper/blob/master/src/node-zk.cpp#L265

As hinted in this StackOverflow post: https://stackoverflow.com/questions/47522854/main-src-unix-core-c117-uv-close-assertion-uv-is-closinghandle-failed

If you, or if you know someone with C and/or C++ skills, please feel free to jump in and help 😄 It is much appreciated.

DavidVujic commented 3 years ago

Just a short update: I'm working on this issue and am trying different approaches on how to solve it. I also believe I can reproduce the assertion error using the code in the examples folder and restart a docker container with a zookeeper server running.

DavidVujic commented 3 years ago

Hello again @sirmonin!

I've been working on solving this issue, but cannot really find where and how to prevent the assertion error to happen.

What I've found out is that it most likely is when watchers are attached. Earlier today I pushed new versions of the code in the examples folder, making sure only one client is used, and function calls is made only when the client is connected.

And that has made a difference: on Mac OS X I get no assertion error. But on Linux I get the error - only occasionally. Haven't found a pattern.

The way I've been doing is: starting a zookeeper server hosted in a docker container on my local machine. starting the examples with node examples/index.js running docker restart <container-id>

Before the changes in the example code, I got assertion error every time. Now, only randomly it seems and only on Linux.

I have never got the error you are describing (with short timeouts).

Could you help me by sharing another code example, or if you can try the example code described above?

Here's a diff of the changes I have made: https://github.com/yfinkelstein/node-zookeeper/pull/264/files

sirmonin commented 3 years ago

Hi @DavidVujic! Thanks a lot for time and effort you've spent on this issue.

My zookeeper connect method looks like this. Timeout is set to 5ms to simulate timeout error.

public connect = () => {
        const host = this.config.get('ZOOKEEPER_HOST');
        const port = this.config.get('ZOOKEEPER_PORT');
        this.zkConfig = {
            connect: `${host}:${port}`,
            timeout: 5,
            debug_level: ZooKeeper.ZOO_LOG_LEVEL_DEBUG,
            host_order_deterministic: false,
        };

        this.client = new ZooKeeper(this.zkConfig);

        this.client.init(this.zkConfig);
    };

This produces error logs:

2021-02-01 17:34:32,821:139:ZOO_INFO@log_env@1098: Client environment:zookeeper.version=zookeeper C client 3.5.5
2021-02-01 17:34:32,821:139:ZOO_INFO@log_env@1102: Client environment:host.name=test
2021-02-01 17:34:32,821:139:ZOO_INFO@log_env@1109: Client environment:os.name=Linux
2021-02-01 17:34:32,821:139:ZOO_INFO@log_env@1110: Client environment:os.arch=4.19.121-linuxkit
2021-02-01 17:34:32,821:139:ZOO_INFO@log_env@1111: Client environment:os.version=#1 SMP Tue Dec 1 17:50:32 UTC 2020
2021-02-01 17:34:32,821:139:ZOO_INFO@log_env@1119: Client environment:user.name=(null)
2021-02-01 17:34:32,821:139:ZOO_INFO@log_env@1127: Client environment:user.home=/home/node
2021-02-01 17:34:32,821:139:ZOO_INFO@log_env@1139: Client environment:user.dir=/srv
2021-02-01 17:34:32,821:139:ZOO_INFO@zookeeper_init_internal@1173: Initiating client connection, host=zookeeper:2181 sessionTimeout=5 watcher=0x7f365d600480 sessionId=0 sessionPasswd=<null> context=0x55d714f278a0 flags=0
Interest in (fd=19, read=true, write=true, timeout=1)2021-02-01 17:34:32,823:139:ZOO_DEBUG@get_next_server_in_reconfig@1326: [OLD] count=0 capacity=0 next=0 hasnext=0
2021-02-01 17:34:32,823:139:ZOO_DEBUG@get_next_server_in_reconfig@1329: [NEW] count=1 capacity=16 next=0 hasnext=1
2021-02-01 17:34:32,823:139:ZOO_DEBUG@get_next_server_in_reconfig@1340: Using next from NEW=172.18.0.2:2181
2021-02-01 17:34:32,823:139:ZOO_DEBUG@zookeeper_connect@2228: [zk] connect()
zk_timer_cb firedping timer went off2021-02-01 17:34:32,832:139:ZOO_WARN@zookeeper_interest@2280: Exceeded deadline by 9ms
yield:zookeeper_interest returned error: -7 - operation timeout
2021-02-01 17:34:32,833:139:ZOO_ERROR@handle_socket_error_msg@2382: Socket [172.18.0.2:2181] zk retcode=-7, errno=110(Operation timed out): connection to 172.18.0.2:2181 timed out (exceeded timeout by 7ms)
2021-02-01 17:34:32,833:139:ZOO_DEBUG@cleanup@1781: Previous connection=[172.18.0.2:2181] delay=0
zk_timer_cb firedping timer went off2021-02-01 17:34:32,834:139:ZOO_WARN@zookeeper_interest@2280: Exceeded deadline by 11ms
2021-02-01 17:34:32,835:139:ZOO_DEBUG@get_next_server_in_reconfig@1326: [OLD] count=0 capacity=0 next=0 hasnext=0
2021-02-01 17:34:32,835:139:ZOO_DEBUG@get_next_server_in_reconfig@1329: [NEW] count=1 capacity=16 next=1 hasnext=0
2021-02-01 17:34:32,836:139:ZOO_DEBUG@get_next_server_in_reconfig@1351: Failed to find either new or old
2021-02-01 17:34:32,836:139:ZOO_DEBUG@zookeeper_connect@2228: [zk] connect()
Interest in (fd=20, read=true, write=true, timeout=1)Assertion failed: loop->watchers[w->fd] == w (../deps/uv/src/unix/core.c: uv__io_stop: 932)
sirmonin commented 3 years ago

However, adding a setTimeout seemed to help a bit. App is no longer crashing and tries to reconnect.

public connect = () => {
        const host = this.config.get('ZOOKEEPER_HOST');
        const port = this.config.get('ZOOKEEPER_PORT');
        this.zkConfig = {
            connect: `${host}:${port}`,
            timeout: 5,
            debug_level: ZooKeeper.ZOO_LOG_LEVEL_DEBUG,
            host_order_deterministic: false,
        };

        this.client = new ZooKeeper(this.zkConfig);

        setTimeout(() => {
            this.client.init(this.zkConfig);
        }, 1000);
    };
DavidVujic commented 3 years ago

Thanks @sirmonin!

DavidVujic commented 3 years ago

I'm closing this one. Please reopen if it still is an issue!

sirmonin commented 2 years ago

@DavidVujic Issue still persists when server is under high load.

2022-04-29 17:48:16,829:448:ZOO_ERROR@handle_socket_error_msg@2489: Socket 172.19.0.4:2181 zk retcode=-7, errno=110(Operation timed out): connection to 172.19.0.4:2181 timed out (exceeded timeout by 8849ms)
yield:zookeeper_interest returned error: -7 - operation timeout
2022-04-29 17:48:18,481:448:ZOO_WARN@zookeeper_interest@2375: Exceeded deadline by 12182ms
Assertion failed: loop->watchers[w->fd] == w (../deps/uv/src/unix/core.c: uv__io_stop: 930)

Node.js v14.16.1 node-zookeeper v5.4.0

DavidVujic commented 2 years ago

Thank you for notifying. I'll reopen this one!

sirmonin commented 2 years ago

@DavidVujic I tested under Node v16.14.2. It looks like yield:zookeeper_interest returned error: -7 - operation timeout no longer causes libuv to fail.

Will keep you posted in case issue repeats itself again. For now the issue is solved by updating to Node v16.14.2 or later.

However, timeout issue still persists (eventhough it doesn't crash node anymore)

2022-04-29 20:18:04,369:218:ZOO_ERROR@handle_socket_error_msg@2489: Socket 172.19.0.4:2181 zk retcode=-7, errno=110(Operation timed out): connection to 172.19.0.4:2181 timed out (exceeded timeout by 10530ms)
yield:zookeeper_interest returned error: -7 - operation timeout

Update: Node.js v14.19.1 doesn't crash as well.