Closed eric closed 12 years ago
So, Zookeeper has a fairly nuanced concept of client session, and it takes a little doing to deal with all the different states properly.
From the docs:
One of the parameters to the ZooKeeper client library call to create a ZooKeeper session is the session timeout in milliseconds. The client sends a requested timeout, the server responds with the timeout that it can give the client. The current implementation requires that the timeout be a minimum of 2 times the tickTime (as set in the server configuration) and a maximum of 20 times the tickTime. The ZooKeeper client API allows access to the negotiated timeout.
When a client (session) becomes partitioned from the ZK serving cluster it will begin searching the list of servers that were specified during session creation. Eventually, when connectivity between the client and at least one of the servers is re-established, the session will either again transition to the "connected" state (if reconnected within the session timeout value) or it will transition to the "expired" state (if reconnected after the session timeout). It is not advisable to create a new session object (a new ZooKeeper.class or zookeeper handle in the c binding) for disconnection. The ZK client library will handle reconnect for you. In particular we have heuristics built into the client library to handle things like "herd effect", etc... Only create a new session when you are notified of session expiration (mandatory).
Session expiration is managed by the ZooKeeper cluster itself, not by the client. When the ZK client establishes a session with the cluster it provides a "timeout" value detailed above. This value is used by the cluster to determine when the client's session expires. Expirations happens when the cluster does not hear from the client within the specified session timeout period (i.e. no heartbeat). At session expiration the cluster will delete any/all ephemeral nodes owned by that session and immediately notify any/all connected clients of the change (anyone watching those znodes). At this point the client of the expired session is still disconnected from the cluster, it will not be notified of the session expiration until/unless it is able to re-establish a connection to the cluster. The client will stay in disconnected state until the TCP connection is re-established with the cluster, at which point the watcher of the expired session will receive the "session expired" notification.
This is particularly importnat to look out for if you're using watches. If your program sets up a bunch of watches and essentially just waits for events to happen, if you miss the SESSION EXPIRED, you will essentially just go into zombie mode: your watches will never fire. As the docs say, if you get a session expired event, you should close your client, and re-establish whatever state you need. I generally handle this in a fairly brute-force way. I use runit for process management, so when I get a session expired, I just treat it as a fatal event. I shut down everything that needs to be gracefully closed, and exit. This ensures that when I come back, everything will be in a correct state (kinda like crash-only software). At the very least, you need to close your client instance and re-connect.
On the client, you can use the on_expired_session
callback to register a block to be called when this event happens. I'm 90% sure these callbacks don't need to be re-registered every time (they're not one-shots like node events), it's been a while since I poked around that portion of the code, so I'll have to try it and get back to you conclusively.
In terms of "why is this happening," generally I see this happen when there's either high network load, or lots of IO load slowing down the zookeeper server. I've never seen it happen in development (i.e. both processes running on the same box). You might want to check the server logs for clues (perhaps bump up the logging), you can also assign a debug log to ZK.logger
, Zookeeper.logger
, and if you're really curious (and using the MRI version), you can do Zookeeper.set_debug_level(4)
which will set the underlying zkc connection debug logging level (which unfortunately only goes to stderr, but it's better than nothing).
Hope that helps :)
This was super informative. Since I'm right in the middle of integrating ZK into the redis_failover gem, I went ahead and wrote a thin wrapper that works around the session expiring (on-demand and also with the callback, just to be extra safe). Interestingly enough, I was running into this same error today because I happened to close my laptop while my ZK client instance was still connected. An hour later I open the laptop and BAM, session expiration error and things become useless.
Here's the commit for that: https://github.com/ryanlecompte/redis_failover/blob/zookeeper/lib/redis_failover/zk_client.rb
Wouldn't it be better to call reopen() on the existing client than it would to create a new one?
I think you're right. I'm running into weird hanging issues when using #close!.
Is this in processes that are using fork (passenger, resque, etc?) or just in basic testing?
Just in basic testing, just a virgin ZK instance in a pry console. Create the instance, then call #close! on it and it just hangs indefinitely. I switched to using #reopen and it gracefully recovers after the session expires (I keep closing my laptop lid to test, hehe). One thing that I noticed, though, is that after calling #reopen, even though I set new watchers, they never get fired. My guess is it's a bug in the ZK client implementation of #reopen or something. In my particular case it's not the end of the world, since right before making any client redis operation I always fetch the freshest nodes from the ZK client (i.e., I don't suddenly brake if the watcher for the redis nodes fails to come through).
Hmm, weird, i'll have to take a look into that. I don't really use reopen much myself, it was part of the original Zookeeper API, and I just carried it forward into ZK.
It's funny, one of the first things I'd written with ZK was a redis failover library (both client/server side), but never deployed it (we wound up getting rid of resque and replacing it with a zookeeper-based solution that I'm currently trying to get open-sourced).
I'll have a look at reopen and the interaction with the watches.
@slyphon that's interesting :) I'm really hoping that redis_failover with ZK becomes a useful option for folks. Just today I integrated ZK (previously it was using a fragile HTTP-based failover mechanism). @eric is going to code review also, but would love to get more eyes on it if you have the time: https://github.com/ryanlecompte/redis_failover/tree/zookeeper
Cool, I'd love to have a look. :)
I've opened #9 for the watcher/reopen behavior, and added a spec for what I think you're describing. It passes in jruby, but in 1.9.3 it dies with an Errno::EBADF
, so I obviously have a little bit of investigation ahead :)
Cool! One other thing that I think might be missing from the library is support for the #sync call. I was reading through the docs and came across something a bit disturbing, as my understanding was that all clients would see the same values at all times (this unfortunately causes problem for redis_failover, since in the current approach two different clients could see different redis master / slaves):
Sometimes developers mistakenly assume one other guarantee that ZooKeeper does not in fact make. This is:
Simultaneously Consistent Cross-Client Views ZooKeeper does not guarantee that at every instance in time, two different clients will have identical views of ZooKeeper data. Due to factors like network delays, one client may perform an update before another client gets notified of the change. Consider the scenario of two clients, A and B. If client A sets the value of a znode /a from 0 to 1, then tells client B to read /a, client B may read the old value of 0, depending on which server it is connected to. If it is important that Client A and Client B read the same value, Client B should should call the sync() method from the ZooKeeper API method before it performs its read.
So, ZooKeeper by itself doesn't guarantee that changes occur synchronously across all servers, but ZooKeeper primitives can be used to construct higher level functions that provide useful client synchronization. (For more information, see the ZooKeeper Recipes. [tbd:..]).
From http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html
This could also be supported by having Client A and B try to create() the same path. Since the server knows the correct state, even if one client doesn't see the right value, only one will create the node, and the other will fail. This avoids check-and-set race conditions (i.e. always try to write and catch the error).
I've created #10, and I'll look into adding support for sync().
@slyphon, that's interesting. I'm not sure if that would solve my existing problem with the way the redis_failover uses ZK right now. Let me explain what I'm doing, and maybe you could see a way to avoid these races:
The approach that I'm currently taking with the zookeeper-based redis_failover is this:
1) The redis failover server (now named just the node manager daemon) is a single process that periodically checks for redis servers that have gone away. It writes the current set of redis servers (master/slaves) to a single znode in zookeeper.
2) The clients no longer access the daemon directly now (they used to in an earlier version of the gem). Instead, they only perform read operations on the znode that contains the list of redis servers set by the node manager daemon. Right before every read/write operation to redis, it first fetches the znode data and updates its internal redis clients appropriately.
What do you think about this approach? I still think there are cases where (for at least a brief moment in time), the clients could end up writing to two different masters. Imagine this scenario:
client B will continue using the old master until its view of the znode data "times out". I'm not sure how to close that window of stale data?
Is this something I just can't avoid? (i.e., small windows of having separate clients write to two different masters temporarily).
Thanks!
The way I arranged things (or rather planned to but never quite finished), was:
The redis daemon was started by a process that first connected to ZK, then attempted to win an election using ZK::Election::Candidate
.
redis.call("SLAVEOF", "NO", "ONE")
call to set ourselves as the master noderedis.slaveof(host, port)
One thorny issue was how to ensure that the redis servers all used the correct master/slaves.
The clients all use ZK::Election::Observer
to watch for changes in leadership, and act as proxies to the underlying redis connection. When the leader is not defined (i.e. the current master fails), the clients all go into 'safe mode' and only raise an error indicating the current state. When a new leader is elected, the clients connect to the new leader and everything continues as normal.
The advantage to having the redis server directly started by the zk candidate is that you are immediately notified if/when the redis process dies. Also, when a shutdown signal is received, you can gracefully close the zk connection, abdicating your leadership (otherwise, it could take up to timeout seconds for the server to realize you've really gone away and your session is invalid).
On the client side, you are always going to have a small window where the current redis server may be dead and you haven't been notified via ZK yet, but that's just the nature of clusters :). The best you can do on that side is to be aware of the exceptions in your client and try to handle the state change as gracefully as possible.
The reason I never completed this was that the state transitions involve so many intermediate steps. On a later project, WhoCan, dealing with a similar issue (having two rabbitmq connections, a 'master' and 'hot standby'), I wound up using the state_machine
gem to manage all the intermediate states (you can see a diagram of the transitions here). Some of the corner cases you'd have to handle:
Anyway, my incomplete attempt is now available as motionbox/redis-mgr. The salient bits are the server.rb and the client.rb.
But enough about my implementation (sorry, I'm just happy I finally got to share this code with someone :))...
I think you'd have to make it so that it was impossible, on the redis server side, to write to the non-master. This is available (after 2.6) via the slave-read-only
option. So the flow would look like:
"Director" decides to promote a different master:
SLAVEOF newmaster port
)SLAVEOF NO ONE
)Ensuring that all clients see the change is difficult, so if you make it so they can't do the wrong thing (or at least that they could recognize the error state and take corrective action), then at least you can be sure you won't wind up with corrupt data.
Wow, that's fantastic @slyphon ! I totally forgot about the read-only option, which I will definitely have to use in this case. What happens in the case of a network partition, though? If a client is partitioned due to a flappy network, it won't see that the znode has been recreated, right? (i.e., it won't receive the watcher callbacks). Won't it continue to try and write to the old master? Also, if the Node Manager (director) wasn't able to successfully set the old master to be read-only/slave, the "stale" client could still make writes to the old master, right? Or maybe ZK somehow prevents this?
One more note: currently, right before the client performs any operation, it queries the ZK node for the list of Redis servers to use. I'm wondering if I really need to use a watcher here, since the node manager could just delete the master from the znode right before it decides to promote a new master. Then, the clients should see that and not use a master for writes until they see a master appearing the next time they fetch the znode data. The client already has built-in support for retrying a write operation if there is no master in the znode data. Thoughts?
On Apr 16, 2012, at 8:23 AM, Jonathan Simmsreply@reply.github.com wrote:
The way I arranged things (or rather planned to but never quite finished), was:
The redis daemon was started by a process that first connected to ZK, then attempted to win an election using
ZK::Election::Candidate
.
- If we win
- we use the
redis.call("SLAVEOF", "NO", "ONE")
call to set ourselves as the master node- the 'current master' node is updated with our uri
- otherwise
- we watch the 'current master' node and update our redis server to point to the leader using
redis.slaveof(host, port)
One thorny issue was how to ensure that the redis servers all used the correct master/slaves.
The clients all use
ZK::Election::Observer
to watch for changes in leadership, and act as proxies to the underlying redis connection. When the leader is not defined (i.e. the current master fails), the clients all go into 'safe mode' and only raise an error indicating the current state. When a new leader is elected, the clients connect to the new leader and everything continues as normal.The advantage to having the redis server directly started by the zk candidate is that you are immediately notified if/when the redis process dies. Also, when a shutdown signal is received, you can gracefully close the zk connection, abdicating your leadership (otherwise, it could take up to timeout seconds for the server to realize you've really gone away and your session is invalid).
On the client side, you are always going to have a small window where the current redis server may be dead and you haven't been notified via ZK yet, but that's just the nature of clusters :). The best you can do on that side is to be aware of the exceptions in your client and try to handle the state change as gracefully as possible.
The reason I never completed this was that the state transitions involve so many intermediate steps. On a later project, WhoCan, dealing with a similar issue (having two rabbitmq connections, a 'master' and 'hot standby'), I wound up using the
state_machine
gem to manage all the intermediate states (you can see a diagram of the transitions here). Some of the corner cases you'd have to handle:
- Zookeeper session becomes invalid (for either the client or server)
- Redis server dies, ZK not notified yet
- Zookeeper server dies, redis state unknown
Anyway, my incomplete attempt is now available as motionbox/redis-mgr. The salient bits are the server.rb and the client.rb.
But enough about my implementation (sorry, I'm just happy I finally got to share this code with someone :))...
I think you'd have to make it so that it was impossible, on the redis server side, to write to the non-master. This is available (after 2.6) via the
slave-read-only
option. So the flow would look like:"Director" decides to promote a different master:
- znode is removed
- Clients will be notified (via a watcher) that the node has been removed and will block
- Current master, if still up, is changed to point to the to-be-promoted node (using
SLAVEOF newmaster port
)- New master is promoted (using
SLAVEOF NO ONE
)- Leader znode is updated appropriately
- Clients are notified that the node has been re-created, will read the current data and continue.
Ensuring that all clients see the change is difficult, so if you make it so they can't do the wrong thing (or at least that they could recognize the error state and take corrective action), then at least you can be sure you won't wind up with corrupt data.
Reply to this email directly or view it on GitHub: https://github.com/slyphon/zk/issues/8#issuecomment-5154649
In terms of the watchers and partitions, if you get a SESSION_EXPIRED, then you should treat that as "OMG I have no idea what's going on," and take no actions until you re-establish your ZK session and see what the current state of the cluster is. If you don't get an expired session, and you have a temporary loss of connection (i.e. less than the negotiated timeout), then ZK will deliver watch events for the client as expected when it (transparently) reconnects.
The advantage of a watcher is that you don't need to read from ZK for every operation on redis. I mean, ZK is fast for reads, but a round trip to the server is still an extra round trip. This is actually why I went with the architecture above, in that all the clients would be notified out-of-band that something had changed, and it wouldn't impede the normal performance of the redis client.
However, if they're doing a get() on each operation, then they're also getting back the stat for the node, so they could look at stat.version
and they could know that they're out of date.
BTW, I added sync support to slyphon/zookeeper. It'll take me a little bit to get a release out, and then add support in ZK, but should be next day or so.
Ah, that makes sense! I will change it so that I don't ping ZK for every redis read/write operation. I was doing that for fear that watches wouldn't be delivered, but now I see that if I should be fine if I properly handle the SESSION_EXPIRED error condition. I'm already handling that by attempting to reopen the ZK connection, but perhaps I should just destroy it and create a new client since there's that bug where #reopen does not subsequently propagate watch events on 1.9.3.
If I go with this architecture, is there a need to perform the #sync before each read operation, or can I solely rely on the watch callbacks for the znode?
I'm seeing something interesting while trying to implement this new approach. Here's what I'm seeing:
1) the director deletes the znode 2) the director then creates the znode
On the client, I'm only getting a watcher update for the delete event, but not the create event. Here's how it's setup:
# register a watcher for future changes
@zkclient.watcher.register(ZK_PATH) do |event|
logger.info("Update for ZK node #{ZK_PATH}: #{event.inspect}")
if event.node_created?
logger.info("ZK node #{ZK_PATH} created")
build_clients
elsif event.node_changed?
logger.info("ZK node #{ZK_PATH} changed")
build_clients
elsif event.node_deleted?
logger.info("ZK node #{ZK_PATH} deleted")
@lock.synchronize { purge_clients }
@zkclient.stat(ZK_PATH, :watch => true)
else
logger.error("Unknown ZK node event: #{event.inspect}")
end
end
My guess is the watch event is fired BEFORE I get a chance to re-register for watches with the @zkclient.stat(ZK_PATH, :watch => true) call?
I was able to get around the issue above. I just updated the ZK redis_failover branch with the approach that you've outilned, @slyphon. I'd love for you and @eric to take a look and comment:
https://github.com/ryanlecompte/redis_failover/tree/zookeeper/lib/redis_failover
Here is where we promote a new master in the Node Manager: https://github.com/ryanlecompte/redis_failover/blob/zookeeper/lib/redis_failover/node_manager.rb#L98
And in the client, here is where we register for watches:
https://github.com/ryanlecompte/redis_failover/blob/zookeeper/lib/redis_failover/client.rb#L109
Thanks!
Ah, one question regarding your last comment: "If you don't get an expired session, and you have a temporary loss of connection (i.e. less than the negotiated timeout), then ZK will deliver watch events for the client as expected when it (transparently) reconnects."
What if someone tries to do a write during that temporary loss of ZK connection? In that case, I wouldn't get a watch event called telling me to refresh my redis client list, and I'll still make a write operation on a stale master node.
I wonder if there is any way to avoid that? Maybe before making any read/write operation on a Redis client, first check to make sure that I'm not disconnected from ZK? It seems like there still might be an edge case.
Ok so the problem here is that someone could have updated the node behind your back.
Basically, you probably want to take the approach that:
gah, there's a pattern that fits this, but I'll have to dig out a decent example of this a little later.
Right, when the delete event is handled, I purge the clients and re-register for watch updates. It's not until the next watch update (changed or created) that I try to read the list of nodes and recreate the clients. I think this is fine since the node manager keeps updating the set of redis servers in the znode every time that the redis servers are periodically checked and deemed available/unavailable. So, the client will keep getting the change watch events and will eventually be able to rebuild its clients. The client knows that if he received a watch change event, but the list of redis servers hasn't changed (see https://github.com/ryanlecompte/redis_failover/blob/zookeeper/lib/redis_failover/client.rb#L178), then it will just gracefully go about its business and not tear down the clients again.
I just read some of the latest changes, including the ones that warn about calling close!
in the event thread.
Does this also pertain to calling reopen
? Can I do this?
zk.on_expired_session { zk.reopen }
and have it work properly?
Reopen should work, as right now, it doesn't shut down the event_dispatch_thread
, it only swaps out the underlying zookeeper connection. I'm pretty sure this is the one case where doing any kind of blocking operation on the event thread should be OK, as you won't be blocking any events because your session is expired :).
As a general rule, I want to start encouraging people to use a queue and thread that they own to process the delivered events, as doing work on someone else's thread is a bad idea.
Hmmmm... that's interesting.
It sort of takes some of the benefit out of using a callback-based system, doesn't it? ;-)
I'm going to have to think about the implications of this further.
@eric hang on, i'm gonna create a new issue for this, I think this is important, and I'd like your input on how this should be designed.
okay, i think i'm gonna close this one
I have been seeing SessionExpired exceptions when performing get operations. What should I be doing in these cases (or to prevent them from happening in the first place)?