zuazo / zookeeper_bridge-cookbook

Chef cookbook to help integrating the Chef Run with ZooKeeper.
https://supermarket.chef.io/cookbooks/zookeeper_bridge
Apache License 2.0
1 stars 0 forks source link

Support for TTLs #1

Open stensonb opened 9 years ago

stensonb commented 9 years ago

First off...THIS IS AWESOME! I've been dreaming about this for a while, and then found you've already done most of the legwork! :)

It would be cool to see a lock expire based on supplied TTL...and/or, a method to update the TTL of an existing lock -- basically, I'd like the "dead-man's switch" model...

zuazo commented 9 years ago

Seriusly @stensonb, thanks for your words :wink: This is still a bit experimental and any help testing the cookbook is welcome.

Please, if you could explain in more detail what you need or send me a link with more information. Unfortunately I'm not very familiar with lock TTLs. Is not zookeeper_bridge_rdlock#wait what you want?

By the way, I'm using the zk gem to implement the lockers. So we can be somewhat limited.

stensonb commented 9 years ago

I'm thinking of the case where the guarded code barfs and causes chef-client to fail...either because of software exception (which could probably gracefully release the lock), or some hardware exception (the machine shutsdown -- cannot release the lock).

The question is this: how does an abandoned lock get released? I suggest TTL...

stensonb commented 9 years ago

Another feature, which could be tangential to this one, is that of persisting the lock locally -- in order to recover from chef-client run failures.

That is, when Node A runs chef-client...it gets Lock B, and performs some task C. If task C fails to complete (software or hardware), it should be possible for Node A to run chef-client again (or in daemon mode), and resume by deserializing Lock B from persistent storage and continue the chef-client run.

stensonb commented 9 years ago

Through some local testing, I was able to determine that:

  1. locks are automatically removed from zookeeper if/when the chef-client dies during the "guarded" code execution (this must be a zookeeper feature of the particular lock the ZK gem is building -- as I killed the chef-client process mid-run).
  2. it doesn't look like the ZK gem supports local persistence of the lock (in order to start back up where it left off)...chef-clients running again would simply get another lock from ZK.
zuazo commented 9 years ago

First of all, thank you for your research.

  1. locks are automatically removed from zookeeper if/when the chef-client dies during the "guarded" code execution (this must be a zookeeper feature of the particular lock the ZK gem is building -- as I killed the chef-client process mid-run).

AFAIR zk locks use ephemeral znodes. So, yes, in case connection to ZooKeeper fails or is closed, the lock should be released.

  1. it doesn't look like the ZK gem supports local persistence of the lock (in order to start back up where it left off)...chef-clients running again would simply get another lock from ZK.

Also this seems to be the typical characteristic that they could refuse to include.

I'm not sure, but I think this should be implemented using other existing lock patterns instead of as you pretend. I'm sorry, but I do not quite understand your example use case (in a real scenario I mean).

stensonb commented 9 years ago

Yup, looks like the ZK::Locker::Semaphore class is using :ephemeral_sequential when creating the nodes...and this makes zookeeper clear the nodes when the TCP connection is closed.

The frustrating part of this is that -- currently -- there is no way for zookeeper_bridge_sem to report (to other nodes), whether the chef-client completes successfully or not:

The result is that other nodes in the cluster will get the lock eventually, and converge their local resources. This could be disastrous for web servers in a load balancer, for example (a "service" restart which fails could bring the entire load balanced solution down...not good).

In summary, the zookeeper_bridge_sem currently only sequences the beginning of a resource, not the successful convergence of it.

(I understand this discussion has now diverged wildly from the original "TTL" question...maybe I'll move this to the wiki)

zuazo commented 9 years ago

@stensonb, sorry for the delay :cold_sweat:

Perhaps the zookeeper_bridge_sem resource does not cover your use case, but I think it works as expected. I mean, the purpose of a semaphore is to control how many processes can access to a resource at the same time. And that's just what that resource does. Nothing more, nothing less.

To manage errors inside semaphores, I think you should combine them with other resources, like locks or waits. Maybe we should implement a resource like zookeeper_bridge_signal to notify zookeeper_bridge_wait (or zookeeper_bridge_sem?) and use it to signal the other processes when there are no errors.

(I understand this discussion has now diverged wildly from the original "TTL" question...maybe I'll move this to the wiki)

I think this place is better than the wiki for discussions. But you can add to the wiki whatever you think appropriate.