Open pbrumm opened 5 years ago
Thanks for using nsq-ruby
and for the PR, Pete!
You mention running into timeout prone scenarios. Is it possible to write some tests to simulate those scenarios, each in isolation, so that we can be confident these changes address them specifically?
Also, it looks like there are many places where you add timeouts or change the default timeouts to be more aggressive. We've found the existing defaults to work sufficiently, and I'd prefer not to change them without good reason, but certainly not opposed to making more things configurable.
I'd love to hear more about your set up and why you're running into some of this.
Yep. I noticed some of those defaults changing and will revert. I can also bring more configuration options to the top for the ones that weren't currently exposed.
I am running on linux and have trouble with the test suite. I will take a further look into ways to test these scenarios.
My use-case is for log pulling from hundreds of servers that are scaling up and down with volume. Sometimes they are removed in a way that acts like a timeout situation, or they could just get into one due to low memory.
the issue I was seeing was that new servers were not getting discovered for potentially days after they came up. And exceptions were getting thrown in discovery that exited the discovery loop so that it wasn't attempted any longer.
if the discovery loop tries connecting to a server that is blocked. (ctrl-z to test) then the discovery loop will never complete. So I added non-blocking reads and a reasonable timeout of 10 seconds to receive any data from the connection.
lookupd is more unlikely to be unresponsive but I added protections there for cases where there is bad network connections.
for the retries I split out the lookupd scenario from the directly configured nsqd situation. I feel if lookupd is providing the server lists it needs to trust that source and give up quicker on nsqds. so it retries fewer times to break out faster. If 100 servers are removed then each would go through the retry logic individually in their own threads for potentially 8 hours before giving up.
To handle cases where the server is initially connected but is unresponsive later, I added a monitor for the last_heartbeat to allow a connection to go bad if the nsqd isn't able to keep up with sending heartbeats.
I switched to StandardError as it allows compile related issues to get through.
the switch to non blocking reads and writes also keeps the retry loops running. as before if one was hung, the code would just be stuck there waiting until the nsqd was taken out of that state by dropping connection or becoming responsive again.
reworking to expose the following configurations would be a good improvement as well. Let me know if you agree to the names and I can update the pull. max_retry_sleep_seconds max_retry_attempts heartbeat_seconds heartbeat_max_stale_seconds
I was able to figure out the specs. adding some now.
I pushed the nsq-cluster gem as well pull. I ran the specs against that and added the new ones.
let me know about the new variables and if that makes sense and I will update that as well.
it will definitely make the specs run faster as well.
Sorry for the complete radio silence - I did merge your nsq-cluster change in and tested against your branch but it seems like there's still issues with the specs.
I can check. We have run this patch in prod for many years and it resolved all the lockups we had in the past.
I have ran into many timeout prone scenarios and this resolves them
if a lookupd times out then set reasonable break and move on. if an nsqd or server times out due to being out of memory or shutting down then timeout quickly and move on.
this fixes case where nsqd would timeout and hang the discovery process so no new servers were identified. even if that nsqd was unreachable or came back.