Closed thecityofguanyu closed 8 years ago
How many nodes do you have? Does it see that 5min interval on first run after restart?
If the interval is so high that all nodes can be fetched within interval by single thread, it'll only run single thread but it'll run it back-to-back with no particular delays until it has finished the cycle, then it should wait until interval has passed.
@ytti
How many nodes do you have?
78 nodes.
Does it see that 5min interval on first run after restart?
No. Upon restarting Oxidized, it immediately begins the next node after the first one finishes.
If the interval is so high that all nodes can be fetched within interval by single thread, it'll only run single thread but it'll run it back-to-back with no particular delays until it has finished the cycle, then it should wait until interval has passed.
Forgive me here, but I guess I don't understand where this 5min "interval" is coming from. Is that separate from the "interval" that is defined in the config file? Or am I misunderstanding the configurable "interval" and possibly need to lower it?
My understand of interval was how often to fetch configs. I set it at 86400 assuming that it would the fetch configs for each device once per day.
D, [2016-03-01T09:15:19.434189 #14141] DEBUG -- : Jobs 1, Want: 2
This says you already have one job running that haven't finished. I was having a similar problem with the F5 waiting in a pager and never finishing. This causes the 5 minutes that you mentioned to fetch the next node.
@danilopopeye
This says you already have one job running that haven't finished.
Thanks for pointing me in the right direction! I believe to have found the device that seems to be causing the troubles. It's definitely the device, not Oxidized. Not sure what's wrong with it as of yet. It's a JunOS device that seems to sometimes become unresponsive in an SSH session. Explains why its job never finishes.
This causes the 5 minutes that you mentioned to fetch the next node.
And thank you for pointing that out.
I do have a question, though. Is there not a mechanism to kill a job should it not complete after X period of time? I assumed it would have been killed via SIGTERM per the configured 60 second "timeout", as I saw in the README:
timeout: hard timeout for the command execution. SIGTERM will be sent to the child process after the timeout has elapsed.
Is there not a mechanism to kill a job should it not complete after X period of time?
@thecityofguanyu doing a fast search in the source, I'd say: no. The only timeout
code I found are for initial connection.
Am I saying rubbish @ytti + @ElvinEfendi ?
There should be timeout for login as well as each expect. Where are we waiting indefinitely?
I guess we could wrap this over some deadline timer, but hopefully not: https://github.com/ytti/oxidized/blob/master/lib/oxidized/node.rb#L61
Is this still relevant?
@ytti Not currently. I'll close it.
This seems to impact all nodes without regard to the device model through the entire job. Based on my config, I believe it should run nightly (86400 seconds) and timeout if a job fails after one minute, allowing only one retry.
It doesn't seem like it has trouble exiting a device, as the 5 minute interval takes place between the "configuration updated" message and the next "Jobs" message.
Any ideas?
My config file: