ytti / oxidized

Oxidized is a network device configuration backup tool. It's a RANCID replacement!
Apache License 2.0
2.83k stars 933 forks source link

Five minutes pass between sending exit to a node and the next beginning #351

Closed thecityofguanyu closed 8 years ago

thecityofguanyu commented 8 years ago

This seems to impact all nodes without regard to the device model through the entire job. Based on my config, I believe it should run nightly (86400 seconds) and timeout if a job fails after one minute, allowing only one retry.

It doesn't seem like it has trouble exiting a device, as the 5 minute interval takes place between the "configuration updated" message and the next "Jobs" message.

D, [2016-03-01T09:05:19.275679 #14141] DEBUG -- : Jobs 1, Want: 2
D, [2016-03-01T09:05:21.690032 #14141] DEBUG -- : SSH: show version @ node1
D, [2016-03-01T09:05:22.292586 #14141] DEBUG -- : SSH: show running-config @ node1
D, [2016-03-01T09:05:24.322021 #14141] DEBUG -- : SSH: exit @ node1
I, [2016-03-01T09:05:25.303752 #14141]  INFO -- : Configuration updated for /node1
D, [2016-03-01T09:10:19.352370 #14141] DEBUG -- : Jobs 1, Want: 2
D, [2016-03-01T09:10:21.900248 #14141] DEBUG -- : SSH: show version @ node2
D, [2016-03-01T09:10:22.502688 #14141] DEBUG -- : SSH: show running-config @ node2
D, [2016-03-01T09:10:24.713715 #14141] DEBUG -- : SSH: exit @ node2
I, [2016-03-01T09:10:25.384255 #14141]  INFO -- : Configuration updated for /node2
D, [2016-03-01T09:15:19.434189 #14141] DEBUG -- : Jobs 1, Want: 2
D, [2016-03-01T09:15:21.828973 #14141] DEBUG -- : SSH: show version @ node3
D, [2016-03-01T09:15:22.432247 #14141] DEBUG -- : SSH: show running-config @ node3
D, [2016-03-01T09:15:24.541796 #14141] DEBUG -- : SSH: exit @ node3
I, [2016-03-01T09:15:25.470966 #14141]  INFO -- : Configuration updated for /node3

Any ideas?

My config file:


---
username: [redacted]
password: [redacted]
interval: 86400
log: /root/.config/oxidized/log/oxidized.log
debug: true
threads: 30
timeout: 60
retries: 1
prompt: !ruby/regexp /^([\w.@-]+[#>]\s?)$/
rest: [redacted]:80
vars: {}
groups: {}
input:
  default: ssh, telnet
  debug: false
  ssh:
    secure: false
output:
  default: git
  git:
    user: oxidized
    email: oxidized@[redacted]
    repo: "/root/.config/oxidized/devices.git"
source:
  default: csv
  csv:
    file: /root/.config/oxidized/router.db
    delimiter: !ruby/regexp /:/
    map:
      name: 0
      model: 2
      username: 4
      password: 5
      ip: 1
    vars_map:
      enable: 3
model_map:
  cisco: ios
  juniper: junos
  foundry: ironware
  adtran: adtran
hooks:
  vaonet-alert:
    type: exec
    events: [node_fail]
    cmd: 'snmptrap -v 2c -c public [redacted] 1 1.2.3.4.5.6.7.8.9 1 s "$OX_NODE_NAME" 2 s "backup-failed"'
ytti commented 8 years ago

How many nodes do you have? Does it see that 5min interval on first run after restart?

If the interval is so high that all nodes can be fetched within interval by single thread, it'll only run single thread but it'll run it back-to-back with no particular delays until it has finished the cycle, then it should wait until interval has passed.

thecityofguanyu commented 8 years ago

@ytti

How many nodes do you have?

78 nodes.

Does it see that 5min interval on first run after restart?

No. Upon restarting Oxidized, it immediately begins the next node after the first one finishes.

If the interval is so high that all nodes can be fetched within interval by single thread, it'll only run single thread but it'll run it back-to-back with no particular delays until it has finished the cycle, then it should wait until interval has passed.

Forgive me here, but I guess I don't understand where this 5min "interval" is coming from. Is that separate from the "interval" that is defined in the config file? Or am I misunderstanding the configurable "interval" and possibly need to lower it?

My understand of interval was how often to fetch configs. I set it at 86400 assuming that it would the fetch configs for each device once per day.

danilopopeye commented 8 years ago
D, [2016-03-01T09:15:19.434189 #14141] DEBUG -- : Jobs 1, Want: 2

This says you already have one job running that haven't finished. I was having a similar problem with the F5 waiting in a pager and never finishing. This causes the 5 minutes that you mentioned to fetch the next node.

thecityofguanyu commented 8 years ago

@danilopopeye

This says you already have one job running that haven't finished.

Thanks for pointing me in the right direction! I believe to have found the device that seems to be causing the troubles. It's definitely the device, not Oxidized. Not sure what's wrong with it as of yet. It's a JunOS device that seems to sometimes become unresponsive in an SSH session. Explains why its job never finishes.

This causes the 5 minutes that you mentioned to fetch the next node.

And thank you for pointing that out.

I do have a question, though. Is there not a mechanism to kill a job should it not complete after X period of time? I assumed it would have been killed via SIGTERM per the configured 60 second "timeout", as I saw in the README:

timeout: hard timeout for the command execution. SIGTERM will be sent to the child process after the timeout has elapsed.

danilopopeye commented 8 years ago

Is there not a mechanism to kill a job should it not complete after X period of time?

@thecityofguanyu doing a fast search in the source, I'd say: no. The only timeout code I found are for initial connection.

Am I saying rubbish @ytti + @ElvinEfendi ?

ytti commented 8 years ago

There should be timeout for login as well as each expect. Where are we waiting indefinitely?

I guess we could wrap this over some deadline timer, but hopefully not: https://github.com/ytti/oxidized/blob/master/lib/oxidized/node.rb#L61

ytti commented 8 years ago

Is this still relevant?

thecityofguanyu commented 8 years ago

@ytti Not currently. I'll close it.