ytti / oxidized

Oxidized is a network device configuration backup tool. It's a RANCID replacement!
Apache License 2.0
2.76k stars 916 forks source link

Oxidized "slows Down" #887

Closed jasonyates closed 6 years ago

jasonyates commented 7 years ago

I'm not too sure how to describe this issue. Every 1-2 months we see a problem where Oxidized simply stops backing up 90% of our devices. There is nothing in the log file that suggests a problem, the tool is not failing to login due to connectivity or credential issues, it simply doesn't try. Most devices in the GUI turn blue. It's almost as if it slows down and decides it can only back up 1 device every 5 minutes instead of running its usual threads.

My polling configuration looks like the below & we have 750 devices: interval: 3600 threads: 30 timeout: 45 retries: 2

I have a hook configured to push collection metrics in to InfluxDB and generate an alert when our success metric drops over a 15 minute period so the "slow down" can be visualized. We change from an average of 4 devices every minute to 1 every 5 minutes with no configuration changes on our side.

https://www.dropbox.com/s/12m79h1i4nfhkde/Screenshot%202017-06-12%2006.38.00.png?dl=0

Normal operation is restored by restarting Oxidized.

Thanks

ytti commented 7 years ago

This might be bit of a pickle to debug. As you need to wait 1-2 months to get more data. For shit and giggles, if it's easy, try to update oxidized, ruby and kernel to latest and greatest and see if problem remains constant.

laf commented 7 years ago

@jasonyates Have you managed to test updating the software like ytti has said?

InsaneSplash commented 7 years ago

I have about 2000 devices at the moment and I find the same thing happen but after a much shorter time. I find that I need to restart the service regularly to keep it operational.....

jasonyates commented 7 years ago

I have just updated to 0.20.0 ( from 0.19.0) along with updating Ruby.

I'll monitor for the next few weeks and see if we get the same slow down.

Is there any form of debugging I can enable to help diagnose?

Thaks

jasonyates commented 7 years ago

So this still happens after the upgrade to 0.20.0, it actually died after only a few hours this time around.

I noticed the following in the logs right before it fell back to only running once every 5 minutes

W, [2017-09-11T09:02:51.983719 #27673] WARN -- : 172.x.x.x raised Net::SSH::ConnectionTimeout (rescued RuntimeError) with msg "timeout during server version negotiating" W, [2017-09-11T09:02:52.818668 #27673] WARN -- : mydevicename status no_connection, retry attempt 1 I, [2017-09-11T09:08:23.954798 #27673] INFO -- : Configuration updated for mygroup/mydevice

laf commented 6 years ago

That looks like standard output.

We run about 2.5k+ devices and I can't say I notice issues like this.

interval: 3600
threads: 30
timeout: 20
retries: 3

This on a E5620 @ 2.40GHz with a total of 16 cores and 48G ram. We do see about 65% CPU usage and 20% mem usage.

carlofab commented 6 years ago

Hi, in my environment I'm having the same behavior: starting from last friday, after 70 devices, Oxidized takes 5 minutes to begin the backup of the next devices instead the average of 30 secs. I'm using Oxidized 0.20.0 with Ubuntu 16.04.3 and I use cron to start and stop Oxidized for the daily backup. Any suggestion?

comp-rnelson commented 6 years ago

I'm running into the same issue. I haven't determined exactly how long it takes to "slow down" to 1 node every 5 minutes, but it happens within a day or two of a restart for sure. Also, the "5 minutes" thing is exactly 5 minutes in between time stamps, plus maybe 1 or 2 seconds.

I'm on v0.20.0 interval: 14400 threads: 40 timeout: 20 retries: 3

carlofab commented 6 years ago

Hi, today I noticed the issue is back...after 1 week without problems! Currently I'm using Oxidized with 200 devices. Looking at the log file it takes "5 minutes" after 15 min from the beginning (after 75 devices). I don't know if this could help to understand the reasons....maybe yes!

carlofab commented 6 years ago

Hi, looking into the code I saw in oxidized/lib/oxidized/jobs.rb this row:

 MAX_INTER_JOB_GAP = 300 # add job if more than X from last job started

I reduced this to 30 and the backup seems working fine, but I don't know the role of this parameter. I'm going to test for one week this setup and I'll keep you aligned on the results. regards, Carlo

laf commented 6 years ago

@carlofab Thanks, let us know how you get on.

carlofab commented 6 years ago

@laf After 3 weeks with MAX_INTER_JOB_GAP set to 30 instead of 300, I noticed that the problem seems to be solved and I have not seen the "slow down" issue. I don't know the role of this timer and if 30 is too small. Up to now I work with 190 devices and all the backup is done in 50 min. Please let me know if I can help with more troubleshoot. Regards, Carlo

laf commented 6 years ago

Also now running this in production so we'll see how it goes.

laf commented 6 years ago

I'm actually now seeing this in our production install even running with the MAX_INTER_JOB_GAP patch.

laf commented 6 years ago

I've added some debugging to our install. Spot the issue:

I, [2018-01-30T16:59:22.299996 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 10 of 10 - ended: 4178 of 3982
I, [2018-01-30T16:59:23.300256 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 10 of 10 - ended: 4178 of 3982
I, [2018-01-30T16:59:24.335873 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 9 of 10 - ended: 4179 of 3982
I, [2018-01-30T16:59:25.372126 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 9 of 10 - ended: 4180 of 3982
I, [2018-01-30T16:59:26.410066 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 9 of 10 - ended: 4181 of 3982
I, [2018-01-30T16:59:27.419916 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 10 of 10 - ended: 4181 of 3982
I, [2018-01-30T16:59:28.420239 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 10 of 10 - ended: 4181 of 3982
I, [2018-01-30T16:59:29.420521 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 10 of 10 - ended: 4181 of 3982

I'm expecting it to get to Jobs running 0 of 0 as it's peeked to 15 and now is reducing all the time. So it could just be a counter issue somewhere. Will continue to monitor / debug.

laf commented 6 years ago

And this morning:

I, [2018-01-31T08:08:30.429132 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 18 of 9 - ended: 1474 of 3982
I, [2018-01-31T08:08:31.429403 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 18 of 9 - ended: 1474 of 3982
I, [2018-01-31T08:08:32.429878 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 18 of 9 - ended: 1474 of 3982
I, [2018-01-31T08:08:33.430150 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 18 of 9 - ended: 1474 of 3982
I, [2018-01-31T08:08:34.430424 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 18 of 9 - ended: 1474 of 3982
I, [2018-01-31T08:08:35.430705 #15530]  INFO -- : lib/oxidized/worker.rb: Jobs running: 18 of 9 - ended: 1474 of 3982
laf commented 6 years ago

So I have a feeling it's to do with the wanted jobs being reset to lower than the current running. I'm going to try some fixes and see if I can get it resolved.

laf commented 6 years ago

More info on this, even though oxidized things the job size is X (16 in my current testing), it's actually only got 4 that are actually running at the moment.

Those 4 haven't had the config backed up yet and have been going for at least 30 minutes right now. timeout value is 40 and retry is 3 so they should have at least timed out right now.

No debug logs currently exist for these devices.

I'll move on to putting more debug output in places to help track this down but if anyone else has any insights that would be ace.

laf commented 6 years ago

Ok with more debugging I can see that it's the same 4 devices over and over again, these are getting re-added to the queue until it's full up.

They can't be backed up at present as they are using the wrong credentials but obviously oxidized shouldn't care about that but it is doing. Will continue to prod.

laf commented 6 years ago

Anyone else experiencing this issue still (@jasonyates / @comp-rnelson / @carlofab )? I think I've tracked it down for us which is that with keyboard-interactive set in ssh.rb, some devices (ios and nxos) that don't auth will hang on the password prompt. This PR introduced the change:

https://github.com/ytti/oxidized/pull/128

I've reverted it in my install and will leave it for a bit to see how it gets on.

comp-rnelson commented 6 years ago

@laf,

I had set up a scheduled reboot for the middle of every night as a kind of work around. but I'm going to disable that and try removing the keyboard-interactive option. Should know if that helps within a few days.

laf commented 6 years ago

@comp-rnelson Ace thanks. Out of interest what devices are you backing up?

If you are using Arista, you will need to add this config:

management ssh
   authentication mode password
comp-rnelson commented 6 years ago

@laf junos, ios, ciscosmb, procurve, routeros, and airos. No Arista (yet)

laf commented 6 years ago

So it's definitely affecting ios devices as I have those that exhibited this issue.

laf commented 6 years ago

I've submitted a PR with my fixes: https://github.com/ytti/oxidized/pull/1175/files

laf commented 6 years ago

@comp-rnelson How have you got on with this?

@jasonyates Any chance you can test this as well?

comp-rnelson commented 6 years ago

I made the suggested changes, but it did resolve the issue for me.

laf commented 6 years ago

@comp-rnelson Sorry did or didn't?

comp-rnelson commented 6 years ago

Sorry, it did not resolve the issue

laf commented 6 years ago

Ok, thanks for testing.

jsynack commented 6 years ago

(this may be a little long)

I saw something today that made me think of this issue and a potential cause of it.

If folks are willing to dig a bit to see if it matches what they are seeing - that could be quite helpful to determining if it is this issue.

In my case - oxidized reported that it had 4 running threads - in reality it was only doing 1 thread worth of work by the log output (watched for several minutes).

When I looked at the underlying sockets - I found that 3 sockets were to the same remote device - TCP ESTABLISHED etc. But the 3 connections were otherwise idle on the wire. The 4th socket/file handle was properly connecting to many devices over the time window I watched it as it tried to make up for the 3 idle threads.

Firewall devices in between the oxidized server and the target device no longer had those connections listed in their state tables - So they had long since aged the connections out. The remote device also didn't show any active connections from the oxidized host.

But since oxidized was still considering them to be running (e.g. 4 running threads) - it was not spawning any additional threads to handle my desired interval / node count. But it means that there was only 'real' worker running - so it was not hitting my desired interval.

The sockets were not being created with generic TCP keepalive per netstat and ss outputs - so the OS/kernel wasn't going to prune them.

It also looks like the default settings for net-ssh are to disable the SSH-level keepalives on connections.

While Oxidized passes a :timeout parameter as an ssh_opt - this only covers the initial connection timeout. Oxidized (and net-ssh) do not activate SSH keepalives with :keepalive => true. This may assist in detecting certain types of failures that take place post-connection.

My theory is that at some time the connection made it - and the TCP 3-way completes, so the :timeout isnt triggered - but there is some other problem that causes the connection to stall - and stick around forever - occupying a worker thread that would otherwise be a productive thread.

It is probably a good idea to activate SSH keepalives by default so that net-ssh can properly relay back to oxidized if there is a failure post-connection.

For those that see this - one way to check is to look at the sockets on the oxidized server - and see if they are properly rotating to new devices - or if they stay 'wedged' to the same device (or set of devices).

Commands to run (assuming linux environment) would be commands such as:

pidof oxidized (this gives you the pid of your running oxidized instance) lsof -p 123 | grep TCP (where 123 is the number returned in the first command) ss -teipo state established '( dport = :ssh )'

The tell-tale sign for me was the same device using up 3 sockets/threads of the oxidized process - and they did not clear after a reasonable amount of time.

As a test for my environment - I have added the keepalive option to the input/ssh.rb ssh_opt. I have set a calendar reminder to check on it in a week to see if there are any non-productive worker threads. If things look clean - I will put in a PR to set this as a default for oxidized.

For the curious - the device in question is / was nothing special - it is one of many of the same exact type of model. Not sure why it landed 3 times on that one to be problematic - it may have some other item going on with it - perhaps a TCP packet getting dropped by IPsec tunnels or something else.

This also means that a higher level timer / timeout could be useful to track the underlying input method / worker and clobber it if it appears stale.

laf commented 6 years ago

@jsynack How has you testing got on?

I don't seem to see the same thing at present on our install which was suffering from this but we're running my patch from this pr #1175. Your suggestion could be the actual fix for this as well but I approached it differently.

jsynack commented 6 years ago

Greetings @laf ,

I checked on this at the 1 week, 2 week and 3 week interval and I found that my oxidized instance is running 'healthy' as far as times go (within 10% of my targeted interval: of 30min).

I wrote a quick script to calculate the time delta between the 'Starting fetching process for ' log messages - with the oxidized process running since Mar 21st. For a given sample device: Total samples: 866 Avg minutes : 33 Highest : 39 (16:02:59 Mar 28, 2018) Lowest : 30 (23:25:32 Mar 21, 2018)

This above was an example of a single sample device - but was the same with several other sample devices/tests with similar bounds on the high/low.

I think I would have noticed a larger slow-down by now - but I am continuing to collect data to make sure. I dont see adding the keepalive as a bad thing - and it may help this case IMO.

laf commented 6 years ago

If you want to submit a PR I don't mind removing my patch and testing this to see if it also helps with our issue or if it's two different things but my debugging showed the same issue that certain devices locked some threads until max threads were hit and nothing else was backed up.

wk commented 6 years ago

This could also be related to net-ssh/net-ssh#550

laf commented 6 years ago

So to keep this thread up to date. #1276 doesn't resolve this issue for us but does work as a standalone PR.

no-response[bot] commented 6 years ago

This issue has been automatically closed because there has been no response to our request for more information from the original author. The information that is currently in the issue is insufficient to take further action. Feel free to re-open this issue if additional information becomes available, or if you believe it has been closed in error.

ytti commented 6 years ago

Just for future reference, only reason why it /should/ slow down, is if we're hitting job limit or if job size is equal to node size and we still can't fetch in time. Why these might happen is if some jobs would be hanging and not doing real work.

In other case, it might be interesting to see if original logic fares any better: https://github.com/ytti/oxidized/blob/aea148d771dca0e10ca1a2f5505446fbc689dad2/lib/oxidized/jobs.rb

wk commented 6 years ago

934 seems to be an instance or a very similar issue.