yaoweibin / nginx_upstream_check_module

Health checks upstreams for nginx
http://github.com/yaoweibin/nginx_upstream_check_module
2.04k stars 482 forks source link

Check timeouts with nginx 1.12 #153

Open onitake opened 7 years ago

onitake commented 7 years ago

The upstream_check module seems to be partially incompatible with nginx 1.12. With the new patch, it will build and run, but I am encountering lots of sudden upstream timeouts when the server is under high load.

I did not observe this with nginx 1.8 or 1.10.

The timeouts are seemingly unrelated, sometimes both the main and the backup go offline at the same time, sometimes multiple vhosts are dropped, sometimes they all work for a long time without problems.

onitake commented 7 years ago

Correction, it seems this happens with 1.10 as well, though to a lesser extent.

yaoweibin commented 7 years ago

OK,I will check it tomorrow.

Thanks.

在 2017年9月8日,下午4:31,Gregor Riepl <notifications@github.com mailto:notifications@github.com> 写道:

Correction, it seems this happens with 1.10 as well, though to a lesser extent.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yaoweibin/nginx_upstream_check_module/issues/153#issuecomment-328039073, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJYmfZsBMkdpneplBomn5Ub7129tnvCks5sgPt0gaJpZM4POPPV.

onitake commented 7 years ago

Here's a little bit more context:

I use nginx as a reverse proxy for multiple web servers. Some of the vhosts are configured with upstream checks. When I restart nginx during times of high load, I see a lot of timeouts. This causes sites to drop on and off continuously - even if they have multiple upstream servers. After a while (~15-30 minutes), there are less timeouts, and I only see occasional fail-overs to the backup servers. Some upstreams will then work for days of time without issues.

When I was tracing the problem, I found an misconfigured upstream check that had a timeout value larger than the interval (4s vs. 1.5s). It seems this also had a negative effect, but changing the timeout did not resolve the problem. Perhaps a note in the documentation or a check during startup may be necessary - if this is indeed wrong.

yaoweibin commented 7 years ago

Could you show me your config file?

onitake commented 6 years ago

Sorry for the delay.

Here is the (anonymised) config for one service: https://gist.github.com/onitake/075afa3ec99a99ac1e9ec17a273b7043 And here's another: https://gist.github.com/onitake/8650f72c82b97c1dfe1c8858f15432ee

Does that help?

onitake commented 6 years ago

I have more of these configurations. The total amount of upstream checks is more than 20.

Could it be that they are all executed from the same thread, with a global timer? When some take longer, that could cause others to time out, explaining why many upstreams fail at once.

vozerov commented 6 years ago

+1 the same thing - tried to setup default_down=false, but after 2 checks nginx send "no live upstreams" in the log.

check interval=10000 fall=2 rise=1 timeout=5000 default_down=false type=tcp;

ubuntu 16.04, nginx 1.12.1, built with:

`

nginx -V

nginx version: nginx/1.12.1 built by gcc 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5) built with OpenSSL 1.0.2g-fips 1 Mar 2016 (running with OpenSSL 1.0.2g 1 Mar 2016) TLS SNI support enabled configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-http_geoip_module --add-module=/home/vozerov/ngx_devel_kit --add-module=/home/vozerov/nginx_upstream_check_module --add-module=/home/vozerov/form-input-nginx-module --add-module=/home/vozerov/ngx_http_geoip2_module --add-module=/home/vozerov/lua-nginx-module --add-module=/home/vozerov/nginx-module-vts --with-cc-opt='-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie' `

vozerov commented 6 years ago

@onitake hey! Could you confirm pls that nginx 1.8 works fine under high load with this module? Thanks!

onitake commented 6 years ago

I cannot confirm nor deny that, since I haven't been using 1.8 for a long time. I do remember that I didn't observe these errors before 1.10, but that could also have a different reason.

It would make sense, though: Tengine is still based on nginx 1.8 and includes this module.

vozerov commented 6 years ago

@onitake @yaoweibin i found the issue for my case - it wasn't upstream_check module. In my case i've setup nginx with 8 workers and seems all of them were using too much (~95% cpu for each worker). I've increased workers count up to 16 and it reloads ok for now.