winshining / nginx-http-flv-module

A media streaming server based on nginx-rtmp-module. In addtion to the features nginx-rtmp-module provides, HTTP-FLV, GOP cache, VHosts (one IP for multi domain names) and JSON style statistics are supported now.
BSD 2-Clause "Simplified" License
2.71k stars 567 forks source link

Upgrade to nginx 1.24 auto_push causes looping signal 29 and signal 17 crash #238

Closed streamingsystems closed 1 year ago

streamingsystems commented 1 year ago

Hi All,

We recently upgraded to nginx 1.24 and compiled this module properly (as we have always done). The module works great still, however after a "reload" of nginx we are seeing this in our nginx error.log and it keeps repeating forever and causes the module to be inoperable. When we do a nginx "restart" everything goes back to working fine.

When I turn auto_push "off" the problem does not happen, so it's something specific to the auto push.

This has been working fine for years up until a recent upgrade to nginx 1.24.

Any help would be appreciated.

2023/05/13 10:27:04 [notice] 2190648#2190648: start worker process 3310276 2023/05/13 10:27:04 [debug] 3310275#3310275: auto_push: creating sockets 2023/05/13 10:27:04 [debug] 3310275#3310275: auto_push: create socket '/tmp/nginx-http-flv.9' 2023/05/13 10:27:04 [debug] 3310275#3310275: auto_push: delete existing socket '/tmp/nginx-http-flv.9' 2023/05/13 10:27:04 [debug] 3310276#3310276: auto_push: creating sockets 2023/05/13 10:27:04 [debug] 3310276#3310276: auto_push: create socket '/tmp/nginx-http-flv.15' 2023/05/13 10:27:04 [debug] 3310276#3310276: auto_push: delete existing socket '/tmp/nginx-http-flv.15' 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 17 (SIGCHLD) received from 3310270 2023/05/13 10:27:04 [alert] 2190648#2190648: worker process 3310270 exited on signal 11 2023/05/13 10:27:04 [alert] 2190648#2190648: worker process 3310272 exited on signal 11 2023/05/13 10:27:04 [notice] 2190648#2190648: start worker process 3310277 2023/05/13 10:27:04 [notice] 2190648#2190648: start worker process 3310278 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 29 (SIGIO) received 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 17 (SIGCHLD) received from 3310272 2023/05/13 10:27:04 [alert] 2190648#2190648: worker process 3310269 exited on signal 11 2023/05/13 10:27:04 [notice] 2190648#2190648: start worker process 3310279 2023/05/13 10:27:04 [debug] 3310277#3310277: auto_push: creating sockets 2023/05/13 10:27:04 [debug] 3310277#3310277: auto_push: create socket '/tmp/nginx-http-flv.10' 2023/05/13 10:27:04 [debug] 3310277#3310277: auto_push: delete existing socket '/tmp/nginx-http-flv.10' 2023/05/13 10:27:04 [debug] 3310278#3310278: auto_push: creating sockets 2023/05/13 10:27:04 [debug] 3310278#3310278: auto_push: create socket '/tmp/nginx-http-flv.11' 2023/05/13 10:27:04 [debug] 3310278#3310278: auto_push: delete existing socket '/tmp/nginx-http-flv.11' 2023/05/13 10:27:04 [debug] 3310279#3310279: auto_push: creating sockets 2023/05/13 10:27:04 [debug] 3310279#3310279: auto_push: create socket '/tmp/nginx-http-flv.12' 2023/05/13 10:27:04 [debug] 3310279#3310279: auto_push: delete existing socket '/tmp/nginx-http-flv.12' 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 29 (SIGIO) received 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 29 (SIGIO) received 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 17 (SIGCHLD) received from 3310271 2023/05/13 10:27:04 [alert] 2190648#2190648: worker process 3310271 exited on signal 11 2023/05/13 10:27:04 [notice] 2190648#2190648: start worker process 3310280 2023/05/13 10:27:04 [debug] 3310280#3310280: auto_push: creating sockets 2023/05/13 10:27:04 [debug] 3310280#3310280: auto_push: create socket '/tmp/nginx-http-flv.14' 2023/05/13 10:27:04 [debug] 3310280#3310280: auto_push: delete existing socket '/tmp/nginx-http-flv.14' 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 17 (SIGCHLD) received from 3310276 2023/05/13 10:27:04 [alert] 2190648#2190648: worker process 3310276 exited on signal 11 2023/05/13 10:27:04 [notice] 2190648#2190648: start worker process 3310281 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 29 (SIGIO) received 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 29 (SIGIO) received 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 17 (SIGCHLD) received from 3310273 2023/05/13 10:27:04 [alert] 2190648#2190648: worker process 3310273 exited on signal 11 2023/05/13 10:27:04 [alert] 2190648#2190648: worker process 3310274 exited on signal 11 2023/05/13 10:27:04 [notice] 2190648#2190648: start worker process 3310282 2023/05/13 10:27:04 [notice] 2190648#2190648: start worker process 3310283 2023/05/13 10:27:04 [notice] 2190648#2190648: signal 17 (SIGCHLD) received from 3310274 2023/05/13 10:27:04 [debug] 3310281#3310281: auto_push: creating sockets 2023/05/13 10:27:04 [debug] 3310281#3310281: auto_push: create socket '/tmp/nginx-http-flv.15' 2023/05/13 10:27:04 [debug] 3310281#3310281: auto_push: delete existing socket '/tmp/nginx-http-flv.15' 2023/05/13 10:27:04 [debug] 3310282#3310282: auto_push: creating sockets 2023/05/13 10:27:04 [debug] 3310282#3310282: auto_push: create socket '/tmp/nginx-http-flv.8' 2023/05/13 10:27:04 [debug] 3310282#3310282: auto_push: delete existing socket '/tmp/nginx-http-flv.8' 2023/05/13 10:27:04 [debug] 3310283#3310283: auto_push: creating sockets 2023/05/13 10:27:04 [debug] 3310283#3310283: auto_push: create socket '/tmp/nginx-http-flv.13' 2023/05/13 10:27:04 [debug] 3310283#3310283: auto_push: delete existing socket '/tmp/nginx-http-flv.13'

streamingsystems commented 1 year ago

More info. The stack trace:

0 0x00007faa26ebe034 in ngx_rtmp_auto_push_init_process (cycle=) at flv-1.24.0/ngx_rtmp_auto_push_module.c:323

1 ngx_rtmp_auto_push_init_process (cycle=) at flv-1.24.0/ngx_rtmp_auto_push_module.c:128

2 0x00005596ef988776 in ngx_worker_process_init (cycle=cycle@entry=0x5596f1294660, worker=worker@entry=0) at src/os/unix/ngx_process_cycle.c:893

3 0x00005596ef988c5f in ngx_worker_process_cycle (cycle=cycle@entry=0x5596f1294660, data=data@entry=0x0) at src/os/unix/ngx_process_cycle.c:706

4 0x00005596ef98759f in ngx_spawn_process (cycle=cycle@entry=0x5596f1294660, proc=0x5596ef988c30 , data=0x0, name=0x5596efa34c27 "worker process", respawn=respawn@entry=1) at src/os/unix/ngx_process.c:199

5 0x00005596ef989d27 in ngx_reap_children (cycle=0x5596f1294660) at src/os/unix/ngx_process_cycle.c:598

6 ngx_master_process_cycle (cycle=0x5596f1294660) at src/os/unix/ngx_process_cycle.c:174

7 0x00005596ef95f698 in main (argc=, argv=) at src/core/nginx.c:383

I am wondering if it has to do with this fix in nginx:

https://github.com/nginx/nginx/commit/d52e5684437aedd6bb74d6b2b602b7306f4212ba

Or this fix:

https://github.com/nginx/nginx/commit/593bb543873ac57efade09dbcc2f93ab6b4d5ab0

winshining commented 1 year ago

Your nginx.conf file?

My nginx.conf:

worker_processes  auto;
...
rtmp_auto_push on;

Nginx version:

root@debian:~# /usr/local/nginx/sbin/nginx -V
nginx version: nginx/1.24.0
built by gcc 10.2.1 20210110 (Debian 10.2.1-6) 
built with OpenSSL 1.1.1n  15 Mar 2022
TLS SNI support enabled
configure arguments: --with-debug --add-module=../nginx-http-flv-module

Executed "reload" but didn't reproduce the bug:

2023/05/14 16:14:14 [notice] 14873#0: signal process started
2023/05/14 16:15:28 [notice] 14976#0: signal process started
2023/05/14 16:18:08 [notice] 15098#0: signal process started
2023/05/14 16:28:53 [notice] 15472#0: signal process started
streamingsystems commented 1 year ago

I fixed this bug.

This is what I found:

In the nginx_rtmp_auto_push_module.c on line 323:

There is this code:

                old = ls->previous->connection;

                if (ngx_del_event(old->read, NGX_READ_EVENT, NGX_CLOSE_EVENT)
                    == NGX_ERROR)
                {
                    return NGX_ERROR;
                }

                old->fd = (ngx_socket_t) -1;

What I am seeing if that "old" is null, so when this code is executing it's crashing because it's trying to do old->read on null.

It happens on every server we have when we issue a "reload".

To fix it I made this change:

            old = ls->previous->connection;
            if (old) {
                if (ngx_del_event(old->read, NGX_READ_EVENT, NGX_CLOSE_EVENT)
                    == NGX_ERROR)
                {
                    return NGX_ERROR;
                }

                old->fd = (ngx_socket_t) -1;
            }

It now reloads perfectly in our entire environment.

I am wondering what conditions in the new 1.24 would cause ls->previous->connection to be null, which would mean that the old variable is null, which would crash the old->read.

/sbin/nginx -V nginx version: nginx/1.24.0 built by gcc 8.5.0 20210514 (Red Hat 8.5.0-16) (GCC) built with OpenSSL 1.1.1k FIPS 25 Mar 2021 TLS SNI support enabled configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-openssl-opt=enable-ktls --with-cc-opt='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -pie'

We have our configs broken up into different files but this is the various configs:

rtmp_auto_push on;

rtmp {

out_queue           4096;
out_cork            8;
max_streams         128;
timeout             10s;
drop_idle_publisher 10s;
respawn off;

server {
    .....
}

}

user nginx; worker_processes auto; worker_rlimit_nofile 32768; pid /run/nginx.pid;

include /usr/share/nginx/modules/*.conf;

events { worker_connections 2048; use epoll; multi_accept on; }

winshining commented 1 year ago

I found the reason: if the module is compiled as a dynamic module, the bug will be 100% reproduced. What you mentioned above doesn't perfectly resolve the issue and will lead to socket leak. I will fix the issue soon. Thank you for reporting the bug!

streamingsystems commented 1 year ago

Great thanks for the information. While you’re in there fixing that bug do you want to also look at the proxy protocol and why it’s not working? I can show you the code changes that I made.On May 15, 2023, at 6:05 AM, winshining @.***> wrote: I found the reason: if the module is compiled as a dynamic module, the bug will be 100% reproduced. What you mentioned above doesn't perfectly resolve the issue and will lead to socket leak. I will fix the issue soon. Thank you for reporting the bug!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

winshining commented 1 year ago

Fixed, please fetch the latest code to test. I will take a look at proxy protocol issue you have posted to me in my spare time.