Opened 4 years ago

Closed 3 years ago

#1101 closed defect (worksforme)

nginx does not respond to SIGHUP/SIGUSR2

Reported by: tiandrey@… Owned by:
Priority: major Milestone:
Component: nginx-core Version: 1.10.x
Keywords: Cc:
uname -a: Linux server03 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.10.1
built by gcc 4.9.2 (Debian 4.9.2-10)
built with OpenSSL 1.0.1k 8 Jan 2015 (running with OpenSSL 1.0.1t 3 May 2016)
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/run/nginx.pid --lock-path=/run/lock/nginx.lock --http-client-body-temp-path=/run/shm/body_temp --http-proxy-temp-path=/run/shm/proxy_temp --http-fastcgi-temp-path=/run/shm/fastcgi_temp --http-uwsgi-temp-path=/run/shm/uwsgi_temp --http-scgi-temp-path=/run/shm/scgi_temp --with-http_ssl_module --with-http_realip_module --with-http_sub_module --with-http_dav_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_stub_status_module --with-mail --with-http_v2_module --with-http_geoip_module --with-select_module --with-http_auth_request_module --with-poll_module --with-debug

Description

We have nginx installed on frontend servers; the configuration is quite simple - nginx decrypts HTTPS traffic and proxies it to backend servers.

The problem is that sometimes nginx stops responding to SIGHUP for configuration updates and to SIGUSR2 for binary updates, seems like it just ignores the signals and does nothing. Today I've even spotted some zombie nginx processes.

I don't see any pattern in this behavior - out of our ten servers there may be two servers with correctly working nginx and eight servers with nginx that ignores signals, or vice versa, although hardware and software configuration of the servers is identical.

Here is a little example of what I am talking about:

root@server03 ~ # ps axjf | grep [n]ginx
    1  2078  2078  2078 ?           -1 Ss       0   0:03 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
 2078 16829  2078  2078 ?           -1 S    65534 1399:13  \_ nginx: worker process                           
 2078 16830  2078  2078 ?           -1 S    65534 1396:36  \_ nginx: worker process                           
 2078 16831  2078  2078 ?           -1 R    65534 1396:16  \_ nginx: worker process                           
 2078 16832  2078  2078 ?           -1 S    65534 1392:03  \_ nginx: worker process                           
 2078 16833  2078  2078 ?           -1 S    65534 1390:33  \_ nginx: worker process                           
 2078 16834  2078  2078 ?           -1 S    65534 1395:36  \_ nginx: worker process                           
 2078 16835  2078  2078 ?           -1 R    65534 1394:23  \_ nginx: worker process                           
 2078 16836  2078  2078 ?           -1 R    65534 1385:00  \_ nginx: worker process                           
 2078 16837  2078  2078 ?           -1 S    65534 1409:25  \_ nginx: worker process                           
 2078 16839  2078  2078 ?           -1 Z    65534 1296:41  \_ [nginx] <defunct>
 2078 16840  2078  2078 ?           -1 S    65534 1399:22  \_ nginx: worker process                           
 2078 16841  2078  2078 ?           -1 Z    65534 1294:17  \_ [nginx] <defunct>
 2078 16842  2078  2078 ?           -1 R    65534 1401:06  \_ nginx: worker process                           
 2078 16843  2078  2078 ?           -1 R    65534 1398:11  \_ nginx: worker process                           
 2078 16844  2078  2078 ?           -1 R    65534 1398:19  \_ nginx: worker process                           
 2078 16845  2078  2078 ?           -1 S    65534   4:22  \_ nginx: cache manager process                    
 2078 54245  2078  2078 ?           -1 R    65534 533:57  \_ nginx: worker process                           
root@server03 ~ # cat /run/nginx.pid 
2078
root@server03 ~ # kill -HUP $(cat /run/nginx.pid)
root@server03 ~ # kill -USR2 $(cat /run/nginx.pid)
root@server03 ~ # ps axjf | grep [n]ginx         
    1  2078  2078  2078 ?           -1 Ss       0   0:03 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
 2078 16829  2078  2078 ?           -1 R    65534 1399:41  \_ nginx: worker process                           
 2078 16830  2078  2078 ?           -1 R    65534 1397:04  \_ nginx: worker process                           
 2078 16831  2078  2078 ?           -1 R    65534 1396:44  \_ nginx: worker process                           
 2078 16832  2078  2078 ?           -1 S    65534 1392:32  \_ nginx: worker process                           
 2078 16833  2078  2078 ?           -1 S    65534 1391:01  \_ nginx: worker process                           
 2078 16834  2078  2078 ?           -1 S    65534 1396:07  \_ nginx: worker process                           
 2078 16835  2078  2078 ?           -1 S    65534 1394:51  \_ nginx: worker process                           
 2078 16836  2078  2078 ?           -1 S    65534 1385:29  \_ nginx: worker process                           
 2078 16837  2078  2078 ?           -1 R    65534 1409:54  \_ nginx: worker process                           
 2078 16839  2078  2078 ?           -1 Z    65534 1296:41  \_ [nginx] <defunct>
 2078 16840  2078  2078 ?           -1 S    65534 1399:51  \_ nginx: worker process                           
 2078 16841  2078  2078 ?           -1 Z    65534 1294:17  \_ [nginx] <defunct>
 2078 16842  2078  2078 ?           -1 S    65534 1401:35  \_ nginx: worker process                           
 2078 16843  2078  2078 ?           -1 S    65534 1398:40  \_ nginx: worker process                           
 2078 16844  2078  2078 ?           -1 R    65534 1398:46  \_ nginx: worker process                           
 2078 16845  2078  2078 ?           -1 S    65534   4:22  \_ nginx: cache manager process                    
 2078 54245  2078  2078 ?           -1 S    65534 534:25  \_ nginx: worker process                           
root@server03 ~ # nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

After restart nginx works fine for some amount of time, usually for several days.

Change History (9)

comment:1 by Maxim Dounin, 4 years ago

The "ps" output config suggests that at least one worker process exited unexpectedly and was respawn (note pid 16838 is missing, and there a new process with pid 54245). Try looking into logs to see why this happened. Other lines from the error log may also help to understand what's going on.

You may also want to provide the full configuration of nginx, and make sure no 3rd party modules are loaded in your configuration. Checking servers memory for errors may also be a good idea, as happening only on some servers suggests this may be a hardware problem.

comment:2 by tiandrey@…, 4 years ago

Try looking into logs to see why this happened. Other lines from the error log may also help to understand what's going on.

How can I log system events only? The servers are under high load, so if I log everything that happens there, I'll just get tons and tons of useless logs. Also there are many different server {} blocks, so I'm not sure where this should be configured.

You may also want to provide the full configuration of nginx

Unfortunately I'm not sure this is possible, but I can provide specific parts of configuration.

and make sure no 3rd party modules are loaded in your configuration

Right now wallarm module is loaded, but this is just about a month old configuration. Earlier there were no dynamic modules loaded, and the symptoms were the same, even on previous nginx versions (at least on 1.8).

Checking servers memory for errors may also be a good idea, as happening only on some servers suggests this may be a hardware problem.

Nah, there are no servers that are always faulty, it's kinda random thing. Also hardware is relatively new and reliable - HP servers with HP branded ECC RAM, deployed in January of 2016.

in reply to:  2 comment:3 by Maxim Dounin, 4 years ago

Replying to tiandrey@…:

Try looking into logs to see why this happened. Other lines from the error log may also help to understand what's going on.

How can I log system events only? The servers are under high load, so if I log everything that happens there, I'll just get tons and tons of useless logs. Also there are many different server {} blocks, so I'm not sure where this should be configured.

System events are logged to global error log as set at global level (i.e., outside of the http{} block). You can overwrite logging at any level, including http{} and server{} and make logs less verbose there. Though in general it is a good idea to have logs everywhere at least at "crit" level. See docs for details.

comment:4 by psukharev.avito.ru@…, 4 years ago

Actually we have some logs from host (not from all vhosts), nothing unusual:

2016/10/11 15:35:07 [crit] 19143#0: *3103397054 SSL_do_handshake() failed (SSL: error:14094458:SSL routines:SSL3_READ_BYTES:tlsv1 unrecognized name:SSL alert number 112) while SSL handshaking, client: 217.151.131.23, server: 0.0.0.0:443
2016/10/11 15:35:30 [crit] 19154#0: *3103567869 SSL_do_handshake() failed (SSL: error:1006706B:elliptic curve routines:ec_GFp_simple_oct2point:point is not on curve error:1408B010:SSL routines:SSL3_GET_CLIENT_KEY_EXCHANGE:EC lib) while SSL handshaking, client: 83.149.45.121, server: 0.0.0.0:443
2016/10/11 15:39:08 [crit] 19147#0: *3105238623 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 78.25.122.213, server: 0.0.0.0:443
2016/10/11 15:42:25 [crit] 19150#0: *3106741156 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 109.184.206.140, server: 0.0.0.0:443
2016/10/11 15:48:42 [crit] 19144#0: *3109637346 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 85.113.45.87, server: 0.0.0.0:443
2016/10/11 15:48:42 [crit] 19144#0: *3109637345 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 85.113.45.87, server: 0.0.0.0:443
2016/10/11 15:54:52 [crit] 19143#0: *3112483266 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 178.184.111.69, server: 0.0.0.0:443
2016/10/11 15:56:05 [crit] 19155#0: *3113036656 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 178.184.111.69, server: 0.0.0.0:443
2016/10/11 15:59:12 [crit] 19155#0: *3114459830 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 80.87.154.101, server: 0.0.0.0:443
2016/10/11 15:59:15 [crit] 19158#0: *3114481931 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 80.87.154.101, server: 0.0.0.0:443
2016/10/11 15:59:22 [crit] 19147#0: *3114535029 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 80.87.154.101, server: 0.0.0.0:443
2016/10/11 15:59:23 [crit] 19157#0: *3114542511 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 80.87.154.101, server: 0.0.0.0:443
2016/10/11 15:59:34 [crit] 19154#0: *3114628224 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 83.149.19.136, server: 0.0.0.0:443

comment:5 by Maxim Dounin, 4 years ago

You need to look into global logs, not logs of a particular server{} block. The ps output as provided above suggests there should be at least one alert about a worker process abnormal exit.

comment:6 by psukharev.avito.ru@…, 4 years ago

This log from http{} section in main nginx.conf:

    access_log  off;
    error_log   /var/log/nginx/error.log notice;

We cant reproduce it, but can do some actions in next time it happen. strace for example. What data will be helpfull?

comment:7 by Maxim Dounin, 4 years ago

Quoting comment:3:

System events are logged to global error log as set at global level (i.e., outside of the http{} block).

Note "outside". Logs configured inside http{} section contain http-specific information and won't help in debugging low-level problems like the one you are facing.

First of all I would recommend you to configure appropriate logging and make sure you see global events like signal handling in the logs. Try to send something like SIGUSR1 to nginx and make sure you see corresponding logging, for example:

2016/10/18 16:50:22 [notice] 82337#100109: signal 30 (SIGUSR1) received, reopening logs
2016/10/18 16:50:22 [notice] 82337#100109: reopening logs
2016/10/18 16:50:22 [notice] 82338#100576: reopening logs

And once the problem happens again, check the logs to see if there are any problems logged. Additionally, try sending some signals to nginx to see if nginx receives signals sent to it. And if it doesn't - where it is blocked. This can be done, for example, by looking at wait channel as reported by sudo ps -alx (note it needs root to report wait channels for other users' processes).

comment:8 by psukharev.avito.ru@…, 4 years ago

I think we got it and know how to reproduce.
We use python script for log analyzer via pipe file and temporary ban IP addresses

access_log /var/log/nginx.pipe

Steps:
0) Make pipe file mkfifo /var/log/nginx.pipe
1) Take default nginx config and point access_log to pipe file like /var/log/nginx.pipe
2) Open in separate console "tail -f /var/log/nginx.pipe"
3) Start nginx and use tool (siege in my case) to get main page
4) Close tail process and run again (you will not be able to see new logs but whole nginx is working well)
5) Make service nginx reload several times and you will see in process tree:

root@test3:~# ps axjf | grep [n]ginx | grep -v tail
    1  1896  1896  1896 ?           -1 Rs       0   0:01 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
 1896  6196  1896  1896 ?           -1 Z       33   0:01  \_ [nginx] <defunct>
 1896  6203  1896  1896 ?           -1 S       33   0:01  \_ nginx: worker process is shutting down          
 1896  6207  1896  1896 ?           -1 Z       33   0:01  \_ [nginx] <defunct>
 1896  6211  1896  1896 ?           -1 S       33   0:01  \_ nginx: worker process is shutting down          
 1896  6258  1896  1896 ?           -1 R       33   0:00  \_ nginx: worker process                           
 1896  6259  1896  1896 ?           -1 R       33   0:00  \_ nginx: worker process                           
 1896  6260  1896  1896 ?           -1 R       33   0:00  \_ nginx: worker process                           

but in this test case it will be repaired by nginx master.
In production we have over 100 000 active connections per HW node and it seems not healing himself.
Now we disable access_log via pipe and watching for situation.

comment:9 by Maxim Dounin, 3 years ago

Resolution: worksforme
Status: newclosed

Feedback timeout.

Note: See TracTickets for help on using tickets.