Opened 8 years ago
Closed 8 years ago
#1101 closed defect (worksforme)
nginx does not respond to SIGHUP/SIGUSR2
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | major | Milestone: | |
Component: | nginx-core | Version: | 1.10.x |
Keywords: | Cc: | ||
uname -a: | Linux server03 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux | ||
nginx -V: |
nginx version: nginx/1.10.1
built by gcc 4.9.2 (Debian 4.9.2-10) built with OpenSSL 1.0.1k 8 Jan 2015 (running with OpenSSL 1.0.1t 3 May 2016) TLS SNI support enabled configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/run/nginx.pid --lock-path=/run/lock/nginx.lock --http-client-body-temp-path=/run/shm/body_temp --http-proxy-temp-path=/run/shm/proxy_temp --http-fastcgi-temp-path=/run/shm/fastcgi_temp --http-uwsgi-temp-path=/run/shm/uwsgi_temp --http-scgi-temp-path=/run/shm/scgi_temp --with-http_ssl_module --with-http_realip_module --with-http_sub_module --with-http_dav_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_stub_status_module --with-mail --with-http_v2_module --with-http_geoip_module --with-select_module --with-http_auth_request_module --with-poll_module --with-debug |
Description
We have nginx installed on frontend servers; the configuration is quite simple - nginx decrypts HTTPS traffic and proxies it to backend servers.
The problem is that sometimes nginx stops responding to SIGHUP for configuration updates and to SIGUSR2 for binary updates, seems like it just ignores the signals and does nothing. Today I've even spotted some zombie nginx processes.
I don't see any pattern in this behavior - out of our ten servers there may be two servers with correctly working nginx and eight servers with nginx that ignores signals, or vice versa, although hardware and software configuration of the servers is identical.
Here is a little example of what I am talking about:
root@server03 ~ # ps axjf | grep [n]ginx 1 2078 2078 2078 ? -1 Ss 0 0:03 nginx: master process /usr/sbin/nginx -g daemon on; master_process on; 2078 16829 2078 2078 ? -1 S 65534 1399:13 \_ nginx: worker process 2078 16830 2078 2078 ? -1 S 65534 1396:36 \_ nginx: worker process 2078 16831 2078 2078 ? -1 R 65534 1396:16 \_ nginx: worker process 2078 16832 2078 2078 ? -1 S 65534 1392:03 \_ nginx: worker process 2078 16833 2078 2078 ? -1 S 65534 1390:33 \_ nginx: worker process 2078 16834 2078 2078 ? -1 S 65534 1395:36 \_ nginx: worker process 2078 16835 2078 2078 ? -1 R 65534 1394:23 \_ nginx: worker process 2078 16836 2078 2078 ? -1 R 65534 1385:00 \_ nginx: worker process 2078 16837 2078 2078 ? -1 S 65534 1409:25 \_ nginx: worker process 2078 16839 2078 2078 ? -1 Z 65534 1296:41 \_ [nginx] <defunct> 2078 16840 2078 2078 ? -1 S 65534 1399:22 \_ nginx: worker process 2078 16841 2078 2078 ? -1 Z 65534 1294:17 \_ [nginx] <defunct> 2078 16842 2078 2078 ? -1 R 65534 1401:06 \_ nginx: worker process 2078 16843 2078 2078 ? -1 R 65534 1398:11 \_ nginx: worker process 2078 16844 2078 2078 ? -1 R 65534 1398:19 \_ nginx: worker process 2078 16845 2078 2078 ? -1 S 65534 4:22 \_ nginx: cache manager process 2078 54245 2078 2078 ? -1 R 65534 533:57 \_ nginx: worker process root@server03 ~ # cat /run/nginx.pid 2078 root@server03 ~ # kill -HUP $(cat /run/nginx.pid) root@server03 ~ # kill -USR2 $(cat /run/nginx.pid) root@server03 ~ # ps axjf | grep [n]ginx 1 2078 2078 2078 ? -1 Ss 0 0:03 nginx: master process /usr/sbin/nginx -g daemon on; master_process on; 2078 16829 2078 2078 ? -1 R 65534 1399:41 \_ nginx: worker process 2078 16830 2078 2078 ? -1 R 65534 1397:04 \_ nginx: worker process 2078 16831 2078 2078 ? -1 R 65534 1396:44 \_ nginx: worker process 2078 16832 2078 2078 ? -1 S 65534 1392:32 \_ nginx: worker process 2078 16833 2078 2078 ? -1 S 65534 1391:01 \_ nginx: worker process 2078 16834 2078 2078 ? -1 S 65534 1396:07 \_ nginx: worker process 2078 16835 2078 2078 ? -1 S 65534 1394:51 \_ nginx: worker process 2078 16836 2078 2078 ? -1 S 65534 1385:29 \_ nginx: worker process 2078 16837 2078 2078 ? -1 R 65534 1409:54 \_ nginx: worker process 2078 16839 2078 2078 ? -1 Z 65534 1296:41 \_ [nginx] <defunct> 2078 16840 2078 2078 ? -1 S 65534 1399:51 \_ nginx: worker process 2078 16841 2078 2078 ? -1 Z 65534 1294:17 \_ [nginx] <defunct> 2078 16842 2078 2078 ? -1 S 65534 1401:35 \_ nginx: worker process 2078 16843 2078 2078 ? -1 S 65534 1398:40 \_ nginx: worker process 2078 16844 2078 2078 ? -1 R 65534 1398:46 \_ nginx: worker process 2078 16845 2078 2078 ? -1 S 65534 4:22 \_ nginx: cache manager process 2078 54245 2078 2078 ? -1 S 65534 534:25 \_ nginx: worker process root@server03 ~ # nginx -t nginx: the configuration file /etc/nginx/nginx.conf syntax is ok nginx: configuration file /etc/nginx/nginx.conf test is successful
After restart nginx works fine for some amount of time, usually for several days.
Change History (9)
comment:1 by , 8 years ago
follow-up: 3 comment:2 by , 8 years ago
Try looking into logs to see why this happened. Other lines from the error log may also help to understand what's going on.
How can I log system events only? The servers are under high load, so if I log everything that happens there, I'll just get tons and tons of useless logs. Also there are many different server {} blocks, so I'm not sure where this should be configured.
You may also want to provide the full configuration of nginx
Unfortunately I'm not sure this is possible, but I can provide specific parts of configuration.
and make sure no 3rd party modules are loaded in your configuration
Right now wallarm module is loaded, but this is just about a month old configuration. Earlier there were no dynamic modules loaded, and the symptoms were the same, even on previous nginx versions (at least on 1.8).
Checking servers memory for errors may also be a good idea, as happening only on some servers suggests this may be a hardware problem.
Nah, there are no servers that are always faulty, it's kinda random thing. Also hardware is relatively new and reliable - HP servers with HP branded ECC RAM, deployed in January of 2016.
comment:3 by , 8 years ago
Replying to tiandrey@…:
Try looking into logs to see why this happened. Other lines from the error log may also help to understand what's going on.
How can I log system events only? The servers are under high load, so if I log everything that happens there, I'll just get tons and tons of useless logs. Also there are many different server {} blocks, so I'm not sure where this should be configured.
System events are logged to global error log as set at global level (i.e., outside of the http{} block). You can overwrite logging at any level, including http{} and server{} and make logs less verbose there. Though in general it is a good idea to have logs everywhere at least at "crit" level. See docs for details.
comment:4 by , 8 years ago
Actually we have some logs from host (not from all vhosts), nothing unusual:
2016/10/11 15:35:07 [crit] 19143#0: *3103397054 SSL_do_handshake() failed (SSL: error:14094458:SSL routines:SSL3_READ_BYTES:tlsv1 unrecognized name:SSL alert number 112) while SSL handshaking, client: 217.151.131.23, server: 0.0.0.0:443 2016/10/11 15:35:30 [crit] 19154#0: *3103567869 SSL_do_handshake() failed (SSL: error:1006706B:elliptic curve routines:ec_GFp_simple_oct2point:point is not on curve error:1408B010:SSL routines:SSL3_GET_CLIENT_KEY_EXCHANGE:EC lib) while SSL handshaking, client: 83.149.45.121, server: 0.0.0.0:443 2016/10/11 15:39:08 [crit] 19147#0: *3105238623 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 78.25.122.213, server: 0.0.0.0:443 2016/10/11 15:42:25 [crit] 19150#0: *3106741156 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 109.184.206.140, server: 0.0.0.0:443 2016/10/11 15:48:42 [crit] 19144#0: *3109637346 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 85.113.45.87, server: 0.0.0.0:443 2016/10/11 15:48:42 [crit] 19144#0: *3109637345 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 85.113.45.87, server: 0.0.0.0:443 2016/10/11 15:54:52 [crit] 19143#0: *3112483266 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 178.184.111.69, server: 0.0.0.0:443 2016/10/11 15:56:05 [crit] 19155#0: *3113036656 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 178.184.111.69, server: 0.0.0.0:443 2016/10/11 15:59:12 [crit] 19155#0: *3114459830 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 80.87.154.101, server: 0.0.0.0:443 2016/10/11 15:59:15 [crit] 19158#0: *3114481931 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 80.87.154.101, server: 0.0.0.0:443 2016/10/11 15:59:22 [crit] 19147#0: *3114535029 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 80.87.154.101, server: 0.0.0.0:443 2016/10/11 15:59:23 [crit] 19157#0: *3114542511 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 80.87.154.101, server: 0.0.0.0:443 2016/10/11 15:59:34 [crit] 19154#0: *3114628224 SSL_do_handshake() failed (SSL: error:1408A0D7:SSL routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL handshaking, client: 83.149.19.136, server: 0.0.0.0:443
comment:5 by , 8 years ago
You need to look into global logs, not logs of a particular server{} block. The ps
output as provided above suggests there should be at least one alert about a worker process abnormal exit.
comment:6 by , 8 years ago
This log from http{} section in main nginx.conf:
access_log off; error_log /var/log/nginx/error.log notice;
We cant reproduce it, but can do some actions in next time it happen. strace for example. What data will be helpfull?
comment:7 by , 8 years ago
Quoting comment:3:
System events are logged to global error log as set at global level (i.e., outside of the http{} block).
Note "outside". Logs configured inside http{} section contain http-specific information and won't help in debugging low-level problems like the one you are facing.
First of all I would recommend you to configure appropriate logging and make sure you see global events like signal handling in the logs. Try to send something like SIGUSR1 to nginx and make sure you see corresponding logging, for example:
2016/10/18 16:50:22 [notice] 82337#100109: signal 30 (SIGUSR1) received, reopening logs 2016/10/18 16:50:22 [notice] 82337#100109: reopening logs 2016/10/18 16:50:22 [notice] 82338#100576: reopening logs
And once the problem happens again, check the logs to see if there are any problems logged. Additionally, try sending some signals to nginx to see if nginx receives signals sent to it. And if it doesn't - where it is blocked. This can be done, for example, by looking at wait channel as reported by sudo ps -alx
(note it needs root to report wait channels for other users' processes).
comment:8 by , 8 years ago
I think we got it and know how to reproduce.
We use python script for log analyzer via pipe file and temporary ban IP addresses
access_log /var/log/nginx.pipe
Steps:
0) Make pipe file mkfifo /var/log/nginx.pipe
1) Take default nginx config and point access_log to pipe file like /var/log/nginx.pipe
2) Open in separate console "tail -f /var/log/nginx.pipe"
3) Start nginx and use tool (siege in my case) to get main page
4) Close tail process and run again (you will not be able to see new logs but whole nginx is working well)
5) Make service nginx reload several times and you will see in process tree:
root@test3:~# ps axjf | grep [n]ginx | grep -v tail 1 1896 1896 1896 ? -1 Rs 0 0:01 nginx: master process /usr/sbin/nginx -g daemon on; master_process on; 1896 6196 1896 1896 ? -1 Z 33 0:01 \_ [nginx] <defunct> 1896 6203 1896 1896 ? -1 S 33 0:01 \_ nginx: worker process is shutting down 1896 6207 1896 1896 ? -1 Z 33 0:01 \_ [nginx] <defunct> 1896 6211 1896 1896 ? -1 S 33 0:01 \_ nginx: worker process is shutting down 1896 6258 1896 1896 ? -1 R 33 0:00 \_ nginx: worker process 1896 6259 1896 1896 ? -1 R 33 0:00 \_ nginx: worker process 1896 6260 1896 1896 ? -1 R 33 0:00 \_ nginx: worker process
but in this test case it will be repaired by nginx master.
In production we have over 100 000 active connections per HW node and it seems not healing himself.
Now we disable access_log via pipe and watching for situation.
The "ps" output config suggests that at least one worker process exited unexpectedly and was respawn (note pid 16838 is missing, and there a new process with pid 54245). Try looking into logs to see why this happened. Other lines from the error log may also help to understand what's going on.
You may also want to provide the full configuration of nginx, and make sure no 3rd party modules are loaded in your configuration. Checking servers memory for errors may also be a good idea, as happening only on some servers suggests this may be a hardware problem.