Opened 15 months ago

Last modified 13 months ago

#1610 new defect

Writing connections counter leak in stub_status.

Reported by: dyeldandi@… Owned by:
Priority: major Milestone:
Component: other Version: 1.12.x
Keywords: Cc:
uname -a: Linux hostname 2.6.32-696.6.3.el6.x86_64 #1 SMP Wed Jul 12 14:17:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.12.2 built by gcc 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC) built with OpenSSL 1.0.2l 25 May 2017 TLS SNI support enabled configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-openssl=openssl-1.0.2l --with-openssl-opt=shared --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --add-dynamic-module=nginx-vod-module-1.22 --with-http_geoip_module=dynamic --with-cc-opt='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -pie'

Description

Writing connections counter as reported by stub_status is slowly growing up.

Active connections: 23374 
server accepts handled requests
 993360041 993360041 11278418586 
Reading: 0 Writing: 13887 Waiting: 12213 

However netstat shows fewer connections

# netstat -anp | grep -v CLOSE_WAIT | grep -c nginx  
12137

If I restart nginx the number of writing connections drops down to 300-ish, which is close to expected value

Active connections: 12130 
server accepts handled requests
 26485 26485 198399 
Reading: 0 Writing: 277 Waiting: 11817 

but after a week or two it would gradually grow up to same large value (thousands).

In average it's receiving about 2000 requests per second, and I can't find an exact request that would increase writing connections counter permanently. However I think this problem started when we moved to http/2 (not completely sure though).

Attachments (1)

nginx.conf (14.3 KB) - added by dyeldandi@… 15 months ago.
nginx.conf as reported by nginx -T with paths and server names changed

Download all attachments as: .zip

Change History (7)

comment:1 Changed 15 months ago by mdounin

Writing connections counter increasing might indicate one of the following:

  • Crashed or killed worker processes. This is unlikely in your case though, as this would also result in other values growing as well, notably Waiting.
  • A real socket leak somewhere. These usually results in sockets in CLOSE_WAIT state, try looking at netstat output without grep -v CLOSE_WAIT filter. Leaked sockets are reported by nginx during graceful shutdown of a worker process (for example, after a configuration reload) - if there are any leaked sockets, nginx will write open socket ... left in connection ... alerts to the error log.

Enabling HTTP/2 might be an issue. It is known to cause socket leaks in the past (most recent one was fixed in nginx 1.11.3), and there are reports that there might be something not fixed, see #1163.

To further investigate things, please do the following:

  • Upgrade to the latest mainline versions, nginx 1.15.2, without any 3rd party modules, and check if you are able to reproduce the issue.
  • Try disabling HTTP/2 to see if it fixes the issue.
  • Check if you are seeing open socket ... left in connection ... alerts on configuration reload.

If you'll be able to reproduce the issue with a recent version, please also provide:

  • Full configuration, as shown by nginx -T.
  • Debug log with at least one leaked socket as reported by open socket ... left... alert. See here for details on how to configure debug logging.

comment:2 Changed 15 months ago by dyeldandi@…

open socket ... left in connection ...

Nops, not a single one in 2 weeks.

Try disabling HTTP/2 to see if it fixes the issue

I'm afraid that won't be possible. Those nginx serve HLS traffic with VOD module, so if I run it without 3rd party modules it won't get any traffic :)

Try disabling HTTP/2 to see if it fixes the issue

Yep, I reconfigured one of the servers without http2, will check it in a week.

Thank you!

comment:3 Changed 15 months ago by dyeldandi@…

No (major) leaks with http/2 disabled.

A week after restart, server WITH http/2 enabled:

Active connections: 13913 
server accepts handled requests
 120731540 120731540 1453989168 
Reading: 0 Writing: 1872 Waiting: 12288 

# netstat -anp | grep -c nginx
12691
#

Server WITHOUT http/2 enabled

Active connections: 14652 
server accepts handled requests
 232682967 232682967 1407244359 
Reading: 0 Writing: 118 Waiting: 14462 

# netstat -anp | grep -c nginx
13046
#

Both servers serve more or less same traffic.

Changed 15 months ago by dyeldandi@…

nginx.conf as reported by nginx -T with paths and server names changed

comment:4 Changed 15 months ago by mdounin

In both cases the number of active connections do not match the total number of connections reported by netstat: 13913 vs. 12691 with HTTP/2, and 14652 vs. 13046 without HTTP/2. So it looks like the problem is both with and without HTTP/2, and it just manifests itself differently with HTTP/2 enabled.

This is also in line with the fact that you are not seeing opens socket... alerts on reloads (please note: these alerts are only expected to appear on configuration reloads and/or worker processes shutdown, not during normal work), and may indicate crashed or killed worker processes. Please make sure that no processes are crashed or killed. Ideally, grab a ps output with all nginx processes at start, and make sure all these processes remain running after a while.

Note well that overall it looks like a problem in the 3rd party module you are using. It is unlikely we can help if you cannot reproduce this without 3rd party modules.

comment:5 Changed 15 months ago by dyeldandi@…

My understanding is that since I can't run netstat at exactly the same time I fetch /status page the numbers will always be different. I ran netstat 5 times in a loop and results are +-2K

# for i in `seq 1 5 `; do netstat -anp | grep -c nginx; done
13001
11999
12928
12598
12829
#

So I guess it's in a standard error

I don't think nginx processes crash, all the workers are still alive since master start (Aug, 22) and pid numbers are very close.

# ps -ef | grep nginx
root      6648  6632  0 15:29 pts/0    00:00:00 grep nginx
root     31709     1  0 Aug22 ?        00:00:04 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
1100     31710 31709  6 Aug22 ?        14:47:32 nginx: worker process                   
1100     31711 31709  5 Aug22 ?        11:48:59 nginx: worker process                   
1100     31712 31709  5 Aug22 ?        11:16:58 nginx: worker process                   
1100     31713 31709  6 Aug22 ?        12:47:15 nginx: worker process                   
1100     31714 31709  8 Aug22 ?        18:02:36 nginx: worker process                   
1100     31715 31709  6 Aug22 ?        14:30:48 nginx: worker process                   
1100     31716 31709 66 Aug22 ?        5-19:53:05 nginx: worker process                   
1100     31717 31709  6 Aug22 ?        14:32:46 nginx: worker process                   
1100     31719 31709  7 Aug22 ?        16:15:57 nginx: worker process                   
1100     31720 31709  6 Aug22 ?        14:00:28 nginx: worker process                   
1100     31721 31709 60 Aug22 ?        5-08:44:01 nginx: worker process                   
1100     31722 31709  6 Aug22 ?        13:19:19 nginx: worker process                   
1100     31723 31709  5 Aug22 ?        12:21:53 nginx: worker process                   
1100     31724 31709  6 Aug22 ?        12:53:08 nginx: worker process                   
1100     31725 31709  5 Aug22 ?        12:04:23 nginx: worker process                   
1100     31726 31709  5 Aug22 ?        11:24:42 nginx: worker process                   
1100     31727 31709  6 Aug22 ?        14:16:52 nginx: worker process                   
1100     31728 31709  9 Aug22 ?        20:14:31 nginx: worker process                   
1100     31729 31709 63 Aug22 ?        5-14:57:12 nginx: worker process                   
1100     31730 31709 65 Aug22 ?        5-18:52:53 nginx: worker process                   
1100     31731 31709  6 Aug22 ?        13:50:16 nginx: worker process                   
1100     31732 31709 66 Aug22 ?        5-20:52:52 nginx: worker process                   
1100     31733 31709  5 Aug22 ?        11:18:26 nginx: worker process                   
1100     31734 31709 41 Aug22 ?        3-15:17:54 nginx: worker process                   
1100     31735 31709  5 Aug22 ?        12:23:48 nginx: worker process                   
1100     31736 31709  5 Aug22 ?        12:00:03 nginx: worker process                   
1100     31737 31709 19 Aug22 ?        1-16:14:45 nginx: worker process                   
1100     31738 31709  5 Aug22 ?        11:18:00 nginx: worker process                   
1100     31739 31709 15 Aug22 ?        1-09:36:22 nginx: worker process                   
1100     31740 31709  5 Aug22 ?        11:23:54 nginx: worker process                   
1100     31741 31709 28 Aug22 ?        2-11:55:52 nginx: worker process                   
1100     31742 31709 57 Aug22 ?        5-01:35:41 nginx: worker process                   
1100     31743 31709 12 Aug22 ?        1-01:32:45 nginx: worker process                   
1100     31744 31709  5 Aug22 ?        11:27:39 nginx: worker process                   
1100     31745 31709  5 Aug22 ?        11:46:17 nginx: worker process                   
1100     31746 31709 13 Aug22 ?        1-05:03:12 nginx: worker process                   
1100     31747 31709  5 Aug22 ?        11:38:25 nginx: worker process                   
1100     31748 31709 10 Aug22 ?        22:31:50 nginx: worker process                   
1100     31749 31709  5 Aug22 ?        11:33:06 nginx: worker process                   
1100     31750 31709 53 Aug22 ?        4-17:31:15 nginx: worker process                   
1100     31751 31709  6 Aug22 ?        13:44:08 nginx: worker process                   
1100     31752 31709 34 Aug22 ?        3-00:32:47 nginx: worker process                   
1100     31753 31709  6 Aug22 ?        12:57:28 nginx: worker process                   
1100     31754 31709 47 Aug22 ?        4-04:05:54 nginx: worker process                   
1100     31755 31709  5 Aug22 ?        11:44:20 nginx: worker process                   
1100     31756 31709 23 Aug22 ?        2-01:26:29 nginx: worker process                   
1100     31757 31709  5 Aug22 ?        11:16:23 nginx: worker process                   
1100     31758 31709 52 Aug22 ?        4-15:22:58 nginx: worker process 

I'll try to replicate this problem on a spare server without VOD module.

comment:6 Changed 13 months ago by mdounin

See also #1650 which might be related.

Note: See TracTickets for help on using tickets.