Opened 2 months ago

Closed 13 days ago

#1650 closed defect (invalid)

Active connections not clear close session

Reported by: kfir.snunit.org.il@… Owned by:
Priority: major Milestone:
Component: nginx-core Version: 1.15.x
Keywords: Cc:
uname -a: Linux varnish62 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.15.2

Description

Hello,
I've been trying for a while to figure out what is the problem for unclosed session, without success.

At least once every two days I see more then 1000 active connections under nginx by run this command:
curl -s http://127.0.0.1:1337/nginx_status

But actually I see that there are only 300 connections to the server under port 80 and 443 together:
netstat -anp | grep :80 | grep ESTABLISHED | wc -l
netstat -anp | grep :443 | grep ESTABLISHED | wc -l

And that number can continue to grow until I restart the service:
service nginx restart

It started to happen since I installed the official version from your repo instead of the official version from the ubuntu repo.
From v1.14.0
To v1.15.2
The same issue.

Plaece advice...

Regards, Kfir Man

Change History (10)

comment:1 Changed 2 months ago by kfir.snunit.org.il@…

The nginx -V output:

nginx version: nginx/1.15.2
built by gcc 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
built with OpenSSL 1.0.2g 1 Mar 2016
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'

comment:2 Changed 2 months ago by mdounin

Do you have HTTP/2 enabled? With HTTP/2, number of active connections reported by stub_status includes HTTP/2 streams, and might not match numbers reported by netstat.

comment:3 Changed 2 months ago by kfir.snunit.org.il@…

The thing is that HTTP2 was already setup before version 1.14.0 from which the problem was started.

And even if there were now 1000 active connections to the server, after restarting the service they had to stay there but after restarting the service, from 1000 the number of active connections dropped to 200.

comment:4 Changed 2 months ago by mdounin

And even if there were now 1000 active connections to the server, after restarting the service they had to stay there

This is not necessarily the case. Restarting the server will close existing connections with clients, and depending on the exact use case and behaviour of the clients, these may or may not be reopened in a reasonable time. It is possible that what you are observing is a bug, but more information is needed to say this for sure.

If you suspect that numbers observed indicate leaks, consider the following:

  • Try switching off HTTP/2 to see if it helps.
  • Make sure worker processes are not killed / crashed.
  • Make sure you observe the same behaviour with latest nginx mainline version, nginx 1.15.5, and without any 3rd party modules.
  • Check if you are seeing any sockets in the CLOSE_WAIT state (for more than several seconds).
  • Check what happens on graceful shutdown of running worker processes. In particular, check for open socket ... left in connection ... alerts. If there are any, consider trying to obtain a debugging log with such a socket.

See also #1610 which might be related.

comment:5 Changed 2 months ago by kfir.snunit.org.il@…

  • Nginx worker processes are not killed or crashed.
  • I am experiencing the same behaviour with latest nginx mainline version 1.15.5.
  • No left in connection...
  • 0 sockets in the CLOSE_WAIT state.
  • I still think it has nothing to do with HTTP2. In 95% of the time the number of ESTABLISHED connections on Ports 80/443 from "netstat" is exactly the number of active connections under nginx.

comment:6 Changed 8 weeks ago by kfir.snunit.org.il@…

I tried to disable HTTP2 still the same behavior.

I noticed one strange thing while the problem exists, I regularly see that the number of reading requests does not change.
allways stay on 29.

Nginx status output:
Active connections: 974
server accepts handled requests
487023 487023 3976818
Reading: 29 Writing: 629 Waiting: 586

Active connections: 1054
server accepts handled requests
491779 491779 4028536
Reading: 29 Writing: 757 Waiting: 647

The number of Writing/Waiting/Active? connections change all the time but reading stay on 29 permanent.
Does that mean anything to you?
Could it be that mod_security is related to this?

comment:7 Changed 8 weeks ago by mdounin

Could it be that mod_security is related to this?

As previously suggested, please make sure you are able to reproduce the problem without any 3rd party modules. Any module can easily break things, and mod_security in particular has a long history of introducing various problems.

comment:8 Changed 3 weeks ago by kfir.snunit.org.il@…

Hello,
After a thorough review of the problem I was able to find the problem but not the cause.
Because the log rotation was done after the problem arose I could not find this on the logs.
But now I have new info on this issue.

I can confirm that nginx worker processes was killed.

The nginx error.log show:
2018/11/17 21:31:04 [alert] 23669#23669: worker process 17191 exited on signal 11 (core dumped)
2018/11/17 21:33:09 [alert] 23669#23669: worker process 17852 exited on signal 11 (core dumped)

The syslog show:
Nov 17 21:26:56 kernel: [340028.785042] nginx[17191]: segfault at 0 ip 00007fdb9bdfa746 sp 00007fff7c0475f8 error 4 in libc-2.23.so[7fdb9bd6f000+1c0000]
Nov 17 21:33:09 kernel: [340401.577572] nginx[17852]: segfault at 0 ip 00007fdb9bdfa746 sp 00007fff7c047678 error 4 in libc-2.23.so[7fdb9bd6f000+1c0000]

ps aux | grep nginx
root 15837 0.0 0.0 12944 1032 pts/0 S+ 09:12 0:00 grep --color=auto nginx
nginx 17186 0.8 3.1 1194640 1034896 ? S Nov14 50:05 nginx: worker process
nginx 17187 1.2 3.1 1194640 1039008 ? R Nov14 71:05 nginx: worker process
nginx 17188 1.8 3.1 1194640 1039072 ? S Nov14 102:29 nginx: worker process
nginx 17189 2.5 3.1 1194640 1039072 ? S Nov14 143:18 nginx: worker process
nginx 17190 4.0 3.1 1194640 1039072 ? R Nov14 228:49 nginx: worker process
nginx 17192 10.2 3.1 1194640 1039072 ? R Nov14 575:35 nginx: worker process
nginx 17193 12.0 3.1 1194640 1039072 ? R Nov14 677:13 nginx: worker process
nginx 18228 5.0 3.1 1194640 1039008 ? R Nov17 35:40 nginx: worker process
root 23669 0.0 3.0 1170036 1013272 ? Ss Nov14 0:17 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx 23672 0.3 2.0 821436 667072 ? S Nov14 22:44 nginx: worker process is shutting down
nginx 23673 0.4 2.0 843064 688668 ? S Nov14 27:37 nginx: worker process is shutting down

I think this is indeed related to the ModSecurity?-nginx connector, And I'm checking this with the vendor "SpiderLabs?":
https://github.com/SpiderLabs/ModSecurity-nginx/issues/136

I will update here if new information is discovered.

comment:9 Changed 2 weeks ago by kfir.snunit.org.il@…

I confirm that the bug was created because I used an old vesion of Libmodsecurity And after the update to the latest Libmodsecurity and ModSecurity?-nginx connector the problem has been solved!

comment:10 Changed 13 days ago by mdounin

  • Resolution set to invalid
  • Status changed from new to closed

Ok, thanks for confirming this.

Note: See TracTickets for help on using tickets.