#2145 closed defect (fixed)
CLOSE_WAIT socket leak in downstream connections with keepalive
Reported by: | Anton Ovchinnikov | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | nginx-core | Version: | 1.19.x |
Keywords: | keepalive, CLOSE_WAIT | Cc: | |
uname -a: | Linux 4ecd0abe453e 4.19.76-linuxkit #1 SMP Tue May 26 11:42:35 UTC 2020 x86_64 GNU/Linux | ||
nginx -V: |
nginx version: nginx/1.19.7
built by gcc 8.3.0 (Debian 8.3.0-6) built with OpenSSL 1.1.1d 10 Sep 2019 TLS SNI support enabled configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fdebug-prefix-map=/data/builder/debuild/nginx-1.19.7/debian/debuild-base/nginx-1.19.7=. -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie' |
Description
It looks like in certain situations Nginx doesn't close connections to downstream properly, and as a result, lots of sockets end up in CLOSE_WAIT state.
One of those situations: Nginx sitting behind an Envoy proxy (https://www.envoyproxy.io/) and responding unconditionally (e.g. with return 200
) or with activated rate limits (via limit_req
), all that with keepalive enabled. Clients send POST requests with non-empty body to Envoy that proxies them to Nginx.
Here you can find a reproducible example with instructions: https://github.com/tonyo/nginx-too-many-close-wait/ Here's a basic Nginx configuration taken from there that was used to reproduce the issue:
worker_processes 1; events {} http { server { listen 8080; server_name _; return 200; } }
I could reproduce the issue with all the version I tested with (1.14.2
, 1.18.0
, 1.19.7
).
On the network level, it seems that the issue manifests itself when Nginx responds before fully reading the request body, and then doesn't properly react to TCP FIN packet received from the downstream.
I found this issue that seems related: https://forum.nginx.org/read.php?2,286665,286699#msg-286699 We briefly tested the provided patch and it does seem to work, but I don't really know how much we can rely on it, and whether it is still up-to-date.
Let me know if I can provide any more details.
Change History (5)
comment:1 by , 4 years ago
Status: | new → accepted |
---|
follow-up: 5 comment:2 by , 4 years ago
Thanks for your reply and explanation.
Do you perhaps also have insight into any potential dangers the issue might bring? In our setup in Google Cloud Platform keepalive_timeout is set to a pretty high value (620 seconds, as recommended by https://cloud.google.com/load-balancing/docs/https#timeouts_and_retries), so these CLOSE_WAIT connections might linger for more than 10 minutes. What we also observe is that they seem to be closed/recycled when the total number of connections (including the CLOSE_WAIT ones) for a given worker reaches its worker_connections
limit. We haven't seen any hard crashes or 500s when this happens, but the question is, can we rely on Nginx here doing the "right" thing of closing these half-broken connections first, and not the active/live ones?
comment:4 by , 4 years ago
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
Fix committed, thanks for report.
comment:5 by , 4 years ago
Replying to Anton Ovchinnikov:
the question is, can we rely on Nginx here doing the "right" thing of closing these half-broken connections first, and not the active/live ones?
That's correct. When approaching the worker_connections limit, nginx is expected to close the least recently used keepalive connections.
Thanks for the ticket. This is exactly the issue you've linked, and the patch by Sergey should still apply (a better link would be http://mailman.nginx.org/pipermail/nginx/2020-January/058867.html though). Thanks for your feedback on the patch as well.
Note though that there is no leak here: rather, a connection is kept open longer that it should given that TCP FIN is already received, and only closed once keepalive_timeout expires. This certainly needs to be fixed, but it's rather a suboptimal behaviour in a specific and not common use case, and hence low priority.