Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#2145 closed defect (fixed)

CLOSE_WAIT socket leak in downstream connections with keepalive

Reported by: Anton Ovchinnikov Owned by:
Priority: minor Milestone:
Component: nginx-core Version: 1.19.x
Keywords: keepalive, CLOSE_WAIT Cc:
uname -a: Linux 4ecd0abe453e 4.19.76-linuxkit #1 SMP Tue May 26 11:42:35 UTC 2020 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.19.7
built by gcc 8.3.0 (Debian 8.3.0-6)
built with OpenSSL 1.1.1d 10 Sep 2019
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fdebug-prefix-map=/data/builder/debuild/nginx-1.19.7/debian/debuild-base/nginx-1.19.7=. -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'

Description

It looks like in certain situations Nginx doesn't close connections to downstream properly, and as a result, lots of sockets end up in CLOSE_WAIT state.

One of those situations: Nginx sitting behind an Envoy proxy (https://www.envoyproxy.io/) and responding unconditionally (e.g. with return 200) or with activated rate limits (via limit_req), all that with keepalive enabled. Clients send POST requests with non-empty body to Envoy that proxies them to Nginx.

Here you can find a reproducible example with instructions: https://github.com/tonyo/nginx-too-many-close-wait/ Here's a basic Nginx configuration taken from there that was used to reproduce the issue:

worker_processes 1;

events {}

http {
  server {
    listen 8080;
    server_name _;

    return 200;
  }
}

I could reproduce the issue with all the version I tested with (1.14.2, 1.18.0, 1.19.7).

On the network level, it seems that the issue manifests itself when Nginx responds before fully reading the request body, and then doesn't properly react to TCP FIN packet received from the downstream.

I found this issue that seems related: https://forum.nginx.org/read.php?2,286665,286699#msg-286699 We briefly tested the provided patch and it does seem to work, but I don't really know how much we can rely on it, and whether it is still up-to-date.

Let me know if I can provide any more details.

Change History (5)

comment:1 by Maxim Dounin, 3 years ago

Status: newaccepted

Thanks for the ticket. This is exactly the issue you've linked, and the patch by Sergey should still apply (a better link would be http://mailman.nginx.org/pipermail/nginx/2020-January/058867.html though). Thanks for your feedback on the patch as well.

Note though that there is no leak here: rather, a connection is kept open longer that it should given that TCP FIN is already received, and only closed once keepalive_timeout expires. This certainly needs to be fixed, but it's rather a suboptimal behaviour in a specific and not common use case, and hence low priority.

comment:2 by Anton Ovchinnikov, 3 years ago

Thanks for your reply and explanation.

Do you perhaps also have insight into any potential dangers the issue might bring? In our setup in Google Cloud Platform keepalive_timeout is set to a pretty high value (620 seconds, as recommended by https://cloud.google.com/load-balancing/docs/https#timeouts_and_retries), so these CLOSE_WAIT connections might linger for more than 10 minutes. What we also observe is that they seem to be closed/recycled when the total number of connections (including the CLOSE_WAIT ones) for a given worker reaches its worker_connections limit. We haven't seen any hard crashes or 500s when this happens, but the question is, can we rely on Nginx here doing the "right" thing of closing these half-broken connections first, and not the active/live ones?

comment:3 by Sergey Kandaurov <pluknet@…>, 3 years ago

In 7804:4a9d28f8f39e/nginx:

Cancel keepalive and lingering close on EOF better (ticket #2145).

Unlike in 75e908236701, which added the logic to ngx_http_finalize_request(),
this change moves it to a more generic routine ngx_http_finalize_connection()
to cover cases when a request is finalized with NGX_DONE.

In particular, this fixes unwanted connection transition into the keepalive
state after receiving EOF while discarding request body. With edge-triggered
event methods that means the connection will last for extra seconds as set in
the keepalive_timeout directive.

comment:4 by Sergey Kandaurov, 3 years ago

Resolution: fixed
Status: acceptedclosed

Fix committed, thanks for report.

in reply to:  2 comment:5 by Sergey Kandaurov, 3 years ago

Replying to Anton Ovchinnikov:

the question is, can we rely on Nginx here doing the "right" thing of closing these half-broken connections first, and not the active/live ones?

That's correct. When approaching the worker_connections limit, nginx is expected to close the least recently used keepalive connections.

Note: See TracTickets for help on using tickets.