Opened 4 years ago

Closed 4 years ago

#2026 closed defect (invalid)

Excessive attempts to reconnect when upstream connection refused

Reported by: Nuru Owned by:
Priority: major Milestone:
Component: nginx-core Version: 1.19.x
Keywords: Cc:
uname -a: Linux ingress-nginx-ingress-controller-pp28p 4.14.181-142.260.amzn2.x86_64 #1 SMP Wed Jun 24 19:07:39 UTC 2020 x86_64 Linux
nginx -V: nginx version: nginx/1.19.1
built by gcc 9.2.0 (Alpine 9.2.0)
built with OpenSSL 1.1.1g 21 Apr 2020
TLS SNI support enabled
configure arguments: --prefix=/usr/local/nginx --conf-path=/etc/nginx/nginx.conf --modules-path=/etc/nginx/modules --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-compat --with-pcre-jit --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_addition_module --with-http_dav_module --with-http_geoip_module --with-http_gzip_static_module --with-http_sub_module --with-http_v2_module --with-stream --with-stream_ssl_module --with-stream_realip_module --with-stream_ssl_preread_module --with-threads --with-http_secure_link_module --with-http_gunzip_module --with-file-aio --without-mail_pop3_module --without-mail_smtp_module --without-mail_imap_module --without-http_uwsgi_module --without-http_scgi_module --with-cc-opt='-g -Og -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wno-deprecated-declarations -fno-strict-aliasing -D_FORTIFY_SOURCE=2 --param=ssp-buffer-size=4 -DTCP_FASTOPEN=23 -fPIC -Wno-cast-function-type -I/root/.hunter/_Base/2c5c6fc/d64af22/92161a9/Install/include -m64 -mtune=native' --with-ld-opt='-fPIE -fPIC -pie -Wl,-z,relro -Wl,-z,now -L/root/.hunter/_Base/2c5c6fc/d64af22/92161a9/Install/lib' --user=www-data --group=www-data --add-module=/tmp/build/ngx_devel_kit-0.3.1 --add-module=/tmp/build/set-misc-nginx-module-0.32 --add-module=/tmp/build/headers-more-nginx-module-0.33 --add-module=/tmp/build/nginx-http-auth-digest-cd8641886c873cf543255aeda20d23e4cd603d05 --add-module=/tmp/build/ngx_http_substitutions_filter_module-bc58cb11844bc42735bbaef7085ea86ace46d05b --add-module=/tmp/build/lua-nginx-module-0.10.17 --add-module=/tmp/build/stream-lua-nginx-module-0.0.8 --add-module=/tmp/build/lua-upstream-nginx-module-0.07 --add-module=/tmp/build/nginx-influxdb-module-5b09391cb7b9a889687c0aa67964c06a2d933e8b --add-dynamic-module=/tmp/build/nginx-opentracing-0.9.0/opentracing --add-dynamic-module=/tmp/build/ModSecurity-nginx-b55a5778c539529ae1aa10ca49413771d52bb62e --add-dynamic-module=/tmp/build/ngx_http_geoip2_module-3.3 --add-module=/tmp/build/nginx_ajp_module-bf6cd93f2098b59260de8d494f0f4b1f11a84627 --add-module=/tmp/build/ngx_brotli

Description (last modified by Nuru)

Using the TCP load balancer, when the upstream refuses the connection, Nginx immediately retries without adequate throttling. I am seeing 4,000 retries per second with log entries like:

2020-08-16T06:21:03.662449141Z 2020/08/16 06:21:03 [error] 78#78: *15771 connect() failed (111: Connection refused) while connecting to upstream, client: 10.105.13.228, server: 0.0.0.0:26808, upstream: "10.105.0.30:26808", bytes from/to client:0/0, bytes from/to upstream:0/0
2020-08-16T06:21:03.662452948Z 2020/08/16 06:21:03 [error] 78#78: *15500 connect() failed (111: Connection refused) while connecting to upstream, client: 10.105.13.228, server: 0.0.0.0:26808, upstream: "10.105.0.30:26808", bytes from/to client:0/0, bytes from/to upstream:0/0

This causes a dramatic increase in CPU and Memory usage. I am not sure if the client is retrying that quickly (client is Amazon Web Services Network Load Balancer Health Check), but even if it is, Nginx should throttle upstream connection attempts according to the Passive TCP Health Checks documentation:

The default values are 10 seconds and 1 attempt. So if a connection attempt times out or fails at least once in a 10‑second period, NGINX marks the server as unavailable for 10 seconds

See also: https://github.com/kubernetes/ingress-nginx/issues/5425

Configuration excerpt (configuration created by ingress-nginx v0.34.1):

stream {
        ...

        upstream upstream_balancer {
                server 0.0.0.1:1234; # placeholder

                balancer_by_lua_block {
                        tcp_udp_balancer.balance()
                }
        }

        server {
                preread_by_lua_block {
                        ngx.var.proxy_upstream_name="tcp-example-26808";
                }

                listen                  26808;

                proxy_timeout           600s;
                proxy_pass              upstream_balancer;

        }
}

Change History (3)

comment:1 by Nuru, 4 years ago

Please add me to CC list. Seems I cannot do that myself.

comment:2 by Nuru, 4 years ago

Description: modified (diff)

comment:3 by Maxim Dounin, 4 years ago

Resolution: invalid
Status: newclosed

Nginx immediately retries without adequate throttling

As per the logs provided, nginx does not retry at all. Instead, the next connection attempt is the result of another connection from the client.

Nginx should throttle upstream connection attempts according to ​the Passive TCP Health Checks documentation:

The max_fails and fail_timeout parameters of the server directive in the upstream block only applied when there is more than one server configured, so nginx can avoid directing traffic to "dead" servers. If the is only one server, it simply maps all connection attempts to the only server configured.

If you want nginx to throttle upstream connection attempts for some reason despite the fact you only have one upstream server, you can configure the same server twice, so the max_fails and fail_timeout parameters will take effect. Note though that there isn't much difference from performance point of view, as the root cause of "dramatic increase in CPU and Memory usage" you observed is the client's connection attempts.

Note: See TracTickets for help on using tickets.