Opened 2 years ago

Last modified 6 months ago

#2421 accepted enhancement

proxy_next_upstream_tries might be ignored with upstream keepalive

Reported by: fischerthiebaut@… Owned by:
Priority: minor Milestone:
Component: nginx-core Version: 1.23.x
Keywords: Cc:
uname -a: Linux static01-test 5.13.19-4-pve #1 SMP PVE 5.13.19-8 (Mon, 31 Jan 2022 10:09:37 +0100) x86_64 GNU/Linux
nginx -V:
nginx version: nginx/1.23.2
built by gcc 10.2.1 20210110 (Debian 10.2.1-6)
built with OpenSSL 1.1.1n 15 Mar 2022
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -ffile-prefix-map=/data/builder/debuild/nginx-1.23.2/debian/debuild-base/nginx-1.23.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'

Description

there is a bug with proxy_next_upstream_tries, which is ignored if Nginx is under load and the upstream server close the connection prematurely.
It seems to only occurs with a connection closed.
If too many requests are producing this error, It is possible to bring all the upstreams down for a certain time.

Here is a reproducer on docker, but the issue was noticed on debian9/11 as well with nginx 1.10<=1.23.0

https://github.com/tibz7/nginx_next_upstream_retries_bug

Change History (7)

comment:1 by Maxim Dounin, 2 years ago

Resolution: invalid
Status: newclosed

The proxy_next_upstream_tries directive limits the number of upstream servers tried by nginx. In your case, as per the no live upstreams error as seen in the logs, all upstream servers are considered unavailable, and therefore there are no servers to try. If you want nginx to ignore errors and don't consider servers to be unavailable after errors, consider tuning max_fails and fail_timeout.

comment:2 by fischerthiebaut@…, 2 years ago

Resolution: invalid
Status: closedreopened

Yes, max_fails AND fail_timeout are used in my configuration. I give more details about the bug in the repository.
Of course it is possible that all upstreams are down, that is a expectable consequence, not the bug itself.
The real issue is that under load these configs are by passed.
therefore instead of seeing 2 errors messages (as set by proxy_next_upstream_tries)
there will be one per upstream (so 6 in my example.)
Which can then naturally lead to an no live upstream error.

comment:3 by Maxim Dounin, 2 years ago

Status: reopenedaccepted
Summary: proxy_next_upstream_tries ignored under some circomstancesproxy_next_upstream_tries might be ignored with upstream keepalive
Type: defectenhancement

Sorry, I've misread what you are actually complaining about.

It looks like your configuration uses keepalive connections to upstream servers, and the error you are generating is the one non-distinguishable from asynchronous close events. As such, when a cached connection to an upstream server is used, nginx retries request regardless of the proxy_next_upstream_tries till all servers are tried (or till a non-cached connection is used if there is only one upstream server). See ticket #2022 for details.

In recent HTTP standard editions requirements for retrying requests are relaxed compared to RFC 2616 referenced above, and it might make sense to relax nginx behaviour somehow, so it won't unconditionally retry such requests.

comment:4 by fischerthiebaut@…, 2 years ago

Thank you for your reply. So if I under correctly this is not a bug, it is a feature?
I'm still dont completely understand why this behavior would happen only under load ?
It will try the first cached connection for the first available upstream, then once its close why does it need to try all of them and by pass proxy_next_upstream_tries?

comment:5 by Maxim Dounin, 2 years ago

RFC 2616 requires clients to retry requests if a requests used a cached connection, and it was closed by the server when the client was sending the request.

To make sure such requests are retried, nginx retries them unconditionally, ignoring the number of tries left (which is normally a number of servers in the upstream, but can be overridden by set to a lower value with proxy_next_upstream_tries). Most notably, this makes it possible to retry requests to an upstream with just one server (normally, such requests are never retried). As a side effect, though, this makes proxy_next_upstream_tries ineffective for such requests (since the number of tries is ignored, see above).

That is, as implemented, nginx will work as follows:

  • It will select a server to use per request balancing method configured.
  • If there are cached connections to this server, it'll use one.
  • If a cached connection is closed while nginx was sending the request, nginx will retry the request, repeating the above steps.

For upstreams with just one server, this essentially means that nginx will retry till there are cached connections. For upstreams with multiple servers, nginx will keep track of servers it tried, and the first step will fail as long as all servers are tried.

Hope this better explains how it works now, and why the side effects observed do appear. It's an open question how to change this behaviour to make it less surprising.

comment:6 by fischerthiebaut@…, 2 years ago

ok thank you for explanation. But for our point of view, it is still suprising that even with max set we can bring all upstream down.

comment:7 by Konrad Baumgart, 6 months ago

Retries after asynchronous close events need a limit.

I found a request with 9707 attempts to connect to upstream in my log, the truncated json log looks like:

{
  "upstream_status": "502, 502, 502, <…>, 502, -",
  "upstream_response_time": "0.003, 0.001, 0.002, <…>, 0.002, 0.002",
  "request_time": 14.561,
  "status": 499,
  <…>

and errors look like

2024/06/07 11:47:25 [error] 52738#52738: *48268611 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 37.….….…, server: …, request: "GET …

I have a busy nginx server 1.21.6 (inside ingress-nginx 1.9.4), proxy_next_upstream_tries is set to 2 and the poisoned request caused upstream to break connections while reading the request line

Last edited 6 months ago by Konrad Baumgart (previous) (diff)
Note: See TracTickets for help on using tickets.