Opened 2 weeks ago

Closed 5 days ago

#2096 closed defect (fixed)

proxy_next_upstream returns 502 Bad Gateway when one of the servers is down

Reported by: pszemus@… Owned by:
Priority: minor Milestone:
Component: nginx-module Version: 1.18.x
Keywords: ngx_http_proxy_module, proxy_next_upstream Cc:
uname -a: Linux pszemus-legion 4.19.128-microsoft-standard #1 SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.18.0
built by gcc 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
built with OpenSSL 1.1.1d FIPS 10 Sep 2019 (running with OpenSSL 1.1.1g FIPS 21 Apr 2020)
TLS SNI support enabled
configure arguments: --prefix=/usr/share/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --http-client-body-temp-path=/var/lib/nginx/tmp/client_body --http-proxy-temp-path=/var/lib/nginx/tmp/proxy --http-fastcgi-temp-path=/var/lib/nginx/tmp/fastcgi --http-uwsgi-temp-path=/var/lib/nginx/tmp/uwsgi --http-scgi-temp-path=/var/lib/nginx/tmp/scgi --pid-path=/run/nginx.pid --lock-path=/run/lock/subsys/nginx --user=nginx --group=nginx --with-file-aio --with-ipv6 --with-http_ssl_module --with-http_v2_module --with-http_realip_module --with-stream_ssl_preread_module --with-http_addition_module --with-http_xslt_module=dynamic --with-http_image_filter_module=dynamic --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_degradation_module --with-http_slice_module --with-http_stub_status_module --with-http_perl_module=dynamic --with-http_auth_request_module --with-mail=dynamic --with-mail_ssl_module --with-pcre --with-pcre-jit --with-stream=dynamic --with-stream_ssl_module --with-google_perftools_module --with-debug --with-cc-opt='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' --with-ld-opt='-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -Wl,-E'

Description

Using proxy_next_upstream gives an inconsistent behaviour when one of upstream's server is down.

Take this simple configuration:

events {}

http {

    upstream test_upstream {
        server postman-echo.com;
        server postman-echo.com;
    }

    server {

        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504 http_403 http_404;

        location / {
            proxy_pass http://test_upstream;
        }

    }

}

Every request to postman-echo.com that returns 500/502/503/504/403/404 is repeated to the other server.

Eg.

$ curl -i localhost/status/403
HTTP/1.1 403 Forbidden
Server: nginx/1.18.0
[...]

In the above example the first server from the upstream's group returned 403 then the second one returned 403 and finally nginx returned 403 to client (curl). The same happens with the rest, configured in proxy_next_upstream, HTTP codes.

But as soon as one of the servers is marked down :

upstream test_upstream {
    server postman-echo.com;
    server postman-echo.com down;
}

nginx reports:

2020/11/16 08:55:18 [error] 559#0: *1 no live upstreams while connecting to upstream, client: 127.0.0.1, server: , request: "GET /status/403 HTTP/1.1", upstream: "http://test_upstream/status/403", host: "localhost"

and returns 502 Bad Gateway:

$ curl -i localhost/status/403
HTTP/1.1 502 Bad Gateway
Server: nginx/1.18.0
[...]

<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.18.0</center>
</body>
</html>

The no live upstreams message is odd because there is a live upstream - the first servers is live. I'd suspect the consistent behaviour to return 403 from the first (available) upstream server.

Attachments (1)

nginx-proxy_next_upstream-test.conf (338 bytes ) - added by pszemus@… 2 weeks ago.
Simple test nginx configuration

Download all attachments as: .zip

Change History (8)

by pszemus@…, 2 weeks ago

Simple test nginx configuration

comment:1 by pszemus@…, 2 weeks ago

What's also weird is that when one of the upstream servers is marked down $upstream_addr and $upstream_status have an additional entry named after an upstream name, eg:

GET /status/403 HTTP/1.1 502 upstreams:[52.7.61.87:80, 107.23.124.180:80, test_upstream], upstream_status:[403, 403, 502]

comment:2 by Maxim Dounin, 2 weeks ago

Status: newaccepted

Thanks, this looks like the problem is that the number of tries set after the total number of upstream servers, regardless of the down flag in the configuration. It probably should be taken into account.

Quick and dirty workaround would be to use proxy_next_uptream_tries 1; (or 2, given that the name in the configuration snippet resolves two IP addresses).

in reply to:  1 comment:3 by Maxim Dounin, 2 weeks ago

Replying to pszemus@…:

What's also weird is that when one of the upstream servers is marked down $upstream_addr and $upstream_status have an additional entry named after an upstream name, eg:

GET /status/403 HTTP/1.1 502 upstreams:[52.7.61.87:80, 107.23.124.180:80, test_upstream], upstream_status:[403, 403, 502]

That's normal, as long as nginx tries to select an upstream server, but fails to do so due to the "no live upstreams" error (all servers being already tried or unavailable), the $upstream_addr variable contains the upstream name in the relevant position. This is explicitly documented: "If a server cannot be selected, the variable keeps the name of the server group".

comment:4 by pszemus@…, 2 weeks ago

Thanks Maxim for that clarification.

There's one more thing:
If I define a sever group with one server failing (e.g. connection failure) and the second returning a valid response (e.g. 403) then, using the above nginx configuration, I get:

2020/11/17 10:40:58 [error] 871#0: *1 upstream timed out (110: Connection timed out) while connecting to upstream, client: 127.0.0.1, server: , request: "GET /status/403 HTTP/1.1", upstream: "http://52.7.61.87:666/status/403", host: "localhost"
[17/Nov/2020:10:40:58 +0100] GET /status/403 HTTP/1.1 504 upstreams:[52.7.61.87:80, 52.7.61.87:666], upstream_status:[403, 504]
2020/11/17 10:41:02 [error] 871#0: *4 no live upstreams while connecting to upstream, client: 127.0.0.1, server: , request: "GET /status/403 HTTP/1.1", upstream: "http://test_upstream/status/403", host: "localhost"
[17/Nov/2020:10:41:02 +0100] GET /status/403 HTTP/1.1 502 upstreams:[52.7.61.87:80, test_upstream], upstream_status:[403, 502]

As you can see, the first request returned 504 (Gateway Time-out) as the first server from group responded 403 and the second was timeouted. Then nginx marked the second server as unhealthy and stopped sending requests to it. The latter requests I send returned 502 (Bad Gateway) with message "no live upstreams", even with the first server being still healthy. Nginx should return 403 from the first (healthy) server here.

I think this is similar to the original issue and it's, again, because of wrong number of healthy servers in group, which should be computed dynamicly.

comment:5 by Maxim Dounin, 2 weeks ago

The latter requests I send returned 502 (Bad Gateway) with message "no live upstreams", even with the first server being still healthy. Nginx should return 403 from the first (healthy) server here.

This is not how it works. As long as the response is from proxy_next_upstream list, and there are additional tries left, nginx will stop reading the response from the client (403 in your case), and will try to obtain a response from the additional servers left. In your case these servers are either not responding (so 504 is returned) or disabled due to previous errors (so 502 is returned).

I think this is similar to the original issue and it's, again, because of wrong number of healthy servers in group, which should be computed dynamicly.

While this is indeed similar to the original issue, there is a difference: while it is easy enough to reduce number of tries while parsing a configuration with the down flag, it is hardly possible to compute number of tries dynamically. It is not known which servers can or cannot be used at the particular moment of time. So the only option is to actually try to select a server. And the "no live upstreams" error means nginx wasn't able to.

In theory it might be possible to rewrite the balancing in a way which is more friendly to configurations like in your tests. That is, before trying to switch to a next upstream server, look it up to see if we can select one. Or even try to connect to one selected. But, while theoretically possible, it will seriously complicate the code and hardly worth the effort.

So the current behaviour is as follows: once the response is not good enough to be returned to the client (as per proxy_next_upstream), and the number of attempts is less than a static number allowed (proxy_next_upstream_tries, defaults to the number of upstream servers configured), nginx will try to do a request to another server.

The original issue is about providing a better default to proxy_next_upstream_tries if there are servers marked down. Certainly there are no plans to rewrite things to a completely different approach.

comment:6 by Ruslan Ermilov <ru@…>, 5 days ago

In 7750:90cc7194e993/nginx:

Upstream: excluded down servers from the next_upstream tries.

Previously, the number of next_upstream tries included servers marked
as "down", resulting in "no live upstreams" with the code 502 instead
of the code derived from an attempt to connect to the last tried "up"
server (ticket #2096).

comment:7 by Ruslan Ermilov, 5 days ago

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.