Opened 4 years ago
Closed 4 years ago
#2096 closed defect (fixed)
proxy_next_upstream returns 502 Bad Gateway when one of the servers is down
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | nginx-module | Version: | 1.18.x |
Keywords: | ngx_http_proxy_module, proxy_next_upstream | Cc: | |
uname -a: | Linux pszemus-legion 4.19.128-microsoft-standard #1 SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | ||
nginx -V: |
nginx version: nginx/1.18.0
built by gcc 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC) built with OpenSSL 1.1.1d FIPS 10 Sep 2019 (running with OpenSSL 1.1.1g FIPS 21 Apr 2020) TLS SNI support enabled configure arguments: --prefix=/usr/share/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --http-client-body-temp-path=/var/lib/nginx/tmp/client_body --http-proxy-temp-path=/var/lib/nginx/tmp/proxy --http-fastcgi-temp-path=/var/lib/nginx/tmp/fastcgi --http-uwsgi-temp-path=/var/lib/nginx/tmp/uwsgi --http-scgi-temp-path=/var/lib/nginx/tmp/scgi --pid-path=/run/nginx.pid --lock-path=/run/lock/subsys/nginx --user=nginx --group=nginx --with-file-aio --with-ipv6 --with-http_ssl_module --with-http_v2_module --with-http_realip_module --with-stream_ssl_preread_module --with-http_addition_module --with-http_xslt_module=dynamic --with-http_image_filter_module=dynamic --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_degradation_module --with-http_slice_module --with-http_stub_status_module --with-http_perl_module=dynamic --with-http_auth_request_module --with-mail=dynamic --with-mail_ssl_module --with-pcre --with-pcre-jit --with-stream=dynamic --with-stream_ssl_module --with-google_perftools_module --with-debug --with-cc-opt='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' --with-ld-opt='-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -Wl,-E' |
Description
Using proxy_next_upstream
gives an inconsistent behaviour when one of upstream's server is down.
Take this simple configuration:
events {} http { upstream test_upstream { server postman-echo.com; server postman-echo.com; } server { proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504 http_403 http_404; location / { proxy_pass http://test_upstream; } } }
Every request to postman-echo.com that returns 500/502/503/504/403/404 is repeated to the other server.
Eg.
$ curl -i localhost/status/403 HTTP/1.1 403 Forbidden Server: nginx/1.18.0 [...]
In the above example the first server from the upstream's group returned 403 then the second one returned 403 and finally nginx returned 403 to client (curl
). The same happens with the rest, configured in proxy_next_upstream
, HTTP codes.
But as soon as one of the servers is marked down
:
upstream test_upstream { server postman-echo.com; server postman-echo.com down; }
nginx reports:
2020/11/16 08:55:18 [error] 559#0: *1 no live upstreams while connecting to upstream, client: 127.0.0.1, server: , request: "GET /status/403 HTTP/1.1", upstream: "http://test_upstream/status/403", host: "localhost"
and returns 502 Bad Gateway:
$ curl -i localhost/status/403 HTTP/1.1 502 Bad Gateway Server: nginx/1.18.0 [...] <html> <head><title>502 Bad Gateway</title></head> <body> <center><h1>502 Bad Gateway</h1></center> <hr><center>nginx/1.18.0</center> </body> </html>
The no live upstreams
message is odd because there is a live upstream - the first servers is live. I'd suspect the consistent behaviour to return 403 from the first (available) upstream server.
Attachments (1)
Change History (8)
by , 4 years ago
Attachment: | nginx-proxy_next_upstream-test.conf added |
---|
follow-up: 3 comment:1 by , 4 years ago
What's also weird is that when one of the upstream servers is marked down
$upstream_addr
and $upstream_status
have an additional entry named after an upstream name, eg:
GET /status/403 HTTP/1.1 502 upstreams:[52.7.61.87:80, 107.23.124.180:80, test_upstream], upstream_status:[403, 403, 502]
comment:2 by , 4 years ago
Status: | new → accepted |
---|
Thanks, this looks like the problem is that the number of tries set after the total number of upstream servers, regardless of the down
flag in the configuration. It probably should be taken into account.
Quick and dirty workaround would be to use proxy_next_uptream_tries 1;
(or 2
, given that the name in the configuration snippet resolves two IP addresses).
comment:3 by , 4 years ago
Replying to pszemus@…:
What's also weird is that when one of the upstream servers is marked
down
$upstream_addr
and$upstream_status
have an additional entry named after an upstream name, eg:
GET /status/403 HTTP/1.1 502 upstreams:[52.7.61.87:80, 107.23.124.180:80, test_upstream], upstream_status:[403, 403, 502]
That's normal, as long as nginx tries to select an upstream server, but fails to do so due to the "no live upstreams" error (all servers being already tried or unavailable), the $upstream_addr variable contains the upstream name in the relevant position. This is explicitly documented: "If a server cannot be selected, the variable keeps the name of the server group".
comment:4 by , 4 years ago
Thanks Maxim for that clarification.
There's one more thing:
If I define a sever group with one server failing (e.g. connection failure) and the second returning a valid response (e.g. 403) then, using the above nginx configuration, I get:
2020/11/17 10:40:58 [error] 871#0: *1 upstream timed out (110: Connection timed out) while connecting to upstream, client: 127.0.0.1, server: , request: "GET /status/403 HTTP/1.1", upstream: "http://52.7.61.87:666/status/403", host: "localhost" [17/Nov/2020:10:40:58 +0100] GET /status/403 HTTP/1.1 504 upstreams:[52.7.61.87:80, 52.7.61.87:666], upstream_status:[403, 504] 2020/11/17 10:41:02 [error] 871#0: *4 no live upstreams while connecting to upstream, client: 127.0.0.1, server: , request: "GET /status/403 HTTP/1.1", upstream: "http://test_upstream/status/403", host: "localhost" [17/Nov/2020:10:41:02 +0100] GET /status/403 HTTP/1.1 502 upstreams:[52.7.61.87:80, test_upstream], upstream_status:[403, 502]
As you can see, the first request returned 504 (Gateway Time-out) as the first server from group responded 403 and the second was timeouted. Then nginx marked the second server as unhealthy and stopped sending requests to it. The latter requests I send returned 502 (Bad Gateway) with message "no live upstreams", even with the first server being still healthy. Nginx should return 403 from the first (healthy) server here.
I think this is similar to the original issue and it's, again, because of wrong number of healthy servers in group, which should be computed dynamicly.
comment:5 by , 4 years ago
The latter requests I send returned 502 (Bad Gateway) with message "no live upstreams", even with the first server being still healthy. Nginx should return 403 from the first (healthy) server here.
This is not how it works. As long as the response is from proxy_next_upstream
list, and there are additional tries left, nginx will stop reading the response from the client (403 in your case), and will try to obtain a response from the additional servers left. In your case these servers are either not responding (so 504 is returned) or disabled due to previous errors (so 502 is returned).
I think this is similar to the original issue and it's, again, because of wrong number of healthy servers in group, which should be computed dynamicly.
While this is indeed similar to the original issue, there is a difference: while it is easy enough to reduce number of tries while parsing a configuration with the down
flag, it is hardly possible to compute number of tries dynamically. It is not known which servers can or cannot be used at the particular moment of time. So the only option is to actually try to select a server. And the "no live upstreams" error means nginx wasn't able to.
In theory it might be possible to rewrite the balancing in a way which is more friendly to configurations like in your tests. That is, before trying to switch to a next upstream server, look it up to see if we can select one. Or even try to connect to one selected. But, while theoretically possible, it will seriously complicate the code and hardly worth the effort.
So the current behaviour is as follows: once the response is not good enough to be returned to the client (as per proxy_next_upstream
), and the number of attempts is less than a static number allowed (proxy_next_upstream_tries
, defaults to the number of upstream servers configured), nginx will try to do a request to another server.
The original issue is about providing a better default to proxy_next_upstream_tries
if there are servers marked down
. Certainly there are no plans to rewrite things to a completely different approach.
comment:7 by , 4 years ago
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
Simple test nginx configuration