#64 closed defect (fixed)
Nginx discards alive upstreams and returns 502
Reported by: | Vladimir Protasov | Owned by: | somebody |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | nginx-module | Version: | 1.0.x |
Keywords: | upstream, proxy, load balancing | Cc: | |
uname -a: | Linux tst-hostname 2.6.32-21-generic-pae #32-Ubuntu SMP Fri Apr 16 09:39:35 UTC 2010 i686 GNU/Linux | ||
nginx -V: |
nginx: nginx version: nginx/1.0.10
nginx: built by gcc 4.4.3 (Ubuntu 4.4.3-4ubuntu5) nginx: configure arguments: --prefix=/usr/local/nginx --with-debug |
Description
We've set up Nginx as load balancer, but sometimes it returns 502 instead of page contents.
We've found the problem is happening when some of backends is going down for maintenance (while others is alive and looks good).
Looks like Nginx sometimes decides that alive servers is down, while they aren't.
Everything is tested under very low load (32 requests was performed from a single client in a few threads).
The problem is easily reproducible and affects all Nginx versions since 0.9.4 up to 1.1.9. It's harder to reproduce it in 0.8.55, but looks like it's affected too.
I'm attaching sample minimized configuration with debug logs for latest stable version, which should help to reproduce and figure out the problem.
Also, the problem was reproduced on Centos 6.0 with backends running apache, Nginx and IIS on both Windows and Linux (whenever it was possible).
Attachments (5)
Change History (12)
by , 13 years ago
Attachment: | access.80.debug.log added |
---|
by , 13 years ago
Attachment: | errors.8080.debug.log added |
---|
Debug log from backend (there is nothing interesting).
by , 13 years ago
Attachment: | errors.80.debug.log.gz added |
---|
Gzipped debug log from frontend (interesting things are at the end)
comment:1 by , 13 years ago
Status: | new → accepted |
---|
Ack, I see the problem. Relevant lines of the debug log:
2011/11/30 13:53:57 [debug] 7699#0: *11 http init upstream, client timer: 0 2011/11/30 13:53:57 [debug] 7699#0: *11 get rr peer, try: 3 2011/11/30 13:53:57 [debug] 7699#0: *11 get rr peer, current: 2 1 2011/11/30 13:53:57 [debug] 7699#0: *11 get rr peer, current: 1 1 ... 2011/11/30 13:54:02 [debug] 7699#0: *11 get rr peer, try: 1 2011/11/30 13:54:02 [error] 7699#0: *11 no live upstreams ...
On first try nginx selects peer 2 in ngx_http_upstream_get_peer() (based on current weight), rejects it as down, than again goes into ngx_http_upstream_get_peer() and selects peer 1. Peer 1 eventually times out. On second try it tries next peer, i.e. 2, which is already tried, and returns EBUSY / "no live upstreams". Peer 0 (the one which is alive) never tried.
comment:2 by , 13 years ago
Strange. All the servers in upstream has the same weight, so why nginx decides to not to try out alive upstream?
Anyway, It causes 502 error to appear on our production servers. We've worked it around by marking servers as "down" during maintenance, but It's not good solution, so It will be glad to know when It will be fixed (approximately, of course).
Thanks in advance.
comment:3 by , 13 years ago
It's not "strange", it's just a bug in upstream selection logic which manifest itself if there are more than one dead backend.
No timeline yet for the fix, though you may want to try 1.1.x as it has improved dead backend detection/recheck logic. The bug is still present there, but it has much lower chance to happen.
And, BTW, explicitly removing backends during maintenance is a good idea, regardless of the bug. It's not really a workaround, it's what you are expected to do in a first place.
comment:4 by , 13 years ago
Yep, but sometimes it's not possible, because developers of parts of the application shouldn't have access to the load-balancer and admins have a lot of things to be done, so it's not so good idea as it should be.
Okay, we'll switch to 1.1.9 and will wait for fix, thanks.
Requests from test client