Opened 9 years ago

Closed 8 years ago

Last modified 8 years ago

#64 closed defect (fixed)

Nginx discards alive upstreams and returns 502

Reported by: Vladimir Protasov Owned by: somebody
Priority: minor Milestone:
Component: nginx-module Version: 1.0.x
Keywords: upstream, proxy, load balancing Cc:
uname -a: Linux tst-hostname 2.6.32-21-generic-pae #32-Ubuntu SMP Fri Apr 16 09:39:35 UTC 2010 i686 GNU/Linux
nginx -V: nginx: nginx version: nginx/1.0.10
nginx: built by gcc 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
nginx: configure arguments: --prefix=/usr/local/nginx --with-debug

Description

We've set up Nginx as load balancer, but sometimes it returns 502 instead of page contents.
We've found the problem is happening when some of backends is going down for maintenance (while others is alive and looks good).
Looks like Nginx sometimes decides that alive servers is down, while they aren't.

Everything is tested under very low load (32 requests was performed from a single client in a few threads).

The problem is easily reproducible and affects all Nginx versions since 0.9.4 up to 1.1.9. It's harder to reproduce it in 0.8.55, but looks like it's affected too.

I'm attaching sample minimized configuration with debug logs for latest stable version, which should help to reproduce and figure out the problem.

Also, the problem was reproduced on Centos 6.0 with backends running apache, Nginx and IIS on both Windows and Linux (whenever it was possible).

Attachments (5)

access.80.debug.log (2.8 KB ) - added by Vladimir Protasov 9 years ago.
Requests from test client
access.8080.debug.log (1.9 KB ) - added by Vladimir Protasov 9 years ago.
Access log on the alive upstream.
errors.8080.debug.log (139.0 KB ) - added by Vladimir Protasov 9 years ago.
Debug log from backend (there is nothing interesting).
nginx.conf (1.3 KB ) - added by Vladimir Protasov 9 years ago.
Nginx configuration
errors.80.debug.log.gz (18.4 KB ) - added by Vladimir Protasov 9 years ago.
Gzipped debug log from frontend (interesting things are at the end)

Download all attachments as: .zip

Change History (12)

by Vladimir Protasov, 9 years ago

Attachment: access.80.debug.log added

Requests from test client

by Vladimir Protasov, 9 years ago

Attachment: access.8080.debug.log added

Access log on the alive upstream.

by Vladimir Protasov, 9 years ago

Attachment: errors.8080.debug.log added

Debug log from backend (there is nothing interesting).

by Vladimir Protasov, 9 years ago

Attachment: nginx.conf added

Nginx configuration

by Vladimir Protasov, 9 years ago

Attachment: errors.80.debug.log.gz added

Gzipped debug log from frontend (interesting things are at the end)

comment:1 by Maxim Dounin, 9 years ago

Status: newaccepted

Ack, I see the problem. Relevant lines of the debug log:

2011/11/30 13:53:57 [debug] 7699#0: *11 http init upstream, client timer: 0
2011/11/30 13:53:57 [debug] 7699#0: *11 get rr peer, try: 3
2011/11/30 13:53:57 [debug] 7699#0: *11 get rr peer, current: 2 1
2011/11/30 13:53:57 [debug] 7699#0: *11 get rr peer, current: 1 1
...
2011/11/30 13:54:02 [debug] 7699#0: *11 get rr peer, try: 1
2011/11/30 13:54:02 [error] 7699#0: *11 no live upstreams ...

On first try nginx selects peer 2 in ngx_http_upstream_get_peer() (based on current weight), rejects it as down, than again goes into ngx_http_upstream_get_peer() and selects peer 1. Peer 1 eventually times out. On second try it tries next peer, i.e. 2, which is already tried, and returns EBUSY / "no live upstreams". Peer 0 (the one which is alive) never tried.

comment:2 by Vladimir Protasov, 9 years ago

Strange. All the servers in upstream has the same weight, so why nginx decides to not to try out alive upstream?
Anyway, It causes 502 error to appear on our production servers. We've worked it around by marking servers as "down" during maintenance, but It's not good solution, so It will be glad to know when It will be fixed (approximately, of course).

Thanks in advance.

comment:3 by Maxim Dounin, 9 years ago

It's not "strange", it's just a bug in upstream selection logic which manifest itself if there are more than one dead backend.

No timeline yet for the fix, though you may want to try 1.1.x as it has improved dead backend detection/recheck logic. The bug is still present there, but it has much lower chance to happen.

And, BTW, explicitly removing backends during maintenance is a good idea, regardless of the bug. It's not really a workaround, it's what you are expected to do in a first place.

comment:4 by Vladimir Protasov, 9 years ago

Yep, but sometimes it's not possible, because developers of parts of the application shouldn't have access to the load-balancer and admins have a lot of things to be done, so it's not so good idea as it should be.

Okay, we'll switch to 1.1.9 and will wait for fix, thanks.

comment:5 by Maxim Dounin, 8 years ago

In [4622/nginx]:

(The changeset message doesn't reference this ticket)

comment:6 by Maxim Dounin, 8 years ago

Resolution: fixed
Status: acceptedclosed

Fix committed, thnx.

comment:7 by sync, 8 years ago

In [4668/nginx]:

(The changeset message doesn't reference this ticket)

Note: See TracTickets for help on using tickets.