Opened 13 years ago

Closed 13 years ago

Last modified 13 years ago

#64 closed defect (fixed)

Nginx discards alive upstreams and returns 502

Reported by: Vladimir Protasov Owned by: somebody
Priority: minor Milestone:
Component: nginx-module Version: 1.0.x
Keywords: upstream, proxy, load balancing Cc:
uname -a: Linux tst-hostname 2.6.32-21-generic-pae #32-Ubuntu SMP Fri Apr 16 09:39:35 UTC 2010 i686 GNU/Linux
nginx -V: nginx: nginx version: nginx/1.0.10
nginx: built by gcc 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
nginx: configure arguments: --prefix=/usr/local/nginx --with-debug

Description

We've set up Nginx as load balancer, but sometimes it returns 502 instead of page contents.
We've found the problem is happening when some of backends is going down for maintenance (while others is alive and looks good).
Looks like Nginx sometimes decides that alive servers is down, while they aren't.

Everything is tested under very low load (32 requests was performed from a single client in a few threads).

The problem is easily reproducible and affects all Nginx versions since 0.9.4 up to 1.1.9. It's harder to reproduce it in 0.8.55, but looks like it's affected too.

I'm attaching sample minimized configuration with debug logs for latest stable version, which should help to reproduce and figure out the problem.

Also, the problem was reproduced on Centos 6.0 with backends running apache, Nginx and IIS on both Windows and Linux (whenever it was possible).

Attachments (5)

access.80.debug.log (2.8 KB ) - added by Vladimir Protasov 13 years ago.
Requests from test client
access.8080.debug.log (1.9 KB ) - added by Vladimir Protasov 13 years ago.
Access log on the alive upstream.
errors.8080.debug.log (139.0 KB ) - added by Vladimir Protasov 13 years ago.
Debug log from backend (there is nothing interesting).
nginx.conf (1.3 KB ) - added by Vladimir Protasov 13 years ago.
Nginx configuration
errors.80.debug.log.gz (18.4 KB ) - added by Vladimir Protasov 13 years ago.
Gzipped debug log from frontend (interesting things are at the end)

Download all attachments as: .zip

Change History (12)

by Vladimir Protasov, 13 years ago

Attachment: access.80.debug.log added

Requests from test client

by Vladimir Protasov, 13 years ago

Attachment: access.8080.debug.log added

Access log on the alive upstream.

by Vladimir Protasov, 13 years ago

Attachment: errors.8080.debug.log added

Debug log from backend (there is nothing interesting).

by Vladimir Protasov, 13 years ago

Attachment: nginx.conf added

Nginx configuration

by Vladimir Protasov, 13 years ago

Attachment: errors.80.debug.log.gz added

Gzipped debug log from frontend (interesting things are at the end)

comment:1 by Maxim Dounin, 13 years ago

Status: newaccepted

Ack, I see the problem. Relevant lines of the debug log:

2011/11/30 13:53:57 [debug] 7699#0: *11 http init upstream, client timer: 0
2011/11/30 13:53:57 [debug] 7699#0: *11 get rr peer, try: 3
2011/11/30 13:53:57 [debug] 7699#0: *11 get rr peer, current: 2 1
2011/11/30 13:53:57 [debug] 7699#0: *11 get rr peer, current: 1 1
...
2011/11/30 13:54:02 [debug] 7699#0: *11 get rr peer, try: 1
2011/11/30 13:54:02 [error] 7699#0: *11 no live upstreams ...

On first try nginx selects peer 2 in ngx_http_upstream_get_peer() (based on current weight), rejects it as down, than again goes into ngx_http_upstream_get_peer() and selects peer 1. Peer 1 eventually times out. On second try it tries next peer, i.e. 2, which is already tried, and returns EBUSY / "no live upstreams". Peer 0 (the one which is alive) never tried.

comment:2 by Vladimir Protasov, 13 years ago

Strange. All the servers in upstream has the same weight, so why nginx decides to not to try out alive upstream?
Anyway, It causes 502 error to appear on our production servers. We've worked it around by marking servers as "down" during maintenance, but It's not good solution, so It will be glad to know when It will be fixed (approximately, of course).

Thanks in advance.

comment:3 by Maxim Dounin, 13 years ago

It's not "strange", it's just a bug in upstream selection logic which manifest itself if there are more than one dead backend.

No timeline yet for the fix, though you may want to try 1.1.x as it has improved dead backend detection/recheck logic. The bug is still present there, but it has much lower chance to happen.

And, BTW, explicitly removing backends during maintenance is a good idea, regardless of the bug. It's not really a workaround, it's what you are expected to do in a first place.

comment:4 by Vladimir Protasov, 13 years ago

Yep, but sometimes it's not possible, because developers of parts of the application shouldn't have access to the load-balancer and admins have a lot of things to be done, so it's not so good idea as it should be.

Okay, we'll switch to 1.1.9 and will wait for fix, thanks.

comment:5 by Maxim Dounin, 13 years ago

In [4622/nginx]:

Upstream: smooth weighted round-robin balancing.

For edge case weights like { 5, 1, 1 } we now produce { a, a, b, a, c, a, a }
sequence instead of { c, b, a, a, a, a, a } produced previously.

Algorithm is as follows: on each peer selection we increase current_weight
of each eligible peer by its weight, select peer with greatest current_weight
and reduce its current_weight by total number of weight points distributed
among peers.

In case of { 5, 1, 1 } weights this gives the following sequence of
current_weight's:

a b c
0 0 0 (initial state)

5 1 1 (a selected)

-2 1 1

3 2 2 (a selected)

-4 2 2

1 3 3 (b selected)
1 -4 3

6 -3 4 (a selected)

-1 -3 4

4 -2 5 (c selected)
4 -2 -2

9 -1 -1 (a selected)
2 -1 -1

7 0 0 (a selected)
0 0 0

To preserve weight reduction in case of failures the effective_weight
variable was introduced, which usually matches peer's weight, but is
reduced temporarily on peer failures.

This change also fixes loop with backup servers and proxy_next_upstream
http_404 (ticket #47), and skipping alive upstreams in some cases if there
are multiple dead ones (ticket #64).

comment:6 by Maxim Dounin, 13 years ago

Resolution: fixed
Status: acceptedclosed

Fix committed, thnx.

comment:7 by sync, 13 years ago

In [4668/nginx]:

Merge of r4622, r4623: balancing changes.

*) Upstream: smooth weighted round-robin balancing.

For edge case weights like { 5, 1, 1 } we now produce { a, a, b, a, c, a, a }
sequence instead of { c, b, a, a, a, a, a } produced previously.

Algorithm is as follows: on each peer selection we increase current_weight
of each eligible peer by its weight, select peer with greatest current_weight
and reduce its current_weight by total number of weight points distributed
among peers.

In case of { 5, 1, 1 } weights this gives the following sequence of
current_weight's:

a b c
0 0 0 (initial state)

5 1 1 (a selected)

-2 1 1

3 2 2 (a selected)

-4 2 2

1 3 3 (b selected)
1 -4 3

6 -3 4 (a selected)

-1 -3 4

4 -2 5 (c selected)
4 -2 -2

9 -1 -1 (a selected)
2 -1 -1

7 0 0 (a selected)
0 0 0

To preserve weight reduction in case of failures the effective_weight
variable was introduced, which usually matches peer's weight, but is
reduced temporarily on peer failures.

This change also fixes loop with backup servers and proxy_next_upstream
http_404 (ticket #47), and skipping alive upstreams in some cases if there
are multiple dead ones (ticket #64).

*) Upstream: fixed ip_hash rebalancing with the "down" flag.

Due to weight being set to 0 for down peers, order of peers after sorting
wasn't the same as without the "down" flag (with down peers at the end),
resulting in client rebalancing for clients on other servers. The only
rebalancing which should happen after adding "down" to a server is one
for clients on the server.

The problem was introduced in r1377 (which fixed endless loop by setting
weight to 0 for down servers). The loop is no longer possible with new
smooth algorithm, so preserving original weight is safe.

Note: See TracTickets for help on using tickets.