Opened 8 months ago

Last modified 6 months ago

#1431 new enhancement

Performance issue with wss and mod_proxy

Reported by: srats.genymobile.com@… Owned by:
Priority: minor Milestone:
Component: nginx-module Version: 1.10.x
Keywords: websocket proxy Cc:
uname -a: Linux ns3018241 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.10.3 (Ubuntu) built with OpenSSL 1.0.2g 1 Mar 2016 TLS SNI support enabled configure arguments: --with-cc-opt='-g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-Bsymbolic-functions -fPIE -pie -Wl,-z,relro -Wl,-z,now' --prefix=/usr/share/nginx --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit --with-ipv6 --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_addition_module --with-http_dav_module --with-http_geoip_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_image_filter_module --with-http_v2_module --with-http_sub_module --with-http_xslt_module --with-stream --with-stream_ssl_module --with-mail --with-mail_ssl_module --with-threads

Description

Hi!

First of all, thanks for the great product :)

Today, we are facing a performance issue in our infrastructure and it seems to be related to nginx.

A bit of context first:
We are using nginx as a websocket proxy, using the default nginx-ubuntu16.04 configuration, with the following server configuration:

location ~* ^/(40[0-2][0-9][0-9])/?$ {
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_pass https://127.0.0.1:$1;
    proxy_http_version 1.1;
}

Basically, it forwards requests like https://my-host.com/40000 to localhost:40000.
This works perfectly fine.

Here is a general overview of the whole infrastructure:

|-------------------|               |-----------------|               |--------------------------------------------|
|                   |               |                 |               |                                            |
| CLIENT MACHINE    |               | CloudFlare      |               | Server                                     |
|                   |               |                 |               |                                            |
| |--------------|  |               | |-------------| |               | |-------------|               |----------| |
| |              |  |   WEBSOCKET   | |             | |   WEBSOCKET   | |             |   WEBSOCKET   |          | |
| | SOFTWARE     |=>|>=============>|>|>=  nginx  =>|>|>=============>|>|>=  nginx  =>|>=============>| SOFTWARE | |
| |              |  |     (WSS)     | |             | |     (WSS)     | |             |     (WSS)     |          | |
| |--------------|  |               | |-------------| |               | |-------------|               |----------| |
|-------------------|               |-----------------|               |--------------------------------------------|

In the websocket, we are sending and receiving a lot of stuff. Some commands are basic (simple text), some a bit heavier (file upload/download).
The websocket is multiplexed, so that we can open multiple instances of the client software to send command to the server software.
This work just perfectly with simple usage.

The issue we encounter:
The issue appears when we try to do any operations when one heavy "outgoining" operation (downloading a file from the server to the client, for example) is processing.
The instances doing the heavy operation receive some data on the websocket, while all others instances seems to receive no (or very few) messages from the server software.

We first thought of a software issue, but we tracked it down to nginx.

Investigations:

  • We first disabled Cloudflare, to make sure it wasn't a Cloudflare issue. Issue was still there.
  • Then, we bypassed nginx on our servers to directly connect to the software. Issue was gone.

Cool. Let's do some more tests.

Not Cool. That's when we remembered that CloudFlare? was using nginx.

HTTP/1.1 101 Switching Protocols
Connection: upgrade
Upgrade: websocket
Server: cloudflare-nginx

Workaround:

  • Apache2 and mod_wstunnel works fine;
  • HA Proxy works fine.

Logs:
Attached nginx debug logs.
You can notice basic commands in the logs with the following block (may vary a bit):

2017/11/17 14:31:26 [debug] 11432#11432: *1 http upstream request: "/40002?"
2017/11/17 14:31:26 [debug] 11432#11432: *1 http upstream process upgraded, fu:1
2017/11/17 14:31:26 [debug] 11432#11432: *1 SSL_read: 47
2017/11/17 14:31:26 [debug] 11432#11432: *1 SSL_read: -1
2017/11/17 14:31:26 [debug] 11432#11432: *1 SSL_get_error: 2
2017/11/17 14:31:26 [debug] 11432#11432: *1 SSL to write: 47
2017/11/17 14:31:26 [debug] 11432#11432: *1 SSL_write: 47
2017/11/17 14:31:26 [debug] 11432#11432: *1 event timer del: 17: 1510929145576
2017/11/17 14:31:26 [debug] 11432#11432: *1 event timer add: 17: 60000:1510929146077
2017/11/17 14:31:26 [debug] 11432#11432: *1 delete posted event 000055603EE75590
2017/11/17 14:31:26 [debug] 11432#11432: *1 http upstream request: "/40002?"
2017/11/17 14:31:26 [debug] 11432#11432: *1 http upstream process upgraded, fu:0
2017/11/17 14:31:26 [debug] 11432#11432: *1 event timer: 17, old: 1510929146077, new: 1510929146077
2017/11/17 14:31:26 [debug] 11432#11432: *1 post event 000055603EE63580
2017/11/17 14:31:26 [debug] 11432#11432: *1 post event 000055603EE75590
2017/11/17 14:31:26 [debug] 11432#11432: *1 delete posted event 000055603EE63580

You can notice the differences when the heavy command (file transfert) starts, at 2017/11/17 14:33:34.

We'll be glad to help you in any ways we can to investigate the bottleneck.
In the mean time, we'll have to use apache :(

Thanks again for the great product.

Attachments (1)

logs.tar.xz (109.8 KB) - added by srats.genymobile.com@… 8 months ago.
nginx debug logs

Download all attachments as: .zip

Change History (4)

Changed 8 months ago by srats.genymobile.com@…

nginx debug logs

comment:1 Changed 8 months ago by mdounin

When an even happens on a connection, nginx generally works with the connection till it encounters blocking somewhere - for example, till the client's socket buffer is full. Then it switches to handling other events on different connections.

In case of proxying, this means that if both client and upstream server are fast enough, and socket buffers used are big enough and/or nginx-side processing is slow enough, a particular connection can monopolize processing in the worker, resulting in larger latency for other clients. Note well that using CloudFlare might actually be a part of the problem here, as it generally makes any client fast from nginx point of view.

In the particular log provided maximum such monopolization seems to be for about ~2.5 megabytes, just before the following line:

2017/11/17 14:33:37 [debug] 11432#11432: *1 http upstream process upgraded, fu:1

Just for the record, found with the following command:

perl -nle 'if (/SSL_write: (.+)/) { if ($1 > 0) { $t += $1; } } elsif (/process upgraded/) { print $_; print "total: $t"; if ($t > $max) { $max = $t; } $t = 0; } END { print "max: $max" }' nginx-debug.log

The 2.5 megabytes does not look serious, though larger runs can happen in other situation, and these might be the issue. In particular, incorrect system tuning might make things worse, see below.

To reduce the likelihood of monopolizing the worker by a connection, consider the following:

  • Use smaller socket buffers. On Linux, socket buffers can be tuned using /proc/sys/net/ipv4/tcp_rmem and /proc/sys/net/ipv4/tcp_wmem, and in nginx itself you can tune client-oriented socket buffers by using listen ... rcvbuf=... sndbuf=..., see listen. The particular log seems to use defaults, so socket buffers are scaled up to about about 1.5 megabytes. With larger sizes you are likely to see larger runs of monopolization.
  • Use larger proxy_buffer_size. When the buffer is larger than corresponding socket buffer, there will be no monopolization at all - as nginx will be able to immediately read all the data available from the socket buffer, and will process no more than one socket buffer per event. Using larger buffer is also expected to reduce processing overhead. Current proxy buffer size is the default, 4096 bytes, and this looks too small compared to socket buffers used.

(Just a side note: monopolization for larger than a socket buffer is also not at all possible when using the kqueue event method, as it reports number of bytes available at the time of the event, and this is how many bytes nginx will process. Unfortunately, Linux has no kqueue support.)

comment:2 Changed 8 months ago by srats.genymobile.com@…

Hi!

Thanks for the information, we'll try to play with proxy_buffer_size :)

comment:3 Changed 6 months ago by mdounin

See also #1468.

Note: See TracTickets for help on using tickets.