Opened 7 years ago
Closed 5 years ago
#1431 closed enhancement (fixed)
Performance issue with wss and mod_proxy
Reported by: | Samuel Rats | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | nginx-module | Version: | 1.10.x |
Keywords: | websocket proxy | Cc: | |
uname -a: | Linux ns3018241 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | ||
nginx -V: |
nginx version: nginx/1.10.3 (Ubuntu)
built with OpenSSL 1.0.2g 1 Mar 2016 TLS SNI support enabled configure arguments: --with-cc-opt='-g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-Bsymbolic-functions -fPIE -pie -Wl,-z,relro -Wl,-z,now' --prefix=/usr/share/nginx --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit --with-ipv6 --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_addition_module --with-http_dav_module --with-http_geoip_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_image_filter_module --with-http_v2_module --with-http_sub_module --with-http_xslt_module --with-stream --with-stream_ssl_module --with-mail --with-mail_ssl_module --with-threads |
Description
Hi!
First of all, thanks for the great product :)
Today, we are facing a performance issue in our infrastructure and it seems to be related to nginx.
A bit of context first:
We are using nginx as a websocket proxy, using the default nginx-ubuntu16.04 configuration, with the following server configuration:
location ~* ^/(40[0-2][0-9][0-9])/?$ { proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_pass https://127.0.0.1:$1; proxy_http_version 1.1; }
Basically, it forwards requests like https://my-host.com/40000
to localhost:40000
.
This works perfectly fine.
Here is a general overview of the whole infrastructure:
|-------------------| |-----------------| |--------------------------------------------| | | | | | | | CLIENT MACHINE | | CloudFlare | | Server | | | | | | | | |--------------| | | |-------------| | | |-------------| |----------| | | | | | WEBSOCKET | | | | WEBSOCKET | | | WEBSOCKET | | | | | SOFTWARE |=>|>=============>|>|>= nginx =>|>|>=============>|>|>= nginx =>|>=============>| SOFTWARE | | | | | | (WSS) | | | | (WSS) | | | (WSS) | | | | |--------------| | | |-------------| | | |-------------| |----------| | |-------------------| |-----------------| |--------------------------------------------|
In the websocket, we are sending and receiving a lot of stuff. Some commands are basic (simple text), some a bit heavier (file upload/download).
The websocket is multiplexed, so that we can open multiple instances of the client software to send command to the server software.
This work just perfectly with simple usage.
The issue we encounter:
The issue appears when we try to do any operations when one heavy "outgoining" operation (downloading a file from the server to the client, for example) is processing.
The instances doing the heavy operation receive some data on the websocket, while all others instances seems to receive no (or very few) messages from the server software.
We first thought of a software issue, but we tracked it down to nginx.
Investigations:
- We first disabled Cloudflare, to make sure it wasn't a Cloudflare issue. Issue was still there.
- Then, we bypassed nginx on our servers to directly connect to the software. Issue was gone.
Cool. Let's do some more tests.
- Re-enable CloudFlare. Issue was back.
Not Cool. That's when we remembered that CloudFlare was using nginx.
HTTP/1.1 101 Switching Protocols Connection: upgrade Upgrade: websocket Server: cloudflare-nginx
Workaround:
- Apache2 and mod_wstunnel works fine;
- HA Proxy works fine.
Logs:
Attached nginx debug logs.
You can notice basic commands in the logs with the following block (may vary a bit):
2017/11/17 14:31:26 [debug] 11432#11432: *1 http upstream request: "/40002?" 2017/11/17 14:31:26 [debug] 11432#11432: *1 http upstream process upgraded, fu:1 2017/11/17 14:31:26 [debug] 11432#11432: *1 SSL_read: 47 2017/11/17 14:31:26 [debug] 11432#11432: *1 SSL_read: -1 2017/11/17 14:31:26 [debug] 11432#11432: *1 SSL_get_error: 2 2017/11/17 14:31:26 [debug] 11432#11432: *1 SSL to write: 47 2017/11/17 14:31:26 [debug] 11432#11432: *1 SSL_write: 47 2017/11/17 14:31:26 [debug] 11432#11432: *1 event timer del: 17: 1510929145576 2017/11/17 14:31:26 [debug] 11432#11432: *1 event timer add: 17: 60000:1510929146077 2017/11/17 14:31:26 [debug] 11432#11432: *1 delete posted event 000055603EE75590 2017/11/17 14:31:26 [debug] 11432#11432: *1 http upstream request: "/40002?" 2017/11/17 14:31:26 [debug] 11432#11432: *1 http upstream process upgraded, fu:0 2017/11/17 14:31:26 [debug] 11432#11432: *1 event timer: 17, old: 1510929146077, new: 1510929146077 2017/11/17 14:31:26 [debug] 11432#11432: *1 post event 000055603EE63580 2017/11/17 14:31:26 [debug] 11432#11432: *1 post event 000055603EE75590 2017/11/17 14:31:26 [debug] 11432#11432: *1 delete posted event 000055603EE63580
You can notice the differences when the heavy command (file transfert) starts, at 2017/11/17 14:33:34
.
We'll be glad to help you in any ways we can to investigate the bottleneck.
In the mean time, we'll have to use apache :(
Thanks again for the great product.
Attachments (1)
Change History (7)
by , 7 years ago
Attachment: | logs.tar.xz added |
---|
comment:1 by , 7 years ago
When an even happens on a connection, nginx generally works with the connection till it encounters blocking somewhere - for example, till the client's socket buffer is full. Then it switches to handling other events on different connections.
In case of proxying, this means that if both client and upstream server are fast enough, and socket buffers used are big enough and/or nginx-side processing is slow enough, a particular connection can monopolize processing in the worker, resulting in larger latency for other clients. Note well that using CloudFlare might actually be a part of the problem here, as it generally makes any client fast from nginx point of view.
In the particular log provided maximum such monopolization seems to be for about ~2.5 megabytes, just before the following line:
2017/11/17 14:33:37 [debug] 11432#11432: *1 http upstream process upgraded, fu:1
Just for the record, found with the following command:
perl -nle 'if (/SSL_write: (.+)/) { if ($1 > 0) { $t += $1; } } elsif (/process upgraded/) { print $_; print "total: $t"; if ($t > $max) { $max = $t; } $t = 0; } END { print "max: $max" }' nginx-debug.log
The 2.5 megabytes does not look serious, though larger runs can happen in other situation, and these might be the issue. In particular, incorrect system tuning might make things worse, see below.
To reduce the likelihood of monopolizing the worker by a connection, consider the following:
- Use smaller socket buffers. On Linux, socket buffers can be tuned using
/proc/sys/net/ipv4/tcp_rmem
and/proc/sys/net/ipv4/tcp_wmem
, and in nginx itself you can tune client-oriented socket buffers by usinglisten ... rcvbuf=... sndbuf=...
, see listen. The particular log seems to use defaults, so socket buffers are scaled up to about about 1.5 megabytes. With larger sizes you are likely to see larger runs of monopolization.
- Use larger
proxy_buffer_size
. When the buffer is larger than corresponding socket buffer, there will be no monopolization at all - as nginx will be able to immediately read all the data available from the socket buffer, and will process no more than one socket buffer per event. Using larger buffer is also expected to reduce processing overhead. Current proxy buffer size is the default, 4096 bytes, and this looks too small compared to socket buffers used.
(Just a side note: monopolization for larger than a socket buffer is also not at all possible when using the kqueue event method, as it reports number of bytes available at the time of the event, and this is how many bytes nginx will process. Unfortunately, Linux has no kqueue support.)
comment:2 by , 7 years ago
Hi!
Thanks for the information, we'll try to play with proxy_buffer_size
:)
nginx debug logs