#1210 closed defect (wontfix)
connections not distributed evenly after restart under high load with accept_mutex enabled
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | other | Version: | 1.10.x |
Keywords: | Cc: | ||
uname -a: | FreeBSD c099.ord001.dev.ix.nflxvideo.net 12.0-CURRENT FreeBSD 12.0-CURRENT #0: Fri Feb 24 13:40:24 EST 2017 root@viserion:/data/NCD-3545/.obj/nano.nflx/data/NCD-3545/FreeBSD/sys/RED1 amd64 | ||
nginx -V: |
nginx version: nginx/1.10.0
built with OpenSSL 1.0.2k-freebsd 26 Jan 2017 TLS SNI support enabled configure arguments: --prefix=/usr/local/etc/nginx --with-cc-opt='-I /usr/local/include' --with-ld-opt='-L /usr/local/lib' --conf-path=/usr/local/etc/nginx/nginx.conf --sbin-path=/usr/local/sbin/nginx --pid-path=/var/run/nginx.pid --error-log-path=/var/log/nginx-error.log --user=www --group=www --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_smooth --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_geo_check --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_signed_uri --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_range_path --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_variables --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_nccp_conf --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_logstat --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_loghot --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_status --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_upstream_jdomain --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_rorder --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_delay --add-module=/var/tmp/portswork/_.p/netflix/nginx-1.10/work/nginx-1.10.0/nflx_sendfile_flags --with-debug --with-file-aio --with-ipv6 --http-client-body-temp-path=/var/tmp/nginx/client_body_temp --http-fastcgi-temp-path=/var/tmp/nginx/fastcgi_temp --http-proxy-temp-path=/var/tmp/nginx/proxy_temp --http-scgi-temp-path=/var/tmp/nginx/scgi_temp --http-uwsgi-temp-path=/var/tmp/nginx/uwsgi_temp --http-log-path=/var/log/nginx-access.log --with-http_v2_module --with-pcre --with-http_ssl_module --with-threads |
Description
On a heavily loaded server (90,000 long lived connections) restarting nginx results in a few nginx workers with the connection limit, and others with very few connections.
conf file snippets:
worker_processes 64; worker_rlimit_nofile 32768; events { worker_connections 4096; accept_mutex on; }
Before service nginx restart, server is serving 90+ Gb/s across ~90,000 conenctions:
input (Total) output packets errs idrops bytes packets errs bytes colls 2237651 0 0 167011176 7789342 0 11633619064 0 2214132 0 0 165215742 7719175 0 11531815268 0 2255186 0 0 168221493 7816878 0 11679556820 0 <...>
nginx restarted. Now top shows uneven load in new nginx workers:
last pid: 4557; load averages: 16.89, 19.71, 17.60 up 0+00:52:43 20:00:44 227 processes: 5 running, 222 sleeping CPU: 4.8% user, 0.0% nice, 10.8% system, 21.1% interrupt, 63.2% idle Mem: 2547M Active, 91G Inact, 3260K Laundry, 27G Wired, 1585M Buf, 2710M Free Swap: 32G Total, 79M Used, 32G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 4364 www 1 48 -4 270M 177M CPU18 18 0:50 40.25% nginx 4371 www 1 39 -4 246M 183M CPU4 4 0:56 33.52% nginx 4347 www 1 35 -4 236M 170M kqread 4 0:54 26.62% nginx 4398 www 1 32 -4 230M 167M kqread 18 0:55 26.00% nginx 4391 www 1 30 -4 216M 151M kqread 15 0:53 24.04% nginx 4381 www 1 31 -4 214M 150M kqread 7 0:50 23.07% nginx 4372 www 1 27 -4 220M 157M kqread 20 0:38 22.09% nginx 4346 www 1 26 -4 228M 166M kqread 5 0:37 19.00% nginx 4380 www 1 26 -4 232M 158M CPU26 26 0:48 18.59% nginx 4363 www 1 25 -4 206M 144M kqread 20 0:31 17.16% nginx 4396 www 1 26 -4 210M 147M kqread 14 0:34 16.60% nginx 4402 www 1 24 -4 186M 123M kqread 3 0:27 14.66% nginx 4345 www 1 24 -4 198M 136M kqread 6 0:30 14.58% nginx 4349 www 1 24 -4 200M 137M kqread 6 0:31 14.30% nginx 4365 www 1 20 -4 152M 91584K kqread 5 0:14 10.41% nginx 4355 www 1 20 -4 138M 76796K kqread 7 0:12 7.28% nginx 4351 www 1 20 -4 146M 84196K kqread 15 0:14 6.81% nginx 4384 www 1 20 -4 120M 56260K kqread 21 0:06 4.76% nginx 4400 www 1 20 -4 126M 63348K CPU17 17 0:08 4.49% nginx 4383 www 1 20 -4 116M 53096K kqread 9 0:04 3.80% nginx 4376 www 1 20 -4 116M 52824K kqread 20 0:04 3.76% nginx 4397 www 1 20 -4 118M 53928K kqread 22 0:05 3.45% nginx 4367 www 1 20 -4 112M 46252K kqread 9 0:03 3.32% nginx 4368 www 1 20 -4 110M 43836K kqread 8 0:02 3.30% nginx 4352 www 1 20 -4 112M 48280K kqread 22 0:03 2.84% nginx 4373 www 1 20 -4 110M 42596K kqread 28 0:02 2.33% nginx 4392 www 1 20 -4 112M 45620K kqread 5 0:03 2.14% nginx 4377 www 1 20 -4 104M 31236K kqread 22 0:01 1.85% nginx 4379 www 1 20 -4 108M 39072K kqread 30 0:02 1.68% nginx 4361 www 1 20 -4 106M 36536K kqread 30 0:01 1.65% nginx 4369 www 1 20 -4 108M 41188K kqread 20 0:02 1.63% nginx 4388 www 1 20 -4 106M 36644K kqread 7 0:01 1.52% nginx 4405 www 1 20 -4 106M 35744K kqread 13 0:01 1.32% nginx 4395 www 1 20 -4 106M 37380K kqread 17 0:02 1.31% nginx 4378 www 1 20 -4 104M 34352K kqread 13 0:01 1.28% nginx 4407 www 1 20 -4 106M 39596K kqread 1 0:02 1.18% nginx 4354 www 1 20 -4 106M 37312K kqread 29 0:02 1.17% nginx 4382 www 1 20 -4 104M 31644K kqread 4 0:01 1.17% nginx 4350 www 1 20 -4 104M 33508K kqread 8 0:01 0.94% nginx 4401 www 1 20 -4 104M 30320K kqread 14 0:01 0.93% nginx 4406 www 1 20 -4 104M 32572K kqread 12 0:01 0.92% nginx 4408 www 1 20 -4 104M 31300K kqread 10 0:01 0.85% nginx 4360 www 1 20 -4 106M 35236K kqread 1 0:01 0.72% nginx 4362 www 1 20 -4 104M 32712K kqread 26 0:01 0.68% nginx 4366 www 1 20 -4 102M 28800K kqread 3 0:00 0.65% nginx 4353 www 1 20 -4 104M 30888K kqread 28 0:01 0.58% nginx 4375 www 1 20 -4 104M 32060K kqread 18 0:01 0.57% nginx 4403 www 1 20 -4 104M 29636K kqread 6 0:01 0.52% nginx 568 root 1 20 0 10496K 2588K select 17 0:09 0.45% syslogd 4385 www 1 20 -4 104M 29536K kqread 5 0:01 0.45% nginx 4404 www 1 20 -4 102M 26208K kqread 22 0:00 0.44% nginx 4387 www 1 20 -4 102M 22384K kqread 6 0:00 0.41% nginx 4370 www 1 20 -4 102M 22328K kqread 10 0:00 0.41% nginx 4399 www 1 20 -4 102M 23440K kqread 10 0:00 0.40% nginx 4358 www 1 20 -4 102M 26216K kqread 12 0:00 0.38% nginx 4390 www 1 20 -4 102M 26612K kqread 17 0:00 0.34% nginx 4356 www 1 20 -4 102M 24788K kqread 7 0:00 0.33% nginx
Most popular workers have near connection limit, least popular has almost no connections:
c099.ord001.dev# lsof -n -p 4364 | grep TCP | wc -l 4058 c099.ord001.dev# lsof -n -p 4380 | grep TCP | wc -l 4058 c099.ord001.dev# lsof -n -p 4389 | grep TCP | wc -l 77 c099.ord001.dev# ps augxw | grep 4389 www 4389 0.0 0.0 104392 27436 - S< 19:57 0:00.60 nginx: worker process (nginx) root 4618 0.0 0.0 14812 2484 0 S+ 20:01 0:00.00 grep 4389
The uneven loading seems to persist indefinitely. Eg, the connection count never seems to even out.
As a workaround, disabling the accept mutex helps, but causes other problems (kernel lock contention in kqueue).
Change History (2)
comment:1 by , 8 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
comment:2 by , 8 years ago
Thanks; I had not realized that.
Reducing the accept_mutex_delay does indeed seem to mitigate the problem.
Thank you again!
With accept mutex enabled, distribution of connections is expected to be uneven as long as processes are not fully loaded. This is the result of how accept mutex work.
If you want to improve distribution of connections between worker processes, consider the following:
accept_mutex_delay
;accept_mutex
off.