Opened 8 years ago

Closed 5 years ago

#1163 closed defect (worksforme)

cache size grows over max_size

Reported by: turchanov@… Owned by:
Priority: critical Milestone:
Component: nginx-core Version: 1.11.x
Keywords: Cc:
uname -a: Linux 4.1.12-61.1.19.el7uek.x86_64 #2 SMP Fri Nov 18 17:00:09 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.11.7
built by gcc 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
built with OpenSSL 1.0.2j-fips 26 Sep 2016
TLS SNI support enabled
configure arguments: --prefix=/opt/nginx-1.11.7 --pid-path=/var/run/nginx.pid --http-client-body-temp-path=/tmp/nginx-client_body_temp --http-proxy-temp-path=/tmp/nginx-proxy_temp --http-fastcgi-temp-path=/tmp/nginx-fastcgi_temp --http-uwsgi-temp-path=/tmp/nginx-uwsgi_temp --http-scgi-temp-path=/tmp/nginx-scgi_temp --with-pcre --with-http_ssl_module --with-openssl=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/openssl-1.0.2j --with-openssl-opt='no-krb5 enable-camellia enable-seed enable-tlsext enable-rfc3779 no-mdc2 no-ec2m no-gost no-srp -fno-strict-aliasing fips' --with-http_v2_module --with-http_stub_status_module --with-http_gzip_static_module --with-http_gunzip_module --with-http_realip_module --with-stream --with-stream_ssl_module --with-http_geoip_module=dynamic --add-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/ngx_devel_kit-master --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/lua-nginx-module-master --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/lua-upstream-nginx-module-master --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/set-misc-nginx-module-master --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/array-var-nginx-module-master --add-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/echo-nginx-module-0.60 --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/nginx-module-vts-0.1.11 --add-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/nginx_upstream_check_module-master --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/nchan-1.0.8

Description

Periodically we have proxy_cache size growing over the configured limit (see attached graph, mind Zabbix 150G=140GiB) which usually after several hours gets back to normal. I did some gdb poking in the 'cache manager' process and it shows that ngx_http_file_cache_manager is called and the current cache size is indeed larger than max_size

(gdb)
1940            size = cache->sh->size;
(gdb) print size
$2 = 42999201
(gdb) print cache->max_size
$3 = 36700160

but ngx_http_file_cache_forced_expire deletes nothing due to locked cache entries (fcn->count == 1) and after hard-coded limit of 20 the search is aborted.

#749 suggests nginx worker crashes which we had none. But I did kill with TERM signal several times after reconfigure workers in worker process is shutting down state as I needed to suppress activity of workers with previous configuration as soon as possible.

So the question is could termination of workers with TERM signal cause the issue and if so why didn't workers perform cleanup of holding locks in caches? Or could it be proxy_cache_lock on which causes trouble?

And there is one thing that bothers we: why the cache manager eventually succeeds in cleaning up the cache (as you can see on the attached graph)? If it were locked entries that left hanging after killed worker process with TERM signal then the cache manager wouldn't be able to clean up.

My config (I hope that I didn't cut too much):

http {
...
  proxy_cache_path /opt/cache/static levels=1:2 use_temp_path=off keys_zone=static:2048m inactive=1y max_size=140g manager_files=50000 manager_sleep=1 manager_threshold=1000 loader_files=50000 loader_threshold=1000 loader_sleep=1;
...
  server {
    location / {
      proxy_cache static;

      proxy_no_cache $no_cache_static;
      proxy_cache_bypass $no_cache_static;

      proxy_cache_lock on;
      proxy_cache_lock_timeout 1s;

      proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
      proxy_cache_valid 200      1y;

      proxy_pass http://static_nodes;
    }
  }
}

Attachments (1)

nginx-cache-size.png (39.9 KB ) - added by turchanov@… 8 years ago.
Cache size

Download all attachments as: .zip

Change History (26)

by turchanov@…, 8 years ago

Attachment: nginx-cache-size.png added

Cache size

comment:1 by Maxim Dounin, 8 years ago

Resolution: invalid
Status: newclosed

Killing workers with the TERM signal terminates them immediately without doing any shared memory cleanup, as it is expected to be used by the master process during termination of nginx itself and doesn't try to clean up anything. Therefore it is expected that such termination will cause troubles with shared memory entries being locked, including cache and limit_conn.

In cache, locked entries will cause troubles for max_size cleanup - as long as there are many such entries. The inactive-based cleanup is able to ignore such entries though, by moving them to the tail of the cleanup queue with appropriate alerts in logs.

If you want shutting down workers to stop as fast as possible, consider dropping connections to the workers in question using tcpdrop / tcpkill instead.

comment:2 by turchanov@…, 8 years ago

Resolution: invalid
Status: closedreopened

This bug still happens even when I do not kill workers or even not doing a RELOAD at all.

And it appears to be related to HTTP2 configured for 'server' section. I disabled HTTP2 and everything worked fine for a couple of weeks but if I re-enable HTTP2 the problem re-appears within 1-2 days (always!).

comment:3 by turchanov@…, 8 years ago

I see the problem with current version 1.12.0, btw.

comment:4 by Maxim Dounin, 8 years ago

Resolution: invalid
Status: reopenedclosed

The fact that the problem happens only with HTTP/2 indicate that the problem is not with cache, but with HTTP/2 (or 3rd party modules you use when used with HTTP/2). Most likely, the problem is that workers crash for some reason, causing the same problems as TERM signal. See ticket #749 as referenced in this ticket description.

comment:5 by turchanov@…, 8 years ago

Resolution: invalid
Status: closedreopened

I beg to differ!

There were no segfaults and I do not use in my setup any 3rd-party modules (I did compile them as dynamic modules but I do not load them).

The server is configured as a reverse proxy with caching (basic layout is in the first message). The only difference is that this bug happens only if HTTP/2 is configured in server section:

server {

listen x.x.x.x:443 ssl http2;

...
}

There is nothing special with the upstream which is configured with a bunch of backend servers:

upstream static_nodes {

lest_conn;
keepalive 30;

server x.x.x.x:yy max_fails=100;
...

}

comment:6 by Maxim Dounin, 8 years ago

Again, to clarify: growing cache you are seeing is just a symptom. It is a result on another problem elsewhere. Or even not a result of a problem but a natural result of a problem, but a property of your workload (and your nginx configuration): for example, if there are very long responses nginx tries to cache, these responses may block max_size-based cache expiration. It is highly unlikely anybody will be able to help you with tracing the cause of what you are seeing.

Some trivial suggestions on how to investigate things:

  • Checks error logs for any emerg, alert, and crit messages. If there are any, investigate their causes. Make sure such messages are properly logged - for example, killing a worker with SIGKILL is expected to result in [alert] ... worker process ... exited on signal 9 error message.
  • Set small enough inactive time to trigger inactive-based expiration instead of max_size-based one, and watch ignore long locked inactive cache entry messages in logs. Investigate resources they appear for.
  • When there is a problem, try reloading configuration and watch old workers. If they do not exit for a long enough time, this indicate there are long-lived requests in these workers. Investigate what happens in these requests.

comment:7 by turchanov@…, 8 years ago

I will check you suggestions but, bear with me, aren't you missing the fact that this problem doesn't occur with HTTP/2 disabled?

I mean that we have (statistically) the same clients, the same workload, the same (potentially) long running responses, etc... and everything works fine for more than 2 weeks (cache size on disk never grows more than even 0.5% above configured limit) until I re-enable HTTP/2 and it breaks within couple of days?

Last edited 8 years ago by turchanov@… (previous) (diff)

comment:8 by Maxim Dounin, 8 years ago

The fact that it doesn't happen without HTTP/2 may mean that this is a bug in HTTP/2 implementation, or using HTTP/2 somehow changes behaviour of your clients. In either case this needs investigation.

comment:9 by turchanov@…, 8 years ago

1) I configured "error_log logs/error.log notice;" and verified that I see "worker process XXXX exited on signal 9" when I send KILL to a worker process.

Tested with max_size expiration.
The bug occured after ~2 days and there were no messages exited on signal in error.log.

2) I set inactive to 1h and verified that inactivity-expiration is performed: cache usage oscilates from 1/5 to 1/2 of max_size.

I see quite a few ignore long locked inactive cache entry in logs.

There is nothing special in those entries. Some of entries already existed in cache before nginx restart and were a 'HIT', some of entries were initial 'MISS'.

3) I did a reload when using inactivity expiration after seeing 'locked entry' in logs. Most of workers exit within ~5 minutes and the last one exited after ~14 minutes. But I still see 'locked entry' for the same hash codes (... and newer ones after a while)

comment:10 by Maxim Dounin, 8 years ago

So from (2) it is clear that the problem is locked entries in cache. And (3) suggests that these entries are not locked by some active requests, so likely this is a bug in HTTP/2.

Please make sure that there are no alert messages in logs after (3) - there may be open socket left in connection ... alerts, indicating socket leaks. If there are any, some hints on further debugging things can be found here.

Further debugging steps are likely require debug logs of requests to a particular resource which become "long locked". It may not be trivial to obtain such logs though.

comment:11 by turchanov@…, 8 years ago

I did a reload and there are open socket left in connection messages in error.log:

2017/05/12 16:59:13 [notice] 26997#0: signal process started
2017/05/12 16:59:13 [notice] 4432#0: signal 1 (SIGHUP) received, reconfiguring
2017/05/12 16:59:13 [notice] 4432#0: reconfiguring
...
2017/05/12 16:59:13 [notice] 23185#0: gracefully shutting down
2017/05/12 16:59:56 [alert] 23185#0: *73408205 open socket #209 left in connection 94
2017/05/12 16:59:56 [alert] 23185#0: *63917659 open socket #177 left in connection 122
2017/05/12 16:59:56 [alert] 23185#0: *134259471 open socket #91 left in connection 363
2017/05/12 16:59:56 [alert] 23185#0: *116373135 open socket #127 left in connection 374
2017/05/12 16:59:56 [alert] 23185#0: *161321399 open socket #253 left in connection 467
2017/05/12 16:59:56 [alert] 23185#0: *104341809 open socket #282 left in connection 564
2017/05/12 16:59:56 [alert] 23185#0: aborting
...

I will configure debug_points abort and perform steps as described in "Socket leaks". But I am not sure whether I need to configure debug level in error.log for that or not?

comment:12 by Maxim Dounin, 8 years ago

To obtain core dump with debug_points abort; debug level in logs is not needed. Debug logs will likely be needed later though, for further investigation of what causes the socket leak in question.

As obtaining core dump is usually easier compared to enabling debug logging for production systems, in most cases the optimal strategy is: obtain core dump / take a look at the core; if / when debug logs will be required for further investigation, obtain new core dump with debug logs enabled.

comment:13 by totallyunknown@…, 7 years ago

I experience the same bug with 1.10.0-0ubuntu0.16.04.4. It's related to http2 or http1.1/TLS.

Here two servers serving nearly identical content (round robin load balanced):

https://gist.github.com/totallyunknown/734817bea1676bac7215b68886e2fb64

  • the yellow one serve a very small portion of http 1.1 TLS / HTTP2 traffic, everything else is HTTP
  • the green one is serving 100% TLS/HTTP2

Actually is somehow related to stale file handles (3. chart in the link above).

$ lsof /data/nginx/cache/ | grep deleted
nginx    3777 www-data  320r   REG   0,39     5205 903745311 /data/nginx/cache/temp/2/69/2ddd8ba2cb2ed7280744b04c8cb7ee81 (deleted)
nginx    3778 www-data  243r   REG   0,39     4942 903839128 /data/nginx/cache/temp/2/72/492ca4c777dbbb4ca4f39aa4e62b2564 (deleted)
nginx    3778 www-data  303r   REG   0,39     3546 903958194 /data/nginx/cache/temp/5/42/ce09607dd6a578cee7e2be3dc6aef4ef (deleted)
nginx    3778 www-data  670r   REG   0,39     3029 904161980 /data/nginx/cache/temp/6/39/6f0b67ed81399491b91ea44e10955b0e (deleted)
nginx    3781 www-data  413r   REG   0,39     4312 903515529 /data/nginx/cache/temp/3/24/207fa20f5a948e9e4d7eae2a79e32da0 (deleted)

The first chart show the number of files with cache age of <= 2 min. It looks like the cache manager is not able to evict the content fast enough. The other server is performing well.

A SIGHUP to nginx will release some stale file handles, but not all. In the log showing up the following message:

2017/05/28 18:48:52 [alert] 3778#3778: *38326 open socket #312 left in connection 81
2017/05/28 18:48:52 [alert] 3778#3778: *69684 open socket #554 left in connection 129
2017/05/28 18:48:52 [alert] 3778#3778: *62518 open socket #602 left in connection 241
2017/05/28 18:48:52 [alert] 3778#3778: *3098 open socket #261 left in connection 333
2017/05/28 18:48:52 [alert] 3778#3778: *71110 open socket #666 left in connection 607
2017/05/28 18:48:52 [alert] 3778#3778: *102851 open socket #431 left in connection 623

relevant cache config:

    proxy_cache_path /data/nginx/cache levels=1:2 keys_zone=edge_cache:128m max_size=115g inactive=3h use_temp_path=off;
    proxy_cache_use_stale updating;
    proxy_cache edge_cache;
    proxy_cache_min_uses 1;
    proxy_cache_lock on;
    proxy_cache_key $uri$is_args$args;
    proxy_cache_valid 200 302 10m;
    proxy_cache_valid 301      1h;

Filesystem for the cache has a total size of 150GB, but it looks like that's not enough for a nginx cache of 115GB.

I'll try to enable debug logs, but I think this bug is easy to reproduce with many requests over long open connections and http2.

comment:14 by totallyunknown@…, 7 years ago

I'll throw some time to debug the issue with debug logs - not successful so far. But I can confirm, that this bug goes away, if I disable http2.

@turchanov: Can you try to lower http2_max_requests, maybe to 100. Before 1.11.6 the connection requests where unlimited - maybe it's somehow related to this bug.

comment:15 by Valentin V. Bartenev, 7 years ago

There was a socket leak fixed in 1.11.3, so it's better to use the latest version for debugging.

comment:16 by caseyf@…, 7 years ago

I'm also observing this problem with nginx 1.11.12 and proxy_cache + HTTP/2.

I'm testing with a newer nginx now since there have been some HTTP/2 fixes.

Last edited 7 years ago by caseyf@… (previous) (diff)

comment:17 by turchanov@…, 7 years ago

(Sorry for long delay)
I tested with nginx 1.13.3 and no longer experience the original problem. I do see a lot of ignore long locked inactive cache entry messages in error log with expiration by max_size , so the bug is still there, but it no longer hinders cleanup process of cached entries (I guess workaround "now cache manager ignores long locked cache entries when cleaning cache based on the "max_size" parameter" from 1.13.1 really helps).

comment:18 by Maxim Dounin, 7 years ago

So it looks like socket leaks are still here in 1.13.3. Without any 3rd party modules, and only with http2 enabled, correct? Please see comment:10 for further steps on how to debug this.

comment:19 by rihad.mail.ru@…, 7 years ago

Same issue with nginx 1.14.0 on FreeBSD 10.3. In case inactive is set high enough to become effectively ineffective (like 7 days or so) it becomes obvious that the limit imposed by max_size doesn't seem to finally kick in, and disk usage constantly grows at a slow pace. http2 is enabled.

$ nginx -V
nginx version: nginx/1.14.0
built with LibreSSL 2.7.3
TLS SNI support enabled
configure arguments: --prefix=/usr/local/etc/nginx --with-cc-opt='-I /usr/local/include' --with-ld-opt='-L /usr/local/lib' --conf-path=/usr/local/etc/nginx/nginx.conf --sbin-path=/usr/local/sbin/nginx --pid-path=/var/run/nginx.pid --error-log-path=/var/log/nginx/error.log --user=www --group=www --modules-path=/usr/local/libexec/nginx --with-file-aio --with-cc-opt='-DNGX_HAVE_INET6=0 -I /usr/local/include' --with-threads --without-mail_imap_module --without-mail_pop3_module --without-mail_smtp_module --http-client-body-temp-path=/var/tmp/nginx/client_body_temp --http-fastcgi-temp-path=/var/tmp/nginx/fastcgi_temp --http-proxy-temp-path=/var/tmp/nginx/proxy_temp --http-scgi-temp-path=/var/tmp/nginx/scgi_temp --http-uwsgi-temp-path=/var/tmp/nginx/uwsgi_temp --http-log-path=/var/log/nginx/access.log --with-http_addition_module --with-http_auth_request_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_random_index_module --with-http_realip_module --with-pcre --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --add-dynamic-module=/usr/ports/www/nginx/work/ngx_brotli-37ab9b2

comment:20 by Maxim Dounin, 7 years ago

What's in the logs? Is the problem resolved when you disable http2? Are you able to reproduce the problem without 3rd party modules? See also comment:6 for more hints on debugging things.

Please also note that starting with nginx 1.13.1, cleaning cache based on max_size ignores long locked entries, see CHANGES. If max_size is not working for you in nginx 1.14.0, likely you are facing a completely different issue.

comment:21 by rihad.mail.ru@…, 6 years ago

@mdounin, I think our problem is the same as everyone else's and happens when http2 is enabled. The problem goes away after nginx restart - any files exceeding max_size are gradually deleted as soon as cache loader finishes its work. Mere reload (SIGHUP) isn't enough. And at the same time new size leaks begin accumulating gradually (and constantly).

Last edited 6 years ago by rihad.mail.ru@… (previous) (diff)

in reply to:  21 comment:22 by Maxim Dounin, 6 years ago

Replying to rihad.mail.ru@…:

@mdounin, I think our problem is the same as everyone else's and happens when http2 is enabled.

As already asked in comment:20, does the problem go away when you disable http2?

The problem goes away after nginx restart - any files exceeding max_size are gradually deleted as soon as cache loader finishes its work. Mere reload (SIGHUP) isn't enough. And at the same time new size leaks begin accumulating gradually (and constantly).

What's in the logs? Are you able to reproduce the problem without 3rd party modules? See also comment:6 for more hints on debugging things.

comment:23 by jayce@…, 6 years ago

I also had this problem with nginx 1.11.2 and proxy_cache + HTTP/2 + slice_module, when I disable http2 this problem is gone.

I think it is cause by http2 socket lead, see gdb trace:

(gdb) print *((ngx_http_v2_connection_t *)(cycle->connections[143]->data))->streams_index[5]->stream->request->cache->node
$43 = {node = {key = 5900161498957275486, left = 0x7f544f7d8018, right = 0x7f544f7d8018, parent = 0x7f545cd7f900, color = 1 '\001', data = 0 '\000'},
  queue = {prev = 0x7f5455c2c528, next = 0x7f54604b7ba8}, key = "'\212\302U4\302\315\062", count = 0, uses = 22, valid_msec = 0, error = 0, exists = 0,
  updating = 0, deleting = 0, uniq = 0, expire = 1530285768, valid_sec = 0, body_start = 0, fs_size = 0, lock_time = 1529940173277}
(gdb) print *((ngx_http_v2_connection_t *)(cycle->connections[143]->data))->streams_index[5]->stream->request->pool->cleanup
$44 = {handler = 0x46ff80 <ngx_http_file_cache_cleanup>, data = 0x7f544b991060, next = 0x7f544b20b440}

in reply to:  23 comment:24 by Maxim Dounin, 6 years ago

Replying to jayce@…:

I also had this problem with nginx 1.11.2 and proxy_cache + HTTP/2 + slice_module, when I disable http2 this problem is gone.

There is at least one know socket leak in HTTP/2 code of nginx 1.11.2, fixed in 1.11.3. Quoting CHANGES:

Changes with nginx 1.11.3                                        26 Jul 2016

...

    *) Bugfix: socket leak when using HTTP/2.

...

Trying to debug anything with 1.11.2 doesn't really make sense. Please upgrade to the latest mainline version (nginx 1.15.1) and test if you are able to reproduce the problem.

comment:25 by Maxim Dounin, 5 years ago

Resolution: worksforme
Status: reopenedclosed

A number of socket leaks in the HTTP/2 code has been fixed since the last meaningful comment in this ticket. Most likely, leaks as observed by the original reported are fixed as well. Closing this.

Note: See TracTickets for help on using tickets.