Opened 10 months ago

Last modified 2 months ago

#1163 reopened defect

cache size grows over max_size

Reported by: turchanov@… Owned by:
Priority: critical Milestone:
Component: nginx-core Version: 1.11.x
Keywords: Cc:
Sensitive: no
uname -a: Linux 4.1.12-61.1.19.el7uek.x86_64 #2 SMP Fri Nov 18 17:00:09 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.11.7 built by gcc 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) built with OpenSSL 1.0.2j-fips 26 Sep 2016 TLS SNI support enabled configure arguments: --prefix=/opt/nginx-1.11.7 --pid-path=/var/run/nginx.pid --http-client-body-temp-path=/tmp/nginx-client_body_temp --http-proxy-temp-path=/tmp/nginx-proxy_temp --http-fastcgi-temp-path=/tmp/nginx-fastcgi_temp --http-uwsgi-temp-path=/tmp/nginx-uwsgi_temp --http-scgi-temp-path=/tmp/nginx-scgi_temp --with-pcre --with-http_ssl_module --with-openssl=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/openssl-1.0.2j --with-openssl-opt='no-krb5 enable-camellia enable-seed enable-tlsext enable-rfc3779 no-mdc2 no-ec2m no-gost no-srp -fno-strict-aliasing fips' --with-http_v2_module --with-http_stub_status_module --with-http_gzip_static_module --with-http_gunzip_module --with-http_realip_module --with-stream --with-stream_ssl_module --with-http_geoip_module=dynamic --add-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/ngx_devel_kit-master --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/lua-nginx-module-master --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/lua-upstream-nginx-module-master --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/set-misc-nginx-module-master --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/array-var-nginx-module-master --add-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/echo-nginx-module-0.60 --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/nginx-module-vts-0.1.11 --add-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/nginx_upstream_check_module-master --add-dynamic-module=/opt/orabuild/rpmbuild/BUILD/nginx-1.11.7-addons/nchan-1.0.8

Description

Periodically we have proxy_cache size growing over the configured limit (see attached graph, mind Zabbix 150G=140GiB) which usually after several hours gets back to normal. I did some gdb poking in the 'cache manager' process and it shows that ngx_http_file_cache_manager is called and the current cache size is indeed larger than max_size

(gdb)
1940            size = cache->sh->size;
(gdb) print size
$2 = 42999201
(gdb) print cache->max_size
$3 = 36700160

but ngx_http_file_cache_forced_expire deletes nothing due to locked cache entries (fcn->count == 1) and after hard-coded limit of 20 the search is aborted.

#749 suggests nginx worker crashes which we had none. But I did kill with TERM signal several times after reconfigure workers in worker process is shutting down state as I needed to suppress activity of workers with previous configuration as soon as possible.

So the question is could termination of workers with TERM signal cause the issue and if so why didn't workers perform cleanup of holding locks in caches? Or could it be proxy_cache_lock on which causes trouble?

And there is one thing that bothers we: why the cache manager eventually succeeds in cleaning up the cache (as you can see on the attached graph)? If it were locked entries that left hanging after killed worker process with TERM signal then the cache manager wouldn't be able to clean up.

My config (I hope that I didn't cut too much):

http {
...
  proxy_cache_path /opt/cache/static levels=1:2 use_temp_path=off keys_zone=static:2048m inactive=1y max_size=140g manager_files=50000 manager_sleep=1 manager_threshold=1000 loader_files=50000 loader_threshold=1000 loader_sleep=1;
...
  server {
    location / {
      proxy_cache static;

      proxy_no_cache $no_cache_static;
      proxy_cache_bypass $no_cache_static;

      proxy_cache_lock on;
      proxy_cache_lock_timeout 1s;

      proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
      proxy_cache_valid 200      1y;

      proxy_pass http://static_nodes;
    }
  }
}

Attachments (1)

nginx-cache-size.png (39.9 KB) - added by turchanov@… 10 months ago.
Cache size

Download all attachments as: .zip

Change History (19)

Changed 10 months ago by turchanov@…

Cache size

comment:1 Changed 10 months ago by mdounin

  • Resolution set to invalid
  • Status changed from new to closed

Killing workers with the TERM signal terminates them immediately without doing any shared memory cleanup, as it is expected to be used by the master process during termination of nginx itself and doesn't try to clean up anything. Therefore it is expected that such termination will cause troubles with shared memory entries being locked, including cache and limit_conn.

In cache, locked entries will cause troubles for max_size cleanup - as long as there are many such entries. The inactive-based cleanup is able to ignore such entries though, by moving them to the tail of the cleanup queue with appropriate alerts in logs.

If you want shutting down workers to stop as fast as possible, consider dropping connections to the workers in question using tcpdrop / tcpkill instead.

comment:2 Changed 6 months ago by turchanov@…

  • Resolution invalid deleted
  • Status changed from closed to reopened

This bug still happens even when I do not kill workers or even not doing a RELOAD at all.

And it appears to be related to HTTP2 configured for 'server' section. I disabled HTTP2 and everything worked fine for a couple of weeks but if I re-enable HTTP2 the problem re-appears within 1-2 days (always!).

comment:3 Changed 6 months ago by turchanov@…

I see the problem with current version 1.12.0, btw.

comment:4 Changed 6 months ago by mdounin

  • Resolution set to invalid
  • Status changed from reopened to closed

The fact that the problem happens only with HTTP/2 indicate that the problem is not with cache, but with HTTP/2 (or 3rd party modules you use when used with HTTP/2). Most likely, the problem is that workers crash for some reason, causing the same problems as TERM signal. See ticket #749 as referenced in this ticket description.

comment:5 Changed 6 months ago by turchanov@…

  • Resolution invalid deleted
  • Status changed from closed to reopened

I beg to differ!

There were no segfaults and I do not use in my setup any 3rd-party modules (I did compile them as dynamic modules but I do not load them).

The server is configured as a reverse proxy with caching (basic layout is in the first message). The only difference is that this bug happens only if HTTP/2 is configured in server section:

server {

listen x.x.x.x:443 ssl http2;

...
}

There is nothing special with the upstream which is configured with a bunch of backend servers:

upstream static_nodes {

lest_conn;
keepalive 30;

server x.x.x.x:yy max_fails=100;
...

}

comment:6 Changed 6 months ago by mdounin

Again, to clarify: growing cache you are seeing is just a symptom. It is a result on another problem elsewhere. Or even not a result of a problem but a natural result of a problem, but a property of your workload (and your nginx configuration): for example, if there are very long responses nginx tries to cache, these responses may block max_size-based cache expiration. It is highly unlikely anybody will be able to help you with tracing the cause of what you are seeing.

Some trivial suggestions on how to investigate things:

  • Checks error logs for any emerg, alert, and crit messages. If there are any, investigate their causes. Make sure such messages are properly logged - for example, killing a worker with SIGKILL is expected to result in [alert] ... worker process ... exited on signal 9 error message.
  • Set small enough inactive time to trigger inactive-based expiration instead of max_size-based one, and watch ignore long locked inactive cache entry messages in logs. Investigate resources they appear for.
  • When there is a problem, try reloading configuration and watch old workers. If they do not exit for a long enough time, this indicate there are long-lived requests in these workers. Investigate what happens in these requests.

comment:7 Changed 6 months ago by turchanov@…

I will check you suggestions but, bear with me, aren't you missing the fact that this problem doesn't occur with HTTP/2 disabled?

I mean that we have (statistically) the same clients, the same workload, the same (potentially) long running responses, etc... and everything works fine for more than 2 weeks (cache size on disk never grows more than even 0.5% above configured limit) until I re-enable HTTP/2 and it breaks within couple of days?

Last edited 6 months ago by turchanov@… (previous) (diff)

comment:8 Changed 6 months ago by mdounin

The fact that it doesn't happen without HTTP/2 may mean that this is a bug in HTTP/2 implementation, or using HTTP/2 somehow changes behaviour of your clients. In either case this needs investigation.

comment:9 Changed 6 months ago by turchanov@…

1) I configured "error_log logs/error.log notice;" and verified that I see "worker process XXXX exited on signal 9" when I send KILL to a worker process.

Tested with max_size expiration.
The bug occured after ~2 days and there were no messages exited on signal in error.log.

2) I set inactive to 1h and verified that inactivity-expiration is performed: cache usage oscilates from 1/5 to 1/2 of max_size.

I see quite a few ignore long locked inactive cache entry in logs.

There is nothing special in those entries. Some of entries already existed in cache before nginx restart and were a 'HIT', some of entries were initial 'MISS'.

3) I did a reload when using inactivity expiration after seeing 'locked entry' in logs. Most of workers exit within ~5 minutes and the last one exited after ~14 minutes. But I still see 'locked entry' for the same hash codes (... and newer ones after a while)

comment:10 Changed 5 months ago by mdounin

So from (2) it is clear that the problem is locked entries in cache. And (3) suggests that these entries are not locked by some active requests, so likely this is a bug in HTTP/2.

Please make sure that there are no alert messages in logs after (3) - there may be open socket left in connection ... alerts, indicating socket leaks. If there are any, some hints on further debugging things can be found here.

Further debugging steps are likely require debug logs of requests to a particular resource which become "long locked". It may not be trivial to obtain such logs though.

comment:11 Changed 5 months ago by turchanov@…

I did a reload and there are open socket left in connection messages in error.log:

2017/05/12 16:59:13 [notice] 26997#0: signal process started
2017/05/12 16:59:13 [notice] 4432#0: signal 1 (SIGHUP) received, reconfiguring
2017/05/12 16:59:13 [notice] 4432#0: reconfiguring
...
2017/05/12 16:59:13 [notice] 23185#0: gracefully shutting down
2017/05/12 16:59:56 [alert] 23185#0: *73408205 open socket #209 left in connection 94
2017/05/12 16:59:56 [alert] 23185#0: *63917659 open socket #177 left in connection 122
2017/05/12 16:59:56 [alert] 23185#0: *134259471 open socket #91 left in connection 363
2017/05/12 16:59:56 [alert] 23185#0: *116373135 open socket #127 left in connection 374
2017/05/12 16:59:56 [alert] 23185#0: *161321399 open socket #253 left in connection 467
2017/05/12 16:59:56 [alert] 23185#0: *104341809 open socket #282 left in connection 564
2017/05/12 16:59:56 [alert] 23185#0: aborting
...

I will configure debug_points abort and perform steps as described in "Socket leaks". But I am not sure whether I need to configure debug level in error.log for that or not?

comment:12 Changed 5 months ago by mdounin

To obtain core dump with debug_points abort; debug level in logs is not needed. Debug logs will likely be needed later though, for further investigation of what causes the socket leak in question.

As obtaining core dump is usually easier compared to enabling debug logging for production systems, in most cases the optimal strategy is: obtain core dump / take a look at the core; if / when debug logs will be required for further investigation, obtain new core dump with debug logs enabled.

comment:13 Changed 5 months ago by totallyunknown@…

I experience the same bug with 1.10.0-0ubuntu0.16.04.4. It's related to http2 or http1.1/TLS.

Here two servers serving nearly identical content (round robin load balanced):

https://gist.github.com/totallyunknown/734817bea1676bac7215b68886e2fb64

  • the yellow one serve a very small portion of http 1.1 TLS / HTTP2 traffic, everything else is HTTP
  • the green one is serving 100% TLS/HTTP2

Actually is somehow related to stale file handles (3. chart in the link above).

$ lsof /data/nginx/cache/ | grep deleted
nginx    3777 www-data  320r   REG   0,39     5205 903745311 /data/nginx/cache/temp/2/69/2ddd8ba2cb2ed7280744b04c8cb7ee81 (deleted)
nginx    3778 www-data  243r   REG   0,39     4942 903839128 /data/nginx/cache/temp/2/72/492ca4c777dbbb4ca4f39aa4e62b2564 (deleted)
nginx    3778 www-data  303r   REG   0,39     3546 903958194 /data/nginx/cache/temp/5/42/ce09607dd6a578cee7e2be3dc6aef4ef (deleted)
nginx    3778 www-data  670r   REG   0,39     3029 904161980 /data/nginx/cache/temp/6/39/6f0b67ed81399491b91ea44e10955b0e (deleted)
nginx    3781 www-data  413r   REG   0,39     4312 903515529 /data/nginx/cache/temp/3/24/207fa20f5a948e9e4d7eae2a79e32da0 (deleted)

The first chart show the number of files with cache age of <= 2 min. It looks like the cache manager is not able to evict the content fast enough. The other server is performing well.

A SIGHUP to nginx will release some stale file handles, but not all. In the log showing up the following message:

2017/05/28 18:48:52 [alert] 3778#3778: *38326 open socket #312 left in connection 81
2017/05/28 18:48:52 [alert] 3778#3778: *69684 open socket #554 left in connection 129
2017/05/28 18:48:52 [alert] 3778#3778: *62518 open socket #602 left in connection 241
2017/05/28 18:48:52 [alert] 3778#3778: *3098 open socket #261 left in connection 333
2017/05/28 18:48:52 [alert] 3778#3778: *71110 open socket #666 left in connection 607
2017/05/28 18:48:52 [alert] 3778#3778: *102851 open socket #431 left in connection 623

relevant cache config:

    proxy_cache_path /data/nginx/cache levels=1:2 keys_zone=edge_cache:128m max_size=115g inactive=3h use_temp_path=off;
    proxy_cache_use_stale updating;
    proxy_cache edge_cache;
    proxy_cache_min_uses 1;
    proxy_cache_lock on;
    proxy_cache_key $uri$is_args$args;
    proxy_cache_valid 200 302 10m;
    proxy_cache_valid 301      1h;

Filesystem for the cache has a total size of 150GB, but it looks like that's not enough for a nginx cache of 115GB.

I'll try to enable debug logs, but I think this bug is easy to reproduce with many requests over long open connections and http2.

comment:14 Changed 5 months ago by totallyunknown@…

I'll throw some time to debug the issue with debug logs - not successful so far. But I can confirm, that this bug goes away, if I disable http2.

@turchanov: Can you try to lower http2_max_requests, maybe to 100. Before 1.11.6 the connection requests where unlimited - maybe it's somehow related to this bug.

comment:15 Changed 5 months ago by vbart

There was a socket leak fixed in 1.11.3, so it's better to use the latest version for debugging.

comment:16 Changed 4 months ago by caseyf@…

I'm also observing this problem with nginx 1.11.12 and proxy_cache + HTTP/2.

I'm testing with a newer nginx now since there have been some HTTP/2 fixes.

Last edited 4 months ago by caseyf@… (previous) (diff)

comment:17 Changed 2 months ago by turchanov@…

(Sorry for long delay)
I tested with nginx 1.13.3 and no longer experience the original problem. I do see a lot of ignore long locked inactive cache entry messages in error log with expiration by max_size , so the bug is still there, but it no longer hinders cleanup process of cached entries (I guess workaround "now cache manager ignores long locked cache entries when cleaning cache based on the "max_size" parameter" from 1.13.1 really helps).

comment:18 Changed 2 months ago by mdounin

So it looks like socket leaks are still here in 1.13.3. Without any 3rd party modules, and only with http2 enabled, correct? Please see comment:10 for further steps on how to debug this.

Note: See TracTickets for help on using tickets.