Opened 8 months ago

Closed 8 months ago

#2540 closed defect (invalid)

nginx stable 1.24 issues with cache file deletion under heavy load

Reported by: p4rancesc0@… Owned by:
Priority: major Milestone:
Component: nginx-core Version: 1.24.x
Keywords: cache expires file deletion disk space Cc: p4rancesc0@…
uname -a: [root@conginx01 live]# uname -a
Linux conginx01.XXX.YYY 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Thu Oct 27 15:18:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.24.0
built by gcc 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC)
built with OpenSSL 1.1.1k FIPS 25 Mar 2021
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -pie'

Description (last modified by p4rancesc0@…)

Hi,

during some synthetic benchmarks on our nginx we faced a strange behaviour:
after the benchmark was stopped and the cache files validity expired we found a large ammount (>200K files) that appeared in lsof of cache directory while a find -type f of the same returned nothing. The space inside the directory was still in use.

Our test:

  • using a custom client generate > 10Gbit/s using 900 connection on one nginx server, asking every 4 seconds new files that the backend will provide.
  • each file is asked 2 times.
  • /data/live/nginx is our caching directory

see nginx initial status:

[root@conginx01 live]# find /data/nginx/live/ -type f
[root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l
0
[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G     0  205G   0% /data/nginx/live
  • we start the benchmark for 10 minutes, we can easily saturate our 205GB RAM cache disk in few minutes (5 circa...)
  • after te cache disk is full, in the meaning of it's arounf its size MINUS min_free we keep runnign 5 min.

see:

 Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G  9.8G  96% /data/nginx/live

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G   10G  96% /data/nginx/live

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G  9.8G  96% /data/nginx/live


Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G   10G  96% /data/nginx/live
  • we stop the benchmark and we wait 10 minutes the cache to expire and to be deleted.

we measured during the 10 minutes test:

  • average response duration: 0.003 seconds
  • maximum response duration: 0.636 seconds
  • a total of 2169600 requests
  • a total of 41 errors on the client side
  • zero 4xx,5xx
  • 802022225512 bytes transferred, around 800GB

at this point this is the status:

[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l
345645
[root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l
359
[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  195G   11G  96% /data/nginx/live

At this point everything looks FINE, the isn't anymore any traffic on the nginx server.

after while we see:

  • used space going down, OK
[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  143G   63G  70% /data/nginx/live
  • cache file number decreasing, OK
[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l
240997
  • files in deleted status growing
[root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l
11810

we still wait all cache items to expire:

finally we see:

  • no more files in cache partition
[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l
0
  • a lot of space used the the cache partition
[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  134G   72G  65% /data/nginx/live
  • around 250K files in status deleted with lsof, a stable value
[root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l
Wed Aug 30 14:39:41 CEST 2023
268690
[root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l
Wed Aug 30 14:41:05 CEST 2023
268690
[root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l
Wed Aug 30 14:42:21 CEST 2023
268690

Most of them are unique cache files, some have 2, 3, 4, 5 occurences:

see:

[root@conginx01 live]# cat log  | grep "/data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206"                                                                                                                                             nginx   3277770 nginx 2087r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277774 nginx *637r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277776 nginx 7101r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277779 nginx *163r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277789 nginx   94r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)

We are monitoring the disk usage with nagios, "those files" in the previous tests stayed there from 16:39 of yesterday until 10:36 until we started a new load test.
In the meanwhile no clients where connetcted to the nginx.
kill -USR1 doesn't cleanup the "(deleted)" files.
kill -HUP cleans them up immediately.

We are wondering if this behaviour is correct.

I'll put down here our cache configuration:

    open_file_cache_errors      on; 
    proxy_cache_path            /data/nginx/live
                                levels=1:1:2
                                keys_zone=cache-live:200m
                                inactive=10m
                                max_size=200g
                                min_free=10g
                                use_temp_path=off
                                manager_files=10000
                                manager_threshold=5000ms
                                manager_sleep=50ms
                                loader_files=10000
                                loader_threshold=1000ms;

    proxy_connect_timeout       2s; 
    proxy_send_timeout          2s;
    proxy_read_timeout          2s; 
    proxy_buffering             on; 
    proxy_cache_lock            on;     
    proxy_cache_lock_timeout    100ms; 
    proxy_cache_lock_age        50ms;   
    proxy_cache_key             $host$uri;
    proxy_cache_methods         GET HEAD POST;
    proxy_cache_use_stale       updating error timeout invalid_header http_500 http_502 http_503 http_504;
    proxy_cache_revalidate      on;
    proxy_cache_background_update       on;
    proxy_http_version          1.1;
    proxy_ignore_headers        X-Accel-Expires Expires Cache-Control;
    proxy_hide_header           Accept-Encoding;

    proxy_next_upstream         error timeout invalid_header http_403 http_404 http_502 http_503 http_504;
    proxy_set_header            Accept-Encoding none;
    proxy_set_header            Connection "keep-alive";
    proxy_set_header            Host       $host;
    proxy_set_header            X-Real-IP  $remote_addr;
    proxy_max_temp_file_size    0; # Get rid of [warn] 30083#30083: *511554 an upstream response is buffered to a temporary file /data/nginx_temp/0000054131 while reading upstream,
    proxy_buffers               64 8k; #http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_buffers
    gzip                        on;
    gzip_proxied                off;

Then in the correct location we include this one:

   set $cache cache-live;
   proxy_cache_valid 400 401 403 405 2s;
   proxy_cache_valid 404 412 500 501 502 503 504 505 509 0s;
   proxy_cache_valid any 5m;
   open_file_cache_valid   10m; #LIVE
   open_file_cache         max=200000 inactive=5m; #LIVE
   gzip              off;

Best Regards
Francesco

Change History (7)

comment:1 by p4rancesc0@…, 8 months ago

Description: modified (diff)

comment:2 by p4rancesc0@…, 8 months ago

Description: modified (diff)

comment:3 by p4rancesc0@…, 8 months ago

Description: modified (diff)

comment:4 by Maxim Dounin, 8 months ago

There is no any traffic on nginx after the test stopped, correct?

What you are seeing looks like a result of open_file_cache caching open file descriptors. While it is configured to close descriptors after 5 minutes of inactivity (inactive=5m), closing inactive file descriptors only happens when the open file cache is actually accessed (more specifically, open file cache closes up to 2 inactive file descriptors when a request which used open file cache is finalized). As such, in your particular scenario it keeps file descriptors open till the worker process is shut down.

Consider re-testing without open_file_cache configured, there should be no files left open. Alternatively, try requesting some files after the test, to let open file cache to close inactive file descriptors.

comment:5 by p4rancesc0@…, 8 months ago

Hi, thanks for the reply.

I can confirm that there is no traffic on nginx after the test is stopped.

I'll try to remove open_file_cache.
Then I'll be back.

BR
Francesco

Last edited 8 months ago by p4rancesc0@… (previous) (diff)

comment:6 by p4rancesc0@…, 8 months ago

Hi,

by removing those two directives:

   open_file_cache_valid   10m; #LIVE
   open_file_cache         max=200000 inactive=5m; #LIVE

out problems is fully resolved.

See, before the test:

[root@conginx01 ~]# lsof /data/nginx/live/ | grep deleted | wc -l
0
[root@conginx01 ~]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G     0  205G   0% /data/nginx/live
[root@conginx01 ~]# find /data/nginx/live/ -type f | wc -l
0

after the benchmark is stopped there are no open files in the cache directory.

[root@conginx01 ~]# lsof /data/nginx/live/
[root@conginx01 ~]#


just after 10 minutes everything is empty.

[root@conginx01 ~]# find /data/nginx/live/ -type f | wc -l
0
[root@conginx01 ~]# lsof /data/nginx/live/
[root@conginx01 ~]# df -h /data/nginx/live
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G     0  205G   0% /data/nginx/live
[root@conginx01 ~]#

During those 10 minutes you can see cache manager freeing space.

I think you can close this ticket, this behaviuor is actually what we expected.

Best Regards.

Francesco

comment:7 by Maxim Dounin, 8 months ago

Resolution: invalid
Status: newclosed

Thanks for confirming, closing this.

Note: See TracTickets for help on using tickets.