﻿id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc	uname	nginx_version
2540	nginx stable 1.24 issues with cache file deletion under heavy load	p4rancesc0@…		"Hi, 

during some synthetic benchmarks on our nginx we faced a strange behaviour:
after the benchmark was stopped and the cache files validity expired we found a large ammount (>200K files) that appeared in lsof of cache directory while a find -type f of the same returned nothing. The space inside the directory was still in use.

Our test:

- using a custom client generate > 10Gbit/s using 900 connection on one nginx server, asking every 4 seconds new files that the backend will provide.
- each file is asked 2 times. 
- /data/live/nginx is our caching directory

see nginx initial status:


{{{
[root@conginx01 live]# find /data/nginx/live/ -type f
[root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l
0
[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G     0  205G   0% /data/nginx/live
}}}


- we start the benchmark for 10 minutes, we can easily saturate our 205GB RAM cache disk in few minutes (5 circa...)
- after te cache disk is full, in the meaning of it's arounf its size MINUS min_free we keep runnign 5 min.

see:


{{{
 Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G  9.8G  96% /data/nginx/live

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G   10G  96% /data/nginx/live

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G  9.8G  96% /data/nginx/live


Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G   10G  96% /data/nginx/live
}}}





- we stop the benchmark and we wait 10 minutes the cache to expire and to be deleted.

we measured during the 10 minutes test:
 - average response duration: 0.003 seconds
 - maximum response duration: 0.636 seconds
 - a total of 2169600 requests
 - a total of 41 errors on the client side
 - zero 4xx,5xx
 - 802022225512 bytes transferred, around 800GB

at this point this is the status:


{{{
[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l
345645
[root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l
359
[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  195G   11G  96% /data/nginx/live
}}}



At this point everything looks FINE, the isn't anymore any traffic on the nginx server.

after while we see:

- used space going down, OK


{{{
[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  143G   63G  70% /data/nginx/live
}}}


- cache file number decreasing, OK


{{{
[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l
240997
}}}



- files in deleted status growing


{{{
[root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l
11810
}}}

we still wait all cache items to expire:

finally we see:

- no more files in cache partition 

{{{
[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l
0
}}}

- a lot of space used the the cache partition


{{{
[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  134G   72G  65% /data/nginx/live
}}}

- around 250K files in status deleted with lsof, a stable value


{{{
[root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l
Wed Aug 30 14:39:41 CEST 2023
268690
[root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l
Wed Aug 30 14:41:05 CEST 2023
268690
[root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l
Wed Aug 30 14:42:21 CEST 2023
268690
}}}


Most of them are unique cache files, some have 2, 3, 4, 5 occurences:

see:


{{{
[root@conginx01 live]# cat log  | grep ""/data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206""                                                                                                                                             nginx   3277770 nginx 2087r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277774 nginx *637r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277776 nginx 7101r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277779 nginx *163r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277789 nginx   94r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
}}}


We are monitoring the disk usage with nagios, ""those files"" in the previous tests stayed there from 16:39 of yesterday until 10:36 until we started a new load test.
In the meanwhile no clients where connetcted to the nginx.
kill -USR1 doesn't cleanup the ""(deleted)"" files.
kill -HUP cleans them up immediately.

We are wondering if this behaviour is correct.

I'll put down here our cache configuration:



{{{
    open_file_cache_errors      on; 
    proxy_cache_path            /data/nginx/live
                                levels=1:1:2
                                keys_zone=cache-live:200m
                                inactive=10m
                                max_size=200g
                                min_free=10g
                                use_temp_path=off
                                manager_files=10000
                                manager_threshold=5000ms
                                manager_sleep=50ms
                                loader_files=10000
                                loader_threshold=1000ms;

    proxy_connect_timeout       2s; 
    proxy_send_timeout          2s;
    proxy_read_timeout          2s; 
    proxy_buffering             on; 
    proxy_cache_lock            on;     
    proxy_cache_lock_timeout    100ms; 
    proxy_cache_lock_age        50ms;   
    proxy_cache_key             $host$uri;
    proxy_cache_methods         GET HEAD POST;
    proxy_cache_use_stale       updating error timeout invalid_header http_500 http_502 http_503 http_504;
    proxy_cache_revalidate      on;
    proxy_cache_background_update       on;
    proxy_http_version          1.1;
    proxy_ignore_headers        X-Accel-Expires Expires Cache-Control;
    proxy_hide_header           Accept-Encoding;

    proxy_next_upstream         error timeout invalid_header http_403 http_404 http_502 http_503 http_504;
    proxy_set_header            Accept-Encoding none;
    proxy_set_header            Connection ""keep-alive"";
    proxy_set_header            Host       $host;
    proxy_set_header            X-Real-IP  $remote_addr;
    proxy_max_temp_file_size    0; # Get rid of [warn] 30083#30083: *511554 an upstream response is buffered to a temporary file /data/nginx_temp/0000054131 while reading upstream,
    proxy_buffers               64 8k; #http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_buffers
    gzip                        on;
    gzip_proxied                off;
}}}


Best Regards
Francesco

"	defect	new	major		nginx-core	1.24.x		cache expires file deletion disk space	p4rancesc0@…	"[root@conginx01 live]# uname -a
Linux conginx01.XXX.YYY 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Thu Oct 27 15:18:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux"	"nginx version: nginx/1.24.0
built by gcc 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC)
built with OpenSSL 1.1.1k  FIPS 25 Mar 2021
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -pie'"
