Context Navigation

← Previous Ticket
Next Ticket →

#2540 closed defect (invalid)

nginx stable 1.24 issues with cache file deletion under heavy load

Reported by:	p4rancesc0@…	Owned by:
Priority:	major	Milestone:
Component:	nginx-core	Version:	1.24.x
Keywords:	cache expires file deletion disk space	Cc:	p4rancesc0@…
uname -a:	[root@conginx01 live]# uname -a Linux conginx01.XXX.YYY 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Thu Oct 27 15:18:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
nginx -V:	nginx version: nginx/1.24.0 built by gcc 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC) built with OpenSSL 1.1.1k FIPS 25 Mar 2021 TLS SNI support enabled configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -pie'

Description (last modified by p4rancesc0@…)

Hi,

during some synthetic benchmarks on our nginx we faced a strange behaviour:
after the benchmark was stopped and the cache files validity expired we found a large ammount (>200K files) that appeared in lsof of cache directory while a find -type f of the same returned nothing. The space inside the directory was still in use.

Our test:

using a custom client generate > 10Gbit/s using 900 connection on one nginx server, asking every 4 seconds new files that the backend will provide.
each file is asked 2 times.
/data/live/nginx is our caching directory

see nginx initial status:

[root@conginx01 live]# find /data/nginx/live/ -type f
[root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l
0
[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G     0  205G   0% /data/nginx/live

we start the benchmark for 10 minutes, we can easily saturate our 205GB RAM cache disk in few minutes (5 circa...)
after te cache disk is full, in the meaning of it's arounf its size MINUS min_free we keep runnign 5 min.

see:

 Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G  9.8G  96% /data/nginx/live

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G   10G  96% /data/nginx/live

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G  9.8G  96% /data/nginx/live


Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  196G   10G  96% /data/nginx/live

we stop the benchmark and we wait 10 minutes the cache to expire and to be deleted.

we measured during the 10 minutes test:

average response duration: 0.003 seconds
maximum response duration: 0.636 seconds
a total of 2169600 requests
a total of 41 errors on the client side
zero 4xx,5xx
802022225512 bytes transferred, around 800GB

at this point this is the status:

[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l
345645
[root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l
359
[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  195G   11G  96% /data/nginx/live

At this point everything looks FINE, the isn't anymore any traffic on the nginx server.

after while we see:

used space going down, OK

[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  143G   63G  70% /data/nginx/live

cache file number decreasing, OK

[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l
240997

files in deleted status growing

[root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l
11810

we still wait all cache items to expire:

finally we see:

no more files in cache partition

[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l
0

a lot of space used the the cache partition

[root@conginx01 live]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G  134G   72G  65% /data/nginx/live

around 250K files in status deleted with lsof, a stable value

[root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l
Wed Aug 30 14:39:41 CEST 2023
268690
[root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l
Wed Aug 30 14:41:05 CEST 2023
268690
[root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l
Wed Aug 30 14:42:21 CEST 2023
268690

Most of them are unique cache files, some have 2, 3, 4, 5 occurences:

see:

[root@conginx01 live]# cat log  | grep "/data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206"                                                                                                                                             nginx   3277770 nginx 2087r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277774 nginx *637r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277776 nginx 7101r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277779 nginx *163r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
nginx   3277789 nginx   94r   REG   0,47    47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)

We are monitoring the disk usage with nagios, "those files" in the previous tests stayed there from 16:39 of yesterday until 10:36 until we started a new load test.
In the meanwhile no clients where connetcted to the nginx.
kill -USR1 doesn't cleanup the "(deleted)" files.
kill -HUP cleans them up immediately.

We are wondering if this behaviour is correct.

I'll put down here our cache configuration:

    open_file_cache_errors      on; 
    proxy_cache_path            /data/nginx/live
                                levels=1:1:2
                                keys_zone=cache-live:200m
                                inactive=10m
                                max_size=200g
                                min_free=10g
                                use_temp_path=off
                                manager_files=10000
                                manager_threshold=5000ms
                                manager_sleep=50ms
                                loader_files=10000
                                loader_threshold=1000ms;

    proxy_connect_timeout       2s; 
    proxy_send_timeout          2s;
    proxy_read_timeout          2s; 
    proxy_buffering             on; 
    proxy_cache_lock            on;     
    proxy_cache_lock_timeout    100ms; 
    proxy_cache_lock_age        50ms;   
    proxy_cache_key             $host$uri;
    proxy_cache_methods         GET HEAD POST;
    proxy_cache_use_stale       updating error timeout invalid_header http_500 http_502 http_503 http_504;
    proxy_cache_revalidate      on;
    proxy_cache_background_update       on;
    proxy_http_version          1.1;
    proxy_ignore_headers        X-Accel-Expires Expires Cache-Control;
    proxy_hide_header           Accept-Encoding;

    proxy_next_upstream         error timeout invalid_header http_403 http_404 http_502 http_503 http_504;
    proxy_set_header            Accept-Encoding none;
    proxy_set_header            Connection "keep-alive";
    proxy_set_header            Host       $host;
    proxy_set_header            X-Real-IP  $remote_addr;
    proxy_max_temp_file_size    0; # Get rid of [warn] 30083#30083: *511554 an upstream response is buffered to a temporary file /data/nginx_temp/0000054131 while reading upstream,
    proxy_buffers               64 8k; #http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_buffers
    gzip                        on;
    gzip_proxied                off;

Then in the correct location we include this one:

   set $cache cache-live;
   proxy_cache_valid 400 401 403 405 2s;
   proxy_cache_valid 404 412 500 501 502 503 504 505 509 0s;
   proxy_cache_valid any 5m;
   open_file_cache_valid   10m; #LIVE
   open_file_cache         max=200000 inactive=5m; #LIVE
   gzip              off;

Best Regards
Francesco

Change History (7)

comment:1 by p4rancesc0@…, 2 years ago

Description:	modified (diff)

comment:2 by p4rancesc0@…, 2 years ago

Description:	modified (diff)

comment:3 by p4rancesc0@…, 2 years ago

Description:	modified (diff)

comment:4 by Maxim Dounin, 2 years ago

There is no any traffic on nginx after the test stopped, correct?

What you are seeing looks like a result of open_file_cache caching open file descriptors. While it is configured to close descriptors after 5 minutes of inactivity (inactive=5m), closing inactive file descriptors only happens when the open file cache is actually accessed (more specifically, open file cache closes up to 2 inactive file descriptors when a request which used open file cache is finalized). As such, in your particular scenario it keeps file descriptors open till the worker process is shut down.

Consider re-testing without open_file_cache configured, there should be no files left open. Alternatively, try requesting some files after the test, to let open file cache to close inactive file descriptors.

comment:5 by p4rancesc0@…, 2 years ago

Hi, thanks for the reply.

I can confirm that there is no traffic on nginx after the test is stopped.

I'll try to remove open_file_cache.
Then I'll be back.

BR
Francesco

Last edited 2 years ago by p4rancesc0@… (previous) (diff)

comment:6 by p4rancesc0@…, 2 years ago

Hi,

by removing those two directives:

   open_file_cache_valid   10m; #LIVE
   open_file_cache         max=200000 inactive=5m; #LIVE

out problems is fully resolved.

See, before the test:

[root@conginx01 ~]# lsof /data/nginx/live/ | grep deleted | wc -l
0
[root@conginx01 ~]# df -h /data/nginx/live/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G     0  205G   0% /data/nginx/live
[root@conginx01 ~]# find /data/nginx/live/ -type f | wc -l
0

after the benchmark is stopped there are no open files in the cache directory.

[root@conginx01 ~]# lsof /data/nginx/live/
[root@conginx01 ~]#

just after 10 minutes everything is empty.

[root@conginx01 ~]# find /data/nginx/live/ -type f | wc -l
0
[root@conginx01 ~]# lsof /data/nginx/live/
[root@conginx01 ~]# df -h /data/nginx/live
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           205G     0  205G   0% /data/nginx/live
[root@conginx01 ~]#

During those 10 minutes you can see cache manager freeing space.

I think you can close this ticket, this behaviuor is actually what we expected.

Best Regards.

Francesco

comment:7 by Maxim Dounin, 2 years ago

Resolution:	→ invalid
Status:	new → closed

Thanks for confirming, closing this.

Note: See TracTickets for help on using tickets.

Download in other formats: