Opened 13 months ago
Closed 13 months ago
#2540 closed defect (invalid)
nginx stable 1.24 issues with cache file deletion under heavy load
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | major | Milestone: | |
Component: | nginx-core | Version: | 1.24.x |
Keywords: | cache expires file deletion disk space | Cc: | p4rancesc0@… |
uname -a: |
[root@conginx01 live]# uname -a
Linux conginx01.XXX.YYY 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Thu Oct 27 15:18:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
||
nginx -V: |
nginx version: nginx/1.24.0
built by gcc 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC) built with OpenSSL 1.1.1k FIPS 25 Mar 2021 TLS SNI support enabled configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -pie' |
Description (last modified by )
Hi,
during some synthetic benchmarks on our nginx we faced a strange behaviour:
after the benchmark was stopped and the cache files validity expired we found a large ammount (>200K files) that appeared in lsof of cache directory while a find -type f of the same returned nothing. The space inside the directory was still in use.
Our test:
- using a custom client generate > 10Gbit/s using 900 connection on one nginx server, asking every 4 seconds new files that the backend will provide.
- each file is asked 2 times.
- /data/live/nginx is our caching directory
see nginx initial status:
[root@conginx01 live]# find /data/nginx/live/ -type f [root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l 0 [root@conginx01 live]# df -h /data/nginx/live/ Filesystem Size Used Avail Use% Mounted on tmpfs 205G 0 205G 0% /data/nginx/live
- we start the benchmark for 10 minutes, we can easily saturate our 205GB RAM cache disk in few minutes (5 circa...)
- after te cache disk is full, in the meaning of it's arounf its size MINUS min_free we keep runnign 5 min.
see:
Filesystem Size Used Avail Use% Mounted on tmpfs 205G 196G 9.8G 96% /data/nginx/live Filesystem Size Used Avail Use% Mounted on tmpfs 205G 196G 10G 96% /data/nginx/live Filesystem Size Used Avail Use% Mounted on tmpfs 205G 196G 9.8G 96% /data/nginx/live Filesystem Size Used Avail Use% Mounted on tmpfs 205G 196G 10G 96% /data/nginx/live
- we stop the benchmark and we wait 10 minutes the cache to expire and to be deleted.
we measured during the 10 minutes test:
- average response duration: 0.003 seconds
- maximum response duration: 0.636 seconds
- a total of 2169600 requests
- a total of 41 errors on the client side
- zero 4xx,5xx
- 802022225512 bytes transferred, around 800GB
at this point this is the status:
[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l 345645 [root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l 359 [root@conginx01 live]# df -h /data/nginx/live/ Filesystem Size Used Avail Use% Mounted on tmpfs 205G 195G 11G 96% /data/nginx/live
At this point everything looks FINE, the isn't anymore any traffic on the nginx server.
after while we see:
- used space going down, OK
[root@conginx01 live]# df -h /data/nginx/live/ Filesystem Size Used Avail Use% Mounted on tmpfs 205G 143G 63G 70% /data/nginx/live
- cache file number decreasing, OK
[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l 240997
- files in deleted status growing
[root@conginx01 live]# lsof /data/nginx/live/ | grep deleted | wc -l 11810
we still wait all cache items to expire:
finally we see:
- no more files in cache partition
[root@conginx01 live]# find /data/nginx/live/ -type f | wc -l 0
- a lot of space used the the cache partition
[root@conginx01 live]# df -h /data/nginx/live/ Filesystem Size Used Avail Use% Mounted on tmpfs 205G 134G 72G 65% /data/nginx/live
- around 250K files in status deleted with lsof, a stable value
[root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l Wed Aug 30 14:39:41 CEST 2023 268690 [root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l Wed Aug 30 14:41:05 CEST 2023 268690 [root@conginx01 live]# date; lsof /data/nginx/live/ | grep deleted | wc -l Wed Aug 30 14:42:21 CEST 2023 268690
Most of them are unique cache files, some have 2, 3, 4, 5 occurences:
see:
[root@conginx01 live]# cat log | grep "/data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206" nginx 3277770 nginx 2087r REG 0,47 47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted) nginx 3277774 nginx *637r REG 0,47 47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted) nginx 3277776 nginx 7101r REG 0,47 47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted) nginx 3277779 nginx *163r REG 0,47 47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted) nginx 3277789 nginx 94r REG 0,47 47685 88106191 /data/nginx/live/6/0/82/2f68c2d5bdd35176a5606507d0ee8206 (deleted)
We are monitoring the disk usage with nagios, "those files" in the previous tests stayed there from 16:39 of yesterday until 10:36 until we started a new load test.
In the meanwhile no clients where connetcted to the nginx.
kill -USR1 doesn't cleanup the "(deleted)" files.
kill -HUP cleans them up immediately.
We are wondering if this behaviour is correct.
I'll put down here our cache configuration:
open_file_cache_errors on; proxy_cache_path /data/nginx/live levels=1:1:2 keys_zone=cache-live:200m inactive=10m max_size=200g min_free=10g use_temp_path=off manager_files=10000 manager_threshold=5000ms manager_sleep=50ms loader_files=10000 loader_threshold=1000ms; proxy_connect_timeout 2s; proxy_send_timeout 2s; proxy_read_timeout 2s; proxy_buffering on; proxy_cache_lock on; proxy_cache_lock_timeout 100ms; proxy_cache_lock_age 50ms; proxy_cache_key $host$uri; proxy_cache_methods GET HEAD POST; proxy_cache_use_stale updating error timeout invalid_header http_500 http_502 http_503 http_504; proxy_cache_revalidate on; proxy_cache_background_update on; proxy_http_version 1.1; proxy_ignore_headers X-Accel-Expires Expires Cache-Control; proxy_hide_header Accept-Encoding; proxy_next_upstream error timeout invalid_header http_403 http_404 http_502 http_503 http_504; proxy_set_header Accept-Encoding none; proxy_set_header Connection "keep-alive"; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_max_temp_file_size 0; # Get rid of [warn] 30083#30083: *511554 an upstream response is buffered to a temporary file /data/nginx_temp/0000054131 while reading upstream, proxy_buffers 64 8k; #http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_buffers gzip on; gzip_proxied off;
Then in the correct location we include this one:
set $cache cache-live; proxy_cache_valid 400 401 403 405 2s; proxy_cache_valid 404 412 500 501 502 503 504 505 509 0s; proxy_cache_valid any 5m; open_file_cache_valid 10m; #LIVE open_file_cache max=200000 inactive=5m; #LIVE gzip off;
Best Regards
Francesco
Change History (7)
comment:1 by , 13 months ago
Description: | modified (diff) |
---|
comment:2 by , 13 months ago
Description: | modified (diff) |
---|
comment:3 by , 13 months ago
Description: | modified (diff) |
---|
comment:4 by , 13 months ago
comment:5 by , 13 months ago
Hi, thanks for the reply.
I can confirm that there is no traffic on nginx after the test is stopped.
I'll try to remove open_file_cache.
Then I'll be back.
BR
Francesco
comment:6 by , 13 months ago
Hi,
by removing those two directives:
open_file_cache_valid 10m; #LIVE open_file_cache max=200000 inactive=5m; #LIVE
out problems is fully resolved.
See, before the test:
[root@conginx01 ~]# lsof /data/nginx/live/ | grep deleted | wc -l 0 [root@conginx01 ~]# df -h /data/nginx/live/ Filesystem Size Used Avail Use% Mounted on tmpfs 205G 0 205G 0% /data/nginx/live [root@conginx01 ~]# find /data/nginx/live/ -type f | wc -l 0
after the benchmark is stopped there are no open files in the cache directory.
[root@conginx01 ~]# lsof /data/nginx/live/ [root@conginx01 ~]#
just after 10 minutes everything is empty.
[root@conginx01 ~]# find /data/nginx/live/ -type f | wc -l 0 [root@conginx01 ~]# lsof /data/nginx/live/ [root@conginx01 ~]# df -h /data/nginx/live Filesystem Size Used Avail Use% Mounted on tmpfs 205G 0 205G 0% /data/nginx/live [root@conginx01 ~]#
During those 10 minutes you can see cache manager freeing space.
I think you can close this ticket, this behaviuor is actually what we expected.
Best Regards.
Francesco
comment:7 by , 13 months ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
Thanks for confirming, closing this.
There is no any traffic on nginx after the test stopped, correct?
What you are seeing looks like a result of
open_file_cache
caching open file descriptors. While it is configured to close descriptors after 5 minutes of inactivity (inactive=5m
), closing inactive file descriptors only happens when the open file cache is actually accessed (more specifically, open file cache closes up to 2 inactive file descriptors when a request which used open file cache is finalized). As such, in your particular scenario it keeps file descriptors open till the worker process is shut down.Consider re-testing without
open_file_cache
configured, there should be no files left open. Alternatively, try requesting some files after the test, to let open file cache to close inactive file descriptors.