Opened 4 years ago

Last modified 4 years ago

#2182 new enhancement

Nginx doesn't delete temp cache files after a crash

Reported by: ifel@… Owned by:
Priority: minor Milestone:
Component: nginx-module Version: 1.14.x
Keywords: Cc:
uname -a: Linux cc001.eag1.facebook.com 5.2.9-228_fbk15_4185_g357f49b36602 #1 SMP Mon Aug 10 09:34:08 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.14.1
built by gcc 8.2.1 20180905 (Red Hat 8.2.1-3) (GCC)
built with OpenSSL 1.1.1 FIPS 11 Sep 2018 (running with OpenSSL 1.1.1g FIPS 21 Apr 2020)
TLS SNI support enabled
configure arguments: --prefix=/usr/share/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --http-client-body-temp-path=/var/lib/nginx/tmp/client_body --http-proxy-temp-path=/var/lib/nginx/tmp/proxy --http-fastcgi-temp-path=/var/lib/nginx/tmp/fastcgi --http-uwsgi-temp-path=/var/lib/nginx/tmp/uwsgi --http-scgi-temp-path=/var/lib/nginx/tmp/scgi --pid-path=/run/nginx.pid --lock-path=/run/lock/subsys/nginx --user=nginx --group=nginx --with-file-aio --with-ipv6 --with-http_ssl_module --with-http_v2_module --with-http_realip_module --with-http_addition_module --with-http_xslt_module=dynamic --with-http_image_filter_module=dynamic --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_degradation_module --with-http_slice_module --with-http_stub_status_module --with-http_perl_module=dynamic --with-http_auth_request_module --with-mail=dynamic --with-mail_ssl_module --with-pcre --with-pcre-jit --with-stream=dynamic --with-stream_ssl_module --with-debug --with-cc-opt='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -Wl,-E'

Description

When Nginx cache module is downloading an object from an upstream, it saves it into a temp file, and once the download is finished, Nginx renames the temp file to a cache one. Nginx does not track these temp files in any persistent storage, nor Cache manager detects them. Thus, if Nginx crashes or gets killed when it's downloading objects, these temp files stays on disk and nothing cleans them up. Since most of modern linux uses systemd, it's a default behavior, if after sending TERM to a process, it does not finish within 60 sec, systemd sends KILL.

There should be some housekeeping mechanism cleaning these stale files up.

Example of these files, it was captured on May 12:
-rw------- 1 nginx nginx 52035584 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935649
-rw------- 1 nginx nginx 52117504 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935663
-rw------- 1 nginx nginx 52592640 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935679
-rw------- 1 nginx nginx 52248576 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935687
-rw------- 1 nginx nginx 41664512 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935688
-rw------- 1 nginx nginx 45858816 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935701
-rw------- 1 nginx nginx 41746432 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935705
-rw------- 1 nginx nginx 42139648 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935708
-rw------- 1 nginx nginx 37421056 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935714
-rw------- 1 nginx nginx 36159488 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935720
-rw------- 1 nginx nginx 36683776 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935730
-rw------- 1 nginx nginx 36847616 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935734
-rw------- 1 nginx nginx 40304640 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935739
-rw------- 1 nginx nginx 24592384 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935743
-rw------- 1 nginx nginx 27017216 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935746
-rw------- 1 nginx nginx 30097408 May 10 12:06 d43aaa174b966c213a8b65a8d53e8c01.0000935754

Change History (3)

comment:1 by Maxim Dounin, 4 years ago

Priority: majorminor
Type: defectenhancement

While it would be nice to have something to remove stale temporary cache files, it is generally not possible to do this automatically, since these files might be actually used by running nginx processes. Current recommendation is to remove such files manually if needed.

Note that it is generally a bad idea to abruptly stop nginx. Instead, consider using configuration reload and binary upgrade. And, of course, if nginx crashes, you should investigate why it does so, and resolve the underlying issue.

comment:2 by ifel@…, 4 years ago

I understand that it's better to use reload. But there are some cases when it needs to be stopped. Normally, this is done by a supervisor, on modern OS it's systemd, and it's default behavior of systemd - run a command specified in ExecStop=, otherwise send SIGTERM, wait for TimeoutStopSec= seconds, default 90s, and if the processes are still running send SIGKILL.

So, if the process does not stop within this timeout it will be killed. Of course it's possible to increase the timeout, but it does not guarantee it will work in all the cases, and it will not prevent from appearing stale files in other situations like bad/buggy supervisor, or OOM killer.

I don't know if the cache manager has access to processes, but in theory it can be possible to have the threads store file names they are working with in a shared memory and have the cache manager reading ctime/mtime of the files to avoid race condition, something like this:

cur_time = now()
active_temp_fns = read_fns_from_memory()
fns_on_disk = read_fn_from_disk()
for f in fns_on_disk:

if f in active_temp_fns:

continue

if stat(f).mtime < cur_time:

rm(f)

This would've made it cleaning after itself despite any external processes/users killing the process.

comment:3 by Maxim Dounin, 4 years ago

I don't know if the cache manager has access to processes, but in theory it can be possible to have the threads store file names they are working with in a shared memory

This is not going to work, as during binary upgrade there is no shared memory between old and new processes. Not to mention the overhead this approach implies.

Note: See TracTickets for help on using tickets.