Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#478 closed defect (wontfix)

open_file_cache doesn't invalidate cache entries when it should

Reported by: Ilari Stenroth Owned by:
Priority: minor Milestone:
Component: nginx-core Version: 1.4.x
Keywords: cache Cc:
uname -a: Linux *** 2.6.32-358.18.1.el6.x86_64 #1 SMP Fri Aug 2 17:04:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.4.3
built by gcc 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC)
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-http_ssl_module --with-http_realip_module --with-http_addition_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_stub_status_module --with-mail --with-mail_ssl_module --with-file-aio --with-ipv6 --with-cc-opt='-O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic'

Description

We have an XML file that is requested frequently (multiple times in a second) from a NFS mount. The file gets replaced quite seldom (once every other week or so). When the file is replaced on the NFS mount Nginx starts serving an empty response body and error log has "Stale file handle" errors. The error is not fixed until Nginx is restarted/reloaded. There are no issues with the NFS mount itself, no NFS related errors in kernel log. There are two distinct bugs (or missing features) on open_file_cache as I see it:
1) "open_file_cache_valid 30s" is not respected, it was more than 10 minutes when we noticed the error and Nginx still had a stale file handle
2) Nginx should invalidate a cache entry immediately when it first ecounters "Stale file handle" error, this is not the case in our experience

We had to disable open_file_cache feature due to this bug.

On configs:

    open_file_cache max=4096 inactive=30s;
    open_file_cache_errors off;
    open_file_cache_min_uses 2;
    open_file_cache_valid 30s;

On error log:

2014/01/02 15:34:33 [alert] 24098#0: *37403021 sendfile() failed (116: Stale file handle) while sending response to client, client: ***, server: ***, request: "GET /file/***/***.xml HTTP/1.1", host: "***", referrer: "http://***/***/***.html"

NFS mount flags:

Flags:	ro,noatime,nodiratime,vers=3,rsize=32768,wsize=32768,namlen=255,soft,nolock,proto=tcp,timeo=600,retrans=4,sec=sys,mountaddr=10.40.197.1,mountvers=3,mountport=1234,mountproto=udp,local_lock=all,addr=10.40.197.1

Change History (3)

comment:1 by Maxim Dounin, 7 years ago

Resolution: wontfix
Status: newclosed

The open_file_cache subsystem caches file descriptors for a specified time, and after this time it checks file's inode number. If it changes, then the file is reopened. What you observe suggests that the file's inode number isn't changed for some reason. This isn't really surprise though, given the fact that NFS is far away from normal Unix filesystem semantics. Not using open_file_cache on NFS looks like a correct solution.

Please also note that it's not recommended to serve files from NFS mounts. File operations are blocking, and serving files from NFS mounts may result in nginx workers being blocked for a long time on NFS operations. Instead, using a proxy to another instance of nginx on a destination server is recommended.

comment:2 by Ilari Stenroth, 7 years ago

OK, I understand the reasoning for using a proxy on a NFS mount. But it doesn't solve the stale file handle issue. The same error can occur on a proxy server and depending on front end server cache configuration, it will sooner or later start serving an empty response body. My suggestion to flush a cache entry when this error occurs is not viable?

comment:3 by Maxim Dounin, 7 years ago

This error isn't expect to occur on filesystems with normal Unix semantics - removed files are still available via previously opened file descriptors, see unlink(2):

If one or more processes have the file open when the last link is removed, the link shall be removed before unlink() returns, but the removal of the file contents shall be postponed until all references to the file are closed.

Given the NFS-specific nature of the error and complexity of providing feedback from errors happening while reading files back to open_file_cache, it hardly worth the effort. If you can't avoid using NFS, not using open_file_cache on NFS looks like simple and correct solution.

Note: See TracTickets for help on using tickets.