Opened 11 months ago

Closed 11 months ago

Last modified 11 months ago

#2557 closed defect (invalid)

Odd STALE reponse under long run stability test

Reported by: p4rancesc0@… Owned by:
Priority: minor Milestone:
Component: nginx-core Version: 1.24.x
Keywords: stale cache Cc: p4rancesc0@…
uname -a: Linux nginx01.xx.yy 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Thu Oct 27 15:18:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.24.0
built by gcc 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC)
built with OpenSSL 1.1.1k FIPS 25 Mar 2021
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -pie'

Description

Hi,

in a test enviroment the run some 12h+ tests to verify an odd behaviour we noticed in production.

A couple of scritps were asking the same file with a ramdom sleep beetween few seconds and 65 seconds.

Nginx is acting as reverse proxy and caching server.

The file from the real server is always coming from the the same real server, in the worst case the difference between Date and Last-Modified will be 6 seconds on the real server.

Nginx has STALE activated.
Nginx cache is configure to cache "any" for 1 second.
We are expecting on nginx a difference between Date and Last-Modified of about 7/8 seconds, maximum 9 seconds.
Are we are wrong ?

Here's the output of the scripts, combined.

Date
Cache Status
Difference between Date and Last-Modified
sleep for xxx milliseconds

--------
Fri Oct 27 00:28:26 CEST 2023
MISS
2
sleep for 81137 milliseconds
--------
Fri Oct 27 00:28:56 CEST 2023
MISS
2
sleep for 60115 milliseconds
--------
Fri Oct 27 00:29:47 CEST 2023
MISS
3
sleep for 60890 milliseconds
--------
	Fri Oct 27 00:29:57 CEST 2023
	STALE
	13
	sleep for 60968 milliseconds
	--------
	Fri Oct 27 00:30:48 CEST 2023
	MISS
	2
	sleep for 26001 milliseconds
	--------
	Fri Oct 27 00:30:58 CEST 2023
	STALE
	12
	sleep for 61950 milliseconds
--------
Fri Oct 27 00:31:14 CEST 2023
MISS
1
sleep for 58485 milliseconds
--------
Fri Oct 27 00:32:00 CEST 2023
MISS
1
sleep for 59154 milliseconds
--------
Fri Oct 27 00:32:13 CEST 2023
MISS
3
sleep for 27922 milliseconds

You can spot a couple of STALE 12 and 13 seconds old.

logs are attached...

here's the proxy cache configuration:

    proxy_connect_timeout       2s; #proxy_connect_timeout       10s; 
    proxy_send_timeout          2s; #proxy_send_timeout          10s; 
    proxy_read_timeout          2s; #proxy_read_timeout          5s; 
    proxy_buffering             on; #TODO TO TEST
    proxy_cache_lock            on;     #CACHE LOCK
    proxy_cache_lock_timeout    100ms;  #CACHE LOCK, old value 1s
    proxy_cache_lock_age        50ms;   #CACHE LOCK, old value default = 5s
    proxy_cache_key             $proxy_host$uri;
    proxy_cache_methods         GET HEAD POST;
    proxy_cache_use_stale       updating error timeout invalid_header http_500 http_502 http_503 http_504;  # http_404; No 404 stale response
    proxy_cache_revalidate      on;
    proxy_cache_background_update       on;
    proxy_http_version          1.1;
    proxy_ignore_headers        X-Accel-Expires Expires Cache-Control;
    proxy_hide_header           Accept-Encoding;
    # TODO to be discussed
    proxy_next_upstream         error timeout invalid_header http_403 http_404 http_502 http_503 http_504;
    proxy_set_header            Accept-Encoding none;
    proxy_set_header            Connection "keep-alive";
    proxy_set_header            Host       $host;
    proxy_set_header            X-Real-IP  $remote_addr;


    proxy_cache_path            /data/nginx/manifest
                                levels=1:1
                                keys_zone=cache-manifest:1m
                                inactive=1s
                                max_size=100m
                                min_free=50m
                                use_temp_path=off
                                manager_files=10000
                                manager_threshold=1000ms
                                manager_sleep=50ms
                                loader_files=10000
                                loader_threshold=1000ms;

and here's the cache configuration in the correct location:

            set $cache cache-manifest;
            proxy_pass          http://usp-live-preprod;
            proxy_cache_valid   any 1s;
            add_header X-Proxy-Cache $upstream_cache_status;
            add_header Access-Control-Allow-Origin *;  # Can restrict this to specific origins.
            add_header 'Access-Control-Expose-Headers' 'Date,X-CDN' always;
            add_header 'Nginx-Cache-Name' $cache always;

Why are we experiencing STALE response instead of a MISS sometimes ?
Here's some reaults:

2 11
4 10
7 9

14 8
15 5
15 7
18 6
82 4

194 0
328 2
334 3
337 1

2 response where 11 seconds old
4 responses where 10 seconds old
7 responses where 9 seconds old

you can see that this is happening just sometimes...

BR
Francesco

Attachments (1)

logs.txt (6.6 KB ) - added by p4rancesc0@… 11 months ago.
acces, error and origin logs

Download all attachments as: .zip

Change History (4)

by p4rancesc0@…, 11 months ago

Attachment: logs.txt added

acces, error and origin logs

comment:1 by Maxim Dounin, 11 months ago

Resolution: invalid
Status: newclosed

You can spot a couple of STALE 12 and 13 seconds old.

The proxy_cache_use_stale updating; combined with proxy_cache_background_update on; in your configuration implies that any stale response present in nginx cache will be returned to the client, and nginx will try to update the cache item with a background subrequest, see docs for details.

This behaviour is consistent with what you are seeing:

--------
Fri Oct 27 00:29:47 CEST 2023
MISS
3
sleep for 60890 milliseconds
--------
	Fri Oct 27 00:29:57 CEST 2023
	STALE
	13
	sleep for 60968 milliseconds
	--------

Note that previous response, MISS 3, caches the response. 10 seconds later response is still in the cache, and nginx returns it to the client as per the configuration, hence STALE 13.

If you don't want nginx to return stale responses to clients, consider changing your configuration accordingly.

comment:2 by p4rancesc0@…, 11 months ago

Thanks Maxim, it makes absolulely sense.

We anyway some more questions...

1) how can the "MISS 3" be present in the cache after 10 seconds if the cache is configured with inactive=1s and proxy_cache_valid any 1s ?

2) why STALE is happening just (2+4+7) times in the tests while all the other (14+15+15+18+82+194+328+334+337) times it's just a MISS ?

It's not easy to reproduce, it just happens sometimes...

For instace:

  • GET -> MISS
  • wait 10seconds
  • GET -> MISS
  • wait 10seconds

can goes on for hours before the first STALE....

If you have the patience to give us a follow up it would be really great.

BR
Francesco

in reply to:  2 comment:3 by Maxim Dounin, 11 months ago

Replying to p4rancesc0@…:

1) how can the "MISS 3" be present in the cache after 10 seconds if the cache is configured with inactive=1s and proxy_cache_valid any 1s ?

The validity of the cache items, as configured with proxy_cache_valid, does not affect presence of cache items in the cache: an expired (stale) cache item can be still present in the cache for a long time as long as it is not yet removed from the cache, and that's basically how proxy_cache_use_stale works.

The inactive= parameter of the proxy_cache_path indeed does specify how cache items are removed from the cache, and assuming inactive=1s is followed strictly the item shouldn't be in the cache after 10 seconds since the last request. The cache item might be removed at some time later though - for example, if the cache manager is busy doing other things. In particular, the cache manager process can sleep for up to 10 seconds if it has nothing to do - and probably this is what happens in your case.

2) why STALE is happening just (2+4+7) times in the tests while all the other (14+15+15+18+82+194+328+334+337) times it's just a MISS ?

Since you have inactive=1s in your configuration, in most cases stale responses are removed from the cache before they are used.

Note: See TracTickets for help on using tickets.