Opened 5 weeks ago

Closed 4 weeks ago

#2384 closed defect (invalid)

http upstream cache and slicing leaks open files

Reported by: Thibault VINCENT Owned by:
Priority: minor Milestone:
Component: nginx-core Version: 1.23.x
Keywords: Cc:
uname -a: Linux 5.19.4-200.fc36.x86_64
nginx -V: nginx version: nginx/1.23.1
built by gcc 12.2.1 20220819 (Red Hat 12.2.1-1) (GCC)
configure arguments: --without-http_rewrite_module --without-http_gzip_module --with-http_slice_module

Description

Using proxy_cache_path and proxy_cache with slice is causing a file descriptor exhaustion while serving subrequests, which leads to a worker crash.

Test configuration:

proxy_cache_path /mnt/cache levels=1:2 keys_zone=zone:100m
    max_size=500g min_free=10g inactive=4h use_temp_path=off;

server {
    listen 80 default_server reuseport backlog=4096;

    location / {
        add_header Cache-Control "public, no-transform";
        expires 1d;

        # low to trigger bug
        slice 1k;

        proxy_cache zone;
        proxy_cache_key $uri$slice_range;
        proxy_set_header Range $slice_range;

        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_pass_request_body off;
        proxy_pass http://192.168.123.123;
    }
}

Running with a limit of 1000 maximum open files, this proxy will crash when serving the 500th slice of so. It does not matter whether it's a 100% cache miss or hit situation.
I guess the root cause may be cache file descriptors left open till the main request is cleaned up.

[crit] 65993#0: *1 open() "/mnt/cache/f/5c/4cca8c6cb531b0aaa5a2fe16f43d45cf.0000000996" failed (24: Too many open files) while reading upstream, client: 127.0.0.1, server: server, request: "GET /test HTTP/1.1", subrequest: "/test", upstream: "http://192.168.123.123/test", host: "127.0.0.1:80"

Is it a design limitation of the slice module, or could it be improved?

Thanks

Change History (3)

in reply to:  description ; comment:1 by Maxim Dounin, 5 weeks ago

Replying to Thibault VINCENT:

Using proxy_cache_path and proxy_cache with slice is causing a file descriptor exhaustion while serving subrequests, which leads to a worker crash.

Could you please clarify what do you mean by "leads to a worker crash"? Once file descriptors are exhausted, nginx will report appropriate errors and continue to work on other requests.

If you instead see a worker process crash, that is, if a worker exits abnormally, for example, due to a segmentation fault, this might be something to further dig into.

I guess the root cause may be cache file descriptors left open till the main request is cleaned up.

Exactly.

Is it a design limitation of the slice module, or could it be improved?

The slice module creates a subrequest for each slice. Each subrequest allocates resources, including file descriptors, which are normally freed with the main request. While theoretically it is possible to free some resources earlier, this requires noticeable additional effort for each resource and not something usually done.

In general, this is something to consider while configuring the slice module. Too small slice size can easily lead to significant resource usage, eliminating advantages of using the slice module.

in reply to:  1 comment:2 by Thibault VINCENT, 4 weeks ago

Replying to Maxim Dounin:

Could you please clarify what do you mean by "leads to a worker crash"? Once file descriptors are exhausted, nginx will report appropriate errors and continue to work on other requests.

Apologies, I was overwhelmed by debug logs and may have over-interpreted the presence of process reaping messages:

2022/08/30 11:54:10 [crit] 88242#0: *1 open() "/mnt/cache/3/7b/759e4ec9ae96efa161ca49ceed2c97b3.0000000996" failed (24: Too many open files) while reading upstream, client: 127.0.0.1, server: server, request: "GET /test HTTP/1.1", subrequest: "/test", upstream: "http://192.168.123.123:80/test", host: "127.0.0.1:80"
[...]
2022/08/30 11:54:51 [debug] 88248#0: http file cache loader time elapsed: 5
2022/08/30 11:54:51 [notice] 88248#0: http file cache: /mnt/cache 167.258M, bsize: 4096
2022/08/30 11:54:51 [notice] 88236#0: signal 17 (SIGCHLD) received from 88248
2022/08/30 11:54:51 [notice] 88236#0: cache loader process 88248 exited with code 0
2022/08/30 11:54:51 [debug] 88236#0: shmtx forced unlock
2022/08/30 11:54:51 [debug] 88236#0: shmtx forced unlock
2022/08/30 11:54:51 [debug] 88236#0: wake up, sigio 0
2022/08/30 11:54:51 [debug] 88236#0: reap children

Nothing is suggesting an abnormal exit.


While theoretically it is possible to free some resources earlier, this requires noticeable additional effort for each resource and not something usually done.

It made sense looking at the code. Thank you for the confirmation.

In general, this is something to consider while configuring the slice module. Too small slice size can easily lead to significant resource usage, eliminating advantages of using the slice module.

Indeed I've had to tweak slice size a few times lately when the dataset changed.
This module may not be suit the use case anymore.

Thanks

comment:3 by Maxim Dounin, 4 weeks ago

Resolution: invalid
Status: newclosed

Nothing is suggesting an abnormal exit.

Thanks for confirming.

Indeed I've had to tweak slice size a few times lately when the dataset changed.
This module may not be suit the use case anymore.

Note that the slice module is more about dealing with network throughput and disk space limits. It makes it possible to effectively cache huge files, notably when:

  • Downloading files from the upstream server takes significant time, and therefore proxy_cache_lock is not effective in preventing multiple simultaneous downloads.
  • Only small ranges are in fact requested by clients, and therefore downloading and caching the whole file wastes significant resources.

It is not, however, expected to help when a single client requests a file and is going to download the whole file. Rather, in such scenario the slice module will certainly need more resources.

Closing this, as the reported behaviour is expected downside of using multiple subrequests to load the file.

Note: See TracTickets for help on using tickets.