Opened 4 months ago

Closed 7 weeks ago

Last modified 44 hours ago

#2614 closed defect (fixed)

Memory-leak like issue happens as long as nginx keeps having long-lived gRPC stream connections

Reported by: sm815lee@… Owned by:
Priority: critical Milestone: nginx-1.27
Component: nginx-core Version: 1.25.x
Keywords: grpc, memory, leak Cc:
uname -a: Linux ip-10-0-2-164 6.5.0-1014-aws #14~22.04.1-Ubuntu SMP Thu Feb 15 15:27:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.25.4
built by gcc 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
built with OpenSSL 3.0.2 15 Mar 2022
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-http_v3_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -ffile-prefix-map=/data/builder/debuild/nginx-1.25.4/debian/debuild-base/nginx-1.25.4=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'

Description (last modified by sm815lee@…)

Hello all,

We have a workload that the load balancer server keeps long-lived gRPC-stream connections like common TCP servers.
And in the client, from time to time, it sends small gRPC calls to the reverse proxy server. So we chose to make an in-house gRPC-stream test-client program to measure CPU and memory usage of nginx with a similar workload.
There is a simple network topology and test environment. we have tested it using AWS EC2 VMs.

https://trac.nginx.org/nginx/attachment/ticket/2614/nginx_simple_topology.png
[simple topology picture, please append https:/]
drive.usercontent.google.com/download?id=12QLfhidHyKAiB4Z4S7WiODfNPlwdenU8&export=view&authuser=0

Then, with this workload, we found that the memory consumption graph gradually and linearly increased during the test.
After all, the proxy server ended up facing OOM killer.
https://trac.nginx.org/nginx/attachment/ticket/2614/vanilla_nginx_400k_conn_grpc.png
[memory consumption picture, please append https:/]
drive.usercontent.google.com/download?id=1l8t4_cZ7pLEBBLP8m0eT1-wWeOLhccl0&export=view&authuser=0

Because it's quite easy to reproduce this issue, we ran some tests with Valgrind + massif tool and we found where memory allocation happens mostly -- ngx_alloc_chain_link() in ngx_chain_writer() --
but what ngx_chain_writer() does is quite legitimate and adequate and because nginx uses a memory pool manner, there are many places to gain pre-allocated memory from the pool, it was quite hard to check all the parts where memory allocation and deallocation happen especially for those who are new to nginx internals.

[massif analyzer graph picture]
https://trac.nginx.org/nginx/attachment/ticket/2614/nginx_memory_massif.png
https://drive.usercontent.google.com/download?id=1FZ40HtHeWXDPFVgw9lk86Nv4AzadGo4Z&export=view&authuser=0

Although we found some recommendations of memory management for nginx on the guide document like nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests
we think it's not fair that nginx costs gradual memory space on every connection during its lifespan like a tax ;)
And some nginx's directives, keepalive_requests, and keepalive_time, didn't even affect gRPC-stream connections.

Currently, we don't even know whether this is what Nginx is originally intended for or not.
There are too many configuration directives and modules and compiling options, so I think it's really hard to avoid this issue with the combinations of those.
Actually, we have been trying to run with many combinations of directives to avoid this issue, like not using certificates, but we didn't have luck with those.

So it would be very helpful to share recommendations for this issue.

* Here is update 2024 MAR 13th *
During a recent test, we found that the connection itself doesn't bring up the memory leak issue but the gRPC stream's message does.
With the same number of connections, we have doubled the number of the gRPC stream's messages and it shows almost doubled memory consumption.

https://trac.nginx.org/nginx/attachment/ticket/2614/grpc-message-memory-ratio.png

Here are the configuration files
[configuration file]

## nginx.conf
worker_cpu_affinity auto;
worker_priority 0;
worker_processes auto;
worker_rlimit_core 1000000;
worker_rlimit_nofile 1000000;
worker_shutdown_timeout 30m;

error_log /var/log/nginx/error.log debug;
pid /var/run/nginx.pid;

daemon on;
debug_points abort;

lock_file logs/nginx.lock;
master_process on;
pcre_jit off;
thread_pool default threads=64 max_queue=65536;
timer_resolution 1s;
working_directory /etc/nginx;

include [];

events {
    accept_mutex off;
    accept_mutex_delay 500ms;
    multi_accept off;
    use epoll;
    worker_aio_requests 256;
    worker_connections 50000;
}

stream {
    upstream stream_upstream {

        least_conn;
                server 10.0.2.210:80;
            server 10.0.2.210:81;
            server 10.0.2.210:82;
            server 10.0.2.210:83;
            server 10.0.2.210:84;
            server 10.0.2.210:85;
            server 10.0.2.210:86;
            server 10.0.2.210:87;
            server 10.0.2.210:88;
            server 10.0.2.210:89;
            server 10.0.2.210:90;
            server 10.0.2.210:91;
            server 10.0.2.210:92;
            server 10.0.2.210:93;
            server 10.0.2.210:94;
            server 10.0.2.210:95;
            server 10.0.2.210:96;
            server 10.0.2.210:97;
            server 10.0.2.210:98;
            server 10.0.2.210:99;
                    server 10.0.2.139:80;
            server 10.0.2.139:81;
            server 10.0.2.139:82;
            server 10.0.2.139:83;
            server 10.0.2.139:84;
            server 10.0.2.139:85;
            server 10.0.2.139:86;
            server 10.0.2.139:87;
            server 10.0.2.139:88;
            server 10.0.2.139:89;
            server 10.0.2.139:90;
            server 10.0.2.139:91;
            server 10.0.2.139:92;
            server 10.0.2.139:93;
            server 10.0.2.139:94;
            server 10.0.2.139:95;
            server 10.0.2.139:96;
            server 10.0.2.139:97;
            server 10.0.2.139:98;
            server 10.0.2.139:99;
                    server 10.0.2.54:80;
            server 10.0.2.54:81;
            server 10.0.2.54:82;
            server 10.0.2.54:83;
            server 10.0.2.54:84;
            server 10.0.2.54:85;
            server 10.0.2.54:86;
            server 10.0.2.54:87;
            server 10.0.2.54:88;
            server 10.0.2.54:89;
            server 10.0.2.54:90;
            server 10.0.2.54:91;
            server 10.0.2.54:92;
            server 10.0.2.54:93;
            server 10.0.2.54:94;
            server 10.0.2.54:95;
            server 10.0.2.54:96;
            server 10.0.2.54:97;
            server 10.0.2.54:98;
            server 10.0.2.54:99;
                    server 10.0.2.38:80;
            server 10.0.2.38:81;
            server 10.0.2.38:82;
            server 10.0.2.38:83;
            server 10.0.2.38:84;
            server 10.0.2.38:85;
            server 10.0.2.38:86;
            server 10.0.2.38:87;
            server 10.0.2.38:88;
            server 10.0.2.38:89;
            server 10.0.2.38:90;
            server 10.0.2.38:91;
            server 10.0.2.38:92;
            server 10.0.2.38:93;
            server 10.0.2.38:94;
            server 10.0.2.38:95;
            server 10.0.2.38:96;
            server 10.0.2.38:97;
            server 10.0.2.38:98;
            server 10.0.2.38:99;
                    server 10.0.2.16:80;
            server 10.0.2.16:81;
            server 10.0.2.16:82;
            server 10.0.2.16:83;
            server 10.0.2.16:84;
            server 10.0.2.16:85;
            server 10.0.2.16:86;
            server 10.0.2.16:87;
            server 10.0.2.16:88;
            server 10.0.2.16:89;
            server 10.0.2.16:90;
            server 10.0.2.16:91;
            server 10.0.2.16:92;
            server 10.0.2.16:93;
            server 10.0.2.16:94;
            server 10.0.2.16:95;
            server 10.0.2.16:96;
            server 10.0.2.16:97;
            server 10.0.2.16:98;
            server 10.0.2.16:99;
            }

    server {
            listen 0.0.0.0:1000;
            proxy_pass stream_upstream;
    }
}

http {
        include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/conf.d/backend/*.conf;

    #open_file_cache max=200000 inactive=20s;
    #open_file_cache_valid 30s;
    #open_file_cache_min_uses 2;
    #open_file_cache_errors on;

    # to boost I/O on HDD we can disable access logs
    access_log off;

    # copies data between one FD and other from within the kernel
    # faster than read() + write()
    sendfile on;

    # send headers in one piece, it is better than sending them one by one
    tcp_nopush on;

    keepalive_timeout 300;
    keepalive_requests 999999999;
}

## /etc/nginx/conf.d/grpc.conf
upstream grpcsvr {
    server 10.0.2.210:50010;
    server 10.0.2.210:50011;
    server 10.0.2.210:50012;
    server 10.0.2.210:50013;
    server 10.0.2.210:50014;
    server 10.0.2.210:50015;
    server 10.0.2.210:50016;
    server 10.0.2.210:50017;
    server 10.0.2.210:50018;
    server 10.0.2.210:50019;
    server 10.0.2.210:50020;
    server 10.0.2.210:50021;
    server 10.0.2.210:50022;
    server 10.0.2.210:50023;
    server 10.0.2.210:50024;
    server 10.0.2.210:50025;
    server 10.0.2.210:50026;
    server 10.0.2.210:50027;
    server 10.0.2.210:50028;
    server 10.0.2.210:50029;
    server 10.0.2.139:50010;
    server 10.0.2.139:50011;
    server 10.0.2.139:50012;
    server 10.0.2.139:50013;
    server 10.0.2.139:50014;
    server 10.0.2.139:50015;
    server 10.0.2.139:50016;
    server 10.0.2.139:50017;
    server 10.0.2.139:50018;
    server 10.0.2.139:50019;
    server 10.0.2.139:50020;
    server 10.0.2.139:50021;
    server 10.0.2.139:50022;
    server 10.0.2.139:50023;
    server 10.0.2.139:50024;
    server 10.0.2.139:50025;
    server 10.0.2.139:50026;
    server 10.0.2.139:50027;
    server 10.0.2.139:50028;
    server 10.0.2.139:50029;
    server 10.0.2.54:50010;
    server 10.0.2.54:50011;
    server 10.0.2.54:50012;
    server 10.0.2.54:50013;
    server 10.0.2.54:50014;
    server 10.0.2.54:50015;
    server 10.0.2.54:50016;
    server 10.0.2.54:50017;
    server 10.0.2.54:50018;
    server 10.0.2.54:50019;
    server 10.0.2.54:50020;
    server 10.0.2.54:50021;
    server 10.0.2.54:50022;
    server 10.0.2.54:50023;
    server 10.0.2.54:50024;
    server 10.0.2.54:50025;
    server 10.0.2.54:50026;
    server 10.0.2.54:50027;
    server 10.0.2.54:50028;
    server 10.0.2.54:50029;
    server 10.0.2.38:50010;
    server 10.0.2.38:50011;
    server 10.0.2.38:50012;
    server 10.0.2.38:50013;
    server 10.0.2.38:50014;
    server 10.0.2.38:50015;
    server 10.0.2.38:50016;
    server 10.0.2.38:50017;
    server 10.0.2.38:50018;
    server 10.0.2.38:50019;
    server 10.0.2.38:50020;
    server 10.0.2.38:50021;
    server 10.0.2.38:50022;
    server 10.0.2.38:50023;
    server 10.0.2.38:50024;
    server 10.0.2.38:50025;
    server 10.0.2.38:50026;
    server 10.0.2.38:50027;
    server 10.0.2.38:50028;
    server 10.0.2.38:50029;
    server 10.0.2.16:50010;
    server 10.0.2.16:50011;
    server 10.0.2.16:50012;
    server 10.0.2.16:50013;
    server 10.0.2.16:50014;
    server 10.0.2.16:50015;
    server 10.0.2.16:50016;
    server 10.0.2.16:50017;
    server 10.0.2.16:50018;
    server 10.0.2.16:50019;
    server 10.0.2.16:50020;
    server 10.0.2.16:50021;
    server 10.0.2.16:50022;
    server 10.0.2.16:50023;
    server 10.0.2.16:50024;
    server 10.0.2.16:50025;
    server 10.0.2.16:50026;
    server 10.0.2.16:50027;
    server 10.0.2.16:50028;
    server 10.0.2.16:50029;
    least_conn;
}

server {
    #listen 0.0.0.0:50010 ssl;
    listen 0.0.0.0:50010;
    http2 on;
    
#    ssl_certificate /server.crt;
#    ssl_certificate_key /server.key;

    
    location / {
        grpc_pass grpc://grpcsvr;
	add_header Last-Modified $date_gmt;
        add_header Cache-Control 'no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0';
        if_modified_since off;
        expires off;
        etag off;
    }
} 

Attachments (4)

nginx_simple_topology.png (36.1 KB ) - added by sm815lee@… 4 months ago.
nginx_memory_massif.png (344.3 KB ) - added by sm815lee@… 4 months ago.
vanilla_nginx_400k_conn_grpc.png (36.6 KB ) - added by sm815lee@… 4 months ago.
grpc-message-memory-ratio.png (45.0 KB ) - added by sm815lee@… 4 months ago.

Download all attachments as: .zip

Change History (13)

by sm815lee@…, 4 months ago

Attachment: nginx_simple_topology.png added

by sm815lee@…, 4 months ago

Attachment: nginx_memory_massif.png added

by sm815lee@…, 4 months ago

comment:1 by sm815lee@…, 4 months ago

Description: modified (diff)

comment:2 by lutzmex@…, 4 months ago

I've encountered a memory management challenge with NGINX (version 1.25.4) running on an Ubuntu server (6.5.0-1014-aws) where the server is handling long-lived gRPC stream connections. Despite following best practices and tuning based on the official documentation, we are observing a gradual and linear increase in memory consumption that eventually triggers the OOM killer. This behavior persists across a range of configurations, and the issue has been reproducible under our test environment that mimics a typical workload with frequent, small gRPC calls to the reverse proxy server.

Through extensive testing, including the use of tools like Valgrind and Massif, we've identified significant memory allocations happening within ngx_alloc_chain_link() in the ngx_chain_writer() function. Given NGINX's memory pool management approach, pinpointing the exact cause of the leak or unnecessary memory retention is challenging. We also noted that directives like keepalive_requests and keepalive_time, which are often recommended for managing memory in HTTP connections, do not influence the behavior of gRPC stream connections.

Here's a snapshot of our NGINX configuration:

(Configuration details were provided)
Given this context, my questions are as follows:

Has anyone experienced similar memory management issues with NGINX handling long-lived gRPC connections, and how did you address them?
Are there specific NGINX directives or configuration strategies that can mitigate this gradual memory consumption for gRPC stream connections?
Could this behavior be a result of an inherent limitation or known issue with NGINX's handling of gRPC streams, and if so, are there any patches or workarounds available?
Would adjusting NGINX's memory pool settings or other underlying system parameters offer a potential solution to this problem?
Any insights, experiences, or recommendations on how to tackle this memory leak-like issue in NGINX would be greatly appreciated.
Thanks

Last edited 44 hours ago by lutzmex@… (previous) (diff)

by sm815lee@…, 4 months ago

comment:3 by sm815lee@…, 4 months ago

Description: modified (diff)

comment:4 by sm815lee@…, 4 months ago

Description: modified (diff)

comment:5 by m.herasimovich, 3 months ago

Milestone: nginx-1.25nginx-1.27

Ticket retargeted after milestone closed

comment:6 by Roman Arutyunyan, 7 weeks ago

A similar problem with chain links was fixed in #1046.

comment:7 by Roman Arutyunyan <arut@…>, 7 weeks ago

In 9248:f7d53c7f7014/nginx:

Optimized chain link usage (ticket #2614).

Previously chain links could sometimes be dropped instead of being reused,
which could result in increased memory consumption during long requests.

A similar chain link issue in ngx_http_gzip_filter_module was fixed in
da46bfc484ef (1.11.10).

Based on a patch by Sangmin Lee.

comment:8 by Roman Arutyunyan, 7 weeks ago

Resolution: fixed
Status: newclosed

comment:9 by Roman Arutyunyan <arut@…>, 6 weeks ago

In 9260:b317a71f75ae/nginx:

Optimized chain link usage (ticket #2614).

Previously chain links could sometimes be dropped instead of being reused,
which could result in increased memory consumption during long requests.

A similar chain link issue in ngx_http_gzip_filter_module was fixed in
da46bfc484ef (1.11.10).

Based on a patch by Sangmin Lee.

Note: See TracTickets for help on using tickets.