Opened 3 months ago

Last modified 4 weeks ago

#2614 new defect

Memory-leak like issue happens as long as nginx keeps having long-lived gRPC stream connections — at Initial Version

Reported by: sm815lee@… Owned by:
Priority: critical Milestone: nginx-1.27
Component: nginx-core Version: 1.25.x
Keywords: grpc, memory, leak Cc:
uname -a: Linux ip-10-0-2-164 6.5.0-1014-aws #14~22.04.1-Ubuntu SMP Thu Feb 15 15:27:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.25.4
built by gcc 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
built with OpenSSL 3.0.2 15 Mar 2022
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-http_v3_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -ffile-prefix-map=/data/builder/debuild/nginx-1.25.4/debian/debuild-base/nginx-1.25.4=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'

Description

Hello all,

We have a workload that the load balancer server keeps long-lived gRPC connections like common TCP servers.
And in the client, from time to time, it sends small gRPC calls to the reverse proxy server. So we chose to make an in-house gRPC test-client program to measure CPU and memory usage of nginx with a similar workload.
There is a simple network topology and test environment. we have tested it using AWS EC2 VMs.

[simple topology picture, please append https:/]
drive.usercontent.google.com/download?id=12QLfhidHyKAiB4Z4S7WiODfNPlwdenU8&export=view&authuser=0

Then, with this workload, we found that the memory consumption graph gradually and linearly increased during the test.
After all, the proxy server ended up facing OOM killer.
[memory consumption picture, please append https:/]
drive.usercontent.google.com/download?id=1l8t4_cZ7pLEBBLP8m0eT1-wWeOLhccl0&export=view&authuser=0

Because it's quite easy to reproduce this issue, we ran some tests with Valgrind + massif tool and we found where memory allocation happens mostly -- ngx_alloc_chain_link() in ngx_chain_writer() --
but what ngx_chain_writer() does is quite legitimate and adequate and because nginx uses a memory pool manner, there are many places to gain pre-allocated memory from the pool, it was quite hard to check all the parts where memory allocation and deallocation happen especially for those who are new to nginx internals.

[massif analyzer graph picture]
https://drive.usercontent.google.com/download?id=1FZ40HtHeWXDPFVgw9lk86Nv4AzadGo4Z&export=view&authuser=0

Although we found some recommendations of memory management for nginx on the guide document like nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests
we think it's not fair that nginx costs gradual memory space on every connection during its lifespan like a tax ;)
And some nginx's directives, keepalive_requests, and keepalive_time, didn't even affect gRPC connections.

Currently, we don't even know whether this is what Nginx is originally intended for or not.
There are too many configuration directives and modules and compiling options, so I think it's really hard to avoid this issue with the combinations of those.
Actually, we have been trying to run with many combinations of directives to avoid this issue, like not using certificates, but we didn't have luck with those.

So it would be very helpful to share recommendations for this issue.

Here are the configuration files
[configuration file]

## nginx.conf
worker_cpu_affinity auto;
worker_priority 0;
worker_processes auto;
worker_rlimit_core 1000000;
worker_rlimit_nofile 1000000;
worker_shutdown_timeout 30m;

error_log /var/log/nginx/error.log debug;
pid /var/run/nginx.pid;

daemon on;
debug_points abort;

lock_file logs/nginx.lock;
master_process on;
pcre_jit off;
thread_pool default threads=64 max_queue=65536;
timer_resolution 1s;
working_directory /etc/nginx;

include [];

events {
    accept_mutex off;
    accept_mutex_delay 500ms;
    multi_accept off;
    use epoll;
    worker_aio_requests 256;
    worker_connections 50000;
}

stream {
    upstream stream_upstream {

        least_conn;
                server 10.0.2.210:80;
            server 10.0.2.210:81;
            server 10.0.2.210:82;
            server 10.0.2.210:83;
            server 10.0.2.210:84;
            server 10.0.2.210:85;
            server 10.0.2.210:86;
            server 10.0.2.210:87;
            server 10.0.2.210:88;
            server 10.0.2.210:89;
            server 10.0.2.210:90;
            server 10.0.2.210:91;
            server 10.0.2.210:92;
            server 10.0.2.210:93;
            server 10.0.2.210:94;
            server 10.0.2.210:95;
            server 10.0.2.210:96;
            server 10.0.2.210:97;
            server 10.0.2.210:98;
            server 10.0.2.210:99;
                    server 10.0.2.139:80;
            server 10.0.2.139:81;
            server 10.0.2.139:82;
            server 10.0.2.139:83;
            server 10.0.2.139:84;
            server 10.0.2.139:85;
            server 10.0.2.139:86;
            server 10.0.2.139:87;
            server 10.0.2.139:88;
            server 10.0.2.139:89;
            server 10.0.2.139:90;
            server 10.0.2.139:91;
            server 10.0.2.139:92;
            server 10.0.2.139:93;
            server 10.0.2.139:94;
            server 10.0.2.139:95;
            server 10.0.2.139:96;
            server 10.0.2.139:97;
            server 10.0.2.139:98;
            server 10.0.2.139:99;
                    server 10.0.2.54:80;
            server 10.0.2.54:81;
            server 10.0.2.54:82;
            server 10.0.2.54:83;
            server 10.0.2.54:84;
            server 10.0.2.54:85;
            server 10.0.2.54:86;
            server 10.0.2.54:87;
            server 10.0.2.54:88;
            server 10.0.2.54:89;
            server 10.0.2.54:90;
            server 10.0.2.54:91;
            server 10.0.2.54:92;
            server 10.0.2.54:93;
            server 10.0.2.54:94;
            server 10.0.2.54:95;
            server 10.0.2.54:96;
            server 10.0.2.54:97;
            server 10.0.2.54:98;
            server 10.0.2.54:99;
                    server 10.0.2.38:80;
            server 10.0.2.38:81;
            server 10.0.2.38:82;
            server 10.0.2.38:83;
            server 10.0.2.38:84;
            server 10.0.2.38:85;
            server 10.0.2.38:86;
            server 10.0.2.38:87;
            server 10.0.2.38:88;
            server 10.0.2.38:89;
            server 10.0.2.38:90;
            server 10.0.2.38:91;
            server 10.0.2.38:92;
            server 10.0.2.38:93;
            server 10.0.2.38:94;
            server 10.0.2.38:95;
            server 10.0.2.38:96;
            server 10.0.2.38:97;
            server 10.0.2.38:98;
            server 10.0.2.38:99;
                    server 10.0.2.16:80;
            server 10.0.2.16:81;
            server 10.0.2.16:82;
            server 10.0.2.16:83;
            server 10.0.2.16:84;
            server 10.0.2.16:85;
            server 10.0.2.16:86;
            server 10.0.2.16:87;
            server 10.0.2.16:88;
            server 10.0.2.16:89;
            server 10.0.2.16:90;
            server 10.0.2.16:91;
            server 10.0.2.16:92;
            server 10.0.2.16:93;
            server 10.0.2.16:94;
            server 10.0.2.16:95;
            server 10.0.2.16:96;
            server 10.0.2.16:97;
            server 10.0.2.16:98;
            server 10.0.2.16:99;
            }

    server {
            listen 0.0.0.0:1000;
            proxy_pass stream_upstream;
    }
}

http {
        include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/conf.d/backend/*.conf;

    #open_file_cache max=200000 inactive=20s;
    #open_file_cache_valid 30s;
    #open_file_cache_min_uses 2;
    #open_file_cache_errors on;

    # to boost I/O on HDD we can disable access logs
    access_log off;

    # copies data between one FD and other from within the kernel
    # faster than read() + write()
    sendfile on;

    # send headers in one piece, it is better than sending them one by one
    tcp_nopush on;

    keepalive_timeout 300;
    keepalive_requests 999999999;
}

## /etc/nginx/conf.d/grpc.conf
upstream grpcsvr {
    server 10.0.2.210:50010;
    server 10.0.2.210:50011;
    server 10.0.2.210:50012;
    server 10.0.2.210:50013;
    server 10.0.2.210:50014;
    server 10.0.2.210:50015;
    server 10.0.2.210:50016;
    server 10.0.2.210:50017;
    server 10.0.2.210:50018;
    server 10.0.2.210:50019;
    server 10.0.2.210:50020;
    server 10.0.2.210:50021;
    server 10.0.2.210:50022;
    server 10.0.2.210:50023;
    server 10.0.2.210:50024;
    server 10.0.2.210:50025;
    server 10.0.2.210:50026;
    server 10.0.2.210:50027;
    server 10.0.2.210:50028;
    server 10.0.2.210:50029;
    server 10.0.2.139:50010;
    server 10.0.2.139:50011;
    server 10.0.2.139:50012;
    server 10.0.2.139:50013;
    server 10.0.2.139:50014;
    server 10.0.2.139:50015;
    server 10.0.2.139:50016;
    server 10.0.2.139:50017;
    server 10.0.2.139:50018;
    server 10.0.2.139:50019;
    server 10.0.2.139:50020;
    server 10.0.2.139:50021;
    server 10.0.2.139:50022;
    server 10.0.2.139:50023;
    server 10.0.2.139:50024;
    server 10.0.2.139:50025;
    server 10.0.2.139:50026;
    server 10.0.2.139:50027;
    server 10.0.2.139:50028;
    server 10.0.2.139:50029;
    server 10.0.2.54:50010;
    server 10.0.2.54:50011;
    server 10.0.2.54:50012;
    server 10.0.2.54:50013;
    server 10.0.2.54:50014;
    server 10.0.2.54:50015;
    server 10.0.2.54:50016;
    server 10.0.2.54:50017;
    server 10.0.2.54:50018;
    server 10.0.2.54:50019;
    server 10.0.2.54:50020;
    server 10.0.2.54:50021;
    server 10.0.2.54:50022;
    server 10.0.2.54:50023;
    server 10.0.2.54:50024;
    server 10.0.2.54:50025;
    server 10.0.2.54:50026;
    server 10.0.2.54:50027;
    server 10.0.2.54:50028;
    server 10.0.2.54:50029;
    server 10.0.2.38:50010;
    server 10.0.2.38:50011;
    server 10.0.2.38:50012;
    server 10.0.2.38:50013;
    server 10.0.2.38:50014;
    server 10.0.2.38:50015;
    server 10.0.2.38:50016;
    server 10.0.2.38:50017;
    server 10.0.2.38:50018;
    server 10.0.2.38:50019;
    server 10.0.2.38:50020;
    server 10.0.2.38:50021;
    server 10.0.2.38:50022;
    server 10.0.2.38:50023;
    server 10.0.2.38:50024;
    server 10.0.2.38:50025;
    server 10.0.2.38:50026;
    server 10.0.2.38:50027;
    server 10.0.2.38:50028;
    server 10.0.2.38:50029;
    server 10.0.2.16:50010;
    server 10.0.2.16:50011;
    server 10.0.2.16:50012;
    server 10.0.2.16:50013;
    server 10.0.2.16:50014;
    server 10.0.2.16:50015;
    server 10.0.2.16:50016;
    server 10.0.2.16:50017;
    server 10.0.2.16:50018;
    server 10.0.2.16:50019;
    server 10.0.2.16:50020;
    server 10.0.2.16:50021;
    server 10.0.2.16:50022;
    server 10.0.2.16:50023;
    server 10.0.2.16:50024;
    server 10.0.2.16:50025;
    server 10.0.2.16:50026;
    server 10.0.2.16:50027;
    server 10.0.2.16:50028;
    server 10.0.2.16:50029;
    least_conn;
}

server {
    #listen 0.0.0.0:50010 ssl;
    listen 0.0.0.0:50010;
    http2 on;
    
#    ssl_certificate /server.crt;
#    ssl_certificate_key /server.key;

    
    location / {
        grpc_pass grpc://grpcsvr;
	add_header Last-Modified $date_gmt;
        add_header Cache-Control 'no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0';
        if_modified_since off;
        expires off;
        etag off;
    }
} 

Change History (3)

by sm815lee@…, 3 months ago

Attachment: nginx_simple_topology.png added

by sm815lee@…, 3 months ago

Attachment: nginx_memory_massif.png added

by sm815lee@…, 3 months ago

Note: See TracTickets for help on using tickets.