Opened 10 months ago
Last modified 5 months ago
#2614 closed defect
Memory-leak like issue happens as long as nginx keeps having long-lived gRPC stream connections — at Version 1
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | critical | Milestone: | nginx-1.27 |
Component: | nginx-core | Version: | 1.25.x |
Keywords: | grpc, memory, leak | Cc: | |
uname -a: | Linux ip-10-0-2-164 6.5.0-1014-aws #14~22.04.1-Ubuntu SMP Thu Feb 15 15:27:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | ||
nginx -V: |
nginx version: nginx/1.25.4
built by gcc 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) built with OpenSSL 3.0.2 15 Mar 2022 TLS SNI support enabled configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-http_v3_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -ffile-prefix-map=/data/builder/debuild/nginx-1.25.4/debian/debuild-base/nginx-1.25.4=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie' |
Description (last modified by )
Hello all,
We have a workload that the load balancer server keeps long-lived gRPC-stream connections like common TCP servers.
And in the client, from time to time, it sends small gRPC calls to the reverse proxy server. So we chose to make an in-house gRPC-stream test-client program to measure CPU and memory usage of nginx with a similar workload.
There is a simple network topology and test environment. we have tested it using AWS EC2 VMs.
[simple topology picture, please append https:/]
drive.usercontent.google.com/download?id=12QLfhidHyKAiB4Z4S7WiODfNPlwdenU8&export=view&authuser=0
Then, with this workload, we found that the memory consumption graph gradually and linearly increased during the test.
After all, the proxy server ended up facing OOM killer.
[memory consumption picture, please append https:/]
drive.usercontent.google.com/download?id=1l8t4_cZ7pLEBBLP8m0eT1-wWeOLhccl0&export=view&authuser=0
Because it's quite easy to reproduce this issue, we ran some tests with Valgrind + massif tool and we found where memory allocation happens mostly -- ngx_alloc_chain_link() in ngx_chain_writer() --
but what ngx_chain_writer() does is quite legitimate and adequate and because nginx uses a memory pool manner, there are many places to gain pre-allocated memory from the pool, it was quite hard to check all the parts where memory allocation and deallocation happen especially for those who are new to nginx internals.
[massif analyzer graph picture]
https://drive.usercontent.google.com/download?id=1FZ40HtHeWXDPFVgw9lk86Nv4AzadGo4Z&export=view&authuser=0
Although we found some recommendations of memory management for nginx on the guide document like nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests
we think it's not fair that nginx costs gradual memory space on every connection during its lifespan like a tax ;)
And some nginx's directives, keepalive_requests, and keepalive_time, didn't even affect gRPC-stream connections.
Currently, we don't even know whether this is what Nginx is originally intended for or not.
There are too many configuration directives and modules and compiling options, so I think it's really hard to avoid this issue with the combinations of those.
Actually, we have been trying to run with many combinations of directives to avoid this issue, like not using certificates, but we didn't have luck with those.
So it would be very helpful to share recommendations for this issue.
Here are the configuration files
[configuration file]
## nginx.conf worker_cpu_affinity auto; worker_priority 0; worker_processes auto; worker_rlimit_core 1000000; worker_rlimit_nofile 1000000; worker_shutdown_timeout 30m; error_log /var/log/nginx/error.log debug; pid /var/run/nginx.pid; daemon on; debug_points abort; lock_file logs/nginx.lock; master_process on; pcre_jit off; thread_pool default threads=64 max_queue=65536; timer_resolution 1s; working_directory /etc/nginx; include []; events { accept_mutex off; accept_mutex_delay 500ms; multi_accept off; use epoll; worker_aio_requests 256; worker_connections 50000; } stream { upstream stream_upstream { least_conn; server 10.0.2.210:80; server 10.0.2.210:81; server 10.0.2.210:82; server 10.0.2.210:83; server 10.0.2.210:84; server 10.0.2.210:85; server 10.0.2.210:86; server 10.0.2.210:87; server 10.0.2.210:88; server 10.0.2.210:89; server 10.0.2.210:90; server 10.0.2.210:91; server 10.0.2.210:92; server 10.0.2.210:93; server 10.0.2.210:94; server 10.0.2.210:95; server 10.0.2.210:96; server 10.0.2.210:97; server 10.0.2.210:98; server 10.0.2.210:99; server 10.0.2.139:80; server 10.0.2.139:81; server 10.0.2.139:82; server 10.0.2.139:83; server 10.0.2.139:84; server 10.0.2.139:85; server 10.0.2.139:86; server 10.0.2.139:87; server 10.0.2.139:88; server 10.0.2.139:89; server 10.0.2.139:90; server 10.0.2.139:91; server 10.0.2.139:92; server 10.0.2.139:93; server 10.0.2.139:94; server 10.0.2.139:95; server 10.0.2.139:96; server 10.0.2.139:97; server 10.0.2.139:98; server 10.0.2.139:99; server 10.0.2.54:80; server 10.0.2.54:81; server 10.0.2.54:82; server 10.0.2.54:83; server 10.0.2.54:84; server 10.0.2.54:85; server 10.0.2.54:86; server 10.0.2.54:87; server 10.0.2.54:88; server 10.0.2.54:89; server 10.0.2.54:90; server 10.0.2.54:91; server 10.0.2.54:92; server 10.0.2.54:93; server 10.0.2.54:94; server 10.0.2.54:95; server 10.0.2.54:96; server 10.0.2.54:97; server 10.0.2.54:98; server 10.0.2.54:99; server 10.0.2.38:80; server 10.0.2.38:81; server 10.0.2.38:82; server 10.0.2.38:83; server 10.0.2.38:84; server 10.0.2.38:85; server 10.0.2.38:86; server 10.0.2.38:87; server 10.0.2.38:88; server 10.0.2.38:89; server 10.0.2.38:90; server 10.0.2.38:91; server 10.0.2.38:92; server 10.0.2.38:93; server 10.0.2.38:94; server 10.0.2.38:95; server 10.0.2.38:96; server 10.0.2.38:97; server 10.0.2.38:98; server 10.0.2.38:99; server 10.0.2.16:80; server 10.0.2.16:81; server 10.0.2.16:82; server 10.0.2.16:83; server 10.0.2.16:84; server 10.0.2.16:85; server 10.0.2.16:86; server 10.0.2.16:87; server 10.0.2.16:88; server 10.0.2.16:89; server 10.0.2.16:90; server 10.0.2.16:91; server 10.0.2.16:92; server 10.0.2.16:93; server 10.0.2.16:94; server 10.0.2.16:95; server 10.0.2.16:96; server 10.0.2.16:97; server 10.0.2.16:98; server 10.0.2.16:99; } server { listen 0.0.0.0:1000; proxy_pass stream_upstream; } } http { include /etc/nginx/conf.d/*.conf; include /etc/nginx/conf.d/backend/*.conf; #open_file_cache max=200000 inactive=20s; #open_file_cache_valid 30s; #open_file_cache_min_uses 2; #open_file_cache_errors on; # to boost I/O on HDD we can disable access logs access_log off; # copies data between one FD and other from within the kernel # faster than read() + write() sendfile on; # send headers in one piece, it is better than sending them one by one tcp_nopush on; keepalive_timeout 300; keepalive_requests 999999999; } ## /etc/nginx/conf.d/grpc.conf upstream grpcsvr { server 10.0.2.210:50010; server 10.0.2.210:50011; server 10.0.2.210:50012; server 10.0.2.210:50013; server 10.0.2.210:50014; server 10.0.2.210:50015; server 10.0.2.210:50016; server 10.0.2.210:50017; server 10.0.2.210:50018; server 10.0.2.210:50019; server 10.0.2.210:50020; server 10.0.2.210:50021; server 10.0.2.210:50022; server 10.0.2.210:50023; server 10.0.2.210:50024; server 10.0.2.210:50025; server 10.0.2.210:50026; server 10.0.2.210:50027; server 10.0.2.210:50028; server 10.0.2.210:50029; server 10.0.2.139:50010; server 10.0.2.139:50011; server 10.0.2.139:50012; server 10.0.2.139:50013; server 10.0.2.139:50014; server 10.0.2.139:50015; server 10.0.2.139:50016; server 10.0.2.139:50017; server 10.0.2.139:50018; server 10.0.2.139:50019; server 10.0.2.139:50020; server 10.0.2.139:50021; server 10.0.2.139:50022; server 10.0.2.139:50023; server 10.0.2.139:50024; server 10.0.2.139:50025; server 10.0.2.139:50026; server 10.0.2.139:50027; server 10.0.2.139:50028; server 10.0.2.139:50029; server 10.0.2.54:50010; server 10.0.2.54:50011; server 10.0.2.54:50012; server 10.0.2.54:50013; server 10.0.2.54:50014; server 10.0.2.54:50015; server 10.0.2.54:50016; server 10.0.2.54:50017; server 10.0.2.54:50018; server 10.0.2.54:50019; server 10.0.2.54:50020; server 10.0.2.54:50021; server 10.0.2.54:50022; server 10.0.2.54:50023; server 10.0.2.54:50024; server 10.0.2.54:50025; server 10.0.2.54:50026; server 10.0.2.54:50027; server 10.0.2.54:50028; server 10.0.2.54:50029; server 10.0.2.38:50010; server 10.0.2.38:50011; server 10.0.2.38:50012; server 10.0.2.38:50013; server 10.0.2.38:50014; server 10.0.2.38:50015; server 10.0.2.38:50016; server 10.0.2.38:50017; server 10.0.2.38:50018; server 10.0.2.38:50019; server 10.0.2.38:50020; server 10.0.2.38:50021; server 10.0.2.38:50022; server 10.0.2.38:50023; server 10.0.2.38:50024; server 10.0.2.38:50025; server 10.0.2.38:50026; server 10.0.2.38:50027; server 10.0.2.38:50028; server 10.0.2.38:50029; server 10.0.2.16:50010; server 10.0.2.16:50011; server 10.0.2.16:50012; server 10.0.2.16:50013; server 10.0.2.16:50014; server 10.0.2.16:50015; server 10.0.2.16:50016; server 10.0.2.16:50017; server 10.0.2.16:50018; server 10.0.2.16:50019; server 10.0.2.16:50020; server 10.0.2.16:50021; server 10.0.2.16:50022; server 10.0.2.16:50023; server 10.0.2.16:50024; server 10.0.2.16:50025; server 10.0.2.16:50026; server 10.0.2.16:50027; server 10.0.2.16:50028; server 10.0.2.16:50029; least_conn; } server { #listen 0.0.0.0:50010 ssl; listen 0.0.0.0:50010; http2 on; # ssl_certificate /server.crt; # ssl_certificate_key /server.key; location / { grpc_pass grpc://grpcsvr; add_header Last-Modified $date_gmt; add_header Cache-Control 'no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0'; if_modified_since off; expires off; etag off; } }
Change History (4)
by , 10 months ago
Attachment: | nginx_simple_topology.png added |
---|
by , 10 months ago
Attachment: | nginx_memory_massif.png added |
---|
by , 10 months ago
Attachment: | vanilla_nginx_400k_conn_grpc.png added |
---|
comment:1 by , 10 months ago
Description: | modified (diff) |
---|