Opened 3 years ago

Closed 3 years ago

#2293 closed defect (worksforme)

Nginx 1.20.1 excessive memory usage

Reported by: luyangliang@… Owned by:
Priority: major Milestone:
Component: nginx-core Version: 1.19.x
Keywords: Cc:
uname -a: bash-4.2$ uname -a
Linux tenant-router-7b95bc584c-s6mcx 5.4.17-2102.205.7.3.el7uek.x86_64 #2 SMP Fri Sep 17 16:52:13 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.20.2

Description (last modified by luyangliang@…)

Running Nginx 1.20.x with HTTP/2 traffic 225req/s, see memory usage keep increasing. In couple hours it reached 1.5GB and the memory never released.

We rolled back to 1.19.1 and memory is much better. But still slow memory leak as traffic runs.

The nginx.conf in 1.20.1 we have to replace http2_max_requests to use keep_alive due to the detective obsolete.

We have:

proxy_buffer off
proxy_cache off

From meminfo we still see cache usage is growing as traffic runs.
config for 1.20.x is attached.

Change History (11)

comment:1 by luyangliang@…, 3 years ago

Description: modified (diff)

Can't attach configuration. I will share that when it is needed.

comment:2 by Maxim Dounin, 3 years ago

First of all, please elaborate on how do you measure memory usage. Additionally, please show "nginx -V" output and keepalive_requests setting in your configuration.

From meminfo we still see cache usage is growing as traffic runs.

Note that "Cached" in meminfo is memory in the pagecache, and expected to grow under load per OS caching algorithms, regardless of whether you use nginx cache or not.

comment:3 by luyangliang@…, 3 years ago

For nginx 1.20.1, we configure:
keepalive_requests 10000;
http2_max_concurrent_streams 1024;

We run Nginx as pods in Kubernetes, we were monitoring the memory from Pods memory usage.
In 1.19.1 We configure:
http2_max_requests 10000;
http2_max_concurrent_streams 1024;

comment:4 by Maxim Dounin, 3 years ago

keepalive_requests 10000;

You may want to check if "keepalive_requests 1000;" (which is the default in 1.20.x) makes any difference.

We run Nginx as pods in Kubernetes, we were monitoring the memory from Pods memory usage.

How exactly do you monitor memory usage of pods? What makes you think that the memory usage is from nginx, and not OS activity such as the pagecache mentioned earlier?

Also, sorry to repeat it again, but please show "nginx -V" output.

comment:5 by luyangliang@…, 3 years ago

Thanks for getting back!

bash-4.2$ nginx -V
nginx version: nginx/1.20.1
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-44.0.3) (GCC)
built with OpenSSL 1.0.2k-fips  26 Jan 2017
TLS SNI support enabled
configure arguments: --with-cc-opt='-fstack-protector-all' --with-ld-opt='-Wl,-z,relro,-z,now' --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx/nginx.conf --pid-path=/tmp/nginx.pid --http-log-path=/dev/stdout --error-log-path=/dev/stdout --http-client-body-temp-path=/tmp/client_temp --http-proxy-temp-path=/tmp/proxy_temp --http-fastcgi-temp-path=/tmp/fastcgi_temp --http-uwsgi-temp-path=/tmp/uwsgi_temp --http-scgi-temp-path=/tmp/scgi_temp --with-file-aio --with-http_v2_module --with-http_ssl_module --with-http_stub_status_module --with-pcre --with-stream --with-stream_ssl_module --with-threads

I will try "keepalive_reuqests 1000"
We monitor container memory usage which is just Nginx. And the only change is version 1.19.1 and 1.20.1. Can you recommend a way to monitor Nginx memory usage?
Thanks!

in reply to:  5 comment:6 by Maxim Dounin, 3 years ago

Replying to luyangliang@…:

Thanks for getting back!

bash-4.2$ nginx -V
nginx version: nginx/1.20.1
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-44.0.3) (GCC)
built with OpenSSL 1.0.2k-fips  26 Jan 2017
TLS SNI support enabled
configure arguments: --with-cc-opt='-fstack-protector-all' --with-ld-opt='-Wl,-z,relro,-z,now' --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx/nginx.conf --pid-path=/tmp/nginx.pid --http-log-path=/dev/stdout --error-log-path=/dev/stdout --http-client-body-temp-path=/tmp/client_temp --http-proxy-temp-path=/tmp/proxy_temp --http-fastcgi-temp-path=/tmp/fastcgi_temp --http-uwsgi-temp-path=/tmp/uwsgi_temp --http-scgi-temp-path=/tmp/scgi_temp --with-file-aio --with-http_v2_module --with-http_ssl_module --with-http_stub_status_module --with-pcre --with-stream --with-stream_ssl_module --with-threads

Thanks. Is aio on;, aio threads;, aio_write on; are actually used in the configuration? Any additional modules loaded via the load_module directive in the configuration?

I will try "keepalive_reuqests 1000"

You may also review other settings affected by the version change, notably http2_recv_timeout, http2_idle_timeout, and keepalive_timeout.

Also, could you please show various buffers and connection limit settings to estimate memory usage, notably worker_processes, worker_connections, proxy_buffer_size, proxy_buffers (and/or grpc_buffer_size, grpc_buffers if gRPC proxying is used, and/or relevant fastcgi/scgi/uwsgi values).

We monitor container memory usage which is just Nginx. And the only change is version 1.19.1 and 1.20.1. Can you recommend a way to monitor Nginx memory usage?

Usually monitoring the VIRT column in top for nginx processes is good way to monitor nginx memory usage. The RES column might be also interesting. Note that these numbers are expected to grow when nginx starts, yet are expected to stabilize under load after some time.

If you see it still growing at the same rate, e.g., after a couple of hours, this might indicate that there is something wrong - that is, some memory or a socket leak. If it stabilizes instead, this likely means it reached the size corresponding the the load and the configuration.

It is also might be a good idea to monitor number of connections (active/reading/writing) as reported by the stub_status module (and/or OS). These can greatly simplify understanding what goes on.

comment:7 by luyangliang@…, 3 years ago

Setting keepalive_requests to 1000 actually improves a lot. Memory does stabilized after couple hours with same traffic rate.

The Client side starts with 4 connections and send requests using the round-robin. We used to set keepalive_request to 10000 to match client side setting. Because if Nginx release the connection when max request is hit, Client side receives a 503 (while using the expired connection). In order to avoid 503, we set client side max request per connection also to 10000. So that client disconnect first to avoid 503. Any suggestion how to handle this? Default 1000 is a little low under traffic load.

We don't have any other load_module. aio on;, aio threads;, aio_write on;are not used in nginx.conf

This are the other related configuration:

events {
  use                            epoll;
  worker_connections             1024;
  multi_accept                   on;
}

http {
  # Required: Headers with underscores
  underscores_in_headers         on;

  # Recommended: Tuning (Unofficial)
  proxy_busy_buffers_size        256k;
  proxy_buffers                  4 256k;
  proxy_buffer_size              128k;
  proxy_read_timeout             300s;
  client_max_body_size           10m;
  server_names_hash_bucket_size  256;
  variables_hash_bucket_size     256;
  sendfile                       on;
  keepalive_timeout              120;
  keepalive_requests             1000;  //Just updated to 1000

  server {

    ....
    http2_max_concurrent_streams 1024;

    proxy_buffering off;
    ....
  }
}

Last edited 3 years ago by luyangliang@… (previous) (diff)

in reply to:  7 comment:8 by Maxim Dounin, 3 years ago

Replying to luyangliang@…:

Setting keepalive_requests to 1000 actually improves a lot. Memory does stabilized after couple hours with same traffic rate.

So it looks like it's just a question of memory consumed in your configuration, not a leak. You may want to try using larger keepalive_requests to see if it stabilizes as well, probably at some larger memory size.

It would be also interesting to compare memory used and the number of active connections, to see if used memory scales linearly from the number of connections or memory used per connection depends on keepalive_requests (that is, there are some per-request allocations).

Because if Nginx release the connection when max request is hit, Client side receives a 503 (while using the expired connection).

Note that this means that client probably needs improvements.

This are the other related configuration:

  worker_connections             1024;
  proxy_buffers                  4 256k;
  proxy_buffer_size              128k;
    http2_max_concurrent_streams 1024;
    proxy_buffering off;

So the configuration allows up to 1 million of parallel requests per worker (worker_connections * http2_max_concurrent_streams), and at least up to 128k of memory per request (proxy_buffer_size, given that proxy_buffering is switched off). This gives 128 gigabytes maximum memory usage per worker process, which is clearly not reached.

BTW, what's the body_buffer_size value? One of the changes in HTTP/2 code in 1.19.x is that it is now more likely to use this buffer. If it's set to a large value, this might explain the change in overall memory usage you observe.

comment:9 by luyangliang@…, 3 years ago

body_buffer_size we did not configure, so it is using default 16k.
This nginx supports up to 10 customers. Traffic rate we should support more than 2250req/s. Each customer has Clients from both browsers and premise traffic generator. I have tried decreased proxy_buffer_size but some of the browser request would failed (due to large cookie size).
I will reduce http2_max_concurrent_streams since that is per connection.
We still don't understand the large memory usage difference between 1.19 and 1.20 when use 10k keepalive_requests.

Last edited 3 years ago by luyangliang@… (previous) (diff)

comment:10 by Maxim Dounin, 3 years ago

body_buffer_size we did not configure, so it is using default 16k.

Err, sorry, client_body_buffer_size is certainly unrelated, as relevant changes are only in 1.21.x, not in 1.19.x.

We still don't understand the large memory usage difference between 1.19 and 1.20 when use 10k keepalive_requests.

My best guess is that the memory difference is due to different number of connections being open in these versions. This might be due to various settings unification between HTTP/1.x and HTTP/2. For example, http2_recv_timeout was 30s by default, and was replaced with client_header_timeout, which is 60s by default. It would be interesting to check number of connections being open in different versions, and memory per connection metrics.

Another important change in 1.19.x branch which also might affect number of connections being open is lingering close introduction for HTTP/2, but it was already present in 1.19.1. There were some related SSL shutdown changes in subsequent versions though, which might affect client behaviour.

comment:11 by Maxim Dounin, 3 years ago

Resolution: worksforme
Status: newclosed

Feedback timeout, so closing this. As previously found out in comments, clearly there is no memory leak, and the difference in memory consumption in the particular configuration is likely explained by the different number of connections being kept alive in different versions due to configuration changes.

Note: See TracTickets for help on using tickets.