Opened 11 months ago
Closed 10 months ago
#2558 closed defect (invalid)
Memory Leak (maybe WebSocket)
Reported by: | Roman | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | nginx-core | Version: | 1.24.x |
Keywords: | memory leak websocket | Cc: | Roman |
uname -a: | Linux 5.10.7-1.el7.elrepo.x86_64 #1 SMP Mon Jan 11 17:32:42 EST 2021 x86_64 x86_64 x86_64 GNU/Linux | ||
nginx -V: |
nginx version: nginx/1.24.0
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) built with OpenSSL 1.0.2k-fips 26 Jan 2017 TLS SNI support enabled configure arguments: --add-module=debian/modules/nginx_upstream_check_module-0.4.0 --add-module=debian/modules/ngx_http_geoip2_module --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --with-http_gzip_static_module --with-http_stub_status_module --http-client-body-temp-path=/home/nginx/tmp/body-temp-path --http-proxy-temp-path=/home/nginx/tmp/proxy-temp-path --http-fastcgi-temp-path=/home/nginx/tmp/fastcgi-temp-path --with-http_ssl_module --with-http_realip_module --with-http_addition_module --with-http_dav_module --with-http_flv_module --with-http_stub_status_module --with-http_image_filter_module --with-http_secure_link_module --with-http_auth_request_module --with-ipv6 --with-stream --with-http_v2_module --with-http_geoip_module --with-stream_geoip_module |
Description (last modified by )
We use a little WebSocket.
And sometimes we execute
kill -USR1 $(cat /var/run/nginx-81.pid)
to reopen logfiles and nginx creates many workers in shutdown mode. But it workers works for a long time and the next time we again reopen our logfiles (1200 file) and than nginx again creates shutdown worker by that some times later the server starts to consume a lot of RAM
oТ the chart you can see the drops - it is OOM Killer.
How to debug I this problem with memory?
Attachments (2)
Change History (11)
by , 11 months ago
Attachment: | Снимок экрана от 2023-11-07 16-32-25.png added |
---|
comment:1 by , 11 months ago
Description: | modified (diff) |
---|
comment:2 by , 11 months ago
Description: | modified (diff) |
---|
by , 11 months ago
comment:3 by , 11 months ago
Description: | modified (diff) |
---|
comment:4 by , 11 months ago
Description: | modified (diff) |
---|
follow-up: 6 comment:5 by , 11 months ago
Reopening log files with kill -USR1
is not expected to result in any shutting down workers. Could you please double-check it's actually what you are doing?
My best guess is that you are using kill -HUP
instead, which is a configuration reload. Configuration reload is expected to start new worker processes with new configuration, while gracefully shutting down old worker processes. That is, old worker processes will stop accepting new requests, but already running requests will be serviced. For WebSocket connections this means that old worker processes will keep running till all the WebSocket connections are closed, which might take a while.
To facilitate graceful configuration reload it is recommended to close WebSocket connections periodically from the backend side, and make sure connections are re-opened by clients appropriately. Alternatively, the worker_shutdown_timeout directive can be used to forcefully drop such long connections after a specified timeout.
This shouldn't be needed for reopening logs though - reopening logs is a simple operation, and nginx can do it without starting new worker processes, see here. Make sure to use USR1
signal to instruct nginx to reopen logs, and not HUP
, which causes reconfiguration.
follow-up: 7 comment:6 by , 11 months ago
Replying to Maxim Dounin:
Reopening log files with
kill -USR1
is not expected to result in any shutting down workers. Could you please double-check it's actually what you are doing?
My best guess is that you are using
kill -HUP
instead, which is a configuration reload. Configuration reload is expected to start new worker processes with new configuration, while gracefully shutting down old worker processes. That is, old worker processes will stop accepting new requests, but already running requests will be serviced. For WebSocket connections this means that old worker processes will keep running till all the WebSocket connections are closed, which might take a while.
To facilitate graceful configuration reload it is recommended to close WebSocket connections periodically from the backend side, and make sure connections are re-opened by clients appropriately. Alternatively, the worker_shutdown_timeout directive can be used to forcefully drop such long connections after a specified timeout.
This shouldn't be needed for reopening logs though - reopening logs is a simple operation, and nginx can do it without starting new worker processes, see here. Make sure to use
USR1
signal to instruct nginx to reopen logs, and notHUP
, which causes reconfiguration.
Sorry. I'm recheck and we use kill -USR1
to log rotate.
kill -HUP
we use much less often to real reload configuration. Because our service needs to reload the configure.
But I want to now worker_shutdown_timeout
if i setup for one our is it normal for WebSocket?
follow-up: 8 comment:7 by , 11 months ago
Replying to Roman:
Replying to Maxim Dounin:
Reopening log files with
kill -USR1
is not expected to result in any shutting down workers. Could you please double-check it's actually what you are doing?
Sorry. I'm recheck and we use
kill -USR1
to log rotate.
To be sure, consider configuring nginx error logging at the global level at least to notice level, so nginx will log signals received, and all worker processes it starts. For example, here are kill -USR1
:
2023/11/08 04:45:43 [notice] 33032#100133: signal 30 (SIGUSR1) received from 942, reopening logs 2023/11/08 04:45:43 [notice] 33032#100133: reopening logs 2023/11/08 04:45:43 [notice] 33032#100133: signal 23 (SIGIO) received 2023/11/08 04:45:43 [notice] 33033#100149: reopening logs
And kill -HUP
:
2023/11/08 04:46:34 [notice] 33032#100133: signal 1 (SIGHUP) received from 942, reconfiguring 2023/11/08 04:46:34 [notice] 33032#100133: reconfiguring 2023/11/08 04:46:34 [notice] 33032#100133: using the "kqueue" event method 2023/11/08 04:46:34 [notice] 33032#100133: start worker processes 2023/11/08 04:46:34 [notice] 33032#100133: start worker process 33036 2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received 2023/11/08 04:46:35 [notice] 33033#100149: gracefully shutting down 2023/11/08 04:46:35 [notice] 33033#100149: exiting 2023/11/08 04:46:35 [notice] 33033#100149: exit 2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received 2023/11/08 04:46:35 [notice] 33032#100133: signal 20 (SIGCHLD) received from 33033 2023/11/08 04:46:35 [notice] 33032#100133: worker process 33033 exited with code 0 2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received 2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received 2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received
Similarly, you can match worker processes startup times as reported with ps -ef
with appropriate signals. This will make it possible to understand what actually causes shutting down worker processes to appear in your case.
kill -HUP
we use much less often to real reload configuration. Because our service needs to reload the configure.
This might be the actual cause for shutting down worker processes you are seeing. As said, kill -USR1
is not expected to result in new worker processes being started.
But I want to now
worker_shutdown_timeout
if i setup for one our is it normal for WebSocket?
With worker_shutdown_timeout
, WebSocket connections, which will still remain open when the timeout expires, will be forcefully closed by nginx. This certainly might cause issues if a WebSocket application does not expect connection to be closed - or might not, this depends on the particular WebSocket application. On the other hand, this certainly better than OOM Killer.
As explained above, the recommended solution is to make sure that the particular WebSocket application can properly handle connection close at least at certain moments, and close the connections periodically from the server side. Still, it is understood that such solution is not always possible, hence the worker_shutdown_timeout
band-aid: it provides some time for connections to shutdown gracefully, and forcefully closes the remaining ones.
comment:8 by , 10 months ago
Replying to Maxim Dounin:
Replying to Roman:
Replying to Maxim Dounin:
Reopening log files with
kill -USR1
is not expected to result in any shutting down workers. Could you please double-check it's actually what you are doing?
Sorry. I'm recheck and we use
kill -USR1
to log rotate.
To be sure, consider configuring nginx error logging at the global level at least to notice level, so nginx will log signals received, and all worker processes it starts. For example, here are
kill -USR1
:
2023/11/08 04:45:43 [notice] 33032#100133: signal 30 (SIGUSR1) received from 942, reopening logs 2023/11/08 04:45:43 [notice] 33032#100133: reopening logs 2023/11/08 04:45:43 [notice] 33032#100133: signal 23 (SIGIO) received 2023/11/08 04:45:43 [notice] 33033#100149: reopening logsAnd
kill -HUP
:
2023/11/08 04:46:34 [notice] 33032#100133: signal 1 (SIGHUP) received from 942, reconfiguring 2023/11/08 04:46:34 [notice] 33032#100133: reconfiguring 2023/11/08 04:46:34 [notice] 33032#100133: using the "kqueue" event method 2023/11/08 04:46:34 [notice] 33032#100133: start worker processes 2023/11/08 04:46:34 [notice] 33032#100133: start worker process 33036 2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received 2023/11/08 04:46:35 [notice] 33033#100149: gracefully shutting down 2023/11/08 04:46:35 [notice] 33033#100149: exiting 2023/11/08 04:46:35 [notice] 33033#100149: exit 2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received 2023/11/08 04:46:35 [notice] 33032#100133: signal 20 (SIGCHLD) received from 33033 2023/11/08 04:46:35 [notice] 33032#100133: worker process 33033 exited with code 0 2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received 2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received 2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) receivedSimilarly, you can match worker processes startup times as reported with
ps -ef
with appropriate signals. This will make it possible to understand what actually causes shutting down worker processes to appear in your case.
kill -HUP
we use much less often to real reload configuration. Because our service needs to reload the configure.
This might be the actual cause for shutting down worker processes you are seeing. As said,
kill -USR1
is not expected to result in new worker processes being started.
But I want to now
worker_shutdown_timeout
if i setup for one our is it normal for WebSocket?
With
worker_shutdown_timeout
, WebSocket connections, which will still remain open when the timeout expires, will be forcefully closed by nginx. This certainly might cause issues if a WebSocket application does not expect connection to be closed - or might not, this depends on the particular WebSocket application. On the other hand, this certainly better than OOM Killer.
As explained above, the recommended solution is to make sure that the particular WebSocket application can properly handle connection close at least at certain moments, and close the connections periodically from the server side. Still, it is understood that such solution is not always possible, hence the
worker_shutdown_timeout
band-aid: it provides some time for connections to shutdown gracefully, and forcefully closes the remaining ones.
Now I want to setup worker_shutdown_timeout
in 1h and testing this configuration.
comment:9 by , 10 months ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
Feedback timeout. No details provided to support the claim that kill -USR1
results in shutting down worker processes, and the behaviour of shutting down worker processes is as expected.
ram.png