Opened 6 months ago

Closed 5 months ago

#2558 closed defect (invalid)

Memory Leak (maybe WebSocket)

Reported by: Roman Owned by:
Priority: minor Milestone:
Component: nginx-core Version: 1.24.x
Keywords: memory leak websocket Cc: Roman
uname -a: Linux 5.10.7-1.el7.elrepo.x86_64 #1 SMP Mon Jan 11 17:32:42 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.24.0
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)
built with OpenSSL 1.0.2k-fips 26 Jan 2017
TLS SNI support enabled
configure arguments: --add-module=debian/modules/nginx_upstream_check_module-0.4.0 --add-module=debian/modules/ngx_http_geoip2_module --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --with-http_gzip_static_module --with-http_stub_status_module --http-client-body-temp-path=/home/nginx/tmp/body-temp-path --http-proxy-temp-path=/home/nginx/tmp/proxy-temp-path --http-fastcgi-temp-path=/home/nginx/tmp/fastcgi-temp-path --with-http_ssl_module --with-http_realip_module --with-http_addition_module --with-http_dav_module --with-http_flv_module --with-http_stub_status_module --with-http_image_filter_module --with-http_secure_link_module --with-http_auth_request_module --with-ipv6 --with-stream --with-http_v2_module --with-http_geoip_module --with-stream_geoip_module

Description (last modified by Roman)

We use a little WebSocket.

And sometimes we execute

kill -USR1 $(cat /var/run/nginx-81.pid)

to reopen logfiles and nginx creates many workers in shutdown mode. But it workers works for a long time and the next time we again reopen our logfiles (1200 file) and than nginx again creates shutdown worker by that some times later the server starts to consume a lot of RAM

https://trac.nginx.org/nginx/attachment/ticket/2558/ram.png

oТ the chart you can see the drops - it is OOM Killer.
How to debug I this problem with memory?

Attachments (2)

Снимок экрана от 2023-11-07 16-32-25.png (56.4 KB ) - added by Roman 6 months ago.
ram.png
ram.png (56.4 KB ) - added by Roman 6 months ago.

Download all attachments as: .zip

Change History (11)

comment:1 by Roman, 6 months ago

Description: modified (diff)

comment:2 by Roman, 6 months ago

Description: modified (diff)

by Roman, 6 months ago

Attachment: ram.png added

comment:3 by Roman, 6 months ago

Description: modified (diff)

comment:4 by Roman, 6 months ago

Description: modified (diff)

comment:5 by Maxim Dounin, 6 months ago

Reopening log files with kill -USR1 is not expected to result in any shutting down workers. Could you please double-check it's actually what you are doing?

My best guess is that you are using kill -HUP instead, which is a configuration reload. Configuration reload is expected to start new worker processes with new configuration, while gracefully shutting down old worker processes. That is, old worker processes will stop accepting new requests, but already running requests will be serviced. For WebSocket connections this means that old worker processes will keep running till all the WebSocket connections are closed, which might take a while.

To facilitate graceful configuration reload it is recommended to close WebSocket connections periodically from the backend side, and make sure connections are re-opened by clients appropriately. Alternatively, the worker_shutdown_timeout directive can be used to forcefully drop such long connections after a specified timeout.

This shouldn't be needed for reopening logs though - reopening logs is a simple operation, and nginx can do it without starting new worker processes, see here. Make sure to use USR1 signal to instruct nginx to reopen logs, and not HUP, which causes reconfiguration.

in reply to:  5 ; comment:6 by Roman, 6 months ago

Replying to Maxim Dounin:

Reopening log files with kill -USR1 is not expected to result in any shutting down workers. Could you please double-check it's actually what you are doing?

My best guess is that you are using kill -HUP instead, which is a configuration reload. Configuration reload is expected to start new worker processes with new configuration, while gracefully shutting down old worker processes. That is, old worker processes will stop accepting new requests, but already running requests will be serviced. For WebSocket connections this means that old worker processes will keep running till all the WebSocket connections are closed, which might take a while.

To facilitate graceful configuration reload it is recommended to close WebSocket connections periodically from the backend side, and make sure connections are re-opened by clients appropriately. Alternatively, the worker_shutdown_timeout directive can be used to forcefully drop such long connections after a specified timeout.

This shouldn't be needed for reopening logs though - reopening logs is a simple operation, and nginx can do it without starting new worker processes, see here. Make sure to use USR1 signal to instruct nginx to reopen logs, and not HUP, which causes reconfiguration.

Sorry. I'm recheck and we use kill -USR1 to log rotate.

kill -HUP we use much less often to real reload configuration. Because our service needs to reload the configure.

But I want to now worker_shutdown_timeout if i setup for one our is it normal for WebSocket?

Last edited 6 months ago by Roman (previous) (diff)

in reply to:  6 ; comment:7 by Maxim Dounin, 6 months ago

Replying to Roman:

Replying to Maxim Dounin:

Reopening log files with kill -USR1 is not expected to result in any shutting down workers. Could you please double-check it's actually what you are doing?

Sorry. I'm recheck and we use kill -USR1 to log rotate.

To be sure, consider configuring nginx error logging at the global level at least to notice level, so nginx will log signals received, and all worker processes it starts. For example, here are kill -USR1:

2023/11/08 04:45:43 [notice] 33032#100133: signal 30 (SIGUSR1) received from 942, reopening logs
2023/11/08 04:45:43 [notice] 33032#100133: reopening logs
2023/11/08 04:45:43 [notice] 33032#100133: signal 23 (SIGIO) received
2023/11/08 04:45:43 [notice] 33033#100149: reopening logs

And kill -HUP:

2023/11/08 04:46:34 [notice] 33032#100133: signal 1 (SIGHUP) received from 942, reconfiguring
2023/11/08 04:46:34 [notice] 33032#100133: reconfiguring
2023/11/08 04:46:34 [notice] 33032#100133: using the "kqueue" event method
2023/11/08 04:46:34 [notice] 33032#100133: start worker processes
2023/11/08 04:46:34 [notice] 33032#100133: start worker process 33036
2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received
2023/11/08 04:46:35 [notice] 33033#100149: gracefully shutting down
2023/11/08 04:46:35 [notice] 33033#100149: exiting
2023/11/08 04:46:35 [notice] 33033#100149: exit
2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received
2023/11/08 04:46:35 [notice] 33032#100133: signal 20 (SIGCHLD) received from 33033
2023/11/08 04:46:35 [notice] 33032#100133: worker process 33033 exited with code 0
2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received
2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received
2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received

Similarly, you can match worker processes startup times as reported with ps -ef with appropriate signals. This will make it possible to understand what actually causes shutting down worker processes to appear in your case.

kill -HUP we use much less often to real reload configuration. Because our service needs to reload the configure.

This might be the actual cause for shutting down worker processes you are seeing. As said, kill -USR1 is not expected to result in new worker processes being started.

But I want to now worker_shutdown_timeout if i setup for one our is it normal for WebSocket?

With worker_shutdown_timeout, WebSocket connections, which will still remain open when the timeout expires, will be forcefully closed by nginx. This certainly might cause issues if a WebSocket application does not expect connection to be closed - or might not, this depends on the particular WebSocket application. On the other hand, this certainly better than OOM Killer.

As explained above, the recommended solution is to make sure that the particular WebSocket application can properly handle connection close at least at certain moments, and close the connections periodically from the server side. Still, it is understood that such solution is not always possible, hence the worker_shutdown_timeout band-aid: it provides some time for connections to shutdown gracefully, and forcefully closes the remaining ones.

in reply to:  7 comment:8 by Roman, 6 months ago

Replying to Maxim Dounin:

Replying to Roman:

Replying to Maxim Dounin:

Reopening log files with kill -USR1 is not expected to result in any shutting down workers. Could you please double-check it's actually what you are doing?

Sorry. I'm recheck and we use kill -USR1 to log rotate.

To be sure, consider configuring nginx error logging at the global level at least to notice level, so nginx will log signals received, and all worker processes it starts. For example, here are kill -USR1:

2023/11/08 04:45:43 [notice] 33032#100133: signal 30 (SIGUSR1) received from 942, reopening logs
2023/11/08 04:45:43 [notice] 33032#100133: reopening logs
2023/11/08 04:45:43 [notice] 33032#100133: signal 23 (SIGIO) received
2023/11/08 04:45:43 [notice] 33033#100149: reopening logs

And kill -HUP:

2023/11/08 04:46:34 [notice] 33032#100133: signal 1 (SIGHUP) received from 942, reconfiguring
2023/11/08 04:46:34 [notice] 33032#100133: reconfiguring
2023/11/08 04:46:34 [notice] 33032#100133: using the "kqueue" event method
2023/11/08 04:46:34 [notice] 33032#100133: start worker processes
2023/11/08 04:46:34 [notice] 33032#100133: start worker process 33036
2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received
2023/11/08 04:46:35 [notice] 33033#100149: gracefully shutting down
2023/11/08 04:46:35 [notice] 33033#100149: exiting
2023/11/08 04:46:35 [notice] 33033#100149: exit
2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received
2023/11/08 04:46:35 [notice] 33032#100133: signal 20 (SIGCHLD) received from 33033
2023/11/08 04:46:35 [notice] 33032#100133: worker process 33033 exited with code 0
2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received
2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received
2023/11/08 04:46:35 [notice] 33032#100133: signal 23 (SIGIO) received

Similarly, you can match worker processes startup times as reported with ps -ef with appropriate signals. This will make it possible to understand what actually causes shutting down worker processes to appear in your case.

kill -HUP we use much less often to real reload configuration. Because our service needs to reload the configure.

This might be the actual cause for shutting down worker processes you are seeing. As said, kill -USR1 is not expected to result in new worker processes being started.

But I want to now worker_shutdown_timeout if i setup for one our is it normal for WebSocket?

With worker_shutdown_timeout, WebSocket connections, which will still remain open when the timeout expires, will be forcefully closed by nginx. This certainly might cause issues if a WebSocket application does not expect connection to be closed - or might not, this depends on the particular WebSocket application. On the other hand, this certainly better than OOM Killer.

As explained above, the recommended solution is to make sure that the particular WebSocket application can properly handle connection close at least at certain moments, and close the connections periodically from the server side. Still, it is understood that such solution is not always possible, hence the worker_shutdown_timeout band-aid: it provides some time for connections to shutdown gracefully, and forcefully closes the remaining ones.

Now I want to setup worker_shutdown_timeout in 1h and testing this configuration.

comment:9 by Maxim Dounin, 5 months ago

Resolution: invalid
Status: newclosed

Feedback timeout. No details provided to support the claim that kill -USR1 results in shutting down worker processes, and the behaviour of shutting down worker processes is as expected.

Note: See TracTickets for help on using tickets.