#2555 closed defect (fixed)

Race when both aio threads and background subrequests are used

Reported by: guo.bojun@… Owned by:
Priority: minor Milestone:
Component: nginx-core Version: 1.25.x
Keywords: Cc:
uname -a:
nginx -V: not provided

Description

When the specific configurations are set(e.g. aio threads + proxy_cache_background_update on), multiple threads can operate on the same request connection structure and https://hg.nginx.org/nginx/file/tip/src/event/ngx_event_openssl.c#l3361
can cause ngx_ssl_recv() call from another thread to access a NULL c->ssl pointer.

Change History (8)

comment:1 by Maxim Dounin, 14 months ago

nginx -V: modified (diff)

Could you please provide more details? A test case which demonstrates the problem would be helpful, as well as "nginx -V".

Note that threaded operations are not expected to access connection structures, but rather implement simple primitives, such as reading from a file, writing to a file, or sending a file with the sendfile() syscall to a specific socket. As such, it is certainly not possible that multiple threads can operate on the same connection structure. Further, the ngx_ssl_shutdown() function you've linked is not expected to be called unless all background subrequests have been completed. If you observe issues with a particular configuration, there should be another explanation.

comment:2 by guo.bojun@…, 14 months ago

I'm not sure it is feasible to reproduce the crash reliably and the actual configuration on production is quite complicated to trim down, but I'm confident that it is along those lines, as the configuration changes were recently introduced and only since then the crash starts to happen. Removing any one of the elements also stops the crash.

The race is probably particularly related to the c->recv member that a thread may have already entered the previous function(ngx_ssl_recv()) during the ngx_ssl_shutdown() call from another thread or the write operation was simply not seen by other threads without a memory barrier.

comment:3 by Maxim Dounin, 14 months ago

I'm not sure it is feasible to reproduce the crash reliably and the actual configuration on production is quite complicated to trim down, but I'm confident that it is along those lines, as the configuration changes were recently introduced and only since then the crash starts to happen. Removing any one of the elements also stops the crash.

You may start with simple debugging steps then. In particular:

  • Make sure you are able to reproduce the crash without 3rd party modules. Unfortunately, many 3rd party modules are incorrectly coded and might easily introduce bugs, especially in complex configurations involving aio and subrequests.
  • Obtain a core dump and show a backtrace.
  • Obtain a debug log to investigate (as long as nginx crashes, logging to a memory buffer would be cheapest solution).

Unfortunately, without either a way to reproduce or at least some debugging information it is not possible to help any further.

The race is probably particularly related to the c->recv member that a thread may have already entered the previous function(ngx_ssl_recv()) during the ngx_ssl_shutdown() call from another thread or the write operation was simply not seen by other threads without a memory barrier.

To re-iterate: nginx only uses the connection structure from the main thread, and the race you describe is completely impossible.

comment:4 by guo.bojun@…, 14 months ago

[12002403.260268] nginx[2338740]: segfault at 10 ip 0000559a1feaa422 sp 00007ffc6e125860 error 4 in nginx[559a1fe3a000+1b1000]
[12002403.260452] Code: 41 55 49 89 f5 41 54 49 89 d4 55 53 48 89 fb 48 83 ec 18 48 8b 97 98 00 00 00 64 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 <48> 8b 42 10 f6 42 5a 08 0f 85 20 01 00 00 48 83 f8 ff 0f 84 c6 01

PID: 2338740 (nginx)
UID: 991 (nginx)
GID: 991 (nginx)

Signal: 11 (SEGV)

Timestamp: Sun 2023-11-05 00:57:00 CST (1min 55s ago)

Command Line: nginx: worker process

Executable: /usr/sbin/nginx

Control Group: /system.slice/nginx.service

Unit:

Slice: system.slice

Boot ID: 9c70f5b23534449da41def7174b7a8fb

Machine ID: 16fa26e4e84d4c2b9c2c30d82c682a82

Hostname: localhost.localdomain

Storage: none
Message: Process 2338740 (nginx) of user 991 dumped core.

Stack trace of thread 2338740:
#0 0x0000559a1feaa422 ngx_ssl_recv (nginx)
#1 0x0000559a1feaa822 ngx_ssl_recv_chain (nginx)
#2 0x0000559a1fe9d0c5 ngx_event_pipe (nginx)
#3 0x0000559a1fee48ed ngx_http_upstream_process_downstream (nginx)
#4 0x0000559a1fee3946 ngx_http_upstream_thread_event_handler (nginx)
#5 0x0000559a1fea57b7 ngx_thread_pool_handler (nginx)
#6 0x0000559a1fea3ae8 ngx_epoll_notify_handler (nginx)
#7 0x0000559a1fea3f7e ngx_epoll_process_events (nginx)
#8 0x0000559a1fe9a078 ngx_process_events_and_timers (nginx)
#9 0x0000559a1fea1eb9 ngx_worker_process_cycle (nginx)
#10 0x0000559a1fea079f ngx_spawn_process (nginx)
#11 0x0000559a1fea22c4 ngx_start_worker_processes (nginx)
#12 0x0000559a1fea2d7c ngx_master_process_cycle (nginx)
#13 0x0000559a1fe75738 main (nginx)
#14 0x00007feb40172d85 libc_start_main (libc.so.6)
#15 0x0000559a1fe7579e _start (nginx)

Stack trace of thread 2338803:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338782:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338796:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338790:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338800:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338788:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338791:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338792:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338779:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338797:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338777:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338780:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338787:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338784:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338781:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338778:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338786:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338793:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338789:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338783:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338799:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338801:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338805:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338795:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338802:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338806:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338794:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338804:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338808:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338807:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338798:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Stack trace of thread 2338785:
#0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx)
#2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx)
#3 0x00007feb413481ca start_thread (libpthread.so.0)
#4 0x00007feb40171e73 clone (libc.so.6)

Version 0, edited 14 months ago by guo.bojun@… (next)

comment:5 by Maxim Dounin, 14 months ago

Status: newaccepted

Thanks for the backtrace.

The backtrace as shown suggests that an upstream-related aio operation (that is, aio write) completed, and the completion handler triggered read from the upstream connection in the context of a subrequest which was already finalized, likely due to an error in the main request context.

I was able to reproduce this with nginx instrumented to introduce artificial 5 seconds delay in ngx_thread_write_chain_to_file_handler(), and the following configuration:

server {
    listen 8080;

    location / {
        proxy_pass http://127.0.0.1:8081/t/4k;
        proxy_cache one;
        proxy_cache_valid 200 1s;
        proxy_cache_background_update on;
        proxy_cache_use_stale updating;
        aio threads;
        aio_write on;
        limit_rate 1000;
    }
}

server {
    listen 8081;
    limit_rate 1000;
    # /t/4k is a 4k file
}

by terminating the client while reading the response when background update was triggered. Without SSL to the upstream server, this results in the readv() EBADF errors (since the connection is already closed):

2023/11/05 05:29:22 [alert] 26362#100160: *4 readv() failed (9: Bad file descriptor) while reading upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.1", subrequest: "/", upstream: "http://127.0.0.1:8081/t/4k", host: "127.0.0.1:8080"

I'll take a look how to fix this properly.

comment:6 by Maxim Dounin, 13 months ago

Posted a patch series here which fixes this:

https://mailman.nginx.org/pipermail/nginx-devel/2023-November/WZKQYXIBZFLDUEK5CMLHPMPIGPE2MFN4.html

Specifically, this issues is fixed with the first patch in the series, available here:

https://mailman.nginx.org/pipermail/nginx-devel/2023-November/IAXAOKRZYCFTEYOA5THCHOGBHUBSGMNG.html

Review and testing appreciated.

comment:7 by Maxim Dounin <mdounin@…>, 11 months ago

In 9203:0de20f43db25/nginx:

Fixed request termination with AIO and subrequests (ticket #2555).

When a request was terminated due to an error via ngx_http_terminate_request()
while an AIO operation was running in a subrequest, various issues were
observed. This happened because ngx_http_request_finalizer() was only set
in the subrequest where ngx_http_terminate_request() was called, but not
in the subrequest where the AIO operation was running. After completion
of the AIO operation normal processing of the subrequest was resumed, leading
to issues.

In particular, in case of the upstream module, termination of the request
called upstream cleanup, which closed the upstream connection. Attempts to
further work with the upstream connection after AIO operation completion
resulted in segfaults in ngx_ssl_recv(), "readv() failed (9: Bad file
descriptor) while reading upstream" errors, or socket leaks.

In ticket #2555, issues were observed with the following configuration
with cache background update (with thread writing instrumented to
introduce a delay, when a client closes the connection during an update):

location = /background-and-aio-write {

proxy_pass ...
proxy_cache one;
proxy_cache_valid 200 1s;
proxy_cache_background_update on;
proxy_cache_use_stale updating;
aio threads;
aio_write on;
limit_rate 1000;

}

Similarly, the same issue can be seen with SSI, and can be caused by
errors in subrequests, such as in the following configuration
(where "/proxy" uses AIO, and "/sleep" returns 444 after some delay,
causing request termination):

location = /ssi-active-boom {

ssi on;
ssi_types *;
return 200 '

<!--#include virtual="/proxy" -->
<!--#include virtual="/sleep" -->
';

limit_rate 1000;

}

Or the same with both AIO operation and the error in non-active subrequests
(which needs slightly different handling, see below):

location = /ssi-non-active-boom {

ssi on;
ssi_types *;
return 200 '

<!--#include virtual="/static" -->
<!--#include virtual="/proxy" -->
<!--#include virtual="/sleep" -->
';

limit_rate 1000;

}

Similarly, issues can be observed with just static files. However,
with static files potential impact is limited due to timeout safeguards
in ngx_http_writer(), and the fact that c->error is set during request
termination.

In a simple configuration with an AIO operation in the active subrequest,
such as in the following configuration, the connection is closed right
after completion of the AIO operation anyway, since ngx_http_writer()
tries to write to the connection and fails due to c->error set:

location = /ssi-active-static-boom {

ssi on;
ssi_types *;
return 200 '

<!--#include virtual="/static-aio" -->
<!--#include virtual="/sleep" -->
';

limit_rate 1000;

}

In the following configuration, with an AIO operation in a non-active
subrequest, the connection is closed only after send_timeout expires:

location = /ssi-non-active-static-boom {

ssi on;
ssi_types *;
return 200 '

<!--#include virtual="/static" -->
<!--#include virtual="/static-aio" -->
<!--#include virtual="/sleep" -->
';

limit_rate 1000;

}

Fix is to introduce r->main->terminated flag, which is to be checked
by AIO event handlers when the r->main->blocked counter is decremented.
When the flag is set, handlers are expected to wake up the connection
instead of the subrequest (which might be already cleaned up).

Additionally, now ngx_http_request_finalizer() is always set in the
active subrequest, so waking up the connection properly finalizes the
request even if termination happened in a non-active subrequest.

comment:8 by Maxim Dounin, 11 months ago

Resolution: fixed
Status: acceptedclosed

Fix committed, thanks for reporting this.

Note: See TracTickets for help on using tickets.