Opened 14 months ago
Closed 11 months ago
#2555 closed defect (fixed)
Race when both aio threads and background subrequests are used
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | nginx-core | Version: | 1.25.x |
Keywords: | Cc: | ||
uname -a: | |||
nginx -V: | not provided |
Description
When the specific configurations are set(e.g. aio threads + proxy_cache_background_update on), multiple threads can operate on the same request connection structure and https://hg.nginx.org/nginx/file/tip/src/event/ngx_event_openssl.c#l3361
can cause ngx_ssl_recv() call from another thread to access a NULL c->ssl pointer.
Change History (8)
comment:1 by , 14 months ago
nginx -V: | modified (diff) |
---|
comment:2 by , 14 months ago
I'm not sure it is feasible to reproduce the crash reliably and the actual configuration on production is quite complicated to trim down, but I'm confident that it is along those lines, as the configuration changes were recently introduced and only since then the crash starts to happen. Removing any one of the elements also stops the crash.
The race is probably particularly related to the c->recv member that a thread may have already entered the previous function(ngx_ssl_recv()) during the ngx_ssl_shutdown() call from another thread or the write operation was simply not seen by other threads without a memory barrier.
comment:3 by , 14 months ago
I'm not sure it is feasible to reproduce the crash reliably and the actual configuration on production is quite complicated to trim down, but I'm confident that it is along those lines, as the configuration changes were recently introduced and only since then the crash starts to happen. Removing any one of the elements also stops the crash.
You may start with simple debugging steps then. In particular:
- Make sure you are able to reproduce the crash without 3rd party modules. Unfortunately, many 3rd party modules are incorrectly coded and might easily introduce bugs, especially in complex configurations involving aio and subrequests.
- Obtain a core dump and show a backtrace.
- Obtain a debug log to investigate (as long as nginx crashes, logging to a memory buffer would be cheapest solution).
Unfortunately, without either a way to reproduce or at least some debugging information it is not possible to help any further.
The race is probably particularly related to the c->recv member that a thread may have already entered the previous function(ngx_ssl_recv()) during the ngx_ssl_shutdown() call from another thread or the write operation was simply not seen by other threads without a memory barrier.
To re-iterate: nginx only uses the connection structure from the main thread, and the race you describe is completely impossible.
comment:4 by , 14 months ago
[12002403.260268] nginx[2338740]: segfault at 10 ip 0000559a1feaa422 sp 00007ffc6e125860 error 4 in nginx[559a1fe3a000+1b1000] [12002403.260452] Code: 41 55 49 89 f5 41 54 49 89 d4 55 53 48 89 fb 48 83 ec 18 48 8b 97 98 00 00 00 64 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 <48> 8b 42 10 f6 42 5a 08 0f 85 20 01 00 00 48 83 f8 ff 0f 84 c6 01
PID: 2338740 (nginx) UID: 991 (nginx) GID: 991 (nginx) Signal: 11 (SEGV) Timestamp: Sun 2023-11-05 00:57:00 CST (1min 55s ago) Command Line: nginx: worker process Executable: /usr/sbin/nginx Control Group: /system.slice/nginx.service Unit: Slice: system.slice Boot ID: 9c70f5b23534449da41def7174b7a8fb Machine ID: 16fa26e4e84d4c2b9c2c30d82c682a82 Hostname: localhost.localdomain Storage: none Message: Process 2338740 (nginx) of user 991 dumped core. Stack trace of thread 2338740: #0 0x0000559a1feaa422 ngx_ssl_recv (nginx) #1 0x0000559a1feaa822 ngx_ssl_recv_chain (nginx) #2 0x0000559a1fe9d0c5 ngx_event_pipe (nginx) #3 0x0000559a1fee48ed ngx_http_upstream_process_downstream (nginx) #4 0x0000559a1fee3946 ngx_http_upstream_thread_event_handler (nginx) #5 0x0000559a1fea57b7 ngx_thread_pool_handler (nginx) #6 0x0000559a1fea3ae8 ngx_epoll_notify_handler (nginx) #7 0x0000559a1fea3f7e ngx_epoll_process_events (nginx) #8 0x0000559a1fe9a078 ngx_process_events_and_timers (nginx) #9 0x0000559a1fea1eb9 ngx_worker_process_cycle (nginx) #10 0x0000559a1fea079f ngx_spawn_process (nginx) #11 0x0000559a1fea22c4 ngx_start_worker_processes (nginx) #12 0x0000559a1fea2d7c ngx_master_process_cycle (nginx) #13 0x0000559a1fe75738 main (nginx) #14 0x00007feb40172d85 __libc_start_main (libc.so.6) #15 0x0000559a1fe7579e _start (nginx) Stack trace of thread 2338803: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338782: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338796: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338790: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338800: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338788: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338791: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338792: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338779: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338797: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338777: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338780: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338787: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338784: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338781: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338778: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338786: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338793: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338789: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338783: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338799: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338801: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338805: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338795: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338802: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338806: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338794: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338804: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338808: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338807: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338798: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6) Stack trace of thread 2338785: #0 0x00007feb4134e45c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000559a1fea6444 ngx_thread_cond_wait (nginx) #2 0x0000559a1fea5c1b ngx_thread_pool_cycle (nginx) #3 0x00007feb413481ca start_thread (libpthread.so.0) #4 0x00007feb40171e73 __clone (libc.so.6)
Turning off aio_write also prevents the crash.
comment:5 by , 14 months ago
Status: | new → accepted |
---|
Thanks for the backtrace.
The backtrace as shown suggests that an upstream-related aio operation (that is, aio write) completed, and the completion handler triggered read from the upstream connection in the context of a subrequest which was already finalized, likely due to an error in the main request context.
I was able to reproduce this with nginx instrumented to introduce artificial 5 seconds delay in ngx_thread_write_chain_to_file_handler()
, and the following configuration:
server { listen 8080; location / { proxy_pass http://127.0.0.1:8081/t/4k; proxy_cache one; proxy_cache_valid 200 1s; proxy_cache_background_update on; proxy_cache_use_stale updating; aio threads; aio_write on; limit_rate 1000; } } server { listen 8081; limit_rate 1000; # /t/4k is a 4k file }
by terminating the client while reading the response when background update was triggered. Without SSL to the upstream server, this results in the readv()
EBADF errors (since the connection is already closed):
2023/11/05 05:29:22 [alert] 26362#100160: *4 readv() failed (9: Bad file descriptor) while reading upstream, client: 127.0.0.1, server: , request: "GET / HTTP/1.1", subrequest: "/", upstream: "http://127.0.0.1:8081/t/4k", host: "127.0.0.1:8080"
I'll take a look how to fix this properly.
comment:6 by , 13 months ago
Posted a patch series here which fixes this:
https://mailman.nginx.org/pipermail/nginx-devel/2023-November/WZKQYXIBZFLDUEK5CMLHPMPIGPE2MFN4.html
Specifically, this issues is fixed with the first patch in the series, available here:
https://mailman.nginx.org/pipermail/nginx-devel/2023-November/IAXAOKRZYCFTEYOA5THCHOGBHUBSGMNG.html
Review and testing appreciated.
comment:8 by , 11 months ago
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
Fix committed, thanks for reporting this.
Could you please provide more details? A test case which demonstrates the problem would be helpful, as well as "nginx -V".
Note that threaded operations are not expected to access connection structures, but rather implement simple primitives, such as reading from a file, writing to a file, or sending a file with the
sendfile()
syscall to a specific socket. As such, it is certainly not possible that multiple threads can operate on the same connection structure. Further, thengx_ssl_shutdown()
function you've linked is not expected to be called unless all background subrequests have been completed. If you observe issues with a particular configuration, there should be another explanation.