#1468 closed defect (duplicate)
408 generated during active transfer to upstream
Reported by: | Krzysztof Malinowski | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | other | Version: | 1.11.x |
Keywords: | upstream proxy timeout | Cc: | |
uname -a: | Linux 662f1e7c418c 3.10.0-327.36.1.el7.x86_64 #1 SMP Wed Aug 17 03:02:37 EDT 2016 x86_64 GNU/Linux | ||
nginx -V: |
nginx version: nginx/1.11.10
built by gcc 4.9.2 (Debian 4.9.2-10) built with OpenSSL 1.0.1t 3 May 2016 TLS SNI support enabled configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie' --with-debug |
Description
I have an nginx set up as a reverse proxy to a backend server. It happens on large (~60GB) file upload that the connection is terminated due to client request body timeout (HTTP 408) even though the transmission is still running (as observed by strace and backend receiving data).
Relevant part of configuration:
worker_processes 4; events { worker_connections 1024; } http { sendfile on; #tcp_nopush on; keepalive_timeout 65; #gzip on; include /etc/nginx/conf.d/*.conf; } server { listen 443 ssl http2; location / { proxy_pass http://upstream:8081; include conf.d/proxy-settings.inc; add_header Access-Control-Allow-Origin $http_origin always; add_header Access-Control-Expose-Headers Server always; } } # configuration file /etc/nginx/conf.d/proxy-settings.inc: if ($http_x_forwarded_proto = '') { set $http_x_forwarded_proto $scheme; } chunked_transfer_encoding on; client_max_body_size 0; proxy_read_timeout 900; proxy_pass_header Server; proxy_cookie_path ~*^/.* /; proxy_set_header X-Forwarded-Port $server_port; proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto; proxy_set_header Host $http_host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; include conf.d/response-buffering.inc; # configuration file /etc/nginx/conf.d/response-buffering.inc: proxy_buffering on; proxy_buffers 10240 16k; proxy_busy_buffers_size 10240k; proxy_buffer_size 8k; proxy_max_temp_file_size 0; # configuration file /etc/nginx/conf.d/request_buffering.conf: proxy_request_buffering off;
Upload is being randomly terminated as 408 at various part of the file; it can be ~2GB, ~10GB, ~20GB or any other value; it may also happen that the upload finishes correctly.
I have enabled debugging while stracing worker process at the same time. I have noticed that the behavior is triggered with EAGAIN on client->nginx connection:
strace (in local time - GMT+1):
17:51:18.225939 read(11, ""..., 33093) = 33093 <0.000042> 17:51:18.226527 writev(12, [...], 1) = 8192 <0.000061> 17:51:18.227442 writev(12, [...], 1) = 8192 <0.000058> 17:51:18.228364 writev(12, [...], 1) = 8192 <0.000058> 17:51:18.229238 writev(12, [...], 1) = 8192 <0.000058> 17:51:18.229624 read(11, ""..., 16146) = 8236 <0.000022> 17:51:18.229680 read(11, 0x7f6bed03513a, 7910) = -1 EAGAIN (Resource temporarily unavailable) <0.000018> 17:51:18.237114 epoll_wait(17, {{EPOLLIN|EPOLLOUT, {u32=3975967392, u64=140101514195616}}}, 512, 60000) = 1 <0.000017> 17:51:18.237652 read(11, ""..., 7910) = 7910 <0.000103> 17:51:18.238506 writev(12, [...], 1) = 8192 <0.000120> 17:51:18.239582 writev(12, [...], 1) = 8192 <0.000061> 17:51:18.239995 read(11, ""..., 33093) = 33093 <0.000081> 17:51:18.240682 writev(12, [...], 1) = 8192 <0.000083> 17:51:18.241743 writev(12, [...], 1) = 8192 <0.000063> 17:51:18.242684 writev(12, [...], 1) = 8192 <0.000087> 17:51:18.243681 writev(12, [...], 1) = 8192 <0.000080> <--- read/writev continues (fd:11 == client->nginx, fd:12 == nginx->backend --> 19:03:41.032687 read(11, ""..., 33093) = 32826 <0.000102> 19:03:41.034186 writev(12, [...], 1) = 8192 <0.000081> 19:03:41.036440 writev(12, [...], 1) = 8192 <0.000089> 19:03:41.038823 writev(12, [...], 1) = 8192 <0.000095> 19:03:41.040972 writev(12, [...], 1) = 8192 <0.000086> 19:03:41.041956 read(11, 0x7f6bed033003, 33093) = -1 EAGAIN (Resource temporarily unavailable) <0.000045> 19:03:41.043862 epoll_wait(17, {{EPOLLOUT, {u32=3975967392, u64=140101514195616}}}, 512, 60000) = 1 <0.000049> 19:03:41.046868 close(12) = 0 <0.000092> 19:03:41.049310 write(5, ""..., 341) = 341 <0.007480> 19:03:41.058692 close(11) = 0 <0.000099> 19:03:41.059661 epoll_wait(17, <detached ...>
error.log (in GMT):
2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http read client request body 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 SSL_read: 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http client request body recv 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http body new buf t:1 f:0 00007F6BECFB0230, pos 00007F6BECFB0230, size: 8192 file: 0, size: 0 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer buf fl:1 s:8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer in: 00007F6BECF543A0 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 writev: 8192 of 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer out: 0000000000000000 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http read client request body 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 SSL_read: 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http client request body recv 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http body new buf t:1 f:0 00007F6BECFB0230, pos 00007F6BECFB0230, size: 8192 file: 0, size: 0 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer buf fl:1 s:8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer in: 00007F6BECF543A0 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 writev: 8192 of 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer out: 0000000000000000 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http read client request body 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 SSL_read: -1 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 SSL_get_error: 2 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http client request body recv -2 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http client request body rest 36277250217 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 event timer add: 11: 60000:1516726338237 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http run request: "/path/to/backend?" 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http upstream check client, write event:1, "/path/to/backend" 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http run request: "/path/to/backend?" 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http upstream check client, write event:1, "/path/to/backend" 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 event timer del: 11: 1516726338237 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http run request: "/path/to/backend?" 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http upstream read request handler 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 finalize http upstream request: 408 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 finalize http proxy request 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 free rr peer 1 0 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 close http upstream connection: 12 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 free: 00007F6BECFC6140, unused: 48 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 reusable connection: 0 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http finalize request: 408, "/path/to/backend?" a:1, c:1 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http terminate request count:1 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http terminate cleanup count:1 blk:0 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http posted request: "/path/to/backend?" 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http terminate handler count:1 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http request count:1 blk:0 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http close request 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 http log handler 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 free: 00007F6BECFB0230 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 free: 00007F6BECFBD240, unused: 15 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 free: 00007F6BECF53650, unused: 138 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 close http connection: 11 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 SSL_shutdown: 1 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 reusable connection: 0 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 free: 00007F6BECFA8580 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 free: 00007F6BECF3A970, unused: 0 2018/01/23 18:03:41 [debug] 65378#65378: *1147733 free: 00007F6BECF6C6D0, unused: 464
Note that error.log does not have any entries between 16:51:18 and 18:03:41 which correspond to those EAGAINs from strace. The timer is set (client_request_body_timeout I presume) before going into "http upstream check client" and then it is not clear where the code goes. strace shows that there is still transfer proxied between client and backend but the error.log is revived only when another EAGAIN shows up and the connection is immediately terminated due to expired timer (after 1hr12m although the timer itself is set to default == 60s).
It is possible that this behavior is only visible in HTTP 1.1. I haven't been able to reproduce it with HTTP2 yet; with HTTP 1.1 it happens quite often, though irregularly.
Am I missing something in the configuration or is it a bug in the code? How can I debug what happens with nginx between 16:51:18 and 18:03:41? Any help appreciated.
Attachments (2)
Change History (16)
comment:1 by , 7 years ago
comment:2 by , 7 years ago
No, it's not. In fact, my strace has read/writev sections ongoing for the whole period of time up until another EAGAIN which seems to wake up nginx logging.
comment:3 by , 7 years ago
Please show full strace then, or at least for a couple of minutes on each end.
by , 7 years ago
Attachment: | excerpt-from-strace.zip added |
---|
Excerpt from strace showing continuous transfer between EAGAINs.
comment:4 by , 7 years ago
OK, I attached excerpt from strace. This is one worker process traced on a test instance of nginx that manifests the behavior. No other transfers were running using this test instance, regular transfers to the same backend were handled by production nginx.
My complete strace file is over 2GB, so I just copied an excerpt after first EAGAIN and before last EAGAIN. All the contents in between are similar: sequence of read/writev through the whole hour of transfer.
comment:5 by , 7 years ago
Ah, ok, I see what's going on here. The problem is that both client and upstream server are fast enough, so neither reading from the client nor writing to the upstream server blocks, and nginx is busy doing read()
/ writev()
between sockets in a loop for the whole hour, similar to how it happens in websocket proxying in ticket #1431. See comments there for a detailed explanation.
Similar to the ticket mentioned, a workaround would be to use a larger buffer for reading the body, to trigger blocking with higher probability, see http://nginx.org/r/client_body_buffer_size.
follow-up: 7 comment:6 by , 7 years ago
Thank you for the hint, I will try the workaround. Still, I have some questions:
- which code path nginx goes in this scenario? I couldn't figure it out, as all paths that I suspected should have logged something in debug log and there is nothing logged until next EAGAIN shows up,
- if the processing goes normally (read/write continues), why does not timer expiration fire up? Nor it does get deleted?
I am also a bit cautious about connection monopolization mentioned in ticket #1431. This instance was a test instance and there were no more connections served; only one from client and one to upstream. There is nothing more to serve, but it seems that the processing got out of regular event loop? Is it possible? I believe we should be hitting ngx_http_do_read_client_request_body()
, but then it should log at least http read client request body
message and remove the timer; still those do not happen.
comment:7 by , 7 years ago
Replying to raspy@…:
Thank you for the hint, I will try the workaround. Still, I have some questions:
- which code path nginx goes in this scenario? I couldn't figure it out, as all paths that I suspected should have logged something in debug log and there is nothing logged until next EAGAIN shows up,
It loops in ngx_http_upstream_send_request_body()
, calling ngx_http_read_unbuffered_request_body()
and ngx_output_chain()
. This code path does have debug logging, but all the log lines use the time updated at event loop iteration start. As the code never returns to the event loop, the time is not updated, and it is the same for all lines. That is, the following lines in your log demonstrate last iterations of loop in ngx_http_upstream_send_request_body()
just before EAGAIN:
2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http read client request body 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 SSL_read: 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http client request body recv 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http body new buf t:1 f:0 00007F6BECFB0230, pos 00007F6BECFB0230, size: 8192 file: 0, size: 0 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer buf fl:1 s:8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer in: 00007F6BECF543A0 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 writev: 8192 of 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer out: 0000000000000000 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http read client request body 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 SSL_read: 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http client request body recv 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http body new buf t:1 f:0 00007F6BECFB0230, pos 00007F6BECFB0230, size: 8192 file: 0, size: 0 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer buf fl:1 s:8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer in: 00007F6BECF543A0 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 writev: 8192 of 8192 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 chain writer out: 0000000000000000 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 http read client request body 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 SSL_read: -1 2018/01/23 16:51:18 [debug] 65378#65378: *1147733 SSL_get_error: 2
- if the processing goes normally (read/write continues), why does not timer expiration fire up? Nor it does get deleted?
Timers are checked once at each event loop iteration, they are not expected to terminate running code.
comment:8 by , 7 years ago
OK, so I tried with 16k buffer and the issue still exists. I have gathered another strace though and it seems quite interesting. There have been more EAGAINs now during the transmission and it seems that every EAGAIN toggles the behavior: first EAGAIN blocks logging on http upstream check client
, then the other unblocks it. It works as long as this block/unblock is shorter than client_request_body_timeout
, at which it gets disconnected.
comment:9 by , 7 years ago
As explained in #1431, for the issue to disappear you have to use buffer comparable to socket buffers. That is, try something like 2m (or even more, depending on /proc/sys/net/ipv4/tcp_rmem
on your system).
comment:10 by , 7 years ago
Please see stops in logging/timer handling between 14:18:03/14:18:09, 14:18:09/14:18:13, 14:18:13/14:18:16. To illustrate, let me show the following fragment:
strace.log
15:18:09.238059 read(11, 0x7f6becf9da93, 33093) = -1 EAGAIN (Resource temporarily unavailable) <0.000061> 15:18:09.240822 epoll_wait(17, {{EPOLLOUT, {u32=3975982752, u64=140101514210976}}}, 512, 60000) = 1 <0.000065> 15:18:09.241548 epoll_wait(17, {{EPOLLIN|EPOLLOUT, {u32=3975982752, u64=140101514210976}}}, 512, 54025) = 1 <0.000765> 15:18:09.243778 read(11, ""..., 33093) = 16413 <0.000155> 15:18:09.245830 writev(12, [...], 1) = 16384 <0.000136> 15:18:09.246668 read(11, 0x7f6becf9da93, 33093) = -1 EAGAIN (Resource temporarily unavailable) <0.000058> 15:18:09.248655 epoll_wait(17, {{EPOLLIN|EPOLLOUT, {u32=3975982752, u64=140101514210976}}}, 512, 60000) = 1 <0.000038> 15:18:09.250392 read(11, ""..., 33093) = 32826 <0.000104> 15:18:09.251544 writev(12, [...], 1) = 16384 <0.000085> 15:18:09.253010 writev(12, [...], 1) = 16384 <0.000075> 15:18:09.253720 read(11, 0x7f6becf9da93, 33093) = -1 EAGAIN (Resource temporarily unavailable) <0.000024> 15:18:09.254895 epoll_wait(17, {{EPOLLIN|EPOLLOUT, {u32=3975982752, u64=140101514210976}}}, 512, 60000) = 1 <0.002547> 15:18:09.259629 read(11, ""..., 33093) = 16413 <0.000217>
error.log
2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http upstream check client, write event:1, "/path/to/backend" 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http run request: "/path/to/backend?" 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http upstream read request handler 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http upstream send request 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http upstream send request body 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http read client request body 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 SSL_read: 16384 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http client request body recv 16384 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http body new buf t:1 f:0 00007F6BECF5C8B0, pos 00007F6BECF5C8B0, size: 16384 file: 0, size: 0 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 event timer del: 11: 1516889943265 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 chain writer buf fl:1 s:16384 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 chain writer in: 00007F6BECF7A200 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 writev: 16384 of 16384 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 chain writer out: 0000000000000000 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http read client request body 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 SSL_read: -1 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 SSL_get_error: 2 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http client request body recv -2 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http client request body rest 59215456425 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 event timer add: 11: 60000:1516889949242 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http run request: "/path/to/backend?" 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http upstream check client, write event:1, "/path/to/backend" 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http run request: "/path/to/backend?" 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http upstream read request handler 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http upstream send request 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http upstream send request body 2018/01/25 14:18:09 [debug] 47244#47244: *1147747 http read client request body
I see it as follows:
15:18:09.238059 - EAGAIN comes that unblocks logging --> http run request
15:18:09.243778 - read(11, ... --> SSL_read: 16384
15:18:09.245830 - writev(12, ... --> writev: 16384 of 16384
15:18:09.246668 - EAGAIN comes that blocks logging --> http upstream check client
15:18:09.250392 - read(11, ... --> ???
15:18:09.251544 - writev(12, ... --> ???
15:18:09.253010 - writev(12, ... --> ???
15:18:09.253720 - EAGAIN comes that unblocks logging --> http run request
Am I missing something? This pattern is now multiple times, until finally this blocked section is 4m1s long and the connection gets terminated with 408.
comment:11 by , 7 years ago
OK, let me try with bigger buffer. I still feel that this behavior is non-intuitive.
One more question: is it possible to get error.log debug with millisecond resolution? It would be easier to match with strace logs.
comment:12 by , 7 years ago
Resolution: | → duplicate |
---|---|
Status: | new → closed |
No, error logs can only show seconds. Further, as I already explained in comment:7, logging uses time as obtained during event loop iteration start. As such, EAGAIN doesn't not "block and unblocks logging", but rather allows nginx to return to the event loop and update time used for logging.
(As a hack, you may try using timer_resolution. On Linux it will result in time being updated via signals, independently from the event loop. This will result in correct time being used in logs, so you won't be able to see "blocked" logging. It won't resolve the root issue though, but rather will make it harder to detect.)
Either way, as you observe "more EAGAINs" now actually indicates that even using 16k buffer helped a lot. The problem in your original setup that EAGAINs do not happen at all. Using a larger buffer will likely resolve the problem completely.
As this is clearly another variant of #1431 in a slightly different scenario, I'm closing this as a duplicate of #1431.
comment:13 by , 7 years ago
Thank you for pointing out the code that loops. I looked through it and you are right, this code logs as well.
I looked through my error.log again and you are also right, the time stamp is only updated when nginx hits EAGAIN and gets back to the event loop. So it seems that the actual reason for this behavior is the time stamp becoming obsolete.
When a client and the server are both fast enough that they do not block, all the processing is done with the same time stamp. This I believe goes this way:
- EAGAIN happens on the client
- sets up the
client_request_body_timeout
timer
- sets up the
- nginx gets back to event loop
ngx_process_events_and_timers()
calls:ngx_process_events()
- updates the time (may update the time e.g. 1hr ahead)
- processes upstream connection (handler:
ngx_http_upstream_check_broken_connection()
)
ngx_event_expire_timers()
- expires
client_request_body_timeout
timer due to current time update
- expires
- next
ngx_process_events_and_timers()
iteration calls:ngx_process_events()
- updates the time (this time possibly not much since checking broken connection possibly does not take much time)
- processes client connection (handler:
ngx_http_read_client_request_body_handler()
) which terminates connection with 408 due to expired timer
It seems to me that to trigger the issue two EAGAINs need to happen on the client connection: first one sets the timer and the second (if comes after client_body_request_timeout
) will terminate the connection. This would not happen if the current time stamp was updated more often. Unfortunately timer_resolution will not help, as the signal handler only sets the flag for ngx_process_events()
to update the time stamp, so it will not happen before returning to the main loop anyway.
I still consider this behavior a bug. When a timer is set, it is set with current time being already in the past, so the timeout is inaccurate.
comment:14 by , 7 years ago
One more idea: maybe time stamp update could be left as seldom as it is, but the update can be forced upon setting up the timer? Otherwise the timer being set may have its expiration already in the past.
The pause between 16:51:18 and 18:03:41 suggests that the request should be terminated much earlier, at 16:52:18. It looks like system time has been changed, and this is what causes both the timeout (see #189, large system time changes can cause unexpected timer expiration) and missing logs. Could you please check if the system time is indeed changes?