Opened 20 months ago

Closed 20 months ago

Last modified 13 months ago

#2383 closed defect (invalid)

nginx grpc have bug that when the backend grpc server repsonse over 64k will timeout(http 504)

Reported by: wangdeyi1987@… Owned by:
Priority: minor Milestone:
Component: documentation Version: 1.23.x
Keywords: grpc http2 grpc_pass Cc:
uname -a: configure arguments: --prefix=/usr/local/nginx --with-http_v2_module --with-debu
nginx -V: Linux localhost 3.10.0-1160.25.1.el7.x86_64 #1 SMP Wed Apr 28 21:49:45 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description

It's happend when the grpc repsonse over 64k;

2022/08/27 18:20:19 [debug] 5802#0: *24 http2 WINDOW_UPDATE frame sid:0 window:20471
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame complete pos:00007F4F83B09055 end:00007F4F83B09062
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame type:8 f:0 l:4 sid:0
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 WINDOW_UPDATE frame sid:0 window:20471
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame complete pos:00007F4F83B09062 end:00007F4F83B09062
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame out: 00000000022A63B8 sid:0 bl:0 len:8
2022/08/27 18:20:19 [debug] 5802#0: *24 writev: 17 of 17
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame sent: 00000000022A63B8 sid:0 bl:0 len:8
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 read handler
2022/08/27 18:20:19 [debug] 5802#0: *24 recv: eof:0, avail:-1
2022/08/27 18:20:19 [debug] 5802#0: *24 recv: fd:3 28 of 1048544
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame type:8 f:0 l:4 sid:0
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 WINDOW_UPDATE frame sid:0 window:65535
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame complete pos:00007F4F83B0901D end:00007F4F83B0902C
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame type:4 f:0 l:6 sid:0
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 SETTINGS frame
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 setting 4:131070
2022/08/27 18:20:19 [debug] 5802#0: *24 http2:1 adjusted window: 132836
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame complete pos:00007F4F83B0902C end:00007F4F83B0902C
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame out: 00000000022A63B8 sid:0 bl:0 len:0
2022/08/27 18:20:19 [debug] 5802#0: *24 writev: 9 of 9
2022/08/27 18:20:19 [debug] 5802#0: *24 http2 frame sent: 00000000022A63B8 sid:0 bl:0 len:0
2022/08/27 18:21:19 [debug] 5802#0: *24 event timer del: 4: 89994730
2022/08/27 18:21:19 [debug] 5802#0: *24 http upstream request: "/MemberSrv.Explains/config?"
2022/08/27 18:21:19 [debug] 5802#0: *24 http upstream process non buffered upstream
2022/08/27 18:21:19 [error] 5802#0: *24 upstream timed out (110: Connection timed out) while reading upstream, client: 192.168.56.1, server: , request: "POST /MemberSrv.Explains/config HTTP/2.0", upstream: "grpc://127.0.0.1:9513", host: "local.com:5513"
2022/08/27 18:21:19 [debug] 5802#0: *24 finalize http upstream request: 504
2022/08/27 18:21:19 [debug] 5802#0: *24 finalize grpc request
2022/08/27 18:21:19 [debug] 5802#0: *24 free rr peer 1 0
2022/08/27 18:21:19 [debug] 5802#0: *24 close http upstream connection: 4
2022/08/27 18:21:19 [debug] 5802#0: *24 run cleanup: 00000000022A56D0

Attachments (2)

normal.error.log (31.0 KB ) - added by wangdeyi1987@… 20 months ago.
bad.error.log (33.4 KB ) - added by wangdeyi1987@… 20 months ago.

Download all attachments as: .zip

Change History (7)

comment:1 by Maxim Dounin, 20 months ago

Could you please provide full debug log for the particular request? What is the upstream gRPC server you are testing with? Are you able to reproduce this issue with other servers?

From the provided debug log snippet it looks like the upstream server fails to send the response, and so nginx correctly reports the timeout error. That is, it looks like an issue in the upstream gRPC server.

in reply to:  1 comment:2 by wangdeyi1987@…, 20 months ago

Replying to Maxim Dounin:

Could you please provide full debug log for the particular request? What is the upstream gRPC server you are testing with? Are you able to reproduce this issue with other servers?

From the provided debug log snippet it looks like the upstream server fails to send the response, and so nginx correctly reports the timeout error. That is, it looks like an issue in the upstream gRPC server.

Thank you for you answer my question!
You can use the next command to try,I have open the upstream gRPC server api.

This is normal case that response content length < 64K.
Command:
curl -v -X POST -H "Content-Type: application/grpc" -H "brandid:10" -H "brandcode:TCF" -d "11111" http://120.79.197.176:8513/MemberSrv.Explains/config --http2-prior-knowledge --output arthas
result:
Note: Unnecessary use of -X or --request, POST is already inferred.

  • Trying 120.79.197.176:8513... % Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to 120.79.197.176 (120.79.197.176) port 8513 (#0)

  • Using HTTP2, server supports multiplexing
  • Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
  • h2h3 [:method: POST]
  • h2h3 [:path: /MemberSrv.Explains/config]
  • h2h3 [:scheme: http]
  • h2h3 [:authority: 120.79.197.176:8513]
  • h2h3 [user-agent: curl/7.85.0-DEV]
  • h2h3 [accept: */*]
  • h2h3 [content-type: application/grpc]
  • h2h3 [brandid: 10]
  • h2h3 [brandcode: TCF]
  • h2h3 [content-length: 5]
  • Using Stream ID: 1 (easy handle 0x16d50d0)

    POST /MemberSrv.Explains/config HTTP/2
    Host: 120.79.197.176:8513
    user-agent: curl/7.85.0-DEV
    accept: */*
    content-type: application/grpc
    brandid:10
    brandcode:TCF
    content-length: 5

} [5 bytes data]

  • We are completely uploaded and fine
  • Connection state changed (MAX_CONCURRENT_STREAMS == 128)!

< HTTP/2 200
< server: openresty/1.19.9.1
< date: Wed, 31 Aug 2022 08:12:08 GMT
< content-type: application/grpc
< content-length: 43169
< trailer: grpc-status, grpc-message
<
{ [8096 bytes data]
< grpc-status: 0
< grpc-message:
100 43174 100 43169 100 5 2599k 308 --:--:-- --:--:-- --:--:-- 2635k

  • Connection #0 to host 120.79.197.176 left intact

This is bad case that response content length > 64K.

Command:
curl -v -X POST -H "Content-Type: application/grpc" -H "brandid:5" -H "brandcode:SSE" -d "11111" http://120.79.197.176:8513/MemberSrv.Explains/config --http2-prior-knowledge --output arthas

result:
Note: Unnecessary use of -X or --request, POST is already inferred.

  • Trying 120.79.197.176:8513... % Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to 120.79.197.176 (120.79.197.176) port 8513 (#0)

  • Using HTTP2, server supports multiplexing
  • Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
  • h2h3 [:method: POST]
  • h2h3 [:path: /MemberSrv.Explains/config]
  • h2h3 [:scheme: http]
  • h2h3 [:authority: 120.79.197.176:8513]
  • h2h3 [user-agent: curl/7.85.0-DEV]
  • h2h3 [accept: */*]
  • h2h3 [content-type: application/grpc]
  • h2h3 [brandid: 5]
  • h2h3 [brandcode: SSE]
  • h2h3 [content-length: 5]
  • Using Stream ID: 1 (easy handle 0x25250d0)

    POST /MemberSrv.Explains/config HTTP/2
    Host: 120.79.197.176:8513
    user-agent: curl/7.85.0-DEV
    accept: */*
    content-type: application/grpc
    brandid:5
    brandcode:SSE
    content-length: 5

} [5 bytes data]

  • We are completely uploaded and fine
  • Connection state changed (MAX_CONCURRENT_STREAMS == 128)!

< HTTP/2 200
< server: openresty/1.19.9.1
< date: Wed, 31 Aug 2022 08:15:27 GMT
< content-type: application/grpc
< content-length: 78473
< trailer: grpc-status, grpc-message
<
{ [8096 bytes data]

83 78478 83 65535 0 5 1106 0 0:01:10 0:00:59 0:00:11 0* HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
83 78478 83 65535 0 5 1091 0 0:01:11 0:01:00 0:00:11 0

  • Connection #0 to host 120.79.197.176 left intact

curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

so,i guess that when the response content length > 64k,nginx will get content by group every 64k,but the grpc server is not support.
so,how i can modify the group data length (64k),eg change to 128k.Or other way to fix the bug?

comment:3 by Maxim Dounin, 20 months ago

You can use the next command to try,I have open the upstream gRPC server api.

From the provided output it looks like you've opened access to Openresty proxy, and not to the origin/upstream gRPC server. Either way, it's not what I've asked about. To re-iterate:

Could you please provide full debug log for the particular request? What is the upstream gRPC server you are testing with? Are you able to reproduce this issue with different upstream gRPC servers?

Note that 64k is special in HTTP/2 and gRPC, since it is the default window size. The fact that responses over 64k hang with the particular upstream gRPC server might indicate that the server have issues with window size handling (e.g., fails to process initial window size setting).

by wangdeyi1987@…, 20 months ago

Attachment: normal.error.log added

by wangdeyi1987@…, 20 months ago

Attachment: bad.error.log added

in reply to:  3 comment:4 by wangdeyi1987@…, 20 months ago

Replying to Maxim Dounin:

You can use the next command to try,I have open the upstream gRPC server api.

From the provided output it looks like you've opened access to Openresty proxy, and not to the origin/upstream gRPC server. Either way, it's not what I've asked about. To re-iterate:

Could you please provide full debug log for the particular request? What is the upstream gRPC server you are testing with? Are you able to reproduce this issue with different upstream gRPC servers?

Note that 64k is special in HTTP/2 and gRPC, since it is the default window size. The fact that responses over 64k hang with the particular upstream gRPC server might indicate that the server have issues with window size handling (e.g., fails to process initial window size setting).

Thanks!
I have upload full debug logs for attachment."normal.error.log" is the gRPC server reponse content length < 64k,"bad.error.log" is > 64k.
The upstream gRPC server is Hyperf (a php framework).

comment:5 by Maxim Dounin, 20 months ago

Resolution: invalid
Status: newclosed

From the debug log, the upstream server sends 64k response:

...
2022/09/01 15:21:30 [debug] 5135#0: *10 recv: eof:0, avail:-1
2022/09/01 15:21:30 [debug] 5135#0: *10 recv: fd:5 65658 of 131072
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc response: 00004e010400000001887685c7eab2d92f5f8b1d75d0620d263d4c4d656440854d833505b3929acac8b21234da8fa526b2b22d495081cc5f6196df3dbf4a002a6e2d6a0802128076e0837190298b46ff0f0d8471d640e700400000000000000100000106e508641207535543434553531a13323032322d30392d30312031353a32313a3330229b8d0432a990013c70207374796c653d226d617267696e2d626f74746f6d3a203870783b2077686974652d73706163653a206e6f726d616c3b206c696e652d6865696768743a20323970783b223e3c7374726f6e673e3c7370616e207374796c653d22666f6e742d73697a653a20323170783b223e5468652043..., len: 65658
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc frame: 1, len: 78, f:4, i:1
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc parse header: start
...
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc header done
...
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc filter bytes:65571
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc frame: 0, len: 16384, f:0, i:1
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc output buf 00007F393346B070
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc frame: 0, len: 16384, f:0, i:1
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc output buf 00007F393346F079
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc frame: 0, len: 16384, f:0, i:1
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc output buf 00007F3933473082
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc frame: 0, len: 16383, f:0, i:1
2022/09/01 15:21:30 [debug] 5135#0: *10 grpc output buf 00007F393347708B
...

and then fails to send any additional data till the timeout expires.

At the very start of a gRPC connections nginx sends SETTINGS frame with SETTINGS_INITIAL_WINDOW_SIZE set to 0x7fffffff to ensure large initial window size on newly created streams, and sends WINDOW_UPDATE frame with 0x7fff0000 update for stream id 0 to ensure large window on the whole connection. That is, at this point the client is allowed to send 2G of data, yet it stops at 64k for some reason.

As previously suggested, this looks like incorrect flow control implementation in the upstream gRPC server.

Note: See TracTickets for help on using tickets.