Opened 3 years ago

Last modified 3 years ago

#2224 reopened enhancement

HTTP/2 in nginx does not use double-GOAWAY for graceful connection shutdown

Reported by: ejona.google.com@… Owned by:
Priority: minor Milestone:
Component: nginx-module Version: 1.19.x
Keywords: Cc: ejona.google.com@…
uname -a:
nginx -V: nginx version: nginx/1.19.8

Description

As defined in RFC 7540 §6.8:

A server that is attempting to gracefully shut down a connection SHOULD send an initial GOAWAY frame with the last stream identifier set to 231-1 and a NO_ERROR code. This signals to the client that a shutdown is imminent and that initiating further requests is prohibited. After allowing time for any in-flight stream creation (at least one round-trip time), the server can send another GOAWAY frame with an updated last stream identifier. This ensures that a connection can be cleanly shut down without losing requests.

I see multiple nginx tickets where clients are blamed for not retrying. But I saw no mention of the RFC recommendation nor the latency impact caused by nginx's behavior. Statements like "It does not seem to be possible to resolve this on nginx side" seem inaccurate.
https://trac.nginx.org/nginx/ticket/1250
https://trac.nginx.org/nginx/ticket/1590
https://trac.nginx.org/nginx/ticket/2155

I've seen users having trouble with this when interacting with grpc-java in the past, but only now chose to file an issue. Historically it seems users have increased keepalive_requests to reduce the rate of failures. It is becoming a bit more noticeable now because grpc-java has improved its error reporting to distinguish the case where a failure was caused by abrupt GOAWAY, so it is easier to notice poorly-behaved servers. This came up this time as part of https://github.com/grpc/grpc-java/issues/8310, but I have a resolution available for that issue.

I understand that nginx would need to put some limits on the number of additional RPCs and the length to allow for additional RPCs. I also understand that nginx doing graceful GOAWAY does not remove the need for client-side retries.

Change History (3)

comment:1 by Maxim Dounin, 3 years ago

Resolution: wontfix
Status: newclosed

As already explained here, two-stage GOAWAY won't help with #2155, as even strictly limited concurrency doesn't help. Further, two-stage GOAWAY is broken in Chrome, and cannot be used in practice due to this.

comment:2 by ejona.google.com@…, 3 years ago

Resolution: wontfix
Status: closedreopened

Oh, https://bugs.chromium.org/p/chromium/issues/detail?id=1030255 wasn't linked to before. That is a pretty nasty bug. I guess most other proxies don't use a setting like keepalive_requests, so they don't hit the Chrome problem and so it isn't generally impacting many Chrome users.

Further, two-stage GOAWAY is broken in Chrome, and cannot be used in practice due to this.

I don't see how two-stage GOAWAY is any more broken in Chrome than one-stage GOAWAY. I agree the Chrome issue is serious for nginx and there's limited options for working around Chrome's behavior, but I don't see how it negatively impacts two-stage GOAWAY.

comment:3 by Maxim Dounin, 3 years ago

Priority: majorminor
Type: defectenhancement

I don't see how two-stage GOAWAY is any more broken in Chrome than one-stage GOAWAY.

The one-stage GOAWAY breaks loading of some resources which aren't re-requested by Chrome for some reason. Using two-stage GOAWAY additionally introduces hard-to-debug and unavoidable delay in processing of all requests in Chrome.

That is, two-stage GOAWAY is useless in prevention of issues as seen with browsers, and additionally introduces other issues to deal with.

Given the above, as of now there are no plans to implement two-stage GOAWAY. This can be reconsidered when client-side support will become better and/or if a simple enough implementation will be available which does not trigger the issue in question or adequately minimizes the impact.

Note: See TracTickets for help on using tickets.