HTTP/2 in nginx does not use double-GOAWAY for graceful connection shutdown
|Reported by:||Owned by:|
|nginx -V:||nginx version: nginx/1.19.8|
As defined in RFC 7540 §6.8:
A server that is attempting to gracefully shut down a connection SHOULD send an initial GOAWAY frame with the last stream identifier set to 231-1 and a NO_ERROR code. This signals to the client that a shutdown is imminent and that initiating further requests is prohibited. After allowing time for any in-flight stream creation (at least one round-trip time), the server can send another GOAWAY frame with an updated last stream identifier. This ensures that a connection can be cleanly shut down without losing requests.
I see multiple nginx tickets where clients are blamed for not retrying. But I saw no mention of the RFC recommendation nor the latency impact caused by nginx's behavior. Statements like "It does not seem to be possible to resolve this on nginx side" seem inaccurate.
I've seen users having trouble with this when interacting with grpc-java in the past, but only now chose to file an issue. Historically it seems users have increased keepalive_requests to reduce the rate of failures. It is becoming a bit more noticeable now because grpc-java has improved its error reporting to distinguish the case where a failure was caused by abrupt GOAWAY, so it is easier to notice poorly-behaved servers. This came up this time as part of https://github.com/grpc/grpc-java/issues/8310, but I have a resolution available for that issue.
I understand that nginx would need to put some limits on the number of additional RPCs and the length to allow for additional RPCs. I also understand that nginx doing graceful GOAWAY does not remove the need for client-side retries.