Opened 5 weeks ago

Last modified 5 weeks ago

#2031 new defect

gRPC connections ingressed through nginx closed with RST_STREAM error code 2

Reported by: mpgermano@… Owned by:
Priority: major Milestone:
Component: nginx-core Version: 1.17.x
Keywords: rst_stream Cc:
uname -a: Linux orc8r-nginx-66cf7485-ztcg6 4.15.0-43-generic #46~16.04.1-Ubuntu SMP Fri Dec 7 13:31:08 UTC 2018 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.17.10
built by gcc 8.3.0 (Debian 8.3.0-6)
built with OpenSSL 1.1.1d 10 Sep 2019
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fdebug-prefix-map=/data/builder/debuild/nginx-1.17.10/debian/debuild-base/nginx-1.17.10=. -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'

Description

My team recently migrated from nghttpx to nginx. We use bi-directional gRPC streams to send requests between components in our system. Nginx is the ingress for our cloud component, which consists of gRPC microservices. Nginx is responsible for TLS termination.

Since we've migrated to nginx, we are seeing gRPC connections being closed with RST_STREAM error code 2 during load testing. This testing consists of around 10,000 RPC calls over the course of ~2-3 minutes. The RST_STREAM issue is consistently happening at least once during the test. On average it occurs 2-3 times per load test.

During the peak of load testing, /nginx_status returned:

Active connections: 14
server accepts handled requests
 21136 21136 318756
Reading: 0 Writing: 108 Waiting: 5

The associated error in the log:

2020-08-05T09:36:12.569375933Z stderr F 2020/08/05 09:36:12 [info] 516#516: *335529 client terminated stream 1841 due to internal error while sending request to upstream, client: 172.17.10.183, server: ~^(?<srv>.+)-orc8r-nginx-proxy.magma.svc.cluster.local$, request: "POST /magma.lte.CentralSessionController/CreateSession HTTP/2.0", upstream: "grpc://10.254.214.50:9079", host: "session_proxy-orc8r-nginx-proxy.magma.svc.cluster.local"

I've tried upgrading to v1.19, but the issue persists. My hunch is that the issue is due to differences in how nghttpx and nginx handle max requests per http2 connection. However, increasing http2_max_requests
in nginx.conf didn't seem to help.

nginx.conf:

user root;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

events {
  worker_connections 1024;
}

http {
  # Custom JSON-formatted log
  log_format json_custom escape=json
    '{'
      '"nginx.time_local": "$time_local",'
      '"nginx.remote_addr": "$remote_addr",'
      '"nginx.request": "$request",'
      '"nginx.request_method": "$request_method",'
      '"nginx.request_uri": "$request_uri",'
      '"nginx.status": $status,'
      '"nginx.body_bytes_sent": $body_bytes_sent,'
      '"nginx.request_length": $request_length,'
      '"nginx.request_time": $request_time,'
      '"nginx.server_name": "$server_name",'
      '"nginx.clientcert_port": $srvport,'
      '"nginx.open_port": $open_srvport,'
      '"nginx.client_serial": "$ssl_client_serial",'
      '"nginx.client_cn": "$ssl_client_s_dn_cn"'
    '}';

  # See https://kubernetes.github.io/ingress-nginx/examples/grpc/#notes-on-using-responserequest-streams
  grpc_send_timeout 1200s;
  grpc_read_timeout 1200s;
  client_body_timeout 1200s;

  # Blackhole (9070) unrecognized services
  map $srv $srvport {
    default 9070;
    download 9102;
    vpnservice 9104;
    testcontroller 9109;
    fbinternal 9111;
    analytics 9200;
    cwf 9115;
    lte 9113;
    subscriberdb 9083;
    policydb 9085;
    orchestrator 9112;
    streamer 9082;
    metricsd 9084;
    accessd 9091;
    dispatcher 9096;
    directoryd 9100;
    state 9105;
    device 9106;
    configurator 9108;
    tenants 9110;
    devmand 9116;
    wifi 9117;
    feg 9114;
    feg_relay 9103;
    s6a_proxy 9079;
    session_proxy 9079;
    swx_proxy 9079;
    csfb 9079;
    feg_hello 9079;
    ocs 9079;
    pcrf 9079;
    health 9107;
  }

  # Blackhole (9070) any services that aren't proxy_type 'open'
  map $srv $open_srvport {
    default 9070;
    bootstrapper 9088;
  }

  # Use a regex to pull the client cert common name out of the DN
  # The DN will look something like "CN=foobar,OU=,O=,C=US"
  map $ssl_client_s_dn $ssl_client_s_dn_cn {
    default "";
    ~CN=(?<CN>[^/,]+) $CN;
  }

  # Server block for controller
  server {
    listen              8443 ssl http2;
    server_name         ~^(?<srv>.+)-orc8r-nginx-proxy.magma.svc.cluster.local$;
    root                /var/www;

    error_log  /var/log/nginx/error.log info;
    access_log /var/log/nginx/access.log json_custom;

    ssl_certificate     /var/opt/magma/certs/controller.crt;
    ssl_certificate_key /var/opt/magma/certs/controller.key;
    ssl_verify_client on;
    ssl_client_certificate /var/opt/magma/certs/certifier.pem;

    location / {
      resolver coredns.kube-system.svc.cluster.local;
      grpc_pass grpc://orc8r-controller.magma.svc.cluster.local:$srvport;

      grpc_set_header x-magma-client-cert-cn $ssl_client_s_dn_cn;
      grpc_set_header x-magma-client-cert-serial $ssl_client_serial;
      grpc_set_header Host $srv-orc8r-controller.magma.svc.cluster.local:$srvport;
    }

    # Setting max allowed size for client requests body to 50MB
    client_max_body_size 50M;
  }

  # Server block for bootstrapper and any other non-clientcert services
  server {
    listen 8444 ssl http2;
    server_name         ~^(?<srv>.+)-orc8r-nginx-proxy.magma.svc.cluster.local$;
    root                /var/www;

    error_log  /var/log/nginx/error.log info;
    access_log /var/log/nginx/access.log json_custom;

    ssl_certificate     /var/opt/magma/certs/controller.crt;
    ssl_certificate_key /var/opt/magma/certs/controller.key;

    location / {
      resolver coredns.kube-system.svc.cluster.local;

      grpc_pass grpc://orc8r-controller.magma.svc.cluster.local:$open_srvport;
    }
  }

  # Catch-all server block for REST HTTP/1.1 requests from browsers
  server {
    listen 9443 ssl default_server;
    server_name _;

    error_log  /var/log/nginx/error.log info;
    access_log /var/log/nginx/access.log json_custom;

    ssl_certificate     /var/opt/magma/certs/controller.crt;
    ssl_certificate_key /var/opt/magma/certs/controller.key;
    ssl_verify_client on;
    ssl_client_certificate /var/opt/magma/certs/certifier.pem;

    location / {
      resolver coredns.kube-system.svc.cluster.local;
      proxy_pass http://orc8r-controller.magma.svc.cluster.local:9081;

      proxy_set_header x-magma-client-cert-cn $ssl_client_s_dn_cn;
      proxy_set_header x-magma-client-cert-serial $ssl_client_serial;
    }
  }

  # Open port 80 for k8s liveness check. Just returns a 200.
  server {
    listen 80;
    server_name _;

    location / {
      return 200;
    }
  }
}

Change History (1)

comment:1 by Maxim Dounin, 5 weeks ago

The error message client terminated stream 1841 due to internal error suggests it was the client who terminated the stream. That is, your client clearly said it experienced some internal error and so it closed the stream. Some additional details might be available in the client logs, if any. If you think the problem is in nginx rather than in your client, please elaborate what makes you think so.

Note: See TracTickets for help on using tickets.