Opened 4 months ago

Closed 4 months ago

#2110 closed defect (invalid)

nginx as grpc reverse proxy report 499 after few weeks of work

Reported by: shuliakovsky@… Owned by:
Priority: major Milestone: nginx-1.19
Component: nginx-module Version: 1.19.x
Keywords: grpc module Cc:
uname -a: Linux clarifaiproxy-79bc996599-glr95 4.19.0-0.bpo.6-amd64 #1 SMP Debian 4.19.67-2+deb10u2~bpo9+1 (2019-11-12) x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.19.5
built by gcc 8.3.0 (Debian 8.3.0-6)
built with OpenSSL 1.1.1d 10 Sep 2019
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fdebug-prefix-map=/data/builder/debuild/nginx-1.19.5/debian/debuild-base/nginx-1.19.5=. -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'

Description

NGINX used as grpc-proxy with queue to avoid throttling from grpc service with listed below configuration:

After few thousands of successful proxied requests or after few weeks of good work( Normal work logs)

============= Normal work logs: ================
10.244.0.10 - - [16/Dec/2020:00:15:36 +0000] "POST /clarifai.api.V2/PostModelOutputs HTTP/2.0" 200 1267 "-" "grpc-java-netty/1.29.0"
10.244.0.10 - - [16/Dec/2020:00:15:36 +0000] "POST /clarifai.api.V2/PostModelOutputs HTTP/2.0" 200 604 "-" "grpc-java-netty/1.29.0"
10.244.0.10 - - [16/Dec/2020:00:15:36 +0000] "POST /clarifai.api.V2/PostModelOutputs HTTP/2.0" 200 1260 "-" "grpc-java-netty/1.29.0"
10.244.0.10 - - [16/Dec/2020:00:15:39 +0000] "POST /clarifai.api.V2/PostModelOutputs HTTP/2.0" 200 1514 "-" "grpc-java-netty/1.29.0"
10.244.0.10 - - [16/Dec/2020:00:15:39 +0000] "POST /clarifai.api.V2/PostModelOutputs HTTP/2.0" 200 576 "-" "grpc-java-netty/1.29.0"
10.244.0.10 - - [16/Dec/2020:00:15:39 +0000] "POST /clarifai.api.V2/PostModelOutputs HTTP/2.0" 200 1258 "-" "grpc-java-netty/1.29.0"
===============================================================

(means randomly) NGINX begins report 499 error on all or on part of workers (Abnormal work logs example).

=========== Abnormal work logs example ================
10.244.0.88 - - [11/Nov/2020:06:32:22 +0000] "POST /clarifai.api.V2/PostModelOutputs HTTP/2.0" 499 0 "-" "grpc-java-netty/1.29.0"
10.244.0.88 - - [11/Nov/2020:06:33:21 +0000] "POST /clarifai.api.V2/PostModelOutputs HTTP/2.0" 499 0 "-" "grpc-java-netty/1.29.0"
=======================================================

after "nginx -s reload" everything begin works properly for next few weeks or few thousands of requests.

============ begin of nginx.conf ==========
user nginx;
worker_processes 4;

error_log /dev/stdout debug;
pid /var/run/nginx.pid;

events {

worker_connections 1024;

}

http {

default_type application/octet-stream;

log_format main '$remote_addr - $remote_user [$time_local] "$request" '

'$status $body_bytes_sent "$http_referer" "$limit_req_status"'
'"$http_user_agent"';

access_log /dev/stdout main;

sendfile on;

keepalive_timeout 2048;

gzip on;

# Limit Rate
limit_req_zone $server_name zone=perserver:10m rate=8r/s;

include /etc/nginx/conf.d/*.conf;

}
=============== end of nginx.conf ================

=============== begin of /etc/nginx/conf.d/default.conf ===========
server {

listen 1080 http2;
server_name ~.*$;

# Logs
access_log /dev/stdout ;
error_log /dev/stdout ;

client_max_body_size 10m;

# Proxy
#proxy_ignore_client_abort on;
proxy_connect_timeout 2048;
proxy_read_timeout 2048;
proxy_send_timeout 2048;

# /Default Location
location / {

resolver 1.1.1.1 valid=30s;
grpc_pass grpc://api-grpc.clarifai.com:18080;
limit_req zone=perserver burst=10000 delay=1;
grpc_connect_timeout 240;
grpc_read_timeout 2048;
grpc_send_timeout 2048;
grpc_socket_keepalive on;

}
location /clarifai_proxy_status {

auth_basic "Restricted Content";
auth_basic_user_file /etc/nginx/.htpasswd;
stub_status;
access_log off;

}

}
============ end of /etc/nginx/conf.d/default.conf ===========

Change History (7)

comment:1 by shuliakovsky@…, 4 months ago

P.S.
Nothing abnormal with memory or cpu usage not detected.

comment:2 by Maxim Dounin, 4 months ago

Could you please clarify what makes you think that the observed behaviour is a bug in nginx?

Note that 499 status code means that the client closed the connection. Usually this happens when a client gives up waiting for the response. Looking at your configuration suggests limit_req might delay client requests. Alternatively, there might be delays at the backend server. To further diagnose what happens you may want to log $request_time and $upstream_response_time variables.

Note well that access logs provided do not match the main format from your configuration, so the $limit_req_status variable is not available in the logs.

in reply to:  2 comment:3 by shuliakovsky@…, 4 months ago

Replying to Maxim Dounin:

Could you please clarify what makes you think that the observed behaviour is a bug in nginx?

I'm completely understanding that it's sounds like "the bullets flew out from my side and problem on your side" but when problem is happened, I mean NGINX begins report 499, It begins report it on each request, independent of requests number per second. Even when there is no request more than hour and I send only one via proxy it will report 499, until I execute "nginx -s reload"

And right after executing command "nginx -s reload" everything begins work properly without 499 status code (it changes to 200)
Also numbers of requests right now lover than 1 per second, so according my configuration delays does not happening, if I clear understand how limit_req works.

Note that 499 status code means that the client closed the connection. Usually this happens when a client gives up waiting for the response. Looking at your configuration suggests limit_req might delay client requests. Alternatively, there might be delays at the backend server. To further diagnose what happens you may want to log $request_time and $upstream_response_time variables.

Note well that access logs provided do not match the main format from your configuration, so the $limit_req_status variable is not available in the logs.

Thank You for this note I had modifying it to obtain correct logs.

Last edited 4 months ago by shuliakovsky@… (previous) (diff)

comment:4 by Maxim Dounin, 4 months ago

I mean NGINX begins report 499

Again: it's not nginx reports 499, it's your client closes the connection before a response was sent to the client. The exact reason why the client closes the connection is to be find out on the client. I suspect this is because client times out waiting for the response. So most likely the problem is that requests are not answered within a reasonable time.

It's up to the additional investigation to find out why requests are not answered within a reasonable time. While this might be due to some unknown bug in nginx, this doesn't seem to be the likely reason. More likely to happen due to incorrect configuration, or the backend server not responding in time for some reason.

And right after executing command "nginx -s reload" everything begins work properly without 499 status code (it changes to 200)
Also numbers of requests right now lover than 1 per second, so according my configuration delays does not happening, if I clear understand how limit_req works.

I think you are right and limit_req is not relevant here.

Another thing which catches my eye is the name of the backend server. It doesn't seem to be your own backend, but rather some public service. If it's the case, the behaviour you describe might be the result of some changes on the service, such as moving to a different IP address.

Note that nginx resolves name in the configuration during the configuration parsing. If the address changes, you have to instruct nginx to re-read configuration to re-resolve the name (and this is what nginx reload does). Alternatively, you can consider making the name only available at run time using variables, so nginx will resolve it while processing requests (will require resolver).

in reply to:  4 comment:5 by shuliakovsky@…, 4 months ago

Another thing which catches my eye is the name of the backend server. It doesn't seem to be your own backend, but rather some public service. If it's the case, the behaviour you describe might be the result of some changes on the service, such as moving to a different IP address.

Note that nginx resolves name in the configuration during the configuration parsing. If the address changes, you have to instruct nginx to re-read configuration to re-resolve the name (and this is what nginx reload does). Alternatively, you can consider making the name only available at run time using variables, so nginx will resolve it while processing requests (will require resolver).

I know about "resolver" and it present in my configuration:

resolver 1.1.1.1 valid=30s;

But possible you are right and I use it in wrong way:
I had reconfigure it to:

resolver 1.1.1.1 valid=10s;
set $clarifai_servers api-grpc.clarifai.com;
grpc_pass grpc://$clarifai_servers:18080;

and will check if it works in another way.

In any case thank You for response.
Also like temporary solution I had scheduled "nginx -s reload" (I know this is a crutch)

Last edited 4 months ago by shuliakovsky@… (previous) (diff)

comment:6 by shuliakovsky@…, 4 months ago

I think task may be closed until problem will happen in future.

comment:7 by Maxim Dounin, 4 months ago

Resolution: invalid
Status: newclosed
Note: See TracTickets for help on using tickets.