Opened 7 years ago

Closed 7 years ago

#1291 closed defect (invalid)

TCP HealthCheck knocking down Nginx

Reported by: samysilva@… Owned by:
Priority: minor Milestone: 1.13
Component: nginx-module Version: 1.13.x
Keywords: stream Cc:
uname -a: Linux datalog-ugt-slb1.gtservicos 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.13.0
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
built with OpenSSL 1.0.1e-fips 11 Feb 2013
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -pie'

Description

We recently added the Nagios solution to monitor our SLB solution with NGINX. After we added the monitoring, the cluster began to present instabilities; and fell.

We have verified that NGINX does not know how to properly handle requests that do not have BODY.

"./check_tcp -H10.154.3.96 -p3515"

This type of requisition; monitoring; Breaks the nginx service.

To work around the problem, we changed the monitoring routines; Passing any string. Not to knock down the nginx cluster.

"./check_tcp -H10.154.3.96 -p3515 -sAnyString"

Attached is a PDF document with more information about the problem, and how it replicates it.

Attachments (1)

Issue presented after we added the Nagios solution to monitor the NGINX TCP ports.docx (143.5 KB ) - added by samysilva@… 7 years ago.

Download all attachments as: .zip

Change History (8)

comment:1 by samysilva@…, 7 years ago

The pdf document was over 250kb. Please check the document in docx format

comment:2 by Maxim Dounin, 7 years ago

Component: nginx-corenginx-module
Keywords: stream added
Priority: blockerminor

From the description it doesn't look like there is anything wrong with nginx, but it can't connect to the upstream servers and this is what causes service degradation.

Could you please check how your backends behave when tested directly with the same health checks without any payload?

comment:3 by samysilva@…, 7 years ago

Hi, could you read as evidence in the attached file? I'm going to record a video to demonstrate the problem.

comment:4 by samysilva@…, 7 years ago

The servers with logsash are monitored directly with Nagios, and are not problems.

The problem occurs when monitoring Nginx IPs.

Here's a video demonstrating the problem.
https://www.youtube.com/watch?v=qD87fy3RNw0

Version 0, edited 7 years ago by samysilva@… (next)

comment:5 by Maxim Dounin, 7 years ago

Note that LAST_ACK state you refer in both your file and your video means that the connection was closed by the other side, and it was closed by the the application, but no last ACK was received yet. As such, the fact that such lines are present in your netstat output indicate that there is something wrong on the network level. So I would recommend to focus on what happens on the network level, in particular which addresses are used by the involved parties, your firewall rules, and routing.

A direct reason for the problem as demonstrated on the video seems to be that you are using 10.154.3.97 as a client and test nginx at 10.154.3.98. These addresses on the same network as your backends (10.154.3.99, 10.154.3.100, https://youtu.be/KCLgbV8lXzU?t=25s), in contrast to the "external" nginx address (10.154.4.103) normally used for balancing (https://youtu.be/KCLgbV8lXzU?t=1m27s). With transparent proxy, this will result in backends sending packets directly to the client host, and will eventually lead to the service degradation - as from nginx point of view backends are not responding.

Some additional notes:

  1. Providing information in text form directly in the ticket is much easier to use than external docx files and videos. Please use text for further information, if any.
  1. The blog post you refer to is not an official documentation, it is merely a blog post written by a particular author. You may contact the author via comments below the blog post if you have questions about the configurations recommended, or have suggestions on how to improve them.
  1. Transparent proxying is not trivial to configure and in general very fragile, as it breaks normal networking rules and requires packets to be processed by a particular host. Additionally, it requires nginx to run worker processes under root, which isn't good from security point of view. In most cases it is a good idea to avoid transparent proxying as long as there are other ways to do the required balancing.

comment:6 by samysilva@…, 7 years ago

Thanks for the info; Mdounin.

We use nginx as an SLB service to distribute a load to our servers from the Nagios log servers. We need to use transparent proxy; For log servers to identify the log source correctly.

I am analyzing the possibility of replacing the balancing layer with a vThunder of the A10 network. Currently the nginx solution is working perfectly; Except for the monitoring problem. The problem also occurs if we make monitoring calls; Of external networks to the VIP, then we consider that the service can be compromised.

I need to find a solution.

For now we change all the monitoring routines, passing the "-s" parameter, so that the request sends a text. This is not common in our other monitoring routines, in other systems of SLB, bigip, vthunder, etc.

For now my application is stable, but we are still liable for problems in case any client via internet make similar requests to the monitoring system. To get around I published the service without SLB for now.

Thank you for the informations.

comment:7 by Maxim Dounin, 7 years ago

Resolution: invalid
Status: newclosed

As previously suggested, the observed behaviour is a result of an attempt to connect to the internal address from the backend's network. A solution would be to avoid connecting to the address in question - use the external address instead (the one used by real clients), it should work just fine. Configuring nginx to only accept connection on the external address might also be a good idea, it will prevent such mistakes.

Note: See TracTickets for help on using tickets.