Opened 4 years ago

Closed 4 years ago

#1976 closed defect (invalid)

Nginx DNS cache issue. ngx_http_core_module valid config is not working. — at Version 1

Reported by: aswin020@… Owned by:
Priority: major Milestone:
Component: nginx-core Version: 1.12.x
Keywords: resolver cache dns Cc:
uname -a: Linux ip-10-200-18-244.us-west-1.compute.internal 4.14.72-73.55.amzn2.x86_64 #1 SMP Thu Sep 27 23:37:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.12.2
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)
built with OpenSSL 1.0.2k-fips 26 Jan 2017
TLS SNI support enabled
configure arguments: --prefix=/usr/share/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib64/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --http-client-body-temp-path=/var/lib/nginx/tmp/client_body --http-proxy-temp-path=/var/lib/nginx/tmp/proxy --http-fastcgi-temp-path=/var/lib/nginx/tmp/fastcgi --http-uwsgi-temp-path=/var/lib/nginx/tmp/uwsgi --http-scgi-temp-path=/var/lib/nginx/tmp/scgi --pid-path=/run/nginx.pid --lock-path=/run/lock/subsys/nginx --user=nginx --group=nginx --with-file-aio --with-ipv6 --with-http_auth_request_module --with-http_ssl_module --with-http_v2_module --with-http_realip_module --with-http_addition_module --with-http_xslt_module=dynamic --with-http_image_filter_module=dynamic --with-http_geoip_module=dynamic --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_degradation_module --with-http_slice_module --with-http_stub_status_module --with-http_perl_module=dynamic --with-mail=dynamic --with-mail_ssl_module --with-pcre --with-pcre-jit --with-stream=dynamic --with-stream_ssl_module --with-google_perftools_module --with-debug --with-cc-opt='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic' --with-ld-opt='-Wl,-z,relro -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -Wl,-E'

Description (last modified by Maxim Dounin)

Setup:

Ec2 instance with Nginx 1.12.2 behind an AWS Load balancer. The load balancer has an Idle timeout of 60 seconds. Upstream is again another Loadbalancer. The DNS TTL of the CNAME and A records are 60 seconds.

Problem:

AWS ELB IP changes periodically and thats how it works. When the IP changes, Nginx is not picking the new IP address, instead requests are sent to the cached IP (Old IP). This is resulting in 499 after the request_time reaches 59.xxx. 499 is due to ELB closing the request after the 60 seconds Idle time. The issue is Nginx caching the ELB IP and not updating it.

Debugging:

  1. I read the docs, checked the previously reported issues. As per my understanding, This issue should not occur. The resolver doc says, By default, Nginx caches answers using the TTL value of a response. My DNS TTL is 60 seconds. I have noticed the error for more than 5 mins, untill nginx is reloaded. Before version 1.1.9, tuning of caching time was not possible, and nginx always cached answers for the duration of 5 minutes..
  1. I added the resolver config to override the cache (if there is one).
server {
    listen       81;
    server_name  my.host.name;

    resolver 10.200.0.2 valid=60s;

It's not working. Still noticing that the DNS is cached till reload.

  1. Updated the DNS from CNAME to Alias A record. The TTL is always 60 seconds. For CNAME and A records. Still issue exist.

Supporting docs/Logs:

This is a production machine and I have only the logs.

	   log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for" "$upstream_addr"'
			' $request_time';

Before Nginx Reload:

10.200.201.145 - - 05/May/2020:19:51:39 +0000 "GET /api/monitor HTTP/1.1" 499 0 "-" "Pingdom.com_bot_version_1.4_http://www.pingdom.com/" "76.72.167.90" "10.200.83.167:80" 59.337

DNS report: Here we can see that the IPs ($upstream_addr) have changed.

# dig +short rails.randomname.us-west-1.infra
internal-a0330ec15812a11e9a1660649a282fad-316863378.us-west-1.elb.amazonaws.com.
10.200.94.115
10.200.94.221

After Nginx reload:

10.200.201.145 - - 05/May/2020:19:52:39 +0000 "GET /api/monitor HTTP/1.1" 200 49 "-" "Pingdom.com_bot_version_1.4_http://www.pingdom.com/" "185.180.12.65" "10.200.94.221:80" 0.003

The issue is always resolved after Nginx reload. Please advise me on how to fix this. Let me know if I can get more data for debugging.

Change History (1)

comment:1 by Maxim Dounin, 4 years ago

Description: modified (diff)
Resolution: invalid
Status: newclosed

Domain names used in nginx configuration are normally resolved during parsing of the configuration. Currently, the only exceptions are:

  • When proxy_pass, fastcgi_pass, etc. directives contain variables.
  • When using ssl_stapling and resolving OCSP responder hostname.
  • When using a server ... resolve; in an upstream block (available as a part of the commercial subscription).

All these cases explicitly document that they are using resolver. For example, quoting proxy_pass documentation:

Parameter value can contain variables. In this case, if an address is specified as a domain name, the name is searched among the described server groups, and, if not found, is determined using a resolver.

Don't expect that names written in the configuration will be re-resolved by nginx periodically. Depending on the particular use case, you may want to either reconfigure nginx as IP addresses change, or configure nginx in a way which implies period re-resolution of names you want to be re-resolved.

Note that re-resolution implies run-time overhead and also may end up with a non-working nginx if DNS server won't be reachable. On the other hand, nginx reconfiguration implies additional resource usage till reconfiguration complete, which may take a while. For this and other reasons automatic reconfiguration might not be a good idea.

Note: See TracTickets for help on using tickets.