Opened 10 months ago

Closed 9 months ago

#2513 closed defect (worksforme)

nginx resolver issue

Reported by: bsiara@… Owned by:
Priority: major Milestone:
Component: nginx-core Version: 1.21.x
Keywords: Cc:
uname -a: Linux ec1-prd-k8s-nginxfront-1 5.15.0-1036-aws #40~20.04.1-Ubuntu SMP Mon Apr 24 00:21:13 UTC 2023 x86_64 Linux
nginx -V: built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r4)
built with OpenSSL 3.0.7 1 Nov 2022
TLS SNI support enabled
configure arguments: --prefix=/opt/nginx/nginx --with-cc-opt=-O2 --add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.33 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.09 --add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.21 --add-module=../ngx_lua_upstream-0.07 --add-module=../headers-more-nginx-module-0.33 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 --add-module=../redis-nginx-module-0.3.9 --add-module=../rds-json-nginx-module-0.15 --add-module=../rds-csv-nginx-module-0.09 --add-module=../ngx_stream_lua-0.0.11 --with-ld-opt=-Wl,-rpath,/opt/nginx/luajit/lib --sbin-path=/opt/nginx/sbin/nginx --modules-path=/opt/nginx/etc/nginx/modules --conf-path=/opt/nginx/etc/nginx/nginx.conf --error-log-path=/opt/nginx/logs/error.log --pid-path=/opt/nginx/run/nginx.pid --lock-path=/opt/nginx/run/nginx.lock --user=nginx --group=nginx --with-select_module --with-poll_module --with-threads --with-file-aio --with-http_ssl_module --with-http_v2_module --with-http_realip_module --with-http_addition_module --with-http_xslt_module=dynamic --with-http_image_filter_module=dynamic --with-http_geoip_module=dynamic --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_auth_request_module --with-http_random_index_module --with-http_secure_link_module --with-http_degradation_module --with-http_slice_module --with-http_stub_status_module --with-http_perl_module=dynamic --http-log-path=/opt/nginx/logs/access.log --http-client-body-temp-path=/opt/nginx/cache/client_temp --http-proxy-temp-path=/opt/nginx/cache/proxy_temp --http-fastcgi-temp-path=/opt/nginx/cache/fastcgi_temp --http-uwsgi-temp-path=/opt/nginx/cache/uwsgi_temp --http-scgi-temp-path=/opt/nginx/cache/scgi_temp --with-perl_modules_path=/usr/lib/perl5/vendor_perl --with-mail=dynamic --with-mail_ssl_module --with-stream=dynamic --with-stream_ssl_module --with-stream_realip_module --with-stream_geoip_module=dynamic --with-stream_ssl_preread_module --with-pcre-jit --add-module=/opt/nginx/openresty-1.21.4.1/../ModSecurity-nginx --add-module=/opt/nginx/openresty-1.21.4.1/../nginx-module-vts --add-module=/opt/nginx/openresty-1.21.4.1/../ngx_brotli --add-dynamic-module=/opt/nginx/openresty-1.21.4.1/../nginx-opentracing/opentracing --with-compat --with-stream

Description

nginx (openresty 1.21.4.1) is compiled in docker alpine 3.17, in nginx.conf in http block I have a delcaration of resolver

resolver 192.168.1.1 ipv6=off;
resolver_timeout 10s;

Sometime (aproximetly one time for week) nginx resolver behave strange. In upstream I have a name which is refreshed every 300s to the new A records. Sometime nginx does not resolve this name to new records and still resolve to old A records. In this situation reload fix the issue, but this is not a solution. This happends only for one name at a time which is refreshed.

Change History (6)

comment:1 by Maxim Dounin, 10 months ago

Are you able to reproduce this with vanilla nginx, without any 3rd party patches and modules?

comment:2 by bsiara@…, 10 months ago

I can not reproduce this problem because this is a production system and in this place I can not use vanila nginx.

comment:3 by Maxim Dounin, 10 months ago

From the symptoms I would suggest one of the following:

  • There is a response with large TTL unexpectedly returned by your DNS server, which results in DNS response being cached for a long time. Consider using resolver ... valid=300s; to see if it helps (see docs for details).
  • There is a bug in the 3rd party patches or modules you are using, such as memory corruption, which results in incorrect TTL being used. I'm afraid it's not possible to rule out this option without reproducing the issue with vanilla nginx.
  • There is a previously unknown bug in nginx which results in incorrect TTL being used.

Unless there are more specific reasons to assume the latter is the case, I would rather focus on the first two options.

comment:4 by bsiara@…, 10 months ago

I tryed "valid" option and this does not take effect. Openresty does not touch vanila code where is embedded resolver code, they only use moduses support as you explain here https://www.nginx.com/blog/nginx-dynamic-modules-how-they-work/, and I suggest that problem is rather on third or second option.

comment:5 by Maxim Dounin, 10 months ago

I tryed "valid" option and this does not take effect.

Thanks for checking.

Openresty does not touch vanila code where is embedded resolver code, they only use moduses support as you explain here ​https://www.nginx.com/blog/nginx-dynamic-modules-how-they-work/, and I suggest that problem is rather on third or second option.

Unfortunately, Openresty is known to break nginx API in multiple places, and was seen many times to break things even if the particular parts of the code are not directly modified. Not to mention that bugs in modules can result in arbitrary issues in seemingly unrelated areas, and this is something seen multiple times with various 3rd party modules.

As such, for more or less any bug report the first thing we are asking is to demonstrate that the issue being reported is an issue in nginx itself, and wasn't the one introduced by 3rd party modifications or 3rd party modules.

The most simple way to do this is to ensure that the issue can be observed without any 3rd party modifications and 3rd party modules.

If it's not possible, an alternative approach would be to debug and analyze the issue in place. Consider collecting debug logs which demonstrate the issue. The most interesting lines are the ones related to resolving, such as resolving a name when connecting to an upstream:

...
2023/06/23 02:27:17 [debug] 39365#100154: *1 http init upstream, client timer: 0
2023/06/23 02:27:17 [debug] 39365#100154: *1 kevent set event: 6: ft:-2 fl:0025
2023/06/23 02:27:17 [debug] 39365#100154: *1 http script copy: "Host"
2023/06/23 02:27:17 [debug] 39365#100154: *1 http script var: "mdounin.ru"
2023/06/23 02:27:17 [debug] 39365#100154: *1 http script copy: "Connection"
2023/06/23 02:27:17 [debug] 39365#100154: *1 http script copy: "close"
2023/06/23 02:27:17 [debug] 39365#100154: *1 http script copy: ""
2023/06/23 02:27:17 [debug] 39365#100154: *1 http script copy: ""
2023/06/23 02:27:17 [debug] 39365#100154: *1 http proxy header: "User-Agent: curl/8.0.1"
2023/06/23 02:27:17 [debug] 39365#100154: *1 http proxy header: "Accept: */*"
2023/06/23 02:27:17 [debug] 39365#100154: *1 http proxy header:
"GET / HTTP/1.0
Host: mdounin.ru
Connection: close
User-Agent: curl/8.0.1
Accept: */*

"
2023/06/23 02:27:17 [debug] 39365#100154: *1 http cleanup add: 21688E6C
2023/06/23 02:27:17 [debug] 39365#100154: malloc: 21676580:120
2023/06/23 02:27:17 [debug] 39365#100154: resolve: "mdounin.ru"
2023/06/23 02:27:17 [debug] 39365#100154: malloc: 21676600:116
2023/06/23 02:27:17 [debug] 39365#100154: malloc: 216ADEA0:10
2023/06/23 02:27:17 [debug] 39365#100154: malloc: 216A6EC0:28
2023/06/23 02:27:17 [debug] 39365#100154: resolve: "mdounin.ru" A 2090
2023/06/23 02:27:17 [debug] 39365#100154: UDP socket 9
2023/06/23 02:27:17 [debug] 39365#100154: connect to 8.8.8.8:53, fd:9 #2
2023/06/23 02:27:17 [debug] 39365#100154: kevent set event: 9: ft:-1 fl:0025
2023/06/23 02:27:17 [debug] 39365#100154: send: fd:9 28 of 28
2023/06/23 02:27:17 [debug] 39365#100154: malloc: 2166E780:56
2023/06/23 02:27:17 [debug] 39365#100154: event timer add: -1: 30000:673867318
2023/06/23 02:27:17 [debug] 39365#100154: event timer add: -1: 5000:673842318
2023/06/23 02:27:17 [debug] 39365#100154: *1 http finalize request: -4, "/?" a:1, c:2
2023/06/23 02:27:17 [debug] 39365#100154: *1 http request count:2 blk:0
2023/06/23 02:27:17 [debug] 39365#100154: timer delta: 30
2023/06/23 02:27:17 [debug] 39365#100154: worker cycle
2023/06/23 02:27:17 [debug] 39365#100154: kevent timer: 5000, changes: 2
2023/06/23 02:27:17 [debug] 39365#100154: kevent events: 1
2023/06/23 02:27:17 [debug] 39365#100154: kevent: 6: ft:-2 fl:0020 ff:00000000 d:49032 ud:217CF828
2023/06/23 02:27:17 [debug] 39365#100154: *1 http run request: "/?"
2023/06/23 02:27:17 [debug] 39365#100154: *1 http upstream check client, write event:1, "/"
2023/06/23 02:27:17 [debug] 39365#100154: timer delta: 0
2023/06/23 02:27:17 [debug] 39365#100154: worker cycle
2023/06/23 02:27:17 [debug] 39365#100154: kevent timer: 5000, changes: 0
2023/06/23 02:27:17 [debug] 39365#100154: kevent events: 1
2023/06/23 02:27:17 [debug] 39365#100154: kevent: 9: ft:-1 fl:0020 ff:00000000 d:44 ud:217C78A0
2023/06/23 02:27:17 [debug] 39365#100154: recv: fd:9 44 of 4096
2023/06/23 02:27:17 [debug] 39365#100154: resolver DNS response 2090 fl:8180 1/1/0/0
2023/06/23 02:27:17 [debug] 39365#100154: resolver DNS response qt:1 cl:1
2023/06/23 02:27:17 [debug] 39365#100154: malloc: 216F1BB0:11
2023/06/23 02:27:17 [debug] 39365#100154: resolver qs:mdounin.ru
2023/06/23 02:27:17 [debug] 39365#100154: resolver naddrs:1 cname:00000000 ttl:3600
2023/06/23 02:27:17 [debug] 39365#100154: *1 http upstream resolve: "/?"
2023/06/23 02:27:17 [debug] 39365#100154: *1 name was resolved to 144.76.182.204
2023/06/23 02:27:17 [debug] 39365#100154: resolve name done: 0
2023/06/23 02:27:17 [debug] 39365#100154: event timer del: -1: 673867318
2023/06/23 02:27:17 [debug] 39365#100154: resolver expire
2023/06/23 02:27:17 [debug] 39365#100154: event timer del: -1: 673842318
2023/06/23 02:27:17 [debug] 39365#100154: *1 get rr peer, try: 1
2023/06/23 02:27:17 [debug] 39365#100154: *1 stream socket 10
2023/06/23 02:27:17 [debug] 39365#100154: *1 connect to 144.76.182.204:80, fd:10 #3
...

Note the following lines, which show the name being resolved, DNS request being sent, and the response received, including TTL:

...
2023/06/23 02:27:17 [debug] 39365#100154: resolve: "mdounin.ru"
...
2023/06/23 02:27:17 [debug] 39365#100154: resolve: "mdounin.ru" A 2090
...
2023/06/23 02:27:17 [debug] 39365#100154: recv: fd:9 44 of 4096
2023/06/23 02:27:17 [debug] 39365#100154: resolver DNS response 2090 fl:8180 1/1/0/0
2023/06/23 02:27:17 [debug] 39365#100154: resolver DNS response qt:1 cl:1
2023/06/23 02:27:17 [debug] 39365#100154: malloc: 216F1BB0:11
2023/06/23 02:27:17 [debug] 39365#100154: resolver qs:mdounin.ru
2023/06/23 02:27:17 [debug] 39365#100154: resolver naddrs:1 cname:00000000 ttl:3600
...

And using a cached name:

...
2023/06/23 02:46:34 [debug] 39365#100154: *4 http cleanup add: 21667E6C
2023/06/23 02:46:34 [debug] 39365#100154: malloc: 21676580:120
2023/06/23 02:46:34 [debug] 39365#100154: resolve: "mdounin.ru"
2023/06/23 02:46:34 [debug] 39365#100154: resolve cached
2023/06/23 02:46:34 [debug] 39365#100154: *4 http upstream resolve: "/?"
2023/06/23 02:46:34 [debug] 39365#100154: *4 name was resolved to 144.76.182.204
2023/06/23 02:46:34 [debug] 39365#100154: resolve name done: 0
2023/06/23 02:46:34 [debug] 39365#100154: resolver expire
2023/06/23 02:46:34 [debug] 39365#100154: *4 get rr peer, try: 1
2023/06/23 02:46:34 [debug] 39365#100154: *4 stream socket 10
2023/06/23 02:46:34 [debug] 39365#100154: *4 connect to 144.76.182.204:80, fd:10 #5
...

Note though that debug logs might be huge, and collecting them over a long period of time might be a challenge.

While debug logs does not guarantee the issue cause will be immediately obvious, they are expected to demonstrate the issue, and usually help to identify the root cause.

comment:6 by Maxim Dounin, 9 months ago

Resolution: worksforme
Status: newclosed

Feedback timeout.

Note: See TracTickets for help on using tickets.