Opened 15 months ago
Closed 14 months ago
#2513 closed defect (worksforme)
nginx resolver issue
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | major | Milestone: | |
Component: | nginx-core | Version: | 1.21.x |
Keywords: | Cc: | ||
uname -a: | Linux ec1-prd-k8s-nginxfront-1 5.15.0-1036-aws #40~20.04.1-Ubuntu SMP Mon Apr 24 00:21:13 UTC 2023 x86_64 Linux | ||
nginx -V: |
built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r4)
built with OpenSSL 3.0.7 1 Nov 2022 TLS SNI support enabled configure arguments: --prefix=/opt/nginx/nginx --with-cc-opt=-O2 --add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.33 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.09 --add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.21 --add-module=../ngx_lua_upstream-0.07 --add-module=../headers-more-nginx-module-0.33 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 --add-module=../redis-nginx-module-0.3.9 --add-module=../rds-json-nginx-module-0.15 --add-module=../rds-csv-nginx-module-0.09 --add-module=../ngx_stream_lua-0.0.11 --with-ld-opt=-Wl,-rpath,/opt/nginx/luajit/lib --sbin-path=/opt/nginx/sbin/nginx --modules-path=/opt/nginx/etc/nginx/modules --conf-path=/opt/nginx/etc/nginx/nginx.conf --error-log-path=/opt/nginx/logs/error.log --pid-path=/opt/nginx/run/nginx.pid --lock-path=/opt/nginx/run/nginx.lock --user=nginx --group=nginx --with-select_module --with-poll_module --with-threads --with-file-aio --with-http_ssl_module --with-http_v2_module --with-http_realip_module --with-http_addition_module --with-http_xslt_module=dynamic --with-http_image_filter_module=dynamic --with-http_geoip_module=dynamic --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_auth_request_module --with-http_random_index_module --with-http_secure_link_module --with-http_degradation_module --with-http_slice_module --with-http_stub_status_module --with-http_perl_module=dynamic --http-log-path=/opt/nginx/logs/access.log --http-client-body-temp-path=/opt/nginx/cache/client_temp --http-proxy-temp-path=/opt/nginx/cache/proxy_temp --http-fastcgi-temp-path=/opt/nginx/cache/fastcgi_temp --http-uwsgi-temp-path=/opt/nginx/cache/uwsgi_temp --http-scgi-temp-path=/opt/nginx/cache/scgi_temp --with-perl_modules_path=/usr/lib/perl5/vendor_perl --with-mail=dynamic --with-mail_ssl_module --with-stream=dynamic --with-stream_ssl_module --with-stream_realip_module --with-stream_geoip_module=dynamic --with-stream_ssl_preread_module --with-pcre-jit --add-module=/opt/nginx/openresty-1.21.4.1/../ModSecurity-nginx --add-module=/opt/nginx/openresty-1.21.4.1/../nginx-module-vts --add-module=/opt/nginx/openresty-1.21.4.1/../ngx_brotli --add-dynamic-module=/opt/nginx/openresty-1.21.4.1/../nginx-opentracing/opentracing --with-compat --with-stream |
Description
nginx (openresty 1.21.4.1) is compiled in docker alpine 3.17, in nginx.conf in http block I have a delcaration of resolver
resolver 192.168.1.1 ipv6=off; resolver_timeout 10s;
Sometime (aproximetly one time for week) nginx resolver behave strange. In upstream I have a name which is refreshed every 300s to the new A records. Sometime nginx does not resolve this name to new records and still resolve to old A records. In this situation reload fix the issue, but this is not a solution. This happends only for one name at a time which is refreshed.
Change History (6)
comment:1 by , 15 months ago
comment:2 by , 15 months ago
I can not reproduce this problem because this is a production system and in this place I can not use vanila nginx.
comment:3 by , 15 months ago
From the symptoms I would suggest one of the following:
- There is a response with large TTL unexpectedly returned by your DNS server, which results in DNS response being cached for a long time. Consider using
resolver ... valid=300s;
to see if it helps (see docs for details). - There is a bug in the 3rd party patches or modules you are using, such as memory corruption, which results in incorrect TTL being used. I'm afraid it's not possible to rule out this option without reproducing the issue with vanilla nginx.
- There is a previously unknown bug in nginx which results in incorrect TTL being used.
Unless there are more specific reasons to assume the latter is the case, I would rather focus on the first two options.
comment:4 by , 15 months ago
I tryed "valid" option and this does not take effect. Openresty does not touch vanila code where is embedded resolver code, they only use moduses support as you explain here https://www.nginx.com/blog/nginx-dynamic-modules-how-they-work/, and I suggest that problem is rather on third or second option.
comment:5 by , 15 months ago
I tryed "valid" option and this does not take effect.
Thanks for checking.
Openresty does not touch vanila code where is embedded resolver code, they only use moduses support as you explain here https://www.nginx.com/blog/nginx-dynamic-modules-how-they-work/, and I suggest that problem is rather on third or second option.
Unfortunately, Openresty is known to break nginx API in multiple places, and was seen many times to break things even if the particular parts of the code are not directly modified. Not to mention that bugs in modules can result in arbitrary issues in seemingly unrelated areas, and this is something seen multiple times with various 3rd party modules.
As such, for more or less any bug report the first thing we are asking is to demonstrate that the issue being reported is an issue in nginx itself, and wasn't the one introduced by 3rd party modifications or 3rd party modules.
The most simple way to do this is to ensure that the issue can be observed without any 3rd party modifications and 3rd party modules.
If it's not possible, an alternative approach would be to debug and analyze the issue in place. Consider collecting debug logs which demonstrate the issue. The most interesting lines are the ones related to resolving, such as resolving a name when connecting to an upstream:
... 2023/06/23 02:27:17 [debug] 39365#100154: *1 http init upstream, client timer: 0 2023/06/23 02:27:17 [debug] 39365#100154: *1 kevent set event: 6: ft:-2 fl:0025 2023/06/23 02:27:17 [debug] 39365#100154: *1 http script copy: "Host" 2023/06/23 02:27:17 [debug] 39365#100154: *1 http script var: "mdounin.ru" 2023/06/23 02:27:17 [debug] 39365#100154: *1 http script copy: "Connection" 2023/06/23 02:27:17 [debug] 39365#100154: *1 http script copy: "close" 2023/06/23 02:27:17 [debug] 39365#100154: *1 http script copy: "" 2023/06/23 02:27:17 [debug] 39365#100154: *1 http script copy: "" 2023/06/23 02:27:17 [debug] 39365#100154: *1 http proxy header: "User-Agent: curl/8.0.1" 2023/06/23 02:27:17 [debug] 39365#100154: *1 http proxy header: "Accept: */*" 2023/06/23 02:27:17 [debug] 39365#100154: *1 http proxy header: "GET / HTTP/1.0 Host: mdounin.ru Connection: close User-Agent: curl/8.0.1 Accept: */* " 2023/06/23 02:27:17 [debug] 39365#100154: *1 http cleanup add: 21688E6C 2023/06/23 02:27:17 [debug] 39365#100154: malloc: 21676580:120 2023/06/23 02:27:17 [debug] 39365#100154: resolve: "mdounin.ru" 2023/06/23 02:27:17 [debug] 39365#100154: malloc: 21676600:116 2023/06/23 02:27:17 [debug] 39365#100154: malloc: 216ADEA0:10 2023/06/23 02:27:17 [debug] 39365#100154: malloc: 216A6EC0:28 2023/06/23 02:27:17 [debug] 39365#100154: resolve: "mdounin.ru" A 2090 2023/06/23 02:27:17 [debug] 39365#100154: UDP socket 9 2023/06/23 02:27:17 [debug] 39365#100154: connect to 8.8.8.8:53, fd:9 #2 2023/06/23 02:27:17 [debug] 39365#100154: kevent set event: 9: ft:-1 fl:0025 2023/06/23 02:27:17 [debug] 39365#100154: send: fd:9 28 of 28 2023/06/23 02:27:17 [debug] 39365#100154: malloc: 2166E780:56 2023/06/23 02:27:17 [debug] 39365#100154: event timer add: -1: 30000:673867318 2023/06/23 02:27:17 [debug] 39365#100154: event timer add: -1: 5000:673842318 2023/06/23 02:27:17 [debug] 39365#100154: *1 http finalize request: -4, "/?" a:1, c:2 2023/06/23 02:27:17 [debug] 39365#100154: *1 http request count:2 blk:0 2023/06/23 02:27:17 [debug] 39365#100154: timer delta: 30 2023/06/23 02:27:17 [debug] 39365#100154: worker cycle 2023/06/23 02:27:17 [debug] 39365#100154: kevent timer: 5000, changes: 2 2023/06/23 02:27:17 [debug] 39365#100154: kevent events: 1 2023/06/23 02:27:17 [debug] 39365#100154: kevent: 6: ft:-2 fl:0020 ff:00000000 d:49032 ud:217CF828 2023/06/23 02:27:17 [debug] 39365#100154: *1 http run request: "/?" 2023/06/23 02:27:17 [debug] 39365#100154: *1 http upstream check client, write event:1, "/" 2023/06/23 02:27:17 [debug] 39365#100154: timer delta: 0 2023/06/23 02:27:17 [debug] 39365#100154: worker cycle 2023/06/23 02:27:17 [debug] 39365#100154: kevent timer: 5000, changes: 0 2023/06/23 02:27:17 [debug] 39365#100154: kevent events: 1 2023/06/23 02:27:17 [debug] 39365#100154: kevent: 9: ft:-1 fl:0020 ff:00000000 d:44 ud:217C78A0 2023/06/23 02:27:17 [debug] 39365#100154: recv: fd:9 44 of 4096 2023/06/23 02:27:17 [debug] 39365#100154: resolver DNS response 2090 fl:8180 1/1/0/0 2023/06/23 02:27:17 [debug] 39365#100154: resolver DNS response qt:1 cl:1 2023/06/23 02:27:17 [debug] 39365#100154: malloc: 216F1BB0:11 2023/06/23 02:27:17 [debug] 39365#100154: resolver qs:mdounin.ru 2023/06/23 02:27:17 [debug] 39365#100154: resolver naddrs:1 cname:00000000 ttl:3600 2023/06/23 02:27:17 [debug] 39365#100154: *1 http upstream resolve: "/?" 2023/06/23 02:27:17 [debug] 39365#100154: *1 name was resolved to 144.76.182.204 2023/06/23 02:27:17 [debug] 39365#100154: resolve name done: 0 2023/06/23 02:27:17 [debug] 39365#100154: event timer del: -1: 673867318 2023/06/23 02:27:17 [debug] 39365#100154: resolver expire 2023/06/23 02:27:17 [debug] 39365#100154: event timer del: -1: 673842318 2023/06/23 02:27:17 [debug] 39365#100154: *1 get rr peer, try: 1 2023/06/23 02:27:17 [debug] 39365#100154: *1 stream socket 10 2023/06/23 02:27:17 [debug] 39365#100154: *1 connect to 144.76.182.204:80, fd:10 #3 ...
Note the following lines, which show the name being resolved, DNS request being sent, and the response received, including TTL:
... 2023/06/23 02:27:17 [debug] 39365#100154: resolve: "mdounin.ru" ... 2023/06/23 02:27:17 [debug] 39365#100154: resolve: "mdounin.ru" A 2090 ... 2023/06/23 02:27:17 [debug] 39365#100154: recv: fd:9 44 of 4096 2023/06/23 02:27:17 [debug] 39365#100154: resolver DNS response 2090 fl:8180 1/1/0/0 2023/06/23 02:27:17 [debug] 39365#100154: resolver DNS response qt:1 cl:1 2023/06/23 02:27:17 [debug] 39365#100154: malloc: 216F1BB0:11 2023/06/23 02:27:17 [debug] 39365#100154: resolver qs:mdounin.ru 2023/06/23 02:27:17 [debug] 39365#100154: resolver naddrs:1 cname:00000000 ttl:3600 ...
And using a cached name:
... 2023/06/23 02:46:34 [debug] 39365#100154: *4 http cleanup add: 21667E6C 2023/06/23 02:46:34 [debug] 39365#100154: malloc: 21676580:120 2023/06/23 02:46:34 [debug] 39365#100154: resolve: "mdounin.ru" 2023/06/23 02:46:34 [debug] 39365#100154: resolve cached 2023/06/23 02:46:34 [debug] 39365#100154: *4 http upstream resolve: "/?" 2023/06/23 02:46:34 [debug] 39365#100154: *4 name was resolved to 144.76.182.204 2023/06/23 02:46:34 [debug] 39365#100154: resolve name done: 0 2023/06/23 02:46:34 [debug] 39365#100154: resolver expire 2023/06/23 02:46:34 [debug] 39365#100154: *4 get rr peer, try: 1 2023/06/23 02:46:34 [debug] 39365#100154: *4 stream socket 10 2023/06/23 02:46:34 [debug] 39365#100154: *4 connect to 144.76.182.204:80, fd:10 #5 ...
Note though that debug logs might be huge, and collecting them over a long period of time might be a challenge.
While debug logs does not guarantee the issue cause will be immediately obvious, they are expected to demonstrate the issue, and usually help to identify the root cause.
Are you able to reproduce this with vanilla nginx, without any 3rd party patches and modules?