Opened 8 years ago

Closed 8 years ago

#1087 closed defect (invalid)

System crashes that were solved by increasing server_names_hash_max_size

Reported by: langemeijer@… Owned by:
Priority: major Milestone:
Component: other Version: 1.11.x
Keywords: Cc:
uname -a: Linux lb1 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.11.4
built with OpenSSL 1.0.2h 3 May 2016
TLS SNI support enabled
configure arguments: --with-cc-opt='-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2' --with-ld-opt=-Wl,-z,relro --prefix=/usr/share/nginx --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit --with-ipv6 --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_gunzip_module --with-file-aio --with-threads --with-http_v2_module --with-http_addition_module --with-http_dav_module --with-http_flv_module --with-http_geoip_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_image_filter_module --with-http_mp4_module --with-http_perl_module --with-http_random_index_module --with-http_secure_link_module --with-http_sub_module --with-http_xslt_module --with-mail --with-mail_ssl_module

Description

Since a couple of months I had some system crashes that I had no explainable reason for. I got "BUG: unable to handle kernel paging request" messages on the console, by a lot of different processes, including kworker, keepalived and other seemingly random processes. The last week crashes started to be more frequent, finally every 10 hours or so.

The machine this happened on was a simple keepalived / nginx reverse proxy that also does ssl termination for my cluster of webservers. Because the keepalived failover to our secondary proxy we still had a working cluster, but every 10 hours was getting annoying.

I started investigating and noticed the "nginx: [warn] could not build optimal server_names_hash, you should increase either server_names_hash_max_size:" message.

Obviously that's what I fixed first. Not because I thought this would be the solution to my problem, just because it seemed good housekeeping to do so immediately.

But: This seems to have fixed my problem.

I have made no other configuration changes to the system, obviously there have been numerous reboots and it's a working production system taking unknown varying load from the internet that I cannot control.

The more frequent crashes recently could be explained by the fact that gradually more server{} configuration clauses have been added for ssl termination.

I am not running a huge number of server{} classes, 100 or so, with 2-8 server_names each. Some of them wildcard, most are exact host names.

server_names_hash_max_size was default (512?), I set it to 1024.

It scares me that an ignored warning could potentially crash an entire system.

Also I've read the documentation on hash_max_size and hash_bucket_size but I find it hard to grasp. Why couldn't nginx just allocate enough memory to hold the server_name's I've configured?

Change History (1)

comment:1 by Maxim Dounin, 8 years ago

Resolution: invalid
Status: newclosed

The message from the kernel suggests you are facing either a hardware problem (likely bad RAM) or a kernel bug.

Changing server_names_hash_max_size is expected to change memory allocation and use pattern, and this can result in the problem no longer being triggered. It is not a good idea to rely on this though, as the problem can easily reappear on other unrelated changes. You may want to start with a good memory test to see if it will be able to find something.

Note: See TracTickets for help on using tickets.