Opened 3 years ago

Closed 21 months ago

#626 closed defect (fixed)

Open Socket Alerts on Reload

Reported by: www.google.com/accounts/o8/id?id=AItOawmwGdIb8wPZiAAOHZ0bybIC0rrZpdcgdNk Owned by: Valentin Bartenev <vbart@…>
Priority: minor Milestone:
Component: nginx-core Version: 1.6.x
Keywords: spdy, http2 Cc:
Sensitive: no
uname -a: Linux web1 2.6.32-431.23.3.el6.x86_64 #1 SMP Thu Jul 31 17:20:51 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.6.1

Description

I am using Nginx 1.6.1 from the official Nginx repository and recently (I think starting from 1.6.1) we started getting these errors on reload:

2014/09/15 11:00:13 [alert] 15749#0: *418441897 open socket #786 left in connection 17
2014/09/15 11:00:13 [alert] 15749#0: *25679726 open socket #932 left in connection 36
2014/09/15 11:00:13 [alert] 15749#0: *250563297 open socket #108 left in connection 64
2014/09/15 11:00:13 [alert] 15749#0: *255280874 open socket #467 left in connection 108
2014/09/15 11:00:13 [alert] 15749#0: *83647387 open socket #687 left in connection 116
2014/09/15 11:00:13 [alert] 15749#0: *172055940 open socket #582 left in connection 158
2014/09/15 11:00:13 [alert] 15749#0: *42380963 open socket #370 left in connection 162
2014/09/15 11:00:13 [alert] 15749#0: *187857605 open socket #834 left in connection 196
2014/09/15 11:00:13 [alert] 15749#0: *96301018 open socket #290 left in connection 226
2014/09/15 11:00:13 [alert] 15749#0: *229257100 open socket #520 left in connection 232
2014/09/15 11:00:13 [alert] 15749#0: *310210706 open socket #453 left in connection 251
2014/09/15 11:00:13 [alert] 15749#0: *273001877 open socket #615 left in connection 307
2014/09/15 11:00:13 [alert] 15749#0: *213129612 open socket #329 left in connection 312
2014/09/15 11:00:13 [alert] 15749#0: *60552692 open socket #839 left in connection 347

We got about 650 of these errors last time we reloaded. I don't remember ever seeing these errors before. Also it doesn't happen on every reload, only if nginx has been running longer (for now I know that it happens if it's running for at least 24h).

I don't know if this is of any help for you, but if I can help you with attaching a configuration file, or anything else feel free to ask.

Change History (18)

comment:1 Changed 3 years ago by vbart

Yes, please, provide a configuration file.

comment:2 Changed 3 years ago by www.google.com/accounts/o8/id?id=AItOawmwGdIb8wPZiAAOHZ0bybIC0rrZpdcgdNk

For some reason I cannot attach a file, so here it is:

user  nginx;
worker_processes  16;
worker_rlimit_nofile 100000;

error_log  /var/log/nginx/error.log;
pid        /var/run/nginx.pid;

events {
    worker_connections  10240;
    use epoll;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    #access_log  /var/log/nginx/access.log  main;
    access_log  off;

    sendfile        on;
    tcp_nopush     on;
    keepalive_timeout  30;
    reset_timedout_connection on;
    tcp_nodelay on;
    server_tokens off;

    gzip  on;
    gzip_http_version 1.1;
    gzip_vary on;
    gzip_comp_level 6;
    gzip_proxied any;
    gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript;
    gzip_buffers 16 8k;

    charset utf-8;

    upstream php1  {
    	server   unix:/dev/shm/.php1.socket;
    }
    upstream php2  {
    	server   unix:/dev/shm/.php2.socket;
    }
    upstream php3  {
    	server   unix:/dev/shm/.php3.socket;
    }

    ssl_certificate     /etc/pki/tls/certs/site.com.crt;
    ssl_certificate_key /etc/pki/tls/certs/site.com.key;
    ssl_ciphers         RC4-SHA:AES128-SHA:AES256-SHA:CAMELLIA128-SHA:HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;


    limit_req_zone  $binary_remote_addr  zone=one:10m   rate=1r/s;
    limit_req_zone  $uri  zone=two:10m   rate=200r/s;

    include /etc/nginx/conf.d/*.conf;
}

The rest of the files are virtual hosts, which have nothing special in them.

comment:3 Changed 3 years ago by vbart

Nothing special means that you don't use SPDY, right?

comment:4 Changed 3 years ago by www.google.com/accounts/o8/id?id=AItOawmwGdIb8wPZiAAOHZ0bybIC0rrZpdcgdNk

I forgot about SPDY, that is special and we do use it.

comment:5 Changed 3 years ago by vbart

What is about dynamic resolving? Do you have the resolver directive? There were a number of bugs in it found and fixed, that could result such problems.

Actually, what you think as "nothing special" can contain a lot of special.

Could you test with spdy turned off? Also, if you are using spdy, it's recommended to switch to the mainline branch.

Last edited 3 years ago by vbart (previous) (diff)

comment:6 follow-up: Changed 3 years ago by www.google.com/accounts/o8/id?id=AItOawmwGdIb8wPZiAAOHZ0bybIC0rrZpdcgdNk

We don't use resolver anywhere.

By nothing special, we basically have a few HTTP, SSL and SPDY virtual hosts, which have a few locations with some IF statements as well, a few return 301 url;, a few locations have a proxy_pass to an external web server and we have rate limiting on one location; that is basically it.

Regarding SPDY, why is it recommended to use mainline instead of stable?

Last edited 3 years ago by www.google.com/accounts/o8/id?id=AItOawmwGdIb8wPZiAAOHZ0bybIC0rrZpdcgdNk (previous) (diff)

comment:7 Changed 3 years ago by www.google.com/accounts/o8/id?id=AItOawmwGdIb8wPZiAAOHZ0bybIC0rrZpdcgdNk

Also I will try if the same thing will happen tomorrow when I reload and if it does, I will turn off SPDY and try the next day.

comment:8 in reply to: ↑ 6 Changed 3 years ago by vbart

Replying to Strahinja Kustudić <strahinjak@nordeus.eu>:

Regarding SPDY, why is it recommended to use mainline instead of stable?

Because the mainline branch receives more bugfixes, which is especially critical when you are using new/experimental functionality. See also: http://nginx.com/blog/nginx-1-6-1-7-released/

comment:9 follow-up: Changed 3 years ago by www.google.com/accounts/o8/id?id=AItOawmwGdIb8wPZiAAOHZ0bybIC0rrZpdcgdNk

I wanted to ask about that blogpost, but I didn't want to derail this bug report. I have mixed feelings about it and it doesn't make that much sense tbh. You are practically saying that stable version has more bugs than the mainline version, even thought the mainline version is introducing new functionality and the stable one isn't. Why have a stable version in the first place, if it is less stable?

So SPDY is an experimental feature in the stable version? Right?

comment:10 in reply to: ↑ 9 Changed 3 years ago by vbart

Replying to Strahinja Kustudić <strahinjak@nordeus.eu>:

I wanted to ask about that blogpost, but I didn't want to derail this bug report. I have mixed feelings about it and it doesn't make that much sense tbh. You are practically saying that stable version has more bugs than the mainline version, even thought the mainline version is introducing new functionality and the stable one isn't. Why have a stable version in the first place, if it is less stable?

There're at least two goals for the stable branch:

  • Avoid changes in internal API, don't break 3rd-party modules;
  • Don't break anything: only critical fixes (usually for annoying bugs, that can't be avoided by configuration changes) and with some time lag after being fixed in the mainline branch.

The stable doesn't imply bug-free. And any bug in spdy can be avoided by turning it off.

So SPDY is an experimental feature in the stable version? Right?

The SPDY module is currently experimental regardless of version (as well as the protocol itself is experimental too).

comment:11 Changed 3 years ago by www.google.com/accounts/o8/id?id=AItOawmwGdIb8wPZiAAOHZ0bybIC0rrZpdcgdNk

I reloaded today again and the same thing happened. I disabled SPDY now and will try tomorrow what happens.

I also noticed one interesting thing, which might be nothing, but it is a strange configuration. We have a few SSL virtual hosts using the same wild-card SSL certificate, but only one virtual host was using SPDY.

comment:12 Changed 3 years ago by www.google.com/accounts/o8/id?id=AItOawmwGdIb8wPZiAAOHZ0bybIC0rrZpdcgdNk

Sorry for not getting back sooner, but you were right. It was an issue with SPDY. After I disabled it I didn't have any problems. For now I'm keeping it disabled, since we didn't notice any performance gain anyway.

comment:13 Changed 3 years ago by vetrakazabijak@…

I can confirm this still exists at 1.7.10 with SPDY on and we are able to reproduce it (it happens every time when there are open connections using SPDY & HTTPS and we reload the configuration). Whenever you'll be ready to get deep into debugging this let me know and I can provide all the info you might need.

comment:14 Changed 2 years ago by mdounin

  • Keywords spdy added

comment:15 Changed 2 years ago by lowks@…

I can confirm this bug still exists even with http2. We encountered using vanilla 1.9.6 and 1.9.7 nginx. with the following config. Steps to reproduce 1. curl a large file from the server 2. issue a reload during that curl session 3. Things cock up.:

#user  nobody;
worker_processes  1;

#error_log  logs/error.log;
#error_log  logs/error.log  notice;
#error_log  logs/error.log  info;
#pid        logs/nginx.pid;

events {
    worker_connections  2048;
}

http {
    include       mime.types;
    default_type  application/octet-stream;
    sendfile        on;
    keepalive_timeout  65;
    gzip on;

    # HTTPS server
    #
    server {
        listen       443 ssl http2;
        server_name  localhost;

        ssl_certificate      /home/lowks/temp_certs/myssl.crt;
        ssl_certificate_key  /home/lowks/temp_certs/myssl.key;

        ssl_session_cache    shared:SSL:1m;
        ssl_session_timeout  5m;

        ssl_ciphers  HIGH:!aNULL:!MD5;
        ssl_prefer_server_ciphers  on;

        location / {
            root   /home/lowks/Downloads;
            autoindex on;
        }
    }
}

Every time a reload is done, errors like these show up:

2015/12/02 11:36:17 [alert] 3608#0: *1 open socket #3 left in connection 2
2015/12/02 11:36:17 [alert] 3608#0: aborting
2015/12/02 11:37:39 [notice] 3627#0: signal process started
2015/12/02 11:37:46 [notice] 3629#0: signal process started
2015/12/02 11:37:46 [alert] 3628#0: *2 open socket #4 left in connection 2
2015/12/02 11:37:46 [alert] 3628#0: aborting
Last edited 2 years ago by lowks@… (previous) (diff)

comment:16 Changed 2 years ago by vbart

  • Status changed from new to accepted

comment:17 Changed 22 months ago by vbart

  • Keywords http2 added

comment:18 Changed 21 months ago by Valentin Bartenev <vbart@…>

  • Owner set to Valentin Bartenev <vbart@…>
  • Resolution set to fixed
  • Status changed from accepted to closed

In 6412:4ba91a4c66a3/nginx:

HTTP/2: implemented per request timeouts (closes #626).

Previously, there were only three timeouts used globally for the whole HTTP/2
connection:

  1. Idle timeout for inactivity when there are no streams in processing (the "http2_idle_timeout" directive);
  1. Receive timeout for incomplete frames when there are no streams in processing (the "http2_recv_timeout" directive);
  1. Send timeout when there are frames waiting in the output queue (the "send_timeout" directive on a server level).

Reaching one of these timeouts leads to HTTP/2 connection close.

This left a number of scenarios when a connection can get stuck without any
processing and timeouts:

  1. A client has sent the headers block partially so nginx starts processing a new stream but cannot continue without the rest of HEADERS and/or CONTINUATION frames;
  1. When nginx waits for the request body;
  1. All streams are stuck on exhausted connection or stream windows.

The first idea that was rejected was to detect when the whole connection
gets stuck because of these situations and set the global receive timeout.
The disadvantage of such approach would be inconsistent behaviour in some
typical use cases. For example, if a user never replies to the browser's
question about where to save the downloaded file, the stream will be
eventually closed by a timeout. On the other hand, this will not happen
if there's some activity in other concurrent streams.

Now almost all the request timeouts work like in HTTP/1.x connections, so
the "client_header_timeout", "client_body_timeout", and "send_timeout" are
respected. These timeouts close the request.

The global timeouts work as before.

Previously, the c->write->delayed flag was abused to avoid setting timeouts on
stream events. Now, the "active" and "ready" flags are manipulated instead to
control the processing of individual streams.

Note: See TracTickets for help on using tickets.