Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#145 closed defect (fixed)

Client disconnect issue lasts for 24 hours

Reported by: Andrew Benson Owned by: somebody
Priority: major Milestone:
Component: nginx-core Version: 0.8.x
Keywords: mail timeout Cc: ajbenson@…
uname -a: SunOS hostname 5.10 Generic_147440-10 sun4v sparc SUNW,Sun-Blade-T6340
nginx -V: nginx version: nginx/0.8.55
built by Sun C 5.11 SunOS_sparc 145354-05 2011/09/13
TLS SNI support disabled
configure arguments: --with-cc=/opt/solstudio12.2/bin/cc --with-cpp=/opt/solstudio12.2/bin/cc --with-cc-opt='-xtarget=ultraT2plus -xO5 -I /usr/sfw/include' --with-ld-opt='-R/usr/sfw/lib -L/usr/sfw/lib' --without-http --with-mail --with-mail_ssl_module --error-log-path=logs/nginx.log --pid-path=/var/run/nginx.pid --with-debug

Description

Running nginx-0.8.55 (also occurs with 0.7.67 and 1.0.14) as an imap/pop proxy only, we have experienced a very strange bug that seems to occur every 50 days or so and last just under 24 hours at each occurrence. It has occurred 4 times now.

Symptoms: imap/pop clients are frequently disconnected from the nginx proxy with the following message printed in the nginx.log:
2012/04/03 21:25:11 [info] 29999#0: *23561178 client timed out (145: Connection timed out) while proxying, client: redacted, server: 0.0.0.0:993, login: "redacted", upstream: 127.0.0.1:14300

First occurrence (Eastern Daylight Time EDT):
Begin: 2011/11/06 17:16:33
End: 2011/11/07 16:46:51

Second (Eastern Standard Time EST):
Begin: 2011/12/26 10:17:55
End: 2011/12/27 09:49:57

Third (EST):
Begin: 2012/02/14 03:21:09
End: 2012/02/15 02:52:01

Fourth (EDT):
Begin: 2012/04/03 21:25:11
End: 2012/04/04 20:55:26

The only way we've been able to resolve the issue has been to set the system clock to a point in time outside of the affected window. We believe the network is not at fault as we were able to reproduce the problem consistently between client and server on the same network segment with no firewall or load balancer in between.

We can reproduce the issue at any time by setting the system clock to a point within any one of the last occurrences. I can provide packet captures if it is useful but nothing seems out of the ordinary in them.

Attachments (2)

nginx.conf (1.4 KB ) - added by Andrew Benson 12 years ago.
nginx-debug.log (79.8 KB ) - added by Andrew Benson 12 years ago.

Download all attachments as: .zip

Change History (14)

by Andrew Benson, 12 years ago

Attachment: nginx.conf added

comment:1 by Maxim Dounin, 12 years ago

Status: newaccepted

Looks like improper integer overflow handling in ngx_event_timer.c:

$ env TZ=America/New_York date -j -f "%F %T" "2011-11-06 17:16:33" +%s
1320617793
$ env TZ=America/New_York date -j -f "%F %T" "2011-12-26 10:17:55" +%s
1324912675
$ perl -le 'print ((1324912675 - 1320617793) / 86400)'
49.7092824074074
$ perl -le 'print ((0xffffffff / 1000) / 86400)'
49.7102696180556

And it lasts for 24 hours due to proxy_timeout being 24h by default.

comment:2 by Maxim Dounin, 12 years ago

Could you please check what the following program prints in your environment if compiled by the same compiler with the same optimization flags?

#include <limits.h>
#include <stdio.h>

int
main(void)
{
    unsigned int a, b;

    a = 2000000000U;
    b = 3000000000U;

    if ((int) b - (int) a <= 0) {
        printf("not ok 1\n");
    } else {
        printf("ok 1\n");
    }

    if ((int) a <= 0) {
        printf("not ok 2\n");
    } else {
        printf("ok 2\n");
    }

    if ((int) b <= 0) {
        printf("ok 3\n");
    } else {
        printf("not ok 3\n");
    }

    if ((int) (b - a) <= 0) {
        printf("not ok 4\n");
    } else {
        printf("ok 4\n");
    }

    if (b - a > INT_MAX) {
        printf("not ok 5\n");
    } else {
        printf("ok 5\n");
    }

    a = 4000000000U;
    b = 1000000000U;

    if ((int) (b - a) <= 0) {
        printf("not ok 6\n");
    } else {
        printf("ok 6\n");
    }

    if (b - a > INT_MAX) {
        printf("not ok 7\n");
    } else {
        printf("ok 7\n");
    }

    return 0;
}

Just in case, use something like this to compile and run:

cc -fast -xtarget=ultraT2plus -xO5 t.c
./a.out

Expected output is:

ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
ok 7

I suspect at least the first test will fail.

comment:3 by Andrew Benson, 12 years ago

$ /opt/solstudio12.2/bin/cc -fast -xtarget=ultraT2plus -xO5 t.c
$ ./a.out 
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
ok 7

comment:4 by Andrew Benson, 12 years ago

Thank you for the prompt response by the way.

comment:5 by Maxim Dounin, 12 years ago

Ok, so either I was wrong assuming SunC does bad things due to signed integer underflow (just to clarify: it's allowed to, as this is implementation defined behaviour per C99) or it's not that easy to reproduce.

Could you please provide debug log with a failure? Preferably to reproduce with a single connection on otherwise idle system, and collect debug log on global levels, this will allow to see all timer-related operations.

comment:6 by Andrew Benson, 12 years ago

Enabled debug logging on the system and set the system time to a point during the last failure: Wednesday, April 4, 2012 08:00:00 AM EDT.

I made a single connection, successfully authenticated, then received the error.

I'll attach the log file, the only thing I did was remove the password from the log file.

nginx is built 32-bit.

by Andrew Benson, 12 years ago

Attachment: nginx-debug.log added

comment:7 by Maxim Dounin, 12 years ago

Could you please test if the following patch helps?

diff -r 7a429135c4ae src/core/ngx_rbtree.c
--- a/src/core/ngx_rbtree.c	Tue Jul 19 13:59:47 2011 +0000
+++ b/src/core/ngx_rbtree.c	Fri Apr 06 21:26:28 2012 +0400
@@ -135,8 +135,7 @@ ngx_rbtree_insert_timer_value(ngx_rbtree
 
         /*  node->key < temp->key */
 
-        p = ((ngx_rbtree_key_int_t) node->key - (ngx_rbtree_key_int_t) temp->key
-              < 0)
+        p = ((ngx_rbtree_key_int_t) (node->key - temp->key) < 0)
             ? &temp->left : &temp->right;
 
         if (*p == sentinel) {
diff -r 7a429135c4ae src/event/ngx_event_timer.c
--- a/src/event/ngx_event_timer.c	Tue Jul 19 13:59:47 2011 +0000
+++ b/src/event/ngx_event_timer.c	Fri Apr 06 21:26:28 2012 +0400
@@ -66,7 +66,7 @@ ngx_event_find_timer(void)
 
     ngx_mutex_unlock(ngx_event_timer_mutex);
 
-    timer = (ngx_msec_int_t) node->key - (ngx_msec_int_t) ngx_current_msec;
+    timer = (ngx_msec_int_t) (node->key - ngx_current_msec);
 
     return (ngx_msec_t) (timer > 0 ? timer : 0);
 }
@@ -94,7 +94,7 @@ ngx_event_expire_timers(void)
 
         /* node->key <= ngx_current_time */
 
-        if ((ngx_msec_int_t) node->key - (ngx_msec_int_t) ngx_current_msec <= 0)
+        if ((ngx_msec_int_t) (node->key - ngx_current_msec) <= 0)
         {
             ev = (ngx_event_t *) ((char *) node - offsetof(ngx_event_t, timer));
 

The patch is against 0.8.55 to simplify testing. It basically checks the same assumption as in the above test, but on real code.

comment:8 by Andrew Benson, 12 years ago

It appears to help. I am not able to reproduce the problem with the patch applied.

comment:9 by Maxim Dounin, 12 years ago

In [4577/nginx]:

(The changeset message doesn't reference this ticket)

comment:10 by Maxim Dounin, 12 years ago

Resolution: fixed
Status: acceptedclosed

Fix committed, thanks.

comment:11 by Andrew Benson, 12 years ago

Thank you for your help with this! Will this be incorporated into future releases or can I apply this patch to the 1.0 branch? Thanks!

comment:12 by Maxim Dounin, 12 years ago

This will be in the next release, which is expected to be 1.2.0 stable.

Note: See TracTickets for help on using tickets.