#145 closed defect (fixed)
Client disconnect issue lasts for 24 hours
Reported by: | Andrew Benson | Owned by: | somebody |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | nginx-core | Version: | 0.8.x |
Keywords: | mail timeout | Cc: | ajbenson@… |
uname -a: | SunOS hostname 5.10 Generic_147440-10 sun4v sparc SUNW,Sun-Blade-T6340 | ||
nginx -V: |
nginx version: nginx/0.8.55
built by Sun C 5.11 SunOS_sparc 145354-05 2011/09/13 TLS SNI support disabled configure arguments: --with-cc=/opt/solstudio12.2/bin/cc --with-cpp=/opt/solstudio12.2/bin/cc --with-cc-opt='-xtarget=ultraT2plus -xO5 -I /usr/sfw/include' --with-ld-opt='-R/usr/sfw/lib -L/usr/sfw/lib' --without-http --with-mail --with-mail_ssl_module --error-log-path=logs/nginx.log --pid-path=/var/run/nginx.pid --with-debug |
Description
Running nginx-0.8.55 (also occurs with 0.7.67 and 1.0.14) as an imap/pop proxy only, we have experienced a very strange bug that seems to occur every 50 days or so and last just under 24 hours at each occurrence. It has occurred 4 times now.
Symptoms: imap/pop clients are frequently disconnected from the nginx proxy with the following message printed in the nginx.log:
2012/04/03 21:25:11 [info] 29999#0: *23561178 client timed out (145: Connection timed out) while proxying, client: redacted, server: 0.0.0.0:993, login: "redacted", upstream: 127.0.0.1:14300
First occurrence (Eastern Daylight Time EDT):
Begin: 2011/11/06 17:16:33
End: 2011/11/07 16:46:51
Second (Eastern Standard Time EST):
Begin: 2011/12/26 10:17:55
End: 2011/12/27 09:49:57
Third (EST):
Begin: 2012/02/14 03:21:09
End: 2012/02/15 02:52:01
Fourth (EDT):
Begin: 2012/04/03 21:25:11
End: 2012/04/04 20:55:26
The only way we've been able to resolve the issue has been to set the system clock to a point in time outside of the affected window. We believe the network is not at fault as we were able to reproduce the problem consistently between client and server on the same network segment with no firewall or load balancer in between.
We can reproduce the issue at any time by setting the system clock to a point within any one of the last occurrences. I can provide packet captures if it is useful but nothing seems out of the ordinary in them.
Attachments (2)
Change History (14)
by , 13 years ago
Attachment: | nginx.conf added |
---|
comment:1 by , 13 years ago
Status: | new → accepted |
---|
comment:2 by , 13 years ago
Could you please check what the following program prints in your environment if compiled by the same compiler with the same optimization flags?
#include <limits.h> #include <stdio.h> int main(void) { unsigned int a, b; a = 2000000000U; b = 3000000000U; if ((int) b - (int) a <= 0) { printf("not ok 1\n"); } else { printf("ok 1\n"); } if ((int) a <= 0) { printf("not ok 2\n"); } else { printf("ok 2\n"); } if ((int) b <= 0) { printf("ok 3\n"); } else { printf("not ok 3\n"); } if ((int) (b - a) <= 0) { printf("not ok 4\n"); } else { printf("ok 4\n"); } if (b - a > INT_MAX) { printf("not ok 5\n"); } else { printf("ok 5\n"); } a = 4000000000U; b = 1000000000U; if ((int) (b - a) <= 0) { printf("not ok 6\n"); } else { printf("ok 6\n"); } if (b - a > INT_MAX) { printf("not ok 7\n"); } else { printf("ok 7\n"); } return 0; }
Just in case, use something like this to compile and run:
cc -fast -xtarget=ultraT2plus -xO5 t.c ./a.out
Expected output is:
ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 ok 7
I suspect at least the first test will fail.
comment:3 by , 13 years ago
$ /opt/solstudio12.2/bin/cc -fast -xtarget=ultraT2plus -xO5 t.c $ ./a.out ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 ok 7
comment:5 by , 13 years ago
Ok, so either I was wrong assuming SunC does bad things due to signed integer underflow (just to clarify: it's allowed to, as this is implementation defined behaviour per C99) or it's not that easy to reproduce.
Could you please provide debug log with a failure? Preferably to reproduce with a single connection on otherwise idle system, and collect debug log on global levels, this will allow to see all timer-related operations.
comment:6 by , 13 years ago
Enabled debug logging on the system and set the system time to a point during the last failure: Wednesday, April 4, 2012 08:00:00 AM EDT.
I made a single connection, successfully authenticated, then received the error.
I'll attach the log file, the only thing I did was remove the password from the log file.
nginx is built 32-bit.
by , 13 years ago
Attachment: | nginx-debug.log added |
---|
comment:7 by , 13 years ago
Could you please test if the following patch helps?
diff -r 7a429135c4ae src/core/ngx_rbtree.c --- a/src/core/ngx_rbtree.c Tue Jul 19 13:59:47 2011 +0000 +++ b/src/core/ngx_rbtree.c Fri Apr 06 21:26:28 2012 +0400 @@ -135,8 +135,7 @@ ngx_rbtree_insert_timer_value(ngx_rbtree /* node->key < temp->key */ - p = ((ngx_rbtree_key_int_t) node->key - (ngx_rbtree_key_int_t) temp->key - < 0) + p = ((ngx_rbtree_key_int_t) (node->key - temp->key) < 0) ? &temp->left : &temp->right; if (*p == sentinel) { diff -r 7a429135c4ae src/event/ngx_event_timer.c --- a/src/event/ngx_event_timer.c Tue Jul 19 13:59:47 2011 +0000 +++ b/src/event/ngx_event_timer.c Fri Apr 06 21:26:28 2012 +0400 @@ -66,7 +66,7 @@ ngx_event_find_timer(void) ngx_mutex_unlock(ngx_event_timer_mutex); - timer = (ngx_msec_int_t) node->key - (ngx_msec_int_t) ngx_current_msec; + timer = (ngx_msec_int_t) (node->key - ngx_current_msec); return (ngx_msec_t) (timer > 0 ? timer : 0); } @@ -94,7 +94,7 @@ ngx_event_expire_timers(void) /* node->key <= ngx_current_time */ - if ((ngx_msec_int_t) node->key - (ngx_msec_int_t) ngx_current_msec <= 0) + if ((ngx_msec_int_t) (node->key - ngx_current_msec) <= 0) { ev = (ngx_event_t *) ((char *) node - offsetof(ngx_event_t, timer));
The patch is against 0.8.55 to simplify testing. It basically checks the same assumption as in the above test, but on real code.
comment:8 by , 13 years ago
It appears to help. I am not able to reproduce the problem with the patch applied.
comment:11 by , 13 years ago
Thank you for your help with this! Will this be incorporated into future releases or can I apply this patch to the 1.0 branch? Thanks!
comment:12 by , 13 years ago
This will be in the next release, which is expected to be 1.2.0 stable.
Looks like improper integer overflow handling in ngx_event_timer.c:
$ env TZ=America/New_York date -j -f "%F %T" "2011-11-06 17:16:33" +%s
1320617793
$ env TZ=America/New_York date -j -f "%F %T" "2011-12-26 10:17:55" +%s
1324912675
$ perl -le 'print ((1324912675 - 1320617793) / 86400)'
49.7092824074074
$ perl -le 'print ((0xffffffff / 1000) / 86400)'
49.7102696180556
And it lasts for 24 hours due to proxy_timeout being 24h by default.