Opened 3 weeks ago

Last modified 3 weeks ago

#2371 new defect

nginx + POST reqsuest = 408 for 0.1% of users, probably SSL related

Reported by: woodholly@… Owned by:
Priority: minor Milestone:
Component: nginx-core Version: 1.23.x
Keywords: post 408 Cc:
uname -a: Linux so3 4.19.0-21-amd64 #1 SMP Debian 4.19.249-2 (2022-06-30) x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.23.1
built by gcc 8.3.0 (Debian 8.3.0-6)
built with LibreSSL 3.5.3
TLS SNI support enabled
configure arguments: --with-openssl=/root/soft/nginx/libressl-3.5.3/ --with-http_geoip_module=dynamic --with-stream=dynamic --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_v2_module --with-http_dav_module --with-http_slice_module --with-threads --with-http_addition_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_image_filter_module=dynamic --with-http_sub_module

Description

I am struggling with this bug for almost two years, please guys I need your help.

What: 0.1% of users of my website complain that website sometimes returns errors.
I can clearly see their problems in log but can't do anything to help them.
It looks like HTTP 408 error with 0 size. Example:

89.28.38.219 - service-online.su [23/Jul/2022:08:46:08 +0300] "GET /blank/.........pdf HTTP/1.1" 200 62034 "........ this is successful request, I just added it to demonstrate that user can access the website. And then this happens:

89.28.38.219 - service-online.su [23/Jul/2022:09:37:26 +0300] "POST /forms/cenniki/blanc.php HTTP/1.1" 408 0 "https://service-online.su/forms/cenniki/" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36" "-" rt=15.000 ut="-" uh=- cs=- cn=283675 n=1 request_length=1160 request_time=15.000 http_accept="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"

As you can see I use non-standard log format, with information useful for debugging (debug log is available too).

Error is ALWAYS POST request and HTTP code 408 and size=0.
It happens after a series of successful GET/POST requests, but then POST just fails and can continue to fail for 5-20 times. Users just click that cursed "submit" button in hope that this time this POST request succeeds and they get what they want.

This happens for different browsers and different operating systems.
This is difficult to debug and replicate, all information I got from website users and was unable to access their PC.
Backend is php-fpm.

What I've tried:

1) replacing Debian's nginx with version built from sources (1.23.1) does not help.
2) replacing Openssl with Libressl does not help.
3) I played with keepalive - no affect.
4) I played with SSL ciphers - and got some result - disabling TLSv1.2 decreases number of 408 errors a bit.
5) I played with other settings, disabled http2 and this seems to help a bit.

Debug log for that particular 408 request (cn=283675) attached to this ticket.

Attachments (2)

debug.log.gz (1.8 KB ) - added by woodholly@… 3 weeks ago.
user_sessions_with_408.zip (56.9 KB ) - added by woodholly@… 3 weeks ago.

Download all attachments as: .zip

Change History (6)

by woodholly@…, 3 weeks ago

Attachment: debug.log.gz added

by woodholly@…, 3 weeks ago

Attachment: user_sessions_with_408.zip added

comment:1 by woodholly@…, 3 weeks ago

I've added attachment with user sessions where you can see how this error happens and feel their pain :)
Filter by 408.

comment:2 by Maxim Dounin, 3 weeks ago

From the debug log it looks like perfectly legitimate client timeout: client is expected to send 12109-byte request body, and nothing else happens on the connection.

First of all, I would recommend to focus on possible network issues.

The fact that some POSTs are failing suggests that the issue might be with Path MTU Discovery from client to server, which is a common issue when tunnels are used and "Fragmentation Needed" ICMP packets are filtered out somewhere along the network path. With non-working Path MTU Discovery it is quite possible that only small POSTs will normally succeed (since large POSTs will try to send full TCP packets, which won't into tunnel's MTU), and from the logs this what seems to happen.

Possible solutions include removing ICMP filters (though usually not possible), clamping MSS on the tunnel (if tunnel is under your or your client's control), forcing a lower MTU on the client, enabling Path MTU Discovery Black Hole Detection on the client (see here), or completely disabling Path MTU Discovery on the client.

Hope this helps.

in reply to:  2 comment:3 by woodholly@…, 3 weeks ago

Replying to Maxim Dounin:

From the debug log it looks like perfectly legitimate client timeout: client is expected to send 12109-byte request body, and nothing else happens on the connection.

Thank you, I started to test your assumption.

To check this I've added "http_content_length" to the log and this seems unrelated to MTU:

31.173.86.123 - host [26/Jul/2022:14:18:38 +0300] "POST /forms/kvitantsiya/blanc.php HTTP/1.1" 408 0 "https://ref" "Mozilla/5.0 (Linux; arm_64; Android 11; RMX3151) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.127 YaApp_Android/20.93.1 YaSearchBrowser/20.93.1 BroPP/1.0 SA/3 Mobile Safari/537.36" "-" rt=15.000 ut="-" uh=- cs=- cn=3358125 n=2 request_length=1109 request_time=15.000 http_accept="..." http_content_length="1350"

http_content_length=1350 is real Content-length expected from client, request_length=1109 is a size of headers. And as you can see, 408 happens for really small packets too.

One more example:

188.162.252.195 - host [26/Jul/2022:04:52:17 +0300] "POST /text/text-editor/blanc.php HTTP/1.1" 408 0 "https://ref" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"

"-" rt=15.001 ut="-" uh=- cs=- cn=2233147 n=1 request_length=1066 request_time=15.001 http_accept="..." http_content_length="451"

http_content_length="451" is enough for error.

comment:4 by Maxim Dounin, 3 weeks ago

Note that other possible reasons include real network issues. In particular, temporary connectivity issues might be common in mobile networks, and both 31.173.86.123 and 188.162.252.195 seems to be mobile. Note well that you are using relatively low client body timeout (15 seconds), and this might not be enough for mobile users with bad connectivity.

You may want to try capturing tcpdump data on the server to further investigate what happens on the network level. Note that it might be a good idea to log $remote_port to simplify matching access log with particular TCP packets.

Note: See TracTickets for help on using tickets.