Opened 11 years ago
Closed 11 years ago
#457 closed defect (fixed)
Win32: ngx_utf8_to_utf16 doesn't allow file names outside U+FFFF
Reported by: | Kroward 1 | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | nginx-core | Version: | |
Keywords: | win32 | Cc: | |
uname -a: | Microsoft Windows XP [Version 5.1.2600] | ||
nginx -V: | nginx/Windows-1.5.7 (prebuild binary) |
Description
Function ngx_utf8_to_utf16() in 'win32\ngx_files.c' return error on UTF-8 characters with CP>U+FFFF.
Such characters are perfectly valid and should be presented as surrogate pair on Windows.
This code fixes situation (it appears twice in the source):
if (n > 0x10ffff) { ngx_set_errno(NGX_EILSEQ); return NULL; } if (n > 0xffff) { // LE order *u++ = (u_short) (0xd800 | ( ((n >> 16) & 0x1f) - 1 ) << 6 | (n & 0xffff) >> 10); // high surrogate // ??? check buffer length here ??? *u++ = (u_short) (0xdc00 | n & 0x3ff); // low surrogate } else { *u++ = (u_short) n; }
You can check it if you extract exemplar CJK-named file from attached archive.
(autoindex module not working with Unicode, so url should be entered manually like .../%f0%a9%ba%8a.txt)
Attachments (1)
Change History (4)
by , 11 years ago
comment:1 by , 11 years ago
Keywords: | win32 added |
---|---|
Status: | new → accepted |
Correct, Windows 2000 and newer supports UTF-16, not just UCS-2. This needs to be addressed. The code suggested obviously needs to be changed to properly check if there is a space in the buffer used.
Something like this seems to be a proper fix:
diff -r 692afcea9d0d -r 06b47c205b0c src/os/win32/ngx_files.c --- a/src/os/win32/ngx_files.c Tue Dec 03 22:07:03 2013 +0400 +++ b/src/os/win32/ngx_files.c Fri Dec 06 23:31:38 2013 +0400 @@ -799,13 +799,25 @@ ngx_utf8_to_utf16(u_short *utf16, u_char continue; } + if (u + 1 == last) { + *len = u - utf16; + break; + } + n = ngx_utf8_decode(&p, 4); - if (n > 0xffff) { + if (n > 0x10ffff) { ngx_set_errno(NGX_EILSEQ); return NULL; } + if (n > 0xffff) { + n -= 0x10000; + *u++ = (u_short) (0xd800 + (n >> 10)); + *u++ = (u_short) (0xdc00 + (n & 0x03ff)); + continue; + } + *u++ = (u_short) n; } @@ -838,12 +850,19 @@ ngx_utf8_to_utf16(u_short *utf16, u_char n = ngx_utf8_decode(&p, 4); - if (n > 0xffff) { + if (n > 0x10ffff) { free(utf16); ngx_set_errno(NGX_EILSEQ); return NULL; } + if (n > 0xffff) { + n -= 0x10000; + *u++ = (u_short) (0xd800 + (n >> 10)); + *u++ = (u_short) (0xdc00 + (n & 0x03ff)); + continue; + } + *u++ = (u_short) n; }
Just in case, example characters (e.g., MUSICAL SYMBOL G CLEF, U+1D11E, 𝄞) as well as various details can be found at http://en.wikipedia.org/wiki/UTF-16.
Example of a file with problematic unicode name (should be extracted from archive)