Opened 12 years ago
Closed 12 years ago
#457 closed defect (fixed)
Win32: ngx_utf8_to_utf16 doesn't allow file names outside U+FFFF
| Reported by: | Kroward 1 | Owned by: | |
|---|---|---|---|
| Priority: | minor | Milestone: | |
| Component: | nginx-core | Version: | |
| Keywords: | win32 | Cc: | |
| uname -a: | Microsoft Windows XP [Version 5.1.2600] | ||
| nginx -V: | nginx/Windows-1.5.7 (prebuild binary) | ||
Description
Function ngx_utf8_to_utf16() in 'win32\ngx_files.c' return error on UTF-8 characters with CP>U+FFFF.
Such characters are perfectly valid and should be presented as surrogate pair on Windows.
This code fixes situation (it appears twice in the source):
if (n > 0x10ffff) {
ngx_set_errno(NGX_EILSEQ);
return NULL;
}
if (n > 0xffff) {
// LE order
*u++ = (u_short) (0xd800 | ( ((n >> 16) & 0x1f) - 1 ) << 6 | (n & 0xffff) >> 10); // high surrogate
// ??? check buffer length here ???
*u++ = (u_short) (0xdc00 | n & 0x3ff); // low surrogate
} else {
*u++ = (u_short) n;
}
You can check it if you extract exemplar CJK-named file from attached archive.
(autoindex module not working with Unicode, so url should be entered manually like .../%f0%a9%ba%8a.txt)
Attachments (1)
Change History (4)
by , 12 years ago
comment:1 by , 12 years ago
| Keywords: | win32 added |
|---|---|
| Status: | new → accepted |
Correct, Windows 2000 and newer supports UTF-16, not just UCS-2. This needs to be addressed. The code suggested obviously needs to be changed to properly check if there is a space in the buffer used.
Something like this seems to be a proper fix:
diff -r 692afcea9d0d -r 06b47c205b0c src/os/win32/ngx_files.c
--- a/src/os/win32/ngx_files.c Tue Dec 03 22:07:03 2013 +0400
+++ b/src/os/win32/ngx_files.c Fri Dec 06 23:31:38 2013 +0400
@@ -799,13 +799,25 @@ ngx_utf8_to_utf16(u_short *utf16, u_char
continue;
}
+ if (u + 1 == last) {
+ *len = u - utf16;
+ break;
+ }
+
n = ngx_utf8_decode(&p, 4);
- if (n > 0xffff) {
+ if (n > 0x10ffff) {
ngx_set_errno(NGX_EILSEQ);
return NULL;
}
+ if (n > 0xffff) {
+ n -= 0x10000;
+ *u++ = (u_short) (0xd800 + (n >> 10));
+ *u++ = (u_short) (0xdc00 + (n & 0x03ff));
+ continue;
+ }
+
*u++ = (u_short) n;
}
@@ -838,12 +850,19 @@ ngx_utf8_to_utf16(u_short *utf16, u_char
n = ngx_utf8_decode(&p, 4);
- if (n > 0xffff) {
+ if (n > 0x10ffff) {
free(utf16);
ngx_set_errno(NGX_EILSEQ);
return NULL;
}
+ if (n > 0xffff) {
+ n -= 0x10000;
+ *u++ = (u_short) (0xd800 + (n >> 10));
+ *u++ = (u_short) (0xdc00 + (n & 0x03ff));
+ continue;
+ }
+
*u++ = (u_short) n;
}
Just in case, example characters (e.g., MUSICAL SYMBOL G CLEF, U+1D11E, 𝄞) as well as various details can be found at http://en.wikipedia.org/wiki/UTF-16.

Example of a file with problematic unicode name (should be extracted from archive)