Opened 4 years ago
Closed 4 years ago
#1995 closed defect (invalid)
Regular expression with Unicode property test doesn't match as expected
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | nginx-core | Version: | 1.19.x |
Keywords: | regex unicode | Cc: | funkyfuture@… |
uname -a: | Linux 6f16b1a35cc8 5.4.0-33-generic #37-Ubuntu SMP Thu May 21 12:53:59 UTC 2020 x86_64 GNU/Linux | ||
nginx -V: |
nginx version: nginx/1.19.0
built by gcc 8.3.0 (Debian 8.3.0-6) built with OpenSSL 1.1.1d 10 Sep 2019 TLS SNI support enabled configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fdebug-prefix-map=/data/builder/debuild/nginx-1.19.0/debian/debuild-base/nginx-1.19.0=. -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie' |
Description
with this server configuration, whose regex pattern relies on Unicode properties (to match all characters that are classified as letter in every script):
server { listen 8000; server_name _; location / { return 404 "Did not match.\n"; } location ~ "^/\p{L}+$" { return 200 "Matched.\n"; } }
i get the following responses:
+ curl http://localhost:8000/123 Did not match. + curl http://localhost:8000/test Matched. + curl http://localhost:8000/täst Did not match. + curl http://localhost:8000/спаси́бо Did not match.
i would expect all of the last three requests to yield the "Matched." string though, because all given paths consist of letters only.
afaict, PCRE does support the Unicode properties based tests and the first two requests seem to confirm that. see also here: http://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5
i'm of course not sure whether this is actually a defect, just docs that could be improved or me, as it involves regular expressions.
what baffles me is that at least the english parts of the web seem not yet to have discussed the use of regular expressions with Unicode properties in nginx.
i'd appreciate anyone who tested this on a different platform and shares her/his findings.
What you see it the result of the fact that URI characters aren't Unicode, despite the fact that nowadays URIs are often in UTF-8. Rather, URI is a string of 1-byte characters. As such, testing
t\xC3\xA4st
for\p{L}
fails at the C3 character.You may try to instruct PCRE to match assuming UTF-8 by using the
(*UTF8)
option, for example:This is expected to work properly as long as the actual URI is in UTF-8. Note though that this is likely to result in unexpected
pcre_exec()
errors if the URI is not in UTF-8. As such, I would not recommend doing this.If you want to reinterpret URI as an UTF-8 string and do some testing on it, consider using some programming language instead, which will allow you to do actual testing and define appropriate behaviour if URI is not in UTF-8. In particular, embedded perl or njs might worth trying.