Opened 4 years ago

Closed 4 years ago

#1995 closed defect (invalid)

Regular expression with Unicode property test doesn't match as expected

Reported by: funkyfuture@… Owned by:
Priority: minor Milestone:
Component: nginx-core Version: 1.19.x
Keywords: regex unicode Cc: funkyfuture@…
uname -a: Linux 6f16b1a35cc8 5.4.0-33-generic #37-Ubuntu SMP Thu May 21 12:53:59 UTC 2020 x86_64 GNU/Linux
nginx -V: nginx version: nginx/1.19.0
built by gcc 8.3.0 (Debian 8.3.0-6)
built with OpenSSL 1.1.1d 10 Sep 2019
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fdebug-prefix-map=/data/builder/debuild/nginx-1.19.0/debian/debuild-base/nginx-1.19.0=. -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'

Description

with this server configuration, whose regex pattern relies on Unicode properties (to match all characters that are classified as letter in every script):

server {
  listen 8000;
  server_name _;

  location / {
      return 404 "Did not match.\n";
  }

  location ~ "^/\p{L}+$" {
      return 200 "Matched.\n";
  }
}

i get the following responses:

+ curl http://localhost:8000/123
Did not match.
+ curl http://localhost:8000/test
Matched.
+ curl http://localhost:8000/täst
Did not match.
+ curl http://localhost:8000/спаси́бо
Did not match.

i would expect all of the last three requests to yield the "Matched." string though, because all given paths consist of letters only.

afaict, PCRE does support the Unicode properties based tests and the first two requests seem to confirm that. see also here: http://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5

i'm of course not sure whether this is actually a defect, just docs that could be improved or me, as it involves regular expressions.

what baffles me is that at least the english parts of the web seem not yet to have discussed the use of regular expressions with Unicode properties in nginx.

i'd appreciate anyone who tested this on a different platform and shares her/his findings.

Change History (1)

comment:1 by Maxim Dounin, 4 years ago

Resolution: invalid
Status: newclosed

What you see it the result of the fact that URI characters aren't Unicode, despite the fact that nowadays URIs are often in UTF-8. Rather, URI is a string of 1-byte characters. As such, testing t\xC3\xA4st for \p{L} fails at the C3 character.

You may try to instruct PCRE to match assuming UTF-8 by using the (*UTF8) option, for example:

location ~ "(*UTF8)^/\p{L}+$" {
   return 200 "Matched.\n";
}

This is expected to work properly as long as the actual URI is in UTF-8. Note though that this is likely to result in unexpected pcre_exec() errors if the URI is not in UTF-8. As such, I would not recommend doing this.

If you want to reinterpret URI as an UTF-8 string and do some testing on it, consider using some programming language instead, which will allow you to do actual testing and define appropriate behaviour if URI is not in UTF-8. In particular, embedded perl or njs might worth trying.

Note: See TracTickets for help on using tickets.