Here are the required steps:
-
Use the
u
pattern option. This turns onPCRE_UTF8
andPCRE_UCP
(the PHP docs forget to mention that one):PCRE_UTF8
This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.
PCRE_UCP
This option changes the way PCRE processes
\B
,\b
,\D
,\d
,\S
,\s
,\W
,\w
, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support. -
\d
will do just fine withPCRE_UCP
(it’s equivalent to\p{N}
already), but you have to replace these[a-z]
ranges to account for accented characters:- Replace
[a-zA-Z]
with\p{L}
- Replace
[a-z]
with\p{Ll}
- Replace
[A-Z]
with\p{Lu}
\p{X}
means: a character from Unicode category X, whereL
means letter,Ll
means lowercase letter andLu
means uppercase letter. You can get a list from the docs.Note that you can use
\p{X}
inside a character class:[\p{L}\d\s]
for instance. - Replace
-
And make sure you use UTF8 encoding for your strings in PHP. Also, make sure you use Unicode-aware functions to handle these strings.
solved Converting regex to account for international characters