[Solved] Converting regex to account for international characters

Question

Here are the required steps:

Use the u pattern option. This turns on PCRE_UTF8 and PCRE_UCP (the PHP docs forget to mention that one):

PCRE_UTF8

This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.

PCRE_UCP

This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.
\d will do just fine with PCRE_UCP (it’s equivalent to \p{N} already), but you have to replace these [a-z] ranges to account for accented characters:
- Replace [a-zA-Z] with \p{L}
- Replace [a-z] with \p{Ll}
- Replace [A-Z] with \p{Lu}
\p{X} means: a character from Unicode category X, where L means letter, Ll means lowercase letter and Lu means uppercase letter. You can get a list from the docs.

Note that you can use \p{X} inside a character class: [\p{L}\d\s] for instance.
And make sure you use UTF8 encoding for your strings in PHP. Also, make sure you use Unicode-aware functions to handle these strings.

Accepted Answer

Here are the required steps:

Use the u pattern option. This turns on PCRE_UTF8 and PCRE_UCP (the PHP docs forget to mention that one):

PCRE_UTF8

This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.

PCRE_UCP

This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.
\d will do just fine with PCRE_UCP (it’s equivalent to \p{N} already), but you have to replace these [a-z] ranges to account for accented characters:
- Replace [a-zA-Z] with \p{L}
- Replace [a-z] with \p{Ll}
- Replace [A-Z] with \p{Lu}
\p{X} means: a character from Unicode category X, where L means letter, Ll means lowercase letter and Lu means uppercase letter. You can get a list from the docs.

Note that you can use \p{X} inside a character class: [\p{L}\d\s] for instance.
And make sure you use UTF8 encoding for your strings in PHP. Also, make sure you use Unicode-aware functions to handle these strings.