[Solved] A regex that doesn’t match with this character sequence


This answer is to demonstrate the possibility only. Using it in production code is questionable.

It is possible with Java String replaceAll function:

String input = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (.*) &$@%#*(....))(((";
String output = input.replaceAll("\\G((?:[^()\\[\\]{}?+\\\\.$^*|!&@#%_\":<>/;'`~-]|\\Q(.*)\\E)*+)([()\\[\\]{}?+\\\\.$^*|!&@#%_\":<>/;'`~-])", "$1\\\\$2");

Result:

"Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , (.*) \&\$\@\%\#\*\(\.\.\.\.\)\)\(\(\("

Another test:

String input = "(.*) sdfHi test message <> >>>>><<<<f<f<,,,,<> <>(.*) sdf (.*)  sdf (.*)";

Result:

"(.*) sdfHi test message \<\> \>\>\>\>\>\<\<\<\<f\<f\<,,,,\<\> \<\>(.*) sdf (.*)  sdf (.*)"

Explanation

Raw regex:

\G((?:[^()\[\]{}?+\\.$^*|!&@#%_":<>/;'`~-]|\Q(.*)\E)*+)([()\[\]{}?+\\.$^*|!&@#%_":<>/;'`~-])

Note that \ is escaped once more when the regex is specified inside the string, and " needs to be escaped. The resulting regex in string can be seen above.

Raw replacement string:

$1\\$2

Since $ has special meaning in replacement string, and you want to keep it for $2, you need to escape the \ so that \ won’t escape the $. And putting the replacement string in quoted string, you need to double up the number of \ to escape the \.

Before we dissect the monster, let’s talk about the idea. We will consume non-special characters, and the sequence that we don’t want to replace, and as many times as possible. The next character will either be a special character not forming sequence we don’t want to replace, or is the end of the string (which means that we have found all character that needs replacing if any).

Naturally, we can think of any arbitrary string as consisting of many of the following pattern consecutively: [0 or more (non-special character or special pattern not to be replace)][special character], and the string ends with [0 or more (non-special character or special pattern not to be replace)].

replaceAll function when used with a regex without \G may find matches that are not consecutive, which can cut in the middle of the sequence not to be replaced and mess it up. \G means the boundary of last match, and can be used to make sure the next match starts from where the last match left off.

  • \G: Starts from last match

  • ((?:[^()\[\]{}?+\\.$^*|!&@#%_":<>/;'`~-]|\Q(.\*)\E)*+): Capture 0 or more of, the non-special character or the special pattern not to be replaced. Note that I have added the possessive qualifier + after *. This will prevent the engine from backtracking when it cannot find the special character that we specify after this.

    • [^()\[\]{}?+\\.$^*|!&@#%_":<>/;'`~-]: Negated character class of special characters.

    • \Q(.*)\E: Special sequence (.*) not to be replaced, literal quoted by \Q and \E.

  • ([()\[\]{}?+\\.$^*|!&@#%_":<>/;'`~-]): Capture the single special character.

The whole regex will match string with minimum length of 1 (the special character). The first capturing group contains the parts that shouldn’t be replaced, and the 2nd capturing group contains the special character that should be replaced.

3

solved A regex that doesn’t match with this character sequence