[Solved] How to replace multiple value in capturing group with regexp


That’s a fairly nasty bit of corruption you’ve got in your file; it looks like CDATA is malformed in several different ways. This catches all of the errors you’ve described:

<tag>.*?\K((?:<!|!<)CDATA\[.*?\]+>+)(?=.*<\/tag>)

This regex checks that the string starts with <tag>, gets text up to the “start” of your CDATA tag, and then uses \K to throw all of that away. Then, it looks for ! and < in any order, followed by CDATA[ and any text inside. Next comes as many ] or > as we can find, though always at least one of each. The final bit of the regex is a lookahead to make sure the closing tag is present. Try it here!

Note that this will only match one malformed tag per line. In order to get them all, it’s likely you’ll need to run a replacement with this regex a few times. Once the regex has no more matches, you can be sure you’re free of malformed tags… or at least, tags with the mutations you’ve described in your question.


As an aside, if you want to keep all “properly formatted” CDATA tags, the regex gets WAY uglier:

<tag>.*?\K(?!<!CDATA\[[^\n\]]*\]>(?:[^>]|$))((?:<!|!<)CDATA\[.*?\]+>+)(?=.*<\/tag>)

This includes a lookahead to assert you’re not matching a “properly formatted” CDATA tag (here described as <!CDATA[...]>). This one runs really slow if the start <tag> does not have a closing <tag> that matches, so if that’s an issue in your file(s), be warned. Try it here!

Good luck!

1

solved How to replace multiple value in capturing group with regexp