[Solved] Regular Expression in String [closed]


That’s XML. XML is a bad idea to parse via regular expression. The reason for this is because these XML snippets are semantically identical:

<ul type="disc">
  <li
      class="MsoNormal"
      style="line-height: normal; margin: 0in 0in 10pt; color: black; mso-list: l1 level1 lfo2; tab-stops: list .5in; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto">
    <span style="mso-fareast-font-family: 'Times New Roman'; mso-bidi-font-family: 'Times New Roman'">
      <font size="3">
        <font face="Calibri">Highlight the items you want to recover.</font>
      </font>
    </span>
  </li>
</ul>

And:

<ul
type="disc"
><li
class="MsoNormal"
style="line-height: normal; margin: 0in 0in 10pt; color: black; mso-list: l1 level1 lfo2; tab-stops: list .5in; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"
><span
style="mso-fareast-font-family: 'Times New Roman'; mso-bidi-font-family: 'Times New Roman'"
><font
size="3"
><font
face="Calibri"
>Highlight the items you want to recover.</font></font></span></li></ul>

And:

<ul type="disc"><li class="MsoNormal" style="line-height: normal; margin: 0in 0in 10pt; color: black; mso-list: l1 level1 lfo2; tab-stops: list .5in; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"><span style="mso-fareast-font-family: 'Times New Roman'; mso-bidi-font-family: 'Times New Roman'"><font size="3"><font face="Calibri">Highlight the items you want to recover.</font></font></span></li></ul>

So please – use a parser. Since you’ve tagged perl I’m going to include a perl solution:

use strict;
use warnings;
use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'span' => sub { print $_ ->att('style'), "\n" }
    }
)->parsefile ( 'your_file.xml' );

This will print – on a new line – the style attribute from your span elements. Once you’ve extracted this, you can turn it into key-values via splitting on ; and using : as a key-value separator.

E.g.:

my $style =  $_ ->att('style'); 
my %styles = map { split ( ': ', $_, 2 ) } split ( '; ', $style);
print Dumper \%styles; 

But exactly what you do is as much a question of what you’re trying to accomplish.

solved Regular Expression in String [closed]