[Solved] How to get the contents inside alt? [closed]


Foreward

You should really use an html parser for this, but you seem to have creative control over the source string, and if it’s really this simple then the edge cases should be reduced.

Description

<img\s(?=(?:[^>=]|="[^"]*'|="[^"]*"|=[^'"][^\s>]*)*?\salt=['"]([^"]*)['"]?)
(?:[^>=]|="[^"]*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>

Regular expression visualization

This regular expression will do the following:

  • find all the image tags
  • require the image tag to have a alt attribute
  • capture the alt attribute value and put into the capture group 1
  • allow the value to be surrounded in single, double, or no quotes
  • avoid some pretty difficult edge cases which would make matching HTML difficult

Example

Live Demo

https://regex101.com/r/cN0lD4/2

Sample text

Note the difficult edge case in the second img tag.

<a href="https://stackoverflow.com/questions/37667007/gallery.com/gallery-name"; target="_blank"> <img class="aligncenter" src="https://stackoverflow.com/questions/37667007/myblog.com/wp-content/image.jpg" alt=" I want to get this text" width =" 400 " height="300" /></a>

<img onmouseover="  alt="This is not the droid you are looking for" ;"  class="aligncenter" src="https://stackoverflow.com/questions/37667007/myblog.com/wp-content/image.jpg" alt="This is the droid I'm looking for." width =" 400 " height="300" />

Sample Matches

  • Capture group 0 gets the entire img tag
  • Capture group 1 gets just the value in the alt attribute, not including any surrounding quotes
[0][0] = <img class="aligncenter" src="https://stackoverflow.com/questions/37667007/myblog.com/wp-content/image.jpg" alt=" I want to get this text" width =" 400 " height="300" />
[0][1] =  I want to get this text

[1][0] = <img onmouseover="  alt="This is not the droid you are looking for" ;"  class="aligncenter" src="https://stackoverflow.com/questions/37667007/myblog.com/wp-content/image.jpg" alt="This is the droid I'm looking for." width =" 400 " height="300" />
[1][1] = This is the droid I'm looking for.

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  <img                     '<img'
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    alt="alt="
----------------------------------------------------------------------
    ['"]                     any character of: ''', '"'
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    ['"]?                    any character of: ''', '"' (optional
                             (matching the most amount possible))
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"\s]*                 any character except: ''', '"',
                             whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  \s?                      whitespace (\n, \r, \t, \f, and " ")
                           (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  \/?                      "https://stackoverflow.com/" (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------

solved How to get the contents inside alt? [closed]