Foreward
You should really use an html parser for this, but you seem to have creative control over the source string, and if it’s really this simple then the edge cases should be reduced.
Description
<img\s(?=(?:[^>=]|="[^"]*'|="[^"]*"|=[^'"][^\s>]*)*?\salt=['"]([^"]*)['"]?)
(?:[^>=]|="[^"]*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>
This regular expression will do the following:
- find all the image tags
- require the image tag to have a
alt
attribute - capture the
alt
attribute value and put into the capture group 1 - allow the value to be surrounded in single, double, or no quotes
- avoid some pretty difficult edge cases which would make matching HTML difficult
Example
Live Demo
https://regex101.com/r/cN0lD4/2
Sample text
Note the difficult edge case in the second img
tag.
<a href="https://stackoverflow.com/questions/37667007/gallery.com/gallery-name"; target="_blank"> <img class="aligncenter" src="https://stackoverflow.com/questions/37667007/myblog.com/wp-content/image.jpg" alt=" I want to get this text" width =" 400 " height="300" /></a>
<img onmouseover=" alt="This is not the droid you are looking for" ;" class="aligncenter" src="https://stackoverflow.com/questions/37667007/myblog.com/wp-content/image.jpg" alt="This is the droid I'm looking for." width =" 400 " height="300" />
Sample Matches
- Capture group 0 gets the entire
img
tag - Capture group 1 gets just the value in the
alt
attribute, not including any surrounding quotes
[0][0] = <img class="aligncenter" src="https://stackoverflow.com/questions/37667007/myblog.com/wp-content/image.jpg" alt=" I want to get this text" width =" 400 " height="300" />
[0][1] = I want to get this text
[1][0] = <img onmouseover=" alt="This is not the droid you are looking for" ;" class="aligncenter" src="https://stackoverflow.com/questions/37667007/myblog.com/wp-content/image.jpg" alt="This is the droid I'm looking for." width =" 400 " height="300" />
[1][1] = This is the droid I'm looking for.
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
<img '<img'
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
alt="alt="
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? "https://stackoverflow.com/" (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
solved How to get the contents inside alt? [closed]