[Solved] need to find a pattern and extract that out [closed]


I’ll take one stab at this with two implementations.

First, I’ll use a character vector. If yours is in a frame, replace it with myframe$mycolumn.

v <- c("110231 validation 108871 validation 85933",
"21102 validation 93442 21232 validation 73769 26402 validation 127221 26402",
"99763 99763 validation 99763 validation 99763",
"validation 199022 validation 122099 validation 12209 validation 199022 validation 199022 validation 122099")

Extraction of “validation number” matches

re <- gregexpr("validation [0-9]+", v)
re
# [[1]]
# [1]  8 26
# attr(,"match.length")
# [1] 17 16
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
# [[2]] ...

We can extract the matching substrings with regmatches:

regmatches(v, re)
# [[1]]
# [1] "validation 108871" "validation 85933" 
# [[2]]
# [1] "validation 93442"  "validation 73769"  "validation 127221"
# [[3]]
# [1] "validation 99763" "validation 99763"
# [[4]]
# [1] "validation 199022" "validation 122099" "validation 12209" 
# [4] "validation 199022" "validation 199022" "validation 122099"

Now we have a list where each of your strings produced 1 or more matching substrings. Now we can just iterate over the list and get just the first element.

sapply(regmatches(v, re), `[`, 1)
# [1] "validation 108871" "validation 93442"  "validation 99763" 
# [4] "validation 199022"

This should not fail, even if a string does not contain the substring pattern:

v <- c(v, "nothing here")
re <- gregexpr("validation [0-9]+", v)
sapply(regmatches(v, re), `[`, 1)
# [1] "validation 108871" "validation 93442"  "validation 99763" 
# [4] "validation 199022" NA                 

where the NA indicates no matches but still preserves a place in your string vector.

gsub only

First, remove numbers/spaces up to but not including the first “validation”:

gsub("^[0-9 ]*(?=validation)", "", v, perl=TRUE)
# [1] "validation 108871 validation 85933"                                                                        
# [2] "validation 93442 21232 validation 73769 26402 validation 127221 26402"                                     
# [3] "validation 99763 validation 99763"                                                                         
# [4] "validation 199022 validation 122099 validation 12209 validation 199022 validation 199022 validation 122099"

Now remove anything after the first “number”:

gsub("([0-9])\\b.*", "", gsub("^[0-9 ]*(?=validation)", "", v, perl=TRUE))
# [1] "validation 10887" "validation 9344"  "validation 9976"  "validation 19902"

solved need to find a pattern and extract that out [closed]