[Solved] Extract components of strings and fill-down where missing in R


Based on the discussion in the comments it appears that it is not feasible to construct a key for joining the Species_name with a look-up table.

What you could do is determine the “similarity” of your (species) names with such a look-up table. This assumes that your (species) names are different enough. To determine the “similarity” or better the “distance” of the strings (your names), we use the package {stringdist}.
There are many string-(dis)similarity algorithms. For the example below, I work on the basis that your (species) names begin in a similar way. (Note: you may have to ensure this or use another algorithm for the distance measure).

You did not provide a fully reproducible example. I construct an example input tibble based on your question and derive the look-up vector from this tibble. You may be able to extract the look-up from your data in another manner.

prep work

library(dplyr)                     # for data crunching
# install.packages("stringdist")   # install from CRAN
library(stringdist)                # use stringdist

# emulate example
input <- tibble::tribble(
   ~SEQ, ~Species_name
    , 1, "Aglaia lawii"
    , 2, "Aglaia lawii"
    , 2, "Aglaia lawii (Wight) C.J.Saldanha ex Ramamoorthy"
    , 3, "Alangium uniloculare"
    , 4, "Alangium uniloculare (Griff.) King"
   )

# create a look-up vector of all "full names"
lookup <- input %>% 
   filter(grepl(pattern = ".\\("    # checking for the opening bracket
                , x = Species_name)
   ) %>% 
   pull(Species_name)               # keep the names column as look-up vector

string (dis)similarity

There are many algorithm to determine the (dis)similarity of strings. The {stringdist} package offers some. There exists a function that provides a distance. Without going in too much details, think about distance being the number of operations required to turn a string into one of the lookup strings. For more details, read up in the documentation. The different algorithm consider different ways of doing these transformations (and many things more). The Jaro-Winkler algorithm gives a preference to strings that match from the beginning.

Let’s check what the Jaro-Winkler jw (method) implementation in {stringdist} works:

stringdist(input$Species_name[1], lookup, method = "jw")
[1] 0.2500000 0.3908497

From this we can derive that the first element of our lookup is similar (lower distance) and starts similarly than Aglaia lawii (Wight) C.J.Saldanha ex Ramamoorthy.

iterating over our tibble and replace/identify the best full names

As we work with a tibble of species names, we need to compare the actual species names with the look-up vector. To iterate over the different rows we use {purrr} and apply the (dis)similarity comparison.
For the latter we construct a short function to make this more compact.

Note: The lookup was created above based on the example data. This might however come from elsewhere.

library(purrr)

best_guess_full_name <- function(.df, .options = lookup){
#----- best guess bg picks from options look-up the most similar (:= min dissimilar)
    bg <- .options[which.min(stringdist(.df$Species_name, .options, method = "jw"))]
#------ append the "best guess" to the row (dataframe) and return
    df <- .df %>% mutate(Full_species = bg)
}

Let’s test the function taking the 2nd row of input (i.e. input[2, ]).
Note: the brackets around the call will print the result, since we do not use return in our function.

( input[2,] %>% best_guess_full_name() )
# A tibble: 1 x 3
    SEQ Species_name Full_species                                    
  <dbl> <chr>        <chr>                                           
1     2 Aglaia lawii Aglaia lawii (Wight) C.J.Saldanha ex Ramamoorthy

putting it all together

With this we can now process (iterate over each row) the input dataframe/tibble and return in as a dataframe/tibble.

input %>% 
   split(row.names(authors)) %>%              # creates a list, each row a data frame
   purrr::map_dfr(.f = best_guess_full_name)  # iterate over every list element (row dataframe)

# A tibble: 5 x 3
    SEQ Species_name                                     Full_species                                    
  <dbl> <chr>                                            <chr>                                           
1     1 Aglaia lawii                                     Aglaia lawii (Wight) C.J.Saldanha ex Ramamoorthy
2     2 Aglaia lawii                                     Aglaia lawii (Wight) C.J.Saldanha ex Ramamoorthy
3     2 Aglaia lawii (Wight) C.J.Saldanha ex Ramamoorthy Aglaia lawii (Wight) C.J.Saldanha ex Ramamoorthy
4     3 Alangium uniloculare                             Alangium uniloculare (Griff.) King              
5     4 Alangium uniloculare (Griff.) King               Alangium uniloculare (Griff.) King 

This does what – I think you can do – if the left_join() approach and proper keys do not work. The drawback of this solution is that we “estimate” the full names.

  • Thus, if the functions from {stringdist} do not help to get this right, you would need to specify your own “similarity” measurement.
  • The result might be – mathematically – wrong as (dis)similarity results in equal distance for a certain combination of the wrong full name. You may need to break ties, etc.

Hope this gets you going!

6

solved Extract components of strings and fill-down where missing in R