[Solved] sum lines with similar substring [closed]

Question

We could match the substring starting from _ to the end of the string (.*$) in ‘IsomiR’ column and replace with '' using sub. We use that as the grouping variable. If we are doing this with dplyr, the summarise_each can be used for summing multiple columns.

library(dplyr)
df1 %>%
   group_by(IsomiR= sub('_.*$', '', IsomiR)) %>%
   summarise_each(funs(sum))
#         IsomiR X185R X68G X60G X134G X124R
#1 hsa-let-7a-3p     6   11    4    15    10
#2 hsa-let-7b-3p    28   14   18    57    38
#3 hsa-let-7c-3p   125   34  106   138    81

Or we can use separate from tidyr where we split the ‘IsomiR’ column into by specifying the sep='_', use that as grouping variable, and in the summarise_each we can select the columns using regex pattern in the matches

library(tidyr)
separate(df1, IsomiR, into=c('IsomiR', 'unWanted'), sep='_') %>%
             group_by(IsomiR) %>%
             summarise_each(funs(sum), matches('[0-9]+[A-Z]$'))

Using data.table, we convert the ‘data.frame’ to ‘data.table’ (setDT(df1)). Remove the substring in ‘IsomiR’ with sub, use that as a grouping variable, loop through the columns (lapply(.SD, ..)) and get the sum (suggested by @David Arenburg in the comments).

library(data.table)
setDT(df1)[, lapply(.SD, sum), by = .(IsomiR = sub('_.*', '', IsomiR))]

Or another option is the formula method in aggregate from baseR after we transform the original dataset column ‘IsomiR` as described above.

 aggregate(.~IsomiR, transform(df1, IsomiR= sub('_.*', '', IsomiR)), sum)

data

df1 <- structure(list(IsomiR = c("hsa-let-7a-3p_ATACAATCTACTGTCTTTCCT", 
"hsa-let-7a-3p_ATATACAATCTACTGTCTTT", 
"hsa-let-7a-3p_ATATACAATCTACTGTCTTTC", 
"hsa-let-7b-3p_ATATACAATCTACTGTCTTTCC",
"hsa-let-7b-3p_ATATACAATCTACTGTCTTTCT", 
"hsa-let-7b-3p_CCATACAATCTACTGTCTTTCT", "hsa-let-7b-3p_CTATACAATCTACTGTCTT", 
"hsa-let-7c-3p_CTATACAATCTACTGTCTTT", "hsa-let-7c-3p_CTATACAATCTACTGTCTTTC",    
"hsa-let-7c-3p_CTATACAATCTACTGTCTTTCA"), X185R = c(1L, 1L, 4L, 
 7L, 15L, 4L, 2L, 29L, 85L, 11L), X68G = c(6L, 0L, 5L, 5L, 6L, 
1L, 2L, 7L, 24L, 3L), X60G = c(1L, 1L, 2L, 2L, 14L, 1L, 1L, 26L, 
73L, 7L), X134G = c(2L, 1L, 12L, 6L, 49L, 0L, 2L, 21L, 109L, 
8L), X124R = c(2L, 4L, 4L, 3L, 32L, 0L, 3L, 19L, 59L, 3L)),
.Names = c("IsomiR", 
 "X185R", "X68G", "X60G", "X134G", "X124R"), class = "data.frame", 
row.names = c(NA, -10L))

Accepted Answer

We could match the substring starting from _ to the end of the string (.*$) in ‘IsomiR’ column and replace with '' using sub. We use that as the grouping variable. If we are doing this with dplyr, the summarise_each can be used for summing multiple columns.

library(dplyr)
df1 %>%
   group_by(IsomiR= sub('_.*$', '', IsomiR)) %>%
   summarise_each(funs(sum))
#         IsomiR X185R X68G X60G X134G X124R
#1 hsa-let-7a-3p     6   11    4    15    10
#2 hsa-let-7b-3p    28   14   18    57    38
#3 hsa-let-7c-3p   125   34  106   138    81

Or we can use separate from tidyr where we split the ‘IsomiR’ column into by specifying the sep='_', use that as grouping variable, and in the summarise_each we can select the columns using regex pattern in the matches

library(tidyr)
separate(df1, IsomiR, into=c('IsomiR', 'unWanted'), sep='_') %>%
             group_by(IsomiR) %>%
             summarise_each(funs(sum), matches('[0-9]+[A-Z]$'))

Using data.table, we convert the ‘data.frame’ to ‘data.table’ (setDT(df1)). Remove the substring in ‘IsomiR’ with sub, use that as a grouping variable, loop through the columns (lapply(.SD, ..)) and get the sum (suggested by @David Arenburg in the comments).

library(data.table)
setDT(df1)[, lapply(.SD, sum), by = .(IsomiR = sub('_.*', '', IsomiR))]

Or another option is the formula method in aggregate from baseR after we transform the original dataset column ‘IsomiR` as described above.

 aggregate(.~IsomiR, transform(df1, IsomiR= sub('_.*', '', IsomiR)), sum)

data

df1 <- structure(list(IsomiR = c("hsa-let-7a-3p_ATACAATCTACTGTCTTTCCT", 
"hsa-let-7a-3p_ATATACAATCTACTGTCTTT", 
"hsa-let-7a-3p_ATATACAATCTACTGTCTTTC", 
"hsa-let-7b-3p_ATATACAATCTACTGTCTTTCC",
"hsa-let-7b-3p_ATATACAATCTACTGTCTTTCT", 
"hsa-let-7b-3p_CCATACAATCTACTGTCTTTCT", "hsa-let-7b-3p_CTATACAATCTACTGTCTT", 
"hsa-let-7c-3p_CTATACAATCTACTGTCTTT", "hsa-let-7c-3p_CTATACAATCTACTGTCTTTC",    
"hsa-let-7c-3p_CTATACAATCTACTGTCTTTCA"), X185R = c(1L, 1L, 4L, 
 7L, 15L, 4L, 2L, 29L, 85L, 11L), X68G = c(6L, 0L, 5L, 5L, 6L, 
1L, 2L, 7L, 24L, 3L), X60G = c(1L, 1L, 2L, 2L, 14L, 1L, 1L, 26L, 
73L, 7L), X134G = c(2L, 1L, 12L, 6L, 49L, 0L, 2L, 21L, 109L, 
8L), X124R = c(2L, 4L, 4L, 3L, 32L, 0L, 3L, 19L, 59L, 3L)),
.Names = c("IsomiR", 
 "X185R", "X68G", "X60G", "X134G", "X124R"), class = "data.frame", 
row.names = c(NA, -10L))