We could match the substring starting from _
to the end of the string (.*$
) in ‘IsomiR’ column and replace with ''
using sub
. We use that as the grouping variable. If we are doing this with dplyr
, the summarise_each
can be used for sum
ming multiple columns.
library(dplyr)
df1 %>%
group_by(IsomiR= sub('_.*$', '', IsomiR)) %>%
summarise_each(funs(sum))
# IsomiR X185R X68G X60G X134G X124R
#1 hsa-let-7a-3p 6 11 4 15 10
#2 hsa-let-7b-3p 28 14 18 57 38
#3 hsa-let-7c-3p 125 34 106 138 81
Or we can use separate
from tidyr
where we split the ‘IsomiR’ column into by specifying the sep='_'
, use that as grouping variable, and in the summarise_each
we can select
the columns using regex pattern in the matches
library(tidyr)
separate(df1, IsomiR, into=c('IsomiR', 'unWanted'), sep='_') %>%
group_by(IsomiR) %>%
summarise_each(funs(sum), matches('[0-9]+[A-Z]$'))
Using data.table, we convert the ‘data.frame’ to ‘data.table’ (setDT(df1)
). Remove the substring in ‘IsomiR’ with sub
, use that as a grouping variable, loop through the columns (lapply(.SD, ..)
) and get the sum
(suggested by @David Arenburg in the comments).
library(data.table)
setDT(df1)[, lapply(.SD, sum), by = .(IsomiR = sub('_.*', '', IsomiR))]
Or another option is the formula method in aggregate
from baseR
after we transform
the original dataset column ‘IsomiR` as described above.
aggregate(.~IsomiR, transform(df1, IsomiR= sub('_.*', '', IsomiR)), sum)
data
df1 <- structure(list(IsomiR = c("hsa-let-7a-3p_ATACAATCTACTGTCTTTCCT",
"hsa-let-7a-3p_ATATACAATCTACTGTCTTT",
"hsa-let-7a-3p_ATATACAATCTACTGTCTTTC",
"hsa-let-7b-3p_ATATACAATCTACTGTCTTTCC",
"hsa-let-7b-3p_ATATACAATCTACTGTCTTTCT",
"hsa-let-7b-3p_CCATACAATCTACTGTCTTTCT", "hsa-let-7b-3p_CTATACAATCTACTGTCTT",
"hsa-let-7c-3p_CTATACAATCTACTGTCTTT", "hsa-let-7c-3p_CTATACAATCTACTGTCTTTC",
"hsa-let-7c-3p_CTATACAATCTACTGTCTTTCA"), X185R = c(1L, 1L, 4L,
7L, 15L, 4L, 2L, 29L, 85L, 11L), X68G = c(6L, 0L, 5L, 5L, 6L,
1L, 2L, 7L, 24L, 3L), X60G = c(1L, 1L, 2L, 2L, 14L, 1L, 1L, 26L,
73L, 7L), X134G = c(2L, 1L, 12L, 6L, 49L, 0L, 2L, 21L, 109L,
8L), X124R = c(2L, 4L, 4L, 3L, 32L, 0L, 3L, 19L, 59L, 3L)),
.Names = c("IsomiR",
"X185R", "X68G", "X60G", "X134G", "X124R"), class = "data.frame",
row.names = c(NA, -10L))
4
solved sum lines with similar substring [closed]