[Solved] Grouping and summarizing [closed]

Question

Editing from previous wrong answer and borrowing from @akron for the use of rle, you can do this: assuming that your data is in a data.frame named “df” and your “frame classes” are in a column named “frame_class”, as in the code below, this should work:

df = data.frame(n_frame = seq(1:13), frame_type = "frame_type",
                frame_class = c("I_frame", "P_frame", "P_frame", "B_frame", "P_frame", "P_frame",
                                "B_frame", "I_frame", "B_frame", "P_frame", "I_frame", "P_frame", "I_frame"))
df$frame_letter = substring(df$frame_class,1,1) # get only the beginning letter

# Find the location of I_frames
where_i = which(df$frame_class == "I_frame") 
num_i = length(where_i)
out_codes = list()

for (ind_i in 1:(num_i-1)){ # cycle on "sandwiches"
  start = where_i[ind_i]
  end = where_i[ind_i+1]
  sub_data = df$frame_letter[(start+1):(end-1)]  # Get data in a sandwich
  count_reps = rle(sub_data)  # find repetitions pattern

  # build the codes
  out_code = "I"
  for (ind_letter in 1:length(count_reps$lengths)){
    out_code= paste0(out_code, ifelse(count_reps$lengths[ind_letter] == 1, 
                     count_reps$values[ind_letter],  # If only 1 rep, don't add "1" in the string
                     paste0(count_reps$lengths[ind_letter], count_reps$values[ind_letter]))) 
  }
  out_codes [[ind_i]] = out_code # put in list
}
out_codes

, which gives:

> out_codes
[[1]]
[1] "I2PB2PB"

[[2]]
[1] "IBP"

[[3]]
[1] "IP"

note it’s really quick and dirty: you should at least want to implement some checks to be sure that the series always start and end with an “I_frame”, but this could put you in the right direction…

Also note that this could be slow for large datasets.

Lorenzo

Accepted Answer

Editing from previous wrong answer and borrowing from @akron for the use of rle, you can do this: assuming that your data is in a data.frame named “df” and your “frame classes” are in a column named “frame_class”, as in the code below, this should work:

df = data.frame(n_frame = seq(1:13), frame_type = "frame_type",
                frame_class = c("I_frame", "P_frame", "P_frame", "B_frame", "P_frame", "P_frame",
                                "B_frame", "I_frame", "B_frame", "P_frame", "I_frame", "P_frame", "I_frame"))
df$frame_letter = substring(df$frame_class,1,1) # get only the beginning letter

# Find the location of I_frames
where_i = which(df$frame_class == "I_frame") 
num_i = length(where_i)
out_codes = list()

for (ind_i in 1:(num_i-1)){ # cycle on "sandwiches"
  start = where_i[ind_i]
  end = where_i[ind_i+1]
  sub_data = df$frame_letter[(start+1):(end-1)]  # Get data in a sandwich
  count_reps = rle(sub_data)  # find repetitions pattern

  # build the codes
  out_code = "I"
  for (ind_letter in 1:length(count_reps$lengths)){
    out_code= paste0(out_code, ifelse(count_reps$lengths[ind_letter] == 1, 
                     count_reps$values[ind_letter],  # If only 1 rep, don't add "1" in the string
                     paste0(count_reps$lengths[ind_letter], count_reps$values[ind_letter]))) 
  }
  out_codes [[ind_i]] = out_code # put in list
}
out_codes

, which gives:

> out_codes
[[1]]
[1] "I2PB2PB"

[[2]]
[1] "IBP"

[[3]]
[1] "IP"

note it’s really quick and dirty: you should at least want to implement some checks to be sure that the series always start and end with an “I_frame”, but this could put you in the right direction…

Also note that this could be slow for large datasets.

Lorenzo