[Solved] R computing not so fast


This isn’t quite right, but maybe gives some indication of how to make this type of operation faster. Here’s the data

url <- "http://pastebin.com/raw.php?i=hsGACr2L"
dfi <- read.csv(url)

I calculate the product and cumulative sum of the product of price and volume. The calculation is vectorized so fast.

pv <- with(dfi, Price * Volume)
cpv <- cumsum(pv)
vol_range <- 100000

My strategy was to figure out how to group the data in a relatively efficient way. I did this by creating a logical vector that will have ‘TRUE’ when a new group starts (I think the actual calculation is wrong below, and that there are edge cases that will fail; probably the strategy needs to be re-thought, but the notion is to minimize the non-vectorized data modification)

grp <- logical(nrow(dfi))
i <- 1
repeat {
    grp[i] <- TRUE
    ## find first index evaluating to 'TRUE'
    i <- which.max(cpv - (cpv[i] - pv[i]) > vol_range)
    ## prevent fails when, e.g., any(diff(cvp) > vol_range)
    if (i > 1L && grp[i] == TRUE)
        i <- i + 1L
    if (i == 1L)   # no TRUE values, so FALSE is max, and elt 1 is first FALSE
        break
}

cumsum(grp) divides the data into the first, second, … groups, and I add this to the data frame

dfi$Group <- cumsum(grp)

For the output, the basic strategy is to split Price (etc.) by Group, and apply a function to each group. There are a number of ways to do this, tapply is not particularly efficient (data.table excels at these types of calculations, but does not provide any particular benefit up to this point) but for the scale of data is likely to be sufficient.

dfo <- with(dfi, {
    data.frame(
        open = tapply(Price, Group, function(x) x[1]),
        high = tapply(Price, Group, max),
        low = tapply(Price, Group, max),
        close = tapply(Price, Group, function(x) x[length(x)]),
        volume = tapply(Volume, Group, sum),
        pv = tapply(Price * Volume, Group, sum))
})

This takes a fraction of a second for the 10,000 row sample data.

solved R computing not so fast