[Solved] Data selection error [closed]

Question

Your code seems to be completely backwards to what you’re trying to achieve:

“For each gene (in d2) which SNPs (from d1) are within 10kb of that gene?”

First of all, your code for d1$matched is backwards. All your p‘s and d2s should be the other way round (currently it doesn’t make much sense?), giving you a list of SNPs whom are in cis with each gene (+/- 10kb).

I would approach it the way i’ve phrased your question:

cisWindow <- 10000 # size of your +/- window, in this case 10kb.
d3 <- data.frame()
# For each gene, locate the cis-SNPs
for (i in 1:nrow(d2)) {
  # Broken down into steps for readability.
  inCis <- d1[which(d1[,"CHR"] == d2[i, "chromosome"]),]
  inCis <- inCis[which(inCis[,"POS"] >= (d2[i, "start"] - cisWindow)),]
  inCis <- inCis[which(inCis[,"POS"] <= (d2[i, "end"] + cisWindow)),]
  # Now we have the cis-SNPs, so lets build the data.frame for this gene,
  # and grow our data.frame d3:
  if (nrow(inCis) > 0) {
    d3 <- rbind(d3, cbind(d2[i,], inCis))
  }
}

I tried to find a solution which didn’t involve growing d3 in the loop, but because you’re attaching each row of d2 to 0 or more rows from d1 I wasn’t able to come up with a solution that’s not horribly inefficient.

Accepted Answer

Your code seems to be completely backwards to what you’re trying to achieve:

“For each gene (in d2) which SNPs (from d1) are within 10kb of that gene?”

First of all, your code for d1$matched is backwards. All your p‘s and d2s should be the other way round (currently it doesn’t make much sense?), giving you a list of SNPs whom are in cis with each gene (+/- 10kb).

I would approach it the way i’ve phrased your question:

cisWindow <- 10000 # size of your +/- window, in this case 10kb.
d3 <- data.frame()
# For each gene, locate the cis-SNPs
for (i in 1:nrow(d2)) {
  # Broken down into steps for readability.
  inCis <- d1[which(d1[,"CHR"] == d2[i, "chromosome"]),]
  inCis <- inCis[which(inCis[,"POS"] >= (d2[i, "start"] - cisWindow)),]
  inCis <- inCis[which(inCis[,"POS"] <= (d2[i, "end"] + cisWindow)),]
  # Now we have the cis-SNPs, so lets build the data.frame for this gene,
  # and grow our data.frame d3:
  if (nrow(inCis) > 0) {
    d3 <- rbind(d3, cbind(d2[i,], inCis))
  }
}

I tried to find a solution which didn’t involve growing d3 in the loop, but because you’re attaching each row of d2 to 0 or more rows from d1 I wasn’t able to come up with a solution that’s not horribly inefficient.