[Solved] k-mer counting into R using perfect hashing [closed]


Item 1: I’m not versed in k-mer counting and the definition of x is a bit shaky. If you could update your question, then I’ll try to address point 1. But, to me it seems as if you just need to use an algorithm that converts from base 10 back to base 4?

Item 2: Yes. The iteration that is being performed is suboptimal as you are constantly converting from a string to int and so on. Moreover, why are the functions being declared inline? Also, how large is the data being passed into A?

To address this, we port the string entirely to a std::vector<int>. In fact, I would say just port it directly to an std::vector<int> and avoid listing that it can be switched to non-rcpp code. Additionally, we opt to use a pass by reference paradigm instead of just a const variable. Lastly, I’ve cut down on the amount of inline declaration.

#include <Rcpp.h>
using namespace Rcpp;

//************************************************
inline const short int V (const char x){
  switch(x){
  case 'A':case 'a':
    return 0;
    break;
  case 'C':case 'c':
    return 1;
    break;   
  case 'G':case 'g':
    return 2;
    break;
  default:
    return 3;
  break;  
  }

}

std::vector<int> conv_A2V(const std::string & A){
 unsigned int obs = A.length()
 std::vector<int> out(obs);
 for(unsigned int i = 0; i < obs; i++){
   out(i) = V(A[i]);
 }
 return out;
}

unsigned int X0( const std::vector<int> & V_A, const int k, const int n){
  unsigned int  result=0;
  int j=k;
  for( int i=n-1;i>n-k-1;i--) {
    result+= pow(4,k-j)*V_A[i];
  j--;
  }
  return result;
}

// [[Rcpp::export]]
IntegerVector kmer4(const std::string A, const int n,const int k)
{
  // Convert
  std::vector<int> V_A = conv_A2V(A);

  IntegerVector P(pow(4,k));                  
  int x=X0(V_A,k,n);                              
  P[x]++;                   
  const int N=pow(4,k-1);               
  for( int i=n-k-1;i>-1;i--){
    x=N*V_A[i]+x/4-x%4/4;
    P[x]++;
  }
  return P;
}

4

solved k-mer counting into R using perfect hashing [closed]