[Solved] What should I use to perform similarity functions on 200 column 12 million row dataset? [closed]

Question

After getting suggestions from a couple of friends, I looked up the documentation on ElasticSearch. Seems like that’s the perfect tool for my use-case. It’s built for search/retrieval needs such as this, shards like anything, can handle huge data. Here’s what should be done:

Store each row in a document, with the key elements being the _id field and each f1, f2… field as different fields. One can use the boost fields feature to increase relevance of certain fields (basically assigning more weights to them, essentially eliminating the need for my similarity function). This can even be done during query time, hence letting the user assign weights depending upon the use-case.

Here’s an example query than might work for this use-case (untested):

{
  "query" : {
    "filtered" : {
        "and" : [
         {      
              "query" : {
                    "bool" : {
                         "should" : [
                              { "match" : { "f192" : { "boost" : 2,"query" : "232"} } },
                              { "match" : { "f16" : { "boost" : 1,"query" : "4324"} } },
                              { "match" : { "f25" : { "boost" : 0.2,"query" : "76783"} } },
                         ]
                    }
                }
         },
         {
              "exists" : { "field" : "f67" }
         }
       ]
    }
  }
}'

Accepted Answer

After getting suggestions from a couple of friends, I looked up the documentation on ElasticSearch. Seems like that’s the perfect tool for my use-case. It’s built for search/retrieval needs such as this, shards like anything, can handle huge data. Here’s what should be done:

Store each row in a document, with the key elements being the _id field and each f1, f2… field as different fields. One can use the boost fields feature to increase relevance of certain fields (basically assigning more weights to them, essentially eliminating the need for my similarity function). This can even be done during query time, hence letting the user assign weights depending upon the use-case.

Here’s an example query than might work for this use-case (untested):

{
  "query" : {
    "filtered" : {
        "and" : [
         {      
              "query" : {
                    "bool" : {
                         "should" : [
                              { "match" : { "f192" : { "boost" : 2,"query" : "232"} } },
                              { "match" : { "f16" : { "boost" : 1,"query" : "4324"} } },
                              { "match" : { "f25" : { "boost" : 0.2,"query" : "76783"} } },
                         ]
                    }
                }
         },
         {
              "exists" : { "field" : "f67" }
         }
       ]
    }
  }
}'