[Solved] What should I use to perform similarity functions on 200 column 12 million row dataset? [closed]

After getting suggestions from a couple of friends, I looked up the documentation on ElasticSearch. Seems like that’s the perfect tool for my use-case. It’s built for search/retrieval needs such as this, shards like anything, can handle huge data. Here’s what should be done: Store each row in a document, with the key elements being … Read more

[Solved] Similarity measure in classification algorithm

There a number of possible measures of similarity. Ideally, you should derive one yourself that takes account of the reason why you are doing this classification, so that good similarity scores amount to something that performs well when you use it in practice. Here are a few examples. 1) Cosine similarity. Treat the two sets … Read more