[Solved] Similarity measure in classification algorithm


There a number of possible measures of similarity. Ideally, you should derive one yourself that takes account of the reason why you are doing this classification, so that good similarity scores amount to something that performs well when you use it in practice. Here are a few examples.

1) Cosine similarity. Treat the two sets of percentages as vectors, make them into unit vectors, and take the dot product to give you something between 0 and 1. So in your example you would have (10 * 10 + 20 * 30 + 30 * 20 + 40 * 40) / (sqrt(10 * 10 + 20 * 20 + 30 * 30 + 40 * 40) * sqrt(10 * 10 + 30 * 30 + 20 * 20 + 40 * 40)).

2) If the expert and the classification system classified the same sperm and you kept track of which was which you could work out what percentage the classification system got correct. You didn’t do this, but you can work out the maximum possible consistent with the data you have by taking, for each class, the minimum either assigned to this class. In your example, the classification system could have been correct for at most min(10, 10) + min(20, 30) + min(30, 20) + min(40, 40) percent. This score will be somewhere between 0 and 100 percent, with 100 percent for a perfect match.

3) If the result of your classification was used as an input to a diagnostic test (e.g. patient will be infertile if…) instead of comparing the classification output, look at how often the results of your classification produce the same test result as the results of expert classifications – see http://en.wikipedia.org/wiki/Receiver_operating_characteristic)

1

solved Similarity measure in classification algorithm