[Solved] Regex stemmer code explanation


It splits a word into two parts: stem and end. There are three cases:

  1. The word ends with ss (or even more s): stem <- word and end <- ""
  2. The word ends with a single s: stem <- word without "s" and end <- "s"
  3. The word does not end with s: stem <- word and end <- ""

This is done by a regular expression which captures the full word (due to ^....$). The first part (i.e. stem) consists either of as much as possible ending in ss (.*ss) or if that is not possible of as less as possible (.*?). Then possibly an ending s is taken to be the end part.

Note that in the first case (as much as possible ending in ss) there can never be an additional s for the end part.

solved Regex stemmer code explanation