It splits a word into two parts: stem
and end
. There are three cases:
- The word ends with
ss
(or even mores
):stem <- word
andend <- ""
- The word ends with a single
s
:stem <- word without "s"
andend <- "s"
- The word does not end with
s
:stem <- word
andend <- ""
This is done by a regular expression which captures the full word (due to ^....$
). The first part (i.e. stem
) consists either of as much as possible ending in ss
(.*ss
) or if that is not possible of as less as possible (.*?
). Then possibly an ending s
is taken to be the end
part.
Note that in the first case (as much as possible ending in ss
) there can never be an additional s
for the end
part.
solved Regex stemmer code explanation