[Solved] How to get unigrams (words) from a list in python?


Aren’t those strings containing a single word, e.g. “evaporation” & “sunlight” unigrams? It seems to me that you want to retain the unigrams, not remove them.

You can do that using a list comprehension:

list1 = ['water vapor','evaporation','carbon dioxide','sunlight','green plants']
unigrams = [word for word in list1 if ' ' not in word]

>>> print unigrams
['evaporation', 'sunlight']

This assumes that words are separated by a one or more spaces. This might be an oversimplification as to what constitutes a n-gram for n > 1 as different whitespace characters could delimit words e.g. tab, new line, various whitespace unicode code points, etc. You could use a regular expression :

import re

list1 = ['water vapor','evaporation','carbon dioxide','sunlight','green plants', 'word with\ttab', 'word\nword', 'abcd\refg']
unigram_pattern = re.compile('^\S+$')    # string contains only non-whitespace chars
unigrams = [word for word in list1 if unigram_pattern.match(word)]

>>> print unigrams
['evaporation', 'sunlight']

The pattern ^\S+$ says to match from the beginning of a string all non-whitespace characters until the end of the string.

If you need to support unicode spaces you can specify the unicode flag when compiling the pattern:

list1.extend([u'punctuation\u2008space', u'NO-BREAKu\u00a0SPACE'])
unigram_pattern = re.compile('^\S+$', re.UNICODE)
unigrams = [word for word in list1 if unigram_pattern.match(word)]

>>> print unigrams
['evaporation', 'sunlight']

Now it will also filter out those strings that contain unicode whitespace, e.g. non-break space (U+00A0) and punctuation space (U+2008).

solved How to get unigrams (words) from a list in python?