Here some ideas:
- use
new_str = str.upper()
so beer and Beer will be same (if you
need this) - use
list = str.split()
to make a list of the words
in your string. - use
set = set(list)
to get rid of double words
if needed. - start with an empty word_list. Copy the first set in the word_list. In the following steps you can loop over the entries in your set and check if they are part of your word_list.
for word in set:
if word not in word_list:
word_list.append(word)
- Now you can make a multi-hot vector from your sentence. (1 if word_list[i] in sentence else 0)
- Don’t forget to make your multi-hot vectors longer (additional zeros) if you add a word to word_list.
- last step: make a matrix from your vectors.
1
solved Python text document similarities (w/o libraries) [closed]