You’re asking what ML representation you should use for user-classification of chat text.
bag-of-words and word-vector are the main representations generally used in text-processing. However user-classification of chat is not the usual text-processing task, we look for telltale features indicative of a specific user.
Here are some:
- character length, word length, sentence length of each comment
- typing speed (esp. if you have timestamps in seconds)
- ratio of punctuation (e.g. 17 punctuation symbols in 80 chars = 17/80)
- ratio of capitalization
- ratio of numerals
- ratio of whitespace
- character n-grams (and notice these can pick up e.g. l0ser, f##k, 🙂 )
- use of Unicode (emojis, symbols e.g. stars)
- ratio of specific punctuation (e.g. how many ‘.’, ‘!’, ‘?’, ‘*’, ‘#’ )
- word-counts, esp. anything statistically anomalous
- anything else you can think of that seems predictive for these two users, e.g. number of misspelled words per sentence (may be actual typos, or come from predictive swiping on a cellphone)
solved What representation of chat text data should I use for user classification? [closed]