[Solved] What representation of chat text data should I use for user classification? [closed]

Question

You’re asking what ML representation you should use for user-classification of chat text.

bag-of-words and word-vector are the main representations generally used in text-processing. However user-classification of chat is not the usual text-processing task, we look for telltale features indicative of a specific user.
Here are some:

character length, word length, sentence length of each comment
typing speed (esp. if you have timestamps in seconds)
ratio of punctuation (e.g. 17 punctuation symbols in 80 chars = 17/80)
ratio of capitalization
ratio of numerals
ratio of whitespace
character n-grams (and notice these can pick up e.g. l0ser, f##k, 🙂 )
use of Unicode (emojis, symbols e.g. stars)
ratio of specific punctuation (e.g. how many ‘.’, ‘!’, ‘?’, ‘*’, ‘#’ )
word-counts, esp. anything statistically anomalous
anything else you can think of that seems predictive for these two users, e.g. number of misspelled words per sentence (may be actual typos, or come from predictive swiping on a cellphone)

Accepted Answer

You’re asking what ML representation you should use for user-classification of chat text.

bag-of-words and word-vector are the main representations generally used in text-processing. However user-classification of chat is not the usual text-processing task, we look for telltale features indicative of a specific user.
Here are some:

character length, word length, sentence length of each comment
typing speed (esp. if you have timestamps in seconds)
ratio of punctuation (e.g. 17 punctuation symbols in 80 chars = 17/80)
ratio of capitalization
ratio of numerals
ratio of whitespace
character n-grams (and notice these can pick up e.g. l0ser, f##k, 🙂 )
use of Unicode (emojis, symbols e.g. stars)
ratio of specific punctuation (e.g. how many ‘.’, ‘!’, ‘?’, ‘*’, ‘#’ )
word-counts, esp. anything statistically anomalous
anything else you can think of that seems predictive for these two users, e.g. number of misspelled words per sentence (may be actual typos, or come from predictive swiping on a cellphone)