[Solved] What representation of chat text data should I use for user classification? [closed]


You’re asking what ML representation you should use for user-classification of chat text.

bag-of-words and word-vector are the main representations generally used in text-processing. However user-classification of chat is not the usual text-processing task, we look for telltale features indicative of a specific user.
Here are some:

  • character length, word length, sentence length of each comment
  • typing speed (esp. if you have timestamps in seconds)
  • ratio of punctuation (e.g. 17 punctuation symbols in 80 chars = 17/80)
  • ratio of capitalization
  • ratio of numerals
  • ratio of whitespace
  • character n-grams (and notice these can pick up e.g. l0ser, f##k, 🙂 )
  • use of Unicode (emojis, symbols e.g. stars)
  • ratio of specific punctuation (e.g. how many ‘.’, ‘!’, ‘?’, ‘*’, ‘#’ )
  • word-counts, esp. anything statistically anomalous
  • anything else you can think of that seems predictive for these two users, e.g. number of misspelled words per sentence (may be actual typos, or come from predictive swiping on a cellphone)

solved What representation of chat text data should I use for user classification? [closed]