top 10 most used words - Githubissues

B-Weyl commented 7 years ago

Get users top 10 most used words using nltk's stopwords to filter out common words

x0rz commented 7 years ago

Hi! Thanks for the PR 👍

I've tested your changes and here are a few thoughts:

The nltk package installation is kind of a pain in the ass (you have to manually open a python interface, import nltk and call nltk.download() and choose the right files to download 😒
The results are not 100% relevant (IMHO):

[+] Top 10 most used words
- rt          286 (3%)
- #privacy     47 (0%)
- it's         38 (0%)
- like         34 (0%)
- ;)           30 (0%)
- don't        27 (0%)
- use          27 (0%)
- security     26 (0%)
- that's       26 (0%)
- using        24 (0%)

Last thing, I get this warning (python2):

./tweets_analyzer.py:201: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if word not in stopwords.words('english'):

I would suggest:

[ ] Removing RTs from being processed from such word analysis (it's not the author original words so it makes more sense to me)
[ ] Adding a --words option of some kind to enable the nltk import and the words analysis
[ ] Maybe display a top15/top20? Some kind of ASCII word cloud would be perfect but I can't find any library doing this 😁

I won't have much time next week (on vacations). Please feel free to update your PR with additional code 👍

-- x0rz

JusticeRage commented 7 years ago

Based on the results ("use" and "using" both present), a stemming algorithm would also need to be applied for the results to be meaningful.

x0rz commented 6 years ago

Conflicting PR at this time

x0rz / tweets_analyzer

top 10 most used words #16