weallwegot / omscs_chatbot

Issues Repository for Georgia Tech OMSCS Chat bot
MIT License
1 stars 0 forks source link

implement stop words & count most commonly occurring words #40

Closed weallwegot closed 6 years ago

weallwegot commented 6 years ago

From @weAllWeGot on December 6, 2016 19:19

this can then be the basis of the categories. this probably a new branch but an interesting one. i wouldnt get rid of the predefined categories but i would add to them with what the analysis comes up with kind of more smartly/automatically.

_Copied from original issue: weAllWeGot/kbai_chatbot3#20

weallwegot commented 6 years ago

http://stackoverflow.com/questions/9953619/technique-to-remove-common-wordsand-their-plural-versions-from-a-string

from nltk.corpus import stopwords
s=set(stopwords.words('english'))

txt="a long string of text about him and her"
print filter(lambda w: not w in s,txt.split())
weallwegot commented 6 years ago

tried this. lmao, it took 25 minutes to get the count of all of the words in the reviews... will have to look into this again. i should have sorted the output list... grr

weallwegot commented 6 years ago

Add extra stop words Add cleaning data function In cleaning data function Make list into set Then cut it out of set if the first character is punctuation Then cut it out of set if last character is punctuation

Write the word counts out to a file. The file can just be stored in the local directory since this takes a couple minutes to run usually. Although this will reduce running time by a little. If it's under 3 total seconds then we will add it

weallwegot commented 6 years ago

maybe use the words file at a next iteration, its in the repo