I am working on the week2 homework(part1) notebook and have question about bow_vocabulary, as the length of the bow_vocabulary I created is different from the length of the set of all tokens in the training set.
See the screenshot below:
The way I created bow_vocabulary as follows:
Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from TweetTokenizer.
Then I count the occurrence of each token and only keep the top k words in the vocabulary.
When putting all tokens into a set, some tokens(str) would be treated as the same one so that the length decreases.
I am wondering:
Is my way creating bow_vocabulary correct?
My understanding is that we should keep and use tokens from tokenizer, such that the creation of vocabulary is like: tokens -> vocabulary.
However, I also understand that some string might be meaningless(with symbols only for example) and we can merge them together, like the set used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.
Could you shed some light on this? That would be super helpful!
Thanks for your time to look at this and I am looking forward to hearing your reply!
Greetings,
I am working on the week2 homework(part1) notebook and have question about
bow_vocabulary
, as the length of thebow_vocabulary
I created is different from thelength of the set of all tokens
in the training set. See the screenshot below:The way I created
bow_vocabulary
as follows:Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from
TweetTokenizer
.Then I count the occurrence of each token and only keep the top
k
words in the vocabulary.When putting all tokens into a
set
, some tokens(str) would be treated as the same one so that the length decreases.I am wondering:
bow_vocabulary
correct?However, I also understand that some string might be meaningless(with symbols only for example) and we can merge them together, like the
set
used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.Could you shed some light on this? That would be super helpful!
Thanks for your time to look at this and I am looking forward to hearing your reply!