A question of bow_vocabulary in w2_homework_part1

Greetings,

I am working on the week2 homework(part1) notebook and have question about bow_vocabulary, as the length of the bow_vocabulary I created is different from the length of the set of all tokens in the training set. See the screenshot below:

The way I created bow_vocabulary as follows:
Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from TweetTokenizer.
Then I count the occurrence of each token and only keep the top k words in the vocabulary.

When putting all tokens into a set, some tokens(str) would be treated as the same one so that the length decreases.
I am wondering:

Is my way creating bow_vocabulary correct?
My understanding is that we should keep and use tokens from tokenizer, such that the creation of vocabulary is like: tokens -> vocabulary.
However, I also understand that some string might be meaningless(with symbols only for example) and we can merge them together, like the set used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.
Could you shed some light on this? That would be super helpful!

Thanks for your time to look at this and I am looking forward to hearing your reply!

yandexdataschool / nlp_course

A question of bow_vocabulary in w2_homework_part1 #113