yandexdataschool / nlp_course

YSDA course in Natural Language Processing
https://lena-voita.github.io/nlp_course.html
MIT License
9.79k stars 2.59k forks source link

A question of bow_vocabulary in w2_homework_part1 #113

Closed ZequnZ closed 1 week ago

ZequnZ commented 1 year ago

Greetings,

I am working on the week2 homework(part1) notebook and have question about bow_vocabulary, as the length of the bow_vocabulary I created is different from the length of the set of all tokens in the training set. See the screenshot below:

image

The way I created bow_vocabulary as follows:
Basically like the idea of week1, I split the text with " "(space) which are the tokens generated from TweetTokenizer.
Then I count the occurrence of each token and only keep the top k words in the vocabulary.

image

When putting all tokens into a set, some tokens(str) would be treated as the same one so that the length decreases.
I am wondering:

  1. Is my way creating bow_vocabulary correct?
  2. My understanding is that we should keep and use tokens from tokenizer, such that the creation of vocabulary is like: tokens -> vocabulary.
    However, I also understand that some string might be meaningless(with symbols only for example) and we can merge them together, like the set used in notebook. So creation of vocabulary is like: tokens -> pre-processing -> vocabulary.
    Could you shed some light on this? That would be super helpful!

Thanks for your time to look at this and I am looking forward to hearing your reply!

poedator commented 1 week ago

thank you for your interest to the course. we do not provide public feedback on homework solutions here, since this creates unwanted spoilers.