Open KeepingItClassy opened 6 years ago
Thanks for your note.
I've also spotted this before. But the preprocessing is just following the Karpathy's step, in which he just applied the nltk tokenizer.
Hope this helps.
But it's actually skewing the evaluation results because a different tokenizer creates a different vocabulary. For accurate results, both the training and the validation/test captions need to be tokenized in the same way.
Yeah, you are totally right. That's what most of the recent papers are doing, as they were following Karpathy's standard.
@ KeepingItClassy Could you share your test code? Code can be achieved on the screen images and captions. Thanks very much.
Hi, first of all thank you for the great repo! I noticed that you are using different tokenizers for the training data -
nltk.tokenize.word_tokenize
inbuild_vocab.py
- and the validation data -PTBTokenizer
fromcoco/pycocoevalcap/tokenizer/ptbtokenizer.py
. The first one doesn't split on punctuation, while the second one does, which leads to many<unk>
tokens for captions that have hyphenated words. It would be great if you could have both training and captioning use the same tokenizer. Thanks!