Different tokenizers used on training and validation data

yufengm / Adaptive

Pytorch Implementation of Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

107 stars 42 forks source link

Different tokenizers used on training and validation data #4

Open KeepingItClassy opened 6 years ago

KeepingItClassy commented 6 years ago

Hi, first of all thank you for the great repo! I noticed that you are using different tokenizers for the training data - nltk.tokenize.word_tokenize in build_vocab.py - and the validation data - PTBTokenizer from coco/pycocoevalcap/tokenizer/ptbtokenizer.py. The first one doesn't split on punctuation, while the second one does, which leads to many <unk> tokens for captions that have hyphenated words. It would be great if you could have both training and captioning use the same tokenizer. Thanks!

yufengm commented 6 years ago

Thanks for your note.

I've also spotted this before. But the preprocessing is just following the Karpathy's step, in which he just applied the nltk tokenizer.

Hope this helps.

KeepingItClassy commented 6 years ago

But it's actually skewing the evaluation results because a different tokenizer creates a different vocabulary. For accurate results, both the training and the validation/test captions need to be tokenized in the same way.

yufengm commented 6 years ago

Yeah, you are totally right. That's what most of the recent papers are doing, as they were following Karpathy's standard.

CherishineNi commented 5 years ago

@ KeepingItClassy Could you share your test code? Code can be achieved on the screen images and captions. Thanks very much.