I wonder, can we improve final score, if we encode each word and masking some numeric entry followed by classification, rather than character level classification.

zzzDavid / ICDAR-2019-SROIE

ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction

MIT License

381 stars 132 forks source link

Thank you @shreeshiv ! Constructing a dictionary is indeed a valid approach and, as I believe, a common practice in NLP. And yes, there is a solid chance that it may improve performance. However, it also comes with some disadvantages, such as we won't be able to detect a word outside the constructed dictionary, and it puts more heavy lifting on encoding.

In our case, we thought it is very likely that a non-dictionary word will appear in the test set, such as abbreviations, shop names, or menu entries. Characters, on the other hand, are easy to encode and can deal with new words, and have yielded satisfying results.

However, I do encourage you to explore a word-based approach if you would like!

zzzDavid / ICDAR-2019-SROIE

I wonder, can we improve final score, if we encode each word and masking some numeric entry followed by classification, rather than character level classification. #11