zzzDavid / ICDAR-2019-SROIE

ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction
MIT License
381 stars 132 forks source link

I wonder, can we improve final score, if we encode each word and masking some numeric entry followed by classification, rather than character level classification. #11

Open shreeshiv opened 4 years ago

shreeshiv commented 4 years ago

I wonder, can we improve final score, if we encode each word and masking some numeric entry followed by classification, rather than character level classification for task 3?

patrick22414 commented 4 years ago

Thank you @shreeshiv ! Constructing a dictionary is indeed a valid approach and, as I believe, a common practice in NLP. And yes, there is a solid chance that it may improve performance. However, it also comes with some disadvantages, such as we won't be able to detect a word outside the constructed dictionary, and it puts more heavy lifting on encoding.

In our case, we thought it is very likely that a non-dictionary word will appear in the test set, such as abbreviations, shop names, or menu entries. Characters, on the other hand, are easy to encode and can deal with new words, and have yielded satisfying results.

However, I do encourage you to explore a word-based approach if you would like!