How to choose keys.txt as a vocabulary ? for example where can i find english vocab.

wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

https://arxiv.org/abs/2004.07464

MIT License

553 stars 191 forks source link

How to choose keys.txt as a vocabulary ? for example where can i find english vocab. #88

Open karimcossentini opened 3 years ago

ducviet00 commented 3 years ago

You can iterate though all your training dataset and get a set of character, include the space. They encode by the character index, and vocab size in Embedding Layer equal to the len of keys. I think it is not good to compare with new method like BPE encoding

babyhockey commented 3 years ago

Hi ducviet00, is the vocabulary a set of characters or words? If it is a set of characters, does it mean we just list "abcd...zABCD...Z" plus numbers and special chars in keys.txt? Thanks!

karimcossentini commented 3 years ago

Hi babyhockey, it's a good question cuz I have a problem with this vocab file , did you find the answer ?

babyhockey commented 3 years ago

Yes, it's a list of English characters plus numbers and special characters.

On Wed, May 26, 2021, 7:34 AM karim cossentini @.***> wrote:

Hi babyhockey, it's a good question cuz I have a problem with this vocab file , did you find the answer ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wenwenyu/PICK-pytorch/issues/88#issuecomment-848694771, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQORHK5YHDKKHV2CWUYY7UTTPTME5ANCNFSM43KOOAWA .

karimcossentini commented 3 years ago

The default vocab file (keys.txt) in this repo is in chinese , I translated it and I noticed that it contains not only characters , but sentences and words etc... so I did not understand what actually this file is

babyhockey commented 3 years ago

As far as I can tell, the default file contains a list of Chinese characters.

On Wed, May 26, 2021, 7:42 AM karim cossentini @.***> wrote:

The default vocab file in this repo is in chineese , I translated it and I noticed that it contains not only characters , but sentences etc... so I did not understand what actually this file is

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wenwenyu/PICK-pytorch/issues/88#issuecomment-848699005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQORHK2DCGHMA7RSCBQBK5DTPTNB7ANCNFSM43KOOAWA .

arunmack789 commented 2 years ago

Hi ducviet00, is the vocabulary a set of characters or words? If it is a set of characters, does it mean we just list "abcd...zABCD...Z" plus numbers and special chars in keys.txt? Thanks!

for custom dataset how to write keys.txt file