Open karimcossentini opened 3 years ago
Hi ducviet00, is the vocabulary a set of characters or words? If it is a set of characters, does it mean we just list "abcd...zABCD...Z" plus numbers and special chars in keys.txt? Thanks!
Hi babyhockey, it's a good question cuz I have a problem with this vocab file , did you find the answer ?
Yes, it's a list of English characters plus numbers and special characters.
On Wed, May 26, 2021, 7:34 AM karim cossentini @.***> wrote:
Hi babyhockey, it's a good question cuz I have a problem with this vocab file , did you find the answer ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wenwenyu/PICK-pytorch/issues/88#issuecomment-848694771, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQORHK5YHDKKHV2CWUYY7UTTPTME5ANCNFSM43KOOAWA .
The default vocab file (keys.txt) in this repo is in chinese , I translated it and I noticed that it contains not only characters , but sentences and words etc... so I did not understand what actually this file is
As far as I can tell, the default file contains a list of Chinese characters.
On Wed, May 26, 2021, 7:42 AM karim cossentini @.***> wrote:
The default vocab file in this repo is in chineese , I translated it and I noticed that it contains not only characters , but sentences etc... so I did not understand what actually this file is
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wenwenyu/PICK-pytorch/issues/88#issuecomment-848699005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQORHK2DCGHMA7RSCBQBK5DTPTNB7ANCNFSM43KOOAWA .
Hi ducviet00, is the vocabulary a set of characters or words? If it is a set of characters, does it mean we just list "abcd...zABCD...Z" plus numbers and special chars in keys.txt? Thanks!
for custom dataset how to write keys.txt file
You can iterate though all your training dataset and get a set of character, include the space. They encode by the character index, and vocab size in Embedding Layer equal to the len of keys. I think it is not good to compare with new method like BPE encoding