Byte Pair Encoding - Githubissues

kbrajwani commented 3 years ago

Hey, I saw you are using keys.txt file for encoding data. if i am wrong please correct me.

If you are using keys.txt then how you are making it and if any word is not in training data then it can be handled or not.
I have to train on another language then how can i do?

cuongngm commented 3 years ago

i same the question. Can you explained purpose and effect of that file?

cuongngm commented 3 years ago

it look like you need build a vocab in your language in new file keys.txt and retrain the model. In my case, the vocab is: aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0123456789!"#$%&''()*+,-./:;<=>?@[]^_`{|}~

kbrajwani commented 3 years ago

I got your point that we have to build vocab to include each character. I was confused about words but I forgot that pick is working on the character. Thanks

wenwenyu / PICK-pytorch

Byte Pair Encoding #52