Closed MSamiee closed 1 year ago
Llama tokenizer does not work with vocab.json
.
I would suggest first training a Persian tokenizer with sentencepiece on Persian texts (with several GB of data), then merge the Persian tokenizer with the Llama tokenizer by following the steps in merge_tokenizers.py
Thank you so much for your response
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.
Check before submitting issues
Type of Issue
Other issues
Base Model
LLaMA-7B
Operating System
None
Describe your issue in detail
Hi I want to expand vocabs for Persian language usin your approach, but I don't know how I can do it from scratch. There are some Persian LLMs that have pre-train on massive amount of Persian texts and they have "vocabs.json" file. I wavt to use this "vocabs.json" file to expand persian vocabs for Llama, but I don't know how. Is there any suggetion about it?
Dependencies (must be provided for code-related issues)
Execution logs or screenshots