Expand vocabulary for other language from scratch

MSamiee commented 1 year ago

Check before submitting issues

[X] Make sure to pull the latest code, as some issues and bugs have been fixed.
[X] Due to frequent dependency updates, please ensure you have followed the steps in our Wiki
[X] I have read the FAQ section AND searched for similar issues and did not find a similar problem or solution
[X] Third-party plugin issues - e.g., llama.cpp, text-generation-webui, LlamaChat, we recommend checking the corresponding project for solutions
[X] Model validity check - Be sure to check the model's SHA256.md. If the model is incorrect, we cannot guarantee its performance

Type of Issue

Other issues

Base Model

LLaMA-7B

Operating System

None

Describe your issue in detail

# Please copy-and-paste your command here.

Hi I want to expand vocabs for Persian language usin your approach, but I don't know how I can do it from scratch. There are some Persian LLMs that have pre-train on massive amount of Persian texts and they have "vocabs.json" file. I wavt to use this "vocabs.json" file to expand persian vocabs for Llama, but I don't know how. Is there any suggetion about it?

Dependencies (must be provided for code-related issues)

# Please copy-and-paste your dependencies here.

Execution logs or screenshots


# Please copy-and-paste your logs here.
``

airaria commented 1 year ago

Llama tokenizer does not work with vocab.json. I would suggest first training a Persian tokenizer with sentencepiece on Persian texts (with several GB of data), then merge the Persian tokenizer with the Llama tokenizer by following the steps in merge_tokenizers.py

MSamiee commented 1 year ago

Thank you so much for your response

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 1 year ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

ymcui / Chinese-LLaMA-Alpaca