Closed StephennFernandes closed 1 year ago
The Chinese tokenizer was trained with the sentencepiece in the standard way. In our Chinese-LLaMA-Alpaca repo there is a script demonstrating how to merge the trained Chinese tokenizer with the original llama tokenizer https://github.com/ymcui/Chinese-LLaMA-Alpaca/tree/main/scripts/merge_tokenizer
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Check before submitting issues
Type of Issue
Model training and fine-tuning
Base Model
Chinese-LLaMA-2 (7B/13B)
Operating System
Linux
Describe your issue in detail
hey there ive been trying to use your code and reproduce the same for indian languages, but i havent been able to understand how to train the tokenizer, i genrally follow along the spm tokenizer training here. but i wasnt so sure about how to add special tokens etc to make the tokenizer compliant to the Lamma-2 tokenizer and similar to how you guys have trained the tokenizer.
i just dont want to end up being stuck in training the tokenizer in a wrong way.
could you please let me know on the detailed instructions on how you trained the tokenizer ?
Dependencies (must be provided for code-related issues)
No response
Execution logs or screenshots
No response