tokenizer training - Githubissues

StephennFernandes commented 1 year ago

Check before submitting issues

[X] Make sure to pull the latest code, as some issues and bugs have been fixed.
[X] I have read the Wiki and FAQ section AND searched for similar issues and did not find a similar problem or solution
[X] Third-party plugin issues - e.g., llama.cpp, LangChain, text-generation-webui, we recommend checking the corresponding project for solutions

Type of Issue

Model training and fine-tuning

Base Model

Chinese-LLaMA-2 (7B/13B)

Operating System

Linux

Describe your issue in detail

hey there ive been trying to use your code and reproduce the same for indian languages, but i havent been able to understand how to train the tokenizer, i genrally follow along the spm tokenizer training here. but i wasnt so sure about how to add special tokens etc to make the tokenizer compliant to the Lamma-2 tokenizer and similar to how you guys have trained the tokenizer.

i just dont want to end up being stuck in training the tokenizer in a wrong way.

could you please let me know on the detailed instructions on how you trained the tokenizer ?

Dependencies (must be provided for code-related issues)

No response

Execution logs or screenshots

No response

airaria commented 1 year ago

The Chinese tokenizer was trained with the sentencepiece in the standard way. In our Chinese-LLaMA-Alpaca repo there is a script demonstrating how to merge the trained Chinese tokenizer with the original llama tokenizer https://github.com/ymcui/Chinese-LLaMA-Alpaca/tree/main/scripts/merge_tokenizer

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

ymcui / Chinese-LLaMA-Alpaca-2

tokenizer training #358