ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Apache License 2.0
7.09k stars 578 forks source link

tokenizer training #358

Closed StephennFernandes closed 1 year ago

StephennFernandes commented 1 year ago

Check before submitting issues

Type of Issue

Model training and fine-tuning

Base Model

Chinese-LLaMA-2 (7B/13B)

Operating System

Linux

Describe your issue in detail

hey there ive been trying to use your code and reproduce the same for indian languages, but i havent been able to understand how to train the tokenizer, i genrally follow along the spm tokenizer training here. but i wasnt so sure about how to add special tokens etc to make the tokenizer compliant to the Lamma-2 tokenizer and similar to how you guys have trained the tokenizer.

i just dont want to end up being stuck in training the tokenizer in a wrong way.

could you please let me know on the detailed instructions on how you trained the tokenizer ?

Dependencies (must be provided for code-related issues)

No response

Execution logs or screenshots

No response

airaria commented 1 year ago

The Chinese tokenizer was trained with the sentencepiece in the standard way. In our Chinese-LLaMA-Alpaca repo there is a script demonstrating how to merge the trained Chinese tokenizer with the original llama tokenizer https://github.com/ymcui/Chinese-LLaMA-Alpaca/tree/main/scripts/merge_tokenizer

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.