How to recreate the sentence piece model?

abhinand5 commented 10 months ago

Check before submitting issues

[X] Make sure to pull the latest code, as some issues and bugs have been fixed.
[X] Due to frequent dependency updates, please ensure you have followed the steps in our Wiki
[X] I have read the FAQ section AND searched for similar issues and did not find a similar problem or solution
[X] Third-party plugin issues - e.g., llama.cpp, text-generation-webui, LlamaChat, we recommend checking the corresponding project for solutions
[X] Model validity check - Be sure to check the model's SHA256.md. If the model is incorrect, we cannot guarantee its performance

Type of Issue

Other issues

Base Model

None

Operating System

Linux

Describe your issue in detail

Excellent work by the community to open-source this project and it serves as a guide for many people like me who want to fine-tune LLAMA2 on our own languages.

I've looked into all the code but couldn't find the code or spm_train command that you used to train the chinese_sp.model.

Information such as these would help greatly:

Dataset size used to train this model
Sentence Piece Model details like model_type and other configurations.

Example command I am using:

spm_train --input=tamil_sentence_corpus_1.6m.txt \
    --model_prefix=tamil_sp \
    --vocab_size=16000 \
    --character_coverage=1.0 \
    --model_type=unigram

Dependencies (must be provided for code-related issues)

# Please copy-and-paste your dependencies here.

Execution logs or screenshots

# Please copy-and-paste your logs here.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 9 months ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

ymcui / Chinese-LLaMA-Alpaca