ymcui / Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki
Apache License 2.0
17.98k stars 1.84k forks source link

Can't open chinese_sp.model #839

Closed polm-stability closed 10 months ago

polm-stability commented 10 months ago

Check before submitting issues

Type of Issue

Other issues

Base Model

None

Operating System

Linux

Describe your issue in detail

I am looking at how the tokenizer for the model was created. The merge script looks fine, but the chinese_sp.model file doesn't seem to open in SentencePiece, and I get an error. Is there an issue with the file in the repo, or am I doing something wrong?

I thought this might be a protobuf error, but using the os.environ setting from the merge script doesn't change the error.

import sentencepiece as spm
model = spm.SentencePieceProcessor("chinese_sp.model")
# same result as ...
model = spm.SentencePieceProcessor()
model.Load("chinese_sp.model")

Dependencies (must be provided for code-related issues)

sentencepiece==0.1.97

Execution logs or screenshots

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/pool/work/stability/tokenizers/chinese-llama-alpaca/env/lib/python3.11/site-packages/sentencepiece/__init__.py", line 447, in Init
    self.Load(model_file=model_file, model_proto=model_proto)
  File "/mnt/pool/work/stability/tokenizers/chinese-llama-alpaca/env/lib/python3.11/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/pool/work/stability/tokenizers/chinese-llama-alpaca/env/lib/python3.11/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 
airaria commented 10 months ago
import sentencepiece as spm
model = spm.SentencePieceProcessor("chinese_sp.model")
# same result as ...
model = spm.SentencePieceProcessor()
model.Load("chinese_sp.model")

I encountered no error with the code above.

My sentencepiece version is sentencepiece==0.1.99

polm-stability commented 10 months ago

Thanks for the quick reply. I tried re-downloading the model and it was fine, I must have gotten a bad version somehow. Sorry for the noise.