Can't open chinese_sp.model

polm-stability commented 10 months ago

Check before submitting issues

[X] Make sure to pull the latest code, as some issues and bugs have been fixed.
[X] Due to frequent dependency updates, please ensure you have followed the steps in our Wiki
[X] I have read the FAQ section AND searched for similar issues and did not find a similar problem or solution
[X] Third-party plugin issues - e.g., llama.cpp, text-generation-webui, LlamaChat, we recommend checking the corresponding project for solutions
[X] Model validity check - Be sure to check the model's SHA256.md. If the model is incorrect, we cannot guarantee its performance

Type of Issue

Other issues

Base Model

None

Operating System

Linux

Describe your issue in detail

I am looking at how the tokenizer for the model was created. The merge script looks fine, but the chinese_sp.model file doesn't seem to open in SentencePiece, and I get an error. Is there an issue with the file in the repo, or am I doing something wrong?

I thought this might be a protobuf error, but using the os.environ setting from the merge script doesn't change the error.

import sentencepiece as spm
model = spm.SentencePieceProcessor("chinese_sp.model")
# same result as ...
model = spm.SentencePieceProcessor()
model.Load("chinese_sp.model")

Dependencies (must be provided for code-related issues)

sentencepiece==0.1.97

Execution logs or screenshots

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/pool/work/stability/tokenizers/chinese-llama-alpaca/env/lib/python3.11/site-packages/sentencepiece/__init__.py", line 447, in Init
    self.Load(model_file=model_file, model_proto=model_proto)
  File "/mnt/pool/work/stability/tokenizers/chinese-llama-alpaca/env/lib/python3.11/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/pool/work/stability/tokenizers/chinese-llama-alpaca/env/lib/python3.11/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

airaria commented 10 months ago

import sentencepiece as spm
model = spm.SentencePieceProcessor("chinese_sp.model")
# same result as ...
model = spm.SentencePieceProcessor()
model.Load("chinese_sp.model")

I encountered no error with the code above.

My sentencepiece version is sentencepiece==0.1.99

polm-stability commented 10 months ago

Thanks for the quick reply. I tried re-downloading the model and it was fine, I must have gotten a bad version somehow. Sorry for the noise.

ymcui / Chinese-LLaMA-Alpaca