Open zxysnow opened 5 years ago
Hi @zxysnow,
yes, the vocab size is 32K, you can easily check that with:
wget https://storage.googleapis.com/xlnet/released_models/cased_L-24_H-1024_A-16.zip
unzip cased_L-24_H-1024_A-16.zip
cd cased_L-24_H-1024_A-16.zip
Load the model in Python to get vocab size:
import sentencepiece as spm
s = spm.SentencePieceProcessor()
s.Load('spiece.model')
# Retrieve size
print(s.get_piece_size())
This outputs 32000.
The parameters for training a sentence piece model are specified in the README
:
spm_train \
--input=$INPUT \
--model_prefix=sp10m.cased.v3 \
--vocab_size=32000 \
--character_coverage=0.99995 \
--model_type=unigram \
--control_symbols=<cls>,<sep>,<pad>,<mask>,<eod> \
--user_defined_symbols=<eop>,.,(,),",-,–,£,€ \
--shuffle_input_sentence \
--input_sentence_size=10000000
This randomly samples 10.000.000 input sentences from the input corpus for training a sentence piece model :)
@stefan-it thank you very much, that's very helpful! Do you know where could find the input file of spiece.model? I found this model is much better than I trained, but don't know why :(
I use a sentence piece model which was pretrained on Chinese wiki data, and the vocab size is 100000. I think it's ok for me.
@knightBoy Thanks, my scenario need use English corpus, just wonder what is the corpus usually used for English sentence piece model train.
I wonder what the vocab size of spiece.model (seems 32k)? I am trying to improve this part, could anyone share the vocab size of spiece.model? Besides, what is the data size which trained to get spiece.model? Thanks!