zihangdai / xlnet

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Apache License 2.0
6.17k stars 1.18k forks source link

spiece.model vocab size #150

Open zxysnow opened 5 years ago

zxysnow commented 5 years ago

I wonder what the vocab size of spiece.model (seems 32k)? I am trying to improve this part, could anyone share the vocab size of spiece.model? Besides, what is the data size which trained to get spiece.model? Thanks!

stefan-it commented 5 years ago

Hi @zxysnow,

yes, the vocab size is 32K, you can easily check that with:

wget https://storage.googleapis.com/xlnet/released_models/cased_L-24_H-1024_A-16.zip
unzip cased_L-24_H-1024_A-16.zip
cd cased_L-24_H-1024_A-16.zip

Load the model in Python to get vocab size:

import sentencepiece as spm
s = spm.SentencePieceProcessor()
s.Load('spiece.model')

# Retrieve size
print(s.get_piece_size())

This outputs 32000.

The parameters for training a sentence piece model are specified in the README:

spm_train \
    --input=$INPUT \
    --model_prefix=sp10m.cased.v3 \
    --vocab_size=32000 \
    --character_coverage=0.99995 \
    --model_type=unigram \
    --control_symbols=<cls>,<sep>,<pad>,<mask>,<eod> \
    --user_defined_symbols=<eop>,.,(,),",-,–,£,€ \
    --shuffle_input_sentence \
    --input_sentence_size=10000000

This randomly samples 10.000.000 input sentences from the input corpus for training a sentence piece model :)

zxysnow commented 5 years ago

@stefan-it thank you very much, that's very helpful! Do you know where could find the input file of spiece.model? I found this model is much better than I trained, but don't know why :(

weilonghu commented 5 years ago

I use a sentence piece model which was pretrained on Chinese wiki data, and the vocab size is 100000. I think it's ok for me.

zxysnow commented 5 years ago

@knightBoy Thanks, my scenario need use English corpus, just wonder what is the corpus usually used for English sentence piece model train.