wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"
https://aclanthology.org/2020.findings-emnlp.389/
Apache License 2.0
133 stars 10 forks source link

sentencepiece model #1

Closed jwijffels closed 4 years ago

jwijffels commented 4 years ago

Many thanks for releasing these resources! I'm trying to make an R wrapper around Pytorch as explained at https://huggingface.co/transformers/torchscript.html to connect to it from the C++ side such that it can be used as an R package. I have some questions

  1. would it be possible to also release the sentencepiece model which you indicate in the paper you have created
  2. if yes, does the sentencepiece model give the same token ids as the wordpiece model for which you have provided the vocabulary at https://bertje.s3.eu-central-1.amazonaws.com/v1/vocab.txt (how did you convert the sentencepiece to wordpiece?)
wietsedv commented 4 years ago

I do have the original SentencePiece model, but I am not sure how useful it would be for your project. SentencePiece and WordPiece are two models that try to find an optimal set of part-words for that language. Both models may work differently, but only the output vocabulary is relevant.

I did not convert the models, I just trained the SentencePiece model and converted the output vocabulary (vocab.txt). SentencePiece uses an underscore (not same character, but looks like an underscore) prefix for preceding whitespaces whereas WordPiece format uses the ## prefix if the characters before it are non-whitespace. Conversion therefore simply boils down to search and replace of these prefixes.

The token ids of the original sentence piece and the resulting word piece format do not overlap since I sorted the vocabulary after conversion. But even if I had not sorted it, tokenization results may not always be the same for the first words in documents (if you do not prepend it with a white space, which I did do fortraining the sentence piece model).

In short, my vocabulary is only meant for WordPiece based tokenization since BERT and the BERT model within Transformers solely rely on this format. I think releasing the SentencePiece format will only cause confusion.

jwijffels commented 4 years ago

Ok. Thanks for the details on the way you exported the vocabulary and sorted it.

My sole goal in this is to make it easy for an R user to get the embedding of the last layer for further downstream NLP tasks. Would you mind providing the SentencePiece model in private to jwijffels [at] bnosac [dot] be. I just want to reduce the R wrapper around LibTorch to use only the transformer and use for the tokenisation the wrapper I just wrote around sentencepiece (at https://github.com/bnosac/sentencepiece) instead of using the wordpiece implementation for which I presume you used this one https://github.com/google-research/bert/blob/master/tokenization.py#L300? Is that last statement correct? That would probably speed up things with similar BPE tokenisation.

wietsedv commented 4 years ago

The last statement is correct. I will sent you the model, but keep in mind that the indices do not match the ones in vocab.txt.