wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"
https://aclanthology.org/2020.findings-emnlp.389/
Apache License 2.0
133 stars 10 forks source link

Important tokens missing in vocabulary? #9

Closed visionscaper closed 4 years ago

visionscaper commented 4 years ago

Hello,

Thanks for creating BERTje! I'm working on a NER application that will classify named entities in Dutch text documents, so BERTje is really useful to me.

When trying to apply BERTje, I found that the tokeniser doesn't know some basic tokens, e.g. '@' for email addresses, but also lower-case single characters like 'o' or 'h', for instance to be able to tokenise names that aren't frequent (or non-sensical, which happens a lot with internet domains for instance). Interestingly enough, upper-case single characters seem to be available.

Example:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-dutch-cased", force_download=True)

tokens = tokenizer.tokenize("Dit is een email adres: test@hjk.nl. Zou dat werken?")

print(tokens)

The result is:

['Dit', 'is', 'een', 'email', 'adres', ':', 'test', '[UNK]', '[UNK]', '.', '[UNK]', '.', 'Zou', 'dat', 'werken', '?']

As you can see, "@hjk" and even 'nl' is not tokenised. This seems incorrect to me, there are many situations in text where an @ sign is used. Further, there can also very easily be non-frequent names that need single character tokens to be tokenised by a subword vocabulary.

Am I missing something? If these important tokens are really missing, I can imagine that your NER benchmark results (reported in the README) can also be (much) better.

wietsedv commented 4 years ago

Unfortunately, this is indeed a known issue that was noticed after the original publication. The SentencePiece token segmenter did not add all reasonable characters if coverage is high enough in the training data. I should have checked and made sure that all common characters were present.

The simple solution is to add the tokens to the tokenizer and then to resize the token embeddings of the model to the new vocabulary size. The embeddings of previously missing tokens are randomly initialised, but the correct embeddings may be learned during fine-tuning. It should at least be better than UNK.

Let me know what effects this has on your model performance! If adding tokens significantly helps performance I may provide a general solution later.

visionscaper commented 4 years ago

Hi @wietsedv, thanks for your reply. Yeah, too bad these characters were not picked up by the segmentor. Thanks for the suggestion, I will definitely try this. Although it might take some time, I will get back to you when I have some results.

simonevanbruggen commented 3 years ago

Hi! I had a similar issue with the missing lower-case characters, and as mentioned above tried adding them to the tokenizer. However, since manually added tokens are given priority when tokenizing text (see https://discuss.huggingface.co/t/add-new-tokens-for-subwords/489), this breaks down all text into single characters.

For example: tokenizer = BertTokenizer.from_pretrained("wietsedv/bert-base-dutch-cased") tokenizer.tokenize('e') > ['[UNK]'] tokenizer.tokenize('test') > ['test']

tokenizer.add_tokens(['t', 'e', 's']) tokenizer.tokenize('e') >['e'] tokenizer.tokenize('test') >['t', 'e', 's', 't']`

Any ideas or different solutions for manually extending the tokenizer, while preserving its original tokenization for existing tokens? Thanks!

wietsedv commented 3 years ago

Thanks for pointing out that the simple add_tokens fix actually does not work. I have never tested it myself, but I assumed that this was the simple solution. It appears that add_tokens does not extend the regular vocabulary, but rather creates a higher priority separate vocabulary. The normal use-case would be to add longer (domain specific) tokens with this function, so it is normally not a problem.

I think the only actual fix would be to manually edit the config.json (num_tokens), vocab.txt and then run model.resize_token_embeddings. This weekend I will edit the base model on the HuggingFace hub to include all regular chars (and initialize the embeddings with the [UNK] value, so that MLM output should not change).

wietsedv commented 3 years ago

If you now use GroNLP/bert-base-dutch-cased with HuggingFace Transformers, you will find that the problem has been fixed. Transformers should automatically download the new tokenizer+model files with 73 new tokens. Their embeddings have been initialized to the [UNK] representation for input compatibility's sake, but with a tiny amount of random noise (much smaller than regular inter-token distances) so that output will stay deterministic.

Hope that this solves the issue! Should have done this sooner, but I did not know that the add_tokens wasn't a solution.

simonevanbruggen commented 3 years ago

Great, thank you! That fixed the problem 👍