Fine-tuning BERTje for custom NER

wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"

https://aclanthology.org/2020.findings-emnlp.389/

Apache License 2.0

135 stars 10 forks source link

Fine-tuning BERTje for custom NER #4

Closed NielsRogge closed 4 years ago

NielsRogge commented 4 years ago

Hello, I'd like to fine-tune BERTje for custom named-entity recognition in Dutch (for example, to recognize street names). Is this possible by initializing BertForTokenClassification with 'bert-base-dutch-cased'? And also, do you think this approach is viable? How many annotated training examples would be roughly needed to obtain a reasonable performance? Is this approach possible with 200 annotated sentences for every entity type?

Ideally, the BERTje model fine-tuned on CoNLL-2002/SoNaR-1 would be even better in terms of transfer learning. But I see you're planning to release these fine-tuned models in the future, so looking forward to that.

Niekvdplas commented 4 years ago

I am also looking forward to your NER fine-tuned model. Do you have an ETA when something like this will be made available?

flieks commented 4 years ago

Also interested :)

wietsedv commented 4 years ago

(Sorry for the late response, I did not receive/notice Github notifications.)

I will try to test and release fine-tuned models before the weekend, but I can not make promises. I will only release models that I consider to be useful/trustworthy in practice, but this is not a problem for NER. For instance for SRL I will have to verify tagsets and annotations (the source SoNaR annotations may sometimes be a bit dubious).

wietsedv commented 4 years ago

Took a bit longer than I intended, but I uploaded the fine-tuned NER models based on BERTje and mBERT. I linked to it in the readme.

I may add more details and usage instructions later when I get to it. But usage should be straightforward if you are familiar with Huggingface Transformers. The three source datasets and tagsets are quite different from each other, so I cannot give a single recommendation.

To give a quick overview, these are the data sizes and tagset sizes of the training data: afbeelding

flieks commented 4 years ago

Thanks alot!