Bertje word embeddings gebruiken

wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"

Apache License 2.0

133 stars 10 forks source link

If you need to extract embeddings and fine-tuning is not an option, you can extract hidden states from BERTje. I think the simplest way is this: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/pipelines#transformers.FeatureExtractionPipeline

from transformers import pipeline

extractor = pipeline(model="GroNLP/bert-base-dutch-cased", task="feature-extraction", model_kwargs={"num_hidden_layers": 10})
result = extractor("Dit is een test.", return_tensors=True)

result.shape  # This is a tensor of shape [1, sequence_lenth, hidden_dimension] representing the input string.
torch.Size([1, 8, 768])`

Choose num_hidden_layers between 1 and 12. defaults to 12, but this is suboptimal.

I hope this helps. This is mostly a generic language modeling question. I refer you to the Hugging Face Forums if you need more help.

wietsedv / bertje

Bertje word embeddings gebruiken #32