wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"
https://aclanthology.org/2020.findings-emnlp.389/
Apache License 2.0
133 stars 10 forks source link

Bertje word embeddings gebruiken #32

Closed SjoerdBraaksma closed 1 year ago

SjoerdBraaksma commented 1 year ago

Hoi Wietse!

I am relatively new to using BERT models, and I was wondering if it is possible to access the word embeddings directly, so I can make them usable in other frameworks. In my specific use case, use it as the embedding model in Top2Vec.

Is this possible and if yes, how can I do this?

thanks in advance!

wietsedv commented 1 year ago

If you need to extract embeddings and fine-tuning is not an option, you can extract hidden states from BERTje. I think the simplest way is this: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/pipelines#transformers.FeatureExtractionPipeline

from transformers import pipeline

extractor = pipeline(model="GroNLP/bert-base-dutch-cased", task="feature-extraction", model_kwargs={"num_hidden_layers": 10})
result = extractor("Dit is een test.", return_tensors=True)

result.shape  # This is a tensor of shape [1, sequence_lenth, hidden_dimension] representing the input string.
torch.Size([1, 8, 768])`

Choose num_hidden_layers between 1 and 12. defaults to 12, but this is suboptimal.

I hope this helps. This is mostly a generic language modeling question. I refer you to the Hugging Face Forums if you need more help.