webis-de / small-text

Active Learning for Text Classification in Python
https://small-text.readthedocs.io/
MIT License
547 stars 60 forks source link

When using EmbeddingBasedQueryStrategy with some transformers, model has an unsupported input `token_type_ids` when creating embeddings. #54

Open RaymondUoE opened 7 months ago

RaymondUoE commented 7 months ago

Bug description

Requires query_strategy to be a subclass of EmbeddingBasedQueryStrategy, such as EmbeddingKMeans; Requires transformer_model to be a model that does not expect token_type_ids in its forward function, such as distilbert-base-uncased

Steps to reproduce

When performing active learning, the model has an unsupported input token_type_ids when creating embeddings.

Expected behavior

The keys of model input are adjusted according to the specific models.

Cause:

In file small_text/integrations/transformers/classifiers/classification.py, function _create_embeddings: the following code:

outputs = self.model(text,
                             token_type_ids=None,
                             attention_mask=masks,
                             output_hidden_states=True)

need to be changed to

outputs = self.model(text,
                             attention_mask=masks,
                             output_hidden_states=True)

removing the token_type_ids field if the seed model does not expect token_type_ids in its forward function.

Environment:

Python version: 3.11.7 small-text version: 1.3.3 small-text integrations (e.g., transformers): transformers 4.36.2 PyTorch version: 2.1.2 PyTorch-cuda: 11.8

chschroeder commented 7 months ago

Yes, such errors may happen, as models can have arbitrary arguments. What you suggest here sounds like a good solution when the calling side passes more parameters than the models accepts.

Moreover, there were plans to add a list of supported models to the documentation, which might also be useful here so that someone who encounters such an error, does not have to try model after model.