When using EmbeddingBasedQueryStrategy with some transformers, model has an unsupported input `token_type_ids` when creating embeddings.

Bug description

Requires query_strategy to be a subclass of EmbeddingBasedQueryStrategy, such as EmbeddingKMeans; Requires transformer_model to be a model that does not expect token_type_ids in its forward function, such as distilbert-base-uncased

Steps to reproduce

When performing active learning, the model has an unsupported input token_type_ids when creating embeddings.

Expected behavior

The keys of model input are adjusted according to the specific models.

Cause:

In file small_text/integrations/transformers/classifiers/classification.py, function _create_embeddings: the following code:

outputs = self.model(text,
                             token_type_ids=None,
                             attention_mask=masks,
                             output_hidden_states=True)

need to be changed to

outputs = self.model(text,
                             attention_mask=masks,
                             output_hidden_states=True)

removing the token_type_ids field if the seed model does not expect token_type_ids in its forward function.

Environment:

Python version: 3.11.7 small-text version: 1.3.3 small-text integrations (e.g., transformers): transformers 4.36.2 PyTorch version: 2.1.2 PyTorch-cuda: 11.8

webis-de / small-text