Adding special tokens to tokenizer (transformers-integration)

HannahKirk commented 2 years ago

I need to add some special tokens to the BERT tokenizer. However, I am not sure how to resize the model tokenizer to incorporate the added special tokens with the small-text transformers integration.

With transformers, you can add special tokens using:

tokenizer.add_tokens(['newWord', 'newWord2'])
model.resize_token_embeddings(len(tokenizer)

How does this change with a clf_factory and initialising the transformers model as a pool based active learner? E.g. with the code from the 01-active-learning-for-text-classification-with-small-text-intro.ipynb notebook:

from small_text.integrations.transformers.datasets import TransformersDataset

def get_transformers_dataset(tokenizer, data, labels, max_length=60):

    data_out = []

    for i, doc in enumerate(data):
        encoded_dict = tokenizer.encode_plus(
            doc,
            add_special_tokens=True,
            padding='max_length',
            max_length=max_length,
            return_attention_mask=True,
            return_tensors='pt',
            truncation='longest_first'
        )

        data_out.append((encoded_dict['input_ids'], encoded_dict['attention_mask'], labels[i]))

    return TransformersDataset(data_out)

train = get_transformers_dataset(tokenizer, raw_dataset['train']['text'], raw_dataset['train']['label'])
test = get_transformers_dataset(tokenizer, raw_dataset['test']['text'], raw_dataset['test']['label'])

transformer_model = TransformerModelArguments(transformer_model_name)
clf_factory = TransformerBasedClassificationFactory(transformer_model, 
                                                    num_classes, 
                                                    kwargs=dict({'device': 'cuda', 
                                                                 'mini_batch_size': 32,
                                                                 'early_stopping_no_improvement': -1
                                                                }))
active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, train)

chschroeder commented 2 years ago

I have not tried this before (at least not in combination with active learning), but I think this should work just fine if you do it as follows:

1. Prepare your Tokenizer

I think the way to go here is to simply save the tokenizer:

tokenizer.add_tokens(['newWord', 'newWord2'])
model.resize_token_embeddings(len(tokenizer)
tokenizer.save_pretrained('/path/to/some/directory/tokenizer/')

You could even train this tokenizer at this point, in which case it makes even more sense to save the result. I think you need to do the same for the model (or alternatively find a way to call resize_token_embeddings() later):

model.save_pretrained('/path/to/some/directory/model/')

2. Passing a Custom Tokenizer into Small-Text

TransformerModelArguments is a simple "container object" and the parameters you set here will later be used to load the model/tokenizer/config using the respective methods provided by huggingface transformers.

# we pass the same directories as above
transformer_model = TransformerModelArguments('/path/to/some/directory/model/', tokenizer='/path/to/some/directory/tokenizer/')

In general, you have to make sure that the tokenizer used for your preprocessing matches the one passed to TransformerModelArguments.

Let me know if this does not work. This is an important use case and should be possible when using small-text.

HannahKirk commented 2 years ago

Hi, Thanks for the speedy reply. Quick question on 1., in the small-text code I sent above, can you access the model directly to do model.resize_token_embeddings()? active_learner..classifier.model only doesn't return none after running labeled_indices = initialize_active_learner(active_learner, train.y).

Or are you suggesting alternatively to load the model and tokenizer from pretrained in a separate script, then load these into the transformer model arguments?

Thanks!

chschroeder commented 2 years ago

Yes, the model in classifier.model is only initialized after calling fit() (which also can happen during initialization). If you do it this way, however, the fit method also trains the model using the original tokenizer, which is probably not what you want.

Your second suggestion to create a separate script seems like the better choice. If you only want to resize the model, this is possible in just a few lines of code. You can take a look here, this is the function to load the model. After calling resize_token_embeddings() and saving the altered model, the result is, again, a pre-trained model, whose only change is the resized token embedding.

HannahKirk commented 2 years ago

This worked perfectly thank you!

webis-de / small-text

Adding special tokens to tokenizer (transformers-integration) #8