webis-de / small-text

Active Learning for Text Classification in Python
https://small-text.readthedocs.io/
MIT License
547 stars 60 forks source link

Device-side assertion not passed when training on cuda device and when there are added tokens to the tokenizer #53

Open RaymondUoE opened 7 months ago

RaymondUoE commented 7 months ago

Bug description

If using cuda, the transformer model will fail a device-side assertion if there are additional special tokens in the tokenizer. This does not happen when device='cpu' or device='mps' are specified, suggesting that this might be a PyTorch issue. However, workaround cannot be done using small-text API and require modification to its source code.

Steps to reproduce

Using small-text/tree/main/examples/examplecode/transformers_multiclass_classification.py as an example:

Change tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/') to

tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
tokenizer.add_special_tokens({'additional_special_tokens': ['[SPECIAL1]', '[SPECIAL2]']})

This will cause the device-side assertion to fail when using cuda:

clf_factory = TransformerBasedClassificationFactory(TRANSFORMER_MODEL,
                                                        num_classes,
                                                        kwargs=dict({
                                                            'device': 'cuda'
                                                        }))

due to embedding size mismatch.

Expected behavior

The model adjusts new vocab size automatically.

Workaround:

In file small_text/integrations/transformers/utils/classification.py function _initialize_transformer_components, change the following

model = AutoModelForSequenceClassification.from_pretrained(
        transformer_model.model,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
        force_download=from_pretrained_options.force_download,
        local_files_only=from_pretrained_options.local_files_only
    )

to

model = AutoModelForSequenceClassification.from_pretrained(
        transformer_model.model,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
        force_download=from_pretrained_options.force_download,
        local_files_only=from_pretrained_options.local_files_only
    )
    model.resize_token_embeddings(new_num_tokens=NEW_VOCAB_SIZE)

adding the final line. This requires the new vocab size to be hard-coded because the customised tokenizer is inaccessible in this function. If the tokenizer is accessible, the final line can simply be changed to model.resize_token_embeddings(new_num_tokens=len(tokenizer))

Environment:

Python version: 3.11.7 small-text version: 1.3.3 small-text integrations (e.g., transformers): transformers 4.36.2 PyTorch version: 2.1.2 PyTorch-cuda: 11.8

chschroeder commented 7 months ago

Thanks for reporting this! I will look into it.

chschroeder commented 7 months ago

@RaymondUoE With just the additional tokenizer.add_special_tokens() call, I cannot reproduce the error. Can you provide details on the assertion output?