Device-side assertion not passed when training on cuda device and when there are added tokens to the tokenizer

LunaRaeW commented 10 months ago

Bug description

If using cuda, the transformer model will fail a device-side assertion if there are additional special tokens in the tokenizer. This does not happen when device='cpu' or device='mps' are specified, suggesting that this might be a PyTorch issue. However, workaround cannot be done using small-text API and require modification to its source code.

Steps to reproduce

Using small-text/tree/main/examples/examplecode/transformers_multiclass_classification.py as an example:

Change tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/') to

tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
tokenizer.add_special_tokens({'additional_special_tokens': ['[SPECIAL1]', '[SPECIAL2]']})

This will cause the device-side assertion to fail when using cuda:

clf_factory = TransformerBasedClassificationFactory(TRANSFORMER_MODEL,
                                                        num_classes,
                                                        kwargs=dict({
                                                            'device': 'cuda'
                                                        }))

due to embedding size mismatch.

Expected behavior

The model adjusts new vocab size automatically.

Workaround:

In file small_text/integrations/transformers/utils/classification.py function _initialize_transformer_components, change the following

model = AutoModelForSequenceClassification.from_pretrained(
        transformer_model.model,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
        force_download=from_pretrained_options.force_download,
        local_files_only=from_pretrained_options.local_files_only
    )

to

model = AutoModelForSequenceClassification.from_pretrained(
        transformer_model.model,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
        force_download=from_pretrained_options.force_download,
        local_files_only=from_pretrained_options.local_files_only
    )
    model.resize_token_embeddings(new_num_tokens=NEW_VOCAB_SIZE)

adding the final line. This requires the new vocab size to be hard-coded because the customised tokenizer is inaccessible in this function. If the tokenizer is accessible, the final line can simply be changed to model.resize_token_embeddings(new_num_tokens=len(tokenizer))

Environment:

Python version: 3.11.7 small-text version: 1.3.3 small-text integrations (e.g., transformers): transformers 4.36.2 PyTorch version: 2.1.2 PyTorch-cuda: 11.8

chschroeder commented 10 months ago

Thanks for reporting this! I will look into it.

chschroeder commented 10 months ago

@RaymondUoE With just the additional tokenizer.add_special_tokens() call, I cannot reproduce the error. Can you provide details on the assertion output?

webis-de / small-text