If using cuda, the transformer model will fail a device-side assertion if there are additional special tokens in the tokenizer. This does not happen when device='cpu' or device='mps' are specified, suggesting that this might be a PyTorch issue. However, workaround cannot be done using small-text API and require modification to its source code.
Steps to reproduce
Using small-text/tree/main/examples/examplecode/transformers_multiclass_classification.py as an example:
Change
tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
to
adding the final line. This requires the new vocab size to be hard-coded because the customised tokenizer is inaccessible in this function. If the tokenizer is accessible, the final line can simply be changed to model.resize_token_embeddings(new_num_tokens=len(tokenizer))
@RaymondUoE With just the additional tokenizer.add_special_tokens() call, I cannot reproduce the error. Can you provide details on the assertion output?
Bug description
If using cuda, the transformer model will fail a device-side assertion if there are additional special tokens in the tokenizer. This does not happen when
device='cpu'
ordevice='mps'
are specified, suggesting that this might be a PyTorch issue. However, workaround cannot be done usingsmall-text
API and require modification to its source code.Steps to reproduce
Using
small-text/tree/main/examples/examplecode/transformers_multiclass_classification.py
as an example:Change
tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
toThis will cause the device-side assertion to fail when using cuda:
due to embedding size mismatch.
Expected behavior
The model adjusts new vocab size automatically.
Workaround:
In file
small_text/integrations/transformers/utils/classification.py
function_initialize_transformer_components
, change the followingto
adding the final line. This requires the new vocab size to be hard-coded because the customised tokenizer is inaccessible in this function. If the tokenizer is accessible, the final line can simply be changed to
model.resize_token_embeddings(new_num_tokens=len(tokenizer))
Environment:
Python version: 3.11.7 small-text version: 1.3.3 small-text integrations (e.g., transformers): transformers 4.36.2 PyTorch version: 2.1.2 PyTorch-cuda: 11.8