webis-de / small-text

Active Learning for Text Classification in Python
https://small-text.readthedocs.io/
MIT License
545 stars 60 forks source link

Use of large custom transformer model #63

Open Adaickalavan opened 3 months ago

Adaickalavan commented 3 months ago

Refering to the active learning for text classification example given here.

In the given example, we have:

transformer_model_name = 'bert-base-uncased'
transformer_model = TransformerModelArguments(transformer_model_name)
clf_factory = TransformerBasedClassificationFactory(
    transformer_model, 
    num_classes, 
    kwargs=dict({'device': 'cuda', 'mini_batch_size': 32, 
    'class_weight': 'balanced'}))

In my case, I would like to use the language model meta-llama/Llama-2-7b-chat-hf as a sequence classifier by calling it as

base_model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
    num_labels=1,
)

Then, I would like to perform supervised training with active learning of the Llama sequence-classifer transformer model on the dataset Birchlabs/openai-prm800k-stepwise-critic.

Questions:

1) How do I modify the example in the repository to get a clf_factory which uses the above base_model instead of providing TransformerModelArguments?

2) How do I use small-text to handle the large model size of Llama and potentially distribute its training over multiple GPUs?

chschroeder commented 3 months ago

Hi @Adaickalavan, thank you for your interest.

  1. TransformerModelArguments is just a wrapper for the Hugging Face names/paths for model, tokenizer and config. Some models work out of the box, others need adaptations. I cannot cover this 100% since the transformers library does not impose too much restrictions on the different models, and the newest one can always deviate from this.

I briefly tried a smaller Llama model (1B):

<...>
File [/path/to/site-packages/transformers/models/llama/modeling_llama.py#line=1371), in LlamaForSequenceClassification.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1369     batch_size = inputs_embeds.shape[0]
   1371 if self.config.pad_token_id is None and batch_size != 1:
-> 1372     raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
   1373 if self.config.pad_token_id is None:
   1374     sequence_lengths = -1

ValueError: Cannot handle batch sizes > 1 if no padding token is defined.

The error seems to be known but the workaround is difficult to achieve with the current API. I will keep this in mind for v2.0.0, but for now I would recommend just copying or subclassing TransformerBasedClassification and adapting it until it fits your needs.

  1. This is currently not supported. You could write your own Classifier implementation to do that. Somewhere down my list of ideas I have a PyTorch Lightning integration which can help with distributed training, however, I think for Llama 2 you will still need other repos as well.