Closed prihoda closed 8 months ago
Hi!
That's a good question. For now we are not going to release the fine-tuned model for diverse downstream tasks because The fine-tuned model has the same size of original model, and release all of them will increase the maintenance cost. We are planning to add the PEFT(Parameter Efficient Fine-Tuning) technique to our model so users can download smaller weight files later.
interesting, freeze_backbone = True really reduce requirement to GPU memory, but use_lora = True return memory requirements back and exit with OutOfMemoryError: CUDA out of memory on my GPU.
When you start training, you can check out some training information such as how many are the trainable parameters in the screen. Maybe fine-tuning models with lora also requires GPU with a certain memory size.
LoRA model is initialized for training. trainable params: 7723521 || all params: 658966422 || trainable%: 1.17206594177571
it goes through freeze_backbone branch (without LoRA I can train it)
│ 32 │ │ if self.freeze_backbone: │
│ ❱ 33 │ │ │ repr = torch.stack(self.get_hidden_states(inputs, reduction="mean")) │
│ 34 │ │ │ x = self.model.classifier.dropout(repr) │
│ 35 │ │ │ x = self.model.classifier.dense(x) │
│ 36 │ │ │ x = torch.tanh(x)
....
│ /opt/conda/lib/python3.7/site-packages/transformers/models/esm/modeling_esm.py:378 in forward │
│ │
│ 375 │ │ if head_mask is not None: │
│ 376 │ │ │ attention_probs = attention_probs * head_mask │
│ 377 │ │ │
│ ❱ 378 │ │ context_layer = torch.matmul(attention_probs, value_layer) │
│ 379 │ │ │
│ 380 │ │ context_layer = context_layer.permute(0, 2, 1, 3).contiguous() │
│ 381 │ │ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
OutOfMemoryError: CUDA out of memory. Tried to allocate 82.00 MiB (GPU 0; 15.78 GiB total capacity; 14.38 GiB already allocated; 12.44 MiB free; 14.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Epoch 0: 0%| | 0/3166 [00:06<?, ?it/s]
Seems that even 7723521 trainable parameters exceed the limitation of your GPU memory. We recommend to set minimal batch size or just freeze all backbone.
Hi all, thanks for the open-source release! Are you also planning to release the fine-tuned model weights, namely the Thermostability model?