yyyyychen / LowMemoryBP

The official implementation of the paper "Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation"
MIT License
16 stars 0 forks source link

Issue with MS-Layernorm in Roberta-base #6

Open NitzanHod opened 1 month ago

NitzanHod commented 1 month ago

Memory Sharing normalization avoids saving the input activation for backward, relying on the next layer to save the normalization output. This makes sense if the next layer is linear (as in LLaMa), but in Roberta all LayerNorms appear before nn.Dropout layers. Sadly, Dropout layers only save for backward their binary mask, not the input activation.

Both your code and your paper do not refer to this problem anywhere, but you did report results on roberta base.

I know you said you'll add results on roberta eventually, but could you please give some explanation on how to overcome this problem?

shimudong commented 1 month ago

Thank you for your question.

When we fine-tune Roberta using lora, the LayerNorms after attention layers (the other one is in RobertaEmbeddings layers) are in RobertaSelfOutput and RobertaOutput. They are after nn.Dropout layers, and before the RobertaAttention layers.

The code from transformers: https://huggingface.co/transformers/v3.5.1/_modules/transformers/modeling_roberta.html

# Copied from transformers.modeling_bert.BertSelfOutput
class RobertaSelfOutput(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

# Copied from transformers.modeling_bert.BertOutput
class RobertaOutput(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states