Question: Regarding potential bottleneck during training of Classifiers

JP-SystemsX commented 1 year ago

I am currently trying to optimize an AL pipeline using small-text therefore I am looking into the source code and came across something slightly bizarre looking, namely in the \integrations\transformers\classifiers\classification.py file in the _get_layer_params() function the following code gets executed to get the parameters to train:

 if hasattr(base_model, 'encoder'):
        layers = base_model.encoder.layer
 else:
        layers = base_model.transformer.layer

If I understand correctly that implies that instead of training the head of the (in my case) RobertaForSequenceClassification (RSC) model the whole encoder is trained and the head neglected, which would be far more computational expensive and quite a unique approach to the problem. Therefore, the following questions arise for me:

Is this wanted?
If so why?
Is the Head (called classifier in RSC) somewhere else trained because I wasn't able to find where?
If so can the training of the encoder be disabled to speed up the training process ?

chschroeder commented 1 year ago

Hi @JP-SystemsX,

sorry, this part could use some more detailed comments. I am happy someone questioned this though, because the more people who do this, the better the quality of the implementations will be ensured.

1. Is this intended?

Yes, it is! This is used to set different learning rates and gradient decay multipliers for different parameter groups (where one group is a "layer" in this specific small-text example).

This reportedly helps against catastrophic forgetting, and moreover, it is part of the ULMFiT method. I first learned about this from "How to Fine-Tune BERT for Text Classification?", but if I remember correctly, I found such a separation by parameter groups in one of the first BERT implementations as well. As soon as I remember where, I will add this here.

2. What happens here? And Why?

The function _get_layer_params() constructs a list of parameter dictionaries for the torch optimizer. This is used to override the optimizer settings of specific parameter groups. Have a look at per-parameter options in torch.optim, I think this is what caused the confusion (in addition to the insufficient documentation). For parameters that have not been specified, the optimizer uses the base settings, e.g. the base learning rate.

Next, you can see how the optimizer in constructed for the TransformerBasedClassification class.

3. Where is the classifier head?

The actual model is referenced by the classifier.model attribute. For BERT, this is a BertForSequenceClassification object, where in turn you can find the classifier head.

4. If so can the training of the encoder be disabled to speed up the training process ?

Definitely possible. Not sure how convenient it is with the current interfaces.

If that is reasonable depends on your goal. Do you just want to minimize the training time? Currently, there a lot of methods for parameter-efficient fine-tuning around, they might be what you are looking for. Admittedly, I haven't had the time yet to use them for active learning, but this would be interesting to me as well. There is, however, an excellent paper on this topic by @josipjukic.

I am definitely interested in hearing more about your progress here. Also, if you find any interfaces, that are difficult to extend (and there are, I also have some on the radar already), please let me know.

JP-SystemsX commented 1 year ago

Hi @chschroeder, thank you so much for your answer, those papers you recommended were very fascinating, and helped me a lot in understanding what's going on. However, when I understand the docs for the optimizer correctly, which you kindly provided, it seems to me as if every parameter group needs to be included and some just don't need a specific learning rate assigned to them, am I wrong with that?

If I should be right with that assumption, then I can see how all parameters are provided in the else-clause of the code you marked but if the condition is met, i.e. fine-tuning arguments are provided, it seems to only include the parameters of the encoder and completely neglect the classifier, as can be seen here. That's the part that confuses me the most.

chschroeder commented 1 year ago

You are completely right. I think this might be a bug. This only occurs if fine-tuning arguments are set, as you correctly stated.

I think the way the parameter dict is currently assembled is not optimal, and if I just add the missing parameters now it would be prone to similar bugs (for other models). Maybe we should collect all parameters first, and then try to override the "layer-specific" parameters. What do you think?

JP-SystemsX commented 1 year ago

Sure, that sounds like a good idea. However, if we would try to get them the same way as without finetuning arguments i.e. like this params = [param for param in self.model.parameters() if param.requires_grad] We have to take into account that they aren't as well grouped anymore, i.e. via _get_layer_params we get up to 12 beautifully grouped RobertaLayers which consist each out of 16 sublayer that makes gradual unfreezing and co. relatively simple but via self.model.parameters() we seem to get immediately all 201 sublayers. The problem is, as you already pointed out, that each model is different, so it might be a bit hard to group them back together such that all sublayers of a layer can get the same learning rate or be frozen/unfrozen together. Do you have an idea on how to preserve or reconstruct the grouping?

chschroeder commented 1 year ago

However, if we would try to get them the same way as without finetuning arguments i.e. like this params = [param for param in self.model.parameters() if param.requires_grad]

Yes, I agree on this. What I have so far, is a mechanism that checks all trainable parameters versus our constructed set of parameters. If they are not the same size, we can raise an error and at least prevent such a situation from happening again (e.g., with incompatible models).

Do you have an idea on how to preserve or reconstruct the grouping?

With the above mechanism, I think one possibility is to just collect all parameters, and override those which we recognize as belonging to a specific layer. Alternatively, we can collect the layer-specific parameters first, and then add all others.

Moreover, it has to be compatible with the existing gradual unfreezing mechanism, which has to be changed to work with this.

JP-SystemsX commented 1 year ago

Sounds like a promising approach, as I am fairly new to pytorch I would be definitely interested in seeing how this works out in code.

The only objection that I could come up with is the following: If I understand your approach correctly that would treat everything of the remaining stuff the same, while I was inspecting the RoBERTa model I found that this would include the token-embedding, the classifier, and maybe a pooler. I am not sure whether that will pose a problem, but I imagine that it might be at least a bit weird to freeze the lower encoder layers and keep the embedding unfrozen.

Alternatively: one could ask for the classifiers parameters and embedding parameters similarly as already done via something like:

    if hasattr(model, 'classifier'):
        layers.append(model.classifier.parameters)

This would have the nice side effect that one could freeze the entire encoder and only train the classifier to speed up the training process to decrease response-time if it shall be used in combination with a real user.

The same could one do for embedding and pooler like

    if hasattr(base_model, 'pooler') and base_model.pooler is not None:
        layers.append(base_model.pooler.parameters)
    if hasattr(base_model, 'embeddings'):
        layers = [base_model.embeddings.parameters] + layers

To be fair this is probably not the most elegant way to do this and it is based on the assumption that the naming conventions are constant between the Huggingface Classifier Models what doesn't need to be the case.

But I would definitely include your ideas as well, and that not only because I want to see how this is done, even though I do, but because it might be a good way to avoid unexpected errors like when an unexpected layer appears.

chschroeder commented 1 year ago

Sounds like a promising approach, as I am fairly new to pytorch I would be definitely interested in seeing how this works out in code.

You can obtain all trainable parameters via params = [param for param in self.model.parameters() if param.requires_grad]. The list of parameters for the optimizer config needs to be of exactly that size.

The only objection that I could come up with is the following: If I understand your approach correctly that would treat everything of the remaining stuff the same, while I was inspecting the RoBERTa model I found that this would include the token-embedding, the classifier, and maybe a pooler. I am not sure whether that will pose a problem, but I imagine that it might be at least a bit weird to freeze the lower encoder layers and keep the embedding unfrozen.

Treating everything the same is the only thing I can think of that could support a wider range of models, exactly because you don't know how they named their embedding module, or if there are any other modules beyond this.

This would have the nice side effect that one could freeze the entire encoder and only train the classifier to speed up the training process to decrease response-time if it shall be used in combination with a real user. Most importantly, the size of the final list of parameters needs to match exactly the list above.

Freezing a layer ist best done before params are collected (and should be done via param.requires_grad = False) , then frozen parameters are already excluded at this point. Still, your objection is valid, having frozen parameters should be supported here as well. (This might be a separate change though, not sure about the scope of this.)

Alternatively: one could ask for the classifiers parameters and embedding parameters similarly as already done via something like:

if hasattr(model, 'classifier'):
    layers.append(model.classifier.parameters)

Sure, this also works, but this is similarly to the way it is done now. The weakness here is, if my classifier head has a different variable name than 'classifier', it will be missed again.

Either way, with the new check, this would at least raise an error. But if the more general approach works, I would be in favor of that.

chschroeder commented 1 year ago

How urgent is this for you? I have already started a fix, but I will not have much time during the next few days, so I cannot promise anything right now. Might be within the week, otherwise within 1-2 weeks likely.

If you are interested in contributing, you could of course try to provide a PR yourself.

Afterwards, I will likely make a bugfix release (v1.3.1).

JP-SystemsX commented 1 year ago

No, don't worry, it's not urgent at all for me, Thanks to the resources you provided I learned to understand the code a lot better, and found out that the finetuning arguments are probably not the right approach for my situation, as I should probably keep my model as vanilla as possible, for that my results become easily reproducible.

JP-SystemsX commented 1 year ago

I just made a PR, please check before merging, as said, I am fairly new to Pytorch. I couldn't come up with any different way on how to preserve the ordering in layers and sublayers as is needed to follow the methods described in the original papers. Therefore, I sadly needed to resort back to the naming convention that e.g. already doesn't work for ALBERT but RoBERTa and BERT seem now to work at least. I also check at the end whether all trainable arguments were at least considered.

While doing so i noted a small bug in the calculation of the learning rate as well, i.e. it was sometimes raised to a negative number which made the lr explode. Which I tried to fix as well.

All was done under the assumption that a more general approach would do either of the following 2:

Demand a complete rethinking of the way how the Finetuning arguments work
Break the way how it is supposed to work

chschroeder commented 1 year ago

Thanks for the new PR :+1:.

I am currently in the process of reviewing and have already learned and realized a lot.

We disagreed what constitutes a "layer". (Nothing bad, this just means code and functionality were not unambiguous before.)
The functionality we are trying to build relies on a lot of assumptions.
- Certain modules have to be present (encoder, layers)
- Certain modules have to be named a certain way (layer0, layer1, ..., layerN)
- Also the "layer" structure is not reflected in the object structure, so it is really subjective what constitutes a layer.

I wonder if this feature should have been included at all, but for now I will stick with the original idea that we are able to use this for discriminative learning rates/gradual unfreezing on the most common transformer models.

chschroeder commented 1 year ago

Done. This is more tricky than I thought, I am sorry. This is a difficult first PR ;).

I have suggested a slightly different strategy than before. This one basically assumes the following structure of a transformer class:

"base_model"
- [any sequence of layers]
- "encoder" (named either "encoder" or "transformer")
- layer0
- layer1 ...
- layerN
[any sequence of layers]

These layers are then flattened in the following way:

[any sequence of layers]
layer0
layer1 ...
layerN
[any sequence of layers]

I think this is the most flexible we can do. Let me know what you think.

JP-SystemsX commented 1 year ago

Ok I thought a bit more about it and I Found those 4 Alternatives:

We keep the approach as is
- Pro: We know when it fails and can throw Error
- Con: Not Very flexible
We use the named Children Approach:
- Pro: Supports more Modules
- Con: There is a possibility that it Breaks without any warning or error that is admittedly a rare event but potentially highly frustrating for the user
We use a hook to get order from forward pass:
- Pro: Might be the most General approach
- Con: Gets quite a bit more complex
We just deprecate this functionality:
- Pro: Gains in paper relative low + Less vulnerability (short: might not be worth it)
- Con: We already have a version that could help in some cases

Please Tell me which Approach you prefer

chschroeder commented 1 year ago

We keep the approach as is

Do you mean the version before or after your PR? I found the version from your PR quite good in the end. Did you see my latest comment in the PR (*)? I would prefer this for now. This fixes the functionality "as is", and then we can think about deprecating or replacing it.

We use the named Children Approach: Your counter argument that we cannot infer the module order correctly convinced me, so I dropped that idea. Unless we could get the order correctly somehow, this seems a bad option.

We use a hook to get order from forward pass

Thought about this as well, but even then: what is a layer? If we go by the autograd graph, we might be flexible, but still not get the results the user intends to. Or am I wrong here?

I think the most correct option would be to configuration mechanism (pass it from "Outside"), but then again, this is disgusting to set up ;).

Edit: Do you agree here? If so, I will merge the PR.

(*): My favored solution would be to merge your PR, i.e. the first of your presented options, but with your fixes.

JP-SystemsX commented 1 year ago

So sorry, Yesterday was a busy day so I forgot to check. But yeah I am ok with all approaches, all have their own specific advantages and disadvantages.

So yeah I agree, the hook would be only a complicated fix for the rare problem that might arise when using the named_children method one would still need to "manually" open the encoder via name.

Touche: I forgot the option of pass it from the Outside ;)

But yeah I am fine with merging no further changes where made since the PR. I would have made some if another approach would have been desired. But when you are alright with the current version. I will have no objection to the merge.

chschroeder commented 1 year ago

No worries! Thanks for the update and for your opinion on this.

Unfortunately, I'm busy today, but I think I'll get the rest done within the next two days, and then we will have a bugfix release soon :).

I will keep you updated.

chschroeder commented 1 year ago

if hasattr(base_model, 'embeddings'):
    layers.append(base_model.embeddings.parameters())

if hasattr(base_model, 'encoder'):
    if hasattr(base_model.encoder, 'layer'):
        layers += [l.parameters() for l in base_model.encoder.layer]
else:
    layers += [l.parameters() for l in base_model.transformer.layer]

if hasattr(base_model, 'pooler') and base_model.pooler is not None:
    layers.append(base_model.pooler.parameters())
if hasattr(model, 'classifier'):
    layers.append(model.classifier.parameters())

Now, while adding the tests, I am thinking if we should be stricter about asserting the existence of embeddings and classifier layers. In our discussion we said that we expect a certain structure now, otherwise the generic mechanism will not be usable (because we prefer raising an error over risking faulty behavior).

I would like to require at least classifier if not both. Do you agree with this?

Edit: By the way, do you want to be listed in the list of contributors?

chschroeder commented 1 year ago

Done. I cleaned up a little, extracted the function to a util, wrote tests, and added documentation and a changelog entry. The logic is otherwise exactly as you built it.

Thank you for the great work, pleasant communication, and helpful discussions!

webis-de / small-text

Question: Regarding potential bottleneck during training of Classifiers #36