About Bert model training

YeonjeeJung commented 1 year ago

Hi, I am very interested in your paper, and have been using your code for training. However, I have a question in the Bert model training part.

In your code, only parameters in first 10 layers of Bert model are included in 'params to optimize'.

if args.model != 'lavt_one':
        params_to_optimize = [
            {'params': backbone_no_decay, 'weight_decay': 0.0},
            {'params': backbone_decay},
            {"params": [p for p in single_model.classifier.parameters() if p.requires_grad]},
            # the following are the parameters of bert
            {"params": reduce(operator.concat,
                              [[p for p in single_bert_model.encoder.layer[i].parameters()
                                if p.requires_grad] for i in range(10)])},
        ]

I wonder why the parameters of last two layers are not optimized. Because in general, when training Bert model, I have known that all parameters are optimized or only last few layers are optimized. I'll be waiting for your reply. Thank you.

yz93 commented 1 year ago

Hi, thank you.

That particular choice (to optimize the first 10 layers) was inherited from Ref-VOS without much thinking or investigation of my own. I didn't find it an issue and didn't pay a lot of attention to it. It is a bit counter-intuitive, and interesting I'd say.

Like you said and I agree, that it is more common and probably more intuitive to choose one of the following: (1) optimize all layers of BERT, excluding the embeddings, (2) optimize all of BERT, including the embeddings, (3) optimize just the last few layers, and (4) freeze BERT. In addition, I don't know if it makes sense to just optimize certain middle layers, e.g., add one layer to the optimizer for every 3 layers, like the 12th, 9th, 6th, and 3rd layers (probably this is a bad idea).

So in general I don't have a conclusion about those choices---except I also feel that fine-tuning the embeddings would be a very bad idea, because the corpora of these evaluation datasets are too small.

I don't have machines now so I can't test them. Maybe investigate this if you'd like to.

Thanks!

yz93 commented 1 year ago

Wanted to add a comment that is not related to optimizing the language model, but to feature fusion. Just a minor comment.

Maybe try extracting multiple levels of features from BERT instead of just the last one, and then use them for feature fusion at the respective stage. See if this gives better results at all.

yz93 / LAVT-RIS

About Bert model training #26