Method to choose a good lambda (in Equation 4)

Beniko95J commented 3 years ago

Hi, I am trying to use the channel-exchanging in a multimodal self-supervised network (depth and rgb) and follow this line of code to add sparse constraints.

As I plot the scaling factors that have sparse constraints, I find all of them decrease to zero at last. However, there seems to be a stable ratio of scaling factors that will not become zero (larger than threshold, thus will not be exchanged) in the figure.5 in your paper. May I ask have you met this case that all scaling factors become zero in your experiments?

Best, beniko

yikaiw commented 3 years ago

Hi, thanks for your interest in our work. You can have a look at our exchange_details.txt in the provided Google drive link, which shows that when reaching the best performance at epoch 174, the portion of slim_params lower than the threshold is only 8.35%.

Epoch 174, 3% smallest slim_params: 0.0009
Epoch 174, portion of slim_params < 2e-02: 0.0835
Epoch 174  (rgb)     glob_acc=76.27    mean_acc=63.95    IoU=50.64    0.69
Epoch 174  (depth)   glob_acc=75.00    mean_acc=60.98    IoU=47.71    0.31
Epoch 174  (ens)     glob_acc=77.01    mean_acc=64.37    IoU=51.58            (best)

Yet eventually, at the last epoch (449), the portion of slim_params lower than the threshold reaches 45.7%. Luckily, the performance is still stable under such a large exchanging portion.

If I set lambda (in Equation 4) too large, scaling factors will quickly decrease to zero before the fusion process acquires enough training steps. Thus a mild lambda needs to be chosen.

In the end, I'm not sure about your self-supervised task. In our framework, both the RGB branch and the depth branch must be supervised, and the two losses are added up. So does this pipeline still hold in your self-supervised task?

Beniko95J commented 3 years ago

Hi, thank you for the quick reply!

Yet eventually, at the last epoch (449), the portion of slim_params lower than the threshold reaches 45.7%. Luckily, the performance is still stable under such a large exchanging portion. If I set the gamma too large, scaling factors will quickly decrease to zero before the fusion process acquires enough training steps. Thus a mild gamma needs to be chosen.

I see, so I think I may try a smaller gamma than 2e-2 in my case. Actually the L1 norm of scaling factors at first is more than twice the self-supervised loss, while they decrease to zeros after 10 epochs. By the way, may I ask what is the approximate ratio of the supervised loss to the L1 norm of scaling factors at first and at last in your case?

In the end, I'm not sure about your self-supervised task. In our framework, both the RGB branch and the depth branch must be supervised, and the two losses are added up. So does this pipeline still hold in your self-supervised task?

Yes, there is still a two-stream network like yours in my self-supervised network. The two-stream network will only output an estimate (the essembled one ) and it will be used in a self-supervised task. I think this has some similarities to your task, so I give a try to the channel exchanging strategy.

I am not sure whether it will be better to output three estimates and calculate the self-supervised loss three times and add them together as the final loss. Have you tried something like this?

Best, beniko

yikaiw commented 3 years ago

I edited my last reply by changing gamma to lambda (in Equation 4). To choose lambda, you can first apply the L1 norm on the single-modality learning (without multimodal fusion, i.e., without channel exchanging), and a good lambda should not noticeably affect performance but can push a portion of scaling factors to zero. Such property is ensured in the network pruning research field.

Do the three estimates mean RGB, depth, and ensemble? We do calculate the loss three times and add them together as the final loss, although we have not tried self-supervised tasks.

Beniko95J commented 3 years ago

Thank you for the reply. I may try to find a good lambda in my case.

yikaiw / CEN

Method to choose a good lambda (in Equation 4) #8