zju-vipa / KamalEngine

Knowledge Amalgamation Engine
Apache License 2.0
97 stars 18 forks source link

how to train each block? #2

Closed liujianzhao6328057 closed 5 years ago

Ssssseason commented 5 years ago

@liujianzhao6328057 Which example do you want to run?

liujianzhao6328057 commented 5 years ago

i want to run the code of "student becoming the master". i am not sure about my understanding. According to my understanding , each block(encoder part and decoder part ) should be trained to amalgamate the knowledge of the two teachers . However , i cannot find the corresponding code in the example . Even though the phase is set to "block", i can only train the block in the decoder part . so i want to know how to train the encoder part block by block ? Thank you and best wishes.

Ssssseason commented 5 years ago

Our proposed method will just remove decoder blocks. Since the encoder part is only a feature extractor, and the decoder part is expected to “project” or transfer the learned features into each task domain. Of course, you can try to realize it, but I wonder if branching out in encoder part makes sense.

liujianzhao6328057 commented 5 years ago

However, in " 5.1.3 Branch Out" ,you say that "after training the N blocks of TargetNet using the loss function (6),we can acquire the final losses for each block ". The total block number is N , and the decoder is in the range of (N/2,N). since that you don't train the encoder block ,how can you get the loss of encoder blocks? It is confusing .

Ssssseason commented 5 years ago

You're very cautious. In fact, the original code (quite not elegant) is in TensorFlow, where encoder part can also be trained block by block. Since we set N/2 < n ≤ N when choosing where to branch out in the paper, the results of training encoder part have no influence. Thus in this PyTorch implementation, you can't train the encoder part easily for now. Maybe you can realize it and make a pull request :) .

liujianzhao6328057 commented 5 years ago

But..if i don't train the encoder blocks , where can i get the pre-trained weight of the encoder part ?By the way ,when will the tf implementation be released?Thanks.

Ssssseason commented 5 years ago

You can set "indices" to be [0,0] to train all blocks in the encoder part. The TF implementation has no open source plans yet, but the method of training encoder block by block is same.

liujianzhao6328057 commented 5 years ago

Mimic teachers.

    for i in range(len(self.indices)):
        out_idx = self.indices[i]
        output = decoder_features[out_idx]
        output = output * self.student_adaptors_list[i][out_idx](F.avg_pool2d(output, output.shape[2:3]))
        for j in range(out_idx, 5):
            **output** = self.student_b_decoders_list[i][j](
                output, 
                decoder_indices[j],
                decoder_shapes[j]
            )
        outputs = output if outputs is None else torch.cat((outputs, output), dim=1)

    return outputs

if i set inducies[0,0] ,the output is still from self.student_b_decoders_list not from encoder .

Ssssseason commented 5 years ago

Yes. As we demonstrated in 5.1.2, we replace the features in teacher model with that in student model. output = decoder_features[out_idx] Here the output is from encoder part. output = output * self.student_adaptors_list[i][out_idx](F.avg_pool2d(output, output.shape[2:3])) Here is the Channel-Coding part. The inner loop makes the student data stream through the left decoder blocks in teacher model, which equals to the "replace" in figure 2 to get L_depth and L_seg.

HowieMa commented 3 years ago

Hi @Ssssseason , thank you for sharing such an awesome project. I also have some questions about the encoder part, and I sincerely hope you can help me solve them.

Do you have the S / D -channel coding part (adaptor) on blocks from the encoders? Like the variable down1, down2, ..., down in task_branch.

In your code, you only apply them on the output of decoders, as in decoder_features = [down5, up5, up4, up3, up2] . It makes me confused that you say "output = decoder_features[out_idx] Here the output is from encoder part." However, from your code, we know that all these variables (up5, up4) are from the decoders. up5 = self.student_decoders[0](down5, indices_5, unpool_shape5) up4 = self.student_decoders[1](up5, indices_4, unpool_shape4) up3 = self.student_decoders[2](up4, indices_3, unpool_shape3) up2 = self.student_decoders[3](up3, indices_2, unpool_shape2) up1 = self.student_decoders[4](up2, indices_1, unpool_shape1)

Since Figure 2 is for the "knowledge amalgamation module", not for the branch out, and you didn't say that n > N/2 here. Besides, in Figure 1, you set the knowledge amalgamation at the very beginning of the network, which makes me think that you also apply it on the encoder. I guess my question is similar as @liujianzhao6328057 .

I have this question because in the previous work, "layerwise amalgamation", you apply the knowledge amalgamation module (amal block) on every block, and all these blocks belong to the feature extractor. To me, it means that you have guidance on every block of the network, while previous work like KD and MTL only provides guidance on the output.

While in this code, you only apply knowledge amalgamation to the decoder part. Thus I cannot find how you use the teacher's knowledge to "guide" the encoder of the student, which makes me confused.

I would really appreciate it if you could help me. Look forward to your reply, thanks!

Ssssseason commented 3 years ago
  1. Do you have the S / D -channel coding part (adaptor) on blocks from the encoders? In this repo, no.
  2. It makes me confused that you say "output = decoder_features[out_idx] Here the output is from encoder part." I mean when out_idx is set to 0, decoder_features[out_idx] is the output of the encoder
  3. which makes me think that you also apply it on the encoder I cannot find how you use the teacher's knowledge to "guide" the encoder of the student Since we require n > N/2 later in the paper, the loss of each block in encoders is useless. To save training time while keeping similar performance, we could first train all encoders at the same time before training each decoder block (set branch_indices to be [0,0,...] ). In another way, in this repo the teacher's knowledge guides the whole encoder of the student in the first training, instead of guiding each encoder block separately.
  4. in the previous work, "layerwise amalgamation", you apply the knowledge amalgamation module (amal block) on every block These two papers don't come from the same author Hope these can answer your question
HowieMa commented 3 years ago
  1. Do you have the S / D -channel coding part (adaptor) on blocks from the encoders? In this repo, no.
  2. It makes me confused that you say "output = decoder_features[out_idx] Here the output is from encoder part." I mean when out_idx is set to 0, decoder_features[out_idx] is the output of the encoder
  3. which makes me think that you also apply it on the encoder I cannot find how you use the teacher's knowledge to "guide" the encoder of the student Since we require n > N/2 later in the paper, the loss of each block in encoders is useless. To save training time while keeping similar performance, we could first train all encoders at the same time before training each decoder block (set branch_indices to be [0,0,...] ). In another way, in this repo the teacher's knowledge guides the whole encoder of the student in the first training, instead of guiding each encoder block separately.
  4. in the previous work, "layerwise amalgamation", you apply the knowledge amalgamation module (amal block) on every block These two papers don't come from the same author Hope these can answer your question

Thank you for the detailed answer, it helps me a lot!