Closed bluesky314 closed 4 years ago
Sorry @zylo117, I still don't get it. In the official paper the authors say that they take level 3-7 features (P3-P7) why in the other issue you say that for the first time there are only 3 features, P3-P5? Maybe is obvious but I really don't get it
Yes, it's really confusing because the paper didn't mention this part. The truth is, there are only three features from the backbone are forwarding to bifpn, P3&P5&P7. But for the sake of consistency, the names were changed to P3/P4/P5. And then P6 is generated from P5, and P7 from P6.
Why? Because any pyramid feature's width and height must be half of the last feature's. There are stages whose features' sizes are not pyramid like. So some of the stages must be left out.
Thanks @zylo117 I'm starting to understand but I still have some confusion. From the EfficientNet paper:
EfficientNet-B0 baseline network Each row describes a stage i with L_i layers, with input resolution <H_i, W_i> and output channels C_i
Stage | Operator | Resolution | #Channels | #Layer |
---|---|---|---|---|
1 | Conv3x3 | 224 × 224 | 32 | 1 |
2 | MBConv1, k3x3 | 112 × 112 | 16 | 1 |
3 | MBConv6, k3x3 | 112 × 112 | 24 | 2 |
4 | MBConv6, k5x5 | 56 × 56 | 40 | 2 |
5 | MBConv6, k3x3 | 28 × 28 | 80 | 3 |
6 | MBConv6, k5x5 | 14 × 14 | 112 | 3 |
7 | MBConv6, k5x5 | 14 × 14 | 192 | 4 |
8 | MBConv6, k3x3 | 7 × 7 | 320 | 1 |
9 | Conv1x1 & Pooling & FC | 7 × 7 | 1280 | 1 |
First: is P1 the output of stage 1 (input res 224 x 224, output res 112 x 112), P2 the output of stage 2 and so on?
If yes, P4_backbone w/h (28, 28) are 1/2 times P3_backbone (56 × 56), not the same (I'm 100% sure I'm missing something but I don't understand exactly what, sorry).
oh, I didn't check it carefully, sorry. I meant there are stages whose features' sizes are not pyramid like.
And stage 2 is P1.
Only the second last layers of stage 4,6,8 are used to build BiFPN.
https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/master/efficientdet/model.py#L415
Let see what I understood (or not) so far...
From https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/b5092321ac6a647d48e81b1f8590feb3dbc3ad87/efficientnet/model.py#L154 and below I see that in self._blocks you have only and all the MBConv layers.
From your link I see that, as one can expect, you take the feature maps just before every downsampling.
From my table above this means that at the end of https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/b5092321ac6a647d48e81b1f8590feb3dbc3ad87/efficientdet/model.py#L408 you have 5 feature maps, the output of
elif
condition)You then discard the None
that you have at the beginning of the feature_maps
list for how you have implemented this part of the logic:
https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/b5092321ac6a647d48e81b1f8590feb3dbc3ad87/efficientdet/model.py#L420
What I don't get is that here I see 5 feature maps not 3, what I'm not understanding?
it's 3, I extract 5 features from the backbone, but only last 3 of them go to BiFPN.
Ok so you keep the last 3 feature maps, that, using the EfficientNet table above and changing the input from 224 x 224 to 512 x 512 as it is in the EfficientDet paper for D0 result in:
you then generate:
Thanks a lot for the explanation @zylo117, I think (I hope) I get it
Hi, your reply to me does not explain why there are 2 residual convs instead of just 1 that have both the job of making the original inputs into the right sizes. You generate two seperate p4_in and p5_in using p4 and p5. The first time is shown in my question 1 and second time in question 2.
p4_in = self.p4_down_channel(p4)
p5_in = self.p5_down_channel(p5)
and
p4_in = self.p4_down_channel_2(p4)
p5_in = self.p5_down_channel_2(p5)
Can you explain why not just have one p4_in/p5_in? Sorry, I could not find this mentioned in ppr or offical implementation.
It really is in official implement, you just have to dig deeper. And I explained it on readme, FAQ Q2, the second part, point 8.
Thanks, I will try to read tf-implementation again.
if 0,1, or 2 in 'inputs_offsets', in resample_feature_map() will create new p3,p4 or p5 because the channels of feat[:3] are not equal to config.fpn_num_filters.
The issue is why is there a p4_down_channel_2, when the first p4_down_channel already created a p4_in of the correct size? In the paper there are two separate streams from p4_in to p4_up and p4_out, so i guess the duplication is to create these two streams. However in the paper there is also stream duplication for P6, so why is this not done in the code?
@opeide there are. check out conv6_up
and conv6_down
in BiFPN module.
https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/c533bc2de65135a6fe1d25ca437765c630943afb/efficientdet/model.py#L214-L215
oh, I didn't check it carefully, sorry. I meant there are stages whose features' sizes are not pyramid like.
And stage 2 is P1.
Only the second last layers of stage 4,6,8 are used to build BiFPN.
https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/master/efficientdet/model.py#L415
Hi, @zylo117 I would be grateful if you could clarify some points.
1st, I have read Efficientdet paper again but I couldn't find anything related to pyramid-like features' sizes. Then I have searched in Feature Pyramid Networks for Object Detection paper (1612.03144) and found this sentence, which describes the Bottom-up pathway: "The bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2"
Is this what you mean "pyramid feature's width and height must be half of the last feature's"?
2nd, in Feature Pyramid Networks paper, the authors use ResNets as backbone and choose "the output of the last layer of each stage as reference set of feature maps" However, in Efficientdet, the second last layer is used. => It doesn't have to be the last or the second last layer of each stage right? We can choose whichever layer of each stage to use if that layer meets the condition of "pyramid-like features' sizes"?
Thank you so much for your reply.
I have another concern about the second last layer of each stage.
I dig into EfficientNet's code from tensorflow. It seems to me that EfficientNet consists 9 stage, in which stage 1 is Stem block, stage 9 is Top block for classification.
From stage 2 to stage 8 are 7 blocks. Each block has sub-blocks, which are repeated several times. Each sub-block consists 4 phases: Expansion phase - Depthwise Convolution phase - Squeeze and Excitation phase - Output phase. When the sub-block is repeated, each phase will have some modifications.
Based on the code, I have summarized the Output phase:
During 1st repeated time of sub-block: Output phase doesn't have Dropout and Add layers (due to different number of input and output filters) => Output phase consists Conv2D and BatchNormalization layers
From the 2nd repeated time of sub-block: Output phase will have Dropout and Add layers (due to update number of filters and strides when repeating sub-block) => Output phase consists Conv2D, BatchNormalization, Dropout and Add layers
So in the 1st case, the second last layer is Conv2D layer, while in the 2nd case, the second last is Dropout layer?
The image below is the Output phase of 2nd repeated time of sub-blocks in P3 from EfficientNet-B0, which consist 4 layers.
@vietanh090893 I'm so confused here. What's your question exactly? Are you wondering why BiFPN takes the ones whose next conv.stride is 2 and the final output of efficientnet? Because of the shapes of those features and being able to send more data to BiFPN and let it extract the features instead of extracting the features using efficientnet backbone like they did. signatrix/efficientdet
Though this doesn't mean that extracting features using efficientnet will definitely result in bad performance.
About your graph, I think you misunderstood the so-called second last layer. It means that, if block N has _depthwise_conv with stride 2, then output of block N-1 will be the input of BiFPN. Considering layer X of block N is the currently last layer, the final layer of block N-1 will be the second last layer.
Take your graph as example, if block4 has a _depthwise_conv layer with stride 2, then the next input of BiFPN will be the output of block3b_add.
I really misunderstood the term second last layer. Thank you for your explanation. It makes more sense to me now.
I noticed your implementaiton is different from the offical's in two places:
1) You apply conv on the encoder outputs and the p6,p7 are downsampled version of the original.
2) You apply a separate convolution for the residual, not using the above.
Can you please justify or explain why you made these changes?