zylo117 / Yet-Another-EfficientDet-Pytorch

The pytorch re-implement of the official efficientdet with SOTA performance in real time and pretrained weights.
GNU Lesser General Public License v3.0
5.21k stars 1.27k forks source link

Different from offical implementation #202

Closed bluesky314 closed 4 years ago

bluesky314 commented 4 years ago

I noticed your implementaiton is different from the offical's in two places:

if self.first_time:
            p3, p4, p5 = inputs

            p6_in = self.p5_to_p6(p5)
            p7_in = self.p6_to_p7(p6_in)

            p3_in = self.p3_down_channel(p3)
            p4_in = self.p4_down_channel(p4)
            p5_in = self.p5_down_channel(p5)

1) You apply conv on the encoder outputs and the p6,p7 are downsampled version of the original.

2) You apply a separate convolution for the residual, not using the above.

if self.first_time:
            p4_in = self.p4_down_channel_2(p4)
            p5_in = self.p5_down_channel_2(p5)

Can you please justify or explain why you made these changes?

zylo117 commented 4 years ago
  1. No, it's the same as the official's. You must have missed it.
  2. https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/issues/19#issuecomment-611983918
mnslarcher commented 4 years ago

Sorry @zylo117, I still don't get it. In the official paper the authors say that they take level 3-7 features (P3-P7) why in the other issue you say that for the first time there are only 3 features, P3-P5? Maybe is obvious but I really don't get it

zylo117 commented 4 years ago

Yes, it's really confusing because the paper didn't mention this part. The truth is, there are only three features from the backbone are forwarding to bifpn, P3&P5&P7. But for the sake of consistency, the names were changed to P3/P4/P5. And then P6 is generated from P5, and P7 from P6.

Why? Because any pyramid feature's width and height must be half of the last feature's. There are stages whose features' sizes are not pyramid like. So some of the stages must be left out.

mnslarcher commented 4 years ago

Thanks @zylo117 I'm starting to understand but I still have some confusion. From the EfficientNet paper:

EfficientNet-B0 baseline network Each row describes a stage i with L_i layers, with input resolution <H_i, W_i> and output channels C_i

Stage Operator Resolution #Channels #Layer
1 Conv3x3 224 × 224 32 1
2 MBConv1, k3x3 112 × 112 16 1
3 MBConv6, k3x3 112 × 112 24 2
4 MBConv6, k5x5 56 × 56 40 2
5 MBConv6, k3x3 28 × 28 80 3
6 MBConv6, k5x5 14 × 14 112 3
7 MBConv6, k5x5 14 × 14 192 4
8 MBConv6, k3x3 7 × 7 320 1
9 Conv1x1 & Pooling & FC 7 × 7 1280 1

First: is P1 the output of stage 1 (input res 224 x 224, output res 112 x 112), P2 the output of stage 2 and so on?

If yes, P4_backbone w/h (28, 28) are 1/2 times P3_backbone (56 × 56), not the same (I'm 100% sure I'm missing something but I don't understand exactly what, sorry).

zylo117 commented 4 years ago

oh, I didn't check it carefully, sorry. I meant there are stages whose features' sizes are not pyramid like.

And stage 2 is P1.

Only the second last layers of stage 4,6,8 are used to build BiFPN.

https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/master/efficientdet/model.py#L415

mnslarcher commented 4 years ago

Let see what I understood (or not) so far...

From https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/b5092321ac6a647d48e81b1f8590feb3dbc3ad87/efficientnet/model.py#L154 and below I see that in self._blocks you have only and all the MBConv layers.

From your link I see that, as one can expect, you take the feature maps just before every downsampling.

From my table above this means that at the end of https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/b5092321ac6a647d48e81b1f8590feb3dbc3ad87/efficientdet/model.py#L408 you have 5 feature maps, the output of

You then discard the None that you have at the beginning of the feature_maps list for how you have implemented this part of the logic: https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/b5092321ac6a647d48e81b1f8590feb3dbc3ad87/efficientdet/model.py#L420

What I don't get is that here I see 5 feature maps not 3, what I'm not understanding?

zylo117 commented 4 years ago

it's 3, I extract 5 features from the backbone, but only last 3 of them go to BiFPN.

mnslarcher commented 4 years ago

Ok so you keep the last 3 feature maps, that, using the EfficientNet table above and changing the input from 224 x 224 to 512 x 512 as it is in the EfficientDet paper for D0 result in:

you then generate:

Thanks a lot for the explanation @zylo117, I think (I hope) I get it

bluesky314 commented 4 years ago

Hi, your reply to me does not explain why there are 2 residual convs instead of just 1 that have both the job of making the original inputs into the right sizes. You generate two seperate p4_in and p5_in using p4 and p5. The first time is shown in my question 1 and second time in question 2.

p4_in = self.p4_down_channel(p4)
p5_in = self.p5_down_channel(p5)

and

p4_in = self.p4_down_channel_2(p4)
p5_in = self.p5_down_channel_2(p5)

Can you explain why not just have one p4_in/p5_in? Sorry, I could not find this mentioned in ppr or offical implementation.

zylo117 commented 4 years ago

It really is in official implement, you just have to dig deeper. And I explained it on readme, FAQ Q2, the second part, point 8.

bluesky314 commented 4 years ago

Thanks, I will try to read tf-implementation again.

Xiao-OMG commented 4 years ago

https://github.com/google/automl/blob/3614751749a21ca2fcb299b60238c6651ff51125/efficientdet/efficientdet_arch.py#L493

https://github.com/google/automl/blob/3614751749a21ca2fcb299b60238c6651ff51125/efficientdet/efficientdet_arch.py#L579

https://github.com/google/automl/blob/3614751749a21ca2fcb299b60238c6651ff51125/efficientdet/efficientdet_arch.py#L90

if 0,1, or 2 in 'inputs_offsets', in resample_feature_map() will create new p3,p4 or p5 because the channels of feat[:3] are not equal to config.fpn_num_filters.

opeide commented 3 years ago

The issue is why is there a p4_down_channel_2, when the first p4_down_channel already created a p4_in of the correct size? In the paper there are two separate streams from p4_in to p4_up and p4_out, so i guess the duplication is to create these two streams. However in the paper there is also stream duplication for P6, so why is this not done in the code?

zylo117 commented 3 years ago

@opeide there are. check out conv6_up and conv6_down in BiFPN module. https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/c533bc2de65135a6fe1d25ca437765c630943afb/efficientdet/model.py#L214-L215

https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/c533bc2de65135a6fe1d25ca437765c630943afb/efficientdet/model.py#L256-L258

vietanh090893 commented 3 years ago

oh, I didn't check it carefully, sorry. I meant there are stages whose features' sizes are not pyramid like.

And stage 2 is P1.

Only the second last layers of stage 4,6,8 are used to build BiFPN.

https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/master/efficientdet/model.py#L415

Hi, @zylo117 I would be grateful if you could clarify some points.

1st, I have read Efficientdet paper again but I couldn't find anything related to pyramid-like features' sizes. Then I have searched in Feature Pyramid Networks for Object Detection paper (1612.03144) and found this sentence, which describes the Bottom-up pathway: "The bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2"

Is this what you mean "pyramid feature's width and height must be half of the last feature's"?

2nd, in Feature Pyramid Networks paper, the authors use ResNets as backbone and choose "the output of the last layer of each stage as reference set of feature maps" However, in Efficientdet, the second last layer is used. => It doesn't have to be the last or the second last layer of each stage right? We can choose whichever layer of each stage to use if that layer meets the condition of "pyramid-like features' sizes"?

zylo117 commented 3 years ago
  1. yes
  2. yes, you can do that with an extra conv/pooling layer resizing the FPN input. But this is more optimized.
vietanh090893 commented 3 years ago

Thank you so much for your reply.

I have another concern about the second last layer of each stage.

I dig into EfficientNet's code from tensorflow. It seems to me that EfficientNet consists 9 stage, in which stage 1 is Stem block, stage 9 is Top block for classification.

From stage 2 to stage 8 are 7 blocks. Each block has sub-blocks, which are repeated several times. Each sub-block consists 4 phases: Expansion phase - Depthwise Convolution phase - Squeeze and Excitation phase - Output phase. When the sub-block is repeated, each phase will have some modifications.

Based on the code, I have summarized the Output phase:

So in the 1st case, the second last layer is Conv2D layer, while in the 2nd case, the second last is Dropout layer?

The image below is the Output phase of 2nd repeated time of sub-blocks in P3 from EfficientNet-B0, which consist 4 layers.

P3 - Last output phase

zylo117 commented 3 years ago

@vietanh090893 I'm so confused here. What's your question exactly? Are you wondering why BiFPN takes the ones whose next conv.stride is 2 and the final output of efficientnet? Because of the shapes of those features and being able to send more data to BiFPN and let it extract the features instead of extracting the features using efficientnet backbone like they did. signatrix/efficientdet

Though this doesn't mean that extracting features using efficientnet will definitely result in bad performance.

About your graph, I think you misunderstood the so-called second last layer. It means that, if block N has _depthwise_conv with stride 2, then output of block N-1 will be the input of BiFPN. Considering layer X of block N is the currently last layer, the final layer of block N-1 will be the second last layer.

Take your graph as example, if block4 has a _depthwise_conv layer with stride 2, then the next input of BiFPN will be the output of block3b_add.

vietanh090893 commented 3 years ago

I really misunderstood the term second last layer. Thank you for your explanation. It makes more sense to me now.