New Diagrams and Questions

New diagrams and layers explanation

Idea: Explain SSD flow using diagrams and describing the it's new layers (Questions on what I did not understand yet bellow)

Training Diagram

ssd_blockdiagram_train

Prediction Diagram

ssd_blockdiagram

Extra Convs

Group of 1x1 and 3x3 convs used to sample activations from CONV_4_3 and last layers (FC), it's used to improve the detection of bigger objects. extraconvs

Detection layer

ssd_detection

New layers added to support SSD

Priorboxes

Generate default boxes using the Image and feature map dimensions

Annotated Data

Do image augmentation.
Generate random images (This is the Negative mining part??)

DetectionOutput

Do the Non-maxima supression during prediction to filter the best region per object.

SmoothL1

Distance metric used on MultiboxLoss layer, more specifically on the Location part.

Permute

Used to change the dimension position on the tensors. (Don't know why??)

Normalize

Used to make activations of conv4_3 smaller, (There are no tests to check if this is needed on other layers, or if they can be substituted by batchNorm or other variantes of batchNorm(NIPS2016)

Deconvolution/AtrousConvolution/Dilated Conv

Acutally it was alredy on caffe, during the experiemnts was found better performance (frame rate)

Open Questions

Is the AnnotatedData layer responsible to augment the negative examples?
The normalize layer could be replaced by BatchNorm, or other variants like WeightNorm (NIPS 2016). The idea is that during backprop we will find optimal Normalization coeficients for every detection branch
To handle bigger objects you used priorboxes with different feature map sizes. Does the priorbox default box sizes must be the same on all layers in order to grab a bigger portion of the feature map size or there is some kind of resizing?
Why we need to permute the tensors?

The AnnotatedData layer is only responsible to generate random patches according to the configuration. If batch_sampler allows, it is possible to generate an image without any ground truth objects. The negative mining is done in the MultiBoxLoss layer after matching ground truth boxes and default boxes.
Since VGG is not trained with batch norm, we found using L2 normalization (from ParseNet) is a nice and easy work around to stabilize the training with VGG. An advantage over batch normalization is that L2 normalization can be done separately per feature map location and does not depends on the number of image in a batch.
The size of default boxes at each feature map layer is configurable. Notice, the default boxes are in the normalized scale [0, 1].
To better combine predictions from different layer

weiliu89 / caffe