weiliu89 / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
4.76k stars 1.67k forks source link

Training and Prediction flow #388

Open leonardoaraujosantos opened 7 years ago

leonardoaraujosantos commented 7 years ago

Introduction

Hi @weiliu89 thanks again for the support, could you confirm my findings from what I learn about SSD so far?

Training phase

ssd_blockdiagram_train

  1. AnnotatedData layer: Do the merge between the default boxes and the ground truth, also do augmentation. (Merge using the Jaccard overlap)
  2. Image is given to a modified VGG_16 with FC layer changed to CONV
  3. Activation map from CONV4_3 is sampled and given to "Detection part" layers
  4. 3 "Extra Convs" layers are cascaded starting from the end (converted FC) part of VGG_16
  5. Activation maps from the "Extra Conv" layers are sampled and given to the "Detection part" layers
  6. Prior boxes, Confidences, Locations from all "Detection part" layers are merged
  7. Merged information is given to the Multibox loss
  8. Multibox Loss beside calculating the loss also do the Hard negative mining

Extra CONVS

extraconvs

Basically this block will learn how to better scale down the features from the end of VGG_16. This will be usefull do to detection of small objects on the image.

Detection part layer

ssd_detection

This layer has as input the feature map sampled from the middle of VGG and from the CONV3x3 activation map of the extra layers It's composed with the following elements:

  1. Priorbox: Generate default boxes using the Image and feature map dimensions (Also crop boxes outside image)
  2. CONV 3x3: Responsible to do regression of the offset needed to place the bounding box centered on the object
  3. CONV 3x3: Responsible to give a score to every available class + background class
  4. Output of this block, during prediction will be sent to the Detection Output layer that will do a Non-maxima-supression to decide which box fit the object better.

Prediction Phase

ssd_blockdiagram

On prediction phase the main difference is that the merged (Location, Confidences, default boxes) will be feed to the Detection Output layer that basically will decide which which boxes detect better the objects found.

weiliu89 commented 7 years ago

Couple of points that are not correct.

  1. In AnnotatedDataLayer, default box is not used. Only GroundTruth boxes are used to do data augmentation.

  2. Extra conv layers are useful for detecting bigger objects (not smaller).

  3. PriorBox is not cropped anymore (for the part outside of image).

leonardoaraujosantos commented 7 years ago

Thanks @weiliu89!

So the AnnotatedDataLayer will not do the match between the GroundTruth and the default boxes? And if not where is this merge (Using the Jaccard overlap) done?

Thanks again for the support

weiliu89 commented 7 years ago

The match is done in the MultiBoxLossLayer.

leonardoaraujosantos commented 7 years ago

Thanks a lot @weiliu89 I think the confusion was this:

The Jaccard overlap is used to make the ground truth position be transformed to something more compatible with the default boxes, so I follow everyone who call the JaccardOverlap function:

  1. AnnotatedDataLayer::load_batch method call the function GenerateBatchSamples here
  2. GenerateBatchSamples calls the function GenerateSamples here
  3. The function GenerateSamples call SatisfySampleConstraint here
  4. SatisfySampleConstraint calls JaccardOverlap here

Then on the paper page on page 5 we have the following Matching strategy During training we need to determine which default boxes correspond to a ground truth detection and train the network accordingly. For each ground truth box we are selecting from default boxes that vary over location, aspect ratio, and scale. We begin by matching each ground truth box to the default box with the best jaccard overlap

As the AnnotatedDataLayer is used during training I suposed that this layer was responsible to do the Matching strategy between Ground Truth and default boxes.

weiliu89 commented 7 years ago

AnnotatedDataLayer is used to load image, do data augmentation, and apply image transformation. It uses ground truth boxes to help guiding the sampling process. GenerateSamples generates random regions (not default boxes) within an image.

leonardoaraujosantos commented 7 years ago

Thanks for the explanation @weiliu89, so which layer or each part of the code is responsible for this action described on the text mentioned on the paper: (Matching images on G.T using Jaccard overlay)

Matching strategy During training we need to determine which default boxes correspond to a ground truth detection and train the network accordingly. For each ground truth box we are selecting from default boxes that vary over location, aspect ratio, and scale. We begin by matching each ground truth box to the default box with the best jaccard overlap

Thanks again, I'm updating the diagrams and attach the explanations soon!

weiliu89 commented 7 years ago

MultiBoxLossLayer. There is a function called FindMatching.