yeezhu / SPN.pytorch

PyTorch implementation of "Soft Proposal Networks for Weakly Supervised Object Localization", ICCV 2017.
http://yzhu.work/spn.html
MIT License
211 stars 37 forks source link

Questions about the VOC2007 CorLoc results #10

Closed vadimkantorov closed 6 years ago

vadimkantorov commented 6 years ago
  1. Do I understand correctly that the model is trained on VOC2012 and COCO2014? (as precised in the Pointing with Prediction, "We upgrade a pre-trained VGG16 model to SPN and respectively fine-tune it on VOC2012 and COCO2014 dataset for 20 epochs. Results are reported in Tab. 3.")

    Or is Table 4 produced with a model trained only on VOC2007?

  2. You aren't using VGG16's pretrained fully-connected layers, right?

  3. Are you applying any bounding box coordinate adjustments due to convolution padding in VGG?

Thanks!

yeezhu commented 6 years ago
  1. In Table 4, I trained SPN on VOC07.
  2. Yes, it is a trunked VGG16.
  3. No adjustments, just keep the same setting with other methods.
vadimkantorov commented 6 years ago

Thanks for your answers!

Here's a standalone script I wrote to evaluate CorLoc for a model trained by demo_voc2007.py, it gets CorLoc = 37% (versus 65% reported in the paper). Is there an obvious error?


# to be called as: python corloc.py ./logs/voc2007/checkpoint.pth.tar ../data/voc/VOCdevkit/VOC2007

import os
import sys
import xml.dom.minidom
import scipy.misc
import torch
import experiment.models

def load_pascal_ground_truth_boxes(VOCdevkit_VOCYEAR):
    class_labels = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']
    res = {subset : (lambda : None) for subset in ['train', 'val']}
    image_file_list = {subset : os.path.join(VOCdevkit_VOCYEAR, 'ImageSets', 'Main', subset + '.txt') for subset in res.keys()}
    for subset, s in res.items():
        s.image_file_name, s.image_file_path, s.ground_truth_boxes = [], [], []
        for example_idx, line in enumerate(open(image_file_list[subset])):
            s.image_file_name.append(line[:-1])
            s.image_file_path.append(os.path.join(VOCdevkit_VOCYEAR, 'JPEGImages', s.image_file_name[-1] + '.jpg'))

        s.class_labels = class_labels
        s.labels = torch.IntTensor(len(s.image_file_path), len(class_labels)).fill_(-1)
        for class_label_ind, class_label in enumerate(class_labels):
            for example_idx, line in enumerate(open(image_file_list[subset].replace(subset, class_label + '_' + subset))):
                s.labels[example_idx][class_label_ind] = 1 if ' 1' in line else (-1 if ' -1' in line else 0)
        for image_file_name in s.image_file_name:
            anno = xml.dom.minidom.parse(os.path.join(VOCdevkit_VOCYEAR, 'Annotations', image_file_name + '.xml'))
            s.ground_truth_boxes.append(torch.Tensor([[int(elem.getElementsByTagName(tag)[0].firstChild.data) for tag in ['xmin', 'ymin', 'xmax', 'ymax']] + [class_labels.index(elem.getElementsByTagName('name')[0].firstChild.data)] for elem in anno.getElementsByTagName('object')]))
    res['trainval'] = (lambda: None)
    res['trainval'].image_file_name, res['trainval'].image_file_path, res['trainval'].ground_truth_boxes, res['trainval'].labels, res['trainval'].class_labels = res['train'].image_file_name + res['val'].image_file_name, res['train'].image_file_path + res['val'].image_file_path, res['train'].ground_truth_boxes + res['val'].ground_truth_boxes, torch.cat((res['train'].labels, res['val'].labels)), class_labels
    return res

def area(boxes = None, x1 = None, y1 = None, x2 = None, y2 = None):
    return (boxes.select(-1, 3) - boxes.select(-1, 1)) * (boxes.select(-1, 2) - boxes.select(-1, 0)) if boxes is not None else (x2 - x1).clamp(min = 0) * (y2 - y1).clamp(min = 0)

def overlap(box1, box2 = None, rectint = False):
    b1, b2 = [(b if b.dim() == 2 else b.unsqueeze(0)).t().contiguous() for b in [box1, (box2 if box2 is not None else box1)]]
    n1, n2 = b1.size(1), b2.size(1)

    xx1 = torch.max(b1[0].unsqueeze(1).expand(n1, n2), b2[0].unsqueeze(0).expand(n1, n2))
    yy1 = torch.max(b1[1].unsqueeze(1).expand(n1, n2), b2[1].unsqueeze(0).expand(n1, n2))
    xx2 = torch.min(b1[2].unsqueeze(1).expand(n1, n2), b2[2].unsqueeze(0).expand(n1, n2))
    yy2 = torch.min(b1[3].unsqueeze(1).expand(n1, n2), b2[3].unsqueeze(0).expand(n1, n2))

    inter = area(x1 = xx1, y1 = yy1, x2 = xx2, y2 = yy2)
    return inter / (area(b1.t()).unsqueeze(1).expand(n1, n2) + area(b2.t()).unsqueeze(0).expand(n1, n2) - inter) if not rectint else inter

def corloc(dataset_subset, boxes = None, threshold = 0.5):
    class_corlocs = []
    for class_label_ind in range(len(dataset_subset.class_labels)):
        class_corlocs.append(torch.Tensor([overlap(boxes[example_idx][class_label_ind], dataset_subset.ground_truth_boxes[example_idx]).gt(threshold).mul(dataset_subset.ground_truth_boxes[example_idx][:, 4].eq(class_label_ind)).max() for example_idx in dataset_subset.labels[:, class_label_ind].eq(1).nonzero().squeeze()]).mean())
    return torch.Tensor(class_corlocs).mean()

def detect(conv_map, image_height, image_width):
    conv_map = torch.nn.functional.upsample(conv_map.unsqueeze(0), size = (image_height, image_width), mode = 'bilinear').squeeze(0).data
    binary_map = conv_map > conv_map.view(len(conv_map), -1).mean(-1).view(len(conv_map), 1, 1)
    return torch.Tensor([[IJ[:, 1].min(), IJ[:, 0].min(), IJ[:, 1].max(), IJ[:, 0].max()] for IJ in map(torch.nonzero, binary_map)])

def produce_conv_map(model, x):
    x = model.features(x)
    x = model.spatial_pooling.adconv(x)
    x = model.spatial_pooling.maps(x)
    x = model.spatial_pooling.sp(x)
    x = torch.nn.functional.conv2d(x, model.classifier[1].weight.unsqueeze(-1).unsqueeze(-1))
    return x

dataset_subset = load_pascal_ground_truth_boxes(sys.argv[2])['trainval']

state_dict = torch.load(sys.argv[1])['state_dict']
model = experiment.models.vgg16_sp(len(dataset_subset.class_labels))
model.load_state_dict(state_dict)
model.cuda()
model.eval()

conv_maps, image_height, image_width = [], [], []
for example_idx in range(len(dataset_subset.image_file_path)):
    image_rgb = torch.from_numpy(scipy.misc.imread(dataset_subset.image_file_path[example_idx])).permute(2, 0, 1).cuda().float() / 255.0
    image_rgb_normalized = (torch.nn.functional.upsample(image_rgb.unsqueeze(0), size = (224, 224), mode = 'bilinear').data - torch.Tensor(model.image_normalization_mean).cuda().view(1, 3, 1, 1)) / torch.Tensor(model.image_normalization_std).cuda().view(1, 3, 1, 1)
    conv_maps.append(produce_conv_map(model, torch.autograd.Variable(image_rgb_normalized, volatile = True)).squeeze(0).data.cpu())
    image_height.append(image_rgb.size(1))
    image_width.append(image_rgb.size(2))

boxes_gt_mean = map(detect, conv_maps, image_height, image_width)
print(corloc(dataset_subset, boxes = boxes_gt_mean))
chenbinghui1 commented 6 years ago

@vadimkantorov Have you achieved the CorLoc results in Table.4. I trained a VGG19 model and tested in VOC2007, only getting 36%.

vadimkantorov commented 6 years ago

I haven't achieved the reported results either.

Hoping for an eventual testing code release by @yeezhu.

yeezhu commented 6 years ago

@vadimkantorov @chenbinghui1 Sorry for the delay, the evaluation demo will be released as soon as possible. Thank you for your patience!

yeezhu commented 6 years ago

@vadimkantorov @chenbinghui1 The demo was released, please check here. Please feel free to contact me if you have questions.

vadimkantorov commented 6 years ago

@yeezhu Thanks for the updated eval code. I noticed the Notebook has "Corloc: 60.50" versus 65% in the paper. Could you clarify the reasons for the disparity? (different base arch or something else?)

Thanks!

yeezhu commented 6 years ago

@vadimkantorov Please check the paper, the reported number is 60.6.

vadimkantorov commented 6 years ago

@yeezhu You are right. Sorry for confusion.

vadimkantorov commented 6 years ago

UPD 1: found the issue with CorLoc: 0.00. It is caused by division in class_corloc.append(cor/len(cls_inds[0])) which turns into integer division in Python2, a fix is to put float() cast around cor. After this CorLoc: 38.29. Trying to add the image size 112. Adding 112 led to Corloc: 45.22.

UPD 2: found the issue with Out-Of-Memory. In modern PyTorch, volatile variable attribute has been deprecated, so https://github.com/yeezhu/SPN.pytorch/blob/master/demo/experiment/engine.py#L82-L84 should be replaced by torch.set_grad_enabled(training). Then it works fine on Titan X 12Gb Corloc: 59.72

I've tried to run the repro.

Unfortunately, I was unable to train the model on all three scales. On my 12Gb Titan X it crashes with Out-of-Memory when image size 560 starts on. So I trained the model for 20 epochs with image size of only 224. The model obtained testMAP = 84.82%. A few questions if you'll have the time to check them:

  1. If I may ask, what was the rationale for swapping the torchvision VGG16 weights for the Caffe ones?
  2. How important is the scale 560 from your experience?
  3. Does it work for you on a 12Gb GPU (if you have tried)?

Then I've fixed the path DATA_ROOT in EvaluationDemo.ipynb and composed the following CorLoc test script:

import os
import torch
import numpy as np
from PIL import Image
from tqdm import tqdm

from spn import object_localization
import experiment.util as utils
DATA_ROOT = '../data/voc/VOCdevkit/VOC2007'
ground_truth = utils.load_ground_truth_voc(DATA_ROOT, 'trainval')

model_path = './logs/voc2007/model.pth.tar'
model_dict = utils.load_model_voc(model_path, True)

predictions = []
for img_idx in tqdm(range(len(ground_truth['image_list']))):
    image_name = os.path.join(DATA_ROOT, 'JPEGImages', ground_truth['image_list'][img_idx] + '.jpg')
    _, input_var = utils.load_image_voc(image_name)
    preds, labels = object_localization(model_dict, input_var, location_type='bbox', gt_labels=(ground_truth['gt_labels'][img_idx] == 1).nonzero()[0], nms_threshold=0.7)
    predictions += [(img_idx,) + p for p in preds]

print("Corloc: {:.2f}".format(utils.corloc(np.array(predictions), ground_truth) * 100.))

This script prints CorLoc: 0.00. Obviously I'm doing something wrong. I put the model.pth.tar here: https://1drv.ms/u/s!Apx8USiTtrYmqOAfvzo955RfqBdOUA (66 Mb). I would very much appreciate if you could take a look at the snippet above or the trained model file.

Thanks!

vadimkantorov commented 6 years ago

If there is interest, I can prepare a PR with Python2 compat changes and modern PyTorch change.

All my issues are now resolved, but the question about the moving to Caffe-based VGG16 weigts. Appreciate a lot if you could elaborate on this choice.

Thanks!

yeezhu commented 6 years ago

@vadimkantorov We use the Caffe-based VGG16 weights for the consistency with the Torch version. It would be great if you can send the PR. Many thanks for your help :)

chenbinghui1 commented 6 years ago

@vadimkantorov Hi, due to personal reason, i have no time to reproduce the results, so i want to get some knowledge from you. Does the final CorLoc score (i.e. 60.6) come from the fusion of three image scales(112,224,560)?

yeezhu commented 6 years ago

@chenbinghui1 Hi, following the compared methods, e.g., WSDDN, a multi-scale setting (112, 224, 560) is applied in the CorLoc experiment. If you have more implementation details to discuss, feel free to contact me via zhu.yee@outlook.com. Thanks!

vadimkantorov commented 6 years ago

@chenbinghui1 From my experience, multiple scales are extremely important for the final result (from what I understand a separate model is trained for every scale). Also multi_objects = True is extremely important. Without either, the result degrades significantly.

chenbinghui1 commented 6 years ago

@vadimkantorov @yeezhu Thanks for your help

vadimkantorov commented 6 years ago

@yeezhu I've done some stepping through, and it seems that during CorLoc computation you're keeping several detections (around 5-8) for every positive class, and then computing maximum over their respecitve IoU's. It seems this is why multi_objects = True setting has so much impact on accuracy.

As far as I am aware, this is not traditional defintion, in https://arxiv.org/abs/1511.02853 or in https://arxiv.org/abs/1501.06170 only the top 1 detection for every positive class is considered.

Grateful for any clarifications, happy to be corrected if I am wrong.

yeezhu commented 6 years ago

@vadimkantorov Sorry for the delay, we are on the winter vacation :P

In contrast to the proposal classification methods you mentioned above, we extract object bboxes based on class response maps of image classification networks without using additional proposal priors, e.g. SS or EB. Therefore the object bboxes predicted from the same map share the same class confidence score. We perform the CorLoc metric on multi-object predictions to evaluate the percentage of images in which SPN correctly localizes at least one object of the target class. And experiments show that the result drops significantly without the SP layer.

Thanks for the review, we will add some instructions to avoid confusion.

vadimkantorov commented 6 years ago

过年好!