Closed vadimkantorov closed 6 years ago
Thanks for your answers!
Here's a standalone script I wrote to evaluate CorLoc for a model trained by demo_voc2007.py
, it gets CorLoc = 37% (versus 65% reported in the paper). Is there an obvious error?
# to be called as: python corloc.py ./logs/voc2007/checkpoint.pth.tar ../data/voc/VOCdevkit/VOC2007
import os
import sys
import xml.dom.minidom
import scipy.misc
import torch
import experiment.models
def load_pascal_ground_truth_boxes(VOCdevkit_VOCYEAR):
class_labels = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']
res = {subset : (lambda : None) for subset in ['train', 'val']}
image_file_list = {subset : os.path.join(VOCdevkit_VOCYEAR, 'ImageSets', 'Main', subset + '.txt') for subset in res.keys()}
for subset, s in res.items():
s.image_file_name, s.image_file_path, s.ground_truth_boxes = [], [], []
for example_idx, line in enumerate(open(image_file_list[subset])):
s.image_file_name.append(line[:-1])
s.image_file_path.append(os.path.join(VOCdevkit_VOCYEAR, 'JPEGImages', s.image_file_name[-1] + '.jpg'))
s.class_labels = class_labels
s.labels = torch.IntTensor(len(s.image_file_path), len(class_labels)).fill_(-1)
for class_label_ind, class_label in enumerate(class_labels):
for example_idx, line in enumerate(open(image_file_list[subset].replace(subset, class_label + '_' + subset))):
s.labels[example_idx][class_label_ind] = 1 if ' 1' in line else (-1 if ' -1' in line else 0)
for image_file_name in s.image_file_name:
anno = xml.dom.minidom.parse(os.path.join(VOCdevkit_VOCYEAR, 'Annotations', image_file_name + '.xml'))
s.ground_truth_boxes.append(torch.Tensor([[int(elem.getElementsByTagName(tag)[0].firstChild.data) for tag in ['xmin', 'ymin', 'xmax', 'ymax']] + [class_labels.index(elem.getElementsByTagName('name')[0].firstChild.data)] for elem in anno.getElementsByTagName('object')]))
res['trainval'] = (lambda: None)
res['trainval'].image_file_name, res['trainval'].image_file_path, res['trainval'].ground_truth_boxes, res['trainval'].labels, res['trainval'].class_labels = res['train'].image_file_name + res['val'].image_file_name, res['train'].image_file_path + res['val'].image_file_path, res['train'].ground_truth_boxes + res['val'].ground_truth_boxes, torch.cat((res['train'].labels, res['val'].labels)), class_labels
return res
def area(boxes = None, x1 = None, y1 = None, x2 = None, y2 = None):
return (boxes.select(-1, 3) - boxes.select(-1, 1)) * (boxes.select(-1, 2) - boxes.select(-1, 0)) if boxes is not None else (x2 - x1).clamp(min = 0) * (y2 - y1).clamp(min = 0)
def overlap(box1, box2 = None, rectint = False):
b1, b2 = [(b if b.dim() == 2 else b.unsqueeze(0)).t().contiguous() for b in [box1, (box2 if box2 is not None else box1)]]
n1, n2 = b1.size(1), b2.size(1)
xx1 = torch.max(b1[0].unsqueeze(1).expand(n1, n2), b2[0].unsqueeze(0).expand(n1, n2))
yy1 = torch.max(b1[1].unsqueeze(1).expand(n1, n2), b2[1].unsqueeze(0).expand(n1, n2))
xx2 = torch.min(b1[2].unsqueeze(1).expand(n1, n2), b2[2].unsqueeze(0).expand(n1, n2))
yy2 = torch.min(b1[3].unsqueeze(1).expand(n1, n2), b2[3].unsqueeze(0).expand(n1, n2))
inter = area(x1 = xx1, y1 = yy1, x2 = xx2, y2 = yy2)
return inter / (area(b1.t()).unsqueeze(1).expand(n1, n2) + area(b2.t()).unsqueeze(0).expand(n1, n2) - inter) if not rectint else inter
def corloc(dataset_subset, boxes = None, threshold = 0.5):
class_corlocs = []
for class_label_ind in range(len(dataset_subset.class_labels)):
class_corlocs.append(torch.Tensor([overlap(boxes[example_idx][class_label_ind], dataset_subset.ground_truth_boxes[example_idx]).gt(threshold).mul(dataset_subset.ground_truth_boxes[example_idx][:, 4].eq(class_label_ind)).max() for example_idx in dataset_subset.labels[:, class_label_ind].eq(1).nonzero().squeeze()]).mean())
return torch.Tensor(class_corlocs).mean()
def detect(conv_map, image_height, image_width):
conv_map = torch.nn.functional.upsample(conv_map.unsqueeze(0), size = (image_height, image_width), mode = 'bilinear').squeeze(0).data
binary_map = conv_map > conv_map.view(len(conv_map), -1).mean(-1).view(len(conv_map), 1, 1)
return torch.Tensor([[IJ[:, 1].min(), IJ[:, 0].min(), IJ[:, 1].max(), IJ[:, 0].max()] for IJ in map(torch.nonzero, binary_map)])
def produce_conv_map(model, x):
x = model.features(x)
x = model.spatial_pooling.adconv(x)
x = model.spatial_pooling.maps(x)
x = model.spatial_pooling.sp(x)
x = torch.nn.functional.conv2d(x, model.classifier[1].weight.unsqueeze(-1).unsqueeze(-1))
return x
dataset_subset = load_pascal_ground_truth_boxes(sys.argv[2])['trainval']
state_dict = torch.load(sys.argv[1])['state_dict']
model = experiment.models.vgg16_sp(len(dataset_subset.class_labels))
model.load_state_dict(state_dict)
model.cuda()
model.eval()
conv_maps, image_height, image_width = [], [], []
for example_idx in range(len(dataset_subset.image_file_path)):
image_rgb = torch.from_numpy(scipy.misc.imread(dataset_subset.image_file_path[example_idx])).permute(2, 0, 1).cuda().float() / 255.0
image_rgb_normalized = (torch.nn.functional.upsample(image_rgb.unsqueeze(0), size = (224, 224), mode = 'bilinear').data - torch.Tensor(model.image_normalization_mean).cuda().view(1, 3, 1, 1)) / torch.Tensor(model.image_normalization_std).cuda().view(1, 3, 1, 1)
conv_maps.append(produce_conv_map(model, torch.autograd.Variable(image_rgb_normalized, volatile = True)).squeeze(0).data.cpu())
image_height.append(image_rgb.size(1))
image_width.append(image_rgb.size(2))
boxes_gt_mean = map(detect, conv_maps, image_height, image_width)
print(corloc(dataset_subset, boxes = boxes_gt_mean))
@vadimkantorov Have you achieved the CorLoc results in Table.4. I trained a VGG19 model and tested in VOC2007, only getting 36%.
I haven't achieved the reported results either.
Hoping for an eventual testing code release by @yeezhu.
@vadimkantorov @chenbinghui1 Sorry for the delay, the evaluation demo will be released as soon as possible. Thank you for your patience!
@vadimkantorov @chenbinghui1 The demo was released, please check here. Please feel free to contact me if you have questions.
@yeezhu Thanks for the updated eval code. I noticed the Notebook has "Corloc: 60.50" versus 65% in the paper. Could you clarify the reasons for the disparity? (different base arch or something else?)
Thanks!
@yeezhu You are right. Sorry for confusion.
UPD 1: found the issue with CorLoc: 0.00
. It is caused by division in class_corloc.append(cor/len(cls_inds[0]))
which turns into integer division in Python2, a fix is to put float()
cast around cor
. After this CorLoc: 38.29
. Trying to add the image size 112. Adding 112 led to Corloc: 45.22
.
UPD 2: found the issue with Out-Of-Memory. In modern PyTorch, volatile variable attribute has been deprecated, so https://github.com/yeezhu/SPN.pytorch/blob/master/demo/experiment/engine.py#L82-L84 should be replaced by torch.set_grad_enabled(training)
. Then it works fine on Titan X 12Gb Corloc: 59.72
I've tried to run the repro.
Unfortunately, I was unable to train the model on all three scales. On my 12Gb Titan X it crashes with Out-of-Memory when image size 560 starts on. So I trained the model for 20 epochs with image size of only 224. The model obtained testMAP = 84.82%. A few questions if you'll have the time to check them:
Then I've fixed the path DATA_ROOT
in EvaluationDemo.ipynb
and composed the following CorLoc test script:
import os
import torch
import numpy as np
from PIL import Image
from tqdm import tqdm
from spn import object_localization
import experiment.util as utils
DATA_ROOT = '../data/voc/VOCdevkit/VOC2007'
ground_truth = utils.load_ground_truth_voc(DATA_ROOT, 'trainval')
model_path = './logs/voc2007/model.pth.tar'
model_dict = utils.load_model_voc(model_path, True)
predictions = []
for img_idx in tqdm(range(len(ground_truth['image_list']))):
image_name = os.path.join(DATA_ROOT, 'JPEGImages', ground_truth['image_list'][img_idx] + '.jpg')
_, input_var = utils.load_image_voc(image_name)
preds, labels = object_localization(model_dict, input_var, location_type='bbox', gt_labels=(ground_truth['gt_labels'][img_idx] == 1).nonzero()[0], nms_threshold=0.7)
predictions += [(img_idx,) + p for p in preds]
print("Corloc: {:.2f}".format(utils.corloc(np.array(predictions), ground_truth) * 100.))
This script prints CorLoc: 0.00
. Obviously I'm doing something wrong. I put the model.pth.tar
here: https://1drv.ms/u/s!Apx8USiTtrYmqOAfvzo955RfqBdOUA (66 Mb). I would very much appreciate if you could take a look at the snippet above or the trained model file.
Thanks!
If there is interest, I can prepare a PR with Python2 compat changes and modern PyTorch change.
All my issues are now resolved, but the question about the moving to Caffe-based VGG16 weigts. Appreciate a lot if you could elaborate on this choice.
Thanks!
@vadimkantorov We use the Caffe-based VGG16 weights for the consistency with the Torch version. It would be great if you can send the PR. Many thanks for your help :)
@vadimkantorov Hi, due to personal reason, i have no time to reproduce the results, so i want to get some knowledge from you. Does the final CorLoc score (i.e. 60.6) come from the fusion of three image scales(112,224,560)?
@chenbinghui1 Hi, following the compared methods, e.g., WSDDN, a multi-scale setting (112, 224, 560) is applied in the CorLoc experiment. If you have more implementation details to discuss, feel free to contact me via zhu.yee@outlook.com. Thanks!
@chenbinghui1 From my experience, multiple scales are extremely important for the final result (from what I understand a separate model is trained for every scale). Also multi_objects = True
is extremely important. Without either, the result degrades significantly.
@vadimkantorov @yeezhu Thanks for your help
@yeezhu I've done some stepping through, and it seems that during CorLoc computation you're keeping several detections (around 5-8) for every positive class, and then computing maximum over their respecitve IoU's. It seems this is why multi_objects = True
setting has so much impact on accuracy.
As far as I am aware, this is not traditional defintion, in https://arxiv.org/abs/1511.02853 or in https://arxiv.org/abs/1501.06170 only the top 1 detection for every positive class is considered.
Grateful for any clarifications, happy to be corrected if I am wrong.
@vadimkantorov Sorry for the delay, we are on the winter vacation :P
In contrast to the proposal classification methods you mentioned above, we extract object bboxes based on class response maps of image classification networks without using additional proposal priors, e.g. SS or EB. Therefore the object bboxes predicted from the same map share the same class confidence score. We perform the CorLoc metric on multi-object predictions to evaluate the percentage of images in which SPN correctly localizes at least one object of the target class. And experiments show that the result drops significantly without the SP layer.
Thanks for the review, we will add some instructions to avoid confusion.
过年好!
Do I understand correctly that the model is trained on VOC2012 and COCO2014? (as precised in the Pointing with Prediction, "We upgrade a pre-trained VGG16 model to SPN and respectively fine-tune it on VOC2012 and COCO2014 dataset for 20 epochs. Results are reported in Tab. 3.")
Or is Table 4 produced with a model trained only on VOC2007?
You aren't using VGG16's pretrained fully-connected layers, right?
Are you applying any bounding box coordinate adjustments due to convolution padding in VGG?
Thanks!