Closed pyupcgithub closed 6 years ago
@pyupcgithub if you want to train model by pycaffe, you should write data layer by yourself and then train the model. But it is really difficult because the data layer is really complex.
@Peng-wei-Yu ....... if so ,how could i use the matlab script to do detection using the pretrained model to do detection on one single image?
@Shadow992 why i can't open the url ?
@pyupcgithub @Peng-wei-Yu @Shadow992 Hi, have you train the resnet101+fpn+cascade or resnet50+fpn+cascade model with your dataset successfully?
@huinsysu I have trained my own model based on Inception v3 on my own dataset. However I trained it on two different datasets:
Detecting license plate on images. This one worked amazingly well. Without much finetuning I reached quite good results (about 90% accuracy on different datasets).
Detecting characters on the croped license plates. This one does not work after two month of testing/finetuning/etc.
I found other threads and especially papers like "DSOD: Learning Deeply Supervised Object Detectors from Scratch" ( http://openaccess.thecvf.com/content_ICCV_2017/papers/Shen_DSOD_Learning_Deeply_ICCV_2017_paper.pdf ), which suggest that training models from scratch may fail due to ROI-Pooling-Layer. Therefore you should always use pre-trained models. I tried training a classifier on the dataset, which only classifies the characters and then keep the convolutional layers and replace the others with RCNN components. It still fails to learn. No matter what hyper parameters, regions, apsect ratios, anchor points, etc. I choose, it always fails. If there would not be a paper, which uses Faster RCNN for OCR, I would have said: Faster RCNN can just not do it.
But I guess there is something wrong with my dataset/training/hyperparamaters. However I was not able to find it until now...
Edit: "Does not work" means not bad results, but accuracy of about maybe 0.1% for detection/localization. It just does not work. However when I overfit the model on my training data and use the same inference code in C++ I used to use, then it works like a charm. Therefore I highly suspect that the way I am doing training or similar does not work perfectly or needs more finetuning. Especially localization seems to work extremely poor. Classification seems okish to me but still quite bad.
@Shadow992 can you show me the url that i can't view.
@Shadow992 Thanks for your detailed reply. When I trained the res50-15s-800-fpn-cascade model on my datasets, I just used the pre-trained model of ImageNet. But the result of my model was very bad. The scores of the bbox that the model detected were very low, which meant the model classfied those bbox to background. And I have no idea how to solve such problem. So I plan to dive into the training code and hope to find the reason of my problem.
@huinsysu This also happened to me when training on character recognition. My model nearly always predicts background with about ~90% or higher probability. I am also suffering from the exact same problem. However as mentioned for License Plate detection nearly the exact same network works great. I guess there must be something wrong or at least needs some more finetuning. I am now trying to make ROI-Pooling-Layer much bigger (now size of 15x9). Hopefully this helps, but I guess not.
A "tiny workaround" would be to interpret backgrounds with lower probability of 90% or similar as foreground and extract the maximum foreground object. But this does still not work quite good, especially when thinking about bbox regression and similar...
What kind of objects do you want to detect?
@pyupcgithub
Python inference:
import os
import sys
import argparse
import numpy as np
from PIL import Image, ImageDraw
import cv2
import time
# Make sure that caffe is on the python path:
caffe_root = '../../..'
#os.chdir(caffe_root)
sys.path.insert(0, os.path.join(caffe_root, 'python'))
import caffe
# from google.protobuf import text_format
# from caffe.proto import caffe_pb2
class CaffeDetection:
def __init__(self, gpu_id, model_def, model_weights,
cascade=0, FPN=0):
if gpu_id < 0:
caffe.set_mode_cpu()
else:
caffe.set_device(gpu_id)
caffe.set_mode_gpu()
# Load the net in the test phase for inference, and configure input preprocessing.
self.net = caffe.Net(model_def, # defines the structure of the model
model_weights, # contains the trained weights
caffe.TEST) # use test mode (e.g., don't perform dropout)
# input preprocessing: 'data' is the name of the input blob == net.inputs[0]
#self.transformer = caffe.io.Transformer({'data': self.net.blobs['data'].data.shape})
#self.transformer.set_transpose('data', (2, 0, 1))
#self.transformer.set_mean('data', np.array([104, 117, 123])) # mean pixel
## the reference model operates on images in [0,255] range instead of [0,1]
#self.transformer.set_raw_scale('data', 255)
## the reference model has channels in BGR order instead of RGB
#self.transformer.set_channel_swap('data', (2, 1, 0))
self.cascade = cascade > 0
self.FPN = FPN > 0
print cascade,FPN
if not self.cascade:
# baseline model
if self.FPN:
self.proposal_blob_names = ['proposals_to_all']
else:
self.proposal_blob_names = ['proposals']
self.bbox_blob_names = ['output_bbox_1st']
self.cls_prob_blob_names = ['cls_prob_1st']
self.output_names = ['1st']
else:
# cascade-rcnn model
if self.FPN:
self.proposal_blob_names = ['proposals_to_all', 'proposals_to_all_2nd',
'proposals_to_all_3rd', 'proposals_to_all_2nd', 'proposals_to_all_3rd']
else:
self.proposal_blob_names = ['proposals', 'proposals_2nd', 'proposals_3rd',
'proposals_2nd', 'proposals_3rd']
self.bbox_blob_names = ['output_bbox_1st', 'output_bbox_2nd', 'output_bbox_3rd',
'output_bbox_2nd', 'output_bbox_3rd']
self.cls_prob_blob_names = ['cls_prob_1st', 'cls_prob_2nd', 'cls_prob_3rd',
'cls_prob_2nd_avg', 'cls_prob_3rd_avg']
self.output_names = ['1st', '2nd', '3rd', '2nd_avg', '3rd_avg']
self.num_outputs = len(self.proposal_blob_names)
assert(self.num_outputs==len(self.bbox_blob_names))
assert(self.num_outputs==len(self.cls_prob_blob_names))
assert(self.num_outputs==len(self.output_names))
# detection configuration
# detect_final_boxes = np.zeros(nImg, num_outputs)
#self.det_thr = 0.001 # threshold for testing
self.det_thr = 0.3 # threshold for demo
self.max_per_img = 100 # max number of detections
self.nms_thresh = 0.5 # NMS
if FPN:
self.shortSize = 800
self.longSize = 1312
else:
self.shortSize = 600
self.longSize = 1000
self.PIXEL_MEANS = np.array([104, 117, 123],dtype=np.uint8)
self.num_cls = 80
def detect(self, image_file):
'''
rcnn detection
'''
#image = caffe.io.load_image(image_file)
image = cv2.imread(image_file) # BGR, default is cv2.IMREAD_COLOR 3-channel
orgH, orgW, channel = image.shape
print("image shape:",image.shape)
rzRatio = self.shortSize / min(orgH, orgW)
imgH = min(rzRatio * orgH, self.longSize)
imgW = min(rzRatio * orgW, self.longSize)
imgH = round(imgH / 32) * 32
imgW = round(imgW / 32) * 32 # must be the multiple of 32
hwRatios = [imgH/orgH, imgW/orgW]
#transformed_image = self.transformer.preprocess('data', image)
#image = cv2.resize(im_orig, None, None, fx=im_scale, fy=im_scale,
resized_w = int(imgW)
resized_h = int(imgH)
print 'resized -> ',(resized_w, resized_h)
image = cv2.resize(image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR)
image -= self.PIXEL_MEANS
#cv2.imwrite("transformed_image.jpg", image)
transformed_image = np.transpose(image, (2,0,1)) # C H W
# set net to batch size of 1
self.net.blobs['data'].reshape(1, 3, resized_h, resized_w)
#Run the net and examine the top_k results
self.net.blobs['data'].data[...] = transformed_image.astype(np.float32, copy=False)
start = time.time()
# Forward pass.
blobs_out = self.net.forward()
print('output_bbox_1st---',blobs_out['output_bbox_1st'].shape)
#print blobs_out
end = time.time()
cost_millis = int((end - start) * 1000)
print "detection cost ms: ", cost_millis
detect_final_boxes = []
for nn in range(self.num_outputs):
# detect_boxes = cell(num_cls, 1);
tmp = self.net.blobs[self.bbox_blob_names[nn]].data.copy() # if no need modify,then no need copy
print(self.bbox_blob_names[nn], tmp.shape)
#tmp = tmp.reshape((-1,5))
tmp = tmp[:,:,0,0]
tmp[:,1] /= hwRatios[1]
tmp[:,3] /= hwRatios[1]
tmp[:,2] /= hwRatios[0]
tmp[:,4] /= hwRatios[0]
# clipping bbs to image boarders
tmp[:, 1] = np.maximum(0,tmp[:,1])
tmp[:, 2] = np.maximum(0,tmp[:,2])
tmp[:, 3] = np.minimum(orgW,tmp[:,3])
tmp[:, 4] = np.minimum(orgH,tmp[:,4])
tmp[:, 3] = tmp[:, 3] - tmp[:, 1] + 1 # w
tmp[:, 4] = tmp[:, 4] - tmp[:, 2] + 1 # h
output_bboxs = tmp[:,1:]
tmp = self.net.blobs[self.cls_prob_blob_names[nn]].data
print(self.cls_prob_blob_names[nn], tmp.shape)
cls_prob = tmp.reshape((-1,self.num_cls+1))
tmp = self.net.blobs[self.proposal_blob_names[nn]].data.copy()
print(self.proposal_blob_names[nn], tmp.shape)
tmp = tmp[:,1:]
tmp[:, 2] = tmp[:, 2] - tmp[:, 0] + 1 # w
tmp[:, 3] = tmp[:, 3] - tmp[:, 1] + 1 # h
proposals = tmp
keep_id = np.where((proposals[:, 2] > 0) & (proposals[:, 3] > 0))[0]
proposals = proposals[keep_id,:]
output_bboxs = output_bboxs[keep_id,:]
cls_prob = cls_prob[keep_id,:]
detect_boxes = []
for i in range(self.num_cls):
cls_id = i + 1
prob = cls_prob[:, cls_id][:, np.newaxis] # 0 is background
#print (output_bboxs.shape, prob.shape)
bbset = np.hstack([output_bboxs, prob])
if self.det_thr > 0:
keep_id = np.where(prob >= self.det_thr)[0]
bbset = bbset[keep_id,:]
keep = self.cpu_nms_single_cls(bbset, self.nms_thresh)
if len(keep) == 0: continue
bbset = bbset[keep,:]
cls_ids = np.array([cls_id] * len(bbset))[:, np.newaxis]
#print "cls_ids.shape", cls_ids.shape, bbset.shape
detect_boxes.extend(np.hstack([cls_ids, bbset]).tolist())
print "detected box num: ", len(detect_boxes)
detect_boxes = np.asarray(detect_boxes)
if self.max_per_img > 0 and len(detect_boxes) > self.max_per_img:
rank_scores = detect_boxes[:, 5].copy()[::-1]
rank_scores.sort() # 'descend'
print len(rank_scores),self.max_per_img
print np.where(detect_boxes[:, 5] >= rank_scores[self.max_per_img])
keep_id = np.where(detect_boxes[:, 5] >= rank_scores[self.max_per_img])[0]
detect_boxes = detect_boxes[keep_id,:]
#detect_final_boxes.extend(detect_boxes.tolist())
detect_final_boxes.append(detect_boxes.tolist())
return detect_final_boxes
def cpu_nms_single_cls(self, dets, thresh):
"""Pure Python NMS baseline."""
x1 = dets[:, 0]
y1 = dets[:, 1]
w = dets[:, 2]
h = dets[:, 3]
scores = dets[:, 4]
x2 = x1 + w - 1
y2 = y1 + h - 1
# areas = (x2 - x1 + 1) * (y2 - y1 + 1)
areas = w * h
order = scores.argsort()[::-1]
keep = []
while order.size > 0:
i = order[0]
keep.append(i)
xx1 = np.maximum(x1[i], x1[order[1:]])
yy1 = np.maximum(y1[i], y1[order[1:]])
xx2 = np.minimum(x2[i], x2[order[1:]])
yy2 = np.minimum(y2[i], y2[order[1:]])
w = np.maximum(0.0, xx2 - xx1 + 1)
h = np.maximum(0.0, yy2 - yy1 + 1)
inter = w * h
ovr = inter / (areas[i] + areas[order[1:]] - inter)
inds = np.where(ovr <= thresh)[0]
order = order[inds + 1]
return keep
def main(args):
'''main '''
wordname_15 = ['__background__', 'plane', 'baseball-diamond', 'bridge', 'ground-track-field', 'small-vehicle', 'large-vehicle', 'ship', 'tennis-court',
'basketball-court', 'storage-tank', 'soccer-ball-field', 'roundabout', 'harbor', 'swimming-pool', 'helicopter']
wordname_5 = ['__background__', '1:plane', '2:ship', '3:storage', '4:harbor', '5:bridge']
# {cls_name: cls_id} # start from 1
#cls_ids = {k: idx+1 for idx, k in enumerate(wordname_15)}
detection = CaffeDetection(args.gpu_id,
args.model_def, args.model_weights,
cascade=args.cascade, FPN=args.FPN)
results = detection.detect(args.image_file)
#print(results)
img = Image.open(args.image_file)
draw = ImageDraw.Draw(img)
width, height = img.size
for item in results[len(results)-1]:# the 3rd_avg result
xmin = int(round(item[1]))
ymin = int(round(item[2]))
xmax = int(round(item[1] + item[3] - 1))
ymax = int(round(item[2] + item[4] - 1))
cls_id = int(item[0])
draw.rectangle([xmin, ymin, xmax, ymax], outline=(255, 0, 0))
draw.text([xmin, ymin], str(cls_id), (0, 0, 255))
print [cls_id, xmin, ymin, xmax, ymax, round(item[-1]*1000)/1000]
img.save('detect_result.jpg')
def parse_args():
'''parse args'''
parser = argparse.ArgumentParser()
parser.add_argument('--gpu_id', type=int, default=0, help='gpu id')
parser.add_argument('--model_def',
default='models/deploy.prototxt')
parser.add_argument('--cascade', default=0, type=int)
parser.add_argument('--FPN', default=0, type=int)
parser.add_argument('--model_weights',
default='models/models_iter_120000.caffemodel')
parser.add_argument('--image_file', default='examples/images/fish-bike.jpg')
return parser.parse_args()
if __name__ == '__main__':
main(parse_args())
@Shadow992 I am participating in a tank detection compitition and there are 189 classes in the dataset. I tried to test the effect of the rpn on some picture and found that the boxxes the rpn provided were not so bad, at least the high score bboxes were around the ground true. So I guess the rpn network works. It seems strange to me that the classfication performs well on stage 1 but performs bad on stage 2. Since there are too many classes in the dataset, I want to try to reduce the training classes to see the effect on the subset. If you find any solution to this problem, please inform me. Thanks!
@huinsysu Can we stay in closer contact? E.g. by using Skype/Discord or similar? So we can update each other on a regular basis? I would also be highly interested in solving this issue...
You can simply write me a mail for contact, if you want to: Removed Email
@Shadow992 thank u for the python inference code.
@Shadow992 I test the python inference code you provided, however, i find it can't get the same good result as the matlab inference code. I just test the author provided trained model. do you know why ?
@Shadow992 like this , i use the same model, different inference code. I only detect people. using the matlab inference code, result is 4. using the python inference code,reuslt is 5. this is one of the differences. Can you help me to solve this problem.
You probably have to finetune parameters and change algorithms. Maybe apply some post processing. However as I am not the author and this is beyond the issues. I suggest to close it and play around by yourself with code.
en..... @Shadow992 glad to receive your reply. one more question, i find if i use the matlab inference code and the python inference code in the same image, the outputs are different, especially the score of the confidence. I can not understand although i use the same model on the same image, the output are different. just beacause of the different of matlab or python interface ? I feel really confused. looking forward to your reply.
@Shadow992 like this output result. in the matlab: detect_boxes =
1.0e+03 *
0.0010 0.0010 0.0411 0.2543 0.1361 0.1713 0.0010
0.0010 0.0010 1.0196 0.5097 0.1098 0.1028 0.0010
0.0010 0.0010 0.8632 0.4844 0.1704 0.1621 0.0010
0.0010 0.0010 0.4153 0.2646 0.1221 0.1774 0.0009
0.0010 0.0010 0.8185 0.3617 0.0951 0.1131 0.0009
0.0010 0.0010 0.5891 0.4763 0.1830 0.2173 0.0009
0.0010 0.0010 0 0.0624 0.0936 0.1204 0.0009
0.0010 0.0010 0.7620 0.4800 0.1422 0.1488 0.0009
0.0010 0.0010 0.5406 0.2861 0.1220 0.1924 0.0009
0.0010 0.0010 0.4313 0.1939 0.1162 0.1401 0.0009
0.0010 0.0010 0.7104 0.3299 0.1051 0.1452 0.0009
0.0010 0.0010 0.6834 0.2532 0.0991 0.1096 0.0009
0.0010 0.0010 0.2715 0.1798 0.1358 0.3119 0.0008
0.0010 0.0010 0.5974 0.2326 0.0972 0.1194 0.0007
in the python:
[1.0, 880.21435546875, 481.3125305175781, 157.732666015625, 167.05300903320312, 0.9509992599487305] [1.0, 17.491165161132812, 251.82504272460938, 161.8278350830078, 180.31094360351562, 0.8915714025497437] [1.0, 1019.8265991210938, 510.274169921875, 117.43792724609375, 108.07086181640625, 0.8827307224273682] [1.0, 3.20751953125, 62.82538986206055, 89.68313598632812, 127.01773071289062, 0.8688411116600037] [1.0, 413.26153564453125, 263.39776611328125, 172.83038330078125, 176.29830932617188, 0.7869312763214111] [1.0, 872.2628784179688, 398.31329345703125, 47.70599365234375, 77.80911254882812, 0.7195702791213989] [1.0, 482.8349304199219, 190.6676788330078, 141.91409301757812, 143.6922149658203, 0.6592983603477478] [1.0, 678.8290405273438, 250.62388610839844, 103.9158935546875, 187.9010467529297, 0.6096863746643066] [1.0, 417.4122009277344, 269.6662902832031, 256.9304504394531, 228.03402709960938, 0.5993428826332092] [1.0, 543.610595703125, 287.9579772949219, 129.0999755859375, 187.22341918945312, 0.5436170697212219] [1.0, 603.4822998046875, 230.12550354003906, 91.89520263671875, 231.51707458496094, 0.5290579199790955] [1.0, 1256.5042724609375, 627.1253051757812, 23.7042236328125, 65.6826171875, 0.4823896884918213]
the score of confidence is [1.0, 691.6392211914062, 254.77044677734375, 87.100341796875, 107.69497680664062, 0.47900158166885376] [1.0, 641.8724365234375, 228.37530517578125, 102.33233642578125, 231.7137451171875, 0.4411845803260803] [1.0, 634.2685546875, 228.98475646972656, 55.02569580078125, 123.33226013183594, 0.42556729912757874] [1.0, 584.3050537109375, 463.4409484863281, 319.29437255859375, 186.97891235351562, 0.40026265382766724]
especailly the last, the score of confidence.
@pyupcgithub As mentioned earlier: I am not the author of this code and I am barely coding in Matlab/Python. However this looks like Python is applying a different kind of NMS or sets IoU different. Just play around with the parameters. I am not able to help you on this problem, sorry.
@Shadow992 yes, i think you are right. anyway, thank you .
when i have the trained model , but i want to use the python code to do detection. can someone tell me how to do inference on one image using trained model?