zhaoweicai / cascade-rcnn

Caffe implementation of multiple popular object detection frameworks
1.04k stars 293 forks source link

Have you written the python inference code on one image? #49

Closed pyupcgithub closed 6 years ago

pyupcgithub commented 6 years ago

when i have the trained model , but i want to use the python code to do detection. can someone tell me how to do inference on one image using trained model?

Peng-wei-Yu commented 6 years ago

@pyupcgithub if you want to train model by pycaffe, you should write data layer by yourself and then train the model. But it is really difficult because the data layer is really complex.

pyupcgithub commented 6 years ago

@Peng-wei-Yu ....... if so ,how could i use the matlab script to do detection using the pretrained model to do detection on one single image?

Shadow992 commented 6 years ago

Have a look at:

https://gist.github.com/makefile/6731ca0e311b6401681c15635bb97330

pyupcgithub commented 6 years ago

@Shadow992 why i can't open the url ?

huinsysu commented 6 years ago

@pyupcgithub @Peng-wei-Yu @Shadow992 Hi, have you train the resnet101+fpn+cascade or resnet50+fpn+cascade model with your dataset successfully?

Shadow992 commented 6 years ago

@huinsysu I have trained my own model based on Inception v3 on my own dataset. However I trained it on two different datasets:

  1. Detecting license plate on images. This one worked amazingly well. Without much finetuning I reached quite good results (about 90% accuracy on different datasets).

  2. Detecting characters on the croped license plates. This one does not work after two month of testing/finetuning/etc.

I found other threads and especially papers like "DSOD: Learning Deeply Supervised Object Detectors from Scratch" ( http://openaccess.thecvf.com/content_ICCV_2017/papers/Shen_DSOD_Learning_Deeply_ICCV_2017_paper.pdf ), which suggest that training models from scratch may fail due to ROI-Pooling-Layer. Therefore you should always use pre-trained models. I tried training a classifier on the dataset, which only classifies the characters and then keep the convolutional layers and replace the others with RCNN components. It still fails to learn. No matter what hyper parameters, regions, apsect ratios, anchor points, etc. I choose, it always fails. If there would not be a paper, which uses Faster RCNN for OCR, I would have said: Faster RCNN can just not do it.

But I guess there is something wrong with my dataset/training/hyperparamaters. However I was not able to find it until now...

Edit: "Does not work" means not bad results, but accuracy of about maybe 0.1% for detection/localization. It just does not work. However when I overfit the model on my training data and use the same inference code in C++ I used to use, then it works like a charm. Therefore I highly suspect that the way I am doing training or similar does not work perfectly or needs more finetuning. Especially localization seems to work extremely poor. Classification seems okish to me but still quite bad.

pyupcgithub commented 6 years ago

@Shadow992 can you show me the url that i can't view.

huinsysu commented 6 years ago

@Shadow992 Thanks for your detailed reply. When I trained the res50-15s-800-fpn-cascade model on my datasets, I just used the pre-trained model of ImageNet. But the result of my model was very bad. The scores of the bbox that the model detected were very low, which meant the model classfied those bbox to background. And I have no idea how to solve such problem. So I plan to dive into the training code and hope to find the reason of my problem.

Shadow992 commented 6 years ago

@huinsysu This also happened to me when training on character recognition. My model nearly always predicts background with about ~90% or higher probability. I am also suffering from the exact same problem. However as mentioned for License Plate detection nearly the exact same network works great. I guess there must be something wrong or at least needs some more finetuning. I am now trying to make ROI-Pooling-Layer much bigger (now size of 15x9). Hopefully this helps, but I guess not.

A "tiny workaround" would be to interpret backgrounds with lower probability of 90% or similar as foreground and extract the maximum foreground object. But this does still not work quite good, especially when thinking about bbox regression and similar...

What kind of objects do you want to detect?

@pyupcgithub

Python inference:

import os
import sys
import argparse
import numpy as np
from PIL import Image, ImageDraw
import cv2
import time

# Make sure that caffe is on the python path:
caffe_root = '../../..'
#os.chdir(caffe_root)
sys.path.insert(0, os.path.join(caffe_root, 'python'))
import caffe

# from google.protobuf import text_format
# from caffe.proto import caffe_pb2

class CaffeDetection:
    def __init__(self, gpu_id, model_def, model_weights,
                 cascade=0, FPN=0):
        if gpu_id < 0:
            caffe.set_mode_cpu()
        else:
            caffe.set_device(gpu_id)
            caffe.set_mode_gpu()

        # Load the net in the test phase for inference, and configure input preprocessing.
        self.net = caffe.Net(model_def,      # defines the structure of the model
                             model_weights,  # contains the trained weights
                             caffe.TEST)     # use test mode (e.g., don't perform dropout)
        # input preprocessing: 'data' is the name of the input blob == net.inputs[0]
        #self.transformer = caffe.io.Transformer({'data': self.net.blobs['data'].data.shape})
        #self.transformer.set_transpose('data', (2, 0, 1))
        #self.transformer.set_mean('data', np.array([104, 117, 123])) # mean pixel
        ## the reference model operates on images in [0,255] range instead of [0,1]
        #self.transformer.set_raw_scale('data', 255)
        ## the reference model has channels in BGR order instead of RGB
        #self.transformer.set_channel_swap('data', (2, 1, 0))

        self.cascade = cascade > 0
        self.FPN = FPN > 0
        print cascade,FPN
        if not self.cascade:
            # baseline model
            if self.FPN:
                self.proposal_blob_names = ['proposals_to_all']
            else:
                self.proposal_blob_names = ['proposals']

            self.bbox_blob_names = ['output_bbox_1st']
            self.cls_prob_blob_names = ['cls_prob_1st']
            self.output_names = ['1st']
        else:
            # cascade-rcnn model
            if self.FPN:
                self.proposal_blob_names = ['proposals_to_all', 'proposals_to_all_2nd',
                                       'proposals_to_all_3rd', 'proposals_to_all_2nd', 'proposals_to_all_3rd']
            else:
                self.proposal_blob_names = ['proposals', 'proposals_2nd', 'proposals_3rd',
                                       'proposals_2nd', 'proposals_3rd']

            self.bbox_blob_names = ['output_bbox_1st', 'output_bbox_2nd', 'output_bbox_3rd',
                           'output_bbox_2nd', 'output_bbox_3rd']
            self.cls_prob_blob_names = ['cls_prob_1st', 'cls_prob_2nd', 'cls_prob_3rd',
                               'cls_prob_2nd_avg', 'cls_prob_3rd_avg']
            self.output_names = ['1st', '2nd', '3rd', '2nd_avg', '3rd_avg']

        self.num_outputs = len(self.proposal_blob_names)
        assert(self.num_outputs==len(self.bbox_blob_names))
        assert(self.num_outputs==len(self.cls_prob_blob_names))
        assert(self.num_outputs==len(self.output_names))
        # detection configuration
        # detect_final_boxes = np.zeros(nImg, num_outputs)
        #self.det_thr = 0.001 # threshold for testing
        self.det_thr = 0.3 # threshold for demo
        self.max_per_img = 100 # max number of detections
        self.nms_thresh = 0.5 # NMS
        if FPN:
            self.shortSize = 800
            self.longSize = 1312
        else:
            self.shortSize = 600
            self.longSize = 1000

        self.PIXEL_MEANS = np.array([104, 117, 123],dtype=np.uint8)
        self.num_cls = 80

    def detect(self, image_file):
        '''
        rcnn detection
        '''
        #image = caffe.io.load_image(image_file)
        image = cv2.imread(image_file) # BGR, default is cv2.IMREAD_COLOR 3-channel
        orgH, orgW, channel = image.shape
        print("image shape:",image.shape)
        rzRatio = self.shortSize / min(orgH, orgW)
        imgH = min(rzRatio * orgH, self.longSize)
        imgW = min(rzRatio * orgW, self.longSize)
        imgH = round(imgH / 32) * 32
        imgW = round(imgW / 32) * 32 # must be the multiple of 32
        hwRatios = [imgH/orgH, imgW/orgW]
        #transformed_image = self.transformer.preprocess('data', image)
        #image = cv2.resize(im_orig, None, None, fx=im_scale, fy=im_scale,
        resized_w = int(imgW)
        resized_h = int(imgH)
        print 'resized -> ',(resized_w, resized_h)
        image = cv2.resize(image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR)
        image -= self.PIXEL_MEANS
        #cv2.imwrite("transformed_image.jpg", image)
        transformed_image = np.transpose(image, (2,0,1)) # C H W

        # set net to batch size of 1
        self.net.blobs['data'].reshape(1, 3, resized_h, resized_w)

        #Run the net and examine the top_k results
        self.net.blobs['data'].data[...] = transformed_image.astype(np.float32, copy=False)

        start = time.time()
        # Forward pass.
        blobs_out = self.net.forward()
        print('output_bbox_1st---',blobs_out['output_bbox_1st'].shape)
        #print blobs_out
        end = time.time()
        cost_millis = int((end - start) * 1000)
        print "detection cost ms: ", cost_millis

        detect_final_boxes = []
        for nn in range(self.num_outputs):
            # detect_boxes = cell(num_cls, 1);
            tmp = self.net.blobs[self.bbox_blob_names[nn]].data.copy() # if no need modify,then no need copy
            print(self.bbox_blob_names[nn], tmp.shape)
            #tmp = tmp.reshape((-1,5))
            tmp = tmp[:,:,0,0]
            tmp[:,1] /= hwRatios[1]
            tmp[:,3] /= hwRatios[1]
            tmp[:,2] /= hwRatios[0]
            tmp[:,4] /= hwRatios[0]

            # clipping bbs to image boarders
            tmp[:, 1] = np.maximum(0,tmp[:,1])
            tmp[:, 2] = np.maximum(0,tmp[:,2])
            tmp[:, 3] = np.minimum(orgW,tmp[:,3])
            tmp[:, 4] = np.minimum(orgH,tmp[:,4])
            tmp[:, 3] = tmp[:, 3] - tmp[:, 1] + 1 # w
            tmp[:, 4] = tmp[:, 4] - tmp[:, 2] + 1 # h

            output_bboxs = tmp[:,1:]

            tmp = self.net.blobs[self.cls_prob_blob_names[nn]].data
            print(self.cls_prob_blob_names[nn], tmp.shape)
            cls_prob = tmp.reshape((-1,self.num_cls+1))

            tmp = self.net.blobs[self.proposal_blob_names[nn]].data.copy()
            print(self.proposal_blob_names[nn], tmp.shape)
            tmp = tmp[:,1:]
            tmp[:, 2] = tmp[:, 2] - tmp[:, 0] + 1  # w
            tmp[:, 3] = tmp[:, 3] - tmp[:, 1] + 1  # h
            proposals = tmp
            keep_id = np.where((proposals[:, 2] > 0) & (proposals[:, 3] > 0))[0]
            proposals = proposals[keep_id,:]
            output_bboxs = output_bboxs[keep_id,:]
            cls_prob = cls_prob[keep_id,:]

            detect_boxes = []
            for i in range(self.num_cls):
                cls_id = i + 1
                prob = cls_prob[:, cls_id][:, np.newaxis] # 0 is background
                #print (output_bboxs.shape, prob.shape)
                bbset = np.hstack([output_bboxs, prob])
                if self.det_thr > 0:
                    keep_id = np.where(prob >= self.det_thr)[0]
                    bbset = bbset[keep_id,:]

                keep = self.cpu_nms_single_cls(bbset, self.nms_thresh)
                if len(keep) == 0: continue
                bbset = bbset[keep,:]
                cls_ids = np.array([cls_id] * len(bbset))[:, np.newaxis]
                #print "cls_ids.shape", cls_ids.shape, bbset.shape
                detect_boxes.extend(np.hstack([cls_ids, bbset]).tolist())
            print "detected box num: ", len(detect_boxes)
            detect_boxes = np.asarray(detect_boxes)
            if self.max_per_img > 0 and len(detect_boxes) > self.max_per_img:
                rank_scores = detect_boxes[:, 5].copy()[::-1]
                rank_scores.sort() # 'descend'
                print len(rank_scores),self.max_per_img
                print np.where(detect_boxes[:, 5] >= rank_scores[self.max_per_img])
                keep_id = np.where(detect_boxes[:, 5] >= rank_scores[self.max_per_img])[0]
                detect_boxes = detect_boxes[keep_id,:]
            #detect_final_boxes.extend(detect_boxes.tolist())
            detect_final_boxes.append(detect_boxes.tolist())

        return detect_final_boxes

    def cpu_nms_single_cls(self, dets, thresh):
        """Pure Python NMS baseline."""
        x1 = dets[:, 0]
        y1 = dets[:, 1]
        w = dets[:, 2]
        h = dets[:, 3]
        scores = dets[:, 4]

        x2 = x1 + w - 1
        y2 = y1 + h - 1
        # areas = (x2 - x1 + 1) * (y2 - y1 + 1)
        areas = w * h
        order = scores.argsort()[::-1]

        keep = []
        while order.size > 0:
            i = order[0]
            keep.append(i)
            xx1 = np.maximum(x1[i], x1[order[1:]])
            yy1 = np.maximum(y1[i], y1[order[1:]])
            xx2 = np.minimum(x2[i], x2[order[1:]])
            yy2 = np.minimum(y2[i], y2[order[1:]])

            w = np.maximum(0.0, xx2 - xx1 + 1)
            h = np.maximum(0.0, yy2 - yy1 + 1)
            inter = w * h
            ovr = inter / (areas[i] + areas[order[1:]] - inter)

            inds = np.where(ovr <= thresh)[0]
            order = order[inds + 1]

        return keep

def main(args):
    '''main '''
    wordname_15 = ['__background__', 'plane', 'baseball-diamond', 'bridge', 'ground-track-field', 'small-vehicle', 'large-vehicle', 'ship', 'tennis-court',
                   'basketball-court', 'storage-tank',  'soccer-ball-field', 'roundabout', 'harbor', 'swimming-pool', 'helicopter']
    wordname_5 = ['__background__', '1:plane', '2:ship', '3:storage', '4:harbor', '5:bridge']
    # {cls_name: cls_id} # start from 1
    #cls_ids = {k: idx+1 for idx, k in enumerate(wordname_15)}

    detection = CaffeDetection(args.gpu_id,
                               args.model_def, args.model_weights,
                               cascade=args.cascade, FPN=args.FPN)
    results = detection.detect(args.image_file)
    #print(results)

    img = Image.open(args.image_file)
    draw = ImageDraw.Draw(img)
    width, height = img.size
    for item in results[len(results)-1]:# the 3rd_avg result
        xmin = int(round(item[1]))
        ymin = int(round(item[2]))
        xmax = int(round(item[1] + item[3] - 1))
        ymax = int(round(item[2] + item[4] - 1))
        cls_id = int(item[0])
        draw.rectangle([xmin, ymin, xmax, ymax], outline=(255, 0, 0))
        draw.text([xmin, ymin], str(cls_id), (0, 0, 255))
        print [cls_id, xmin, ymin, xmax, ymax, round(item[-1]*1000)/1000]

    img.save('detect_result.jpg')

def parse_args():
    '''parse args'''
    parser = argparse.ArgumentParser()
    parser.add_argument('--gpu_id', type=int, default=0, help='gpu id')
    parser.add_argument('--model_def',
                        default='models/deploy.prototxt')
    parser.add_argument('--cascade', default=0, type=int)
    parser.add_argument('--FPN', default=0, type=int)
    parser.add_argument('--model_weights',
                        default='models/models_iter_120000.caffemodel')
    parser.add_argument('--image_file', default='examples/images/fish-bike.jpg')
    return parser.parse_args()

if __name__ == '__main__':
    main(parse_args())
huinsysu commented 6 years ago

@Shadow992 I am participating in a tank detection compitition and there are 189 classes in the dataset. I tried to test the effect of the rpn on some picture and found that the boxxes the rpn provided were not so bad, at least the high score bboxes were around the ground true. So I guess the rpn network works. It seems strange to me that the classfication performs well on stage 1 but performs bad on stage 2. Since there are too many classes in the dataset, I want to try to reduce the training classes to see the effect on the subset. If you find any solution to this problem, please inform me. Thanks!

Shadow992 commented 6 years ago

@huinsysu Can we stay in closer contact? E.g. by using Skype/Discord or similar? So we can update each other on a regular basis? I would also be highly interested in solving this issue...

You can simply write me a mail for contact, if you want to: Removed Email

pyupcgithub commented 6 years ago

@Shadow992 thank u for the python inference code.

pyupcgithub commented 6 years ago

@Shadow992 I test the python inference code you provided, however, i find it can't get the same good result as the matlab inference code. I just test the author provided trained model. do you know why ?

pyupcgithub commented 6 years ago

@Shadow992 like this , i use the same model, different inference code. I only detect people. using the matlab inference code, result is 4. 9 using the python inference code,reuslt is 5. detect_result this is one of the differences. Can you help me to solve this problem.

Shadow992 commented 6 years ago

You probably have to finetune parameters and change algorithms. Maybe apply some post processing. However as I am not the author and this is beyond the issues. I suggest to close it and play around by yourself with code.

pyupcgithub commented 6 years ago

en..... @Shadow992 glad to receive your reply. one more question, i find if i use the matlab inference code and the python inference code in the same image, the outputs are different, especially the score of the confidence. I can not understand although i use the same model on the same image, the output are different. just beacause of the different of matlab or python interface ? I feel really confused. looking forward to your reply.

pyupcgithub commented 6 years ago

@Shadow992 like this output result. in the matlab: detect_boxes =

1.0e+03 *

0.0010    0.0010    0.0411    0.2543    0.1361    0.1713    0.0010
0.0010    0.0010    1.0196    0.5097    0.1098    0.1028    0.0010
0.0010    0.0010    0.8632    0.4844    0.1704    0.1621    0.0010
0.0010    0.0010    0.4153    0.2646    0.1221    0.1774    0.0009
0.0010    0.0010    0.8185    0.3617    0.0951    0.1131    0.0009
0.0010    0.0010    0.5891    0.4763    0.1830    0.2173    0.0009
0.0010    0.0010         0    0.0624    0.0936    0.1204    0.0009
0.0010    0.0010    0.7620    0.4800    0.1422    0.1488    0.0009
0.0010    0.0010    0.5406    0.2861    0.1220    0.1924    0.0009
0.0010    0.0010    0.4313    0.1939    0.1162    0.1401    0.0009
0.0010    0.0010    0.7104    0.3299    0.1051    0.1452    0.0009
0.0010    0.0010    0.6834    0.2532    0.0991    0.1096    0.0009
0.0010    0.0010    0.2715    0.1798    0.1358    0.3119    0.0008
0.0010    0.0010    0.5974    0.2326    0.0972    0.1194    0.0007

in the python:

[1.0, 880.21435546875, 481.3125305175781, 157.732666015625, 167.05300903320312, 0.9509992599487305] [1.0, 17.491165161132812, 251.82504272460938, 161.8278350830078, 180.31094360351562, 0.8915714025497437] [1.0, 1019.8265991210938, 510.274169921875, 117.43792724609375, 108.07086181640625, 0.8827307224273682] [1.0, 3.20751953125, 62.82538986206055, 89.68313598632812, 127.01773071289062, 0.8688411116600037] [1.0, 413.26153564453125, 263.39776611328125, 172.83038330078125, 176.29830932617188, 0.7869312763214111] [1.0, 872.2628784179688, 398.31329345703125, 47.70599365234375, 77.80911254882812, 0.7195702791213989] [1.0, 482.8349304199219, 190.6676788330078, 141.91409301757812, 143.6922149658203, 0.6592983603477478] [1.0, 678.8290405273438, 250.62388610839844, 103.9158935546875, 187.9010467529297, 0.6096863746643066] [1.0, 417.4122009277344, 269.6662902832031, 256.9304504394531, 228.03402709960938, 0.5993428826332092] [1.0, 543.610595703125, 287.9579772949219, 129.0999755859375, 187.22341918945312, 0.5436170697212219] [1.0, 603.4822998046875, 230.12550354003906, 91.89520263671875, 231.51707458496094, 0.5290579199790955] [1.0, 1256.5042724609375, 627.1253051757812, 23.7042236328125, 65.6826171875, 0.4823896884918213]

the score of confidence is [1.0, 691.6392211914062, 254.77044677734375, 87.100341796875, 107.69497680664062, 0.47900158166885376] [1.0, 641.8724365234375, 228.37530517578125, 102.33233642578125, 231.7137451171875, 0.4411845803260803] [1.0, 634.2685546875, 228.98475646972656, 55.02569580078125, 123.33226013183594, 0.42556729912757874] [1.0, 584.3050537109375, 463.4409484863281, 319.29437255859375, 186.97891235351562, 0.40026265382766724]

especailly the last, the score of confidence.

Shadow992 commented 6 years ago

@pyupcgithub As mentioned earlier: I am not the author of this code and I am barely coding in Matlab/Python. However this looks like Python is applying a different kind of NMS or sets IoU different. Just play around with the parameters. I am not able to help you on this problem, sorry.

pyupcgithub commented 6 years ago

@Shadow992 yes, i think you are right. anyway, thank you .