gurkirt commented 8 years ago

Hi,

I get only 9 fps in when I do forward pass in python.

is it that pycaffe is slower?

thanks G.

ksaluja15 commented 8 years ago

Try doing it in batches

gurkirt commented 8 years ago

Thanks for reply. Unfortunately I can't use batches.

I got 20fps with cuDNN. but that is still very low. I am using 300x300 architecture.

I am working on streaming videos. I need to process the frame as it come. So, I can't use batches, at a time I want to do forward pass on single image.

Per frame time includes image transformation and forward pass and accessing location blob, detection blob and saving it in mat format. Removing the save part only increase fps to 22fps. Most of the time is taken by forward pass and accessing the blobs.

here is the python code.

for img_name in image_list: input_img_file = '{}/{}'.format(video_input_dir,img_name) save_file = '{}/{}.mat'.format(video_output_dir,img_name.rstrip('.jpg')) if not os.path.isfile(save_file): image = caffe.io.load_image(input_img_file)

                    transformed_image = transformer.preprocess('data', image)
                    net.blobs['data'].data[...] = transformed_image
                    # Forward pass.
                    net.forward()
                    detections =  net.blobs['detection_out'].data[0][0]
                    confidences =  net.blobs['mbox_conf_softmax'].data[0]
                    #print np.shape(detections),np.shape(confidences)
                    number_detection = np.shape(detections)[0];
                    final_detection = np.zeros((number_detection,num_classes+7))
                    for i in range(number_detection):
                        final_detection[i,:7] = detections[i,:]
                        final_detection[i,7:] = confidences[int(detections[i,0]),:]

                    sio.savemat(save_file, mdict={'detections':final_detection})

whereas ./build/tools/caffe time --model models/VGGNet/VOC0712/SSD_300x300/deploy.prototxt --gpu 0

gives me following timing numbers. I1014 19:37:53.356597 19125 caffe.cpp:412] Average Forward pass: 17.2102 ms. I1014 19:37:53.356611 19125 caffe.cpp:414] Average Backward pass: 23.4602 ms. I1014 19:37:53.356621 19125 caffe.cpp:416] Average Forward-Backward: 40.7575 ms. I1014 19:37:53.356631 19125 caffe.cpp:418] Total Time: 2037.87 ms. I1014 19:37:53.356637 19125 caffe.cpp:419] * Benchmark ends *

Which suggest forward pass should take only 17 ms means 58 fps. I only get 20 fps which is 3 times slower than what it should be.

What can do to improve it.

vj-1988 commented 8 years ago

I am using nvidia gtx 960 and I am getting similar numbers. Which GPU do you use?

weiliu89 commented 8 years ago

@gurkirt Try commenting out sio.savemat?

gurkirt commented 8 years ago

@weiliu89 and @vj-1988

I am using Titan-X 12 GB I timed each section independently. it seems that most fo the time goes in preprocessing of image and forward pass. On an average forward pass take 30ms which is much hihger than 17ms time I got using caffe time. sio.savemat take negligible time.

Here is code to time each part and results are after the code.

t3 = time.time() image = caffe.io.load_image(input_img_file) transformed_image = transformer.preprocess('data', image) net.blobs['data'].data[...] = transformed_image t4 = time.time() print('time taken to read and preprocess the image ',t4-t3,' seconds');

forward pass.

net.forward() t5 = time.time() print('time taken for forward pass',t5-t4,' seconds'); detections = net.blobs['detection_out'].data[0][0] confidences = net.blobs['mbox_conf_softmax'].data[0]

print np.shape(detections),np.shape(confidences)

number_detection = np.shape(detections)[0]; final_detection = np.zeros((number_detection,num_classes+7)) for i in range(number_detection): final_detection[i,:7] = detections[i,:] final_detection[i,7:] = confidences[int(detections[i,0]),:] t6 = time.time() print('time taken to post process ',t6-t5,' seconds'); sio.savemat(save_file, mdict={'detections':final_detection}) print('time taken to save as .mat ',time.time()-t6,' seconds');

output is 15.6 fps with each part timing as folllow ('time taken to read and preprocess the image ', 0.02192997932434082, ' seconds') ('time taken for forward pass', 0.030692100524902344, ' seconds') ('time taken to post process ', 0.00045108795166015625, ' seconds') ('time taken to save as .mat ', 0.008970022201538086, ' seconds')

but caffe time command give me following numbers

I1025 13:48:34.542702 18268 caffe.cpp:412] Average Forward pass: 17.0406 ms. I1025 13:48:34.542711 18268 caffe.cpp:414] Average Backward pass: 22.8031 ms. I1025 13:48:34.542718 18268 caffe.cpp:416] Average Forward-Backward: 39.9358 ms. I1025 13:48:34.542726 18268 caffe.cpp:418] Total Time: 1996.79 ms. I1025 13:48:34.542732 18268 caffe.cpp:419] * Benchmark ends *

ksaluja15 commented 8 years ago

I think you can achieve the 72 fps if you a) Use batches b) Evaluate fps only for 'pure gpu computation' i.e. measure only the forward pass c) Use cudnn 5.1

gurkirt commented 8 years ago

@blackunicorn15 That is not my issue. I am using cudnn 5.1. I get 58fps with batch size of 1 with time command. 72fps with batch size of 8 using pascal_speed.py; BTW: pascal_speed.py is wrapper on cuda code. All the computation is done by ./build/tools/cafffe interface.

problem is, when I use forward pass from python as shown in code above takes 30 ms. Is it because pycaffe is slower? Why is that?

vj-1988 commented 8 years ago

I tested the code with batching on Titan X 12 GB. For a batch size of 64, it takes around 1.01 seconds for net.forward(), which is roughly 63-64 fps.

Mine has Cuda 8.0 and cudnn 5.1 installed. The reported speed for VGG16 (300x300) is around 72 fps. So I guess the reduction in speed could be due to wrapper overhead.

gurkirt commented 8 years ago

@vj-1988
Thanks for your input. It seems that wrapper overhead is less when batch size is large. In my case, I have batch size of 1, wrapper overhead is probably very high per image when compared to high batch size.

brat-eek commented 7 years ago

I am encountering a strange result. Glad if anyone could help out.

I have to stick to a batch_size of 2 since increasing beyond this causes cudasuccess(2 vs 0) memory error.

I had cuda 8.0 with cundnn 5.1 installed on amazon ec2 with Nvidia-grid-k520. Upon making with USE_CUDNN := 1, I get : Average Forward pass: 111.995 ms. Average Backward pass: 131.617 ms Average Forward-Backward: 243.737 ms

and without cudnn it gets better Average Forward pass: 84.0685 ms. Average Backward pass: 129.108 ms. Average Forward-Backward: 213.288 ms.

I also tried cuda 7.0 and cudnn 4 just to see for luck, but results were same as with cuda 8

Any input is appreciated. :)

gurkirt commented 7 years ago

use cudnn 5.1 and above

brat-eek commented 7 years ago

Have already tried with 5.1 but let me try for above 👍

brat-eek commented 7 years ago

So I tried reinstalling everything on a fresh instance using a docker-image. I have been able to get to 17-18 fps by manually timing the code lines

detections = net.forward()['detection_out']

I am on a Tesla K80 with 12 GB and am using a train batch of 32 and test batch of 8 Also increasing the test batch to anything > 8 breaks with memory error. Would a TITAN X Maxwell get me to 45 fps beacuse I have tried a lot of tweaks without any luck ?

While testing as in the ssd_detect.ipynb we can set the batch size here

image_resize = 300 net.blobs['data'].reshape(batch_size,3,image_resize,image_resize)

The shape of detections for a batch size of 8 gives me [1,1, x, 7]

where x is the total no. of detections on all the 8 images. Is there a way to break this on a per image basis ? I havent been able to figure this out Thanks for any help on this or the fps issue :)

weiliu89 / caffe

Forward pass in python slow #239

forward pass.

print np.shape(detections),np.shape(confidences)