post proccesing takes too long

sudapure commented 3 years ago

model forwarding takes only ~5 ms to infer the input blob, but post processing takes about 50 ms , i wonder as pytorch implementation(python) takes only 15 ms for both infer and post processing , but here it's taking too long for post-processing, is there any way to optimize post-processing for low latency.

yasenh commented 3 years ago

Update: the "inference time" should be "post process time"

Hi @sudapure, thanks for the feedback! I did some test on my end with same set of images, same model and count time for inference + nms parts. So the yolov5 python takes ~0.015s to process an image, and the libtorch one takes ~0.012s to process the image

I put the timer in detector.cpp (similar to what yolov5 python version):

There is the result for 5 images:

Question: are you use GPU version or CPU version?

Some quick solution I can think about now is to use the nms TorchVision, I didn't put it here because I want to minimize the amount of dependencies in this project, but I would suggest you to give a try if latency is the concern. And meanwhile I will try to optimize the current implementation as much as possible.

sudapure commented 3 years ago

Hi @yasenh , I am using GPU version, on GTX 1050 nvidia card(2 GB), as per your inference readings it looks like taking average 7 ms for infer and post processing , but in my case infer takes ~5 ms but post processing is taking lot of time some times more than 50 ms, however i modified the code to take input from video stream instead of image file, but average post processing latency is too high in both the cases.

mheriyanto commented 3 years ago

Hi @yasenh and @sudapure I am using GTX 1050 Ti and add timer on main.cpp:

My GPU is utilized during the inference:

And these are the result for 4 images:

yasenh commented 3 years ago

@mheriyanto Looks like the inference time is reasonable, could you measure how much time taken from the "Demo" function? I suspect "cv::namedWindow" and "cv::imshow" might take little bit longer time.

mheriyanto commented 3 years ago

@yasenh Actually my "cv::namedWindow" and "cv::imshow" moved out of "Demo" function and I add timer for measuring "Demo" and "Detector" function like this:

I got these results:

yasenh commented 3 years ago

@mheriyanto Thanks, very interesting and seems here consumes most of the time, could you try to measure the processing time of the code above? or you can comment out these code segment and take a look whats the processing time.

mheriyanto commented 3 years ago

@yasenh , Okay, I measured the processing time like this:

and these results:

yasenh commented 3 years ago

emmmm, thats weird... I will do some more testing tomorrow, thanks again for these measurement

yasenh commented 3 years ago

Hi @mheriyanto and @sudapure,

You could pull the repo and checkout the "debug/process_time_measurement" branch: https://github.com/yasenh/libtorch-yolov5/tree/debug/process_time_measurement

Here is what I got and the end-to-end time is around 16ms per image (It should be known that it takes longer time at first time):

sudapure commented 3 years ago

@yasenh , i know it takes bit long for first time but only for inference, but even after several iterations getting same results, as i already mentioned infer time is not too much in my case it's around 5ms but post-process function taking too long.

yasenh commented 3 years ago

Hi @sudapure, yep I understand you pain point and try to think about what might be the root-cause. Like what I showed above, the post-process time is roughly 8ms on my end. So did you tried to run the debug branch and still get similar result(50ms)? The only thing I can think about is maybe we have different software/hardware setup maybe.

sudapure commented 3 years ago

@yasenh , i reproduced the infer run from the debug branch you specified earlier,

pre-process takes : 2 ms
inference takes : 394 ms
post-process takes : 171 ms
end-to-end time : 568 ms

pre-process takes : 7 ms
inference takes : 5 ms
post-process takes : 24 ms
end-to-end time : 37 ms

pre-process takes : 2 ms
inference takes : 4 ms
post-process takes : 24 ms
end-to-end time : 31 ms

pre-process takes : 2 ms
inference takes : 5 ms
post-process takes : 24 ms
end-to-end time : 32 ms

clearly here post-processing adding lot of toll in latency, from my analysis it might be happening because of two reasons:

during torchscript export, you removed the final detect layer, using model.model[-1].export = False , due to this bbox extraction needs to be implemented after retrieval of output tensors after network forward pass(_module.forward()). this will certainly cause additional processing latency to be added on cpu, i observed during code run CPU utilization peaks above 50%, where as in ultralytics python implementation it does not exceed above 16% on an average.
second reason would be hopping of output tensor processing from host memory to GPU memory, from code output tensors are allocated on GPU, so there might be certain operations which can not be done on GPU and at the time of execution such operations might be hopping back to CPU , which will certainly add a toll in latency.

i guess in order to confront both of the above possibilities profiling of the code needs to be done.

yasenh commented 3 years ago

@sudapure Thanks for the analysis.

For the suggestion #1 (model.model[-1].export = False), it actually align with the behavior in ultralytics python implementation if you run and debug "detect.py" in that repo:

You can find that if we set export = Ture, then self.training will be true and some of the post-processing will be gone: https://github.com/ultralytics/yolov5/blob/916d4aad9a154587dc5914f9b04be575a0990529/models/yolo.py#L38 https://github.com/ultralytics/yolov5/blob/916d4aad9a154587dc5914f9b04be575a0990529/models/yolo.py#L44-L51

So I think my implementation for the first item is correct.

For the second one, yes I agree with you and will do some profiling and refinement within next few weeks.

pengyang1225 commented 3 years ago

The post-processing time is too long, mainly due to the function "auto det = torch::masked_select(detections[batch_i], conf_mask[batch_i]).view({-1, num_classes + item_attr_size});" time-consuming, do you have a ending plan? ?

yasenh commented 3 years ago

@pengyang1225 Working on it, hopefully can improve it on this weekend :)

sainttelant commented 3 years ago

it is an really excellent project ,cool

pengyang1225 commented 3 years ago

Waiting for you to solve the problem

yasenh commented 3 years ago

@pengyang1225 LOL, I was tooooo busy with my work recently, hopefully can make it done ASAP

ganler commented 3 years ago

@yasenh @sudapure Hi, from my experience, I think the profiling result might be invalid. PyTorch leverages an asynchronous kernel execution mode. Before you run your profiling program, you should add an environment variable : CUDA_LAUNCH_BLOCKING=1.

For example:

CUDA_LAUNCH_BLOCKING=1 ./main

And you guys gonna find that post-processing is not the bottleneck, but the DNN inference.

This link may tell more if you guys wanna know some insights. https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution

yasenh commented 3 years ago

@ganler Thanks, that is a great document, lesson learned!

yasenh commented 3 years ago

Hi @mheriyanto and @sudapure,

I did some profiling based on this post, which added synchronization points before starting and stopping the timer for the forward and backward pass like this:

at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
AT_CUDA_CHECK(cudaStreamSynchronize(stream));

And here is the time check (Tested with single GTX 1070 GPU):

I also followed @ganler suggestion to test it with CUDA_LAUNCH_BLOCKING=1:

Compare with the NN inference time, post-processing is not the bottleneck here. Like @pengyang1225 mentioned, there might be some places that we can find a better approach to post-process everything, I am doing some experiments and hopefully can release it soon.

yasenh commented 3 years ago

Hi @ganler, I notice that the inference time itself, even without preprocess and postprocess, normally takes around 20ms to process a 640x640 image. However, the yolov5 python version only need 8-9ms. I did some test with OpenMP config and Release build based on post1 and post2, but it does not improve a lot. Any ideas?

ganler commented 3 years ago

Hi @ganler, I notice that the inference time itself, even without preprocess and postprocess, normally takes around 20ms to process a 640x640 image. However, the yolov5 python version only need 8-9ms. I did some test with OpenMP config and Release build based on post1 and post2, but it does not improve a lot. Any ideas?

I think ideally cxx version should be of the same performance as python version. There are some issues that might help:

Original dynamic graph based execution / JIT model?
Batched input / batch size = 1 in profiling? (batched input will have better throughput)
Async / sync. Async execution is faster...

I haven't got time to profile your code. If you still got stuck, I can take a try myself.

Torch provides a profiling module, but unfortunately, it gets much slower when in profiling mode.

Hope my advices can help. :-)

ganler commented 3 years ago

BTW, I don't think it has anything to do with OpenMP, it is a GPU thing...

Also you can take a look at the gpu utilization(using gpusta -i). This can tell something.

yasenh commented 3 years ago

@ganler Appreciate your suggestions!

mojiayuan commented 3 years ago

@yasenh，我复制了您之前指定的debug分支的推断运行，
pre-process takes : 2 ms
inference takes : 394 ms
post-process takes : 171 ms
end-to-end time : 568 ms

pre-process takes : 7 ms
inference takes : 5 ms
post-process takes : 24 ms
end-to-end time : 37 ms

pre-process takes : 2 ms
inference takes : 4 ms
post-process takes : 24 ms
end-to-end time : 31 ms

pre-process takes : 2 ms
inference takes : 5 ms
post-process takes : 24 ms
end-to-end time : 32 ms
很明显，这里的后处理增加了很多延迟，从我的分析来看，这可能是由于两个原因而发生的：

在torchscript导出期间，您使用删除了最后的检测层，model.model[-1].export = False因为在网络正向pass（_module.forward()）后检索输出张量后需要执行此bbox提取。这肯定会导致在cpu上增加额外的处理延迟，我在代码运行期间观察到CPU利用率峰值超过50％，而在Ultralytics python实现中，它平均不会超过16％。

第二个原因是将输出张量处理从主机内存跳转到GPU内存，代码输出张量在GPU上分配，因此某些操作可能无法在GPU上完成，因此在执行时此类操作可能会跳回到CPU，这肯定会增加延迟。

我想为了面对上述两种可能性，需要对代码进行性能分析。

@yasenh , i reproduced the infer run from the debug branch you specified earlier,
pre-process takes : 2 ms
inference takes : 394 ms
post-process takes : 171 ms
end-to-end time : 568 ms

pre-process takes : 7 ms
inference takes : 5 ms
post-process takes : 24 ms
end-to-end time : 37 ms

pre-process takes : 2 ms
inference takes : 4 ms
post-process takes : 24 ms
end-to-end time : 31 ms

pre-process takes : 2 ms
inference takes : 5 ms
post-process takes : 24 ms
end-to-end time : 32 ms
clearly here post-processing adding lot of toll in latency, from my analysis it might be happening because of two reasons:

during torchscript export, you removed the final detect layer, using model.model[-1].export = False , due to this bbox extraction needs to be implemented after retrieval of output tensors after network forward pass(_module.forward()). this will certainly cause additional processing latency to be added on cpu, i observed during code run CPU utilization peaks above 50%, where as in ultralytics python implementation it does not exceed above 16% on an average.

second reason would be hopping of output tensor processing from host memory to GPU memory, from code output tensors are allocated on GPU, so there might be certain operations which can not be done on GPU and at the time of execution such operations might be hopping back to CPU , which will certainly add a toll in latency.

i guess in order to confront both of the above possibilities profiling of the code needs to be done. I met the second situation. Do you have a solution？

longjiu commented 3 years ago

            i solve this problem     

            torch::jit::IValue output = module.forward(inputs);
    auto detections = output.toTuple()->elements()[0].toTensor();
    auto conf_mask = detections.select(2, 4).ge(0.5).unsqueeze(2);
    detections = torch::masked_select(detections[0], conf_mask[0]).view({ -1, 85 });
    detections = detections.to(torch::kFloat);
    detections = detections.to(torch::kCPU);

    vector<Rect> boxes;
    vector<float> confidences;
    float* ptr = (float*)detections.data_ptr();
    for (int i = 0; i < (int)detections.size(0); ++i, ptr += 85)
    {
            float confidence = ptr[4];
            int centerX = (int)((ptr[0] / 640) * frame.cols);
            int centerY = (int)((ptr[1] / 640) * frame.rows);
            int width = (int)((ptr[2] / 640) * frame.cols);
            int height = (int)((ptr[3] / 640) * frame.rows);
            int left = (centerX - width / 2);
            int top = (centerY - height / 2);
            confidences.push_back(confidence);
            boxes.push_back(Rect(max(left, 0), max(top, 0), min(width, 1920 - left), min(height, 1200 - top)));

    }
    vector<int> indices;
    NMSBoxes(boxes, confidences, 0.5, 0.2, indices);  (use #include<opencv2\dnn\dnn.hpp> )

yasenh commented 3 years ago

@longjiu Thanks for your inputs! I did some optimization based on OpenCV nms and some tricks from the yolov5 python repo,

now the post-processing drops to 1ms:

Benchmark with following command: $ CUDA_LAUNCH_BLOCKING=1 ./libtorch-yolov5 --source ../images/bus.jpg --weights ../weights/yolov5s.torchscript.pt --gpu --view-img

yasenh / libtorch-yolov5

post proccesing takes too long #3