If put yolov5 onnx exported from ultralytics into export_engine api, the postprocess speed slows down in cpp deploy.

zhiqwang / yolort

yolort is a runtime stack for yolov5 on specialized accelerators such as tensorrt, libtorch, onnxruntime, tvm and ncnn.

https://zhiqwang.com/yolort

GNU General Public License v3.0

708 stars 153 forks source link

If put yolov5 onnx exported from ultralytics into export_engine api, the postprocess speed slows down in cpp deploy. #492

Closed YoungjaeDev closed 1 year ago

YoungjaeDev commented 1 year ago

🐛 Describe the bug

There is no problem when I put pt file directly into export_tensorrt_engine If put the onnx file in the export_tensorrt_engine model_path immediately after exporting onnx from ultralytics, the model is created. (0.1ms vs 6ms in my PC)

https://github.com/zhiqwang/yolov5-rt-stack/blob/8b578eb9a7910f1dcb28188a36c8c540d15a9430/deployment/tensorrt/main.cpp#L382-L408

Versions

No specifics

zhiqwang commented 1 year ago

Hi @youngjae-avikus , sorry for missing this ticket. I think this is expected, seems it will do serialization operation first and it's a very time-consuming process to do this serialization.

YoungjaeDev commented 1 year ago

@zhiqwang

thank you Sorry for the confusion. I would like to tell you the exact issue process To be precise, using the export_tensorrt_engine api consumes most of the enqueue time, and the memcpy code mentioned above does not take much time. However, if I directly insert the onnx extracted from ultralytics (the reason for this is that the newly renovated architecture has a new Block or Layer, so it cannot be converted directly from yolo-rt-stack), strangely, the enqueue is short, but the memcpy is long.The conclusion is that Enqueue+memcpy takes the same time (Less test cases, but probably..!) but there is a time difference depending on how the engine is created. Can you tell me why?

zhiqwang commented 1 year ago

I think this should also meet expectations. NMS is equivalent to a filter, putting NMS into the model will greatly reduce the number of output tensors. You can observe that the D2H time will be greatly reduced, but the cost is to put this part of the calculation on the device, which may increase the enqueue time.

YoungjaeDev commented 1 year ago

@zhiqwang

Thanks your answer bugt, Well, it seems that the two methods(input pt or onnx) I'm comparing are the same in that the engine with the same NMS_plugin added comes out in the end. However, if you put the onnx extracted from ultralytics directly into export_tensorrt_engine, the enqueue time for the engine that comes out is less, and the memcpy time is longer.

YoungjaeDev commented 1 year ago

@zhiqwang

https://github.com/Linaom1214/TensorRT-For-YOLO-Series/blob/accaf0a41dab8f1e132db4d2c43e3005b3fe2190/export.py#L147-L215

If I build and use the engine with Nms_TRT as End2End with the above code, D2H takes more than EnqueueV2. Do you have any other secrets to yolo-rt-stack about make engine

zhiqwang commented 1 year ago

If I build and use the engine with Nms_TRT as End2End with the above code, D2H takes more than EnqueueV2.

I remember that the part of his repository that integrates NMS was originally inherited from the yolort repo, but I don't know if he later added some new techniques to it.

Do you have any other secrets to yolo-rt-stack about make engine

Nope

YoungjaeDev commented 1 year ago

Since the mAP is almost the same, there is no problem using it, right? I mentioned the phenomenon because it was a little strange.

zhiqwang commented 1 year ago

Since the mAP is almost the same, there is no problem using it, right?

Yep

I mentioned the phenomenon because it was a little strange.

Indeed, but the information here is limited, and sorry I cannot analyze the reasons behind it.