Slower than expected GPU inference in `deployment/libtorch` example

mattpopovich commented 2 years ago

🐛 Describe the bug

I created some yolo-rt-stack torchscript models by following the script here. I then followed the README instructions to build the LibTorch C++ code. Everything works as expected except inference on the GPU is much slower (7x) than the CPU.

Can you confirm these results or am I doing something wrong? I believe previously (July-August 2021 timeframe) I was seeing inference times in the 8-10ms range.

v4.0:

Click to show v4.0

```console root@pc:yolov5-rt-stack/deployment/libtorch/build# ./yolort_torch --input_source ../../../bus.jpg --checkpoint ../../../yolov5s-v4.0-RT-v0.5.2-YOLOv5.torchscript.pt --labelmap ../../../coco.names Set CPU mode Loading model Model loaded Run once on empty image [W TensorImpl.h:1153] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator()) Pre-process takes : 18 ms Inference takes : 106 ms Detected labels: 0 0 0 5 0 [ CPULongType{5} ] Detected boxes: 669.2656 391.3025 809.8663 885.2344 54.0635 397.8318 235.9531 901.3731 222.8834 406.8119 341.5572 854.7792 18.6320 232.9767 810.9739 760.1169 0.4640 502.0519 88.5140 887.0480 [ CPUFloatType{5,4} ] Detected scores: 0.8901 0.8733 0.8537 0.7234 0.3769 [ CPUFloatType{5} ] root@pc:yolov5-rt-stack/deployment/libtorch/build# ./yolort_torch --input_source ../../../bus.jpg --checkpoint ../../../yolov5s-v4.0-RT-v0.5.2-YOLOv5.torchscript.pt --labelmap ../../../coco.names --gpu Set GPU mode Loading model Model loaded Run once on empty image [W TensorImpl.h:1153] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator()) Pre-process takes : 21 ms Inference takes : 748 ms Detected labels: 0 0 0 5 0 [ CUDALongType{5} ] Detected boxes: 669.2656 391.3025 809.8663 885.2344 54.0635 397.8318 235.9531 901.3730 222.8834 406.8120 341.5572 854.7791 18.6320 232.9767 810.9739 760.1170 0.4640 502.0522 88.5139 887.0480 [ CUDAFloatType{5,4} ] Detected scores: 0.8901 0.8733 0.8537 0.7234 0.3769 [ CUDAFloatType{5} ] ```

v6.0:

Click to show v6.0

```console root@pc:yolov5-rt-stack/deployment/libtorch/build# ./yolort_torch --input_source ../../../bus.jpg --checkpoint ../../../yolov5s-v6.0-RT-v0.5.2-YOLOv5.torchscript.pt --labelmap ../../../coco.names Set CPU mode Loading model Model loaded Run once on empty image [W TensorImpl.h:1153] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator()) Pre-process takes : 15 ms Inference takes : 95 ms Detected labels: 0 0 0 5 0 [ CPULongType{5} ] Detected boxes: 224.5497 402.5811 342.7194 862.6057 51.8626 398.3438 245.3290 906.3114 679.8232 385.5574 809.3773 883.1394 0.1952 201.8805 812.9611 786.3345 0.0480 558.7347 75.8148 871.5754 [ CPUFloatType{5,4} ] Detected scores: 0.8959 0.8846 0.8579 0.5181 0.3932 [ CPUFloatType{5} ] root@pc:yolov5-rt-stack/deployment/libtorch/build# ./yolort_torch --input_source ../../../bus.jpg --checkpoint ../../../yolov5s-v6.0-RT-v0.5.2-YOLOv5.torchscript.pt --labelmap ../../../coco.names --gpu Set GPU mode Loading model Model loaded Run once on empty image [W TensorImpl.h:1153] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator()) Pre-process takes : 28 ms Inference takes : 746 ms Detected labels: 0 0 0 5 0 [ CUDALongType{5} ] Detected boxes: 224.5497 402.5810 342.7194 862.6058 51.8626 398.3439 245.3289 906.3113 679.8232 385.5574 809.3773 883.1393 0.1954 201.8804 812.9608 786.3347 0.0480 558.7346 75.8148 871.5754 [ CUDAFloatType{5,4} ] Detected scores: 0.8959 0.8846 0.8579 0.5181 0.3932 [ CUDAFloatType{5} ] ```

Thanks again for all your help thus far. I'm going to look into deployment/tensorrt next to see what inference times I can get there.

Versions

Click to display Versions

```console # python3 -m torch.utils.collect_env Collecting environment information... PyTorch version: 1.9.0a0+gitd69c22d Is debug build: False CUDA used to build PyTorch: 11.2 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.2 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: version 3.21.1 Libc version: glibc-2.31 Python version: 3.8 (64-bit runtime) Python platform: Linux-5.4.0-92-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.2.152 GPU models and configuration: GPU 0: GeForce GTX 1080 GPU 1: GeForce GTX 1080 GPU 2: GeForce GTX 1080 Nvidia driver version: 460.91.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.21.4 [pip3] pytorch-lightning==1.5.8 [pip3] torch==1.9.0a0+gitd69c22d [pip3] torchmetrics==0.6.2 [pip3] torchvision==0.10.0a0+300a8a4 [conda] Could not collect ```

zhiqwang commented 2 years ago

Hi @mattpopovich ,

Seems that PyTorch 1.9 requires two warm-ups on the GPU, and we need to ignore the first two calculation times. Could you test it again or upgrade your PyTorch to 1.10.1? (Check https://github.com/pytorch/pytorch/pull/58801 for more details.)

The part of TensorRT C++ is under development, we have implemented the core parts of model converting, now there are several parts that still need to be implemented:

We use the YOLO.load_from_yolov5() strategy in the TensorRT, so we should implement the pre-processing in the C++ example, and now the existing version is a bit rough.
we use the static shape mechanism in the part of the model conversion to TensorRT engine, we need to add dynamic shape support, this is very important for practical applications. Check #266.

And all contributions are welcome here!

mattpopovich commented 2 years ago

Great find! I ran way too many tests on my machine (below) with PyTorch, TorchVision, and OpenCV built from source (originally I was seeing slow inference no matter how many times I "warmed up" the model, but I have since been unable to recreate that).