zhiqwang / yolort

yolort is a runtime stack for yolov5 on specialized accelerators such as tensorrt, libtorch, onnxruntime, tvm and ncnn.
https://zhiqwang.com/yolort
GNU General Public License v3.0
721 stars 153 forks source link

Add Multi-Weight Support API #282

Open mattpopovich opened 2 years ago

mattpopovich commented 2 years ago

🐛 Describe the bug

Before I closed #273, I wanted to make a PR to add some documentation of how to go from Ultralytics weights --> yolort weights --> LibTorch C++ inference (how to run deployment/libtorch/main.cpp with Ultralytics weights). I was going to reference in the documentation to use the CLI tool as you mentioned for the weights conversion, but I'm not sure how to use that script properly?

It seems to run just fine:

# python3 yolov5_to_yolort.py --checkpoint_path models/yolov5s-v6.0.pt --version 'r6.0' --image_path bus.jpg --output_path conversion_testing
No protocol specified
Command Line Args: Namespace(checkpoint_path='yolov5-rt-stack/models/yolov5s-v6.0.pt', image_path='yolov5-rt-stack/bus.jpg', output_path='yolov5-rt-stack/conversion_testing', version='r6.0')

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  yolort.v5.models.common.Conv            [3, 32, 6, 2, 2]              
  1                -1  1     18560  yolort.v5.models.common.Conv            [32, 64, 3, 2]                
  2                -1  1     18816  yolort.v5.models.common.C3              [64, 64, 1]                   
  3                -1  1     73984  yolort.v5.models.common.Conv            [64, 128, 3, 2]               
  4                -1  2    115712  yolort.v5.models.common.C3              [128, 128, 2]                 
  5                -1  1    295424  yolort.v5.models.common.Conv            [128, 256, 3, 2]              
  6                -1  3    625152  yolort.v5.models.common.C3              [256, 256, 3]                 
  7                -1  1   1180672  yolort.v5.models.common.Conv            [256, 512, 3, 2]              
  8                -1  1   1182720  yolort.v5.models.common.C3              [512, 512, 1]                 
  9                -1  1    656896  yolort.v5.models.common.SPPF            [512, 512, 5]                 
 10                -1  1    131584  yolort.v5.models.common.Conv            [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  yolort.v5.models.common.Concat          [1]                           
 13                -1  1    361984  yolort.v5.models.common.C3              [512, 256, 1, False]          
 14                -1  1     33024  yolort.v5.models.common.Conv            [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  yolort.v5.models.common.Concat          [1]                           
 17                -1  1     90880  yolort.v5.models.common.C3              [256, 128, 1, False]          
 18                -1  1    147712  yolort.v5.models.common.Conv            [128, 128, 3, 2]              
 19          [-1, 14]  1         0  yolort.v5.models.common.Concat          [1]                           
 20                -1  1    296448  yolort.v5.models.common.C3              [256, 256, 1, False]          
 21                -1  1    590336  yolort.v5.models.common.Conv            [256, 256, 3, 2]              
 22          [-1, 10]  1         0  yolort.v5.models.common.Concat          [1]                           
 23                -1  1   1182720  yolort.v5.models.common.C3              [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  yolort.v5.models.yolo.Detect            [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /resources/pytorch/c10/core/TensorImpl.h:1153.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Model Summary: 270 layers, 7235389 parameters, 7235389 gradients

But I'm not sure how to use the output model? If you can elaborate on how, I'd be happy to make a PR with added documentation.

I tried to use it directly in deployment/libtorch/main.cpp but that gave the same error as #142:

# ./yolort_torch --input_source ../../../bus.jpg --checkpoint ../../../conversion_testing/yolov5_darknet_pan_s_r60_custom.pt --labelmap ../../../coco.names --gpu
Set GPU mode
Loading model
Error loading the model: PytorchStreamReader failed locating file constants.pkl: file not found
Exception raised from valid at /resources/pytorch/caffe2/serialize/inline_container.cc:151 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7feab3a7b7ac in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7feab3a47866 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: caffe2::serialize::PyTorchStreamReader::valid(char const*, char const*) + 0x35b (0x7feaa84ff1cb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamReader::getRecordID(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x57 (0x7feaa84ffa77 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::getRecord(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x5c (0x7feaa84ffb1c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, std::shared_ptr<torch::jit::StorageContext>) + 0x125 (0x7feaa9bf6e15 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x4071beb (0x7feaa9befbeb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x4074327 (0x7feaa9bf2327 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::jit::load(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x1ba (0x7feaa9bf378a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0xc2 (0x7feaa9bf54a2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>) + 0x6a (0x7feaa9bf558a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x22528 (0x56274e516528 in ./yolort_torch)
frame #12: __libc_start_main + 0xf3 (0x7fea98eb80b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #13: <unknown function> + 0x2100e (0x56274e51500e in ./yolort_torch)

I tried to load it in python, but with no luck:

# python3
>>> from yolort.models import YOLO, YOLOv5
>>> checkpoint_path = 'yolov5-rt-stack/conversion_testing/yolov5_darknet_pan_s_r60_custom.pt' 
>>> model = YOLOv5.load_from_yolov5(checkpoint_path = checkpoint_path, version = 'r6.0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/models/yolo_module.py", line 308, in load_from_yolov5
    model = YOLO.load_from_yolov5(checkpoint_path, **kwargs)
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/models/yolo.py", line 204, in load_from_yolov5
    model_info = load_from_ultralytics(checkpoint_path, version=version)
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/utils/update_module_state.py", line 60, in load_from_ultralytics
    checkpoint_yolov5 = load_yolov5_model(checkpoint_path)
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/v5/helper.py", line 67, in load_yolov5_model
    model_ckpt = ckpt["model"]  # load model
KeyError: 'model'
>>> model = YOLO.load_from_yolov5(checkpoint_path = checkpoint_path, version = 'r6.0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/models/yolo.py", line 204, in load_from_yolov5
    model_info = load_from_ultralytics(checkpoint_path, version=version)
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/utils/update_module_state.py", line 60, in load_from_ultralytics
    checkpoint_yolov5 = load_yolov5_model(checkpoint_path)
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/v5/helper.py", line 67, in load_yolov5_model
    model_ckpt = ckpt["model"]  # load model
KeyError: 'model'

You also mentioned I might be able to convert the model weights if "I load the translated checkpoints in yolort.models.yolov5s()". I'm not seeing any argument that would allow me to load a checkpoint using yolort.models.yolov5s():

>>> from yolort.models import yolov5s
>>> model = yolov5s(checkpoint_path = checkpoint_path)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/models/__init__.py", line 57, in yolov5s
    model = YOLOv5(arch="yolov5_darknet_pan_s_r60", **kwargs)
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/models/yolo_module.py", line 59, in __init__
    model = yolo.__dict__[arch](
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/models/yolo.py", line 523, in yolov5_darknet_pan_s_r60
    return build_model(
  File "/home/mpopovich/git/yolov5-rt-stack/yolort/models/yolo.py", line 262, in build_model
    model = YOLO(backbone, num_classes, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'checkpoint_path'

Let me know what I'm doing wrong - thank you!

Versions

Click to display Versions

```console # python3 -m torch.utils.collect_env Collecting environment information... PyTorch version: 1.9.0a0+gitd69c22d Is debug build: False CUDA used to build PyTorch: 11.2 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.2 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: version 3.21.1 Libc version: glibc-2.31 Python version: 3.8 (64-bit runtime) Python platform: Linux-5.4.0-92-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.2.152 GPU models and configuration: GPU 0: GeForce GTX 1080 GPU 1: GeForce GTX 1080 GPU 2: GeForce GTX 1080 Nvidia driver version: 460.91.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.21.4 [pip3] pytorch-lightning==1.5.8 [pip3] torch==1.9.0a0+gitd69c22d [pip3] torchmetrics==0.6.2 [pip3] torchvision==0.10.0a0+300a8a4 [conda] Could not collect ```

zhiqwang commented 2 years ago

Hi @mattpopovich ,

But I'm not sure how to use the output model?

It's a little tricky here. We call the translated checkpoint weight at below, and the model_urls[weights_name] is translated from yolov5 with this CLI tool. Seems that we have to manually edit the codes here, check more details about the limitation of this interface https://pytorch.org/blog/introducing-torchvision-new-multi-weight-support-api/#limitations-of-the-current-api. https://github.com/zhiqwang/yolov5-rt-stack/blob/3485ea144fec7b2857d0d2e0d4ff329959e77027/yolort/models/yolo.py#L263-L267

If you can elaborate on how, I'd be happy to make a PR with added documentation.

Sure! The torchvision team introduce a new API to call custom multi-weight at https://pytorch.org/blog/introducing-torchvision-new-multi-weight-support-api/#multi-weight-support, I think this interface is just what we need here, and we could follow their strategy.

Additional context

zhiqwang commented 2 years ago

I'm a little hesitant here whether we adopt this brand new interface or make a backward compatible interface like torchvision.

Because we do not yet support training, I currently judge that most people use the classmethods YOLO.load_from_yolov5() or YOLOv5.load_from_yolov5() to load custom checkpoints, and we will remain this classmethod. So I'm inclined to go the route of completely adopting the new interface from torchvision.

Let me know if you have more concerns about this.