zjjMaiMai / TinyHITNet

HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching
152 stars 21 forks source link

Export ONNX #11

Closed deephog closed 2 years ago

deephog commented 2 years ago

Thank you for creating this great work!

Have you ever tried to export an onnx model of your HITNet model? When I try to export an onnx model, it always says there are tensors from both CPU and GPU used in the model, which is very strange, because I sent both input data and the model to GPU. Please share your thoughts on this issue. Thanks!

zjjMaiMai commented 2 years ago

try export onnx on cpu, and set opset_version to 11.

PINTO0309 commented 2 years ago
deephog commented 2 years ago

Thank you for sharing!! I tried your HitNet, didn't realize you have a TinyHitnet as well.

The thing is , I still wanted to train the model with my own data, then convert it into ONNX. I tried as the author said, use CPU and opset version 11. There is no more Tensor from different device issue, and an ONNX file is successfully generated.

However, when I try to compile a Tensorrt engine, it gives me the following error:

[03/28/2022-13:03:36] [TRT] [V] Pad_144 [Pad] inputs: [290 -> (1, 16, 720, 1280)[FLOAT]], [316 -> (8)[INT32]], [317 -> ()[FLOAT]], [03/28/2022-13:03:36] [TRT] [V] Registering layer: Pad_144 for ONNX node: Pad_144 [03/28/2022-13:03:36] [TRT] [E] [shuffleNode.cpp::symbolicExecute::391] Error Code 4: Internal Error (Reshape_133: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])

It seems like Tensorrt doesn't like the F.Pad operations, did you try to compile a Tensorrt engine? If so, please share with me how you solved this issue.

Thanks!

deephog commented 2 years ago

And I tried to compile a Tensorrt engine with your TinyHITNet onnx file, it failed with similar reason, the padding operations. However, I successfully compiled your HITNet model, do they come from different sources? Because I can see the HITNet_XL version of this author has similar padding operations and I cannot compile it either

PINTO0309 commented 2 years ago

Conversion after optimization will not cause any particular problem. It's hard work, so I am just committing all my TinyHITNet commits without performing any optimization.

$ docker run --gpus all -it --rm \
-v `pwd`:/home/user/workdir \
ghcr.io/pinto0309/openvino2tensorflow:latest

$ onnxsim hitnet_sf_finalpass_180x320_nonopt.onnx hitnet_sf_finalpass_180x320_opt.onnx
$ onnx2trt hitnet_sf_finalpass_180x320_opt.onnx -o hitnet_sf_finalpass_180x320_opt.trt -b 1 -d 16 -v

----------------------------------------------------------------
Input filename:   hitnet_sf_finalpass_180x320_opt.onnx
ONNX IR version:  0.0.7
Opset version:    12
Producer name:    pytorch
Producer version: 1.10
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
[2022-03-28 23:44:53    INFO] [MemUsageChange] Init CUDA: CPU +458, GPU +0, now: CPU 612, GPU 875 (MiB)
[2022-03-28 23:44:53    INFO] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 612 MiB, GPU 875 MiB
[2022-03-28 23:44:54    INFO] [MemUsageSnapshot] End constructing builder kernel library: CPU 766 MiB, GPU 919 MiB
Parsing model
[2022-03-28 23:44:54 WARNING] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[2022-03-28 23:44:54 WARNING] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[2022-03-28 23:44:54 WARNING] Tensor DataType is determined at build time for tensors not marked as input or output.
Building TensorRT engine, FP16 available:1
    Max batch size:     1
    Max workspace size: 1024 MiB
[2022-03-28 23:44:55 WARNING] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 11.6.1
[2022-03-28 23:44:55    INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1772, GPU 1259 (MiB)
[2022-03-28 23:44:56    INFO] [MemUsageChange] Init cuDNN: CPU +195, GPU +342, now: CPU 1967, GPU 1601 (MiB)
[2022-03-28 23:44:56    INFO] Local timing cache in use. Profiling results in this builder pass will not be stored.
[2022-03-28 23:48:13    INFO] Detected 2 inputs and 1 output network tensors.
[2022-03-28 23:48:13    INFO] Total Host Persistent Memory: 262768
[2022-03-28 23:48:13    INFO] Total Device Persistent Memory: 1515520
[2022-03-28 23:48:13    INFO] Total Scratch Memory: 2311200
[2022-03-28 23:48:13    INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 539 MiB
[2022-03-28 23:48:13    INFO] [BlockAssignment] Algorithm ShiftNTopDown took 59.8616ms to assign 12 blocks to 225 nodes requiring 16339456 bytes.
[2022-03-28 23:48:13    INFO] Total Activation Memory: 16339456
[2022-03-28 23:48:13 WARNING] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 11.6.1
[2022-03-28 23:48:13    INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3066, GPU 2122 (MiB)
[2022-03-28 23:48:13    INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3066, GPU 2130 (MiB)
[2022-03-28 23:48:13    INFO] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +1, GPU +72, now: CPU 1, GPU 72 (MiB)
Writing TensorRT engine to hitnet_sf_finalpass_180x320_opt.trt
All done

image

deephog commented 2 years ago

Conversion after optimization will not cause any particular problem. It's hard work, so I am just committing all my TinyHITNet commits without performing any optimization.

$ docker run --gpus all -it --rm \
-v `pwd`:/home/user/workdir \
ghcr.io/pinto0309/openvino2tensorflow:latest

$ onnxsim hitnet_sf_finalpass_180x320_nonopt.onnx hitnet_sf_finalpass_180x320_opt.onnx
$ onnx2trt hitnet_sf_finalpass_180x320_opt.onnx -o hitnet_sf_finalpass_180x320_opt.trt -b 1 -d 16 -v

----------------------------------------------------------------
Input filename:   hitnet_sf_finalpass_180x320_opt.onnx
ONNX IR version:  0.0.7
Opset version:    12
Producer name:    pytorch
Producer version: 1.10
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
[2022-03-28 23:44:53    INFO] [MemUsageChange] Init CUDA: CPU +458, GPU +0, now: CPU 612, GPU 875 (MiB)
[2022-03-28 23:44:53    INFO] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 612 MiB, GPU 875 MiB
[2022-03-28 23:44:54    INFO] [MemUsageSnapshot] End constructing builder kernel library: CPU 766 MiB, GPU 919 MiB
Parsing model
[2022-03-28 23:44:54 WARNING] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[2022-03-28 23:44:54 WARNING] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[2022-03-28 23:44:54 WARNING] Tensor DataType is determined at build time for tensors not marked as input or output.
Building TensorRT engine, FP16 available:1
    Max batch size:     1
    Max workspace size: 1024 MiB
[2022-03-28 23:44:55 WARNING] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 11.6.1
[2022-03-28 23:44:55    INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1772, GPU 1259 (MiB)
[2022-03-28 23:44:56    INFO] [MemUsageChange] Init cuDNN: CPU +195, GPU +342, now: CPU 1967, GPU 1601 (MiB)
[2022-03-28 23:44:56    INFO] Local timing cache in use. Profiling results in this builder pass will not be stored.
[2022-03-28 23:48:13    INFO] Detected 2 inputs and 1 output network tensors.
[2022-03-28 23:48:13    INFO] Total Host Persistent Memory: 262768
[2022-03-28 23:48:13    INFO] Total Device Persistent Memory: 1515520
[2022-03-28 23:48:13    INFO] Total Scratch Memory: 2311200
[2022-03-28 23:48:13    INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 539 MiB
[2022-03-28 23:48:13    INFO] [BlockAssignment] Algorithm ShiftNTopDown took 59.8616ms to assign 12 blocks to 225 nodes requiring 16339456 bytes.
[2022-03-28 23:48:13    INFO] Total Activation Memory: 16339456
[2022-03-28 23:48:13 WARNING] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 11.6.1
[2022-03-28 23:48:13    INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3066, GPU 2122 (MiB)
[2022-03-28 23:48:13    INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3066, GPU 2130 (MiB)
[2022-03-28 23:48:13    INFO] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +1, GPU +72, now: CPU 1, GPU 72 (MiB)
Writing TensorRT engine to hitnet_sf_finalpass_180x320_opt.trt
All done

image

I used trtexec provided by Tensorrt, never tried onnx2trt, thank you!

And also, did you modify his model in some particular way or used some tools other than torch.onnx.export to generate your onnx model? It will be much appreciated if you can share with me how you generated your onnx models. Thanks!

PINTO0309 commented 2 years ago
  1. Export to ONNX without doing anything special
  2. Optimization with onnx-simplifier (onnxsim)
  3. Conversion to trt engine with onnx2trt

onnx2trt is just a backend call to trtexec.

deephog commented 2 years ago
  1. Export to ONNX without doing anything special
  2. Optimization with onnx-simplifier (onnxsim)
  3. Conversion to trt engine with onnx2trt

onnx2trt is just a backend call to trtexec.

Hi Hyodo san, I tried exactly your docker container, and followed your steps. I guess the step that fixed the "pad operation not supported" issue of tensorrt is that you did the "onnxsim" before it. However, when I try to do onnxsim, even though my ONNX file has only 16 MB, it says "ValueError: Message onnx.ModelProto exceeds maximum protobuf size of 2GB: 2364128722" and stopped the simplification. I used an input larger than yours though, I used 720x1280. But like I said, the onnx file before simplification is only 16 MB, I don't know why it can exceed 2GB. Please share your thoughts.

Thanks!

PINTO0309 commented 2 years ago

@deephog The explanation is very difficult to follow, so I will just give the gist of it.

Most models that process stereo images attempt to merge features from the left and right images by breaking the image into small patches. Patch processing is realized with countless slices and GatherND and ScatterND, but the onnx-simplifier tries to replace all internal parameters with constants in order to optimize their operation to the limit. As a result of onnx-simplifier's conversion of various parameters into INT64 constants, the amount of parameters embedded in the model in INT64 grows enormously with the size of the image.

Therefore, the internal processing of onnx-simplifier must be modified to keep the size of the final output onnx file under 2 GB. Although onnx is generated in the form of Protocol Buffers, Protocol Buffers are limited to a maximum file size of 2 GB.

Therefore, I created a validation repository yesterday and have successfully experimented with compressing the size of the onnx model. Soon I will Fork on onnx-simplifier and add my experimental implementation and try it out. https://github.com/PINTO0309/scs4onnx

deephog commented 2 years ago

@deephog The explanation is very difficult to follow, so I will just give the gist of it.

Most models that process stereo images attempt to merge features from the left and right images by breaking the image into small patches. Patch processing is realized with countless slices and GatherND and ScatterND, but the onnx-simplifier tries to replace all internal parameters with constants in order to optimize their operation to the limit. As a result of onnx-simplifier's conversion of various parameters into INT64 constants, the amount of parameters embedded in the model in INT64 grows enormously with the size of the image.

Therefore, the internal processing of onnx-simplifier must be modified to keep the size of the final output onnx file under 2 GB. Although onnx is generated in the form of Protocol Buffers, Protocol Buffers are limited to a maximum file size of 2 GB.

Therefore, I created a validation repository yesterday and have successfully experimented with compressing the size of the onnx model. Soon I will Fork on onnx-simplifier and add my experimental implementation and try it out. https://github.com/PINTO0309/scs4onnx

Thank you for your prompt reply! I will try your tool and come back with any result I may get

deephog commented 2 years ago

@deephog The explanation is very difficult to follow, so I will just give the gist of it.

Most models that process stereo images attempt to merge features from the left and right images by breaking the image into small patches. Patch processing is realized with countless slices and GatherND and ScatterND, but the onnx-simplifier tries to replace all internal parameters with constants in order to optimize their operation to the limit. As a result of onnx-simplifier's conversion of various parameters into INT64 constants, the amount of parameters embedded in the model in INT64 grows enormously with the size of the image.

Therefore, the internal processing of onnx-simplifier must be modified to keep the size of the final output onnx file under 2 GB. Although onnx is generated in the form of Protocol Buffers, Protocol Buffers are limited to a maximum file size of 2 GB.

Therefore, I created a validation repository yesterday and have successfully experimented with compressing the size of the onnx model. Soon I will Fork on onnx-simplifier and add my experimental implementation and try it out. https://github.com/PINTO0309/scs4onnx

I still failed to generate an ONNX file because of the size limit. looks like the compression worked on a few layers, maybe not all of them. Or maybe I did something wrong, could you try to export with the dummy input shape of (1280, 720) ? Thanks!

PINTO0309 commented 2 years ago

I told you to wait and I would do it right away.

I just upgraded. https://github.com/PINTO0309/scs4onnx

Only 2 lines of onnx-simplifier need to be modified. I work another job during the day so I can only work on it during my private time.

deephog commented 2 years ago

I told you to wait and I would do it right away.

I just upgraded.

https://github.com/PINTO0309/scs4onnx

Only 2 lines of onnx-simplifier need to be modified. I work another job during the day so I can only work on it during my private time.

Sorry I don't mean to push you, I just misundetstood your reply. I thought the compression tool itself was ok, you just needed to merge it into onnxsim. So I just downloaded your other docker and tried the compression tool separately.

Anyways, thank you for your tremendous work and don't let my ignorant comments make you upset. I will try it again and come back with the results. I'm in US time (I assume you are in Japan), I may comment anytime when I have the result but you don't need to reply me right away, please do it when you feel convenient. Thanks!

PINTO0309 commented 2 years ago

I have not tried it yet, but I have added a trial method to the README. After tweaking the onnx-simplifier source code, you will need to reinstall onnx-simplifier from source code. https://github.com/PINTO0309/scs4onnx#key-concept

PINTO0309 commented 2 years ago

@deephog My prediction was shallow. I found that everywhere in onnx-simplifier there is logic that is trapped by the 2GB constraint. Modifying just a few lines of the program will result in an error.

I will temporarily suspend my research work as it will take quite some time to deal with the compression work for high resolution models such as 720x1280. :cry:

deephog commented 2 years ago

@deephog My prediction was shallow. I found that everywhere in onnx-simplifier there is logic that is trapped by the 2GB constraint. Modifying just a few lines of the program will result in an error.

I will temporarily suspend my research work as it will take quite some time to deal with the compression work for high resolution models such as 720x1280. 😢

Yes you are right, I found similar issue. Instead of putting your compressor where you suggested, I put it under line 543, inside function "simplify" -> "constant_folding", right after the constant folding, and before the checker.

However, I encountered another issue which is raised in your compressor. It says "The input shape of the next OP does not match the output shape".

Please focus on your other work first and come back when you are available, I can wait, days or weeks, no hurry. Have a nice day!

PINTO0309 commented 2 years ago

This is a note of last resort that I came up with just before I went to bed.

Recombine after optimization. Splitting and merging seems like it would be easy. For each partitioned onnx component, optimization is performed in the order of onnx-simplifier → scs4onnx to optimize the structure while keeping the buffer size to a minimum, and then the optimized components are recombined to reconstruct the whole graph. Finally, run scs4onnx again on the reconstructed, optimized overall graph to further reduce the model-wide constant.

PINTO0309 commented 2 years ago

@deephog I have successfully optimized a 720x1280 sized HITNet with onnx-simplifier. I found the only workaround to avoid the 2GB overage. Perhaps you may no longer be interested, so I will not describe the details. Incidentally, this figure is before model size compression with scs4onnx. image

deephog commented 2 years ago

@deephog I have successfully optimized a 720x1280 sized HITNet with onnx-simplifier. I found the only workaround to avoid the 2GB overage. Perhaps you may no longer be interested, so I will not describe the details. Incidentally, this figure is before model size compression with scs4onnx. image

  • hitnet_xl_sf_finalpass_from_tf_720x1280_cast_opt.onnx image

Hi Hyodo san, so glad you figured it out! I am very interested in how you did it, just like I said, I wanted to train the model my own way and convert it to trt afterwards. Please share your knowledge of how to circumvented the 2GB limitation! Thanks!

PINTO0309 commented 2 years ago

I have posted a fairly detailed issue on onnx-simplifier. If you read the contents carefully, you should be able to understand and reproduce the cause.

"Excessive bloating of ONNX files due to over-efficient conversion of "Tile" to constants (Protocol Buffers .onnx > 2GB)" https://github.com/daquexian/onnx-simplifier/issues/178

pcb9382 commented 2 years ago

Thank you for sharing!! I tried your HitNet, didn't realize you have a TinyHitnet as well.

The thing is , I still wanted to train the model with my own data, then convert it into ONNX. I tried as the author said, use CPU and opset version 11. There is no more Tensor from different device issue, and an ONNX file is successfully generated.

However, when I try to compile a Tensorrt engine, it gives me the following error:

[03/28/2022-13:03:36] [TRT] [V] Pad_144 [Pad] inputs: [290 -> (1, 16, 720, 1280)[FLOAT]], [316 -> (8)[INT32]], [317 -> ()[FLOAT]], [03/28/2022-13:03:36] [TRT] [V] Registering layer: Pad_144 for ONNX node: Pad_144 [03/28/2022-13:03:36] [TRT] [E] [shuffleNode.cpp::symbolicExecute::391] Error Code 4: Internal Error (Reshape_133: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])

It seems like Tensorrt doesn't like the F.Pad operations, did you try to compile a Tensorrt engine? If so, please share with me how you solved this issue.

Thanks!

I used cpu and opset_version=11 to generate onnx, but the onnx file is only 90k

pcb9382 commented 2 years ago

Thank you for sharing!! I tried your HitNet, didn't realize you have a TinyHitnet as well. The thing is , I still wanted to train the model with my own data, then convert it into ONNX. I tried as the author said, use CPU and opset version 11. There is no more Tensor from different device issue, and an ONNX file is successfully generated. However, when I try to compile a Tensorrt engine, it gives me the following error: [03/28/2022-13:03:36] [TRT] [V] Pad_144 [Pad] inputs: [290 -> (1, 16, 720, 1280)[FLOAT]], [316 -> (8)[INT32]], [317 -> ()[FLOAT]], [03/28/2022-13:03:36] [TRT] [V] Registering layer: Pad_144 for ONNX node: Pad_144 [03/28/2022-13:03:36] [TRT] [E] [shuffleNode.cpp::symbolicExecute::391] Error Code 4: Internal Error (Reshape_133: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2]) It seems like Tensorrt doesn't like the F.Pad operations, did you try to compile a Tensorrt engine? If so, please share with me how you solved this issue. Thanks!

I used cpu and opset_version=11 to generate onnx, but the onnx file is only 90k

` from future import print_function import os import argparse import torch import torch.backends.cudnn as cudnn import cv2 import torch import torch.nn as nn import torch.nn.functional as F import pytorch_lightning as pl

    from models import build_model
    parser = argparse.ArgumentParser(description='Test')
    args = parser.parse_args()

    class PredictModel(pl.LightningModule):
        def __init__(self, **kwargs):
            super().__init__()
            self.save_hyperparameters()
            self.model = build_model(self.hparams)

        def forward(self, left, right):
            left = left * 2 - 1
            right = right * 2 - 1
            return self.model(left, right)

    if __name__ == '__main__':

        parser.add_argument("--images", nargs=2, required=False)
        parser.add_argument("--model", type=str, default="HITNet_SF")
        parser.add_argument("--ckpt", type=str, default="ckpt/hitnet_sf_finalpass.ckpt")
        parser.add_argument("--width", type=int, default=None)
        parser.add_argument("--output", default="./")
        args = parser.parse_args()

        model = PredictModel(**vars(args))
        model.eval()
        ckpt = torch.load("ckpt/hitnet_sf_finalpass.ckpt")
        # for name in ckpt['state_dict']:
        #     print('name is {}'.format(name))
        model.load_state_dict(ckpt["state_dict"])
        device = torch.device("cpu")
        model = model.to(device)
        input_names = ["input0","input1"]
        #output_names = ["output0"]
        output_names=["output"] + ["_%d" % i for i in range(14)]
        print(output_names)
        left = torch.randn(1, 3, 375, 1242).to(device)
        right = torch.randn(1, 3, 375, 1242).to(device)
        a=(left, right)
        export_onnx_file = "./HITNet_SF.onnx"
        # torch_out = torch.onnx._export(model(left,right),(left,right), output_onnx, export_params=True, verbose=False,
        #                                input_names=input_names, output_names=output_names)
        torch_out = torch.onnx.export(model,args=(left,right),f=export_onnx_file,verbose=False,input_names=input_names,
                                      output_names=output_names, export_params=False,opset_version=11,do_constant_folding=True)

` @zjjMaiMai @deephog This is my conversion code, please help to see what is wrong, thank you!

PINTO0309 commented 2 years ago

The issue I posted against onnx-simplifier with a feature improvement suggestion has been adopted and the issue of onnx file bloat was resolved two days ago.