webonnx / wonnx

A WebGPU-accelerated ONNX inference run-time written 100% in Rust, ready for native and the web
Other
1.54k stars 54 forks source link

I get an error when try to use yolov8m.onnx in wonnx-wasm-example #174

Open vitiok123 opened 11 months ago

vitiok123 commented 11 months ago

Describe the bug A clear and concise description of what the bug is. Hi, I try to use your example (https://github.com/webonnx/wonnx-wasm-example). When I try to use yolov8m.onnx (a coco datset exported with YOLOv8 to onnx), I get this error SessionError 'IR error: output node for output /model.0/conv/Conv_output_0 not found'

To Reproduce Steps to reproduce the behavior:

  1. const [modelBytes, initResult] = await Promise.all([fetchBytes("./data/models/yolov8m.onnx"), init()])
  2. const session = await Session.fromBytes(modelBytes)

Expected behavior To not have error

Screenshots image image

Desktop (please complete the following information):

pixelspark commented 11 months ago

Can you share the specific onnx file you are using?

The error in general means that the output is missing somehow. If it is in the ONNX file and properly connected, there may be an issue in the optimizer.

vitiok123 commented 11 months ago

Can you share the specific onnx file you are using?

The error in general means that the output is missing somehow. If it is in the ONNX file and properly connected, there may be an issue in the optimizer.

Hi, you can find the onnx file in this repository (file: yolov8m.onnx) https://github.com/AndreyGermanov/yolov8_onnx_python

I used python and yolov8 to export this file. When export it's possible to add some arguments. This is the list of arguments https://docs.ultralytics.com/modes/export/#arguments Maybe this will help to understand if the problem is in export settings.

pixelspark commented 11 months ago

Hm, the file linked does not have all its shapes inferred (nnx prepare is unable to infer all shapes, but that is expected as shape inference for Conv is not yet supported).

After simplifying with onnx-simplifier (see README) there are still issues as the outputs of some Resize nodes are not inferred yet:

[2023-07-18T18:55:43Z ERROR nnx::info] Node '/model.10/Resize' input '' has unknown shape
[2023-07-18T18:55:43Z ERROR nnx::info] Node '/model.13/Resize' input '' has unknown shape

The issue seems to be that this node has no name specified for one of its inputs (this is allowed for optional inputs, as roi is in this case):

image

This should however not pose an issue since the optimizer will move inputs to attributes for Resize and in that process, ignore the optional roi input.

So my suggestion would be to try again with the optimized version (obtained using python3 -m onnxsim ./model.onnx ./simplified.onnx).

vitiok123 commented 11 months ago

Hm, the file linked does not have all its shapes inferred (nnx prepare is unable to infer all shapes, but that is expected as shape inference for Conv is not yet supported).

After simplifying with onnx-simplifier (see README) there are still issues as the outputs of some Resize nodes are not inferred yet:

[2023-07-18T18:55:43Z ERROR nnx::info] Node '/model.10/Resize' input '' has unknown shape
[2023-07-18T18:55:43Z ERROR nnx::info] Node '/model.13/Resize' input '' has unknown shape

The issue seems to be that this node has no name specified for one of its inputs (this is allowed for optional inputs, as roi is in this case):

image

This should however not pose an issue since the optimizer will move inputs to attributes for Resize and in that process, ignore the optional roi input.

So my suggestion would be to try again with the optimized version (obtained using python3 -m onnxsim ./model.onnx ./simplified.onnx).

Wow, Cool, Thanks a lot. I will try and give you feedback.

vitiok123 commented 11 months ago

Hm, the file linked does not have all its shapes inferred (nnx prepare is unable to infer all shapes, but that is expected as shape inference for Conv is not yet supported).

After simplifying with onnx-simplifier (see README) there are still issues as the outputs of some Resize nodes are not inferred yet:

[2023-07-18T18:55:43Z ERROR nnx::info] Node '/model.10/Resize' input '' has unknown shape
[2023-07-18T18:55:43Z ERROR nnx::info] Node '/model.13/Resize' input '' has unknown shape

The issue seems to be that this node has no name specified for one of its inputs (this is allowed for optional inputs, as roi is in this case):

image

This should however not pose an issue since the optimizer will move inputs to attributes for Resize and in that process, ignore the optional roi input.

So my suggestion would be to try again with the optimized version (obtained using python3 -m onnxsim ./model.onnx ./simplified.onnx).

After python3 -m onnxsim ./model.onnx ./simplified.onnx this is the statistic

Simplifying...
Finish! Here is the difference:
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃            ┃ Original Model ┃ Simplified Model ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ Add        │ 15             │ 14               │
│ Concat     │ 19             │ 19               │
│ Constant   │ 189            │ 183              │
│ Conv       │ 84             │ 84               │
│ Div        │ 2              │ 1                │
│ Gather     │ 1              │ 0                │
│ MaxPool    │ 3              │ 3                │
│ Mul        │ 80             │ 78               │
│ Reshape    │ 5              │ 5                │
│ Resize     │ 2              │ 2                │
│ Shape      │ 1              │ 0                │
│ Sigmoid    │ 78             │ 78               │
│ Slice      │ 2              │ 2                │
│ Softmax    │ 1              │ 1                │
│ Split      │ 9              │ 9                │
│ Sub        │ 2              │ 2                │
│ Transpose  │ 1              │ 1                │
│ Model Size │ 99.0MiB        │ 98.9MiB          │
└────────────┴────────────────┴──────────────────┘

Using simplified model I get this in console:

Info logs

transferring input split for op Split to i64 attribute (initializer data type: I64): [48, 48]
applying padding optimization to tensor model.2.m.0.cv1.conv.weight: strides data is 82944 bytes before, 110592 bytes after
applying padding optimization to tensor model.2.m.0.cv2.conv.weight: strides data is 82944 bytes before, 110592 bytes after
applying padding optimization to tensor model.2.m.1.cv1.conv.weight: strides data is 82944 bytes before, 110592 bytes after
applying padding optimization to tensor model.2.m.1.cv2.conv.weight: strides data is 82944 bytes before, 110592 bytes after
transferring input split for op Split to i64 attribute (initializer data type: I64): [96, 96]
applying padding optimization to tensor model.4.m.0.cv1.conv.weight: strides data is 331776 bytes before, 442368 bytes after
applying padding optimization to tensor model.4.m.0.cv2.conv.weight: strides data is 331776 bytes before, 442368 bytes after
applying padding optimization to tensor model.4.m.1.cv1.conv.weight: strides data is 331776 bytes before, 442368 bytes after
applying padding optimization to tensor model.4.m.1.cv2.conv.weight: strides data is 331776 bytes before, 442368 bytes after
applying padding optimization to tensor model.4.m.2.cv1.conv.weight: strides data is 331776 bytes before, 442368 bytes after
applying padding optimization to tensor model.4.m.2.cv2.conv.weight: strides data is 331776 bytes before, 442368 bytes after
applying padding optimization to tensor model.4.m.3.cv1.conv.weight: strides data is 331776 bytes before, 442368 bytes after
applying padding optimization to tensor model.4.m.3.cv2.conv.weight: strides data is 331776 bytes before, 442368 bytes after
transferring input split for op Split to i64 attribute (initializer data type: I64): [192, 192]
applying padding optimization to tensor model.6.m.0.cv1.conv.weight: strides data is 1327104 bytes before, 1769472 bytes after
applying padding optimization to tensor model.6.m.0.cv2.conv.weight: strides data is 1327104 bytes before, 1769472 bytes after
applying padding optimization to tensor model.6.m.1.cv1.conv.weight: strides data is 1327104 bytes before, 1769472 bytes after
applying padding optimization to tensor model.6.m.1.cv2.conv.weight: strides data is 1327104 bytes before, 1769472 bytes after
applying padding optimization to tensor model.6.m.2.cv1.conv.weight: strides data is 1327104 bytes before, 1769472 bytes after
applying padding optimization to tensor model.6.m.2.cv2.conv.weight: strides data is 1327104 bytes before, 1769472 bytes after
applying padding optimization to tensor model.6.m.3.cv1.conv.weight: strides data is 1327104 bytes before, 1769472 bytes after
applying padding optimization to tensor model.6.m.3.cv2.conv.weight: strides data is 1327104 bytes before, 1769472 bytes after
transferring input split for op Split to i64 attribute (initializer data type: I64): [288, 288]
applying padding optimization to tensor model.8.m.0.cv1.conv.weight: strides data is 2985984 bytes before, 3981312 bytes after
applying padding optimization to tensor model.8.m.0.cv2.conv.weight: strides data is 2985984 bytes before, 3981312 bytes after
applying padding optimization to tensor model.8.m.1.cv1.conv.weight: strides data is 2985984 bytes before, 3981312 bytes after
applying padding optimization to tensor model.8.m.1.cv2.conv.weight: strides data is 2985984 bytes before, 3981312 bytes after

And after error

panicked at 'internal error: entered unreachable code', wonnx/src/optimizer.rs:95:67

Stack:

Error
    at imports.wbg.__wbg_new_abda76e883ba8a5f (http://localhost:3000/node_modules/@webonnx/wonnx-wasm/wonnx.js?v=a126f01e:481:21)
    at http://localhost:3000/node_modules/@webonnx/wonnx-wasm/wonnx_bg.wasm:wasm-function[1080]:0x14a444
    at http://localhost:3000/node_modules/@webonnx/wonnx-wasm/wonnx_bg.wasm:wasm-function[2887]:0x18e73a
    at http://localhost:3000/node_modules/@webonnx/wonnx-wasm/wonnx_bg.wasm:wasm-function[1666]:0x17502e
    at http://localhost:3000/node_modules/@webonnx/wonnx-wasm/wonnx_bg.wasm:wasm-function[1812]:0x17b84f
    at http://localhost:3000/node_modules/@webonnx/wonnx-wasm/wonnx_bg.wasm:wasm-function[2232]:0x187b4c
    at http://localhost:3000/node_modules/@webonnx/wonnx-wasm/wonnx_bg.wasm:wasm-function[2441]:0x18b798
    at http://localhost:3000/node_modules/@webonnx/wonnx-wasm/wonnx_bg.wasm:wasm-function[2273]:0x188936
    at http://localhost:3000/node_modules/@webonnx/wonnx-wasm/wonnx_bg.wasm:wasm-function[180]:0x1b83e
    at http://localhost:3000/node_modules/@webonnx/wonnx-wasm/wonnx_bg.wasm:wasm-function[189]:0x34dbb
Uncaught (in promise) RuntimeError: unreachable
    at wonnx_bg.wasm:0x175068
    at wonnx_bg.wasm:0x17b84f
    at wonnx_bg.wasm:0x187b4c
    at wonnx_bg.wasm:0x18b798
    at wonnx_bg.wasm:0x188936
    at wonnx_bg.wasm:0x1b83e
    at wonnx_bg.wasm:0x34dbb
    at wonnx_bg.wasm:0xef85f
    at wonnx_bg.wasm:0x3492e
    at wonnx_bg.wasm:0xef85f
pixelspark commented 11 months ago

Good news and bad news:

The above does seem to be a bug in the optimizer, it appears to attempt constant folding on the missing node. I just committed https://github.com/webonnx/wonnx/commit/5d20e966473ad71fcdafba0bf5664a34f07f8a95 to fix that. Now unfortunately I get a different issue:

RUST_LOG=wonnx=debug RUST_BACKTRACE=1 cargo run --release -- infer ~/Downloads/yolov8m-simplified-2.onnx
[2023-07-18T19:50:08Z DEBUG wonnx::gpu] sequence tensor onnx::Split_180 (outputs readable=false)
[2023-07-18T19:50:08Z WARN  wonnx::gpu] initializers with int64 data type are not supported, converting into int32 initializer
[2023-07-18T19:50:08Z INFO  wonnx::gpu] creating buffer: onnx::Split_180 8b
[2023-07-18T19:50:08Z DEBUG wonnx::gpu] sequence op: /model.2/Split_output_0 (Split) (outputs readable=false)
thread 'main' panicked at 'wgpu error: Validation Error

Caused by:
    In Device::create_bind_group
      note: label = `/model.2/Split_output_0`
    Number of bindings in bind group descriptor (4) does not match the number of bindings defined in the bind group layout (3)

It does appear the split input (number 2) is properly transferred to an attribute:

[2023-07-18T19:55:15Z DEBUG wonnx::optimizer] locally_optimized_node_with NodeIdentifier(0x600001c81b40, "/model.2/Split_output_0") op: /model.2/Split (Split)
[2023-07-18T19:55:15Z INFO  wonnx::optimizer] transferring input split for op Split to i64 attribute (initializer data type: I64): [48, 48]

So for some reason it thinks there should be four buffers in one place but three in another. In the generated shader code, it has three (as expected: the split input is moved to an attribute earlier):

 @group(0) @binding(0)
    var<storage, read> input_0: Array;

        @group(0) @binding(1)
        var<storage, read_write> output_0: Array;

        @group(0) @binding(2)
        var<storage, read_write> output_1: Array;

Hence, there must still be two inputs in the IR (even after split is moved to an attribute) while only one is ever used by the shader (it expects all other input to be moved to attributes), which leads to the error.

This needs some further investigation (I don't have the time for it now) but at least we know where to look.

vitiok123 commented 11 months ago

Cool, good to know about this. No problem, when will be done, will be done :)

Thank you a lot for your super fast help and answer.

mersinvald commented 4 months ago

Hi @pixelspark, I've encountered the same error trying to run YOLOv8 via wonnx. Have you had a chance to look into this issue yet?

If you don't have time for that, but could offer some guidance in debugging, that would be very much appreciated too :)

pixelspark commented 4 months ago

Hi @pixelspark, I've encountered the same error trying to run YOLOv8 via wonnx. Have you had a chance to look into this issue yet?

If you don't have time for that, but could offer some guidance in debugging, that would be very much appreciated too :)

I haven't (and frankly don't have the time), unfortunately.

If I were you I would start by investigating whether your ONNX file too has the issue with the Split operator, and check how many inputs it has. You might be able to rewrite (using Python onnx package) the ONNX file to something wonnx accepts. Another possibility would be to tweak the ONNX opset version (perhaps the issue is caused because there are different forms Split can take depending on the opset version).