MatMul Broadcasting - Githubissues

AmineDiro commented 7 months ago

Describe the bug Hello, Thanks for this amazing project !

I was trying to run the sentence-transformers/all-MiniLM-L6-v2 model and I had the following error:

Can't create session: 
GpuError(CompileError { node: "/encoder/layer.0/attention/self/query/MatMul", 
    error: UnimplementedVariant { 
    variant: "broadcasting for two stacks of matrixes (left side has shape 1x512x384:f32, 
                right side has shape 384x384:f32)", 
    op: "MatMul" } 
})

It seems that you can't multiply 1x512x384 @ 384x384 = 1x512z384. I don't know if there is a need to match the same dimensions.
I understand that MatMul doesn't support shape inference and I don't know if this issue is related to this. I might be doing something wrong when preparing the model so could you help on this matter. all-MiniLM-L6-v2 is basically a bert model and I saw that Bert ran successfully using wonnx.

To Reproduce

Transform sentence-transformers/all-MiniLM-L6-v2 to ONNX format using optimum lib
Ran onnx-simplify:
```
python -m onnxsim model.onnx sim_model.onnx --overwrite-input-shape "input_ids:1,512" "attention_mask:1,512" "token_type_ids:1,512"
```
Note I also ran nnx prepare model.onnx model-prepared.onnx --set batch=1 --set sequence=512 on another test but it didn't solve the issue

Ran the following test:

use std::collections::HashMap;

use tokio_test;
use wonnx::{utils::InputTensor, Session};
#[test]
fn test_load_model() {
    let session = tokio_test::block_on(Session::from_path(PATH_XXX.onnx).expect("Can't create session");

    let mut input: HashMap<String, InputTensor> = HashMap::new();
    let tokens = vec![1f32; 512];
    let attention_mask = vec![1f32; 512];
    let token_type_ids = vec![0f32; 512];
    // For now ['input_ids', 'token_type_ids', 'attention_mask']
    input.insert("input_ids".to_string(), tokens[..].into());
    input.insert("attention_mask".to_string(), attention_mask[..].into());
    input.insert("token_type_ids".to_string(), token_type_ids[..].into());
    let output = tokio_test::block_on(session.run(&input)).unwrap();
    assert!(output.len() > 1)

Expected behavior Run session inference

Desktop (please complete the following information):

OS: MacOS 13.5.1

Thanks for your help, I'm also open to contributing if there is a fix to implement if you point me to the right direction 😄

pixelspark commented 7 months ago

Hi @AmineDiro!

In WONNX, shape inference and model inference are two completely separate components. Model inference requires all shapes to be inferred (which you can do using the shape inference module).

The UnimplementedVariant error means that WONNX isn't able to compile the operation to a shader to calculate the result, because the specific variant of the op usage has not been implemented. In this case, see this part of compiler.rs. As you can see the compiler calculates all sorts of template parameters, which it then feeds to the matrix/gemm.wgsl template for the WGSL shader.

Conceptually, to implement broadcasting for MatMul, you need to add the calculation in the shader and invoke it the right way (by setting parameters or switching to a different shader altogether) in compiler.rs. There is some code for broadcasting already (Ctrl-F multi_broadcast) but mostly around dealing with calculating the effective shapes.

To not reinvent the wheel I suggest to have a look at other GPU-based frameworks and their implementation (I often peeked at ONNXRuntime to see their test implementations as well).

Feel free to go ahead implementing this and if you have any questions, let me know!

AmineDiro commented 7 months ago

Hi @pixelspark ,

Thanks a lot for the detailed explanation. I'll start working on this, that's a good excuse to start learning wgpu 😃

AmineDiro commented 7 months ago

Hi there @pixelspark,

I started working on Matmul broadcasting this weekend and I encountered an issue with the existing implementation. I am new wgpu but I have some Cuda programming background. I added a small test to verify random square matrix multiplication. I used ndarray and compared the result of both computation using some tolerance.

#[test]
fn test_matmul_square_random_matrix() {
    let tol = 1e-3f32;
    let n = 10;
    let _ = env_logger::builder().is_test(true).try_init();
    // Generate random arrays
    let x_data = ndarray::Array2::<f32>::random((n, n), Uniform::new(0f32, 100f32));
    let y_data = ndarray::Array2::<f32>::random((n, n), Uniform::new(0f32, 10f32));
    let expected = x_data.dot(&y_data).as_slice().unwrap().to_owned();

    let mut input_data = HashMap::new();
    input_data.insert("X".to_string(), x_data.as_slice().unwrap().into());
    input_data.insert("Y".to_string(), y_data.as_slice().unwrap().into());

    let n = n as i64;

    let shape = [1, n, n];
    let model = model(graph(
        vec![tensor("X", &shape), tensor("Y", &shape)],
        vec![tensor("Z", &shape)],
        vec![],
        vec![],
        vec![node(vec!["X", "Y"], vec!["Z"], "matmul", "MatMul", vec![])],
    ));

    let session =
        pollster::block_on(wonnx::Session::from_model(model)).expect("Session did not create");
    let result = pollster::block_on(session.run(&input_data)).unwrap();

    match result.get("Z").unwrap() {
        OutputTensor::F32(v) => {
            let diff: Vec<f32> = v.iter().zip(expected.iter()).map(|(l, r)| l - r).collect();
            let diff = diff
                .iter()
                .filter(|e| f32::abs(**e) > tol)
                .collect::<Vec<_>>();
            assert_eq!(diff.len(), 0);
        }
        _ => unreachable!(),
    }
}

This test fails for the value n=10 but doesn't fail for n=16 which means a vec of 256 elements. As I could gather, wgpu uses work group sizes that are capped to 256 which means that the existing shader doesn't exactly produce the correct matrix multiplication for arbitrary sized squared matrices 🤔 . Am I missing something ? Thanks for your help

pixelspark commented 7 months ago

Hm, these are always tough issues to figure out... First of all, you should establish that this is not actually some rounding error (differences are expected given wgpu is a cake of abstractions on top of abstractions, so it might depend on your tolerance). That said, given it works with n=16 suggests this has something to do with the chunk sizes (could even be an off-by-one somewhere).

I am unfortunately not too familiar with the matmul implementation (@haixuanTao wrote it iirc, he might have a clue). I suspect he wrote this for the n=16 case for a specific model and then called it a day (as for most ONNX operators there are just so many variations that this is a very reasonable thing to do)

For most of the shaders I implemented, I let the shader coordinates refer to the output coordinate and let it 'gather' all the input data needed (there are probably more efficient ways to do it but this for me was the easiest to reason about). This way the group size should not really matter for the result in any way (I also did not use any complicated stuff like sharing data within work groups or barriers). You could of course try to write something of your own this way for the case where n!=16.

AmineDiro commented 7 months ago

Thanks for your response !

Hm, these are always tough issues to figure out... First of all, you should establish that this is not actually some rounding error (differences are expected given wgpu is a cake of abstractions on top of abstractions, so it might depend on your tolerance). That said, given it works with n=16 suggests this has something to do with the chunk sizes (could even be an off-by-one somewhere).

100%, I used a pretty high tolerance for testing. Usually matching float mult precision between two hardware doesn't require this much tolerance.

For most of the shaders I implemented, I let the shader coordinates refer to the output coordinate and let it 'gather' all the input data needed (there are probably more efficient ways to do it but this for me was the easiest to reason about). This way the group size should not really matter for the result in any way (I also did not use any complicated stuff like sharing data within work groups or barriers). You could of course try to write something of your own this way for the case where n!=16.

I am not sure I follow here. If I reformulate, you are trying to eliminate picking the group sizes entirely by setting them to a fix size and having the logic access some arbitrarily long input array? Thanks again for your help :)

haixuanTao commented 7 months ago

So, you should probably look at the matrix implementation. Basically each shader can utilize up to 16 elements as mat 4x4. This is the fastest way to do matmul.

Using arbitrary matrix not multiple of 16 will probably not work as you’re not contiguous.

you can revert to a more simple matmul kernel or try to do some form of padding.

☺️

webonnx / wonnx

MatMul Broadcasting #197