np.matmul implementation

chongyi-zheng commented 5 years ago

Currently, I am using this XTENSOR library to develop my own project. I have searched the documentation for a whole night, but just failed to find something like np.matmul. The only available stuffs are dot and tensordot now. Finally, I decided to do it by myself. Here is a recursive implementation:


#include <stddef.h>
#include <typeinfo>
#include <stdexcept>

#include <xtensor/xarray.hpp>
#include <xtensor/xview.hpp>
#include <xtensor-blas/xlinalg.hpp>

template <class _Tp>
xt::xarray<_Tp> matmul(xt::xarray<_Tp> a, xt::xarray<_Tp> b) noexcept(false)
{
    if (a.shape().size() != b.shape().size()) {
        throw std::runtime_error("Shape mismatching!");
    }

    if (typeid(_Tp).hash_code() != typeid(int).hash_code()
        && typeid(_Tp).hash_code() != typeid(size_t).hash_code()
        && typeid(_Tp).hash_code() != typeid(float).hash_code()
        && typeid(_Tp).hash_code() != typeid(double).hash_code()) {
        throw std::runtime_error("Element type mismatching!");
    }

    xt::xarray<double>::shape_type shape = a.shape();
    xt::xarray<_Tp> out(shape);
    //
    // Both argument are 2-D, end the recursion.
    //
    //      a - (M, M)
    //      b - (M, M)
    // 
    if (shape.size() == 2) {
        xt::xarray<_Tp> matrix, vector;
        for (int col = 0; col < *(shape.end() - 1); col++) {
            matrix = xt::view(a, xt::all(), xt::all());
            vector = xt::view(b, xt::all(), col);
            xt::view(out, xt::all(), col) = xt::linalg::dot(matrix, vector);
        }
    }
    //
    // Both arguments are N-D, go into deeper layer.
    //
    //
    //      a - (..., M, M)
    //      b - (..., M, M)
    //
    else {
        xt::xarray<_Tp> aSlice, bSlice;
        int layer = shape.size();
        for (int i = 0; i < shape[0]; i++) {
            aSlice = xt::view(a, i);
            bSlice = xt::view(b, i);
            xt::view(out, i) = matmul(aSlice, bSlice);
        }
    }

    return out;
}

I have tested it like this:


#include <iostream>

#include <xtensor/xarray.hpp>
#include <xtensor/xview.hpp>
#include <xtensor-blas/xlinalg.hpp>
#include <xtensor/xio.hpp>

#include "matmul.hpp"

int main(int argc, char const *argv[])
{
    xt::xarray<double> arr1 {
        {
            {1, 2, 3},
            {4, 5, 6},
            {7, 8, 9}
        },
        {
            {1, 2, 3},
            {4, 5, 6},
            {7, 8, 9}
        }
    };// (2, 3, 3)

    xt::xarray<double> res = matmul(arr1, arr1);// (2, 3, 3)
    std::cout << res << std::endl;

    return 0;
}

Output:


{{{  30.,   36.,   42.},
  {  66.,   81.,   96.},
  { 102.,  126.,  150.}},
 {{  30.,   36.,   42.},
  {  66.,   81.,   96.},
  { 102.,  126.,  150.}}}

I think it is correct, right? But I don't think it is an efficient way and the type checking seems ugly. In numpy documentation of np.matmul I found

If both arguments are 2-D they are multiplied like conventional matrices.
If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and - broadcast accordingly.
If the first argument is 1-D, it is promoted to a matrix by prepending a 1 to its dimensions. After matrix multiplication the prepended 1 is removed.
If the second argument is 1-D, it is promoted to a matrix by appending a 1 to its dimensions. After matrix multiplication the appended 1 is removed.

I have only implemented the first two cases. I wonder if there will be an efficient xt::linalg::matmul in this library. And does parallel computing like OPENMP and CUDA will be introduced in the future? Hopefully, this will help.

wolfv commented 5 years ago

Hi, thanks for tackling this! I think the implementation should also live in xtensor-blas for consistency. The idea with the view's is good, I think. Maybe we can use the cxxblas::gemm functions directly in xtensor-blas. Did you look at the implementation of linalg::dot? In general, dot and matmul are quite similar, I think.

Regarding OpenMP, we've started to do some (very little) work to parallelize certain mathematical operations using intel TBB. Broadcasted matmul could be trivially parallelized with TBB as well.

CUDA (or other GPU support) is on the roadmap for '19 as well.

chongyi-zheng commented 5 years ago

Oh, thanks for a quick reply! Actually, I haven't read the code of linalg::dot yet. It's a good idea to check the implementation out. Since this is the first time for me to do something with tensors in C++, I have not used BLAS or LAPACK before. Maybe I should use cxxblas::gemm to write it from scratch.

So, did you mean that there would be a matmul in the coming version of this library?

By the way, I am using this toolkit to code a real-time 3D modeling system which means the efficiency counts a lot. Hopefully, when the work is done, I will get some kind of 15 fps feedback.

wolfv commented 5 years ago

Cool! Yes, we take performance very serious, too. If you find performance related bugs, let us know! And if your library is open source we can also take a look at your code to give some recommendations.

We definitely want to have matmul in xtensor-blas. I could start from your initial work here, or you could make a PR for this, and I can guide you on how to do it (e.g. using "raw" gemm from BLAS). It should be quite straight forward.

Are you interested in contributing this?

chongyi-zheng commented 5 years ago

That's great to hear about a future plan for xt::linalg::matmul. And it's a pleasure to contribute my work to this powerful library. But is it possible to leave it until my current work gets done? Sure enough, my project will become open source once I construct a basic architecture.

And another problem I have come across yesterday is that I could not use xt::xtuple dynamically. What I mean is that when I want to call xt::stack, I have to construct a xt::xtuple inside the calling line like xt::xstack(xt::xtuple(arr1, arr2, arr3), 0). However, if I want to gather a bunch of xt::xarrays into a xtuple one by one using a for-loop before the xt::stack function, it just doesn't work. I have come up with xt::concatenate, but it's obviously less efficiently. Also, I have used std::tuple_cat in the for-loop to concatenate them together, but the return type of std::tuple_cat seems incompatible with the argument as xt::xtuple in xt::stack. I just got a compiling error in the end. I am trying other approaches to figure it out.

Maybe I should commit another ISSUE for this and add detail descriptions about the problem?

Thanks for the reply again!

JohanMabille commented 5 years ago

@YeeCY sorry for the late reply. Regarding the implementation of matmul, we don't have the bandwidth to implement it now so it can definitely wait for your current wok to get done ;)

Regarding the issue with tuple, I agree that a dedicated issue would be more appropriated so we avoid to mix discussions.

chongyi-zheng commented 5 years ago

Oh, thanks for inviting me to join you guys. I have almost got my work done! Actually, the only thing left is the documentation. That's not a big problem.

It's ok for me to start contributing to this powerful library. To deal with multiplication involving N-D vectors and N-D matrices, another complete version of matmul have been tested in my code. I think it will be helpful after being added.

By the way, If there is any new instruction, feel free to send me an email.

tort-dla-psa commented 4 years ago

Is there any progress on that?

wolfv commented 4 years ago

@tort-dla-psa we still have dot in xtensor-blas which does pretty much the same as matmul.

tort-dla-psa commented 4 years ago

@wolfv but is it parallelized?

wolfv commented 4 years ago

only as far as the BLAS implementation you are using parallelizes the GEMM instruction. One could do higher level parallelization for the case where dot is broadcasting. Check out the code in xtensor-blas it's fairly simple :)

zhiayang commented 4 years ago

neither xtensor nor xtensor-blas have implemented matmul as described here, so I implemented my own version that properly handles broadcasting; eg. you can perform (2, 4, 3) x (3, 5) as well as (5, 2) x (2, 4, 6) multiplications.

I have neither a strong math background nor a BLAS background, so idk how correct this is; it looks mathematically sound from my small tests, but I'm sure it can be made more efficient.

Also, is there a way to template-ise the input arguments so they can accept xexpressions as well? A brief attempt of mine led to too many template errors, so I gave up on that front.

template <class _Tp>
xt::xarray<_Tp> matrix_mul(xt::xarray<_Tp> a, xt::xarray<_Tp> b)
{
    using Arr = xt::xarray<_Tp>;

    if(a.dimension() == 1 && b.dimension() == 1)
    {
        return xt::linalg::outer(a, b);
    }
    else if(a.dimension() <= 2 && b.dimension() <= 2)
    {
        return xt::linalg::dot(a, b);
    }
    else
    {
        if(a.dimension() == b.dimension())
        {
            assert(a.shape()[0] == b.shape()[0]);
            size_t layers = a.shape()[0];

            Arr tmp;
            {
                Arr a0 = xt::view(a, 0);
                Arr b0 = xt::view(b, 0);
                tmp = matrix_mul(std::move(a0), std::move(b0));
            }

            auto out_shape = tmp.shape();
            out_shape.insert(out_shape.begin(), layers);

            auto result = Arr::from_shape(out_shape);
            xt::view(result, 0) = tmp;

            for(size_t i = 1; i < layers; i++)
            {
                Arr ai = xt::view(a, i);
                Arr bi = xt::view(b, i);
                xt::view(result, i) = matrix_mul(std::move(ai), std::move(bi));
            }

            return result;
        }
        else if(a.dimension() > b.dimension())
        {
            size_t layers = a.shape()[0];

            Arr tmp;
            {
                Arr a0 = xt::view(a, 0);
                tmp = matrix_mul(std::move(a0), b);
            }

            auto out_shape = tmp.shape();
            out_shape.insert(out_shape.begin(), layers);

            auto result = Arr::from_shape(out_shape);
            xt::view(result, 0) = std::move(tmp);

            for(size_t i = 1; i < layers; i++)
            {
                Arr ai = xt::view(a, i);
                xt::view(result, i) = matrix_mul(std::move(ai), b);
            }

            return result;
        }
        else
        {
            assert(a.dimension() < b.dimension());
            size_t layers = b.shape().back();

            Arr tmp;
            {
                Arr b0 = xt::strided_view(b, { xt::ellipsis(), 0 });
                tmp = matrix_mul(a, std::move(b0));
            }

            auto out_shape = tmp.shape();
            out_shape.insert(out_shape.end(), layers);

            auto result = Arr::from_shape(out_shape);
            xt::strided_view(result, { xt::ellipsis(), 0 }) = std::move(tmp);

            for(size_t i = 1; i < layers; i++)
            {
                Arr bi = xt::strided_view(b, { xt::ellipsis(), i });
                xt::strided_view(result, { xt::ellipsis(), i }) = matrix_mul(a, std::move(bi));
            }

            return result;
        }
    }
}

tort-dla-psa commented 4 years ago

@zhiayang broadcasting? but xt::linalg::dot already does that, AFAIK.

Also, is there a way to template-ise the input arguments so they can accept xexpressions as well?

There is, lurk about "perfect forwarding", "duck typing" and "std::forward". Pretty nice feature. Also you may need to guard your parameters with "std::is_base_of"

zhiayang commented 4 years ago

right, I just looked into the xt::linalg::dot implementation and it does work properly for the (a, b, c) x (c, d) case to give (a, b, d). It doesn't do the (a, b) x (b, c, d) -> (a, c, d) case (is that even a useful thing? idk) nor the (a, b, c) x (a, c, d) -> (a, b, d) case -- or at least the output is different from that of numpy.

As for the template thing, this is what i came up with, after digging a bit into the sources:

template <typename At, typename Bt, typename R = std::common_type_t<typename At::value_type, typename Bt::value_type>>
xarray<R> matrix_mul(const xexpression<At>& aexp, const xexpression<Bt>& bexp)

more importantly, is the implementation mathematically/logically sound?

faze-geek commented 2 months ago

Hi, what is the progress here ? I want to use something like xt::matmul() for matrix multiplication. Requirement of my project is to be extremely lightweight in nature. So I wouldn't want to sacrifice that much disk space (xtensor-blas is dependent on xflens which is about ~15Mb) if xt::linalg::dot() gives the correct answer in partial cases only.

faze-geek commented 2 months ago

Hi, by using some snippets here and following this behavior of np.matmul. I have made a custom implementation, using purely xtensor.

You can have a look here - https://gist.github.com/faze-geek/014525b78fae02de526a001cff026e96 Do let me know your thoughts.

Ivorforce commented 2 weeks ago

Batched matrix multiplication is pretty easy to achieve with the normal API, even without xtensor-blas:

auto a_broadcast = xt::view(a, xt::ellipsis(), xt::newaxis());
auto b_broadcast = xt::view(b, xt::ellipsis(), xt::newaxis(), xt::all(), xt::all());
auto multiplied = xt::sum(a_broadcast * b_broadcast, { -2 });

We can confirm this is equivalent to matrix multiplication, e.g. in numpy:

# float[batch..., rows, cols]
a = np.random.random((3, 2, 5, 6, 7))
# float[batch..., rows, cols]
b = np.random.random((3, 2, 5, 7, 8))

np.allclose(np.sum(a[..., :, :, None] * b[..., None, :, :], axis=-2), (a @ b))

The main downside to this implementation is that it produces intermediate values, i.e. if you want to make a function out of this it has to return the result as an xarray. But perhaps this insight helps someone else.

xtensor-stack / xtensor

np.matmul implementation #1375