Open chongyi-zheng opened 5 years ago
Hi, thanks for tackling this!
I think the implementation should also live in xtensor-blas for consistency. The idea with the view's is good, I think. Maybe we can use the cxxblas::gemm
functions directly in xtensor-blas
.
Did you look at the implementation of linalg::dot
? In general, dot
and matmul
are quite similar, I think.
Regarding OpenMP, we've started to do some (very little) work to parallelize certain mathematical operations using intel TBB. Broadcasted matmul
could be trivially parallelized with TBB as well.
CUDA (or other GPU support) is on the roadmap for '19 as well.
Oh, thanks for a quick reply!
Actually, I haven't read the code of linalg::dot
yet. It's a good idea to check the implementation out. Since this is the first time for me to do something with tensors in C++, I have not used BLAS or LAPACK before. Maybe I should use cxxblas::gemm
to write it from scratch.
So, did you mean that there would be a matmul
in the coming version of this library?
By the way, I am using this toolkit to code a real-time 3D modeling system which means the efficiency counts a lot. Hopefully, when the work is done, I will get some kind of 15 fps feedback.
Cool! Yes, we take performance very serious, too. If you find performance related bugs, let us know! And if your library is open source we can also take a look at your code to give some recommendations.
We definitely want to have matmul in xtensor-blas. I could start from your initial work here, or you could make a PR for this, and I can guide you on how to do it (e.g. using "raw" gemm
from BLAS). It should be quite straight forward.
Are you interested in contributing this?
That's great to hear about a future plan for xt::linalg::matmul
. And it's a pleasure to contribute my work to this powerful library. But is it possible to leave it until my current work gets done? Sure enough, my project will become open source once I construct a basic architecture.
And another problem I have come across yesterday is that I could not use xt::xtuple
dynamically.
What I mean is that when I want to call xt::stack
, I have to construct a xt::xtuple
inside the calling line like xt::xstack(xt::xtuple(arr1, arr2, arr3), 0)
. However, if I want to gather a bunch of xt::xarray
s into a xtuple one by one using a for-loop before the xt::stack
function, it just doesn't work.
I have come up with xt::concatenate
, but it's obviously less efficiently. Also, I have used std::tuple_cat
in the for-loop to concatenate them together, but the return type of std::tuple_cat
seems incompatible with the argument as xt::xtuple
in xt::stack
. I just got a compiling error in the end. I am trying other approaches to figure it out.
Maybe I should commit another ISSUE for this and add detail descriptions about the problem?
Thanks for the reply again!
@YeeCY sorry for the late reply. Regarding the implementation of matmul
, we don't have the bandwidth to implement it now so it can definitely wait for your current wok to get done ;)
Regarding the issue with tuple, I agree that a dedicated issue would be more appropriated so we avoid to mix discussions.
Oh, thanks for inviting me to join you guys. I have almost got my work done! Actually, the only thing left is the documentation. That's not a big problem.
It's ok for me to start contributing to this powerful library. To deal with multiplication involving N-D vectors and N-D matrices, another complete version of matmul
have been tested in my code. I think it will be helpful after being added.
By the way, If there is any new instruction, feel free to send me an email.
Is there any progress on that?
@tort-dla-psa we still have dot
in xtensor-blas
which does pretty much the same as matmul
.
@wolfv but is it parallelized?
only as far as the BLAS implementation you are using parallelizes the GEMM instruction. One could do higher level parallelization for the case where dot
is broadcasting. Check out the code in xtensor-blas it's fairly simple :)
neither xtensor nor xtensor-blas have implemented matmul
as described here, so I implemented my own version that properly handles broadcasting; eg. you can perform (2, 4, 3) x (3, 5)
as well as (5, 2) x (2, 4, 6)
multiplications.
I have neither a strong math background nor a BLAS background, so idk how correct this is; it looks mathematically sound from my small tests, but I'm sure it can be made more efficient.
Also, is there a way to template-ise the input arguments so they can accept xexpressions
as well? A brief attempt of mine led to too many template errors, so I gave up on that front.
template <class _Tp>
xt::xarray<_Tp> matrix_mul(xt::xarray<_Tp> a, xt::xarray<_Tp> b)
{
using Arr = xt::xarray<_Tp>;
if(a.dimension() == 1 && b.dimension() == 1)
{
return xt::linalg::outer(a, b);
}
else if(a.dimension() <= 2 && b.dimension() <= 2)
{
return xt::linalg::dot(a, b);
}
else
{
if(a.dimension() == b.dimension())
{
assert(a.shape()[0] == b.shape()[0]);
size_t layers = a.shape()[0];
Arr tmp;
{
Arr a0 = xt::view(a, 0);
Arr b0 = xt::view(b, 0);
tmp = matrix_mul(std::move(a0), std::move(b0));
}
auto out_shape = tmp.shape();
out_shape.insert(out_shape.begin(), layers);
auto result = Arr::from_shape(out_shape);
xt::view(result, 0) = tmp;
for(size_t i = 1; i < layers; i++)
{
Arr ai = xt::view(a, i);
Arr bi = xt::view(b, i);
xt::view(result, i) = matrix_mul(std::move(ai), std::move(bi));
}
return result;
}
else if(a.dimension() > b.dimension())
{
size_t layers = a.shape()[0];
Arr tmp;
{
Arr a0 = xt::view(a, 0);
tmp = matrix_mul(std::move(a0), b);
}
auto out_shape = tmp.shape();
out_shape.insert(out_shape.begin(), layers);
auto result = Arr::from_shape(out_shape);
xt::view(result, 0) = std::move(tmp);
for(size_t i = 1; i < layers; i++)
{
Arr ai = xt::view(a, i);
xt::view(result, i) = matrix_mul(std::move(ai), b);
}
return result;
}
else
{
assert(a.dimension() < b.dimension());
size_t layers = b.shape().back();
Arr tmp;
{
Arr b0 = xt::strided_view(b, { xt::ellipsis(), 0 });
tmp = matrix_mul(a, std::move(b0));
}
auto out_shape = tmp.shape();
out_shape.insert(out_shape.end(), layers);
auto result = Arr::from_shape(out_shape);
xt::strided_view(result, { xt::ellipsis(), 0 }) = std::move(tmp);
for(size_t i = 1; i < layers; i++)
{
Arr bi = xt::strided_view(b, { xt::ellipsis(), i });
xt::strided_view(result, { xt::ellipsis(), i }) = matrix_mul(a, std::move(bi));
}
return result;
}
}
}
@zhiayang broadcasting? but xt::linalg::dot already does that, AFAIK.
Also, is there a way to template-ise the input arguments so they can accept xexpressions as well?
There is, lurk about "perfect forwarding", "duck typing" and "std::forward". Pretty nice feature. Also you may need to guard your parameters with "std::is_base_of"
right, I just looked into the xt::linalg::dot
implementation and it does work properly for the (a, b, c) x (c, d)
case to give (a, b, d)
. It doesn't do the (a, b) x (b, c, d) -> (a, c, d)
case (is that even a useful thing? idk) nor the (a, b, c) x (a, c, d) -> (a, b, d)
case -- or at least the output is different from that of numpy.
As for the template thing, this is what i came up with, after digging a bit into the sources:
template <typename At, typename Bt, typename R = std::common_type_t<typename At::value_type, typename Bt::value_type>>
xarray<R> matrix_mul(const xexpression<At>& aexp, const xexpression<Bt>& bexp)
more importantly, is the implementation mathematically/logically sound?
Hi, what is the progress here ? I want to use something like xt::matmul()
for matrix multiplication. Requirement of my project is to be extremely lightweight in nature.
So I wouldn't want to sacrifice that much disk space (xtensor-blas
is dependent on xflens
which is about ~15Mb) if xt::linalg::dot()
gives the correct answer in partial cases only.
Hi, by using some snippets here and following this behavior of np.matmul. I have made a custom implementation, using purely xtensor.
You can have a look here - https://gist.github.com/faze-geek/014525b78fae02de526a001cff026e96 Do let me know your thoughts.
Batched matrix multiplication is pretty easy to achieve with the normal API, even without xtensor-blas:
auto a_broadcast = xt::view(a, xt::ellipsis(), xt::newaxis());
auto b_broadcast = xt::view(b, xt::ellipsis(), xt::newaxis(), xt::all(), xt::all());
auto multiplied = xt::sum(a_broadcast * b_broadcast, { -2 });
We can confirm this is equivalent to matrix multiplication, e.g. in numpy:
# float[batch..., rows, cols]
a = np.random.random((3, 2, 5, 6, 7))
# float[batch..., rows, cols]
b = np.random.random((3, 2, 5, 7, 8))
np.allclose(np.sum(a[..., :, :, None] * b[..., None, :, :], axis=-2), (a @ b))
The main downside to this implementation is that it produces intermediate values, i.e. if you want to make a function out of this it has to return the result as an xarray
. But perhaps this insight helps someone else.
Currently, I am using this XTENSOR library to develop my own project. I have searched the documentation for a whole night, but just failed to find something like
np.matmul
. The only available stuffs aredot
andtensordot
now. Finally, I decided to do it by myself. Here is a recursive implementation:I have tested it like this:
Output:
I think it is correct, right? But I don't think it is an efficient way and the type checking seems ugly. In numpy documentation of
np.matmul
I foundI have only implemented the first two cases. I wonder if there will be an efficient
xt::linalg::matmul
in this library. And does parallel computing like OPENMP and CUDA will be introduced in the future? Hopefully, this will help.