parallel reduction using tbb

gpipelee commented 4 years ago

I'm trying to parallellize a for loop in which I sum submatrices to a given matrix. That is, I would like to do something like

#include <iostream>
#include <xtensor/xarray.hpp>
#include <xtensor/xview.hpp>
#include <xtensor/xio.hpp>
#include <tbb/parallel_for.h>
#include <tbb/combinable.h>

int main()
{
    std::size_t Nx{ 10 }, Ny{ 12 }, N{ 10000 }; 

    tbb::combinable<xt::xarray<double>> mat_(xt::zeros<double>({ Nx, Ny }));
    tbb::parallel_for(
        tbb::blocked_range<int>(0, N),
        [&](tbb::blocked_range<int> r) {
            for (int i = r.begin(); i < r.end(); ++i)
            {
                int ix_0{ 2 };
                int iy_0{ 4 };
                int ix_1{ 5 };
                int iy_1{ 8 };

                xt::view(mat_.local(), xt::range(ix_0, ix_1), xt::range(iy_0, iy_1)) += 0.1 * xt::ones<double>({ ix_1 - ix_0, iy_1 - iy_0 });
            }
        }
    );
    auto mat = mat_.combine([](const xt::xarray<double>& x, const xt::xarray<double>& y) {return x + y; });

    std::cout << mat;
    return 0;
}

Of course, in my actual code the views and the rhs of the += change over the iterations. The small example above fails to execute, and in my actual code the failure flags an 'heap corruption error'.

I have a workaround, but this requires a new initialization for every kernel:

#include <iostream>
#include <xtensor/xarray.hpp>
#include <xtensor/xview.hpp>
#include <xtensor/xio.hpp>
#include <tbb/parallel_for.h>
#include <tbb/combinable.h>

int main()
{
    std::size_t Nx{ 10 }, Ny{ 12 }, N{ 10000 };

    tbb::combinable<xt::xarray<double>> mat_(xt::zeros<double>({ Nx, Ny }));
    tbb::parallel_for(
        tbb::blocked_range<int>(0, N),
        [&](tbb::blocked_range<int> r) {
            xt::xarray<double> mat_temp = xt::zeros<double>({ Nx, Ny });
            for (int i = r.begin(); i < r.end(); ++i)
            {
                int ix_0{ 2 };
                int iy_0{ 4 };
                int ix_1{ 5 };
                int iy_1{ 8 };

                xt::view(mat_temp, xt::range(ix_0, ix_1), xt::range(iy_0, iy_1)) += 0.1 * xt::ones<double>({ ix_1 - ix_0, iy_1 - iy_0 });
            }
            mat_.local() += mat_temp;
        }
    );
    auto mat = mat_.combine([](const xt::xarray<double>& x, const xt::xarray<double>& y) {return x + y; });

    std::cout << mat;
    return 0;
}

I'm not sure the tbb parallellized assignment of xtensor, i.e. the code below, is doing what I want...

#define XTENSOR_USE_TBB

#include <iostream>
#include <xtensor/xarray.hpp>
#include <xtensor/xview.hpp>
#include <xtensor/xio.hpp>

int main()
{
    std::size_t Nx{ 10 }, Ny{ 12 }, N{ 10000 };

    xt::xarray<double>mat{ xt::zeros<double>({ Nx, Ny }) };
    for (std::size_t i = 0; i < N; ++i)
    {
        int ix_0{ 2 };
        int iy_0{ 4 };
        int ix_1{ 5 };
        int iy_1{ 8 };

        xt::view(mat, xt::range(ix_0, ix_1), xt::range(iy_0, iy_1)) += 0.1 * xt::ones<double>({ ix_1 - ix_0, iy_1 - iy_0 });
    }

    std::cout << mat;
    return 0;
}

Could you perhaps indicate the right way of doing what I want to do?

zhujun98 commented 4 years ago

xtensor is not linked to TBB in xtensorConfig.cmake.in. So users have to find TBB and then link the target to it.

Is it a bug? The documentation https://xtensor.readthedocs.io/en/latest/build-options.html?highlight=XTENSOR_USE_TBB#external-dependencies says one can activate TBB with only set(XTENSOR_USE_TBB 1)

JohanMabille commented 4 years ago

I think we missed the link directive in the xtensorConfig.cmake.in file

zhujun98 commented 4 years ago

There is another problem with find_dependency(TBB) and find_package(TBB) in xtensorConfig.cmake.in. They both can only find the version, but not TBB_LIBRARIES and TBB_INCLUDE_DIRS. xtensor has a customized FindTBB.cmake. I am wondering what's the recommended way to make it work in xtensorConfig.cmake.in

JohanMabille commented 4 years ago

Quoting @bluescarni from the Quantstack gitter channel

Usually for dependencies which do not provide CMake package config files I do ship FindXXX.cmake scripts, but I make sure to namespace them in order to minimise the chance of name collisions with other projects. For instance, if my library foo depends on library bar, I don't ship a Findbar.cmake, but rather a Findfoo_bar.cmake, and within Findfoo_bar.cmake I create an imported target called foo::bar, rather than something like bar::bar or similar.

I find this solution really stylish, we should implement it for TBB.

xtensor-stack / xtensor

parallel reduction using tbb #2016