xtensor-stack / xtensor

C++ tensors with broadcasting and lazy computing
BSD 3-Clause "New" or "Revised" License
3.3k stars 392 forks source link

Bad performance of views/slicing #2734

Open razorx89 opened 10 months ago

razorx89 commented 10 months ago

Hi,

I am still getting used to the library, but was able to isolate an unexpected performance hit. I want to update just a subregion of a pre-allocated 1D tensor. Maybe there is a better pattern to achieve the same result?

#include <chrono>
#include <xtensor/xrandom.hpp>
#include <xtensor/xtensor.hpp>

double mean_milliseconds_from_total(std::chrono::nanoseconds total,
                                    size_t num_repeats) {
  std::chrono::duration<double, std::milli> total_ms = total;
  return total_ms.count() / (double)num_repeats;
}

int main() {
  size_t num_repeats = 100;
  xt::xtensor<double, 1> a = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> b = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> c = xt::zeros<double>({10000000});

  // case 1: full tensor
  auto started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = a + b;
  auto finished = std::chrono::high_resolution_clock::now();
  std::cout << "elapsed time: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2: view of tensor with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::all()) = xt::view(a + b, xt::all());
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "elapsed time: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3: view of tensor with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::range(0, c.size())) =
        xt::view(a + b, xt::range(0, c.size()));
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "elapsed time: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;
  return 0;
}

Result:

elapsed time: 8.00238ms
elapsed time: 31.0913ms
elapsed time: 30.9484ms

I understand that introducing views should have a performance hit, but for doing essentially the same task (memory layout, contiguous memory, same range, equal step size of one), it is quite a big hit. Is this expected behavior or am I doing something wrong?

Thanks.

Versions:

razorx89 commented 10 months ago

A little bit more detailed:

#include <chrono>
#include <xtensor/xrandom.hpp>
#include <xtensor/xtensor.hpp>

double mean_milliseconds_from_total(std::chrono::nanoseconds total,
                                    size_t num_repeats) {
  std::chrono::duration<double, std::milli> total_ms = total;
  return total_ms.count() / (double)num_repeats;
}

int main() {
  size_t num_repeats = 100;
  xt::xtensor<double, 1> a = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> b = xt::random::rand<double>({10000000});
  xt::xtensor<double, 1> c = xt::zeros<double>({10000000});

  // case 1: full tensor
  auto started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = a + b;
  auto finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 1:  "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2a: view of tensor and expression with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::all()) = xt::view(a + b, xt::all());
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 2a: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2b: view of only tensor with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::all()) = a + b;
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 2b: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 2c: view of only expression with xt::all()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = xt::view(a + b, xt::all());
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 2c: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3a: view of tensor and expression with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::range(0, c.size())) =
        xt::view(a + b, xt::range(0, c.size()));
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 3a: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3b: view of only tensor with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    xt::view(c, xt::range(0, c.size())) = a + b;
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 3b: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;

  // case 3c: view of only expression with xt::range()
  started = std::chrono::high_resolution_clock::now();
  for (size_t i = 0; i < num_repeats; ++i)
    c = xt::view(a + b, xt::range(0, c.size()));
  finished = std::chrono::high_resolution_clock::now();
  std::cout << "case 3c: "
            << mean_milliseconds_from_total(finished - started, num_repeats)
            << "ms" << std::endl;
  return 0;
}

Result:

case 1:  7.82969ms
case 2a: 31.7327ms
case 2b: 9.01278ms
case 2c: 29.9268ms
case 3a: 31.1293ms
case 3b: 8.89544ms
case 3c: 29.6933ms

Accessing the expression through a view seems to be the expensive part.

spectre-ns commented 9 months ago

This actually a compiler issue. I posted this in the gitter channel as well. If intel can make sense of the view objects, then the other implementations should be able to as well. If you're using MSVC, Clang, or GCC that doesn't help much but it's good to know the optimizations exist.

In MSVC with /O2 /Ob2 /arch:avx512 USING XSIMD SIMD SIZE: 8

USING XSIMD SIMD SIZE: 8

case 1: 38.3063ms case 2a: 114.402ms case 2b: 35.6107ms case 2c: 92.7817ms case 3a: 106.946ms case 3b: 35.6352ms case 3c: 102.884ms

With Intel 2023 DPC++/C++ Optimizing Compiler using equivalent flags: USING XSIMD SIMD SIZE: 8

case 1: 38.7353ms case 2a: 45.1576ms case 2b: 40.1831ms case 2c: 30.8375ms case 3a: 35.3932ms case 3b: 37.2431ms case 3c: 27.0555ms

spectre-ns commented 1 month ago

@tdegeus @razorx89 I have done more digging and the root cause of the bad performance is actually here: https://github.com/xtensor-stack/xtensor/blob/8c0a484f04eccd0dbc0e25eb58a97de000fb048b/include/xtensor/xview.hpp#L226

Thus, all views of xfunctions are pessimistically assigning using a stepper rather than a linear or strided assign. They are linearly assigned when assigned directly to a container however.

This also eliminates the possibility of using SIMD acceleration. I don't see any reason why xfunctions which already have linear assignment cannot have linearly assigned views. The has_data_interface attribute seems very intertwined into the xview interface so it could take a while to figure out.