Tensor View Operations Slower Than Manual Looping

Hi,

First off, thanks for all the great work on xtensor, its made it possible to build state of the art Model Predictive algorithms in ROS' Nav2 project using only CPU matching or in some cases beating GPU-enabled versions.

I've been on a kick since March 1st of really narrowing in on our uses of xtensor and every operation to try to squeeze the last few bits of performance from the system we can. I've found a couple of interesting remarks that I wanted to ask maintainers about since they seem very counter intuitive.

Copying Tensor 8-10x Faster Than View Assignment

There's a point where we have a method that we assign a set of controls to absolute velocities as a pass-through on the system dynamics. If that doesn't grokk, I'm basically just copying one tensor into another with a 1 index offset, shown below.

    xt::noalias(xt::view(state.vx, xt::all(), xt::range(1, _))) =
      xt::view(state.cvx, xt::all(), xt::range(0, -1));

    xt::noalias(xt::view(state.wz, xt::all(), xt::range(1, _))) =
      xt::view(state.cwz, xt::all(), xt::range(0, -1));

    if (isHolonomic()) {
      xt::noalias(xt::view(state.vy, xt::all(), xt::range(1, _))) =
        xt::view(state.cvy, xt::all(), xt::range(0, -1));
    }

This operation with a tensor size of {2000, 56} takes about 0.8-1.3ms per iteration. I thought to myself, 'that's weird' so I wrote a quick loop doing the same thing and it takes only 0.15 - 0.2ms.

    const bool is_holo = isHolonomic();
    for (unsigned int i = 0; i != state.vx.shape(0); i++) {
      for (unsigned int j = 1; j != state.vx.shape(1); j++) {
        state.vx(i, j) = state.cvx(i, j - 1);
        state.wz(i, j) = state.cwz(i, j - 1);
        if (is_holo) {
          state.vy(i, j) = state.cvy(i, j - 1);
        }
      }
    }

I'm not sure what to make of this except that I feel like I must be missing some subtle detail.

An aside

An aside, but I've also been running into some interesting results with xt::cumsum and xt::atan2, xt::cos, xt::sin where as the equivalent in looping (or vectorize()) is approximately the same as using the xtensor function. I don't know if you expect this to be significantly faster (I would have thought so with simd), but I was surprised to find it was as slow as it is. About ~1.5ms of our total ~6ms control loop is taken just computing xt::cumsum over 3 {2000, 56} tensors. I would have liked to have used it more in additional locations, but the overhead of it made that impractical. Example:

  xt::noalias(trajectories.x) = state.pose.pose.position.x +
    xt::cumsum(dx * settings_.model_dt, 1);

...

  auto yaws_between_points = xt::atan2(
    goal_y - data.trajectories.y,
    goal_x - data.trajectories.x);

xtensor-stack / xtensor

Tensor View Operations Slower Than Manual Looping #2776

Copying Tensor 8-10x Faster Than View Assignment

An aside