Reduce the memory usage

I can't believe I missed this!

In transeq at the very end we do du_x = du_x + du_y + du_z

However due to the restictions on running GPU kernels with specific thread and block dimensions we carry out this operation in 2 separate calls as du_x = du_x + du_y du_x = du_x + du_z

And this gives an oppurtunity to move the y2x sum up just below transeq_y call. Then we release du_y, dv_y, dw_y right after adding these into _x counterparts.

This basic fix allowed reducing the memory usage from 18 scalar fields down to 15, without affecting the performance at all for the CUDA backend. (15GiB for a $512^3$ simulation). The total figure excludes Poisson solvers memory requirement which is not yet in the codebase.

Any further reductions in memory usage after this point would result in an increase in the runtime of the simulation. For example we might be able to reduce it down to 12, which shouldn't be that hard, but it would require some extra reordering operations and its better not to work on that at this stage I think.

Assuming that FFT based Poisson solver will requre ~4 scalar field equivalent memory, we should be able to fit a $1024^3$ simulation on a typical 4xA100 node!

Now we have separate sum_yintox, sum_zintox, and vecadd subroutines in backends, all similar at some level. All these can be combined into a single subroutine like we did with reorder subroutines, and that's what I'll do next. I'll create an issue to discuss this further, and not include this next step in the current PR. I'm happy to merge this one as soon as someone approves.

xcompact3d / x3d2

Reduce the memory usage #37