Reuse result field in transeq_component as a temporary field

In transeq we call _x _y and _z components separately, and each time we pass a field to store the result as well as 3 temporary fields that distributed algorithm requires. Because the result is assembled only at the very last step, we can actually pass only 2 temporary fields and use the result field in place of a temporary storage. This eliminates 1 field size worth of memory usage, reducing the total down to 14 (See #37). It won't make any diference in terms of performance in the CUDA backend, however, OpenMP backend performance will be improved slightly. On the OpenMP backend reusing result as a temporary array removes an unnecessary single field sized read, reducing the total from 14 to 13 per transeq_component call and saving %7 of memory bandwidth. And I expect an equivalent speedup in this case.

https://github.com/xcompact3d/x3d2/blob/26ccadd4b033a99a63072ab47518bd092060978f/src/cuda/backend.f90#L259-L263 result in line 260 temporary fields in 263

https://github.com/xcompact3d/x3d2/blob/26ccadd4b033a99a63072ab47518bd092060978f/src/cuda/exec_dist.f90#L112-L113 And here in line 113 lets say we'll have r_u, v, dud, d2u only, because du will be in r_u.

xcompact3d / x3d2

Reuse result field in transeq_component as a temporary field #40