Implement transeq in omp backend

Nanoseb commented 8 months ago

closes #21

Nanoseb commented 7 months ago

Just reimplemented following your comment. Note that, for now, I have kept the velmul_omp because removing it will require a rewrite of der_univ_dist to do it inplace. So it isn't as optimised as it could be yet, though I will likely do that in an other branch because I may try to simplify der_univ_dist itself at the same time.

Still not ready to be merged, I need to cleanup the naming convention I used, test it and add a test.

semi-h commented 7 months ago

I would only test exec_dist_transeq_compact for now. It would be much simpler and targeted. transeq only calls exec_dist_transeq_compact 3x3 times after all and testing it requres instantiating the allocator and backend and then setting a few variables. I think its beyond the scope of the current test which should be more low level. Also, its very likely that allocator will change a bit and then we would need to edit this test program too. Probably there'll be changes in how we set the variables in the backend as well, and mirroring all these here in the transeq test shouldn't be necessary. I think any test initiating the backend should be one level up so that we can test all backends picking the right one at runtime and the main procedures in backends can be tested in a better way and in a single program.

semi-h commented 7 months ago

Also, its a good idea to test the performance of the exec_dist_transeq_compact now that we have it. With the distributed strategy the first phase of the algorithm reads u and v, (or only u, but lets test with u and v), and writes out 3 fields, du, dud, and d2u. There is a bit of buffering and MPI after the first phase, but lets ignore the cost of this for the moment. Then we have the second phase where we read 3 fields du, dud, d2u and write out 1 field as a result. So in total the data movement is equal to 5 reads and 4 writes, so 5 + 4*2 = 13. If we run this function on a $512^3$ mesh data movement is equal to 13 GiB. If you run the test on ARCHER2 on a single node with 8 ranks so that there is a rank per NUMA zone, where the total available bandwidth is ~400 GiB/s, I would expect a single execution of this function to take 0.046 seconds assuming we'll achieve a modest %70 of peak BW. Would you be able to test this when you have time? If it is too different than what we expect we should investigate what causes it.

Nanoseb commented 7 months ago

I would only test exec_dist_transeq_compact for now.

Added a test for it too. Now we are testing both.

Nanoseb commented 7 months ago

Also, its a good idea to test the performance of the exec_dist_transeq_compact...

Now that we've decided to separate performance and verification/unit test. I think I will leave that for now until we have a framework in place. (see #35)

semi-h commented 7 months ago

Distributed solver requires up to 128/256 points per rank based on the particular compact scheme we solve. If you want to do parallel tests the test can fail if you have too few points in a rank.

We don't have a parallel testing enviroment yet but can you confirm that the test passes when you run the executable by hand with multiple ranks? Because you use the default schemes I think 64 points per rank should be more than enough.

Nanoseb commented 7 months ago

Distributed solver requires up to 128/256 points per rank based on the particular compact scheme we solve. If you want to do parallel tests the test can fail if you have too few points in a rank.

Yes, exactly. With 64 cells, it was failing with 4 cores (error ~ 7e-8) and working on 2. That's why I have increased it to 96 now so it works up to 4 cores with the tolerance that was set (1e-8).

Nanoseb commented 7 months ago

I lowered the cell count just to run them faster. Went from 2s to 0.4 I think. For now it isn't a big deal, but when we will have 10 or 20 tests, that quickly adds up. Being able to run many tests quickly makes you run them more often.

xcompact3d / x3d2

Implement transeq in omp backend #27