Introduce Chunking in BSE KERNEL IO

muralidhar-nalabothula commented 1 year ago

Dear All,

It looks like Yambo does not adapt chunking when writing large files. I guess it is known that the performance is awful (in parallel IO mode) when writing decently large BSE kernels, In fact when building sufficiently large kernels( > 10 Gb), significant amount of the time is spent in IO severely hitting the performance. The root cause of this issue from software point is that Yambo does not chunk the variables when writing large datasets and by default netCDF does not adapt chunking.

This PR introduces Chunking to write functions and adapts chunking when writing BSE kernel (for now only in "2D standard" mode and in parallel IO mode). Subsequently, chunking should be adapted through out the code when writing other variables with very large files sizes. This will in general bring the performance on par with non parallel IO builds.

Please note that as said, this is only a fix from software point of view. As hardware in general have tweaks to improve the parallel IO. For example, lustre has striping mechanism when you can write parts of files to different OST (with set strip to set number of fragmentation parts and size of each part). As always the users must consult the HPC documentations.

Best regards, Murali

ps. I do not have access to the test suite, but I tried them for few test cases. please run the test suite and let me know if it fails any cases.

sangallidavide commented 1 year ago

Dear Murali, thank you very much for this.

I do not know much about chunking. Indeed my attempt to speed-up the I/O was through imposing the syncing at 5% steps of the BSE matrix. See the end of io_BS_PAR_block.F. Do you see any conflict between syncing and chunking ?

For the chunking, how did you decide that 512 was a good value ? Instead of having 512 explicitly coded inside io_BS_PAR_init.F , I'd rather define inside the mod_IO a variable, something like y_chunk_default=512. Eventually this could be promoted to an input variable for fine tuning in the future.

muralidhar-nalabothula commented 1 year ago

Dear Davide,

If you introduced syncing only to improve speed, then you no longer need it as data is now read/written in the chunk shapes (a chunk is read/written as a whole. previously it has to read the entire matrix to write small parts which severely effected the performance as it reads entire block from the local harddisk). But as far as I understood, you were doing these to keep track of percentage of BSE matrix computed for restart purposes. If that is purpose, then both sync and chunk can go well.

In general, there is no thumb rule for the chunk size, and should depend on matrix size. the value 512 is a default value that is adapted from hdf5 (hdf5 has a default of 1 Mb chunks which should be increased in most cases), so I choose that. The rationale is that it is much better to chunk rather than not chunking. 512 was indeed working very well for me (tested for bse matrix size ranging from 256 to 147456). Just to give benchmark: On our HPC (using lustre and writing on 16 OSTs): To write BSE kernel of size 147456 x 147456: before chunking : io_BS : 01h-40m CPU (389 calls, 15.431 sec avg) after chunking(+): io_BS : 177.8820s CPU (388 calls, 0.458 sec avg) (+ chunking size 512 )

It is a great idea to create a input variable so that users can tune it but I think there should be a good default value, leaving it to 512 is fine or maybe something like nc x nv x sqrt(nk) or nc x nv x max(k_grid) or BS_Hdim/max(k_grid) where kgrid = (nkx,nky,nkz) defined in k_lattice.F).

Best regards, Murali

sangallidavide commented 1 year ago

If you introduced syncing only to improve speed, then you no longer need it as data is now read/written in the chunk shapes (a chunk is read/written as a whole. previously it has to read the entire matrix to write small parts which severely effected the performance as it reads entire block from the local harddisk). But as far as I understood, you were doing these to keep track of percentage of BSE matrix computed for restart purposes. If that is purpose, then both sync and chunk can go well.

Yeah, I introduced syncing for restarting reasons, but then I though this was the source of the slowdown, and I changed the code to force the syncing only at steps of 5%. This is the reason behind this line

 ! nf90_sync is performed up to n_BS_blks_min and at steps of 5% of BSE kernel
 file_sync= i_block<=n_BS_blks_min .and. mod(i_block,max(1,n_BS_blks_min/20))==0 .and. index(BSK_IO_mode,"norestart")==0

In general, there is no thumb rule for the chunk size, and should depend on matrix size. the value 512 is a default value that is adapted from hdf5 (hdf5 has a default of 1 Mb chunks which should be increased in most cases), so I choose that. The rationale is that it is much better to chunk rather than not chunking. 512 was indeed working very well for me (tested for bse matrix size ranging from 256 to 147456). Just to give benchmark: On our HPC (using lustre and writing on 16 OSTs): To write BSE kernel of size 147456 x 147456: before chunking : io_BS : 01h-40m CPU (389 calls, 15.431 sec avg) after chunking ( of 512 sizes ) : io_BS : 177.8820s CPU (388 calls, 0.458 sec avg)

Got it. That's already really great !

It is a great idea to create a input variable so that users can tune it but I think there should be a good default value, leaving it to 512 is fine or maybe something like nc x nv x sqrt(nk) or nc x nv x max(k_grid) or BS_Hdim/max(k_grid) where kgrid = (nkx,nky,nkz) defined in k_lattice.F).

Sure, I had in mind a default and then, eventually, an high verbosity input variable for the future. Fine, let's set 512 as default then, later we will see if an input variable is really needed in some case.

sangallidavide commented 1 year ago

Addendum. After this change, we can probably simplify file_sync= i_block<=n_BS_blks_min and remove the "norestart" option

About i_block<=n_BS_blks_min I had to set n_BS_blks_min because, in tests I did, I cannot call f90_sync unless all MPI tasks which called nf90_open go thorough nf90_sync as well each time. If not, the simulation remains stack forever. Since every MPI task may have a different number of BS_blks to write, I can sync only up to the minimum value. After that I cannot anymore.

Question (sorry I take a bit advantage of your expertise): Is this normal, or am I doing something wrong ?

muralidhar-nalabothula commented 1 year ago

Honestly I would leave as it is now. I would expect f90_sync will have a hit in the IO performance and I indeed saw a performance hit in some cases. Although the IO times are drastically cut down with chunking, I personally still leave it to the user whether to restart or not. Please note the benefits of f90_sync over weigh in most cases though (so the default for restart is on).

Yes it is normal. For me what you do makes sense, nf90_sync is a collective call and all the cpus must call the function, since the blocks are not distributed equally, some cpus may not call the nf90_sync, leading to behaviour you saw (the cpus are waiting for others to reach there but they never reach as they did not call the function. just like calling a MPI_barrier from one cpu and not other)

BTW, I noticed that the Haydock solver on current master branch (with out this PR) is very slow compared to the release tar balls and I sometimes see the following error and disappears if I run it again

 <53s> P1-aion-0147: [08.01.05.01] Screened interaction header I/O
P1-aion-0147: [ERROR] STOP signal received while in[08.01.05.01] Screened interaction header I/O
P1-aion-0147: [ERROR] Using screening with less G than Weh interaction

Also, Please let me know if the PR passed all the test cases.

yambo-code / yambo

Introduce Chunking in BSE KERNEL IO #48