wrf-model / WRF

The official repository for the Weather Research and Forecasting (WRF) model
Other
1.24k stars 680 forks source link

Optimization flags for gfortran #1254

Open milancurcic opened 4 years ago

milancurcic commented 4 years ago

By changing -O2 to -Ofast -ffast-math flags with gfortran, I got about an 18% speed up with WRF v4.2.

In configure.wrf:

#FCOPTIM         =       -O2 -ftree-vectorize -funroll-loops
FCOPTIM         =       -Ofast -ffast-math -ftree-vectorize -funroll-loops

Is anybody on the WRF team aware of problems with using -Ofast -ffast-math? If not, should we consider using them by default? They will change the results relative to -O2 (or even -O3), but I doubt with bias in any direction. I may be wrong.

davegill commented 4 years ago

@milancurcic @dudhia @weiwangncar Milan, We tend to be VERY conservative with our optimization flags. Nearly a 20% performance boost is very impressive.

Would you be interested in proposing a PR? Perhaps we could add this option (commented out) to the arch/configure.defaults for all of the GNU stanzas? Perhaps we could add a new GNU stanza with this option?

negin513 commented 4 years ago

@milancurcic

@davidedelvento and I did a study a few years ago which showed similar results; however, the results will not be BFB.

We even showed in some bottlenecks of the WRF code only by changing to -Ofast we can have up to 6X speedup (in the bottleneck).

milancurcic commented 4 years ago

@davegill Sure, I'd be happy to. But we'd need to test that all schemes (if possible) still work. Do you have a test suite that you could run for this purpose?

What is a GNU stanza? Would it result in something like this?

32. (serial)  33. (smpar)  34. (dmpar)  35. (dm+sm)   GNU (gfortran/gcc), moderate optimization level
36. (serial)  37. (smpar)  38. (dmpar)  39. (dm+sm)   GNU (gfortran/gcc), high optimization level

If we can test everything, I'd suggest we have just one recommended GNU stanza.

Alternatively, if we can't test everything, perhaps we can have two as above, and add a note like:

36. (serial)  37. (smpar)  38. (dmpar)  39. (dm+sm)   GNU (gfortran/gcc), high optimization level (experimental)

the results will not be BFB.

@negin513 Right, we shouldn't expect BFB compared to different optimization level as the compiler emits different code. But as long as we can test that stronger optimization doesn't break any existing code, I think this is fine. Do you agree?

davegill commented 4 years ago

@milancurcic Milan,

  1. We do have a way to test a number of schemes, though by default it turns off optimization. We can fix that.
  2. Look in the arch/configure.defaults file. You will see separate sections for each combination of compiler/architecture. Grab one of the GNU sections and COPY it towards the very bottom of the file. You will see this instruction:
    #insert new stanza before the Fujitsu block, keep Fujitsu at the end of the list
    ###########################################################
  3. I'd agree that with including "higher optimization" in the description, but let's not include "experimental".
  4. This would not be a bug fix, so we would consider this a new development. As such, the modifications would be based from the "develop" branch. Take a look at a new wiki we are putting together: https://github.com/wrf-model/WRF/wiki/PR
dudhia commented 4 years ago

Another alternative would be a new configure flag, like we have configure -d, perhaps configure -o for more optimized? Just a thought. Dave's response may be delayed.

On Tue, Jul 14, 2020 at 10:39 AM Milan Curcic notifications@github.com wrote:

@davegill https://github.com/davegill Sure, I'd be happy to. But we'd need to test that all schemes (if possible) still work. Do you have a test suite that you could run for this purpose?

What is a GNU stanza? Would it result in something like this?

  1. (serial) 33. (smpar) 34. (dmpar) 35. (dm+sm) GNU (gfortran/gcc), moderate optimization level
  2. (serial) 37. (smpar) 38. (dmpar) 39. (dm+sm) GNU (gfortran/gcc), high optimization level

If we can test everything, I'd suggest we have just one recommended GNU stanza.

Alternatively, if we can't test everything, perhaps we can have two as above, and add a note like:

  1. (serial) 37. (smpar) 38. (dmpar) 39. (dm+sm) GNU (gfortran/gcc), high optimization level (experimental)

the results will not be BFB.

@negin513 https://github.com/negin513 Right, we shouldn't expect BFB compared to different optimization level as the compiler emits different code. But as long as we can test that stronger optimization doesn't break any existing code, I think this is fine. Do you agree?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1254#issuecomment-658286315, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIZ77EV4L2OJY5EQ3RX7VDR3SC5VANCNFSM4OYZO66A .

davegill commented 4 years ago

@dudhia Jimy, I like your concept. We would need to agree on what "optimized" is for each compiler. Intel may support various AVX instructions, but only with some chip sets. Could we always assume -fast (or whatever the syntax is) for all compilers? I do like this cleaner approach than an entirely new stanza.

weiwangncar commented 4 years ago

It may not need to be available for every compiler. It can be made when it is available and when some tests are performed.

davegill commented 4 years ago

@weiwangncar @dudhia We could have this available for GNU and Intel. Maybe do PGI.

bociusz commented 4 years ago

There is an additional ~15%-20% performance boost by using "-march=native" in gfortran, I propose to include it as well. The only issue I found with using optimization flags higher than the default -O2 is the gen_be_wrapper.ksh script of WRFDA, which gave me NaN errors.

dudhia commented 4 years ago

configure could also be called configure -fast to match the typical compiler flag, Jimy

On Tue, Jul 14, 2020 at 10:52 AM Dave Gill notifications@github.com wrote:

@dudhia https://github.com/dudhia Jimy, I like your concept. We would need to agree on what "optimized" is for each compiler. Intel may support various AVX instructions, but only with some chip sets. Could we always assume -fast (or whatever the syntax is) for all compilers? I do like this cleaner approach than an entirely new stanza.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1254#issuecomment-658292977, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIZ77FHVLFPAPM6AZDO2Q3R3SEORANCNFSM4OYZO66A .

Plantain commented 4 years ago

I would suggest not merging flags for which make the resulting binary non-portable across machines of the same architecture like -march. Similarly options that allegedly give NaN errors... surely the defaults should be the most correct and portable, and leave it to users to override if desired.

On Wed, 15 Jul 2020, 12:07 pm Bálint Szente-Varga, notifications@github.com wrote:

There is an additional ~15%-20% performance boost by using "-march=native" in gfortran, I propose to include it as well. The only issue I found with using optimization flags higher than the default -O2 is the gen_be_wrapper.ksh script of WRFDA, which gave me NaN errors.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1254#issuecomment-658677940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACODF7EBLUL6BZCE4UQ3HDR3V5UZANCNFSM4OYZO66A .

davidedelvento commented 4 years ago

@Plantain I agree that in some systems the compiling may be happening on a (slightly) different architecture than where the run will occur, possibly creating even Illegal Instruction errors. Yet, I believe the proposed PR is for a commented out option which some disclaimer, and so I think that architecture specific flags should be on the table. They may or may not provide the "advertized" benefit to each individual user/machine: performance is already hard, performance portability is even worse... (and here we are just talking about compiler flags, not really "performance" per se)

milancurcic commented 4 years ago

From the feedback it seems that there's prevalent agreement that the highest optimization flags shouldn't be the default. I think Jimy's proposed configure -f is nice. I see that it's already used and passed to arch/Config.pl but it looks to me like it only controls the sources that are optimized or not, but not the compiler flags.

@davegill how to proceed? Should this PR add the highest optimization flags for all compilers? This will be impossible for me to test. I think Intel already uses max optimization by default.

weiwangncar commented 4 years ago

@milancurcic If the optimization is introduced as an option, that gives people time to try out. It's possible that it can become default at some later time.

dudhia commented 4 years ago

Yes, it looks like -f is the default optimization. Perhaps another flag is needed for faster optimizations. Not all stanzas would have this extra option, so that has to be handled somehow too.

On Thu, Jul 23, 2020 at 9:25 AM Milan Curcic notifications@github.com wrote:

From the feedback it seems that there's prevalent agreement that the highest optimization flags shouldn't be the default. I think Jimy's proposed configure -f is nice. I see that it's already used and passed to arch/Config.pl but it looks to me like it only controls the sources that are optimized or not, but not the compiler flags.

@davegill https://github.com/davegill how to proceed? Should this PR add the highest optimization flags for all compilers? This will be impossible for me to test. I think Intel already uses max optimization by default.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1254#issuecomment-663070634, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIZ77GGQFVDQ4BC6ONIABDR5BI4ZANCNFSM4OYZO66A .