Open IncubatorShokuhou opened 4 years ago
@IncubatorShokuhou Hao Lyu, We would appreciate you help in this matter.
@davegill
I discovered that NaN appear after subroutine nislfv_rain_plm6
for the first time. After calling nislfv_rain_plm6
, some positions in denqrs3
became NaN, and then
qrs(i, k, 3) = max(denqrs3(i, k)/den(i, k), 0.)
is excuted.
Theoretically, qrs should be a real number larger than or equal to 0. However, when I tryed to print qrs(its:ite,kts:kte, 3)
after the loop, I found that there are NaN in it.
So I print qrs(i, k, 3)
after qrs(i, k, 3) = max(denqrs3(i, k)/den(i, k), 0.)
to make sure if this code was correctly executed. However, I found that these code DO correctly executed——there was no NaN in qrs!
So I further more printed qrs twice: before and after the loop. It shows again that the number has no NaN in the loop and has NaN outside the loop——but there should be nothing done upon the qrs
in between.
Now, here is the weirdest thing in the whole story: I tried to print qrs
again, like this:
do k = kts, kte
do i = its, ite
qrs(i, k, 1) = max(denqrs1(i, k)/den(i, k), 0.)
qrs(i, k, 2) = max(denqrs2(i, k)/den(i, k), 0.)
qrs(i, k, 3) = max(denqrs3(i, k)/den(i, k), 0.)
print * ,qrs(i, k, 3)
fall(i, k, 1) = denqrs1(i, k)*workr(i, k)/delz(i, k)
fall(i, k, 2) = denqrs2(i, k)*worka(i, k)/delz(i, k)
fall(i, k, 3) = denqrs3(i, k)*worka(i, k)/delz(i, k)
print * ,"2nd",qrs(i, k, 3)
enddo
enddo
do k = kts, kte
do i = its, ite
print * ,"3rd", qrs(i, k, 3)
enddo
enddo
and now, there is no NaN in "3rd" at all !
I am wondering why this happened.
PS: my fortran complier version is pgf90 19.4-0 LLVM 64-bit target on x86-64 Linux -tp skylake
. I'm wondering if this error is related to the compiler.
I will carry on checking why denqrs3
has NaN, which caused the NaN in qrs(:,:,3) and qg. All further information concerned will be updated.
I noticed that qci
and qrs
are all paddinted 0 for negative values generated by dynamics at the very beginnning of subroutine WSM62D
, which is likely to be the reason of how NaN is used in the subroutine.
However, I personally recommand that denqrs1(i, k)/den(i, k)
should be firstly checked to determine whether it is NaN before using function max
, as is mentioned by opendoc of gfortran,
The Fortran standard does not specify what the result of the MAX and MIN intrinsics are if one of the arguments is a NaN. Accordingly, the GNU Fortran compiler does not specify that either, as this allows for faster and more compact code to be generated. If the programmer wishes to take some specific action in case one of the arguments is a NaN, it is necessary to explicitly test the arguments before calling MAX or MIN, e.g. with the IEEE_IS_NAN function from the intrinsic module IEEE_ARITHMETIC.
@IncubatorShokuhou
I noticed that qci and qrs are all paddinted 0 for negative values generated by dynamics at the very beginnning of subroutine WSM62D, which is likely to be the reason of how NaN is used in the subroutine.
I do not see how a max(qci(i,k,1),0.0) is likely to be the reason of how a NaN in introduced into the scheme?
If the qci or qcs is a NaN, can you get the (i,k,j) location? Is it reproducible? Can you back trace this to the solve_em.F file where the microphysics driver is called? If you use positive definite monotonic advection, does the NaN still appear? If you change MP schemes, is there still a problem? If you run a different day, does the problem still occur? If you start at a different time?
Can you build the code with floating traps on? Maybe that helps? For WRF that would be clean -a configure -D compile
I am sympathetic when it comes to print statements that move errors around, it is very frustrating.
@davegill
Sorry, maybe I did not explained it clearly. max(qci(i,k,1),0.0)
might be the solution of the NaN input, not the cause of NaN. As mentioned above, max(denqrs1(i, k)/den(i, k), 0.)
might be the cause in my opinion.
If the qci or qcs is a NaN, can you get the (i,k,j) location? Is it reproducible?
I located the i,k,j and I make sure that they are all in its-ite,kts-kte,jts-jte. It is reproducible and all the reproduction are exactly same.
Can you back trace this to the solve_em.F file where the microphysics driver is called?
I actually exported the input of the subroutine WSM6
in phys/module_microphysics_driver.F
, right before calling wsm6. And there ARE NaN in these inputs. Nothing was done within the subroutine WSM6
at that time.
If you use positive definite monotonic advection, does the NaN still appear?
Sorry, I don't know what is a 'positive definite monotonic advection'. Would you please give me more information about it?
If you change MP schemes, is there still a problem?
Have not checked yet, but I guess that there might be some similar problems as wsm5 and wsm3 has a similar structure in caculating qrs
using max(denqrs1(i,k)/den(i,k),0)
If you run a different day, does the problem still occur? If you start at a different time?
I tried several different days and time, and the Nan problems still remained. I also tried to compile wrf with gcc/gfortran, and there are still NaNs, all of which are within high levels, which are supposed to be zero( in my own opinion).
So eventually, my key questions are:
max(NaN,0)
?do k = kts, kte
do i = its, ite
qrs(i, k, 1) = max(denqrs1(i, k)/den(i, k), 0.)
qrs(i, k, 2) = max(denqrs2(i, k)/den(i, k), 0.)
qrs(i, k, 3) = max(denqrs3(i, k)/den(i, k), 0.)
print * ,"1st",qrs(i, k, 3)
fall(i, k, 1) = denqrs1(i, k)*workr(i, k)/delz(i, k)
fall(i, k, 2) = denqrs2(i, k)*worka(i, k)/delz(i, k)
fall(i, k, 3) = denqrs3(i, k)*worka(i, k)/delz(i, k)
print * ,"2nd",qrs(i, k, 3)
enddo
enddo
@IncubatorShokuhou This seems like good news.
The max(qci,0.0) is an old trick to handle negative values of moisture. It was never intended to fix a NaN value.
The qci and qcs are passed into the scheme from the microphysics driver, which is called from solve_em.F. Trace the specific location, the (i,1,j), of the moist variables (ice, cloud, rain, snow, graupel) back up the call tree in the solve routine. Find which routine is throwing a NaN at the location. Once you identify the routine, go in there and find the problematic loop.
If you can get a reproducible NaN after a restart, this would be much faster if you output a restart file a single time step before the first NaN occurred. Then just run one time step with loads of print out.
@davegill
I am afraid that the NaN is generated inside the wsm6
subroutine instead of outside. As is menthoned above, I did a experiment, converting all the NaNs in the input of subroutine WSM6
, and then called the WSM6
subroutine, and it turned out that there are NaNs in the output, which proves that WSM6
itself can cause the NaNs.
Also as I said above, the NaNs appeared after subroutine nislfv_rain_plm6
for the first time, which indicates(as far as I am concerned) that NaNs in denqrs3
caused the NaNs in qrs
.
@IncubatorShokuhou
NaNs usually result from floating point exceptions, so have you tried compiling with an option to trap those? For pgf90, it looks like '-Ktrap=fp' might the option you want. (I don't have experience with that compiler -- that is what I found in a man page). Otherwise, an exception could occur and the code will still continue.
Also, do you need the '-tp skylake' option and have you tried turning that off?
MODEL VERSION: v4.1.2
TYPE: question
KEYWORDS: WSM6
SOURCE FILE: phys/module_mp_wsm6.F
DESCRIPTION OF PROBLEMS:
Recently I am trying to do something related to GPU acceleration in microphysics scheme, so I tried to output all the variables used by wsm6 subroutine in
phys/module_microphysics_driver.F
before subroutine wsm6 is called. However, I found that there are some "NaN" values in the called variables(e.g.qg
). So here are my questions:phys/module_mp_wsm6.F
to a "single-coloum" "WSM1D" subroutine, putting the i loops outside of the subroutine. I compared the output of my codes and the oroginal wsm6 code, with a same input (in which the NaN has been replaced by 0.0) generated by a normal wrf case. I found that the output of my "WSM1D" code has no NaN values, which is quite different from the original "2d" code(still has NaN), while all the other variables(th
et.al.) were exactly the same. What might be the causions?Thanks! -- Hao Lyu