WSM6 NaN values in qg - Githubissues

IncubatorShokuhou commented 4 years ago

MODEL VERSION: v4.1.2

TYPE: question

KEYWORDS: WSM6

SOURCE FILE: phys/module_mp_wsm6.F

DESCRIPTION OF PROBLEMS:

Recently I am trying to do something related to GPU acceleration in microphysics scheme, so I tried to output all the variables used by wsm6 subroutine in phys/module_microphysics_driver.F before subroutine wsm6 is called. However, I found that there are some "NaN" values in the called variables(e.g. qg). So here are my questions:

How could the variables with "NaN" be able to be used in the subroutine wsm6? Or, what has been done on this NaN values? I cannnot find any clue in the concerned code.
I tried to rewrite the subroutine "WSM62D" in phys/module_mp_wsm6.F to a "single-coloum" "WSM1D" subroutine, putting the i loops outside of the subroutine. I compared the output of my codes and the oroginal wsm6 code, with a same input (in which the NaN has been replaced by 0.0) generated by a normal wrf case. I found that the output of my "WSM1D" code has no NaN values, which is quite different from the original "2d" code(still has NaN), while all the other variables(th et.al.) were exactly the same. What might be the causions?

Thanks! -- Hao Lyu

davegill commented 4 years ago

@IncubatorShokuhou Hao Lyu, We would appreciate you help in this matter.

Are the NaN values within the computed horizontal region? The input variables, such as qg, are dimensioned as (ims:ime,kms:kme,jms:jme). Are the NaN values occurring in the (its:ite,kts:kte,jts:jte) region?
If you know the (i,k,j) of the NaN value, go backwards through the call tree to see where this is happening. There are very few places that modify qg. Is this a boundary, caused by transport, etc? Is this only in a single domain? Is this immediately after initialization?

IncubatorShokuhou commented 4 years ago

@davegill I discovered that NaN appear after subroutine nislfv_rain_plm6 for the first time. After calling nislfv_rain_plm6, some positions in denqrs3 became NaN, and then qrs(i, k, 3) = max(denqrs3(i, k)/den(i, k), 0.) is excuted. Theoretically, qrs should be a real number larger than or equal to 0. However, when I tryed to print qrs(its:ite,kts:kte, 3) after the loop, I found that there are NaN in it. So I print qrs(i, k, 3) after qrs(i, k, 3) = max(denqrs3(i, k)/den(i, k), 0.) to make sure if this code was correctly executed. However, I found that these code DO correctly executed——there was no NaN in qrs! So I further more printed qrs twice: before and after the loop. It shows again that the number has no NaN in the loop and has NaN outside the loop——but there should be nothing done upon the qrs in between. Now, here is the weirdest thing in the whole story: I tried to print qrs again, like this:

         do k = kts, kte
            do i = its, ite
               qrs(i, k, 1) = max(denqrs1(i, k)/den(i, k), 0.)
               qrs(i, k, 2) = max(denqrs2(i, k)/den(i, k), 0.)
               qrs(i, k, 3) = max(denqrs3(i, k)/den(i, k), 0.)
               print * ,qrs(i, k, 3)
               fall(i, k, 1) = denqrs1(i, k)*workr(i, k)/delz(i, k)
               fall(i, k, 2) = denqrs2(i, k)*worka(i, k)/delz(i, k)
               fall(i, k, 3) = denqrs3(i, k)*worka(i, k)/delz(i, k)
            print * ,"2nd",qrs(i, k, 3)
            enddo

         enddo

         do k = kts, kte
            do i = its, ite
               print * ,"3rd", qrs(i, k, 3)
            enddo
         enddo

and now, there is no NaN in "3rd" at all !

I am wondering why this happened. PS: my fortran complier version is pgf90 19.4-0 LLVM 64-bit target on x86-64 Linux -tp skylake. I'm wondering if this error is related to the compiler. I will carry on checking why denqrs3 has NaN, which caused the NaN in qrs(:,:,3) and qg. All further information concerned will be updated.

IncubatorShokuhou commented 4 years ago

I noticed that qci and qrs are all paddinted 0 for negative values generated by dynamics at the very beginnning of subroutine WSM62D, which is likely to be the reason of how NaN is used in the subroutine. However, I personally recommand that denqrs1(i, k)/den(i, k) should be firstly checked to determine whether it is NaN before using function max, as is mentioned by opendoc of gfortran,

The Fortran standard does not specify what the result of the MAX and MIN intrinsics are if one of the arguments is a NaN. Accordingly, the GNU Fortran compiler does not specify that either, as this allows for faster and more compact code to be generated. If the programmer wishes to take some specific action in case one of the arguments is a NaN, it is necessary to explicitly test the arguments before calling MAX or MIN, e.g. with the IEEE_IS_NAN function from the intrinsic module IEEE_ARITHMETIC.

davegill commented 4 years ago

@IncubatorShokuhou

I noticed that qci and qrs are all paddinted 0 for negative values generated by dynamics at the very beginnning of subroutine WSM62D, which is likely to be the reason of how NaN is used in the subroutine.

I do not see how a max(qci(i,k,1),0.0) is likely to be the reason of how a NaN in introduced into the scheme?

If the qci or qcs is a NaN, can you get the (i,k,j) location? Is it reproducible? Can you back trace this to the solve_em.F file where the microphysics driver is called? If you use positive definite monotonic advection, does the NaN still appear? If you change MP schemes, is there still a problem? If you run a different day, does the problem still occur? If you start at a different time?

Can you build the code with floating traps on? Maybe that helps? For WRF that would be clean -a configure -D compile

I am sympathetic when it comes to print statements that move errors around, it is very frustrating.

IncubatorShokuhou commented 4 years ago

@davegill Sorry, maybe I did not explained it clearly. max(qci(i,k,1),0.0)might be the solution of the NaN input, not the cause of NaN. As mentioned above, max(denqrs1(i, k)/den(i, k), 0.)might be the cause in my opinion.

If the qci or qcs is a NaN, can you get the (i,k,j) location? Is it reproducible?

I located the i,k,j and I make sure that they are all in its-ite,kts-kte,jts-jte. It is reproducible and all the reproduction are exactly same.

Can you back trace this to the solve_em.F file where the microphysics driver is called?

I actually exported the input of the subroutine WSM6 in phys/module_microphysics_driver.F, right before calling wsm6. And there ARE NaN in these inputs. Nothing was done within the subroutine WSM6at that time.

If you use positive definite monotonic advection, does the NaN still appear?

Sorry, I don't know what is a 'positive definite monotonic advection'. Would you please give me more information about it?

If you change MP schemes, is there still a problem?

Have not checked yet, but I guess that there might be some similar problems as wsm5 and wsm3 has a similar structure in caculating qrs using max(denqrs1(i,k)/den(i,k),0)

If you run a different day, does the problem still occur? If you start at a different time?

I tried several different days and time, and the Nan problems still remained. I also tried to compile wrf with gcc/gfortran, and there are still NaNs, all of which are within high levels, which are supposed to be zero( in my own opinion).

So eventually, my key questions are:

Is NaN in high levels acceptable? Should they be converted to 0, which are not successfully completed by the function max(NaN,0)?

What caused the difference in the qrs inside and outside the loop? There SHOULD be nothing done upon qrs, but in the loop there is no NaN and outside the loop there are Nans?

do k = kts, kte
        do i = its, ite
           qrs(i, k, 1) = max(denqrs1(i, k)/den(i, k), 0.)
           qrs(i, k, 2) = max(denqrs2(i, k)/den(i, k), 0.)
           qrs(i, k, 3) = max(denqrs3(i, k)/den(i, k), 0.)
           print * ,"1st",qrs(i, k, 3)
           fall(i, k, 1) = denqrs1(i, k)*workr(i, k)/delz(i, k)
           fall(i, k, 2) = denqrs2(i, k)*worka(i, k)/delz(i, k)
           fall(i, k, 3) = denqrs3(i, k)*worka(i, k)/delz(i, k)
        print * ,"2nd",qrs(i, k, 3)
        enddo
 enddo

davegill commented 4 years ago

@IncubatorShokuhou This seems like good news.

The input has NaNs
It is always reproducible

The max(qci,0.0) is an old trick to handle negative values of moisture. It was never intended to fix a NaN value.

The qci and qcs are passed into the scheme from the microphysics driver, which is called from solve_em.F. Trace the specific location, the (i,1,j), of the moist variables (ice, cloud, rain, snow, graupel) back up the call tree in the solve routine. Find which routine is throwing a NaN at the location. Once you identify the routine, go in there and find the problematic loop.

If you can get a reproducible NaN after a restart, this would be much faster if you output a restart file a single time step before the first NaN occurred. Then just run one time step with loads of print out.

IncubatorShokuhou commented 4 years ago

@davegill I am afraid that the NaN is generated inside the wsm6 subroutine instead of outside. As is menthoned above, I did a experiment, converting all the NaNs in the input of subroutine WSM6, and then called the WSM6 subroutine, and it turned out that there are NaNs in the output, which proves that WSM6 itself can cause the NaNs. Also as I said above, the NaNs appeared after subroutine nislfv_rain_plm6 for the first time, which indicates(as far as I am concerned) that NaNs in denqrs3 caused the NaNs in qrs.

MicroTed commented 4 years ago

@IncubatorShokuhou

NaNs usually result from floating point exceptions, so have you tried compiling with an option to trap those? For pgf90, it looks like '-Ktrap=fp' might the option you want. (I don't have experience with that compiler -- that is what I found in a man page). Otherwise, an exception could occur and the code will still continue.

Also, do you need the '-tp skylake' option and have you tried turning that off?

wrf-model / WRF

WSM6 NaN values in qg #1167