Fix the source of the dt=0 error when using adaptive time step

JeroBnd commented 3 weeks ago

Fixed the source of the error dt=0 in BC and out time steps adjust.

TYPE: bug fix

KEYWORDS: time, step, adaptative

SOURCE: Jeronimo Bande (IDING SAS)

DESCRIPTION OF CHANGES: Problem: When adjusting the time step in BC and out timesame time produce dt =0

Solution: What was down algorithmically and in the source code to address the problem?

ISSUE: Fixes #1560

LIST OF MODIFIED FILES: /dyn_em/adapt_timestep_em.F

TESTS CONDUCTED:

Do mods fix problem? How can that be demonstrated, and was that test conducted?
Are the Jenkins tests all passing?

RELEASE NOTE: Corrected adaptative time step on BC and OUT time.

weiwangncar commented 3 weeks ago

The regression test results:

Test Type              | Expected  | Received |  Failed
= = = = = = = = = = = = = = = = = = = = = = = =  = = = =
Number of Tests        : 23           24
Number of Builds       : 60           57
Number of Simulations  : 158           150        0
Number of Comparisons  : 95           86        0

Failed Simulations are: 
None
Which comparisons are not bit-for-bit: 
None

dudhia commented 3 weeks ago

To help us review this, can you add more explanation of the fix in the description section?

JeroBnd commented 3 weeks ago

Hello.

The adaptive time step module operates with a precision of 1/100 seconds, resulting in simulation times with the same precision.

The process that determines the dtInterval involves several steps, checking various conditions. One of these conditions is the precision of 1/100 seconds.

When adjusting the time step to boundary conditions (BC) and output times, the algorithm divides a temporary time interval (with 1/100 sec precision) by two and assigns it to dtInterval. When this temporary time interval has an odd value, the precision changes to 1/200 sec, and the next simulation time also has a precision of 1/200 sec.

In the next step, adjacent to the BC or output time, the algorithm truncates the 1/200 sec precision, setting the simulation time to 1/200 sec before the BC or output time (without the mitigation done in #154).

The following dtInterval, which is 1/200 sec, is then truncated to a precision of 1/100 sec, resulting in a dtInterval of 0.

This effect is mitigated in #154 but does not address the underlying source of the problem.

weiwangncar commented 3 weeks ago

@JeroBnd Can you expand IDING SAS? Are you working with Kugler who posted the issue?

JeroBnd commented 3 weeks ago

I am not working with Kugler. I am from Córdoba, Argentina. IDING SAS is a startup. We provide services to APRHI (Provincial Administration of Water Resources) for reservoir management. In this context, we are operationally running a high-resolution weather forecast ensemble with WRF.

I found this bug while trying to debug an error caused by myself in the namelist.input that did not throw a warning.

There are several things to do in the adapt_timestep module...

weiwangncar commented 3 weeks ago

@JeroBnd Thanks for the info. I tested one of the cases Kugler had problem with, the em_b_wave case, your change didn't help. Did you test that case, using his namelist.input file?

JeroBnd commented 3 weeks ago

@weiwangncar, the adapt_timestep module is not prepared to handle this kind of idealized case. The algorithm uses the remaining time until the boundary condition is applied, with a counter that resets when the boundary condition is used.

In this idealized case, the counter starts with a value of 10800 (equivalent to 3 hours), but when the supposed boundary condition should occur, it doesn’t, and the counter continues decreasing into negative values, causing the simulation to terminate prematurely.

I wrote a small patch to fix this, but the module should be reconsidered. It uses several variables from the namelist.input without properly checking them.

weiwangncar commented 3 weeks ago

@JeroBnd Did you encounter a problem associated with dt=0? If so, what version of the code is it?

JeroBnd commented 3 weeks ago

@weiwangncar I am doing one-way nesting using Ndown and adaptive time step. The boundary conditions (BC) are taken from domain 1 to domain 2 every 30 minutes (1800 seconds).

However, when running WRF for domain 2 with a mistake in the namelist.input file, specifically with the variable interval_seconds set to 3600 seconds, one random run out of the 24-member ensemble usually fails with a CFL error and a segmentation fault or stops at NOAH MP.

For this to happen, two conditions must occur simultaneously:

1 - The time interval between the current simulation time and the next BC time is an odd value (resulting a running time precision of 1/200 sec).

2 - The BC time does not match the BC time derived from interval_seconds, which in my case should occur at times ending in 30 minutes, such as 1:30.

When both of these conditions occur, the mitigation algorithm for dt=0 (#154) sets the dtInterval to match the BC time derived from interval_seconds, generating a time interval of 30 minutes, which crashes the run.

weiwangncar commented 3 weeks ago

@brianreen We wonder if you could help review this PR? Thanks.

brianreen commented 2 weeks ago

Yes, I will help review this PR.

brianreen commented 2 weeks ago

@JeroBnd

Does this issue only occur when interval_seconds is set incorrectly?

Since I did not think #154 changed code that sets dtInterval, could you clarify what lines of code you are referring to when you say "the mitigation algorithm for dt=0 (#154)"?

wrf-model / WRF

Fix the source of the dt=0 error when using adaptive time step #2103