wrf-model / WRF

The official repository for the Weather Research and Forecasting (WRF) model
Other
1.26k stars 693 forks source link

Z-staggering for URB_PARAM in Registry.EM_COMMON. Bug? #1687

Open wgustafson opened 2 years ago

wgustafson commented 2 years ago

Heng Xiao (@hengxiao80) and I have been trying to work through an issue where real.exe is crashing while trying to read URB_PARAM. This only happens when we use large domains. One thing I noticed is that URB_PARAM is not indicated to have Z-staggering in Registry.EM_COMMON. I vaguely remember from work I did many years ago in WRF-Chem that variables that use one of the constant dimensions must have that marked as Z-staggered, even if it really isn't thought of that way. Is that correct? If so, then this is likely a bug for the definition of URB_PARAM that may also require a few changes to looping parameters in the code.

The symptom we are seeing is a buffer overflow in the read routines. We actually have the urban parameterization turned off, so URB_PARAM is not in our input met_em files. So, this read attempt is superfluous. Our workaround is to just turn off the "i1" read in Registry.EM_COMMON, but this may not work for others' situations if they run into this same problem.

Note that we tried changing the staggering to "Z", but we still got the crash if we left the read turned on. So, I do not know what the actual issue is. However, I wanted to report the possible Z-staggering issue for the urban developers to be aware of.

The issue happens for the 2nd domain when processing domains set up with

 max_dom                             = 2
 e_we                                = 751, 2146
 e_sn                                = 866, 2776
 e_vert                              = 150,  150

The error messages reported in rsl.error.0000 are:

d02 2019-01-29_12:00:00  NetCDF error in wrf_io.F90, line        2885  Varname URB_PARAM
0: MPICH2 Error: Failed to register memory address 0x2aabb2fcc000 with length of 0x641000 (6557696) bytes.
0: Unable to register memory at the requested address. This may indicate an application bug. See process virtual memory mappings below:

< EXCLUDING LOTS OF EXTRA OUTPUT WE HAVE TURNED ON THAT IS IRRELEVANT... >

Rank 0 [Fri Feb  4 20:43:06 2022] [c0-0c0s3n3] Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(448)........................: MPI_Scatterv(sbuf=0x2aac0229e020, scnts=0x13f9f4f0, displs=0x14d21e80, MPI_CHAR, rbuf=0x7fffff9934d0, rcount=6614784, MPI_CHAR, root=0, comm=0xc4000002) failed
MPIR_Scatterv_impl(234)...................:
MPIR_CRAY_Scatterv(178)...................:
MPIR_Scatterv(164)........................:
MPIR_Waitall_impl(221)....................:
MPIDI_CH3I_Progress(537)..................:
MPID_nem_mpich_blocking_recv(1140)........:
MPID_nem_gni_poll(1632)...................:
MPID_nem_gni_progress_lmt_start_send(2010):
MPID_nem_gni_lmt_mem_register(130)........:
MPID_nem_gni_dreg_register(569)...........: UDREG_Register 1
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
real_debug.exe     000000000BA3E624  Unknown               Unknown  Unknown
libpthread-2.26.s  00002AAAADE272D0  Unknown               Unknown  Unknown
libc-2.26.so       00002AAAAE270520  gsignal               Unknown  Unknown
libc-2.26.so       00002AAAAE271B01  abort                 Unknown  Unknown
libmpich_intel.so  00002AAAADA545F8  Unknown               Unknown  Unknown
libmpich_intel.so  00002AAAAD9DD462  MPIR_Handle_fatal     Unknown  Unknown
libmpich_intel.so  00002AAAAD9DD59E  MPIR_Err_return_c     Unknown  Unknown
libmpich_intel.so  00002AAAAD8F90E2  MPI_Scatterv          Unknown  Unknown
real_debug.exe     0000000001E854D6  Unknown               Unknown  Unknown
real_debug.exe     00000000014DCBCB  wrf_global_to_pat        8393  module_dm.f90
real_debug.exe     00000000014DA6C0  wrf_global_to_pat        8223  module_dm.f90
real_debug.exe     000000000140EF92  call_pkg_and_dist       23240  module_io.f90
real_debug.exe     0000000001407BA5  call_pkg_and_dist       22691  module_io.f90
real_debug.exe     0000000001406CB9  call_pkg_and_dist       22600  module_io.f90
real_debug.exe     00000000013FB046  wrf_read_field1_        21178  module_io.f90
real_debug.exe     00000000013FA8B8  wrf_read_field_         20968  module_io.f90
real_debug.exe     000000000512067A  wrf_ext_read_fiel         130  wrf_ext_read_field.f90
real_debug.exe     000000000420D54C  input_wrf_               1643  input_wrf.f90
real_debug.exe     0000000004031DC3  module_io_domain_         898  module_io_domain.f90
real_debug.exe     0000000000417676  med_sidata_input_         414  real_em.f90
real_debug.exe     0000000000415289  MAIN__                    244  real_em.f90
real_debug.exe     0000000000413852  Unknown               Unknown  Unknown
libc-2.26.so       00002AAAAE25B34A  __libc_start_main     Unknown  Unknown
real_debug.exe     000000000041376A  Unknown               Unknown  Unknown

We can provide sample inputs from met_em, namelist, etc. if desired. But I will not attach them all here due to size.

dudhia commented 2 years ago

My recollection is that the Z staggering only applies to the output/input dimension and the variable in the code is always kde. What would we need to do to repeat your error with standard code? Sounds like a memory size issue. The urban array can be quite large. Does the number of processors you use for real.exe matter?

On Tue, Feb 22, 2022 at 10:05 AM William Gustafson @.***> wrote:

Heng Xiao @.*** https://github.com/hengxiao80) and I have been trying to work through an issue where real.exe is crashing while trying to read URB_PARAM. This only happens when we use large domains. One thing I noticed is that URB_PARAM is not indicated to have Z-staggering in Registry.EM_COMMON. I vaguely remember from work I did many years ago in WRF-Chem that variables that use one of the constant dimensions must have that marked as Z-staggered, even if it really isn't thought of that way. Is that correct? If so, then this is likely a bug for the definition of URB_PARAM that may also require a few changes to looping parameters in the code.

The symptom we are seeing is a buffer overflow in the read routines. We actually have the urban parameterization turned off, so URB_PARAM is not in our input met_em files. So, this read attempt is superfluous. Our workaround is to just turn off the "i1" read in Registry.EM_COMMON, but this may not work for others' situations if they run into this same problem.

Note that we tried changing the staggering to "Z", but we still got the crash if we left the read turned on. So, I do not know what the actual issue is. However, I wanted to report the possible Z-staggering issue for the urban developers to be aware of.

The issue happens for the 2nd domain when processing domains set up with

max_dom = 2 e_we = 751, 2146 e_sn = 866, 2776 e_vert = 150, 150

The error messages reported in rsl.error.0000 are:

d02 2019-01-29_12:00:00 NetCDF error in wrf_io.F90, line 2885 Varname URB_PARAM 0: MPICH2 Error: Failed to register memory address 0x2aabb2fcc000 with length of 0x641000 (6557696) bytes. 0: Unable to register memory at the requested address. This may indicate an application bug. See process virtual memory mappings below:

< EXCLUDING LOTS OF EXTRA OUTPUT WE HAVE TURNED ON THAT IS IRRELEVANT... >

Rank 0 [Fri Feb 4 20:43:06 2022] [c0-0c0s3n3] Fatal error in PMPI_Scatterv: Other MPI error, error stack: PMPI_Scatterv(448)........................: MPI_Scatterv(sbuf=0x2aac0229e020, scnts=0x13f9f4f0, displs=0x14d21e80, MPI_CHAR, rbuf=0x7fffff9934d0, rcount=6614784, MPI_CHAR, root=0, comm=0xc4000002) failed MPIR_Scatterv_impl(234)...................: MPIR_CRAY_Scatterv(178)...................: MPIR_Scatterv(164)........................: MPIR_Waitall_impl(221)....................: MPIDI_CH3I_Progress(537)..................: MPID_nem_mpich_blocking_recv(1140)........: MPID_nem_gni_poll(1632)...................: MPID_nem_gni_progress_lmt_start_send(2010): MPID_nem_gni_lmt_mem_register(130)........: MPID_nem_gni_dreg_register(569)...........: UDREG_Register 1 forrtl: error (76): Abort trap signal Image PC Routine Line Source real_debug.exe 000000000BA3E624 Unknown Unknown Unknown libpthread-2.26.s 00002AAAADE272D0 Unknown Unknown Unknownlibc-2.26.so 00002AAAAE270520 gsignal Unknown Unknownlibc-2.26.so 00002AAAAE271B01 abort Unknown Unknown libmpich_intel.so 00002AAAADA545F8 Unknown Unknown Unknown libmpich_intel.so 00002AAAAD9DD462 MPIR_Handle_fatal Unknown Unknown libmpich_intel.so 00002AAAAD9DD59E MPIR_Err_return_c Unknown Unknown libmpich_intel.so 00002AAAAD8F90E2 MPI_Scatterv Unknown Unknown real_debug.exe 0000000001E854D6 Unknown Unknown Unknown real_debug.exe 00000000014DCBCB wrf_global_to_pat 8393 module_dm.f90 real_debug.exe 00000000014DA6C0 wrf_global_to_pat 8223 module_dm.f90 real_debug.exe 000000000140EF92 call_pkg_and_dist 23240 module_io.f90 real_debug.exe 0000000001407BA5 call_pkg_and_dist 22691 module_io.f90 real_debug.exe 0000000001406CB9 call_pkg_and_dist 22600 module_io.f90 real_debug.exe 00000000013FB046 wrf_readfield1 21178 module_io.f90 real_debug.exe 00000000013FA8B8 wrf_readfield 20968 module_io.f90 real_debug.exe 000000000512067A wrf_ext_read_fiel 130 wrf_ext_read_field.f90 real_debug.exe 000000000420D54C inputwrf 1643 input_wrf.f90 real_debug.exe 0000000004031DC3 module_iodomain 898 module_io_domain.f90 real_debug.exe 0000000000417676 med_sidatainput 414 real_em.f90 real_debug.exe 0000000000415289 MAIN 244 real_em.f90 real_debug.exe 0000000000413852 Unknown Unknown Unknownlibc-2.26.so 00002AAAAE25B34A libc_start_main Unknown Unknown real_debug.exe 000000000041376A Unknown Unknown Unknown

We can provide sample inputs from met_em, namelist, etc. if desired. But I will not attach them all here due to size.

— Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1687, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIZ77DIEKV2JTTRGOGZ7IDU4O66XANCNFSM5PB7LZZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

hengxiao80 commented 2 years ago

Yes. I also think it might be some kind of memory size issue. URB_PARAM is very large, a real array the size of nx (2146) ny (2776) 132 (a constant) all allocated on rank 0 (if I read the code correctly, it actually gets allocated twice somehow before dist_on_comm0 sent the global array out.). The machine and the number of nodes used matter to some extent. We have had successful real.exe runs with the same setup on a different machine. But even there, we saw strange sensitivity to the NETCDF io_form (2 vs 11) we used in the namelist.input for real.exe and we had to use a large number of nodes.

For the machine we use now, increasing node and/or MPI rank counts does not seem to solve the problem (we also tried to use the large-memory nodes on the cluster) even thought we had more memory per node. We had to turn off the initialization of URB_PARAM completely for real.exe to work for the larger/higher-res. domain. But for the smaller/lower-res. domain we have not had any problems on any of the machines. We also had no problem when we ran real.exe on the larger domain but with met_em data with fewer vertical levels.

Thank you,

Heng Xiao

From: dudhia @.> Date: Tuesday, February 22, 2022 at 10:38 AM To: wrf-model/WRF @.> Cc: Xiao, Heng @.>, Mention @.> Subject: Re: [wrf-model/WRF] Z-staggering for URB_PARAM in Registry.EM_COMMON. Bug? (Issue #1687) Check twice before you click! This email originated from outside PNNL.

My recollection is that the Z staggering only applies to the output/input dimension and the variable in the code is always kde. What would we need to do to repeat your error with standard code? Sounds like a memory size issue. The urban array can be quite large. Does the number of processors you use for real.exe matter?

On Tue, Feb 22, 2022 at 10:05 AM William Gustafson @.***> wrote:

Heng Xiao @.*** https://github.com/hengxiao80) and I have been trying to work through an issue where real.exe is crashing while trying to read URB_PARAM. This only happens when we use large domains. One thing I noticed is that URB_PARAM is not indicated to have Z-staggering in Registry.EM_COMMON. I vaguely remember from work I did many years ago in WRF-Chem that variables that use one of the constant dimensions must have that marked as Z-staggered, even if it really isn't thought of that way. Is that correct? If so, then this is likely a bug for the definition of URB_PARAM that may also require a few changes to looping parameters in the code.

The symptom we are seeing is a buffer overflow in the read routines. We actually have the urban parameterization turned off, so URB_PARAM is not in our input met_em files. So, this read attempt is superfluous. Our workaround is to just turn off the "i1" read in Registry.EM_COMMON, but this may not work for others' situations if they run into this same problem.

Note that we tried changing the staggering to "Z", but we still got the crash if we left the read turned on. So, I do not know what the actual issue is. However, I wanted to report the possible Z-staggering issue for the urban developers to be aware of.

The issue happens for the 2nd domain when processing domains set up with

max_dom = 2 e_we = 751, 2146 e_sn = 866, 2776 e_vert = 150, 150

The error messages reported in rsl.error.0000 are:

d02 2019-01-29_12:00:00 NetCDF error in wrf_io.F90, line 2885 Varname URB_PARAM 0: MPICH2 Error: Failed to register memory address 0x2aabb2fcc000 with length of 0x641000 (6557696) bytes. 0: Unable to register memory at the requested address. This may indicate an application bug. See process virtual memory mappings below:

< EXCLUDING LOTS OF EXTRA OUTPUT WE HAVE TURNED ON THAT IS IRRELEVANT... >

Rank 0 [Fri Feb 4 20:43:06 2022] [c0-0c0s3n3] Fatal error in PMPI_Scatterv: Other MPI error, error stack: PMPI_Scatterv(448)........................: MPI_Scatterv(sbuf=0x2aac0229e020, scnts=0x13f9f4f0, displs=0x14d21e80, MPI_CHAR, rbuf=0x7fffff9934d0, rcount=6614784, MPI_CHAR, root=0, comm=0xc4000002) failed MPIR_Scatterv_impl(234)...................: MPIR_CRAY_Scatterv(178)...................: MPIR_Scatterv(164)........................: MPIR_Waitall_impl(221)....................: MPIDI_CH3I_Progress(537)..................: MPID_nem_mpich_blocking_recv(1140)........: MPID_nem_gni_poll(1632)...................: MPID_nem_gni_progress_lmt_start_send(2010): MPID_nem_gni_lmt_mem_register(130)........: MPID_nem_gni_dreg_register(569)...........: UDREG_Register 1 forrtl: error (76): Abort trap signal Image PC Routine Line Source real_debug.exe 000000000BA3E624 Unknown Unknown Unknown libpthread-2.26.s 00002AAAADE272D0 Unknown Unknown Unknownlibc-2.26.so 00002AAAAE270520 gsignal Unknown Unknownlibc-2.26.so 00002AAAAE271B01 abort Unknown Unknown libmpich_intel.so 00002AAAADA545F8 Unknown Unknown Unknown libmpich_intel.so 00002AAAAD9DD462 MPIR_Handle_fatal Unknown Unknown libmpich_intel.so 00002AAAAD9DD59E MPIR_Err_return_c Unknown Unknown libmpich_intel.so 00002AAAAD8F90E2 MPI_Scatterv Unknown Unknown real_debug.exe 0000000001E854D6 Unknown Unknown Unknown real_debug.exe 00000000014DCBCB wrf_global_to_pat 8393 module_dm.f90 real_debug.exe 00000000014DA6C0 wrf_global_to_pat 8223 module_dm.f90 real_debug.exe 000000000140EF92 call_pkg_and_dist 23240 module_io.f90 real_debug.exe 0000000001407BA5 call_pkg_and_dist 22691 module_io.f90 real_debug.exe 0000000001406CB9 call_pkg_and_dist 22600 module_io.f90 real_debug.exe 00000000013FB046 wrf_readfield1 21178 module_io.f90 real_debug.exe 00000000013FA8B8 wrf_readfield 20968 module_io.f90 real_debug.exe 000000000512067A wrf_ext_read_fiel 130 wrf_ext_read_field.f90 real_debug.exe 000000000420D54C inputwrf 1643 input_wrf.f90 real_debug.exe 0000000004031DC3 module_iodomain 898 module_io_domain.f90 real_debug.exe 0000000000417676 med_sidatainput 414 real_em.f90 real_debug.exe 0000000000415289 MAIN 244 real_em.f90 real_debug.exe 0000000000413852 Unknown Unknown Unknownlibc-2.26.so 00002AAAAE25B34A libc_start_main Unknown Unknown real_debug.exe 000000000041376A Unknown Unknown Unknown

We can provide sample inputs from met_em, namelist, etc. if desired. But I will not attach them all here due to size.

— Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1687, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIZ77DIEKV2JTTRGOGZ7IDU4O66XANCNFSM5PB7LZZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwrf-model%2FWRF%2Fissues%2F1687%23issuecomment-1048096544&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NWwGnr2ebnS%2BvNxo8XZo7H%2Bn0MkTeH0GddQV2%2Fvg6w4%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXRAHH2FGY5VWV5MJXAE23U4PJV3ANCNFSM5PB7LZZA&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=d6IvmO1qPge0gLosxahdSiIwyqFkfELH7iIDEUeAaqM%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pEO7CSKIg5P6BI36k%2FqNHrihKkb8ipDWYLDCaNmMR7I%3D&reserved=0 or Androidhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jlq%2BprhTQ8JWYcZU%2FRPec%2FnuuTNMRzVTLURkhxk%2FWDo%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

dudhia commented 2 years ago

Do you need to run real.exe on the nest? Is it the nest that causes the problems?

On Tue, Feb 22, 2022 at 1:28 PM Heng Xiao @.***> wrote:

Yes. I also think it might be some kind of memory size issue. URB_PARAM is very large, a real array the size of nx (2146) ny (2776) 132 (a constant) all allocated on rank 0 (if I read the code correctly, it actually gets allocated twice somehow before dist_on_comm0 sent the global array out.). The machine and the number of nodes used matter to some extent. We have had successful real.exe runs with the same setup on a different machine. But even there, we saw strange sensitivity to the NETCDF io_form (2 vs 11) we used in the namelist.input for real.exe and we had to use a large number of nodes.

For the machine we use now, increasing node and/or MPI rank counts does not seem to solve the problem (we also tried to use the large-memory nodes on the cluster) even thought we had more memory per node. We had to turn off the initialization of URB_PARAM completely for real.exe to work for the larger/higher-res. domain. But for the smaller/lower-res. domain we have not had any problems on any of the machines. We also had no problem when we ran real.exe on the larger domain but with met_em data with fewer vertical levels.

Thank you,

Heng Xiao

From: dudhia @.> Date: Tuesday, February 22, 2022 at 10:38 AM To: wrf-model/WRF @.> Cc: Xiao, Heng @.>, Mention @.> Subject: Re: [wrf-model/WRF] Z-staggering for URB_PARAM in Registry.EM_COMMON. Bug? (Issue #1687) Check twice before you click! This email originated from outside PNNL.

My recollection is that the Z staggering only applies to the output/input dimension and the variable in the code is always kde. What would we need to do to repeat your error with standard code? Sounds like a memory size issue. The urban array can be quite large. Does the number of processors you use for real.exe matter?

On Tue, Feb 22, 2022 at 10:05 AM William Gustafson @.***> wrote:

Heng Xiao @.*** https://github.com/hengxiao80) and I have been trying to work through an issue where real.exe is crashing while trying to read URB_PARAM. This only happens when we use large domains. One thing I noticed is that URB_PARAM is not indicated to have Z-staggering in Registry.EM_COMMON. I vaguely remember from work I did many years ago in WRF-Chem that variables that use one of the constant dimensions must have that marked as Z-staggered, even if it really isn't thought of that way. Is that correct? If so, then this is likely a bug for the definition of URB_PARAM that may also require a few changes to looping parameters in the code.

The symptom we are seeing is a buffer overflow in the read routines. We actually have the urban parameterization turned off, so URB_PARAM is not in our input met_em files. So, this read attempt is superfluous. Our workaround is to just turn off the "i1" read in Registry.EM_COMMON, but this may not work for others' situations if they run into this same problem.

Note that we tried changing the staggering to "Z", but we still got the crash if we left the read turned on. So, I do not know what the actual issue is. However, I wanted to report the possible Z-staggering issue for the urban developers to be aware of.

The issue happens for the 2nd domain when processing domains set up with

max_dom = 2 e_we = 751, 2146 e_sn = 866, 2776 e_vert = 150, 150

The error messages reported in rsl.error.0000 are:

d02 2019-01-29_12:00:00 NetCDF error in wrf_io.F90, line 2885 Varname URB_PARAM 0: MPICH2 Error: Failed to register memory address 0x2aabb2fcc000 with length of 0x641000 (6557696) bytes. 0: Unable to register memory at the requested address. This may indicate an application bug. See process virtual memory mappings below:

< EXCLUDING LOTS OF EXTRA OUTPUT WE HAVE TURNED ON THAT IS IRRELEVANT...

Rank 0 [Fri Feb 4 20:43:06 2022] [c0-0c0s3n3] Fatal error in PMPI_Scatterv: Other MPI error, error stack: PMPI_Scatterv(448)........................: MPI_Scatterv(sbuf=0x2aac0229e020, scnts=0x13f9f4f0, displs=0x14d21e80, MPI_CHAR, rbuf=0x7fffff9934d0, rcount=6614784, MPI_CHAR, root=0, comm=0xc4000002) failed MPIR_Scatterv_impl(234)...................: MPIR_CRAY_Scatterv(178)...................: MPIR_Scatterv(164)........................: MPIR_Waitall_impl(221)....................: MPIDI_CH3I_Progress(537)..................: MPID_nem_mpich_blocking_recv(1140)........: MPID_nem_gni_poll(1632)...................: MPID_nem_gni_progress_lmt_start_send(2010): MPID_nem_gni_lmt_mem_register(130)........: MPID_nem_gni_dreg_register(569)...........: UDREG_Register 1 forrtl: error (76): Abort trap signal Image PC Routine Line Source real_debug.exe 000000000BA3E624 Unknown Unknown Unknown libpthread-2.26.s 00002AAAADE272D0 Unknown Unknown Unknownlibc-2.26.so 00002AAAAE270520 gsignal Unknown Unknownlibc-2.26.so 00002AAAAE271B01 abort Unknown Unknown libmpich_intel.so 00002AAAADA545F8 Unknown Unknown Unknown libmpich_intel.so 00002AAAAD9DD462 MPIR_Handle_fatal Unknown Unknown libmpich_intel.so 00002AAAAD9DD59E MPIR_Err_return_c Unknown Unknown libmpich_intel.so 00002AAAAD8F90E2 MPI_Scatterv Unknown Unknown real_debug.exe 0000000001E854D6 Unknown Unknown Unknown real_debug.exe 00000000014DCBCB wrf_global_to_pat 8393 module_dm.f90 real_debug.exe 00000000014DA6C0 wrf_global_to_pat 8223 module_dm.f90 real_debug.exe 000000000140EF92 call_pkg_and_dist 23240 module_io.f90 real_debug.exe 0000000001407BA5 call_pkg_and_dist 22691 module_io.f90 real_debug.exe 0000000001406CB9 call_pkg_and_dist 22600 module_io.f90 real_debug.exe 00000000013FB046 wrf_readfield1 21178 module_io.f90 real_debug.exe 00000000013FA8B8 wrf_readfield 20968 module_io.f90 real_debug.exe 000000000512067A wrf_ext_read_fiel 130 wrf_ext_read_field.f90 real_debug.exe 000000000420D54C inputwrf 1643 input_wrf.f90 real_debug.exe 0000000004031DC3 module_iodomain 898 module_io_domain.f90 real_debug.exe 0000000000417676 med_sidatainput 414 real_em.f90 real_debug.exe 0000000000415289 MAIN 244 real_em.f90 real_debug.exe 0000000000413852 Unknown Unknown Unknownlibc-2.26.so 00002AAAAE25B34A libc_start_main Unknown Unknown real_debug.exe 000000000041376A Unknown Unknown Unknown

We can provide sample inputs from met_em, namelist, etc. if desired. But I will not attach them all here due to size.

— Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1687, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AEIZ77DIEKV2JTTRGOGZ7IDU4O66XANCNFSM5PB7LZZA

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwrf-model%2FWRF%2Fissues%2F1687%23issuecomment-1048096544&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NWwGnr2ebnS%2BvNxo8XZo7H%2Bn0MkTeH0GddQV2%2Fvg6w4%3D&reserved=0>, or unsubscribe< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXRAHH2FGY5VWV5MJXAE23U4PJV3ANCNFSM5PB7LZZA&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=d6IvmO1qPge0gLosxahdSiIwyqFkfELH7iIDEUeAaqM%3D&reserved=0

. Triage notifications on the go with GitHub Mobile for iOS< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pEO7CSKIg5P6BI36k%2FqNHrihKkb8ipDWYLDCaNmMR7I%3D&reserved=0> or Android< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jlq%2BprhTQ8JWYcZU%2FRPec%2FnuuTNMRzVTLURkhxk%2FWDo%3D&reserved=0 . You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1687#issuecomment-1048184481, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIZ77AKS2GGYDUQSRLRQRTU4PWWVANCNFSM5PB7LZZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

hengxiao80 commented 2 years ago

Thanks for the clarifying question. It is run for the nested domain to generate the wrfqnainp file for the aerosol-aware Thompson Microphysics (option 28) only. We actually don’t need the wrfinput for the nested domain. But I don’t know how to bypass the processing for wrfinput though.

Heng Xiao

From: dudhia @.> Date: Tuesday, February 22, 2022 at 1:07 PM To: wrf-model/WRF @.> Cc: Xiao, Heng @.>, Mention @.> Subject: Re: [wrf-model/WRF] Z-staggering for URB_PARAM in Registry.EM_COMMON. Bug? (Issue #1687) Do you need to run real.exe on the nest? Is it the nest that causes the problems?

On Tue, Feb 22, 2022 at 1:28 PM Heng Xiao @.***> wrote:

Yes. I also think it might be some kind of memory size issue. URB_PARAM is very large, a real array the size of nx (2146) ny (2776) 132 (a constant) all allocated on rank 0 (if I read the code correctly, it actually gets allocated twice somehow before dist_on_comm0 sent the global array out.). The machine and the number of nodes used matter to some extent. We have had successful real.exe runs with the same setup on a different machine. But even there, we saw strange sensitivity to the NETCDF io_form (2 vs 11) we used in the namelist.input for real.exe and we had to use a large number of nodes.

For the machine we use now, increasing node and/or MPI rank counts does not seem to solve the problem (we also tried to use the large-memory nodes on the cluster) even thought we had more memory per node. We had to turn off the initialization of URB_PARAM completely for real.exe to work for the larger/higher-res. domain. But for the smaller/lower-res. domain we have not had any problems on any of the machines. We also had no problem when we ran real.exe on the larger domain but with met_em data with fewer vertical levels.

Thank you,

Heng Xiao

From: dudhia @.> Date: Tuesday, February 22, 2022 at 10:38 AM To: wrf-model/WRF @.> Cc: Xiao, Heng @.>, Mention @.> Subject: Re: [wrf-model/WRF] Z-staggering for URB_PARAM in Registry.EM_COMMON. Bug? (Issue #1687) Check twice before you click! This email originated from outside PNNL.

My recollection is that the Z staggering only applies to the output/input dimension and the variable in the code is always kde. What would we need to do to repeat your error with standard code? Sounds like a memory size issue. The urban array can be quite large. Does the number of processors you use for real.exe matter?

On Tue, Feb 22, 2022 at 10:05 AM William Gustafson @.***> wrote:

Heng Xiao @.*** https://github.com/hengxiao80) and I have been trying to work through an issue where real.exe is crashing while trying to read URB_PARAM. This only happens when we use large domains. One thing I noticed is that URB_PARAM is not indicated to have Z-staggering in Registry.EM_COMMON. I vaguely remember from work I did many years ago in WRF-Chem that variables that use one of the constant dimensions must have that marked as Z-staggered, even if it really isn't thought of that way. Is that correct? If so, then this is likely a bug for the definition of URB_PARAM that may also require a few changes to looping parameters in the code.

The symptom we are seeing is a buffer overflow in the read routines. We actually have the urban parameterization turned off, so URB_PARAM is not in our input met_em files. So, this read attempt is superfluous. Our workaround is to just turn off the "i1" read in Registry.EM_COMMON, but this may not work for others' situations if they run into this same problem.

Note that we tried changing the staggering to "Z", but we still got the crash if we left the read turned on. So, I do not know what the actual issue is. However, I wanted to report the possible Z-staggering issue for the urban developers to be aware of.

The issue happens for the 2nd domain when processing domains set up with

max_dom = 2 e_we = 751, 2146 e_sn = 866, 2776 e_vert = 150, 150

The error messages reported in rsl.error.0000 are:

d02 2019-01-29_12:00:00 NetCDF error in wrf_io.F90, line 2885 Varname URB_PARAM 0: MPICH2 Error: Failed to register memory address 0x2aabb2fcc000 with length of 0x641000 (6557696) bytes. 0: Unable to register memory at the requested address. This may indicate an application bug. See process virtual memory mappings below:

< EXCLUDING LOTS OF EXTRA OUTPUT WE HAVE TURNED ON THAT IS IRRELEVANT...

Rank 0 [Fri Feb 4 20:43:06 2022] [c0-0c0s3n3] Fatal error in PMPI_Scatterv: Other MPI error, error stack: PMPI_Scatterv(448)........................: MPI_Scatterv(sbuf=0x2aac0229e020, scnts=0x13f9f4f0, displs=0x14d21e80, MPI_CHAR, rbuf=0x7fffff9934d0, rcount=6614784, MPI_CHAR, root=0, comm=0xc4000002) failed MPIR_Scatterv_impl(234)...................: MPIR_CRAY_Scatterv(178)...................: MPIR_Scatterv(164)........................: MPIR_Waitall_impl(221)....................: MPIDI_CH3I_Progress(537)..................: MPID_nem_mpich_blocking_recv(1140)........: MPID_nem_gni_poll(1632)...................: MPID_nem_gni_progress_lmt_start_send(2010): MPID_nem_gni_lmt_mem_register(130)........: MPID_nem_gni_dreg_register(569)...........: UDREG_Register 1 forrtl: error (76): Abort trap signal Image PC Routine Line Source real_debug.exe 000000000BA3E624 Unknown Unknown Unknown libpthread-2.26.s 00002AAAADE272D0 Unknown Unknown Unknownlibc-2.26.so 00002AAAAE270520 gsignal Unknown Unknownlibc-2.26.so 00002AAAAE271B01 abort Unknown Unknown libmpich_intel.so 00002AAAADA545F8 Unknown Unknown Unknown libmpich_intel.so 00002AAAAD9DD462 MPIR_Handle_fatal Unknown Unknown libmpich_intel.so 00002AAAAD9DD59E MPIR_Err_return_c Unknown Unknown libmpich_intel.so 00002AAAAD8F90E2 MPI_Scatterv Unknown Unknown real_debug.exe 0000000001E854D6 Unknown Unknown Unknown real_debug.exe 00000000014DCBCB wrf_global_to_pat 8393 module_dm.f90 real_debug.exe 00000000014DA6C0 wrf_global_to_pat 8223 module_dm.f90 real_debug.exe 000000000140EF92 call_pkg_and_dist 23240 module_io.f90 real_debug.exe 0000000001407BA5 call_pkg_and_dist 22691 module_io.f90 real_debug.exe 0000000001406CB9 call_pkg_and_dist 22600 module_io.f90 real_debug.exe 00000000013FB046 wrf_readfield1 21178 module_io.f90 real_debug.exe 00000000013FA8B8 wrf_readfield 20968 module_io.f90 real_debug.exe 000000000512067A wrf_ext_read_fiel 130 wrf_ext_read_field.f90 real_debug.exe 000000000420D54C inputwrf 1643 input_wrf.f90 real_debug.exe 0000000004031DC3 module_iodomain 898 module_io_domain.f90 real_debug.exe 0000000000417676 med_sidatainput 414 real_em.f90 real_debug.exe 0000000000415289 MAIN 244 real_em.f90 real_debug.exe 0000000000413852 Unknown Unknown Unknownlibc-2.26.so 00002AAAAE25B34A libc_start_main Unknown Unknown real_debug.exe 000000000041376A Unknown Unknown Unknown

We can provide sample inputs from met_em, namelist, etc. if desired. But I will not attach them all here due to size.

— Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1687, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AEIZ77DIEKV2JTTRGOGZ7IDU4O66XANCNFSM5PB7LZZA

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwrf-model%2FWRF%2Fissues%2F1687%23issuecomment-1048096544&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NWwGnr2ebnS%2BvNxo8XZo7H%2Bn0MkTeH0GddQV2%2Fvg6w4%3D&reserved=0>, or unsubscribe< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXRAHH2FGY5VWV5MJXAE23U4PJV3ANCNFSM5PB7LZZA&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=d6IvmO1qPge0gLosxahdSiIwyqFkfELH7iIDEUeAaqM%3D&reserved=0

. Triage notifications on the go with GitHub Mobile for iOS< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pEO7CSKIg5P6BI36k%2FqNHrihKkb8ipDWYLDCaNmMR7I%3D&reserved=0> or Android< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jlq%2BprhTQ8JWYcZU%2FRPec%2FnuuTNMRzVTLURkhxk%2FWDo%3D&reserved=0 . You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1687#issuecomment-1048184481, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIZ77AKS2GGYDUQSRLRQRTU4PWWVANCNFSM5PB7LZZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwrf-model%2FWRF%2Fissues%2F1687%23issuecomment-1048212467&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Cef18e330067c41f35e0d08d9f6475f41%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811608665293257%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ESX%2Bz%2FNInM%2B12cXy2vgO7z0y9SHx3cRvJtDLd%2F6RVts%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXRAHCJRVU3RGWKCD73SM3U4P3JRANCNFSM5PB7LZZA&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Cef18e330067c41f35e0d08d9f6475f41%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811608665293257%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=K18SkGuYciCcICMyN0Wb1Z92dYswuSYRTxlRUOHCGIc%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Cef18e330067c41f35e0d08d9f6475f41%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811608665293257%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ByfHvyO7XpMw%2Bty1%2FYI3O%2BeQ51MTEamW2otpbQOunj4%3D&reserved=0 or Androidhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Cef18e330067c41f35e0d08d9f6475f41%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811608665293257%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=7iBvOpwd7x4sB1JzNg6zkRwAM0tUijUnIny0GPYcNjk%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

dudhia commented 2 years ago

Depending what fields it reads there may be a way of just interpolating them from the coarse grid like we can do with SST.

On Tue, Feb 22, 2022 at 2:18 PM Heng Xiao @.***> wrote:

Thanks for the clarifying question. It is run for the nested domain to generate the wrfqnainp file for the aerosol-aware Thompson Microphysics (option 28) only. We actually don’t need the wrfinput for the nested domain. But I don’t know how to bypass the processing for wrfinput though.

Heng Xiao

From: dudhia @.> Date: Tuesday, February 22, 2022 at 1:07 PM To: wrf-model/WRF @.> Cc: Xiao, Heng @.>, Mention @.> Subject: Re: [wrf-model/WRF] Z-staggering for URB_PARAM in Registry.EM_COMMON. Bug? (Issue #1687) Do you need to run real.exe on the nest? Is it the nest that causes the problems?

On Tue, Feb 22, 2022 at 1:28 PM Heng Xiao @.***> wrote:

Yes. I also think it might be some kind of memory size issue. URB_PARAM is very large, a real array the size of nx (2146) ny (2776) 132 (a constant) all allocated on rank 0 (if I read the code correctly, it actually gets allocated twice somehow before dist_on_comm0 sent the global array out.). The machine and the number of nodes used matter to some extent. We have had successful real.exe runs with the same setup on a different machine. But even there, we saw strange sensitivity to the NETCDF io_form (2 vs 11) we used in the namelist.input for real.exe and we had to use a large number of nodes.

For the machine we use now, increasing node and/or MPI rank counts does not seem to solve the problem (we also tried to use the large-memory nodes on the cluster) even thought we had more memory per node. We had to turn off the initialization of URB_PARAM completely for real.exe to work for the larger/higher-res. domain. But for the smaller/lower-res. domain we have not had any problems on any of the machines. We also had no problem when we ran real.exe on the larger domain but with met_em data with fewer vertical levels.

Thank you,

Heng Xiao

From: dudhia @.> Date: Tuesday, February 22, 2022 at 10:38 AM To: wrf-model/WRF @.> Cc: Xiao, Heng @.>, Mention @.> Subject: Re: [wrf-model/WRF] Z-staggering for URB_PARAM in Registry.EM_COMMON. Bug? (Issue #1687) Check twice before you click! This email originated from outside PNNL.

My recollection is that the Z staggering only applies to the output/input dimension and the variable in the code is always kde. What would we need to do to repeat your error with standard code? Sounds like a memory size issue. The urban array can be quite large. Does the number of processors you use for real.exe matter?

On Tue, Feb 22, 2022 at 10:05 AM William Gustafson @.***> wrote:

Heng Xiao @.*** https://github.com/hengxiao80) and I have been trying to work through an issue where real.exe is crashing while trying to read URB_PARAM. This only happens when we use large domains. One thing I noticed is that URB_PARAM is not indicated to have Z-staggering in Registry.EM_COMMON. I vaguely remember from work I did many years ago in WRF-Chem that variables that use one of the constant dimensions must have that marked as Z-staggered, even if it really isn't thought of that way. Is that correct? If so, then this is likely a bug for the definition of URB_PARAM that may also require a few changes to looping parameters in the code.

The symptom we are seeing is a buffer overflow in the read routines. We actually have the urban parameterization turned off, so URB_PARAM is not in our input met_em files. So, this read attempt is superfluous. Our workaround is to just turn off the "i1" read in Registry.EM_COMMON, but this may not work for others' situations if they run into this same problem.

Note that we tried changing the staggering to "Z", but we still got the crash if we left the read turned on. So, I do not know what the actual issue is. However, I wanted to report the possible Z-staggering issue for the urban developers to be aware of.

The issue happens for the 2nd domain when processing domains set up with

max_dom = 2 e_we = 751, 2146 e_sn = 866, 2776 e_vert = 150, 150

The error messages reported in rsl.error.0000 are:

d02 2019-01-29_12:00:00 NetCDF error in wrf_io.F90, line 2885 Varname URB_PARAM 0: MPICH2 Error: Failed to register memory address 0x2aabb2fcc000 with length of 0x641000 (6557696) bytes. 0: Unable to register memory at the requested address. This may indicate an application bug. See process virtual memory mappings below:

< EXCLUDING LOTS OF EXTRA OUTPUT WE HAVE TURNED ON THAT IS IRRELEVANT...

Rank 0 [Fri Feb 4 20:43:06 2022] [c0-0c0s3n3] Fatal error in PMPI_Scatterv: Other MPI error, error stack: PMPI_Scatterv(448)........................: MPI_Scatterv(sbuf=0x2aac0229e020, scnts=0x13f9f4f0, displs=0x14d21e80, MPI_CHAR, rbuf=0x7fffff9934d0, rcount=6614784, MPI_CHAR, root=0, comm=0xc4000002) failed MPIR_Scatterv_impl(234)...................: MPIR_CRAY_Scatterv(178)...................: MPIR_Scatterv(164)........................: MPIR_Waitall_impl(221)....................: MPIDI_CH3I_Progress(537)..................: MPID_nem_mpich_blocking_recv(1140)........: MPID_nem_gni_poll(1632)...................: MPID_nem_gni_progress_lmt_start_send(2010): MPID_nem_gni_lmt_mem_register(130)........: MPID_nem_gni_dreg_register(569)...........: UDREG_Register 1 forrtl: error (76): Abort trap signal Image PC Routine Line Source real_debug.exe 000000000BA3E624 Unknown Unknown Unknown libpthread-2.26.s 00002AAAADE272D0 Unknown Unknown Unknownlibc-2.26.so 00002AAAAE270520 gsignal Unknown Unknownlibc-2.26.so 00002AAAAE271B01 abort Unknown Unknown libmpich_intel.so 00002AAAADA545F8 Unknown Unknown Unknown libmpich_intel.so 00002AAAAD9DD462 MPIR_Handle_fatal Unknown Unknown libmpich_intel.so 00002AAAAD9DD59E MPIR_Err_return_c Unknown Unknown libmpich_intel.so 00002AAAAD8F90E2 MPI_Scatterv Unknown Unknown real_debug.exe 0000000001E854D6 Unknown Unknown Unknown real_debug.exe 00000000014DCBCB wrf_global_to_pat 8393 module_dm.f90 real_debug.exe 00000000014DA6C0 wrf_global_to_pat 8223 module_dm.f90 real_debug.exe 000000000140EF92 call_pkg_and_dist 23240 module_io.f90 real_debug.exe 0000000001407BA5 call_pkg_and_dist 22691 module_io.f90 real_debug.exe 0000000001406CB9 call_pkg_and_dist 22600 module_io.f90 real_debug.exe 00000000013FB046 wrf_readfield1 21178 module_io.f90 real_debug.exe 00000000013FA8B8 wrf_readfield 20968 module_io.f90 real_debug.exe 000000000512067A wrf_ext_read_fiel 130 wrf_ext_read_field.f90 real_debug.exe 000000000420D54C inputwrf 1643 input_wrf.f90 real_debug.exe 0000000004031DC3 module_iodomain 898 module_io_domain.f90 real_debug.exe 0000000000417676 med_sidatainput 414 real_em.f90 real_debug.exe 0000000000415289 MAIN 244 real_em.f90 real_debug.exe 0000000000413852 Unknown Unknown Unknownlibc-2.26.so 00002AAAAE25B34A libc_start_main Unknown Unknown real_debug.exe 000000000041376A Unknown Unknown Unknown

We can provide sample inputs from met_em, namelist, etc. if desired. But I will not attach them all here due to size.

— Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1687, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AEIZ77DIEKV2JTTRGOGZ7IDU4O66XANCNFSM5PB7LZZA

. Triage notifications on the go with GitHub Mobile for iOS <

https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android <

https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub

.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub<

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwrf-model%2FWRF%2Fissues%2F1687%23issuecomment-1048096544&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NWwGnr2ebnS%2BvNxo8XZo7H%2Bn0MkTeH0GddQV2%2Fvg6w4%3D&reserved=0 , or unsubscribe<

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXRAHH2FGY5VWV5MJXAE23U4PJV3ANCNFSM5PB7LZZA&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=d6IvmO1qPge0gLosxahdSiIwyqFkfELH7iIDEUeAaqM%3D&reserved=0

. Triage notifications on the go with GitHub Mobile for iOS<

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pEO7CSKIg5P6BI36k%2FqNHrihKkb8ipDWYLDCaNmMR7I%3D&reserved=0

or Android<

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Ce7563dc358a54cd5d28f08d9f6327421%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811518821077374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jlq%2BprhTQ8JWYcZU%2FRPec%2FnuuTNMRzVTLURkhxk%2FWDo%3D&reserved=0

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1687#issuecomment-1048184481, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AEIZ77AKS2GGYDUQSRLRQRTU4PWWVANCNFSM5PB7LZZA

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwrf-model%2FWRF%2Fissues%2F1687%23issuecomment-1048212467&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Cef18e330067c41f35e0d08d9f6475f41%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811608665293257%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ESX%2Bz%2FNInM%2B12cXy2vgO7z0y9SHx3cRvJtDLd%2F6RVts%3D&reserved=0>, or unsubscribe< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXRAHCJRVU3RGWKCD73SM3U4P3JRANCNFSM5PB7LZZA&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Cef18e330067c41f35e0d08d9f6475f41%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811608665293257%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=K18SkGuYciCcICMyN0Wb1Z92dYswuSYRTxlRUOHCGIc%3D&reserved=0

. Triage notifications on the go with GitHub Mobile for iOS< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Cef18e330067c41f35e0d08d9f6475f41%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811608665293257%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ByfHvyO7XpMw%2Bty1%2FYI3O%2BeQ51MTEamW2otpbQOunj4%3D&reserved=0> or Android< https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cheng.xiao%40pnnl.gov%7Cef18e330067c41f35e0d08d9f6475f41%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637811608665293257%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=7iBvOpwd7x4sB1JzNg6zkRwAM0tUijUnIny0GPYcNjk%3D&reserved=0 . You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1687#issuecomment-1048220858, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIZ77HAZZVCBEXKCZCGW73U4P4RLANCNFSM5PB7LZZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

jamiebresch commented 2 years ago

@hengxiao80 Please click view it on GitHub in your email to see how your reply shows up on GitHub.

wgustafson commented 2 years ago

Staggering...

Back to @dudhia 's comment about the need or not for staggering, I thought we had to have Z indicated in the stagger column of the registry file to properly handle the output to prevent an indexing issue for "constant" dims. The indexing when accessing the array inside the non-I/O code is a different matter. The chemistry emissions are a good example to go by. E.g., e_so2 has dims of "i+j" so the "Z" is turned on for the stagger (+=kemit). When looping for e_so2 in module_cbmz_addemiss.F, the vertical indexing is done from kms to kemit, so then the staggering in the registry doesn't matter. It only matters for the I/O, as you suggested.

If this is just a red herring and I am mistaken, then we can close this issue in terms of the possible bug for URB_PARAM. Otherwise, we might want to forward the issue onto the urban physics developer to be aware of.

Memory issue...

The secondary issue with us possibly running into a memory issue is a different topic. We definitely run into possible memory issues if we are not careful with these giant domains. However, I do not think it is a simple out of memory problem because the code works fine when we do not do the read for URB_PARAM (by turning off the i1 read command in the registry), and we have more levels for the other volumetric variables than the number of urban parameters in URB_PARAM's constant dimension. So, something like U is bigger than URB_PARAM and should trigger a memory issue in that case (which it will if we don't use enough nodes with io_form=11). I'm suspicious something might be awry with our MPI buffer handling in WRF or something related to a setting in the MPI library's setup/configuration. (We are forced to use OpenMPI on this cluster, which is an MPI vendor I have not used elsewhere.)

If we could hack the code to just handle the aerosol input file for the inner domain, it would make a quicker way of handling our current situation where we don't need anything else for the inner domain. Off the top of my head, I'm not sure what would be involved to do this or if it is worth our development effort at this point.

dudhia commented 2 years ago

Regarding staggering, I only know about the use of Z for normal vertical dimensioned arrays where Z adds 1 for variables like W and PH when outputting. There may be other uses I am unaware of. Maybe it is a memory issue related to the i/o only. I don't know enough about the mechanism of reading netcdf to say more.

On Tue, Feb 22, 2022 at 4:45 PM William Gustafson @.***> wrote:

Staggering...

Back to @dudhia https://github.com/dudhia 's comment about the need or not for staggering, I thought we had to have Z indicated in the stagger column of the registry file to properly handle the output to prevent an indexing issue for "constant" dims. The indexing when accessing the array inside the non-I/O code is a different matter. The chemistry emissions are a good example to go by. E.g., e_so2 has dims of "i+j" so the "Z" is turned on for the stagger (+=kemit). When looping for e_so2 in module_cbmz_addemiss.F, the vertical indexing is done from kms to kemit, so then the staggering in the registry doesn't matter. It only matters for the I/O, as you suggested.

If this is just a red herring and I am mistaken, then we can close this issue in terms of the possible bug for URB_PARAM. Otherwise, we might want to forward the issue onto the urban physics developer to be aware of. Memory issue...

The secondary issue with us possibly running into a memory issue is a different topic. We definitely run into possible memory issues if we are not careful with these giant domains. However, I do not think it is a simple out of memory problem because the code works fine when we do not do the read for URB_PARAM (by turning off the i1 read command in the registry), and we have more levels for the other volumetric variables than the number of urban parameters in URB_PARAM's constant dimension. So, something like U is bigger than URB_PARAM and should trigger a memory issue in that case (which it will if we don't use enough nodes with io_form=11). I'm suspicious something might be awry with our MPI buffer handling in WRF or something related to a setting in the MPI library's setup/configuration. (We are forced to use OpenMPI on this cluster, which is an MPI vendor I have not used elsewhere.)

If we could hack the code to just handle the aerosol input file for the inner domain, it would make a quicker way of handling our current situation where we don't need anything else for the inner domain. Off the top of my head, I'm not sure what would be involved to do this or if it is worth our development effort at this point.

— Reply to this email directly, view it on GitHub https://github.com/wrf-model/WRF/issues/1687#issuecomment-1048317253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIZ77E7AXNV5ZK5XMWX2ALU4QNYBANCNFSM5PB7LZZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>