wrfchem-leeds / WRFotron

Tools to automatise WRF-Chem runs with re-initialised meteorology
https://wrfchem-leeds.github.io/WRFotron/
GNU Affero General Public License v3.0
21 stars 7 forks source link

Errors on clean WRFotron #42

Closed bjsilver closed 3 years ago

bjsilver commented 3 years ago

Hello

I recently cloned a clean WRFotron repo (after the recent bug fixes) and tried to run the default domain/time. I had some errors which seem to lead to main crashing, and I think they may be related to the latest bugfix. Tried to work out what is going on without success

In pre.bash all log files are fine except diurnal_emiss.out, which has the error

(0)     ==== Preparing the emission files ====
(0)     == Emission source files not accessible, failure in emission_file_setup, check these paths:
(0)     /nobackup/eebjs/simulation_WRFChem4.2_test/run/base/2015-10-11_18:00:00-2015-10-13_00:00:00/wrfchemi_00z_d02
(0)     /nobackup/eebjs/simulation_WRFChem4.2_test/run/base/2015-10-11_18:00:00-2015-10-13_00:00:00/wrfchemi_12z_d02
processing /nobackup/eebjs/simulation_WRFChem4.2_test/run/base/2015-10-11_18:00:00-2015-10-13_00:00:00/wrfchemi_00z_d01
writing updated /nobackup/eebjs/simulation_WRFChem4.2_test/run/base/2015-10-11_18:00:00-2015-10-13_00:00:00/wrfchemi_00z_d01
processing /nobackup/eebjs/simulation_WRFChem4.2_test/run/base/2015-10-11_18:00:00-2015-10-13_00:00:00/wrfchemi_12z_d01
writing updated /nobackup/eebjs/simulation_WRFChem4.2_test/run/base/2015-10-11_18:00:00-2015-10-13_00:00:00/wrfchemi_12z_d01

I also noticed this message in the file: (0) No diurnal cycle applied to the following emission variables, because of lack of sector information (was this intended?): Here is the full file: diurnal_emiss.out

When main starts, it created the first wrfout file fine, then crashed on the second one. In the rsl files there is the error:

 mediation_integrate: med_read_wrf_chem_emissions: Open file wrfchemi_12z_d01
 HOURLY EMISSIONS UPDATE TIME        0.0       0.0
mediation_integrate: med_read_wrf_chem_emissions: Read emissions for time 2015-10-11_18:00:00
mediation_integrate: med_read_wrf_chem_emissions: Skip emissions    1
d01 2015-10-11_18:00:00  input_wrf: begin
d01 2015-10-11_18:00:00 module_io.F: in wrf_inquire_filename
d01 2015-10-11_18:00:00  input_wrf: filestate =          103
d01 2015-10-11_18:00:00  input_wrf: dryrun =  F  switch           31
d01 2015-10-11_18:00:00 module_io.F: in wrf_inquire_filename
d01 2015-10-11_18:00:00  Error trying to read metadata

which I think is what causes this error to show up in main.bash.e*

MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[d8s0b2.arc4.leeds.ac.uk:04943] 127 more processes have sent help message help-mpi-api.txt / mpi-abort
[d8s0b2.arc4.leeds.ac.uk:04943] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Does anyone know what might be causing the diurnal_emiss error and whether this is causing the crash in main? Also, has anyone run the test case since the bug fix and does it work ok for you? Could be an issue at my end.

Cheers

lukeconibear commented 3 years ago

I ran a successful test with the default WRFotron the other day.

I thought those messages in diurnal_emiss.out were a quirk of the WRF_UoM_EMIT because we didn't have sector information for those sectors (e.g., aircraft, etc.). If you run python plotwrfchemi.py, is the diurnal cycle applied to total emissions or not?

I also see a similar message in the rsl.error.0000 file, so I'm not sure how this is connected to the MPI_ABORT error you're getting. What's the path to this run folder and I could take a look (you'll need to give me read and execute access all the way down,+rx)?

bjsilver commented 3 years ago

Hi Luke, thanks for getting back to me. I ran plotwrfchemi.py and it showed the diurnal emissions. Looks like the diurnal emissions stage is fine in that case. (probably going to be some dumb mistake from me, sorry)

Thanks very much for having a look, here is the path: /nobackup/eebjs/simulation_WRFChem4.2_test/run/base/2015-10-11_18:00:00-2015-10-13_00:00:00

lukeconibear commented 3 years ago

No worries. I think this might be a memory error. Try a test for main.bash with 2GB per core (i.e. #$ -l h_vmem=2G), and for this 24 hour test run you can probably decrease the wall clock time to shorten the wait in the queue (i.e. #$ -l h_rt=04:00:00). This has been happening occasionally, so I'll increase this memory in the default WRFotron. Let me know how it goes.

bjsilver commented 3 years ago

Thanks Luke will submit that now and get back to you. It would make sense if it is a memory issue, because I didn't get this crash initially, but then I made some changes to the domain and it started happening, so I went back to the clean version and it was still happening

lukeconibear commented 3 years ago

Okay, that makes sense, especially if timesteps (resolution to timestep ratios) changed.

bjsilver commented 3 years ago

Hi Luke, similar crash happened on hour 13 at h_vmem=2G. Will try again with 4G

bjsilver commented 3 years ago

At h_vmem=4G all the wrfouts are created but I still get the MPI_ABORT error in main.bash.e and the first wrfout file (at the beginning of the 6hr met spinup) is less than half the size of the others so possibly something went wrong there

lukeconibear commented 3 years ago

Okay. Could you make the simulation path readable again and I'll take a look?

bjsilver commented 3 years ago

Ok just done that (I think)

lukeconibear commented 3 years ago

Did you update the executable access too, as I can't see it i.e. chmod a+rx -R /nobackup/eebjs/simulation_WRFChem4.2_test?

bjsilver commented 3 years ago

thanks Luke, done it now

lukeconibear commented 3 years ago

Well, I think the MPI_ABORT message might be associated with that smaller first hour of spin-up. Though the remainder of the spin-up looks fine, and the whole of the spin-up is discarded anyway as its only purpose to set reasonable initial conditions for the simulation. I'm not sure if this is related to hardware, as there is not much information to go on. The rest of the simulation looks okay.