wrfchem-leeds / WRFotron

Tools to automatise WRF-Chem runs with re-initialised meteorology
https://wrfchem-leeds.github.io/WRFotron/
GNU Affero General Public License v3.0
21 stars 7 forks source link

anthro_emis segmentation fault #44

Closed ailishgraham closed 3 years ago

ailishgraham commented 3 years ago

Running anthro_emis (/nobackup/WRFChem/anthro_emis version) with new emissions causes segmentation error (see error below). Emissions netcdf files follow the same format as EDGAR-HTAP2 (but includes extra sectors 'emis_tot_no_awb'). Segmentation fault occurs when reading in individual sectors (of which there are 14) and only the total (i.e. 1 sector). The fault seems to occur once 12 files have been read in (each file is 3.5 Gb in size). The error always occurs after reading the last file in (so if more than 12 files are read in (e.g. 15) it will occur on the final file (file 15)). The segmentation fault prevents the final statement of 'anthro_emis complete' being printed. However, the wrfchemi_00z and wrfchemi_12z files are generated and look reasonable.
I have tried following online help on the GEOS-Chem website (http://wiki.seas.harvard.edu/geos-chem/index.php/Segmentation_faults) to find the route of the error. The GEOS-Chem website suggests the error arises from either:

error:

will use source file for C2H6

get_src_time_ndx; src_dir,src_fn = /nobackup/ee15amg/wrf3.7.1_data/emissions/EDGARv52015_CAMS2016_MEIC2017/EDGARv5 _2015_CAMS_v4.2_2016_MEIC_v1.3_2017_Malley_C2H6_monthly_0.1x0.1.nc get_src_time_ndx; interp_date,datesec,ntimes = 20170904 0 12 get_src_time_ndx; tndx = 9 aera_interp: raw dataset max value = 1.7067602E-08 aera_interp: raw dataset max indices = 2188 991 aera_interp: raw dataset max value = 7.5946782E-10 aera_interp: raw dataset max indices = 2048 1322 aera_interp: raw dataset max value = 3.3164188E-10 aera_interp: raw dataset max indices = 2314 1258 aera_interp: raw dataset max value = 1.1780937E-10 aera_interp: raw dataset max indices = 2178 1460 aera_interp: raw dataset max value = 8.0596892E-12 aera_interp: raw dataset max indices = 1111 1339 aera_interp: raw dataset max value = 0.0000000E+00 aera_interp: raw dataset max indices = 1 1 aera_interp: raw dataset max value = 1.3037157E-13 aera_interp: raw dataset max indices = 2854 851 aera_interp: raw dataset max value = 1.7423774E-08 aera_interp: raw dataset max indices = 2188 991 aera_interp: raw dataset max value = 1.7423773E-08 aera_interp: raw dataset max indices = 2188 991 aera_interp: raw dataset max value = 4.7226974E-13 aera_interp: raw dataset max indices = 1796 1415 aera_interp: raw dataset max value = 4.8074793E-13 aera_interp: raw dataset max indices = 1886 1401 aera_interp: raw dataset max value = 2.9458816E-12 aera_interp: raw dataset max indices = 922 1320 forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source
anthro_emis 0000000000479C33 for__signal_handl Unknown Unknown libpthread-2.17.s 00007F9FA5BD45D0 Unknown Unknown Unknown libc-2.17.so 00007F9FA587D71C cfree Unknown Unknown anthro_emis 00000000004AE590 for_dealloc_alloc Unknown Unknown anthro_emis 000000000042D658 Unknown Unknown Unknown anthro_emis 000000000044A318 Unknown Unknown Unknown anthro_emis 000000000040C5E2 Unknown Unknown Unknown libc-2.17.so 00007F9FA581A495 __libc_start_main Unknown Unknown anthro_emis 000000000040C4E9 Unknown Unknown Unknown

My anthro_emis.inp file is as follows:

anthro_dir = '/nobackup/ee15amg/wrf3.7.1_data/emissions/EDGARv52015_CAMS2016_MEIC2017' domains = 1

src_file_prefix = 'EDGARv5_2015_CAMS_v4.2_2016_MEIC_v1.3_2017Malley' src_file_suffix = '_monthly_0.1x0.1.nc'

src_names = 'CO(28)','NOx(30)','SO2(64)','NH3(17)','BC(12)','OC(12)','PM2.5(1)','BIGALK(72)','BIGENE(56)', 'C2H4(28)','C2H5OH(46)','C2H6(30)'

sub_categories = 'emis_ind', ! CAMS, industrial non-power+CAMS, fugitive emissions+CAMS, solvent emissions 'emis_dom', ! CAMS, residential energy and other + CAMS, solid waster and waste water 'emis_tra', ! CAMS, off road transport+CAMS, road transport 'emis_ene', ! CAMS, power generation 'emis_ship', ! CAMS, shipping 'emis_agr', ! CAMS, Agricultural soils+CAMS, Agricultural livestock 'emis_awb', ! CAMS, Agricultural waste burning 'emis_tot', ! CAMS, total with awb 'emis_tot_no_awb', ! CAMS, total with awb 'emis_cds', ! EDGAR-v5, aircraft - climbing and descent 'emis_crs', ! EDGAR-v5, aircraft - cruise 'emis_lto', ! EDGAR-v5 aircraft - landing and take off 'emis_1A1_1A2', ! EDGAR-HTAPv2.2, CH4 only, Energy manufacturing transformation 'emis_1A3a_c_d_e', ! EDGAR-HTAPv2.2, CH4 only, Non-road transportation 'emis_1A3b', ! EDGAR-HTAPv2.2, CH4 only, Road transportation 'emis_1A4', ! EDGAR-HTAPv2.2, CH4 only, Energy for buildings 'emis_1B1', ! EDGAR-HTAPv2.2, CH4 only, Fugitive from solid 'emis_1B2a', ! EDGAR-HTAPv2.2, CH4 only, Oil production and refineries 'emis_1B2b', ! EDGAR-HTAPv2.2, CH4 only, Gas production and distribution 'emis_2', ! EDGAR-HTAPv2.2, CH4 only, Industrial process and product use 'emis_4A', ! EDGAR-HTAPv2.2, CH4 only, Enteric fermentation 'emis_4B', ! EDGAR-HTAPv2.2, CH4 only, Manure management 'emis_4C_4D', ! EDGAR-HTAPv2.2, CH4 only, Agricultural soils 'emis_4F', ! EDGAR-HTAPv2.2, CH4 only, Agricultural waste burning 'emis_6A_6C', ! EDGAR-HTAPv2.2, CH4 only, Solid waste disposal 'emis_6B', ! EDGAR-HTAPv2.2, CH4 only, Waste water 'emis_7A' ! EDGAR-HTAPv2.2, CH4 only, Fossil Fuel Fires

serial_output = .false. !data_yrs_offset = 2 ! make sure to update this! data_yrs_offset = 1 ! make sure to update this! emissions_zdim_stag = 1

! make sure to update these dates! start_data_time = '2016-01-01_00:00:00' stop_data_time = '2016-12-31_00:00:00'

emis_map = !'CO->CO(emis_tot)', !'NO->0.8NOx(emis_tot)', !'NO2->0.2NOx(emis_tot)', !'SO2->SO2(emis_tot)', !'NH3->NH3(emis_tot)', !'ECI(a)->0.1BC(emis_tot)', !'ECJ(a)->0.9BC(emis_tot)', !'ORGI(a)->0.1OC(emis_tot)', !'PM25I(a)->0.1PM2.5(emis_tot)'

    'CO_TRA->CO(emis_tra)','CO_IND->CO(emis_ind)',
        'CO_RES->CO(emis_dom)','CO_POW->CO(emis_ene)',
        'CO_SHP->CO(emis_ship)','CO_CDS->CO(emis_cds)',
        'CO_CRS->CO(emis_crs)','CO_LTO->CO(emis_lto)',
        'CO->CO(emis_tot)','CO_NO_AWB->CO(emis_tot_no_awb)',
    'CO_AWB->CO(emis_awb)','CO_AGR->CO(emis_agr)',

        'NO_TRA->0.8*NOx(emis_tra)','NO_IND->0.8*NOx(emis_ind)',
        'NO_RES->0.8*NOx(emis_dom)','NO_POW->0.8*NOx(emis_ene)',
        'NO_SHP->0.8*NOx(emis_ship)','NO_CDS->0.8*NOx(emis_cds)',
        'NO_CRS->0.8*NOx(emis_crs)','NO_LTO->0.8*NOx(emis_lto)',
        'NO->0.8*NOx(emis_tot)','NO_NO_AWB->0.8*NOx(emis_tot_no_awb)',
    'NO_AWB->0.8*NOx(emis_awb)','NO_AGR->0.8*NOx(emis_agr)',

        'NO2_TRA->0.2*NOx(emis_tra)','NO2_IND->0.2*NOx(emis_ind)',
        'NO2_RES->0.2*NOx(emis_dom)','NO2_POW->0.2*NOx(emis_ene)',
        'NO2_SHP->0.2*NOx(emis_ship)','NO2_CDS->0.2*NOx(emis_cds)',
        'NO2_CRS->0.2*NOx(emis_crs)','NO2_LTO->0.2*NOx(emis_lto)',
        'NO2->0.2*NOx(emis_tot)','NO2_NO_AWB->0.2*NOx(emis_tot_no_awb)',
    'NO2_AWB->0.2*NOx(emis_awb)','NO2_AGR->0.2*NOx(emis_agr)',

        'SO2_TRA->SO2(emis_tra)','SO2_IND->SO2(emis_ind)',
        'SO2_RES->SO2(emis_dom)','SO2_POW->SO2(emis_ene)',
        'SO2_SHP->SO2(emis_ship)','SO2_CDS->SO2(emis_cds)',
        'SO2_CRS->SO2(emis_crs)','SO2_LTO->SO2(emis_lto)',
        'SO2->SO2(emis_tot)','SO2_NO_AWB->SO2(emis_tot_no_awb)',
    'SO2_AWB->SO2(emis_awb)','SO2_AGR->SO2(emis_agr)',

    'NH3_TRA->NH3(emis_tra)','NH3_IND->NH3(emis_ind)',
        'NH3_RES->NH3(emis_dom)','NH3_POW->NH3(emis_ene)',
        'NH3_SHP->NH3(emis_ship)','NH3_CDS->NH3(emis_cds)',
        'NH3_CRS->NH3(emis_crs)','NH3_LTO->NH3(emis_lto)',
        'NH3->NH3(emis_tot)','NH3_NO_AWB->NH3(emis_tot_no_awb)',
        'NH3_AWB->NH3(emis_awb)','NH3_AGR->NH3(emis_agr)',

    'ECI_TRA(a)->0.1*BC(emis_tra)','ECI_IND(a)->0.1*BC(emis_ind)',
        'ECI_RES(a)->0.1*BC(emis_dom)','ECI_POW(a)->0.1*BC(emis_ene)',
        'ECI_SHP(a)->0.1*BC(emis_ship)','ECI_CDS(a)->0.1*BC(emis_cds)',
        'ECI_CRS(a)->0.1*BC(emis_crs)','ECI_LTO(a)->0.1*BC(emis_lto)',
        'ECI(a)->0.1*BC(emis_tot)','ECI_NO_AWB(a)->0.1*BC(emis_tot_no_awb)',
        'ECI_AGRI(a)->0.1*BC(emis_agr)','ECI_AWB(a)->0.1*BC(emis_awb)',

        'ECJ_TRA(a)->0.9*BC(emis_tra)','ECJ_IND(a)->0.9*BC(emis_ind)',
        'ECJ_RES(a)->0.9*BC(emis_dom)','ECJ_POW(a)->0.9*BC(emis_ene)',
        'ECJ_SHP(a)->0.9*BC(emis_ship)','ECJ_CDS(a)->0.9*BC(emis_cds)',
        'ECJ_CRS(a)->0.9*BC(emis_crs)','ECJ_LTO(a)->0.9*BC(emis_lto)',
        'ECJ(a)->0.9*BC(emis_tot)','ECJ_NO_AWB(a)->0.9*BC(emis_tot_no_awb)',
        'ECJ_AGRI(a)->0.9*BC(emis_agr)','ECJ_AWB(a)->0.9*BC(emis_awb)',

    'ORGI_TRA(a)->0.1*OC(emis_tra)','ORGI_IND(a)->0.1*OC(emis_ind)',
        'ORGI_RES(a)->0.1*OC(emis_dom)','ORGI_POW(a)->0.1*OC(emis_ene)',
        'ORGI_SHP(a)->0.1*OC(emis_ship)','ORGI_CDS(a)->0.1*OC(emis_cds)',
        'ORGI_CRS(a)->0.1*OC(emis_crs)','ORGI_LTO(a)->0.1*OC(emis_lto)',
        'ORGI(a)->0.1*OC(emis_tot)','ORGI_NO_AWB(a)->0.1*OC(emis_tot_no_awb)',
        'ORGI_AGR(a)->0.1*OC(emis_agr)','ORGI_AWB(a)->0.1*OC(emis_awb)',

        'PM25I_TRA(a)->0.1*PM2.5(emis_tra)','PM25I_IND(a)->0.1*PM2.5(emis_ind)',
        'PM25I_RES(a)->0.1*PM2.5(emis_dom)','PM25I_POW(a)->0.1*PM2.5(emis_ene)',
        'PM25I_SHP(a)->0.1*PM2.5(emis_ship)','PM25I_CDS(a)->0.1*PM2.5(emis_cds)',
        'PM25I_CRS(a)->0.1*PM2.5(emis_crs)','PM25I_LTO(a)->0.1*PM2.5(emis_lto)',
        'PM25I(a)->0.1*PM2.5(emis_tot)','PM25I_NO_AWB(a)->0.1*PM2.5(emis_tot_no_awb)',
        'PM25I_AGR(a)->0.1*PM2.5(emis_agr)','PM25I_AWB(a)->0.1*PM2.5(emis_awb)',

        'BIGALK_TRA->BIGALK(emis_tra)','BIGALK_IND->BIGALK(emis_ind)',
        'BIGALK_RES->BIGALK(emis_dom)','BIGALK_POW->BIGALK(emis_ene)',
        'BIGALK_SHP->BIGALK(emis_ship)','BIGALK_CDS->BIGALK(emis_cds)',
        'BIGALK_CRS->BIGALK(emis_crs)','BIGALK_LTO->BIGALK(emis_lto)',
        'BIGALK->BIGALK(emis_tot)','BIGALK_NO_AWB->BIGALK(emis_tot_no_awb)',
        'BIGALK_AGR->BIGALK(emis_agr)','BIGALK_AWB->BIGALK(emis_awb)',

        'BIGENE_TRA->BIGENE(emis_tra)','BIGENE_IND->BIGENE(emis_ind)',
        'BIGENE_RES->BIGENE(emis_dom)','BIGENE_POW->BIGENE(emis_ene)',
        'BIGENE_SHP->BIGENE(emis_ship)','BIGENE_CDS->BIGENE(emis_cds)',
        'BIGENE_CRS->BIGENE(emis_crs)','BIGENE_LTO->BIGENE(emis_lto)',
        'BIGENE->BIGENE(emis_tot)','BIGENE_NO_AWB->BIGENE(emis_tot_no_awb)',
        'BIGENE_AGR->BIGENE(emis_agr)','BIGENE_AWB->BIGENE(emis_awb)',

        'C2H4_TRA->C2H4(emis_tra)','C2H4_IND->C2H4(emis_ind)',
        'C2H4_RES->C2H4(emis_dom)','C2H4_POW->C2H4(emis_ene)',
        'C2H4_SHP->C2H4(emis_ship)','C2H4_CDS->C2H4(emis_cds)',
        'C2H4_CRS->C2H4(emis_crs)','C2H4_LTO->C2H4(emis_lto)',
        'C2H4->C2H4(emis_tot)','C2H4_NO_AWB->C2H4(emis_tot_no_awb)',
        'C2H4_AGR->C2H4(emis_agr)','C2H4_AWB->C2H4(emis_awb)',

        'C2H6_TRA->C2H6(emis_tra)','C2H6_IND->C2H6(emis_ind)',
        'C2H6_RES->C2H6(emis_dom)','C2H6_POW->C2H6(emis_ene)',
        'C2H6_SHP->C2H6(emis_ship)','C2H6_CDS->C2H6(emis_cds)',
        'C2H6_CRS->C2H6(emis_crs)','C2H6_LTO->C2H6(emis_lto)',
        'C2H6->C2H6(emis_tot)','C2H6_NO_AWB->C2H6(emis_tot_no_awb)',
        'C2H6_AGR->C2H6(emis_agr)','C2H6_AWB->C2H6(emis_awb)'

/

lukeconibear commented 3 years ago

Segmentation faults can often be from exceeding memory limits. Assuming this is being run within pre.bash, and not manually, can you try increasing the memory (e.g., #$ -l h_vmem=128G)?

ailishgraham commented 3 years ago

Hi Luke,

I also tried this - sorry I should have said. I increased to 128 Gb as you suggested and still got the same error. I guess it isn't possible to run pre.bash over multiple nodes? I think I saw the memory limit is 192Gb for the nodes we use, are there any larger ones we can request? If not, perhaps a solution would be to run anthro_emis multiple times in chunks and (if this solves the issue) add in some python code to merge the netCDF files created?

lukeconibear commented 3 years ago

Okay. Did 192 GB fail too? The preprocessors are compiled from serial code, so these would need to be rewritten to run in parallel. I suppose you could run anthro_emis over subsets of species instead of them all together and then add them back to the same wrfchemi files, though this is a bit of hack.

ARC4 does have high-memory nodes up to 768GB. To use these you add #$ -l node_type=40core-768G to the preamble at the top of pre.bash, and then increase the memory using #$ -l h_vmem=...G.

Though 192 GB is already a lot of memory. If you request the job output to be emailed to you (as below), you can see if the job did fail from requesting too much memory.

-m be
-M email@leeds.ac.uk

If it doesn't work with 768GB of memory then the problem is something else haha.

ailishgraham commented 3 years ago

I upped the memory to 256 Gb on the high-memory nodes but if I print out the job info I can see it used 14.7 Gb max memory so this is not why it was crashing before. I then altered the order things were being read in (i.e. the order the species are read in matches the src_names exactly). This seems to have got things working. I hadn't realised these needed to match (or is this just a coincidence?).

qacct -j 2798500

qname 40core-768G.q
hostname d8mem1.arc4.leeds.ac.uk group EAR
owner ee15amg
project ENG
department defaultdepartment
jobname pre.bash
jobnumber 2798500
taskid undefined account sge
priority 0
qsub_time Fri Aug 27 17:44:29 2021 start_time Fri Aug 27 17:45:14 2021 end_time Fri Aug 27 17:51:42 2021 granted_pe ib-edr-part-2
slots 1
failed 0
exit_status 0
ru_wallclock 388s ru_utime 265.451s ru_stime 71.413s ru_maxrss 14.074MB ru_ixrss 0.000B ru_ismrss 0.000B ru_idrss 0.000B ru_isrss 0.000B ru_minflt 2173666
ru_majflt 12
ru_nswap 0
ru_inblock 48327584
ru_oublock 74783728
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 25518
ru_nivcsw 1510
cpu 336.864s mem 730.063GBs io 216.952GB iow 0.000s maxvmem 14.073GB arid undefined ar_sub_time undefined category -U tomcat -l env=centos7,h_rt=10800,h_vmem=256G,node_type=40core-768G,project=arc -pe ib-edr-part-* 1

lukeconibear commented 3 years ago

Okay. That sounds like it wasn't a memory issue.

I'm not sure I follow your solution. The list of species in src_names is the order of processing. What exactly did you change in the anthro_emis input namelist?

ailishgraham commented 3 years ago

I've been testing what fixed it by reverting the changes in the anthro_emis.inp file one by one but have yet to find what breaks it again. To fix it: -I first tested just reading in totals for each species (i.e. emis_tot) - adding one species at a time. -Once that worked I read in all sectors except awb and total_no_awb ('emis_awb', 'emis_no_awb') - again adding in one species at a time. -Then added in awb and total_no_awb last. -I did all of those steps with high memory and then reduced the memory to see if it still worked. -For all of those above steps I kept the mapping for each species (emis_map) in the same order as the species list (src_names) (i.e. src_names = 'CO(28)', 'NOx(30)', 'SO2(64)'.... emis_map = 'CO->CO(emis_tot)','NO->0.8NOx(emis_tot)','NO2(emis_tot)->0.2NOx(emis_tot)','SO2->SO2(emis_tot)'....). I have since tested not matching the order of src_names and emis_map and this still works.

Looking through the anthro_emis source code the issue was after area_mapper had finished for the final species in the list and the wrfchemi files had been created. This would suggest it was within the 'cleanup for next domain' step (in anthro_emis.f90 file) as this needs to complete before the 'anthro_emis succesful' message is printed.

lukeconibear commented 3 years ago

Okay. Well, maybe it was a temporary hardware glitch. I'm not sure how much value there is in persisting with replicating the old bug now that things are back working with the original settings. In summary, it sounds like there is nothing to change in the general settings and we can close this issue now.

ailishgraham commented 3 years ago

Yes I agree, thanks for the help.