xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.
https://xcdat.readthedocs.io/en/latest/
Apache License 2.0
119 stars 12 forks source link

[Refactor] Improve the performance of temporal group averaging #689

Closed tomvothecoder closed 2 months ago

tomvothecoder commented 2 months ago

Description

TODO:

Checklist

If applicable:

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (584fcce) to head (6459c1b). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #689 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 15 15 Lines 1544 1546 +2 ========================================= + Hits 1544 1546 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

tomvothecoder commented 2 months ago

Hi @chengzhuzhang, this PR is ready for review.

After refactoring, I managed to cut down the runtime as following:

  1. Annual climatology: 33s -> 5.85s
  2. Annual departures: 1min9s -> 11.6s
  3. monthly group averages: 33.5s -> 5.59s.

I also performed a regression test using the same e3sm_diags dataset between main and this branch and produced identical results. The GH Actions build also passes.

Benchmarking Script

# %%
import xarray as xr
import xcdat as xc

### 1. Using temporal.climatology from xcdat
file_path = "/global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20221103.v2.LR.amip.NGD_v3atm.chrysalis/arm-diags-data/PRECT_sgpc1_198501_201412.nc"
ds = xc.open_dataset(file_path)

branch = "dev"
# %%
# 1. Calculate annual climatology
# -------------------------------
ds_annual_cycle = ds.temporal.climatology("PRECT", "month", keep_weights=True)
ds_annual_cycle.to_netcdf(f"temporal_climatology_{branch}.nc")
"""
main
--------------------------
CPU times: user 33 s, sys: 2.41 s, total: 35.4 s
Wall time: 35.4 s

refactor/688-temp-api-perf
--------------------------
CPU times: user 5.85 s, sys: 2.88 s, total: 8.72 s
Wall time: 8.78 s
"""

# %%
# 2. Calculate annual departures
# ------------------------------
ds_annual_cycle_anom = ds.temporal.departures("PRECT", "month", keep_weights=True)
ds_annual_cycle_anom.to_netcdf(f"temporal_departures_{branch}.nc")
"""
main
--------------------------
CPU times: user 1min 9s, sys: 4.8 s, total: 1min 14s
Wall time: 1min 14s

refactor/688-temp-api-perf
--------------------------
CPU times: user 11.6 s, sys: 4.32 s, total: 15.9 s
Wall time: 15.9 s
"""

# %%
# 3. Calculate monthly group averages
# -----------------------------------
ds_annual_avg = ds.temporal.group_average("PRECT", "month", keep_weights=True)
ds_annual_avg.to_netcdf(f"temporal_group_average_{branch}.nc")

"""
main
--------------------------
CPU times: user 33.5 s, sys: 2.27 s, total: 35.8 s
Wall time: 35.9 s

refactor/688-temp-api-perf
--------------------------
CPU times: user 5.59 s, sys: 2.06 s, total: 7.65 s
Wall time: 7.65 s
"""

Regression testing script

import glob

import xarray as xr

# Get the filepaths for the dev and main branches
dev_filepaths = sorted(glob.glob("qa/issue-688/dev/*.nc"))
main_filepaths = sorted(glob.glob("qa/issue-688/main/*.nc"))

for fp, mp in zip(dev_filepaths, main_filepaths):
    print(f"Comparing {fp} and {mp}")
    # Load the datasets
    dev_ds = xr.open_dataset(fp)
    main_ds = xr.open_dataset(mp)

    # Compare the datasets
    try:
        xr.testing.assert_identical(dev_ds, main_ds)
    except AssertionError as e:
        print(f"Datasets are not identical: {e}")
    else:
        print("Datasets are identical")

Next step

  1. I will investigate the differences you pointed out here between xCDAT and the e3sm_diags climatology functions separately from this PR (related e3sm_diags discussion post)
  2. Open a GH issue on the Xarray repo about grouping with auxiliary time coordinates resulting in a large performance hit