xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.
https://xcdat.readthedocs.io/en/latest/
Apache License 2.0
101 stars 11 forks source link

[PR]: Update Regrid2 missing and fill value behaviors to align with CDAT and add `unmapped_to_nan` arg for output data #613

Closed jasonb5 closed 3 months ago

jasonb5 commented 3 months ago

Description

Todo

Checklist

If applicable:

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (472c04b) to head (fc17e7a).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #613 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 15 15 Lines 1521 1543 +22 ========================================= + Hits 1521 1543 +22 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

tomvothecoder commented 3 months ago

Hey @lee1043, at our past meeting on 3/13, Jason mentioned having an unmapped_to_nan=True option for the Regrid2 API. This enables us to have consistency with the xESMF API and keeps plots looking correct when data has missing values. As a reminder, this option applies to output data. For input data, this PR updates the Regrid2 API to use fill_value to align with CDAT's Regrid2 behavior.

Do you agree with this idea? Thanks.

lee1043 commented 3 months ago

@tomvothecoder I agree, thanks!

tomvothecoder commented 3 months ago

Hi @jasonb5, just checking to see if you can get this done by next Tues (4/2)? I would like to release v0.7.0 mid-next week.

I'm currently debugging #615 to see if I can get anywhere.

lee1043 commented 3 months ago

I confirm this resolves #615.

lee1043 commented 3 months ago

It looks like the current version of the code in this PR could introduce unexpected behavior (guess it's a bug?) when there are NaN values in the original field and regrid using regrid2.

import os
import xcdat as xc
from pcmdi_metrics.utils import apply_landmask

input_data = os.path.join(
    "/p/css03/esgf_publish/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/ts/gn/v20200511",
    "ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc")

target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)

ds = xc.open_dataset(input_data)

# Apply land mask
ds_masked = ds.copy()
da_masked = apply_landmask(ds, "ts", keep_over="ocean")
ds_masked["ts"] = da_masked

ds_masked["ts"].isel(time=0).plot()

# regrid
ds_masked_regrid = ds_masked.regridder.horizontal("ts", target_grid, tool="regrid2")

ds_masked_regrid["ts"].isel(time=0).plot()

ds (raw data): output1

ds_masked: output2

ds_masked_regrid (from code in main branch) -- what I expect output3

ds_masked_regrid (from this branch) -- what I get output4

@jasonb5 @tomvothecoder do you have any idea?

lee1043 commented 3 months ago

For the above, it looks like 1e20 is filled for grid with NaN then do the regridding, which results 1e20 over the land and ~1e19 along the coastlines. If I plot using the below command, I get the 3rd plot in the above comment instead of the last one.

ds_masked_regrid["ts"].where(ds_masked_regrid["ts"] < 1e16).isel(time=0).plot()
tomvothecoder commented 3 months ago

Hi @jasonb5, just checking to see if you can get this done by next Tues (4/2)? I would like to release v0.7.0 mid-next week.

I'm currently debugging #615 to see if I can get anywhere.

@jasonb5 any updates? Also #615 is fixed. Jiwoo opened PR #634 to get that one in separately if this PR is going to take longer.

We can push back v0.7.0 another week if needed.

jasonb5 commented 3 months ago

@tomvothecoder @lee1043 I'll have time today to finish this up.

tomvothecoder commented 3 months ago

Thanks for the update @jasonb5!

jasonb5 commented 3 months ago

@lee1043 Fixed the issue, after regridding the missing values weren't replaced with Nan's again which resulted in the bad plot. There is still potentially an issue with coastline values ~1e19 so you still need ds_masked_regrid["ts"].where(ds_masked_regrid["ts"] < 1e16).isel(time=0).plot() to clean up the plot. I'm looking into this, might be a separate issue with regrid2.

lee1043 commented 3 months ago

@jasonb5 thank you very much for following up and for the update. I test with the latest code in this PR, which has changed 1e20 to nan. However, as you mentioned, there still are issues with coastlines that still concerns me. I think that indicates some of the original information is missed by averaging with 1e20, occurring along the coastlines in this case. Having ~1e19 numbers cause substantial impact on the statistical numbers when conducting model evaluation. I can exclude them by taking < 1e16, but still, that make the coastline grids as blank, which influences to the statistics. Is there any chance to make it those 1e20 or nan to be completely excluded before the regrinding process starts, so they won't affect to the regridded output at all?

ds_masked_regrid (from this branch) -- what I get

output5

By applying ds_masked_regrid["ts"].where(ds_masked_regrid["ts"] < 1e16).isel(time=0).plot()

output6

jasonb5 commented 3 months ago

@lee1043 I'm investigating that issue how, I'll probably update this PR to just include a few fixes.

lee1043 commented 3 months ago

@jasonb5 thank you for following up. I opened PR #635 for my proposal on this matter. Could you please take a look? Please feel free to discard or merge it to this PR if that looks okay to you.

jasonb5 commented 3 months ago

@lee1043 I added another fix, your example still displays the same issue.

The following example works:

import os
import xcdat as xc
import numpy as np
from pcmdi_metrics.utils import apply_landmask, create_land_sea_mask

input_data = os.path.join(
    "/p/css03/esgf_publish/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/ts/gn/v20200511",
    "ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc")

target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)

ds = xc.open_dataset(input_data)

# Apply land mask
ds_masked = ds.copy()
mask = create_land_sea_mask(ds_masked)
# invert mask to keep ocean
ds_masked["mask"] = np.absolute(mask - 1)

# ds_masked["psl"].isel(time=0).plot()

# regrid
ds_masked_regrid = ds_masked.regridder.horizontal("ts", target_grid, tool="regrid2")

ds_masked_regrid["ts"].isel(time=0).plot()

I'm still looking into why your example still fails.

lee1043 commented 3 months ago

@lee1043 I added another fix, your example still displays the same issue.

The following example works:

import os
import xcdat as xc
import numpy as np
from pcmdi_metrics.utils import apply_landmask, create_land_sea_mask

input_data = os.path.join(
    "/p/css03/esgf_publish/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/ts/gn/v20200511",
    "ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc")

target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)

ds = xc.open_dataset(input_data)

# Apply land mask
ds_masked = ds.copy()
mask = create_land_sea_mask(ds_masked)
# invert mask to keep ocean
ds_masked["mask"] = np.absolute(mask - 1)

# ds_masked["psl"].isel(time=0).plot()

# regrid
ds_masked_regrid = ds_masked.regridder.horizontal("ts", target_grid, tool="regrid2")

ds_masked_regrid["ts"].isel(time=0).plot()

I'm still looking into why your example still fails.

Hi @jasonb5, thanks for following up. But I don't think the code in this comment is working. This is what I get which still has coastline issue. Seeing the result plot below, on top of colorbar color range indicates 1e19, which appears along coastline. output7

It becomes more clear when plotting where >1e10 ds_masked_regrid["ts"].where(ds_masked_regrid["ts"]>1e10).isel(time=0).plot()

output8

I think the key issue here is that 1e20 is being used when interpolating.

In this PR, logic of the code is checking if there is attributes for FillValue or missing value, and if so, 1e20 is injected instead of NaN in the input variable (here). This input variable is passed to regrid2 (here), which include 1e20 when actually interpolating values.

Instead, I think inclusion of 1e20 when interpolating can be prevented by replacing 1e20 or any missing value if exists (nan_replace in the code) to actual NaN (here), before calling the regridding. I proposed this solution in #635. Could you please take a look and let me know if there is any concern?

jasonb5 commented 3 months ago

@lee1043 Merged your change, and added a new one. Both pre-applied mask and passing mask as a variable should both work now.

Also note that how the mask is applied by cdms2 and passing the mask as a variable to xcdat differs from pcmdi_metrics apply_landmask function. The prior allows more data through, while the latter is tighter around the coastlines (not sure if this is a problem).

import os
import xcdat as xc
import numpy as np
import xarray as xr
import cdms2
import matplotlib.pyplot as plt
from pcmdi_metrics.utils import apply_landmask, create_land_sea_mask

fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(10,10))

input_data = "ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc"

target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)
cdms2_grid = cdms2.createUniformGrid(-90, 180, 1.0, 0, 360, 1.0)

ds = xc.open_dataset(input_data)
cdms2_ds = cdms2.open(input_data)
cdms2_var = cdms2_ds("ts")

mask = create_land_sea_mask(ds)
# cdms2 automatically inverts the mask
cdms2_mask = mask.values

cdms2_regrid = cdms2_var.regrid(cdms2_grid, mask=cdms2_mask, regridTool="regrid2")
cdms2_regrid = xr.DataArray(cdms2_regrid.data, dims=("time", "lat", "lon"))
cdms2_regrid = cdms2_regrid.where(cdms2_regrid!=1e20)
cdms2_regrid_t0 = cdms2_regrid.isel(time=0)

cdms2_regrid_t0.plot(ax=axs[0,0])
(cdms2_regrid_t0 - cdms2_regrid_t0).plot(ax=axs[0,1])

ds1 = ds.copy()
ds1_masked = apply_landmask(ds1, "ts", keep_over="ocean")
ds1["ts"] = ds1_masked

ds1_regrid = ds1.regridder.horizontal("ts", target_grid, tool="regrid2")
ds1_regrid_t0 = ds1_regrid["ts"].isel(time=0)
ds1_regrid_t0.plot(ax=axs[1,0])
# replace nan with 1e20 to show where data is missing
(ds1_regrid_t0.fillna(1e20) - cdms2_regrid_t0.fillna(1e20)).plot(ax=axs[1,1])

ds2 = ds.copy()
# need to manually invert the mask for ocean
ds2["mask"] = np.absolute(mask - 1)

ds2_regrid = ds2.regridder.horizontal("ts", target_grid, tool="regrid2")
ds2_regrid_t0 = ds2_regrid["ts"].isel(time=0)
ds2_regrid_t0.plot(ax=axs[2,0])
# replace nan with 1e20 to show where data is missing
(ds2_regrid_t0.fillna(1e20) - cdms2_regrid_t0.fillna(1e20)).plot(ax=axs[2,1])

plt.tight_layout()
plt.draw()

image

lee1043 commented 3 months ago

Hi @jasonb5, thank you for the update. I checked the updated code in this PR works well for my example code above.

Also note that how the mask is applied by cdms2 and passing the mask as a variable to xcdat differs from pcmdi_metrics apply_landmask function. The prior allows more data through, while the latter is tighter around the coastlines (not sure if this is a problem).

Thank you for pointing out. It is kind of intended behavior that in PMP I made the apply landmask function to be more strict when selecting either land or ocean grid. Default criteria for land is land fraction >= 0.8 while for the ocean < 0.2. I might have to rethink on this more to make it more consistent to what CDAT does, to keep minimal impact of xcdat conversion in PMP results.

jasonb5 commented 3 months ago

@lee1043 Sounds good, just wanted to make a note.

@tomvothecoder I think this is good to merge, and I'll tackle #625 in the next release.

tomvothecoder commented 3 months ago

@jasonb5 Great! I'll do a quick final review and then merge.

tomvothecoder commented 3 months ago

I'm going to add the codecov token to the repo and GitHub Actions build to hopefully fix the codecov upload problem and prevent/limit future occurrences.

https://community.codecov.com/t/upload-issues-unable-to-locate-build-via-github-actions-api/3954