Investigate / document CDAT-xCDAT spatial differences (again)

xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.

https://xcdat.readthedocs.io/en/latest/

Apache License 2.0

116 stars 12 forks source link

Investigate / document CDAT-xCDAT spatial differences (again) #500

Closed pochedls closed 1 year ago

pochedls commented 1 year ago

A recent issue prompted me to look through (and update) the spatial averaging routine.

I validated this PR by looking at monthly CMIP5/6 historical tas files (1 per model for 117 climate models) and 21 averaging domains (I skipped NAFM due to this discussion).

In validating the proposed changes, I have (again) come across CDAT-xCDAT differences. In 28 cases of 2415 cases (1.1%) the spatial averages are different (as judged by np.allclose()). The monthly differences are mostly below 0.01 K, but exceed 0.1 (in rare cases). These results apply to both the current release v0.5.0 and the proposed PR.

I plan to summarize the reasons for the differences in this issue.

pochedls commented 1 year ago

For a subset of models and domains, the CDAT-xCDAT differences were traced to CDAT dropping the model's latitude bounds (and then replacing the bounds with its own calculated bounds). This resulted in different weights and small differences in the area averages (see table at bottom).

First – an example of incorrect bounds:

Excerpt of lat_bnds from ncdump -v lat_bnds /p/css03/cmip5_css02/data/cmip5/output1/BNU/BNU-ESM/historical/mon/atmos/Amon/r1i1p1/tas/1/tas_Amon_BNU-ESM_historical_r1i1p1_185001-200512.nc:

lat_bnds = -87.8638000488281, -86.4801635742188, -86.4801635742188, -83.704719543457, -83.704719543457, -80.9192581176758, -80.9192581176758, -78.1312561035156, -78.1312561035156, -75.3422088623047, -75.3422088623047, -72.5526351928711,

These appear to match the bounds in cdscan. But they are different from those loaded using cdms2:

fn = '/p/user_pub/xclim/CMIP5/CMIP/historical/atmos/mon/tas/CMIP5.CMIP.historical.BNU.BNU-ESM.r1i1p1.mon.tas.atmos.glb-z1-gu.1.0000000.0.xml'
f = cdms2.open(fn)
tas = f('tas')
print(tas.getLatitude().getBounds())

[[-89.24743652 -86.48016358]
[-86.48016358 -83.70471955] [-83.70471955 -80.91925812] [-80.91925812 -78.13125229] [-78.13125229 -75.34220887] [-75.34220887 -72.55263519]

If I drop the lat_bnds in xcdat and re-calculate them using .add_missing_bounds() the spatial average differences are order 10^-10. This resolves 15 / 28 CDAT-xCDAT differences. I think CDAT probably drops the latitude bounds systematically, but it only matters in these cases. xCDAT does not drop the model latitude bounds (and I don't think it should – we should trust the model data and the user should decide to drop this data).

model	domain	max absolute difference
BNU-ESM	global	0.0337137471188953
BNU-ESM	NAM	0.03921949346317888
BNU-ESM	SAM	0.057204666153268136
FIO-ESM	global	0.033644891283586276
FIO-ESM	NAM	0.03955337948826809
FIO-ESM	SAM	0.0555397851591124
bcc-csm1-1	global	0.03277778145468346
bcc-csm1-1	NAM	0.04099915332000137
bcc-csm1-1	SAM	0.056736843962255534
bcc-csm1-1-m	global	0.005671941710033934
bcc-csm1-1-m	NAM	0.006484684673694119
bcc-csm1-1-m	SAM	0.009858142884070276
fio-esm	global	0.033644891283586276
fio-esm	NAM	0.03955337948826809
fio-esm	SAM	0.0555397851591124

pochedls commented 1 year ago

In another small set of cases (see table below), small differences seem to be introduced by the precision of the axis information.

For example: ncdump -h /p/css03/esgf_publish/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r10i1p1f1/Amon/tas/gr/v20180803/tas_Amon_IPSL-CM6A-LR_historical_r10i1p1f1_gr_185001-201412.nc | grep lat

The lat axis is stored as a float (i.e., float32 not double). CDAT (type(tas.getLatitude().getBounds()[0, 0])) reads this as numpy.float64 whereas xcdat reads (type(ds.lat_bnds.values[0, 0])) this (properly) as numpy.float32. These datasets do not have bounds (so they are calculated by both CDAT and xCDAT). The difference in precision introduces numerical roundoff (10^-11 to 10^-8) differences in the bounds, then the weights (10^-11 0 10^-8), then the spatial averages.

If I cast the lat and lon axes to float64 in the xcdat dataset, recompute the bounds, and the spatial average, then xcdat and CDAT match. This resolves 6 of the 28 differences. This means 21 of 28 of the differences are resolved.

In the past (and repeated here), this issue had been resolved with my own averaging calculations. We could not figure out why my calculations differed from xarray, though we did realize that modifying the precision mattered. I think the custom spatial averager likely worked because it was probably casting the arrays to the same type which led to np.allclose resulting in True. I haven't tried to figure out exactly where the difference happens (it seems to stem from CDAT reading in the bounds with the wrong precision, which affects downstream calculations).

model	domain	max absolute difference
IPSL-CM6A-LR	SAM	0.003096939779936747
IPSL-CM6A-LR	AllM	0.007789411802434643
IPSL-CM6A-LR-INCA	global	0.0037057772779576226
IPSL-CM6A-LR-INCA	SAM	0.003101609951784212
IPSL-CM6A-LR-INCA	AllM	0.007794648910362412
IPSL-CM6A-LR	global	0.0037208329281384067

pochedls commented 1 year ago

One set of cases that I don't fully understand is with CESM2-WACCM (see differences). The bounds and weights appear to be identical in CDAT and xCDAT. The differences are "all close" if I either a) manually calculate the spatial average (10^-5 differences), b) reverse the axis order when averaging in xcdat (10^-3 differences), or drop and recompute the latitude bounds (and weights) (10^-7 differences).

Unlike with IPSL data (above) the axes seem to be stored (and opened) with double precision. The bounds are stored as float in the netCDF file. Both xcdat and CDAT bounds/weights seem to open as float32. But if I drop the lat_bnds and re-create them in xcdat, they are float64 and I get very close to the CDAT spatial averages (10^-7 differences). So I think this is another precision/roundoff type difference.

If we accept this as an explanation, this resolves 4/28 differences (3/28 left to figure out).

model	domain	max absolute difference
CESM2-WACCM	tropics	0.004078462585937359
CESM2-WACCM	global	0.006396168779986056
CESM2-WACCM	SAM	0.0029629751547304295
CESM2-WACCM	AllM	0.008556697618530507

pochedls commented 1 year ago

The final set of differences (see table below) appear to be related to incorrect longitude weights assigned by CDAT in rare instances.

For example, for the Sahel domain (/p/css03/esgf_publish/CMIP6/CMIP/NUIST/NESM3/historical/r1i1p1f1/Amon/tas/gn/v20190630/) which has the bounds 13 - 18^oN and -10 to 10^oE, CDAT gives the following relative weights:

lon_bnd	relative weight
[-9.375 -7.5 ]	1.875
[-7.5 -5.625]	1.875
[-5.625 -3.75 ]	1.875
[-3.75 -1.875]	1.875
[-1.875 0. ]	1.875
[0. 1.875]	1.875
[1.875 3.75 ]	1.875
[3.75 5.625]	1.875
[5.625 7.5 ]	1.875
[7.5 9.375]	1.875
[ 9.375 10. ]	0.625

It correctly adjusts the bound of the lat grid cell (from 9.375 : 11.25 to 9.375 : 10.), but it drops weight from the grid cell spanning (-11.25 : -9.375) [it isn't listed]. xCDAT does weight this grid cell:

lon_bnd	relative weight
[-11.25 -9.375]	0.625
[-9.375 -7.5 ]	1.875
[-7.5 -5.625]	1.875
[-5.625 -3.75 ]	1.875
[-3.75 -1.875]	1.875
[-1.875 0. ]	1.875
[0. 1.875]	1.875
[1.875 3.75 ]	1.875
[3.75 5.625]	1.875
[5.625 7.5 ]	1.875
[7.5 9.375]	1.875
[ 9.375 11.25 ]	0.625

Note the partial weights in the first and last grid cells. I don't know why CDAT doesn't provide the correct weights in these rare instances. This resolves the final 3 CDAT / xCDAT discrepancies.

model	domain	max absolute difference
NESM3	Sahel	0.15666759125781482
NESM3	GoG	0.036151854691411245
NESM3	SAmo	0.10544204167973703

pochedls commented 1 year ago

The issue that prompted the PR has to do with an E3SM dataset with unusual bounds.

The first set of bounds span 359.6484375 to 0.3515625. This appears to be inconsistent with CF-conventions (see, e.g., here) which states that bounds should be increasing. We don't have examples of bounds like this in the CMIP archive. If bounds wrap around the prime meridian they are stored in increasing order, e.g., [ -0.9375, 0.9375] for /p/css03/esgf_publish/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/.

The weird E3SM bounds are not handled correctly by the spatial averager in v0.5.0. This issue is caught and fixed in the current PR.

tomvothecoder commented 1 year ago

Thanks for documenting the 28 cases where you found differences in the spatial averaging results between CDAT and xCDAT. I was able to follow your findings for these cases and am satisfied with the explanations.

One set of cases that I don't fully understand is with CESM2-WACCM (see differences). The bounds and weights appear to be identical in CDAT and xCDAT. The differences are "all close" if I either a) manually calculate the spatial average (10-5 differences), b) reverse the axis order when averaging in xcdat (10-3 differences), or drop and recompute the latitude bounds (and weights) (10-7 differences).

Unlike with IPSL data (above) the axes seem to be stored (and opened) with double precision. The bounds are stored as float in the netCDF file. Both xcdat and CDAT bounds/weights seem to open as float32. But if I drop the lat_bnds and re-create them in xcdat, they are float64 and I get very close to the CDAT spatial averages (10-7 differences). So I think this is another precision/roundoff type difference.

If we accept this as an explanation, this resolves 4/28 differences (3/28 left to figure out).

As mentioned in this comment, we found that CDAT incorrectly type casts weights to np.float64 even though bounds in the input dataset are np.float32 (source notebook). There might be some subtle type casting going on resulting in precision differences. We probably shouldn't be worried here.

pochedls commented 1 year ago

I am also happy with where things are. I think this can be closed with #495.