xgcm / xgcm

python package for analyzing general circulation model output data
http://xgcm.readthedocs.org
MIT License
223 stars 81 forks source link

Automatic grid generation #73

Closed rabernat closed 7 years ago

rabernat commented 7 years ago

cc @jbusecke, @naomi-henderson

All of the basic functionality in xgcm now could easily be applied to generic gridded datasets (e.g. satellite, model data without auxiliary coordinates, etc.), if we had the ability to automatically generate other axis variables. (This may also be necessary for @naomi-henderson to move forward with pangeo-data/pangeo-discussion#1.)

What would this look like? Imagine we have an xarray dataset that looks like this

import xarray as xr
import numpy as np
ds = xr.DataArray(np.random.rand(180, 360),
                         dims=['lat', 'lon'],
                         coords={'lon': np.arange(0, 360)+0.5,
                                 'lat': np.arange(-90, 90)+0.5}
                        ).to_dataset(name='somedata')
ds
<xarray.Dataset>
Dimensions:   (lat: 180, lon: 360)
Coordinates:
  * lon       (lon) float64 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 ...
  * lat       (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 -84.5 -83.5 -82.5 ...
Data variables:
    somedata  (lat, lon) float64 0.2838 0.3383 0.1441 0.8124 0.6643 0.6623 ...

This is a dataset with the values located at the midpoints between lat and lon.

It would be amazing to be able to say this

grid = xgcm.autogenerate_axis(ds, axes={'X': 'lon', 'Y': 'lat'})

There are lots of potential options we would want to include, such as the coordinate positions to generate (i.e. left, right, inner, outer). I'm also assuming we would want to treat the original coordinates as the cell centers, but maybe that's not always the case.

This could be implemented via a lower level function called autogenerate_axis that just works on one axis at a time (the way the Grid object works now). I would recommend creating a new module called autogenerate for these functions.

A reminder of the lexicon of grid positions

 Center
 |------o-------|------o-------|------o-------|------o-------|
       [0]            [1]            [2]            [3]

 Left
 |------o-------|------o-------|------o-------|------o-------|
[0]            [1]            [2]            [3]

 Right
 |------o-------|------o-------|------o-------|------o-------|
               [0]            [1]            [2]            [3]

 Inner
 |------o-------|------o-------|------o-------|------o-------|
               [0]            [1]            [2]

 Outer
 |------o-------|------o-------|------o-------|------o-------|
[0]            [1]            [2]            [3]            [4]
spencerahill commented 7 years ago

@rabernat this Issue, along with https://github.com/pangeo-data/pangeo-discussion/issues/1, make me want to take stock of the state of the existing similar-but-different finite differencing efforts, to avoid unnecessary overlap to the extent possible. In short, I want you/xgcm to take on everything 😜

IIRC, one outcome of the initial pangeo workshop last November was the desire to merge efforts among xgcm, @lesommer's oocgcm, and my infinite-diff. xgcm (before this issue at least!) and oocgcm have focused on C-grid-aware operations, while infinite-diff's focus has been on grid-naive cases such as those that this issue discusses.

It seems that oocgcm has gone dormant, with no new commits in the past 8 months. @lesommer, any updates?

From the outside, it may also seem that infinite-diff has gone similarly dormant. However, @spencerkclark has introduced in this gist what I think is an amazing approach: using sympy and xarray to compute finite differences on evenly spaced grids to arbitrarily high orders of accuracy, all of it dask compatible. This approach is so vastly better than what I currently have implemented in infinite-diff that, if infinite-diff development is to continue at all, it will be via essentially re-writing everything to use this new approach.

But if, as this Issue would suggest, you see space for such grid-naive operations within xgcm, then I (and I believe @spencerkclark also) would be highly receptive to you incorporating his approach. Frankly, I would love to deprecate infinite-diff completely...both selfishly because then I don't have to maintain it, and more importantly because I think it would be better to have a single go-to package for finite differencing, whether C-grid-specific or grid-unaware.

(There is also the issue of spectral methods, which I have implemented previously via windspharm. These are necessary for the budget adjustment methods I alluded to in https://github.com/pangeo-data/pangeo-discussion/issues/1#issuecomment-326149844. But maybe this is a topic for another day.)

So let me know your take on all of this. Sorry if this somewhat hijacks the thread -- we can move it to a separate issue if you'd like. Thanks!

rabernat commented 7 years ago

Thanks for bringing back up the overlap issue. You're absolutely right...despite our good intentions, we are still maintaining overlapping packages wrt finite differencing. My explanation for this is simply that none of the packages is mature / general purpose enough for others to start using it off-the-shelf. oocgcm appeared to have great promise, but I couldn't get it to work with any of my models, so I went back to my own package (xgcm). oocgcm in its current state is still too NEMO-specific. xgcm currently works pretty well with MITgcm (obviously) and ROMS, but its scope is pretty narrow.

Thanks for the update on infinite-diff! I agree that @spencerclark's gist holds enormous potential.

I think we all agree that, for gcm data that is provided with native grid spacing, we want to be able to reproduce the internal numerical schemes used by the model. That is what xgcm and oocgcm aim to do.

The key implementation question for "grid-naive" datasets seems to be whether it is best to

  1. generate a "fake" finite volume grid and then treat it like a gcm
  2. use the infinite-diff approach

It would be productive to understand the pros and cons of each approach. I like finite volumes because I understand them and I know they have good conservation properties. But I think what you guys are doing is very elegant and probably better.

rabernat commented 7 years ago

It would be great to hear from @naomi-henderson what numerical approach they used in their past work with generic gridded datasets (like CMIP5). This paper on moisture budgets for example is the basis for one of our Pangeo use cases.

spencerahill commented 7 years ago

Thanks for the quick response.

My explanation for this is simply that none of the packages is mature / general purpose enough for others to start using it off-the-shelf.

No worries. I agree, and I think this constitutes a major gap in the python geosciences toolkit. It would be a really fantastic outcome if this emerged out of the pangeo tracer budgets use-case. It would also make the use-case all the more useful to others.

I think we all agree that, for gcm data that is provided with native grid spacing, we want to be able to reproduce the internal numerical schemes used by the model.

Absolutely.

I like finite volumes because I understand them and I know they have good conservation properties. But I think what you guys are doing is very elegant and probably better.

Particularly for the use case of tracer budgets, conservation is key. So I wouldn't be so fast to dismiss the finite volume approach. Is there a way to combine the two approaches? @spencerkclark what do you think?

spencerkclark commented 7 years ago

First off I think @rabernat's approach for grid-aware datasets is pretty elegant in its own right! It's a really clever use of xarray's data model, and makes things just about as explicit as possible by keeping track of the coordinate changes across differencing and interpolation operations. If I understand things correctly, it provides the building blocks for computing estimates of derivatives using finite-volume methods (rather than methods to compute those estimates directly). (So I might say what @rabernat has going in xgcm is "lower level" and therefore more flexible than what is in my gist).

My gist was not initially developed with the tracer-budget use-case in mind (that is to say I don't know if computing terms of a tracer budget from GCM output is necessarily a good application for it). One aspect about it that may not be entirely desirable here is that it trades compatibility with unevenly spaced grids with the ability to compute finite-difference approximations of derivatives to arbitrary order of accuracy. As neat as arbitrary order of accuracy might be, an efficient dask-compatible 2nd order accurate method that could be applied to arbitrary grids might be more valuable (due to its flexibility).

On the flip-side though, if added to infinite-diff, the functions I've implemented might be an upgrade from what is currently there (currently the methods are not dask-compatible and hard-coded for specific finite-difference stencils; that said the methods there currently do support non-uniform grid spacing). Given that infinite-diff was used successfully in a tracer-budget context [e.g. Hill et al. (2017)] on "grid-naive" datasets, it perhaps should not be written off.

It would be productive to understand the pros and cons of each approach.

Ultimately I'll admit I have only dabbled in tracer-budget analysis, so I defer to @spencerahill or @naomi-henderson for their instincts on what the pros and cons of each approach might be.

naomi-henderson commented 7 years ago

The finite volume approach is certainly key to tracer conservation. We do try to match the differencing technique used in an offline budget to that of the model which produced it. This would certainly include the order of approximation as well as the type of grid (A-grid, B-grid, C-grid, etc), method (finite difference, finite volume, finite element, spectral, ...) and allow for uneven spacing. For our offline atmospheric budget calculations, the time aggregation (using hourly, daily or monthly mean fields instead of at each time step) has already degraded the accuracy which the higher order approximations cannot recover, so we have never attempted anything beyond 2nd order.

spencerahill commented 7 years ago

Thanks @naomi-henderson. That all makes sense. (As an aside, I also must apologize for not being aware of your paper and thus not citing it in my work so far. It looks like exactly what I should have been referring to for years now.)

It sounds like @spencerkclark's gist is thus not the solution for the pangeo budget project. So the remaining question (mostly for @rabernat) is: is there still interest in incorporating it in xgcm? I am still interested in the idea of one finite differencing package to rule them all, but if xgcm development will be closely tied to the pangeo effort for the foreseeable future, it might not sense to try to lop on this additional thing at the same time.

The alternative would be to revamp infinite-diff using it (including better packaging & docs), while still retaining a fallback for unevenly spaced grids. The door would remain open of course to merge the two in the future. I am fine with either -- don't want to hijack either xgcm development or the pangeo budget efforts.

rabernat commented 7 years ago

So the remaining question (mostly for @rabernat) is: is there still interest in incorporating it in xgcm?

Yes! Options are good. Having multiple options in the same package will allow us to attempt to actually answer the question of which approach is best?

if xgcm development will be closely tied to the pangeo effort for the foreseeable future

Not really the case. xgcm is still my own pet project...we don't have dedicated dev time for it via Pangeo.

So bottom line...PR welcome!

jbusecke commented 7 years ago

I took a shot at a very basic implementation (#74). Based on the example above from @rabernat it would work something like this:

import xgcm as xg
import xarray as xr
import numpy as np

# create a 'typical observational' dataset
ds = xr.DataArray(np.random.rand(180, 360*2,10*2),
                         dims=['lat', 'lon','z'],
                         coords={'lon': np.arange(-180, 180, 0.5)+0.25,
                                 'lat': np.arange(-90, 90)+0.5,
                                 'z': np.arange(0,10,0.5)+0.25}
                        ).to_dataset(name='somedata')
ds

>>> <xarray.Dataset>
>>> Dimensions:   (lat: 180, lon: 720, z: 20)
>>> Coordinates:
>>>   * lon       (lon) float64 -179.8 -179.2 -178.8 -178.2 -177.8 -177.2 -176.8 ...
>>>   * lat       (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 -84.5 -83.5 -82.5 ...
>>>   * z         (z) float64 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 ...
>>> Data variables:
>>>     somedata  (lat, lon, z) float64 0.2058 0.06754 0.89 0.9464 0.02511 ...

# autogenerate_ds recovers the 'original' coordinates
new_ds = xg.autogenerate_ds(ds,axes={'X':'lon','Y':'lat','Z':'z'},
                            position='center')

print(new_ds['lon'].data[0:3])
print(new_ds['lon'].data[-4:-1])
print(new_ds['lon'+'_inferred'].data[0:3])
print(new_ds['lon'+'_inferred'].data[-4:-1])

print(new_ds['lat'].data[0:3])
print(new_ds['lat'].data[-4:-1])
print(new_ds['lat'+'_inferred'].data[0:3])
print(new_ds['lat'+'_inferred'].data[-4:-1])

print(new_ds['z'].data[0:3])
print(new_ds['z'].data[-4:-1])
print(new_ds['z'+'_inferred'].data[0:3])
print(new_ds['z'+'_inferred'].data[-4:-1])

>>> [-179.75 -179.25 -178.75]
>>> [ 178.25  178.75  179.25]
>>> [-180.  -179.5 -179. ]
>>> [ 178.   178.5  179. ]
>>> [-89.5 -88.5 -87.5]
>>> [ 86.5  87.5  88.5]
>>> [-90. -89. -88.]
>>> [ 86.  87.  88.]
>>> [ 0.25  0.75  1.25]
>>> [ 8.25  8.75  9.25]
>>> [ 0.   0.5  1. ]
>>> [ 8.   8.5  9. ]

Some elements are still a bit clunky, particularly how add_slice is implemented in xgcm.grid, but it seems to work.

Still a lot more to implement (calculate area based on lat/lon coordinates etc), but I wanted to see how everybody feels about this basic stuff.

jbusecke commented 7 years ago

With the current commit of #74 I added the capability to auto-generate multidimensional coordinates as well. Here is a little example of how that would work in practice.