Closed kmpaul closed 2 years ago
I can definitely help with this, however I'm not very good with this kind of stuff. I really liked @mgrover1's first blogpost regarding xwrf though, so maybe we could think along those lines and I can make a first draft....
To flesh out what we are going for, I think the features we should showcase are:
dask
-integration and associated speedupxgcm
metpy
And we should definitely have some beautiful plots ;)
Am I missing anything? Otherwise I would look into pushing this over the weekend....
Not sure about necessary features, but it seems like the pieces are in place with the merge of #69.
@lpilz Is this still something you'd like to take the lead on putting together ahead of an initial release? If not, that's fine too, I can definitely use what you've made for the docs to put together something in blog post format.
Also, with this package now residing in xarray-contrib, would it make the most sense to request this be posted in https://github.com/xarray-contrib/xarray.dev (a la pint-xarray)?
On it! I'll get a draft to you by tomorrow.
Alright, I have the draft finished, finally got it to work! I had to hack it a bit because unfortunately xgcm.apply_as_grid_ufunc
didn't work as expected. Maybe you can find some better way to do this (i.e. without the transpose)...
To download the data used for the Blogpost, you can use aws-cli
. For SSP245 aws s3 cp --include="wrfout_d01_2099-10*" --no-sign-request s3://wrf-cmip6-noversioning/downscaled_products/gcm/cesm2_r11i1p1f1_ssp245/6hourly/2099/d02/ d02 --recursive
and for SSP585 aws s3 cp --include="wrfout_d01_2099-10*" --no-sign-request s3://wrf-cmip6-noversioning/downscaled_products/gcm/cesm2_r11i1p1f1_ssp585/6hourly/2099/d02/ d02 --recursive
.
Edit: Updated file!
compute()
calls used for debuggingtranspose
calldask
's Task Stream - I think this is pretty well optimizedAnd please do comment and amend this draft, there is probably some room for improvement! :)
@xarray-contrib/xwrf-devs Just wanted to ping you regarding the blog post. Since we released v0.0.1 right now, we should probably get it published sooner rather than later. Is there any way I can help?
I'll try looking at the draft more closely today. I agree we should get this out to announce the release.
@lpilz Thank you for putting that together! It looks like a great example to showcase what xWRF can do in tandem with xgcm and dask. If you don't mind, there are still a few parts of it I'd like to amend:
Also, if this is indeed something that we hope to share with the broader xarray audience, I'd really love to see a cleaner solution to the dask-compatible destaggering. If we were willing to hold off on the blog post for just a little bit longer, I could wrap up https://github.com/jthielen/xwrf/tree/add-destagger / https://github.com/xarray-contrib/xwrf/pull/37 in the next day or so, and we could share the blog post along with a v0.0.2 release?
That all being said, if instead we wanted a more limited audience for the blog post (e.g., still on the NCAR blog or a Pangeo discourse announcement) so as to get a faster turnaround on announcing this, then several of the changes are not as important.
@jthielen I agree on points 1 and 2 and will get back to you on those. I'm not sure point 3 is correct because the last blog post had some interactive output (it not updating is something we're discussing in #89).
I don't have any strong opinions on the destaggering part. It would indeed be nice to have this integrated into the accessor rather than being a hacky custom function. However, I would not push the release of the blog post further back than a couple of days. I feel like otherwise we might lose track again and it would be really nice to get this finished in the near future. @jthielen if you don't think you'll have time to work on this today, I can take a crack at it if you want. I just won't have time to do so this weekend.
@jthielen I agree on points 1 and 2 and will get back to you on those.
Sounds good! If fsspec isn't something you've used before and it would be a hassle, I could also try working on that later today instead.
I'm not sure point 3 is correct because the last blog post had some interactive output (it not updating is something we're discussing in #89).
I apologize for the confusion! The reason I bring this up is because the platform being used for this blog post is likely going to be different than the previous one. I was assuming (based on xwrf now being an xarray-contrib project, rather than ncar-xdev) that we would not post on the NCAR-ESDS blog, but instead on the xarray blog (and if not that, then something in the Pangeo realm). So, we need to have content that can be supported by the platform we're targeting.
I don't have any strong opinions on the destaggering part. It would indeed be nice to have this integrated into the accessor rather than being a hacky custom function. However, I would not push the release of the blog post further back than a couple of days. I feel like otherwise we might lose track again and it would be really nice to get this finished in the near future. @jthielen if you don't think you'll have time to work on this today, I can take a crack at it if you want. I just won't have time to do so this weekend.
I'll plan on working on this later today. Given that I already have a starting point to which I'll primarily just be adding tests and any needed fixes, I'll get a draft PR up right away so that you (or anyone else) can comment on it early/whenever you get the chance.
Sounds good! If fsspec isn't something you've used before and it would be a hassle, I could also try working on that later today instead
I'll let you know if I'm stuck anywhere, thanks for the offer.
I apologize for the confusion! The reason I bring this up is because the platform being used for this blog post is likely going to be different than the previous one. I was assuming (based on xwrf now being an xarray-contrib project, rather than ncar-xdev) that we would not post on the NCAR-ESDS blog, but instead on the xarray blog (and if not that, then something in the Pangeo realm). So, we need to have content that can be supported by the platform we're targeting.
Sorry about that - you're of course right, I was still thinking in terms of NCAR-ESDS. I can try to make a small gif, which we can include in the blog post.
I'll plan on working on this later today. Given that I already have a starting point to which I'll primarily just be adding tests and any needed fixes, I'll get a draft PR up right away so that you (or anyone else) can comment on it early/whenever you get the chance.
Sounds great!
👍🏽 for waiting a few days and sharing the blog post when it's ready... Also, 👍🏽 for publishing this on the xarray blog (more visibility/reach). i'm available if you need help setting that up.
I have an update of the blogpost draft but I got a bit stuck implementing the fsspec
data access. Since one can only natively use s3://
wildcard-links with the zarr engine, I am using s3fs
to glob the S3 bucket and get the file names. However, when using open_mfdataset
to read them, I get the error NetCDF: Attempt to use feature that was not turned on when netCDF was built
.
I had a look at nc-config
and it seems all features (incl. nczarr
) are turned on but for parallel-netcdf
. Unfortunately, the only netcdf-parallel release for conda is restricted to python<3.8 and support for 3.7 was dropped from xarray. Can somebody help with this?
I have an update of the blogpost draft but I got a bit stuck implementing the
fsspec
data access. Since one can only natively uses3://
wildcard-links with the zarr engine, I am usings3fs
to glob the S3 bucket and get the file names. However, when usingopen_mfdataset
to read them, I get the errorNetCDF: Attempt to use feature that was not turned on when netCDF was built
.I had a look at
nc-config
and it seems all features (incl.nczarr
) are turned on but forparallel-netcdf
. Unfortunately, the only netcdf-parallel release for conda is restricted to python<3.8 and support for 3.7 was dropped from xarray. Can somebody help with this?
Sorry for the delay in getting back to you on this! The two ideas I had to get around this (given that I'm not sure how to directly get parallel netcdf enabled in general) are 1) open the files serially, then concatenate 2) see if kerchunk could help us out with the s3 parallelization, and perhaps also use intake to streamline the code for file access
If you don't mind, I could try implementing the kerchunk + intake approach later today and report back with how it goes?
No worries at all. I got it ~to work~ not to error serially, but it's a bit ugly and at some point the kernel just dies, so maybe using kerchunk
and intake
sounds like a good idea. Looking forward to seeing whether it works!
import s3fs
fs = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='us-west-2'))
files = fs.glob('s3://wrf-cmip6-noversioning/downscaled_products/gcm/cesm2_r11i1p1f1_ssp585/6hourly/2099/d02/wrfout_d01_2099-10*')
dslist = []
for file in files:
with fs.open(file) as f:
dslist.append(xr.open_dataset(f, chunks='auto'))
ssp5_ds = xr.concat(files, combine='nested', concat_dim='Time')
ssp5_ds
I was able to get kerchunk reference json files (ssp585, ssp245, and a single sample) and an intake catalog generated for these wrf-cmip6 samples, and everything seemed to work locally. However, after placing those files on s3, I haven't been able to figure out the right incantation for anonymous access and instead get a ServerTimeoutError: Connection timeout to host http://***************/latest/api/token
error whenever I try to open a dataset.
So, I'm not sure on the best route forward...
Yeah, so I don't think this is possible with the netcdf4
backend. With zarr
, you can give it a storage_options
dict via the backend_kwargs
keyword, but I'm reasonably certain this doesn't exist with netcdf4
, cf. https://stackoverflow.com/questions/66512985/how-to-add-fsspec-open-local-in-backend-of-xarray and https://github.com/pydata/xarray/blob/main/xarray/backends/netCDF4_.py#L534-L552. Also see https://docs.xarray.dev/en/stable/whats-new.html#id179
open_dataset() and open_mfdataset() now accept fsspec URLs (including globs for the latter) for engine="zarr", and so allow reading from many remote and other file systems
As far as I remember, I could only get past the anon
access using:
import s3fs
fs = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='us-west-2'))
I'm also at a loss regarding the further direction. Originally, I chose these large datasets because they show very nicely the reason why one would want to use dask
with WRF
data, but if we can't make it work at all I guess we'll have to revert to different data. Also, is it really not possible to download the data manually using aws-cli
and then just use a local open_mfdataset
? I think this is how it was done in the original blog post.
@jthielen can you make sure that your reference file is openly available?
s3://sample-data-jthielen/xwrf/ssp245_gcm_wrfout_combined.json
I tried accessing it, and received a PermissionError: Access Denied
error...
@jthielen @lpilz @andersy005 - I moved those reference files and catalog into a github repository, and adjusted the links accordingly. The following code snippet works!
import intake
# Read in the data catalog from github
cat = intake.open_catalog('https://raw.githubusercontent.com/mgrover1/sample-wrf-data/main/catalog.yml')
# Read one of the sample datasets
ds = cat["xwrf-sample-ssp245"].to_dask()
ds
<xarray.Dataset>
Dimensions: (XTIME: 124, Time: 1, south_north: 340,
west_east: 270, bottom_top_stag: 40,
bottom_top: 39,
soil_levels_or_lake_levels_stag: 10,
snow_and_soil_levels_stag: 15, soil_layers_stag: 4,
seed_dim_stag: 2, west_east_stag: 271,
south_north_stag: 341, snow_layers_stag: 3,
interface_levels_stag: 16, snso_layers_stag: 7)
Coordinates:
XLAT (Time, south_north, west_east) float32 dask.array<chunksize=(1, 170, 135), meta=np.ndarray>
XLAT_U (Time, south_north, west_east_stag) float32 dask.array<chunksize=(1, 170, 136), meta=np.ndarray>
XLAT_V (Time, south_north_stag, west_east) float32 dask.array<chunksize=(1, 171, 135), meta=np.ndarray>
XLONG (Time, south_north, west_east) float32 dask.array<chunksize=(1, 170, 135), meta=np.ndarray>
XLONG_U (Time, south_north, west_east_stag) float32 dask.array<chunksize=(1, 170, 136), meta=np.ndarray>
XLONG_V (Time, south_north_stag, west_east) float32 dask.array<chunksize=(1, 171, 135), meta=np.ndarray>
* XTIME (XTIME) datetime64[ns] 2099-10-01 ... 2099-10-31T1...
Dimensions without coordinates: Time, south_north, west_east, bottom_top_stag,
bottom_top, soil_levels_or_lake_levels_stag,
snow_and_soil_levels_stag, soil_layers_stag,
seed_dim_stag, west_east_stag,
south_north_stag, snow_layers_stag,
interface_levels_stag, snso_layers_stag
Data variables: (12/283)
ACGRDFLX (XTIME, Time, south_north, west_east) float32 dask.array<chunksize=(1, 1, 170, 135), meta=np.ndarray>
ACHFX (XTIME, Time, south_north, west_east) float32 dask.array<chunksize=(1, 1, 170, 135), meta=np.ndarray>
ACLHF (XTIME, Time, south_north, west_east) float32 dask.array<chunksize=(1, 1, 170, 135), meta=np.ndarray>
ACLWDNB (XTIME, Time, south_north, west_east) float32 dask.array<chunksize=(1, 1, 170, 135), meta=np.ndarray>
ACLWDNBC (XTIME, Time, south_north, west_east) float32 dask.array<chunksize=(1, 1, 170, 135), meta=np.ndarray>
ACLWDNT (XTIME, Time, south_north, west_east) float32 dask.array<chunksize=(1, 1, 170, 135), meta=np.ndarray>
... ...
ZNU (XTIME, Time, bottom_top) float32 dask.array<chunksize=(1, 1, 39), meta=np.ndarray>
ZNW (XTIME, Time, bottom_top_stag) float32 dask.array<chunksize=(1, 1, 40), meta=np.ndarray>
ZS (XTIME, Time, soil_layers_stag) float32 dask.array<chunksize=(1, 1, 4), meta=np.ndarray>
ZSNSO (XTIME, Time, snso_layers_stag, south_north, west_east) float32 dask.array<chunksize=(1, 1, 7, 170, 135), meta=np.ndarray>
ZWT (XTIME, Time, south_north, west_east) float32 dask.array<chunksize=(1, 1, 170, 135), meta=np.ndarray>
Z_LAKE3D (XTIME, Time, soil_levels_or_lake_levels_stag, south_north, west_east) float32 dask.array<chunksize=(1, 1, 10, 170, 135), meta=np.ndarray>
Attributes: (12/149)
ADAPT_DT_MAX: 72.0
ADAPT_DT_MIN: 36.0
ADAPT_DT_START: 54.0
AERCU_FCT: 1.0
AERCU_OPT: 0
AER_ANGEXP_OPT: 1
... ...
WEST-EAST_PATCH_END_STAG: 271
WEST-EAST_PATCH_END_UNSTAG: 270
WEST-EAST_PATCH_START_STAG: 1
WEST-EAST_PATCH_START_UNSTAG: 1
W_DAMPING: 0
YSU_TOPDOWN_PBLMIX: 0
We can move this directly into this repository instead of the sample repository I added.
Here is a link to the repo
@jthielen feel free to submit a PR to the main xwrf repo, otherwise I can do this later today.
That looks really good, thanks for your work! Did you check whether the XTIME
vs Time
dimension length give us any problems in postprocessing? When I open_mfdataset
them locally, they are already combined by Time
Yeah, so I don't think this is possible with the
netcdf4
backend. Withzarr
, you can give it astorage_options
dict via thebackend_kwargs
keyword, but I'm reasonably certain this doesn't exist withnetcdf4
, cf. https://stackoverflow.com/questions/66512985/how-to-add-fsspec-open-local-in-backend-of-xarray and https://github.com/pydata/xarray/blob/main/xarray/backends/netCDF4_.py#L534-L552. Also see https://docs.xarray.dev/en/stable/whats-new.html#id179open_dataset() and open_mfdataset() now accept fsspec URLs (including globs for the latter) for engine="zarr", and so allow reading from many remote and other file systems
This should be the sort of thing that kerchunk is meant to resolve as far as I'm aware, as it allows the zarr engine to drive things with just byte range requests for the data chunks. Thankfully, it looks like that is the indeed the case based on @mgrover1's change to host these on github instead.
@jthielen @lpilz @andersy005 - I moved those reference files and catalog into a github repository, and adjusted the links accordingly. The following code snippet works!
...
We can move this directly into this repository instead of the sample repository I added.
Here is a link to the repo
@jthielen feel free to submit a PR to the main xwrf repo, otherwise I can do this later today.
That's awesome @mgrover1 that changing the host location/method fixes it! Moving these to xwrf-data seems like a great path forward. Also motivates me to (sometime later) figure out what I messed up in s3 (be it permissions, bucket location, or something else).
That looks really good, thanks for your work! Did you check whether the
XTIME
vsTime
dimension length give us any problems in postprocessing? When Iopen_mfdataset
them locally, they are already combined byTime
Very good point; sorry I didn't mention that side-effect of the kerchunk process earlier! I haven't gotten the chance to test this either. If it presents issues, either I could go back and see if the kerchunk output could be tweaked, or we could just adapt the resulting dataset with squeeze
and/or swap_dims
.
Alright - submitted a PR over in xwrf-data! Added some docs to the readme too, figured it would be good include in a separate "catalogs" directory.
With https://github.com/xarray-contrib/xwrf-data/pull/33 now merged, these wrf-cmip6 samples can be loaded simply as
import intake
import xwrf
cat = intake.open_catalog('https://raw.githubusercontent.com/xarray-contrib/xwrf-data/main/catalogs/catalog.yml')
ds_unprocessed = cat["xwrf-sample-ssp245"].to_dask()
ds = ds_unprocessed.xwrf.postprocess()
print(ds)
<xarray.Dataset>
Dimensions: (Time: 124, y: 340, x: 270, z_stag: 40, z: 39,
soil_levels_or_lake_levels_stag: 10,
snow_and_soil_levels_stag: 15,
soil_layers_stag: 4, seed_dim_stag: 2,
x_stag: 271, y_stag: 341, snow_layers_stag: 3,
interface_levels_stag: 16, snso_layers_stag: 7)
Coordinates: (12/15)
CLAT (y, x) float32 dask.array<chunksize=(170, 135), meta=np.ndarray>
* Time (Time) datetime64[ns] 2099-10-01 ... 2099-10-3...
XLAT (y, x) float32 dask.array<chunksize=(170, 135), meta=np.ndarray>
XLAT_U (y, x_stag) float32 dask.array<chunksize=(170, 136), meta=np.ndarray>
XLAT_V (y_stag, x) float32 dask.array<chunksize=(171, 135), meta=np.ndarray>
XLONG (y, x) float32 dask.array<chunksize=(170, 135), meta=np.ndarray>
... ...
* z (z) float32 0.9969 0.9899 ... 0.009174 0.002948
* z_stag (z_stag) float32 1.0 0.9938 ... 0.005896 0.0
* y (y) float64 -3.341e+05 -3.251e+05 ... 2.717e+06
* x (x) float64 -4.728e+06 -4.719e+06 ... -2.307e+06
* y_stag (y_stag) float64 -3.386e+05 ... 2.721e+06
* x_stag (x_stag) float64 -4.733e+06 ... -2.303e+06
Dimensions without coordinates: soil_levels_or_lake_levels_stag,
snow_and_soil_levels_stag, soil_layers_stag,
seed_dim_stag, snow_layers_stag,
interface_levels_stag, snso_layers_stag
Data variables: (12/280)
ACGRDFLX (Time, y, x) float32 dask.array<chunksize=(1, 170, 135), meta=np.ndarray>
ACHFX (Time, y, x) float32 dask.array<chunksize=(1, 170, 135), meta=np.ndarray>
ACLHF (Time, y, x) float32 dask.array<chunksize=(1, 170, 135), meta=np.ndarray>
ACLWDNB (Time, y, x) float32 dask.array<chunksize=(1, 170, 135), meta=np.ndarray>
ACLWDNBC (Time, y, x) float32 dask.array<chunksize=(1, 170, 135), meta=np.ndarray>
ACLWDNT (Time, y, x) float32 dask.array<chunksize=(1, 170, 135), meta=np.ndarray>
... ...
Z_LAKE3D (Time, soil_levels_or_lake_levels_stag, y, x) float32 dask.array<chunksize=(1, 10, 170, 135), meta=np.ndarray>
air_potential_temperature (Time, z, y, x) float32 dask.array<chunksize=(1, 39, 170, 135), meta=np.ndarray>
air_pressure (Time, z, y, x) float32 dask.array<chunksize=(1, 39, 170, 135), meta=np.ndarray>
geopotential (Time, z_stag, y, x) float32 dask.array<chunksize=(1, 40, 170, 135), meta=np.ndarray>
geopotential_height (Time, z_stag, y, x) float32 dask.array<chunksize=(1, 40, 170, 135), meta=np.ndarray>
wrf_projection object +proj=lcc +x_0=0 +y_0=0 +a=6370000 +b=6...
Attributes: (12/149)
ADAPT_DT_MAX: 72.0
ADAPT_DT_MIN: 36.0
ADAPT_DT_START: 54.0
AERCU_FCT: 1.0
AERCU_OPT: 0
AER_ANGEXP_OPT: 1
... ...
WEST-EAST_PATCH_END_STAG: 271
WEST-EAST_PATCH_END_UNSTAG: 270
WEST-EAST_PATCH_START_STAG: 1
WEST-EAST_PATCH_START_UNSTAG: 1
W_DAMPING: 0
YSU_TOPDOWN_PBLMIX: 0
(note that I did go back and regenerate the kerchunk json files to fix up the Time dimension issue pointed out above)
Now, with recent updates to https://github.com/xarray-contrib/xwrf/pull/93, we have destaggering too!
import intake
import xwrf
cat = intake.open_catalog('https://raw.githubusercontent.com/xarray-contrib/xwrf-data/main/catalogs/catalog.yml')
ds_unprocessed = cat["xwrf-sample-ssp245"].to_dask()
ds = ds_unprocessed.xwrf.postprocess().xwrf.destagger()
print(ds['U'])
<xarray.DataArray 'U' (Time: 124, z: 39, y: 340, x: 270)>
dask.array<mul, shape=(124, 39, 340, 270), dtype=float32, chunksize=(1, 39, 170, 135), chunktype=numpy.ndarray>
Coordinates:
XLAT (y, x) float32 dask.array<chunksize=(170, 135), meta=np.ndarray>
XLONG (y, x) float32 dask.array<chunksize=(170, 135), meta=np.ndarray>
* x (x) float64 -4.728e+06 -4.719e+06 ... -2.316e+06 -2.307e+06
* y (y) float64 -3.341e+05 -3.251e+05 ... 2.708e+06 2.717e+06
CLAT (y, x) float32 dask.array<chunksize=(170, 135), meta=np.ndarray>
* Time (Time) datetime64[ns] 2099-10-01 ... 2099-10-31T18:00:00
XTIME (Time) float32 dask.array<chunksize=(124,), meta=np.ndarray>
* z (z) float32 0.9969 0.9899 0.981 0.9698 ... 0.0161 0.009174 0.002948
Hey everyone! I have another update of the blogpost draft ready. I didn't manage to make a GIF of the holoviews interactive plot yet (at least none without considerable lag), but I hope to have that done tomorrow. However, if somebody else could quickly whip this GIF together, that would be great too!
I now created a small GIF to show off the interactive plot. Please review the blogpost draft above and get back to me regarding improvements, so we can get this show on the road asap.
@lpilz This looks like an awesome demonstration, and GIF looks great! I do have feedback on several slight adjustments to adapt the format (both technically and descriptively) for the xarray.dev blog, but would you be good with me just making those changes, and then submitting from that a draft PR to the xarray.dev blog? Then, we'd have a bit easier time collaborating on changes (compared to passing around what amounts to a file attachment here).
I agree! Just didn't know where to start a PR. Now we have xarray.dev/343 to iterate over :)
Closing this, as https://xarray.dev/blog/introducing-xwrf is online :tada: :rocket:
Description
I think that once we have a first release of xwrf, we should write a blog post demonstrating its use. It would be great if one of our WRF expert collaborators could spearhead this blog. Any volunteers?
Implementation
Personally, I think that a Jupyter Notebook is a good medium for a demonstration, and the notebook can be easily converted to a markdown doc for a blog-post.
Tests
N/A
Questions
Before embarking on this, though, we need to complete the features that we want in the first release. That said, I wouldn't be too overly excited to delay the release. Earlier is better, even if incomplete.