Open jypeter opened 2 years ago
I like this idea, but I'm wondering how this be implemented in a way that is easy to maintain. Perhaps we could add some functionality to directly download (e.g., from ESGF) example netCDF files (e.g., xcdat.get_test_data())
?
I was curious about what xarray does – it seems like they generate toy data rather than providing data.
Should this be a discussion item?
This is the up-to-date link for toy data you mentioned, but I'd rather have data coming from actual netCDF files than toy data generated in memory!
Some not-too-big test data files could come from ESGF, the way I've done it in #284, but we also need a way to get other static/known test data files:
xcdat
can handle, and also provide example scripts to show how to correct the files and save corrected filesI have just checked that cartopy mostly generates toy data on the fly for its examples, but iris uses a directory with actual data files (the way vcs
and cdms2
did)
>>> import iris
>>> help(iris.sample_data_path)
sample_data_path(*path_to_join)
Given the sample data resource, returns the full path to the file.
.. note::
This function is only for locating files in the iris sample data
collection (installed separately from iris). It is not needed or
appropriate for general file access.
>>> iris.sample_data_path("E1_north_america.nc")
'/home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/iris_sample_data/sample_data/E1_north_america.nc'
ls -lh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/iris_sample_data/sample_data/
total 24M
-rw-rw-r-- 2 jypeter lsce 110K Jun 25 2020 A1B.2098.pp
-rw-rw-r-- 2 jypeter lsce 1.8M Jun 25 2020 A1B_north_america.nc
-rw-rw-r-- 2 jypeter lsce 28K Jun 25 2020 air_temp.pp
-rw-rw-r-- 2 jypeter lsce 34K Jun 25 2020 atlantic_profiles.nc
-rw-rw-r-- 2 jypeter lsce 3.5M Jun 25 2020 colpex.pp
-rw-rw-r-- 2 jypeter lsce 110K Jun 25 2020 E1.2098.pp
-rw-rw-r-- 2 jypeter lsce 1.8M Jun 25 2020 E1_north_america.nc
drwxr-xr-x 2 jypeter lsce 4.0K Sep 10 2021 GloSea4/
-rw-rw-r-- 2 jypeter lsce 662K Jun 25 2020 hybrid_height.nc
-rw-rw-r-- 2 jypeter lsce 7.5M Jun 25 2020 NAME_output.txt
drwxr-xr-x 2 jypeter lsce 4.0K Sep 10 2021 NEMO/
-rw-rw-r-- 2 jypeter lsce 2.0M Jun 25 2020 orca2_votemper.nc
-rw-rw-r-- 2 jypeter lsce 1.7M Jun 25 2020 ostia_monthly.nc
-rw-rw-r-- 2 jypeter lsce 26K Jun 25 2020 polar_stereo.grib2
-rw-rw-r-- 2 jypeter lsce 110K Jun 25 2020 pre-industrial.pp
-rw-rw-r-- 2 jypeter lsce 19K Jun 25 2020 rotated_pole.nc
-rw-rw-r-- 2 jypeter lsce 163K Jun 25 2020 SOI_Darwin.nc
-rw-rw-r-- 2 jypeter lsce 243K Jun 25 2020 space_weather.nc
-rw-rw-r-- 2 jypeter lsce 514K Jun 25 2020 toa_brightness_stereographic.nc
-rw-rw-r-- 2 jypeter lsce 3.3M Jun 25 2020 uk_hires.pp
drwxr-xr-x 2 jypeter lsce 12K Sep 10 2021 UM/
-rw-rw-r-- 2 jypeter lsce 2.4K Jun 25 2020 wind_speed_lake_victoria.pp
Thanks for this @jypeter. This has been discussed and was in-mind, although a GH issue was not opened for it.
I explored a possible implementation similar to xarray. xarray uses a GH repo (https://github.com/pydata/xarray-data) to host test datasets, and provides xarray.tutorial
methods to open up the test datasets using a package called pooch
.
We didn't pursue this idea since xarray supports direct download of data using OpenDAP. However, I think this idea is worthwhile because it standardizes and streamlines the testing processes with easy access to the same real-world datasets.
Hmmm, I had a quick look at the pooch
GH page. It looks really nice and fancy but:
lib
directory. And I hate default cache locations in hidden sub-directories of the users' home dir. We have nightly backups of the the home dirs at LSCE, and we archive the interns' home dir when they are finished. I don't want to have backups of hidden test files!Having a dedicated python package with just the data could also be an easy solution: e.g. basemap-data-hires
Another data sample example from xoa
>>> import xoa
>>> xoa.show_data_samples()
gdp-6203641.csv hycom.gdp.u.nc hycom.gdp.v.nc hycom.gdp.h.nc croco.south-africa.surf.nc hycom.cfg croco.cfg gdp.cfg mercator.cfg argo.cfg croco.south-africa.zonal.nc croco.south-africa.meridional.nc ibi-argo-7900573.nc argo-7900573.nc
>>> xoa.get_data_sample('hycom.gdp.u.nc')
'/home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples/hycom.gdp.u.nc'
> du -sh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples
1.1M /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples
>ls -lh /home/share/unix_files/cdat/miniconda3_21-02/envs/cdatm_py3/lib/python3.8/site-packages/xoa/_samples
total 1.1M
-rw-rw-r-- 2 jypeter lsce 92K Feb 25 09:56 argo-7900573.nc
-rw-rw-r-- 2 jypeter lsce 305 Feb 25 09:56 argo.cfg
-rw-rw-r-- 2 jypeter lsce 714 Feb 25 09:56 croco.cfg
-rw-rw-r-- 2 jypeter lsce 61K Feb 25 09:56 croco.south-africa.meridional.nc
-rw-rw-r-- 2 jypeter lsce 190K Feb 25 09:56 croco.south-africa.surf.nc
-rw-rw-r-- 2 jypeter lsce 61K Feb 25 09:56 croco.south-africa.zonal.nc
-rw-rw-r-- 2 jypeter lsce 43K Feb 25 09:56 gdp-6203641.csv
-rw-rw-r-- 2 jypeter lsce 73 Feb 25 09:56 gdp.cfg
-rw-rw-r-- 2 jypeter lsce 487 Feb 25 09:56 hycom.cfg
-rw-rw-r-- 2 jypeter lsce 174K Feb 25 09:56 hycom.gdp.h.nc
-rw-rw-r-- 2 jypeter lsce 173K Feb 25 09:56 hycom.gdp.u.nc
-rw-rw-r-- 2 jypeter lsce 173K Feb 25 09:56 hycom.gdp.v.nc
-rw-rw-r-- 2 jypeter lsce 71K Feb 25 09:56 ibi-argo-7900573.nc
-rw-rw-r-- 2 jypeter lsce 195 Feb 25 09:56 mercator.cfg
@tomvothecoder was there a plan to have a test suite with just the kind of (few timesteps) data that @jypeter was describing? It seems that CDAT
was using the sample_data
subdir which enabled testing in the CI envs, similar to what iris
appears to do (https://github.com/xCDAT/xcdat/issues/277#issuecomment-1199068571 above)
Note: see example usage of vcs.sample_data + '/tas_mo.nc'
in https://github.com/xCDAT/xcdat/issues/310#issuecomment-1212866276
I have added an Easy to use datasets section to my python page, with test/tutorials datasets from several packages
@tomvothecoder It seems that xarray uses xarray.tutorial.load_dataset. Maybe xcdat could have a similar xcdat.tutorial.load_dataset
pointing to some useful sample CMIP6 data (and possibly the equivalent CMIP5 data, if somebody wants to make a CMIP5/CMIP6 comparison example)
Describe your documentation update
I wonder if there are xCDAT (or xarray) test files that can be (pre)downloaded and can be used for :
I'm thinking of (something like) the cdms2/vcs test data
I think these files are the ones listed in CDMS Sample Dataset and they are still online!