[use case demonstration] Kvikio Direct-to-gpu -> xarray -> xbatcher -> ml model

jhamman commented 2 years ago

What is your issue?

Recent developments by @nvidia and @dcherian are opening the door for direct-to-gpu data loading in Xarray. This could mean that when combined with Xbatcher and the tensorflow or pytorch data loaders, a complete workflow from Zarr all the way to a ml model training could be accomplished without ever handling data on a CPU.

Here's a short illustration of the potential workflow:

import xarray as xr
import xbatcher

ds = xr.open_dataset(store, engine="kvikio", consolidated=False)

x_gen = xbatcher.BatchGenerator(ds[xvars], {'time': 10}) 
y_gen = xbatcher.BatchGenerator(ds[yvars], {'time': 10}) 

tf_dataset = xbatcher.loaders.keras.CustomTFDataset(x_gen, y_gen)

model.fit(tf_dataset, ...)

This would be awesome to demonstrate in a single example. Perhaps as a second tutorial on Xbatcher's documentation site.

xref: https://github.com/xarray-contrib/cupy-xarray/pull/10

cc @dcherian, @negin513, and @weiji14

dcherian commented 2 years ago

I like how you tagged NVIDIA hahaha.

The RAPIDS folks (@jakirkham, @madsbk, @jacobtomlinson) were really interested in a blogpost about this stuff

weiji14 commented 2 years ago

:+1: for a blog post. I'd be happy to contribute to a draft blog post as @dcherian suggested at a recent Pangeo meeting for https://medium.com/pangeo (or https://medium.com/rapids-ai), but probably need to wait for https://github.com/pydata/xarray/pull/6874 and https://github.com/zarr-developers/zarr-python/pull/934 to get merged and new xarray and Zarr releases first.

One issue with having this kvikio tutorial on xbatcher's documentation though is that we don't have GPUs in GitHub Actions CI or Readthedocs, so it can't be built dynamically :slightly_smiling_face: We'll either need to cache the outputs, or find another way or place to host the tutorial.

jhamman commented 2 years ago

I love the idea of a blog post here. Perhaps we publish the post in a few places at once (xarray's blog would also work).

One issue with having this kvikio tutorial on xbatcher's documentation though is that we don't have GPUs in GitHub Actions CI or Readthedocs, so it can't be built dynamically 🙂 We'll either need to cache the outputs, or find another way or place to host the tutorial.

I think its probably worth publishing a "cached" notebook here even though it won't be running by most folks. A strong disclaimer at the top stating the purpose will probably be sufficient to avoid confusion in the future.

dcherian commented 2 years ago

OK thanks for the prompt. I added a super brief intro blogpost here: https://github.com/xarray-contrib/xarray.dev/pull/308 to get the word out. This blogpost could then just link to that for extra details.

weiji14 commented 2 years ago

One issue with having this kvikio tutorial on xbatcher's documentation though is that we don't have GPUs in GitHub Actions CI or Readthedocs, so it can't be built dynamically slightly_smiling_face We'll either need to cache the outputs, or find another way or place to host the tutorial.

I think its probably worth publishing a "cached" notebook here even though it won't be running by most folks. A strong disclaimer at the top stating the purpose will probably be sufficient to avoid confusion in the future.

At https://discourse.pangeo.io/t/statement-of-need-integrating-jupyterbook-and-jupyterhubs-via-ci/2705, there's some ideas on how to run 'expensive' (read: GPU required) notebooks via the Pangeo Binder Jupyter Hub. It'll be more work than the caching solution, but probably allows for easier reproducibility long-term for the wider community, especially if the GPU direct storage/kvikIO technology gets updated in the future and we need to re-run things for newer versions. Thoughts?

maxrjones commented 2 years ago

One issue with having this kvikio tutorial on xbatcher's documentation though is that we don't have GPUs in GitHub Actions CI or Readthedocs, so it can't be built dynamically slightly_smiling_face We'll either need to cache the outputs, or find another way or place to host the tutorial.

I think its probably worth publishing a "cached" notebook here even though it won't be running by most folks. A strong disclaimer at the top stating the purpose will probably be sufficient to avoid confusion in the future.

At https://discourse.pangeo.io/t/statement-of-need-integrating-jupyterbook-and-jupyterhubs-via-ci/2705, there's some ideas on how to run 'expensive' (read: GPU required) notebooks via the Pangeo Binder Jupyter Hub. It'll be more work than the caching solution, but probably allows for easier reproducibility long-term for the wider community, especially if the GPU direct storage/kvikIO technology gets updated in the future and we need to re-run things for newer versions. Thoughts?

I think the eventual goal should be to build the examples that are 'expensive' and cross-cutting in terms of software (e.g., Kvikio Direct-to-gpu -> xarray -> xbatcher -> ml model) as part of the Project Pythia cookbooks and link to those cookbooks from the individual package docs (e.g., xbatcher). But, as discussed on that thread some infrastructure developments are required before Project Pythia can support those examples. The notebook discussed here could be a great test case for the integration between JupyterHubs and JupyterBook and could be "cached" in xbatcher docs while that development happens.

weiji14 commented 2 years ago

Just on the infrastructure point, I noticed that GPU-enabled GitHub Actions is on the roadmap (https://github.com/github/roadmap/issues/505), but unsure if this will be limited to Teams/Enterprise plans only as with https://github.blog/changelog/2022-09-01-github-actions-larger-runners-are-now-in-public-beta. In theory, this would allow for us to store an uncached version of the notebook and run it from time to time (though it will probably cost some $$).

Still, I think the Project Pythia cookbook method is worth pursuing, as the close integration with Pangeo Binder would allow users to actually run the example kvikIO notebook on the cloud. In practical terms, we could:

Wait for the PRs mentioned in https://github.com/xarray-contrib/cupy-xarray/pull/10 to be merged, and releases made for xarray/cupy-xarray/zarr
Have a 'cached' kvikIO notebook
Have an un-cached kvikIO notebook using either
1. GitHub Actions GPU (if it becomes available)
2. Project Pythia infrastructure

joshmoore commented 2 years ago

@weiji14 commented 14 days ago but probably need to wait for ... zarr-developers/zarr-python#934 to get merged and new xarray and Zarr releases first.

Now available in zarr-python 2.13.0a2 for testing.

dcherian commented 2 years ago

Is there a cloud provider that has the necessary GDS stuff set up?

weiji14 commented 2 years ago

Is there a cloud provider that has the necessary GDS stuff set up?

Tried running on Microsoft Planetary Computer (gpu-pytorch container), GPU direct storage doesn't work yet, but compatibility mode works. Below are results from python single-node-io.py (script from https://github.com/rapidsai/kvikio/blob/29c52f76035002d91f301895250c0ff14f18f50a/python/benchmarks/single-node-io.py):

----------------------------------
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING - KvikIO compat mode   
      libcufile.so not used       
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
GPU               | Unknown (install pynvml)
GPU Memory Total  | Unknown (install pynvml)
BAR1 Memory Total | Unknown (install pynvml)
GDS driver        | N/A (Compatibility Mode)
GDS config.json   | /etc/cufile.json
----------------------------------
nbytes            | 10485760 bytes (10.00 MiB)
4K aligned        | True
pre-reg-buf       | True
diretory          | /tmp/tmp9a8nd5kz
nthreads          | 1
nruns             | 1
==================================
cufile read       |   4.28 GiB/s
cufile write      |  92.59 MiB/s
posix read        |   1.23 GiB/s
posix write       |   1.24 GiB/s

Could try to get in a PR to install the necessary GPU direct storage and kvikIO packages perhaps, they're usually pretty responsive. Edit: opened issue at https://github.com/microsoft/planetary-computer-containers/issues/51.

weiji14 commented 2 years ago

Oh, and if we do get GPU direct storage setup on Microsoft Planetary Computer (on Azure West Europe), I have an idea to get a demo working with the https://github.com/carbonplan/cmip6-downscaling dataset (since it's also on Azure West Europe?). This may or may not require the multi-resolution issue at #93 to be resolved, but it looked like a good Zarr machine learning dataset to play with.

As a start, I did try this quickly:

xr.open_dataset(
    "https://cpdataeuwest.blob.core.windows.net/cp-cmip/version1/data/DeepSD/ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.DeepSD.pr.zarr",
    engine="kvikio",
    consolidated=False,
)

but got a strange GroupNotFoundError: group not found at path '' (Using xr.open_zarr worked fine though). So realistically, still a few things to iron out on cupy-xarray and xarray perhaps, maybe a month or two's worth of work?

weiji14 commented 1 year ago

Ok, looks like I've severely underestimated how long this is going to take :sweat_smile: Hoping to get some time to work on this in October 2023 :crossed_fingers:, but just gonna make a TODO list on things that need to happen:

[ ] Documentation. Right now everything is in a blog post. There's been some related work at https://github.com/negin513/cupy-xarray-tutorials (not direct GPU, but CPU->GPU), which we could build on top of
[ ] Cloud infrastructure. Maybe start with one cloud provider (AWS?), and ensure that the disk partition, network connections and all that are setup properly to ensure low I/O latency.

Longer term, we'll also look into:

[ ] Non-Zarr file formats. May be a way to get this to work via kerchunk (see failed attempt at https://discourse.pangeo.io/t/accessing-nested-hdf5-file-from-http-via-kerchunk/3432/6), could maybe look into NetCDF, Cloud-Optimized GeoTIFFs, and others next.
[ ] More cloud providers - Document how to set things up on AWS/Azure/GCP/etc

dcherian commented 1 year ago

Maybe start with one cloud provider (AWS?), and ensure that the disk partition, network connections and all that are setup properly to ensure low I/O latency.

It may be a lot easier to experiment on NCAR systems once they can do it. @negin513 seems very interested in this kind of thing :)

maxrjones commented 1 year ago

thanks for creating the to-do list @weiji14! as we discussed earlier today, I'll also have some time in October to contribute and am particularly interesting in the kerchunk connections.

jakirkham commented 1 year ago

Starting with the name brand CSPs is a reasonable first step

While lesser known, CoreWeave has been putting in good effort to configuring hardware optimally

Though if you have your own system that you are planning to use long term, setting up there sounds good

weiji14 commented 1 year ago

Cool, the idea is to enable more people to run kvikIO/NVIDIA GPUDirect Storage, either on a local GPU, or in the cloud if they don't have one. That's why I'd like to start with the documentation, and we could experiment on NCAR first to understand how involved the configuration would be. Once we've figured out the config settings, we can then expand to other HPC or commercial cloud systems. That CoreWeave offering does look nice, though I can't see on their webpage if they do support NVIDIA GDS (would like to hope that they do)!

weiji14 commented 11 months ago

Have managed to run some benchmark experiments on a WeatherBench2/ERA5 subset comparing kvikIO (GPU-based) and zarr (CPU-based) engines at https://github.com/zarr-developers/zarr-benchmark/discussions/14. See also related discussion at https://github.com/zarr-developers/zarr-benchmark/discussions/14 where I describe the technical stuff in more detail. And yes, the benchmark code uses xbatcher too :wink:

compare_kvikio_zarr

Initial results are that kvikIO takes ~25% less time to load data than zarr (though I'm not confident with that number yet, because the numbers change drastically between subsequent runs due to some strange factors like caching). I'll be giving a talk next week at FOSS4G SotM Oceania 2023 to get people excited about this, and hope that things can move forward a bit more :smile:

KiranModukuri commented 11 months ago

@weiji14 can you please describe where these tests were run local Machine or in Cloud environment ?

weiji14 commented 11 months ago

Hi @KiranModukuri, yes, these tests were ran locally (using an NVIDIA RTX A2000 8GB GPU). I did try to set up a GCP container to run the benchmarks (WeatherBench2's ERA5 is at https://console.cloud.google.com/storage/browser/weatherbench2/datasets/era5), but was running into quota issues allocating GPUs on us-central1 where the dataset is stored.

xarray-contrib / xbatcher

[use case demonstration] Kvikio Direct-to-gpu -> xarray -> xbatcher -> ml model #87

What is your issue?