zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.45k stars 273 forks source link

Adding GPU CI #2041

Closed akshaysubr closed 3 weeks ago

akshaysubr commented 1 month ago

Opening this issue to try and figure out how to set up GPU CI. The new Buffer, NDBuffer and BufferProtocol abstractions allow adding GPU support but the current blocker for https://github.com/zarr-developers/zarr-python/pull/1967 is the lack of GPU CI.

A few options are avaialble:

  1. Use GitHub's own GPU CI. The pricing for this is $0.07 per minute or $4.2 per hour.
  2. Use cirun.io to spawn custom cloud GPU spot instances and interface with GitHub actions. The pricing for this if using the same T4 GPU as in 1 on AWS is $0.24 per hour.
  3. Ubicloud is another option and they have native support for GitHub Actions and they just added GPU runners. They don't mention pricing for their GPU runners though.
  4. Potentially use Quansight's open-gpu-server.

A very short non-exhaustive survey on how other libraries handle GPU CI:

The main question is which option is best suited to zarr-python and if we need to pay for cloud cycles, how can that be done as an organization?

cc @jakirkham @rabernat @jhamman

rabernat commented 1 month ago

The github GPU CI would probably be simplest, no? But it doesn't seem to be available yet: https://resources.github.com/devops/accelerate-your-cicd-with-arm-and-gpu-runners-in-github-actions/

We would be happy to pay the costs. But we would want to think carefully about what GPU tests to run as part of CI in order to avoid blowing up the bill.

byucesoy commented 1 month ago

Hello, This is Burak from Ubicloud team. I'd be happy to answer any questions you might have regarding our runners. You can find pricing info here.

jakirkham commented 1 month ago

@aterrel do you know a good contact at GitHub, who could provide more details on their offering?

@aktech would you be able to share more about the cirun / Quansight approach?

aktech commented 1 month ago

@aktech would you be able to share more about the cirun / Quansight approach?

Thanks for the ping @jakirkham If you have a cloud account in one of the supported clouds you can spinup pretty cheap gpu runners with cirun, the service itself is free for open source so you'll be paying the per second runner cost to the chosen cloud (AWS is usually the best in my experience).

You can also use spot instances on aws to reduce the cost to further down. The scverse folks were able to reduce cost to about 1 cent per run with aws spot instances: https://github.com/scverse/anndata/issues/1067#issuecomment-1709802780 (of course it depends on the time taken on each run), this is just for perspective.

My opinion might be biased towards cirun (being the founder), feel free to explore and chose what works best for you guys. I am happy to help/support to make it work (just need a cloud account access), if you happen to chose cirun.

šŸš€ Sample run of this repo on GPU runner via cirun.io: https://github.com/aktech/zarr-python/actions/runs/10304821277/job/28524202276#step:4:1

betatim commented 1 month ago

Hello all šŸ‘‹

I recently setup a GPU CI for scikit-learn. We use the GitHub GPU runner. I don't think we had to do anything to be able to use them in terms of joining a beta program or some such. We did this because we spent a few months working on getting this going on cirun, but somehow never got it done. We worked on it on and off but didn't get to the finish line. Getting the GitHub GPU CI going was quick enough that we manged to "get it done".

The workflow is defined in this workflow https://github.com/scikit-learn/scikit-learn/blob/main/.github/workflows/cuda-ci.yml. It is triggered by applying a "CUDA CI" label to a PR. The label is removed by a separate workflow: https://github.com/scikit-learn/scikit-learn/blob/main/.github/workflows/cuda-label-remover.yml. The reason to split the workflow is permissions.

I also wrote up a blog post describing what we did/what we did wrong/etc. However it isn't quite done yet. I'll link you to it anyway https://hackmd.io/f68r4NjHSvO0tvchb61Z0g?edit - I think in particular the part about where to click in the UI on github.com is useful. The rest is better learned from the workflow files from the repo.

jhamman commented 3 weeks ago

Thanks folks for the input here.

This is set up now using a github action runner for the time being. Over the next few months, we'll measure usage and decide on a long term arrangement.

jakirkham commented 3 weeks ago

Thanks Joe! šŸ™