yutik-nn / Xarray-adventure

SIParCS 2023 project work
https://yutik-nn.github.io/Xarray-adventure/
0 stars 1 forks source link

Dask tutorial #2

Open yutik-nn opened 1 year ago

dcherian commented 1 year ago
  1. Xarray tutorial: https://tutorial.xarray.dev/intermediate/xarray_and_dask.html
  2. Xarray docs: https://docs.xarray.dev/en/stable/user-guide/dask.html

Your comments on (2) would be super helpful. It should be rewritten to be more friendly to newcomers.

yutik-nn commented 1 year ago

I definitely agree that the dask tutorials could benefit from restructuring and rewriting. I will leave some comments below for both. From a perspective of a new user.

(1) Tutorial

What's unclear:

  1. "dask lets you scale from memory-sized datasets to disk-sized datasets". Requires more explanation.

  2. The first exercise of the tutorial says: "Try calling mean.values and mean.data. Do you understand the difference?" It happens before there is any variable 'mean 'or anything with mean really exists.

  3. It is not very clear when to use .compute() and when to use .load()

  4. Lack of clarity when it comes to a data type. It is hard to understand what the cue to look for when trying to understand the type of data. You see in an xarray dataset a line data type : numpy.array. But then you see chunks and know it's a dask array.

Dask is not a required dependency and yet it is when using .open_mfdataset. There you clearly see your data being dask.array

  1. Everything about a Cluster. Because again when running .open_mfdataset you do not explicitly start a Cluster yet it's in parallel. What happens if you run client = Client(local_directory='/tmp') in several notebooks? What exactly happens when you execute client.close()

  2. No real intro anywhere into a dashboard and how to access it. And what to do when you run into an error with a default port.

(2) Docs

  1. This document goes into explainingopen_mfdataset() but then mentions parallel=True option. But since for mfdataset dask is a required dependency what's the functionality of parallel=True. I think it means that dask.delayed is used, but how is it different from a default chunking?

  2. .persist() The language of single versus multiple machines is confusing. Because I do not think it implies cores. But what would be an example of a workflow where I'd be using multiple machines? What is the machine in that context?

  3. The note at the end: "In the future, we would like to enable automatic alignment of Dask chunksizes (but not the other way around). We might also require that all arrays in a dataset share the same chunking alignment. Neither of these are currently done."

I do not understand what chunking alignment is.

yutik-nn commented 1 year ago

Also, with no prior exposure to xarray-dask environment I feel like introducing Dask by saying "for a full example of how to use xarray’s Dask integration" go read Stephan Hoyer's post feels slightly misleading. Mainly because (it's an excellent post) it's from the times when xarray was xray and I would not say it can be considered a 'full example'. Again, not from a standpoint of a person who uses xarray and dask and knows who core developers are but from a beginner's perspective.

yutik-nn commented 1 year ago