xarray-contrib / xarray-tutorial

Xarray Tutorials
https://tutorial.xarray.dev/
Apache License 2.0
165 stars 105 forks source link

Add non-geoscience example datasets #172

Open dcherian opened 1 year ago

dcherian commented 1 year ago

Example:

  1. https://sgkit-dev.github.io/sgkit/latest/getting_started.html#data-structures
  2. https://anopheles-genomic-surveillance.github.io/workshop-5/module-1-xarray.html
rsatapat commented 4 months ago

Xarray is a great tool for Neuroscience research since we typically gather data involving multiple dimensions (trials, days, animas, conditions etc.) Allen Institute provides an SDK for reading and processing such data alognwith an "observatory" which contains relevant data (https://allensdk.readthedocs.io/en/latest/)

negin513 commented 1 month ago

Hello @rsatapat, can we add a subset of the data to xaray-data for future tutorials? Any concerns regarding a subset of data being added for tutorials?

negin513 commented 1 month ago

Relevant content from @jsiegle: https://xarray.dev/blog/xarray-for-neurophysiology

scottyhq commented 1 month ago

Just keeping a list of some other examples here

Already using Xarray:

Would require modification to use xarray instead of numpy or custom objects:

TomNicholas commented 1 month ago

https://discourse.pangeo.io/t/potential-for-adapting-pythia-foundations-for-different-disciplines-e-g-neuro/4239

scottyhq commented 1 month ago

Would be interesting to look at modifying some of these examples to see if Xarray would work well in place of straight numpy arrays https://numpy.org/numpy-tutorials/ ... also it's an excellent repository overall

scottyhq commented 1 month ago

Brainstormed a bit more on this today with @TomNicholas. There are really two separate things to accomplish:

  1. Just highlight (visually) a few non-geoscience example datastructures in the tutorial and Xarray docs to make it clear that Xarray is flexible and relevant to different domains. So from the genomic surveillance example above:
    1. "a set of genotype calls obtained from sequencing some mosquitoes. These data can be stored as a 3-dimensional array, where one dimension of the array corresponds to positions (variants) within a reference genome, another dimension corresponds to the individual mosquitoes that were sequenced (samples), and a third dimension corresponds to the number of genomes within each individual (ploidy)." :
image

Note: On one hand it's nice to re-use the existing graphic and actual dataset, but could simplify even further by reducing the size, adding dimension labels to the image on the left, and dropping "alleles" and running set_index() to the dataarray on the right to easily match up!

  1. Bespoke formats (txt, or binary) are pervasive (not HDF,Zarr,netCDF,TIF). It would be great to add an example that coerces such a format into Xarray and does a simple useful visualization or computation.
    1. NumPy .npz files + metadata, which can be opened into xarray variables easily. Many people definitely still use .npz, but which example in the wild to use?
    2. Collection of X-ray images could work https://numpy.org/numpy-tutorials/content/tutorial-x-ray-image-processing.html, but to be really useful want to illustrate labeling (and ultimately selection) by physical coordinates so would have to invent some (patientID, x_distance(mm))
      1. This would segue nicely into building a custom backend docs https://tutorial.xarray.dev/advanced/backends/backends.html
dcherian commented 1 month ago

https://docs.google.com/forms/d/1x9bOIelnUsDMyI1tF4bN7TWK0v4nBDiwhpxh9mi6PaI/edit#responses

One of the user survey responses specifically calls this out:

Examples with Astropy to read FITS files, using Astropy Tables

scottyhq commented 1 month ago

Examples with Astropy to read FITS files, using Astropy Table

Some renewed activity in this repository that seems relevant! https://github.com/ratt-ru/xarray-fits/issues/26

TomNicholas commented 3 weeks ago

@tomwhite mentioned that the sgkit file openers / converters are actually about to be deprecated in favour of a new package called bio2zarr. Basically their motivation is that the text-based VCF format etc. is so awfully-designed that efficient access via a kerchunk-like approach is basically impossible, so they end up having to convert it to zarr anyway.

tomwhite commented 3 weeks ago

@tomwhite mentioned that the sgkit file openers / converters are actually about to be deprecated in favour of a new package called bio2zarr. Basically their motivation is that the text-based VCF format etc. is so awfully-designed that efficient access via a kerchunk-like approach is basically impossible, so they end up having to convert it to zarr anyway.

Both the VCF conversion code in sgkit and the new bio2zarr project both output the same Zarr format (specified here). The reason for bio2zarr is that users were struggling to get the Dask-based sgkit VCF conversion working reliably, so the code was re-written to be a command-line application that runs on multi-core local machines, or HPC schedulers, and bio2zarr is the result.

There are a couple of example sgkit tutorials that may be of interest here: https://sgkit-dev.github.io/sgkit/latest/examples/index.html