zarr-developers / community

An open community with an interest in developing and using new technologies for tensor data storage.
19 stars 1 forks source link

Usage from R #18

Open alimanfoo opened 6 years ago

alimanfoo commented 6 years ago

It would be great to be able to use zarr format data from R. This issue is intended for discussing options for enabling/supporting usage from R.

alimanfoo commented 6 years ago

One option might be to use the zarr python package from R via reticulate. It would be good to try this out and find out if there are any interoperability issues. One way of doing this could be to try to run all the code examples from the zarr tutorial but from R via reticulate. Some benchmarking would probably also be useful, to identify any areas where performance is affected by having to move or translate data between R and python.

If it is a workable option, it might then be cool to write a version of the zarr tutorial but for R users, which could be based off the current zarr python tutorial but include any specific information that R users might need to be aware of.

alimanfoo commented 6 years ago

Another option could be to write R bindings for the Z5 C++ library, e.g., via RCPP. This would be more work but might provide opportunities for better performance by avoiding any unnecessary data transformations or copies required when using reticulate.

alimanfoo commented 6 years ago

A technical point of interest, in R arrays use column-major (Fortran) memory layout. Zarr provides the option to use either row (C) or column (F) memory layout for data within chunks, and the same layout is used when retrieving data for all or part of a zarr array into a numpy array. E.g.:

In [20]: z = zarr.zeros((100, 100), order='F')

In [21]: a = z[:]

In [22]: a.flags
Out[22]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

So when using zarr from R, using order='F' should be more natural and give better performance.

mikejiang commented 6 years ago

I could be wrong, but from my understanding, isn't zarr is more of a software that uses key-value and chunked-compressed mechanism to provide efficient on-disk array solution? That is to say, being able to load the zaar data in R is far from having a full-fledged and equally performed R package that can access zarr backend as efficient as the current python lib? (even if the R binding for Z5 lib is implemented). Can you provide more insights regarding to the amount of software engineering efforts required to translate zarr to R without reticulate?

alimanfoo commented 6 years ago

Hi Mike, yes I imagine that having a native implementation would be more powerful than using reticulate, although I have not tried it yet and so don't have a clear view of the limitations.

FWIW there are basically 3 main components in the Zarr internal architecture, each with a simple API.

The storage module contains classes which expose a key-value interface where keys are ASCII strings and values are blobs. Minimum would be an implementation of this interface for the filesystem, allowing you to read Zarr data stored on disk. Other possible implementations include cloud object stores etc.

The codecs module contains classes that expose a encode/decode interface, and includes main compressors like blosc, gzip etc. For the Python implementation that's in a separate package called numcodecs.

Then the core module provides the translation between an array-like interface and the underlying management, encoding and storage of chunks.

There is also a hierarchy module which deals with creating and accessing groups etc.

FWIW there's a bit more info on the architecture in the ESIP talk I gave here: http://alimanfoo.github.io/2018/04/12/zarr-tech-dive.html

To get a basic working implementation was actually less work than you might imagine, but as always the devil is in the detail.

On Mon, 17 Sep 2018, 22:43 Mike Jiang, notifications@github.com wrote:

I could be wrong, but from my understanding, isn't zarr is more of a software that uses key-value and chunked-compressed mechanism to provide efficient on-disk array solution? Being able to load the zaar data in R is far from having a full-fledged and equally performed R package that can access zarr backend as efficient as the current python lib? (even if the R binding for Z5 lib is implemented). Can you provide more insights regarding to the amount of software engineering efforts required to translate zarr to R without reticulate?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/279#issuecomment-422181904, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QtMsBLSda7x7OPExhZySAycawfbrks5ucBeIgaJpZM4VZRpl .

gdkrmr commented 4 years ago

I took a stab at wrapping z5 from R, it currently compiles, but it is not functional yet and there is still quite a lot of work to do, so don't judge me :-). I am sharing this to avoid duplicated efforts, anyone who wants can join development

https://github.com/gdkrmr/zarr-R

constantinpape commented 4 years ago

I took a stab at wrapping z5 from R, it currently compiles, but it is not functional yet and there is still quite a lot of work to do, so don't judge me

Let me know if you have any questions or need any help from the z5 site.

alimanfoo commented 4 years ago

Thanks a lot for sharing, great to hear about this!

On Tue, 12 Nov 2019, 08:15 Guido Kraemer, notifications@github.com wrote:

I took a stab at wrapping z5 from R, it currently compiles, but it is not functional yet and there is still quite a lot of work to do, so don't judge me :-). I am sharing this to avoid duplicated efforts, anyone who wants can join development

https://github.com/gdkrmr/zarr-R

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/community/issues/18?email_source=notifications&email_token=AAFLYQSJBZX5M6DZ3YOJXO3QTJQ2JA5CNFSM4H5MB6Q2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDZM22Q#issuecomment-552783210, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFLYQRCL5NXXQDTAOGWDCTQTJQ2JANCNFSM4H5MB6QQ .

gdkrmr commented 4 years ago

Let me know if you have any questions or need any help from the z5 site.

Thanks, what is the ETA for v2.0.0?

constantinpape commented 4 years ago

Thanks, what is the ETA for v2.0.0?

The API redesign is done, I just need to test it a bit more. Initially, my plan was to wait for the implementation of the S3 backend. I have put in a bit of work into this, but it's not quite done yet and I don't have time to finish it right now. (I was initially hoping for some external contributions to the S3 part, but this hasn't happened yet).

Anyway, I think I will release 2.0.0 without S3 or other cloud backends and push this to 2.1.0. I can probably do it next week. I will let you know once it's there.

jakirkham commented 4 years ago

@gdkrmr, it would be great if you could hop on one of our meetings ( https://github.com/zarr-developers/community/issues/1 ), am sure others would be interested in hearing about your work and how we can help you.

gdkrmr commented 4 years ago
constantinpape commented 4 years ago
* I still need to think how to deal with the other data types (e.g. Booleans).

Just fyi, I don't support bool right now in z5. @alimanfoo @jakirkham Are there any optimisations when zarr stores bools or is it storing a bool as one byte?

* It produces an .so file that is almost 40Mb large :-)

Interesting, for the python bindings the .so is quite a bit smaller, ~ 2.5 MB (build on Ubuntu 18 with gcc 7 and Release).

jakirkham commented 4 years ago

I don't think we are doing anything special. Though could imagine one implementing a bit packing codec.

Maybe there are some compiler flags that can help?

constantinpape commented 4 years ago

Maybe there are some compiler flags that can help?

Probably yes.

@gdkrmr What operating system are you using and which compiler? Are you using CMake? If so, maybe try compiling with Release or with MinSizeRel.

alimanfoo commented 4 years ago

On Mon, 27 Jan 2020, 21:05 Constantin Pape, notifications@github.com wrote:

  • I still need to think how to deal with the other data types (e.g. Booleans).

Just fyi, I don't support bool right now in z5. @alimanfoo https://github.com/alimanfoo @jakirkham https://github.com/jakirkham Are there any optimisations when zarr stores bools or is it storing a bool as one byte?

Same as numpy, bool as one byte.

gdkrmr commented 4 years ago
* I still need to think how to deal with the other data types (e.g. Booleans).

Just fyi, I don't support bool right now in z5.

@alimanfoo @jakirkham Are there any optimisations when zarr stores bools or is it storing a bool as one byte?

R stores bools as bytes (EDIT: no, they are stored as int32), because there is also a NA for bools. So I guess the way to go is to add an argument when reading to transform the data either into R integers or bools

Maybe there are some compiler flags that can help?

Probably yes.

@gdkrmr What operating system are you using and which compiler? Are you using CMake? If so, maybe try compiling with Release or with MinSizeRel.

Ubuntu 16.04 and I have to use the R build system, which uses Makefiles.

* It produces an .so file that is almost 40Mb large :-)

Interesting, for the python bindings the .so is quite a bit smaller,

~ 2.5 MB (build on Ubuntu 18 with gcc 7 and Release).

I can get the size of the .so down to < 1MB if I strip debug symbols or use link time optimization. I have asked on the R developers mailing list and the CRAN (the official R package repository) policy is quite restrictive with these kinds of flags, so they have to live with it. Ironically their checker throws a warning if the .so gets too large :-). I just found that this was a curious fact, nothing to really worry about.

EDIT: R stores rlogicals as int32, not uint8

gdkrmr commented 4 years ago

What is the state of Zarr support in R? I haven't looked at my package for a while and wonder if someone else has done some work on this in the meanwhile or is planning to work on this?

LTLA commented 3 years ago

I'm late to the party, but Googling most permutations of "zarr for R" gives this thread as the top hit, then @gdkrmr's repo, and Bioconductor's ZarrExperiment (I'll get to this later). So I'd guess your stuff is still the best we've got right now.

If you're planning to keep working on your zarr R package, I'd be willing to test it out on some genome-scale data. I've been eyeing some alternatives to HDF5 for a while and would be very interested in building on top of whatever you make.

Our current approach in ZarrExperiment just does the simple thing of dispatching to the Python library via reticulate. A native port would be much preferred if it is feasible. If your package gets more mature, we would use it to create a DelayedArray backend for zarr that would work in all analysis pipelines as a plug-and-play replacement for HDF5.

(Maybe you should call the package zarrr, ho ho ho.)

ocefpaf commented 3 years ago

I'm late to the party, but Googling most permutations of "zarr for R" gives this thread as the top hit, then @gdkrmr's repo, and Bioconductor's ZarrExperiment (I'll get to this later). So I'd guess your stuff is still the best we've got right now.

Same here. I'll be teaching a workshop for R users soon and I was wondering about zarr support. So far I got it via nczarr. See the last cell of this notebook. But it would be nice to add alternatives that don't require a netcdf installation.

joshmoore commented 3 years ago

Would it help to get zarrrrrr interested parties together at the next community meeting (May 5th) to discuss a path forward?

From my side, I'd love to see one (or more?) R implementation in https://github.com/zarr-developers/zarr_implementations/

cc: @gdkrmr @keller-mark @ocefpaf @LTLA (@dominikl? @jkh1?)

jkh1 commented 3 years ago

Count me in. As a regular R user, this is something I've been thinking about recently. I'd favour the C++/Rcpp path over the reticulate approach as I've had issues with reticulate before (in my experience, R doesn't always play well with the various python envs/conda).

davidbrochart commented 3 years ago

Maybe we could provide R bindings of xtensor-zarr? We already do that for xtensor, and there exists an R package for xtensor already. We could improve this package so that it allows Zarr access, and users could use the same package for array processing. The package would then be equivalent to something like Zarr + NumPy.

gdkrmr commented 3 years ago

the netcdf-c library has added support for zarr files. netcdf-c is the basis for the R package ncdf4. There are discussion on how to get it working in R https://github.com/Unidata/netcdf-c/issues/1982.

keller-mark commented 3 years ago

This sounds great! I started an extremely rough pure R function for producing a single Zarr chunk from an R matrix here https://github.com/vitessce/vitessce-r/blob/keller-mark/zarr/R/zarr.R#L215 in case anyone is interested. Unfortunately I cannot attend at 2pm eastern time on May 5th due to a conflict but perhaps @ilan-gold @manzt @th789 @mccalluc are interested

gdkrmr commented 3 years ago

I will try to attend but cannot make any promises.

joshmoore commented 3 years ago

Looks like the time slot didn't work out for R folks. No worries. Note that the 19th is cancelled; we'll be back on the regular zoom on the 2nd though. If a different time slot would be better, feel free to say the word.

ocefpaf commented 3 years ago

the netcdf-c library has added support for zarr files. netcdf-c is the basis for the R package ncdf4. There are discussion on how to get it working in R Unidata/netcdf-c#1982.

In a way that already works. See the last cell of https://nbviewer.jupyter.org/gist/ocefpaf/4a078b19db4fd5507d2d21691abaa689

But nczarr is not exactly the same as zarr. I'm not well versed in the details but maybe a core zarr (c/c++/rust, whatever) that we can wrap in Python and R is still needed?

joshmoore commented 3 years ago

@ocefpaf : I only know what's on the docs and what I've tested on the CLI, but my understanding was that nczarr has a mode to work with pure Zarr that may be of interest. I'd defer to @DennisHeimbigner whether a portion of the library could be used as a core.

DennisHeimbigner commented 3 years ago

Josh is correct. We support pure zarr read/write, so as long as you are willing to live with the restricted meta-data of pure zarr you can use netcdf-c for pure V2 zarr. The next netcdf-c release (version 4.8.1) will also add support for the Xarray convention for named dimensions. As for pulling out pieces, that is doable. As is usual, the documentation could be improved. Much of the code uses the netcdf internal data structures for implementing the netcdf-C API. But at least these parts might be usable.

  1. the code caching and read/write of chunks
  2. the code that reads/writes zarr metadata.
  3. the code that wraps access to the underlying storage e.g. files, zip file, and S3.
DennisHeimbigner commented 3 years ago

BTW you could try this experiment with R wrapping netcdf-4.8.0

  1. Take a simple R program that creates a netcdf .nc file, call it simple.nc
  2. Modify the program so that instead of calling whatever the R equivalent of nc_create("simple.nc"...) instead call the equivalent of nc_create("file://simple.zarr#mode=zarr,file",...)

This should create a directory called "simple.zarr" that contains a pure zarr container. The name "simple.zarr" is not special, you can call it whatever you want. If you try this then let me know what happens.

jkh1 commented 3 years ago

Related discussion on image.sc.

schienstockd commented 2 years ago

I have a beginner's question to opening zarr files with netcdf.

I have built netcdf-c with zarr support (I think) and then built the R package ncdf4. I am not sure how to open the dataset up then. I have an OME-ZARR file generated from biofromats2raw and tried to open the file like this:

library(ncdf4)

# open file
ncin <- nc_open(
  "file:///Users/me/image.ome.zarr#mode=nczarr,zarr"
)
ncin
# Error in R_nc4_inq: NetCDF: Invalid argument
# Error in nc_get_grp_info(gids[ib], root_group$fqgn, format) : 
#   nc_get_grp_info: R_nc4_inq returned error on group id 524289
gdkrmr commented 2 years ago

I couldn't get it to work either, see https://github.com/Unidata/netcdf-c/issues/1982

DennisHeimbigner commented 2 years ago

A couple of things.

  1. It appears you are using netcdf-c version 4.8.1 correct?
  2. try this command to avoid any R interference. ncdump -h "file:///Users/me/image.ome.zarr#mode=nczarr,zarr"
DennisHeimbigner commented 2 years ago

BTW what operating system are you using?

schienstockd commented 2 years ago

MacOS Catalina 10.15.5

It is a a multiscale OME-zarr where the image is in the path /0/0 of the file. When I try to read from the root then the file is not found. When I try to read from the dataset path directly the header seems to be empty:

> ncdump -h "file:///Users/me/ccidImage.ome.zarr#mode=nczarr,zarr"
ncdump: file:///Users/me/ccidImage.ome.zarr#mode=nczarr,zarr: No such file or directory

> ncdump -h "file:///Users/me/ccidImage.ome.zarr/0/0#mode=nczarr,zarr"
netcdf \0 {
}
nc-config --all

This netCDF 4.8.1-development has been built with the following features: 

  --cc            -> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
  --cflags        -> -I/usr/local/include
  --libs          -> -L/usr/local/lib -lnetcdf
  --static        -> -lhdf5_hl -lhdf5 -lsz -lz -ldl -lm -lsz -lcurl -lzip

  --has-c++       -> no
  --cxx           -> 

  --has-c++4      -> yes
  --cxx4          -> /usr/local/Homebrew/Library/Homebrew/shims/mac/super/clang++
  --cxx4flags     -> -I/usr/local/Cellar/osgeo-netcdf/4.7.4/include
  --cxx4libs      -> -L/usr/local/Cellar/osgeo-netcdf/4.7.4/lib -lnetcdf-cxx4 -lnetcdf

  --has-fortran   -> yes
  --fc            -> /usr/local/bin/gfortran
  --fflags        -> /usr/local/Cellar/osgeo-netcdf/4.7.4/include
  --flibs         -> -L/usr/local/Cellar/osgeo-netcdf/4.7.4/lib
  --has-f90       -> TRUE
  --has-f03       -> FALSE

  --has-dap       -> yes
  --has-dap2      -> yes
  --has-dap4      -> yes
  --has-nc2       -> yes
  --has-nc4       -> yes
  --has-hdf5      -> yes
  --has-hdf4      -> no
  --has-logging   -> no
  --has-pnetcdf   -> no
  --has-szlib     -> yes
  --has-cdf5      -> yes
  --has-parallel4 -> no
  --has-parallel  -> no
  --has-nczarr    -> yes

  --prefix        -> /usr/local
  --includedir    -> /usr/local/include
  --libdir        -> /usr/local/lib
  --version       -> netCDF 4.8.1-development
joshmoore commented 2 years ago

When I try to read from the dataset path directly the header seems to be empty:

Could this be related to the "dimension_separator" metadata, @DennisHeimbigner ? @schienstockd , can you show us the content of 0/0/.zarray?

DennisHeimbigner commented 2 years ago

I think I see the problem. I use a heuristic to break a key into the variable key and the chunk index/key. The heuristic says to get the longest suffix of integers as the chunk index. So, in this case it is eating up too much of the key as the chunk index. I can fix, but out of curiosity why do you have a variable named "0"

schienstockd commented 2 years ago

@joshmoore

0/0/.zarray

{
  "chunks" : [
    1,
    1,
    1,
    512,
    512
  ],
  "compressor" : {
    "clevel" : 5,
    "blocksize" : 0,
    "shuffle" : 1,
    "cname" : "lz4",
    "id" : "blosc"
  },
  "dtype" : ">u2",
  "fill_value" : 0,
  "filters" : null,
  "order" : "C",
  "shape" : [
    180,
    4,
    8,
    512,
    512
  ],
  "zarr_format" : 2,
  "dimension_separator" : "/"
}

I am not sure where the '0' variable comes from .. I used bioformats2raw to convert the image

jakirkham commented 2 years ago

Looks like there is a very rough Zarr implementation in R

https://github.com/keller-mark/pizzarr

cc @keller-mark (hopefully I've clarified that correctly; please feel free to correct me if not)

keller-mark commented 2 years ago

Yes very rough indeed. Of course open to contributions or more detailed feature requests / issues.

joshmoore commented 2 years ago

See discussion post under https://github.com/zarr-developers/zarr-python/discussions/1088

cc: @mike-lawrence

bart1 commented 1 year ago

The stars package seems to have a implementation (I did not test it): https://r-spatial.org/r/2022/09/13/zarr.html

mike-lawrence commented 1 year ago

The stars package seems to have a implementation (I did not test it): https://r-spatial.org/r/2022/09/13/zarr.html

I think stars only provides read access, no write.

mike-lawrence commented 1 year ago

Seems to be solid progress here

jkh1 commented 1 year ago

The Rarr package is now on Bioconductor. The repository is here. It's written in C and writing is supported although for now limited to double and string types.

mike-lawrence commented 1 year ago

The Rarr package is now on Bioconductor. The repository is here. It's written in C and writing is supported although for now limited to double and string types.

Cool! I always forget to check bioconductr for packages 🤦‍♂️

keller-mark commented 1 year ago

Hi all, update on pizzarr: some things are working now!

I have updated the docs a bit, with a simple OME-NGFF demo at https://keller-mark.github.io/pizzarr/articles/ome-ngff.html

Screenshot 2023-08-19 at 6 06 44 PM
MSanKeys963 commented 1 year ago

Thanks for working on Pizzarr and updating us, @keller-mark.

May I add this to our website (https://zarr.dev/implementations/)?