Error converting hdf5 file to zarr

joshua-gould commented 2 years ago

Zarr version

2.13.2

Numcodecs version

0.10.2

Python Version

3.10.4

Operating System

Mac

Installation

pip

Description

TypeError: Cannot compare structured or void to non-void arrays when converting h5 file to zarr

Steps to reproduce

import urllib

import h5py
import zarr

testfile = urllib.URLopener()
testfile.retrieve(
    "https://raw.githubusercontent.com/chanzuckerberg/cellxgene/main/example-dataset/pbmc3k.h5ad",
    "pbmc3k.h5ad",
)
with h5py.File("pbmc3k.h5ad", "r") as h5_file:
    zarr.copy_all(h5_file, zarr.open("pbmc3k.zarr", "w"))

Additional output

No response

joshmoore commented 2 years ago

Hi @joshua-gould. Thanks for the test file. For those reading along:

h5ls pbmc3k.h5ad
X                        Dataset {2638, 1838}
obs                      Dataset {2638}
obsm                     Dataset {2638}
raw.X                    Group
raw.var                  Dataset {13714}
uns                      Group
var                      Dataset {1838}
varm                     Dataset {1838}

$h5ls -v pbmc3k.h5ad
Opened "pbmc3k.h5ad" with sec2 driver.
X                        Dataset {2638/2638, 1838/1838}
    Location:  1:800
    Links:     1
    Chunks:    {83, 115} 38180 bytes
    Storage:   19394576 logical bytes, 17564316 allocated bytes, 110.42% utilization
    Filter-0:  deflate-1 OPT {4}
    Type:      native float
obs                      Dataset {2638/2638}
    Location:  1:16780884
    Links:     1
    Chunks:    {330} 10890 bytes
    Storage:   87054 logical bytes, 42051 allocated bytes, 207.02% utilization
    Filter-0:  deflate-1 OPT {4}
    Type:      struct {
                   "index"            +0    16-byte null-padded ASCII string
                   "n_genes"          +16   native long
                   "percent_mito"     +24   native float
                   "n_counts"         +28   native float
                   "louvain"          +32   native signed char
               } 33 bytes
obsm                     Dataset {2638/2638}
    Location:  1:16781612
    Links:     1
    Chunks:    {83} 20584 bytes
    Storage:   654224 logical bytes, 617223 allocated bytes, 105.99% utilization
    Filter-0:  deflate-1 OPT {4}
    Type:      struct {
                   "X_pca"            +0    [50] native float
                   "X_tsne"           +200  [2] native double
                   "X_umap"           +216  [2] native double
                   "X_draw_graph_fr"  +232  [2] native double
               } 248 bytes
raw.X                    Group
    Attribute: h5sparse_format scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "csr"
    Attribute: h5sparse_shape {2}
        Type:      native long
        Data:  2638, 13714
    Location:  1:19223132
    Links:     1
raw.var                  Dataset {13714/13714}
    Location:  1:20353253
    Links:     1
    Chunks:    {429} 11583 bytes
    Storage:   370278 logical bytes, 107553 allocated bytes, 344.27% utilization
    Filter-0:  deflate-1 OPT {4}
    Type:      struct {
                   "index"            +0    19-byte null-padded ASCII string
                   "n_cells"          +19   native long
               } 27 bytes
uns                      Group
    Location:  1:16782540
    Links:     1
var                      Dataset {1838/1838}
    Location:  1:16781340
    Links:     1
    Chunks:    {460} 11960 bytes
    Storage:   47788 logical bytes, 14501 allocated bytes, 329.55% utilization
    Filter-0:  deflate-1 OPT {4}
    Type:      struct {
                   "index"            +0    18-byte null-padded ASCII string
                   "n_cells"          +18   native long
               } 26 bytes
varm                     Dataset {1838/1838}
    Location:  1:16781996
    Links:     1
    Chunks:    {58} 11600 bytes
    Storage:   367600 logical bytes, 342628 allocated bytes, 107.29% utilization
    Filter-0:  deflate-1 OPT {4}
    Type:      struct {
                   "PCs"              +0    [50] native float
               } 200 bytes

The error is coming from the fill_value of (b'', 0, 0., 0., 0) for the dtype:

dtype([('index', 'S16'), ('n_genes', '<i8'), ('percent_mito', '<f4'), ('n_counts', '<f4'), ('louvain', 'i1')])

in the obs array. A possible workaround would be to read the HDF5 with anndata and then use its write_zarr method which should know how to handle this issue. (Though @ivirhsup and co. can tell us more.)

ninousf commented 1 year ago

same issue with copy_all and when trying to build a Zarr array with a HDF5 structured dtype:

import h5py as h5
import numpy as np
import os
import zarr as za

h5path = '/tmp/toto.h5'
dt = np.dtype([('address', 'S4'), ('value','S8')])

try:
    with h5.File(h5path, 'a') as f:
        ds = f.create_dataset("toto", (1,), dtype=dt)

        for k in dir(ds.dtype):
            if not k.startswith('__'):
                print(k, getattr(ds.dtype, k), getattr(dt, k))

        z = za.zeros(1, dtype=dt) # OK
        z = za.zeros(1, dtype=ds.dtype) # KO

finally:
    if os.path.exists(h5path):
        os.remove(h5path)

Traceback (most recent call last):                                                             
  File "git/zarr-python/zarr/meta.py", line 123, in decode_array_metadata
    dtype = cls.decode_dtype(meta["dtype"])                                                    
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                    
  File "git/zarr-python/zarr/meta.py", line 197, in decode_dtype                                                                                                               
    d = cls._decode_dtype_descr(d)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "git/zarr-python/zarr/meta.py", line 192, in _decode_dtype_descr
    d = [(k[0], cls._decode_dtype_descr(k[1])) + tuple(k[2:]) for k in d]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "git/zarr-python/zarr/meta.py", line 192, in <listcomp>
    d = [(k[0], cls._decode_dtype_descr(k[1])) + tuple(k[2:]) for k in d]
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "git/zarr-python/zarr/meta.py", line 192, in _decode_dtype_descr
    d = [(k[0], cls._decode_dtype_descr(k[1])) + tuple(k[2:]) for k in d]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "git/zarr-python/zarr/meta.py", line 192, in <listcomp>
    d = [(k[0], cls._decode_dtype_descr(k[1])) + tuple(k[2:]) for k in d]
          ~^^^
KeyError: 0

the only difference between an Numpy dtype and a HDF5 dtype is the desc attribute:

HDF5: [('address', ('|S4', {'h5py_encoding': 'ascii'})), ('value', ('|S8', {'h5py_encoding': 'ascii'}))]
- Numpy: [('address', '|S4'), ('value', '|S8')]

same issue in numpy: https://github.com/numpy/numpy/issues/15488

zarr-developers / zarr-python