Open joshua-gould opened 2 years ago
Hi @joshua-gould. Thanks for the test file. For those reading along:
h5ls pbmc3k.h5ad
X Dataset {2638, 1838}
obs Dataset {2638}
obsm Dataset {2638}
raw.X Group
raw.var Dataset {13714}
uns Group
var Dataset {1838}
varm Dataset {1838}
$h5ls -v pbmc3k.h5ad
Opened "pbmc3k.h5ad" with sec2 driver.
X Dataset {2638/2638, 1838/1838}
Location: 1:800
Links: 1
Chunks: {83, 115} 38180 bytes
Storage: 19394576 logical bytes, 17564316 allocated bytes, 110.42% utilization
Filter-0: deflate-1 OPT {4}
Type: native float
obs Dataset {2638/2638}
Location: 1:16780884
Links: 1
Chunks: {330} 10890 bytes
Storage: 87054 logical bytes, 42051 allocated bytes, 207.02% utilization
Filter-0: deflate-1 OPT {4}
Type: struct {
"index" +0 16-byte null-padded ASCII string
"n_genes" +16 native long
"percent_mito" +24 native float
"n_counts" +28 native float
"louvain" +32 native signed char
} 33 bytes
obsm Dataset {2638/2638}
Location: 1:16781612
Links: 1
Chunks: {83} 20584 bytes
Storage: 654224 logical bytes, 617223 allocated bytes, 105.99% utilization
Filter-0: deflate-1 OPT {4}
Type: struct {
"X_pca" +0 [50] native float
"X_tsne" +200 [2] native double
"X_umap" +216 [2] native double
"X_draw_graph_fr" +232 [2] native double
} 248 bytes
raw.X Group
Attribute: h5sparse_format scalar
Type: variable-length null-terminated UTF-8 string
Data: "csr"
Attribute: h5sparse_shape {2}
Type: native long
Data: 2638, 13714
Location: 1:19223132
Links: 1
raw.var Dataset {13714/13714}
Location: 1:20353253
Links: 1
Chunks: {429} 11583 bytes
Storage: 370278 logical bytes, 107553 allocated bytes, 344.27% utilization
Filter-0: deflate-1 OPT {4}
Type: struct {
"index" +0 19-byte null-padded ASCII string
"n_cells" +19 native long
} 27 bytes
uns Group
Location: 1:16782540
Links: 1
var Dataset {1838/1838}
Location: 1:16781340
Links: 1
Chunks: {460} 11960 bytes
Storage: 47788 logical bytes, 14501 allocated bytes, 329.55% utilization
Filter-0: deflate-1 OPT {4}
Type: struct {
"index" +0 18-byte null-padded ASCII string
"n_cells" +18 native long
} 26 bytes
varm Dataset {1838/1838}
Location: 1:16781996
Links: 1
Chunks: {58} 11600 bytes
Storage: 367600 logical bytes, 342628 allocated bytes, 107.29% utilization
Filter-0: deflate-1 OPT {4}
Type: struct {
"PCs" +0 [50] native float
} 200 bytes
The error is coming from the fill_value
of (b'', 0, 0., 0., 0)
for the dtype
:
dtype([('index', 'S16'), ('n_genes', '<i8'), ('percent_mito', '<f4'), ('n_counts', '<f4'), ('louvain', 'i1')])
in the obs array. A possible workaround would be to read the HDF5 with anndata and then use its write_zarr
method which should know how to handle this issue. (Though @ivirhsup and co. can tell us more.)
same issue with copy_all and when trying to build a Zarr array with a HDF5 structured dtype:
import h5py as h5
import numpy as np
import os
import zarr as za
h5path = '/tmp/toto.h5'
dt = np.dtype([('address', 'S4'), ('value','S8')])
try:
with h5.File(h5path, 'a') as f:
ds = f.create_dataset("toto", (1,), dtype=dt)
for k in dir(ds.dtype):
if not k.startswith('__'):
print(k, getattr(ds.dtype, k), getattr(dt, k))
z = za.zeros(1, dtype=dt) # OK
z = za.zeros(1, dtype=ds.dtype) # KO
finally:
if os.path.exists(h5path):
os.remove(h5path)
Traceback (most recent call last):
File "git/zarr-python/zarr/meta.py", line 123, in decode_array_metadata
dtype = cls.decode_dtype(meta["dtype"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "git/zarr-python/zarr/meta.py", line 197, in decode_dtype
d = cls._decode_dtype_descr(d)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "git/zarr-python/zarr/meta.py", line 192, in _decode_dtype_descr
d = [(k[0], cls._decode_dtype_descr(k[1])) + tuple(k[2:]) for k in d]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "git/zarr-python/zarr/meta.py", line 192, in <listcomp>
d = [(k[0], cls._decode_dtype_descr(k[1])) + tuple(k[2:]) for k in d]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "git/zarr-python/zarr/meta.py", line 192, in _decode_dtype_descr
d = [(k[0], cls._decode_dtype_descr(k[1])) + tuple(k[2:]) for k in d]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "git/zarr-python/zarr/meta.py", line 192, in <listcomp>
d = [(k[0], cls._decode_dtype_descr(k[1])) + tuple(k[2:]) for k in d]
~^^^
KeyError: 0
the only difference between an Numpy dtype and a HDF5 dtype is the desc attribute:
same issue in numpy: https://github.com/numpy/numpy/issues/15488
Zarr version
2.13.2
Numcodecs version
0.10.2
Python Version
3.10.4
Operating System
Mac
Installation
pip
Description
TypeError: Cannot compare structured or void to non-void arrays when converting h5 file to zarr
Steps to reproduce
Additional output
No response