Some of these solutions are probably can't be implemented in Python; instead they have to be implemented in a compiled language like C++ or Rust.
Start simple: loading a single uncompressed file. Note that fio shows that - on an SSD, and maybe on an HDD - best performance requires reading multiple parts of the file in parallel. Compare Zarr to numpy.
Lots of round-trips to RAM (decompression copies compressed data from RAM into CPU cache, then copies uncompressed chunk back to RAM. Then, when we copy those individual uncompressed chunks to the final, contiguous numpy array, we have to copy each chunk into CPU cache again, and then copy back to the merged array in RAM. Would be better to uncompress and move immediately, so the uncompressed chunk is still in CPU cache.)
Decompress chunk n+1 whilst loading chunk n, in parallel.
Decompress multiple chunks in parallel.
Decompress into an existing buffer (so we only have to create a buffer once).
memcpy chunks to the final contiguous numpy array in parallel.
Partial loading from disk (only for uncompressed chunks).
Use the operating system's high performance async, batched IO (io_uring on Linux, kqueue on MacOS, and IOCP on Windows).
Some of these solutions are probably can't be implemented in Python; instead they have to be implemented in a compiled language like C++ or Rust.
fio
shows that - on an SSD, and maybe on an HDD - best performance requires reading multiple parts of the file in parallel. Compare Zarr to numpy.