zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
125 stars 87 forks source link

VLenUTF8().encode(buffer) fails is buffer is read-only #514

Closed ivirshup closed 2 months ago

ivirshup commented 6 months ago

Minimal, reproducible code sample, a copy-pastable example if possible

import numpy as np
from numcodecs import VLenUTF8

codec = VLenUTF8()

a = np.array(list("abc"), dtype=object)
a.flags.writeable = False

codec.encode(a)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[39], line 9
      6 a = np.array(list("abc"), dtype=object)
      7 a.flags.writeable = False
----> 9 codec.encode(a)

File numcodecs/vlen.pyx:87, in numcodecs.vlen.VLenUTF8.encode()

File <stringsource>:663, in View.MemoryView.memoryview_cwrapper()

File <stringsource>:353, in View.MemoryView.memoryview.__cinit__()

ValueError: buffer source array is read-only

Problem description

Short description: this shouldn't error, as the codec shouldn't care whether it can write to the buffer it's passed.

Long description:

I can't think of a reason that .encode would need to modify the buffer, so it shouldn't care that it's read-only.

Version and installation information

Please provide the following:

Also, if you think it might be relevant, please provide the output from pip list or conda list depending on which was used to install NumCodecs.

conda list ```python # packages in environment at /mnt/workspace/mambaforge/envs/scanpy-dev: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge anndata 0.10.5.post1 pypi_0 pypi annoy 1.17.3 pypi_0 pypi array-api-compat 1.4.1 pypi_0 pypi asciitree 0.3.3 pypi_0 pypi asttokens 2.4.1 pypi_0 pypi atk-1.0 2.38.0 hd4edc92_1 conda-forge attrs 23.2.0 pypi_0 pypi bokeh 3.3.4 pypi_0 pypi bzip2 1.0.8 hd590300_5 conda-forge ca-certificates 2023.11.17 hbcca054_0 conda-forge cairo 1.18.0 h3faef2a_0 conda-forge click 8.1.7 pypi_0 pypi cloudpickle 3.0.0 pypi_0 pypi comm 0.2.1 pypi_0 pypi contourpy 1.2.0 pypi_0 pypi cycler 0.12.1 pypi_0 pypi cython 3.0.8 pypi_0 pypi dask 2024.3.0 pypi_0 pypi dask-expr 1.0 pypi_0 pypi dask-glm 0.3.2 pypi_0 pypi dask-ml 2023.3.24 pypi_0 pypi debugpy 1.8.0 pypi_0 pypi decorator 5.1.1 pypi_0 pypi deprecated 1.2.14 pypi_0 pypi distributed 2024.1.1 pypi_0 pypi execnet 2.0.2 pypi_0 pypi executing 2.0.1 pypi_0 pypi expat 2.5.0 hcb278e6_1 conda-forge fasteners 0.19 pypi_0 pypi fbpca 1.0 pypi_0 pypi font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge font-ttf-inconsolata 3.000 h77eed37_0 conda-forge font-ttf-source-code-pro 2.038 h77eed37_0 conda-forge font-ttf-ubuntu 0.83 h77eed37_1 conda-forge fontconfig 2.14.2 h14ed4e7_0 conda-forge fonts-conda-ecosystem 1 0 conda-forge fonts-conda-forge 1 0 conda-forge fonttools 4.47.2 pypi_0 pypi freetype 2.12.1 h267a509_2 conda-forge fribidi 1.0.10 h36c2ea0_0 conda-forge fsspec 2023.12.2 pypi_0 pypi future 0.18.3 pypi_0 pypi gdk-pixbuf 2.42.10 h829c605_4 conda-forge geosketch 1.2 pypi_0 pypi gettext 0.21.1 h27087fc_0 conda-forge giflib 5.2.1 h0b41bf4_3 conda-forge gprof2dot 2022.7.29 pypi_0 pypi graphite2 1.3.13 h58526e2_1001 conda-forge graphtools 1.5.3 pypi_0 pypi graphviz 9.0.0 h78e8752_1 conda-forge gtk2 2.24.33 h7f000aa_3 conda-forge gts 0.7.6 h977cf35_4 conda-forge h5py 3.10.0 pypi_0 pypi harfbuzz 8.3.0 h3d44ed6_0 conda-forge harmonypy 0.0.9 pypi_0 pypi icu 73.2 h59595ed_0 conda-forge igraph 0.11.3 pypi_0 pypi imageio 2.33.1 pypi_0 pypi importlib-metadata 7.0.1 pypi_0 pypi iniconfig 2.0.0 pypi_0 pypi intervaltree 3.1.0 pypi_0 pypi ipykernel 6.29.0 pypi_0 pypi ipython 8.20.0 pypi_0 pypi jedi 0.19.1 pypi_0 pypi jinja2 3.1.3 pypi_0 pypi joblib 1.3.2 pypi_0 pypi jupyter-client 8.6.0 pypi_0 pypi jupyter-core 5.7.1 pypi_0 pypi kiwisolver 1.4.5 pypi_0 pypi lazy-loader 0.3 pypi_0 pypi ld_impl_linux-64 2.40 h41732ed_0 conda-forge legacy-api-wrap 1.4 pypi_0 pypi leidenalg 0.10.2 pypi_0 pypi lerc 4.0.0 h27087fc_0 conda-forge libdeflate 1.19 hd590300_0 conda-forge libexpat 2.5.0 hcb278e6_1 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 13.2.0 h807b86a_4 conda-forge libgd 2.3.3 h119a65a_9 conda-forge libglib 2.78.3 h783c2da_0 conda-forge libgomp 13.2.0 h807b86a_4 conda-forge libiconv 1.17 hd590300_2 conda-forge libjpeg-turbo 3.0.0 hd590300_1 conda-forge libnsl 2.0.1 hd590300_0 conda-forge libpng 1.6.39 h753d276_0 conda-forge librsvg 2.56.3 he3f83f7_1 conda-forge libsqlite 3.44.2 h2797004_0 conda-forge libstdcxx-ng 13.2.0 h7e041cc_4 conda-forge libtiff 4.6.0 ha9c0a0a_2 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libwebp 1.3.2 h658648e_1 conda-forge libwebp-base 1.3.2 hd590300_0 conda-forge libxcb 1.15 h0b41bf4_0 conda-forge libxcrypt 4.4.36 hd590300_1 conda-forge libxml2 2.12.4 h232c23b_1 conda-forge libzlib 1.2.13 hd590300_5 conda-forge llvmlite 0.41.1 pypi_0 pypi locket 1.0.0 pypi_0 pypi magic-impute 3.0.0 pypi_0 pypi markdown-it-py 3.0.0 pypi_0 pypi markupsafe 2.1.4 pypi_0 pypi matplotlib 3.8.2 pypi_0 pypi matplotlib-inline 0.1.6 pypi_0 pypi matplotx 0.3.10 pypi_0 pypi mdurl 0.1.2 pypi_0 pypi memory-profiler 0.61.0 pypi_0 pypi msgpack 1.0.7 pypi_0 pypi multipledispatch 1.0.0 pypi_0 pypi natsort 8.4.0 pypi_0 pypi ncurses 6.4 h59595ed_2 conda-forge nest-asyncio 1.6.0 pypi_0 pypi networkx 3.2.1 pypi_0 pypi numba 0.58.1 pypi_0 pypi numcodecs 0.12.1 pypi_0 pypi numpy 1.26.3 pypi_0 pypi openssl 3.2.0 hd590300_1 conda-forge packaging 23.2 pypi_0 pypi pandas 2.2.0 pypi_0 pypi pango 1.50.14 ha41ecd1_2 conda-forge parso 0.8.3 pypi_0 pypi partd 1.4.1 pypi_0 pypi patsy 0.5.6 pypi_0 pypi pbr 6.0.0 pypi_0 pypi pcre2 10.42 hcad00b1_0 conda-forge perfplot 0.10.2 pypi_0 pypi pexpect 4.9.0 pypi_0 pypi pillow 10.2.0 pypi_0 pypi pip 23.3.2 pyhd8ed1ab_0 conda-forge pixman 0.43.2 h59595ed_0 conda-forge platformdirs 4.1.0 pypi_0 pypi pluggy 1.4.0 pypi_0 pypi profimp 0.1.0 pypi_0 pypi prompt-toolkit 3.0.43 pypi_0 pypi psutil 5.9.8 pypi_0 pypi pthread-stubs 0.4 h36c2ea0_1001 conda-forge ptyprocess 0.7.0 pypi_0 pypi pure-eval 0.2.2 pypi_0 pypi pyarrow 15.0.1 pypi_0 pypi pygments 2.17.2 pypi_0 pypi pygsp 0.5.1 pypi_0 pypi pynndescent 0.5.11 pypi_0 pypi pyparsing 3.1.1 pypi_0 pypi pytest 7.4.4 pypi_0 pypi pytest-mock 3.12.0 pypi_0 pypi pytest-nunit 1.0.4 pypi_0 pypi pytest-profiling 1.7.0 pypi_0 pypi pytest-xdist 3.5.0 pypi_0 pypi python 3.11.7 hab00c5b_1_cpython conda-forge python-dateutil 2.8.2 pypi_0 pypi python-graphviz 0.20.1 pypi_0 pypi pytz 2023.4 pypi_0 pypi pyyaml 6.0.1 pypi_0 pypi pyzmq 25.1.2 pypi_0 pypi readline 8.2 h8228510_1 conda-forge rich 13.7.1 pypi_0 pypi scanorama 1.7.4 pypi_0 pypi scanpy 1.10.0.dev197+g96e19540 pypi_0 pypi scikit-image 0.22.0 pypi_0 pypi scikit-learn 1.4.0 pypi_0 pypi scikit-misc 0.3.1 pypi_0 pypi scipy 1.12.0 pypi_0 pypi scprep 1.1.0 pypi_0 pypi scrublet 0.2.3 pypi_0 pypi seaborn 0.13.2 pypi_0 pypi session-info 1.0.0 pypi_0 pypi setuptools 69.0.3 pyhd8ed1ab_0 conda-forge six 1.16.0 pypi_0 pypi sortedcontainers 2.4.0 pypi_0 pypi sparse 0.15.1 pypi_0 pypi stack-data 0.6.3 pypi_0 pypi statsmodels 0.14.1 pypi_0 pypi stdlib-list 0.10.0 pypi_0 pypi tasklogger 1.2.0 pypi_0 pypi tblib 3.0.0 pypi_0 pypi texttable 1.7.0 pypi_0 pypi threadpoolctl 3.2.0 pypi_0 pypi tifffile 2023.12.9 pypi_0 pypi tk 8.6.13 noxft_h4845f30_101 conda-forge toolz 0.12.1 pypi_0 pypi tornado 6.4 pypi_0 pypi tqdm 4.66.1 pypi_0 pypi traitlets 5.14.1 pypi_0 pypi tzdata 2023.4 pypi_0 pypi umap-learn 0.5.5 pypi_0 pypi urllib3 2.1.0 pypi_0 pypi wcwidth 0.2.13 pypi_0 pypi wheel 0.42.0 pyhd8ed1ab_0 conda-forge wrapt 1.16.0 pypi_0 pypi xorg-kbproto 1.0.7 h7f98852_1002 conda-forge xorg-libice 1.1.1 hd590300_0 conda-forge xorg-libsm 1.2.4 h7391055_0 conda-forge xorg-libx11 1.8.7 h8ee46fc_0 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xorg-libxext 1.3.4 h0b41bf4_2 conda-forge xorg-libxrender 0.9.11 hd590300_0 conda-forge xorg-renderproto 0.11.1 h7f98852_1002 conda-forge xorg-xextproto 7.3.0 h0b41bf4_1003 conda-forge xorg-xproto 7.0.31 h7f98852_1007 conda-forge xyzservices 2023.10.1 pypi_0 pypi xz 5.2.6 h166bdaf_0 conda-forge zarr 2.17.1 pypi_0 pypi zict 3.0.0 pypi_0 pypi zipp 3.17.0 pypi_0 pypi zlib 1.2.13 hd590300_5 conda-forge zstd 1.5.5 hfc55251_0 conda-forge ```
martindurant commented 6 months ago

Does object[:] input_values allow for static (meaning we promise not to change the values, as opposed to changing the value of the pointer) ? In true C-land, we cannot truly guarantee that code will not write to any buffer passed.

ivirshup commented 6 months ago

Do you mean like:

      Error compiling Cython file:
      ------------------------------------------------------------
      ...
          @cython.wraparound(False)
          @cython.boundscheck(False)
          def encode(self, buf):
              cdef:
                  Py_ssize_t i, l, n_items, data_length, total_length
                  const object[:] values
                  ^
      ------------------------------------------------------------

      numcodecs/vlen.pyx:351:12: Const/volatile base type cannot be a Python object

Apparently not.

I would have thought that this is handleable since pandas is presumably passing these arrays into cython code.

martindurant commented 6 months ago

Pandas has recently started wrapping the low-level arrays into immutable ones, which is maybe why you are seeing this now. I assume they internally access the low-level writable buffer somewhere. I think this is part of their move towards arrow, since arrow buffers are supposed to be immutable (which makes sense when there are offsets/indexes around, rather than just values).

ivirshup commented 6 months ago

Pandas has recently started wrapping the low-level arrays into immutable ones

It looks like if you access the .array backing a Series you can get a mutable interface to the memory via the public API. Unclear if I should rely on that though.

martindurant commented 6 months ago

If you're not doing any ._data or similar, I don't see why not. It would fail for some extension array that doesn't offer that API, but some extension arrays wouldn't be appropriate input anyway.

Or we could require the caller to always provide a raw, writable numpy-like.

ivirshup commented 6 months ago

My concern is that pandas may not intentionally be giving me a writable view, and may change this behaviour in the future.

I was pointed at:

For how pandas deals with this case. AFAICT, it's basically changing the typing from a memoryview to a ndarray.

ivirshup commented 6 months ago

I've opened a PR which should handle this on the numcodecs side. Does the approach look fine to you @martindurant?

martindurant commented 6 months ago

Yes, I suppose it's fine. We should maybe document this somewhere, since having to make a copy of the data, even temporarily, may surprise some people.

jakirkham commented 2 months ago

Linking the upstream Cython issue: https://github.com/cython/cython/issues/2485