xtensor-stack / xtensor-python

Python bindings for xtensor
BSD 3-Clause "New" or "Revised" License
347 stars 58 forks source link

Numpy Array of Strings #142

Closed iamthebot closed 6 years ago

iamthebot commented 6 years ago

How would one go about taking in (and returning) a numpy array of strings using xtensor-python (assuming ASCII)?

The use case is I have a numpy array containing a bunch of Base64 encoded JPEG images. I want to decode this batch using an OpenMP loop in C++. Ideally I should also be able to return a numpy array of strings.

I know I can work around this by creating a 2D numpy array of bytes (each row contains the ASCII string's bytes) but the problem is that it requires two passes since we have to find the max string length. Not to mention string conversions in python.

iamthebot commented 6 years ago

Looks like attempting to make an xt::pyarray<std::string> yields the following error:

error: 'index' is not a member of 'pybind11::detail::is_fmt_numeric<std::__cxx11::basic_string<char>, void>'
             static constexpr int type_num = value_list[pybind11::detail::is_fmt_numeric<value_type>::index];
SylvainCorlay commented 6 years ago

Numpy arrays of strings are an interesting piece. In fact, they store all the strings of the array in a single buffer. Each string is padded to match the length of the longuest.

Maybe we should tackle this using xtl's stack allocated strings. Although, while numpy strings are null terminated, I am not sure they leave space to store the size...

wolfv commented 6 years ago

I think wrapping the buffer, and mapping the contents to a std::string_view could be a good approach, and has the correct semantics. Also there is a constructor from a char* which finds the null-termination automatically.

I think, the syntax to create these arrays would look a bit more like xt::pyarray<char[20]>, which, I think, is also the syntax pybind11 supports. This would however also imply, that you need to know about the maximum string length at compile time (and during execution, it's probably advisable to create the array with a concrete dtype, e.g. <U20 for a 20 character string).

wolfv commented 6 years ago

Merged and released!