Open hexomethyl opened 4 days ago
int8_t
is a char
, so consequently, it is represented as a string. Obviously, the actual intent of int8_t
is to be an integer type, so I'm trying to catch as many instances as possible when coming from Cling to make int8_t
an integer type, but it's not complete (Cling is no help; I've been asking for proper support for ages), especially with templates as is your case. I can probably make a reproducer myself for std::vector<int8_t>
, but for cudf::column (int8_t)
, if different from the std::vector
case, you'd have to provide one.
Which version of cppyy is this? If you build master from source, something broke in typedef resolution for int8_t
when moving to LLVM16 and I have yet to resolve it (those are my final tests to fix before releasing LLVM16-based cppyy), so more tests are failing with HEAD than with the latest release.
Aside, for LLV, the relevant call is CreateLowLevelView_i8
, since obviously, int8_t
can't be type matched.
Running cppyy on 3.1.2 (pypi.org)
Both std::vector and cudf::column face the same issue, as the occurs when attempting to obtain a pointer to the containers contiguous buffer via .data()
. For primitive types this usually results immediately in a LLV, in the case of int8_t
this results immediately in a python str
object, I have no way of obtaining the raw address to then possibly cast it.
I may investigate further as it seems cppyy somewhere along the chain converts int8_t
or std::int8_t
to signed char
instead of retaining the original type (?), not sure just attempting to brainstorm if the problems are on my side ?
No, again, it's not a conversion (implicit or otherwise): int8_t
is a typedef of signed char
. cppyy isn't converting int8_t
to signed char
: signed char
is the actual type. It's just that we don't want the actual type in case of int8_t
, but we want to pretend it's an integer type.
At issue is that all types need to resolve to something canonical to make sure there's ever always one type only and typedefs are then simply references to the canonical type. Thus, typedefs are normally completely resolved. For basic use of int8_t
, however, it is intercepted to make sure it is not resolved, but if int8_t
appears somewhere along the way (in a chain of typedefs, or as is the case here as a template argument), then cppyy never sees the int8_t
as LLVM allows jumping directly from a typedef to the canonical type, which is what Cling then does.
That is, something like typedef int8_t mytype;
and subsequent use of mytype
never involves int8_t
as the canonical type of mytype
is immediately found to be signed char
, rather than finding it through int8_t
. Or more specifically to your problem, std::vector<int8_t>::value_type
can not be inferred to be int8_t
.
Now, it's possible to create a bunch of special cases for common cases and std::vector<int8_t>
could reasonably be considered a common enough type to warrant special handling. But not so cudf::column
. What I need to know is what the actual use is. If cudf::column
is a typedef itself or it's access to the buffer is exposed as a typedef similar to value_type
, then it is simply not possible to intercept int8_t
. If the top-level use is something like cudf::column<int8_t>
then maybe, but only if created directly from Python, not if it is returned from a function.
Note that it's always possible to create a C++ helper functions at run-time, using cppdef
to access the buffer, by casting it to a void*
, but e.g. std::vector::data
is Pythonized to also include the size, so you'd miss out on that.
Ah I see now, sorry for the misunderstanding. This is the function for reference, that's called on a cudf::mutable_column_view
(cudf::column::mutable_view()
) to obtain the data
pointer.
I tried a few options, but it falls flat: I can't get the backend (Cling) to have two distinct representations for the same type (such as e.g. std::vector<signed char>
and std::vector<int8_t>
), w/o crashing at some point. Theoretically, there could be two proxies that resolve both to the same type, but then it's still unclear when to resolve to int8_t
and when to signed char
. For std::vector
, it is may be somewhat simple, but it's otherwise perfectly possible for a class templated with int8_t
to sometimes return an actual signed char
and vv.
That said, I forgot that a return of unsigned char*
already is a low level view. This b/c (historically) the probability was higher for unsigned char[]
to mean "array of bytes" than to mean "c-string". Thus, std::vector<uint8_t>::data()
already returns a low level view. (It also, unfortunately, means that for a vector v = std.vector[uint8_t](range(10)
, v[0]
is not the same as v.data()[0]
. That is, the former is a str
, the latter an integer.)
I implemented the same for signed char
(and thus int8_t
), under the same reasoning, and that doesn't break any existing tests. Aside, Python's array.array
represents both unsigned char
and signed char
array's as integers, too, so it certainly isn't a strange thing to do. This fixes this particular problem (definitely for vector, but I presume for column_view as well), even as it's still no general solution.
Note that, after this change, if for whatever reason you do want to get a vector<int8_t>
's data as a Python str, you can do ''.join(v)
.
Hey,
I'm interfacing with C++ code that returns
std::vector<std::int8_t>
orcudf::column (int8_t)
. When attempting to access the underlying buffers of these objects instead of getting a cppyy.ll.LowLevelView. cppyy attempts to implicitly convert the raw pointer into a python string.Looking at CPyCppyy, there looks to be support for LowLevelView of
char
andsigned char
Is there anyway to prevent this implicit conversion or force the creation of a LowLevelView ?