wlav / cppyy

Other
412 stars 42 forks source link

Stop implicit conversion to string #276

Open hexomethyl opened 4 days ago

hexomethyl commented 4 days ago

Hey,

I'm interfacing with C++ code that returns std::vector<std::int8_t> or cudf::column (int8_t). When attempting to access the underlying buffers of these objects instead of getting a cppyy.ll.LowLevelView. cppyy attempts to implicitly convert the raw pointer into a python string.

Looking at CPyCppyy, there looks to be support for LowLevelView of char and signed char image

Is there anyway to prevent this implicit conversion or force the creation of a LowLevelView ?

wlav commented 2 days ago

int8_t is a char, so consequently, it is represented as a string. Obviously, the actual intent of int8_t is to be an integer type, so I'm trying to catch as many instances as possible when coming from Cling to make int8_t an integer type, but it's not complete (Cling is no help; I've been asking for proper support for ages), especially with templates as is your case. I can probably make a reproducer myself for std::vector<int8_t>, but for cudf::column (int8_t), if different from the std::vector case, you'd have to provide one.

Which version of cppyy is this? If you build master from source, something broke in typedef resolution for int8_t when moving to LLVM16 and I have yet to resolve it (those are my final tests to fix before releasing LLVM16-based cppyy), so more tests are failing with HEAD than with the latest release.

Aside, for LLV, the relevant call is CreateLowLevelView_i8, since obviously, int8_t can't be type matched.

hexomethyl commented 2 days ago

Running cppyy on 3.1.2 (pypi.org)

Both std::vector and cudf::column face the same issue, as the occurs when attempting to obtain a pointer to the containers contiguous buffer via .data(). For primitive types this usually results immediately in a LLV, in the case of int8_t this results immediately in a python str object, I have no way of obtaining the raw address to then possibly cast it.

I may investigate further as it seems cppyy somewhere along the chain converts int8_t or std::int8_t to signed char instead of retaining the original type (?), not sure just attempting to brainstorm if the problems are on my side ?

wlav commented 2 days ago

No, again, it's not a conversion (implicit or otherwise): int8_t is a typedef of signed char. cppyy isn't converting int8_t to signed char: signed char is the actual type. It's just that we don't want the actual type in case of int8_t, but we want to pretend it's an integer type.

At issue is that all types need to resolve to something canonical to make sure there's ever always one type only and typedefs are then simply references to the canonical type. Thus, typedefs are normally completely resolved. For basic use of int8_t, however, it is intercepted to make sure it is not resolved, but if int8_t appears somewhere along the way (in a chain of typedefs, or as is the case here as a template argument), then cppyy never sees the int8_t as LLVM allows jumping directly from a typedef to the canonical type, which is what Cling then does.

That is, something like typedef int8_t mytype; and subsequent use of mytype never involves int8_t as the canonical type of mytype is immediately found to be signed char, rather than finding it through int8_t. Or more specifically to your problem, std::vector<int8_t>::value_type can not be inferred to be int8_t.

Now, it's possible to create a bunch of special cases for common cases and std::vector<int8_t> could reasonably be considered a common enough type to warrant special handling. But not so cudf::column. What I need to know is what the actual use is. If cudf::column is a typedef itself or it's access to the buffer is exposed as a typedef similar to value_type, then it is simply not possible to intercept int8_t. If the top-level use is something like cudf::column<int8_t> then maybe, but only if created directly from Python, not if it is returned from a function.

Note that it's always possible to create a C++ helper functions at run-time, using cppdef to access the buffer, by casting it to a void*, but e.g. std::vector::data is Pythonized to also include the size, so you'd miss out on that.

hexomethyl commented 1 day ago

Ah I see now, sorry for the misunderstanding. This is the function for reference, that's called on a cudf::mutable_column_view (cudf::column::mutable_view()) to obtain the data pointer.

wlav commented 5 hours ago

I tried a few options, but it falls flat: I can't get the backend (Cling) to have two distinct representations for the same type (such as e.g. std::vector<signed char> and std::vector<int8_t>), w/o crashing at some point. Theoretically, there could be two proxies that resolve both to the same type, but then it's still unclear when to resolve to int8_t and when to signed char. For std::vector, it is may be somewhat simple, but it's otherwise perfectly possible for a class templated with int8_t to sometimes return an actual signed char and vv.

That said, I forgot that a return of unsigned char* already is a low level view. This b/c (historically) the probability was higher for unsigned char[] to mean "array of bytes" than to mean "c-string". Thus, std::vector<uint8_t>::data() already returns a low level view. (It also, unfortunately, means that for a vector v = std.vector[uint8_t](range(10), v[0] is not the same as v.data()[0]. That is, the former is a str, the latter an integer.)

I implemented the same for signed char (and thus int8_t), under the same reasoning, and that doesn't break any existing tests. Aside, Python's array.array represents both unsigned char and signed char array's as integers, too, so it certainly isn't a strange thing to do. This fixes this particular problem (definitely for vector, but I presume for column_view as well), even as it's still no general solution.

Note that, after this change, if for whatever reason you do want to get a vector<int8_t>'s data as a Python str, you can do ''.join(v).