Request) Make the rarray SEXP constructor error if incompatible types are passed

DavisVaughan commented 5 years ago

This has been a bug on my part that I struggled with for a few days now, and have just figured out.

I had a simple function that took in a SEXP that is automatically converted to a rarray<double>, and then just returns it. If I passed in c(1, 2, 3) everything worked fine. If I passed in 1:3 it gave garbage results like c(1e-314, 1e-314, 1e-314).

This is a result of me being dumb and not remembering that c(1, 2, 3) is a numeric vector, and 1:3 is an integer vector. So passing along the integer SEXP results in garbage when its converted to rarray<double>.

An example of all of this is in the readme here (just look at the calls to identity_cpp()): https://github.com/DavisVaughan/xtensorfailure

With the identity function here: https://github.com/DavisVaughan/xtensorfailure/blob/master/src/example.cpp#L7

Would it be possible to check that the type T matches up with the type of the R object provided, and throw an error if not? I think you could probably do this in the rarray and rtensor constructors, where you could maybe compare SXP as you have defined it against TYPEOF(SEXP_object_to_convert) and throw an error if they are different?

wolfv commented 5 years ago

That is a good idea. I was struggling with the same problem before.

I think it should definitely be possible.

eddelbuettel commented 5 years ago

We struggle with this in Rcpp land too. The trouble is that we are smushing the lines between compile and run-time. C++ wants a compile time check, but R will only tell us at run-time what the type is. Hence the need for SEXP in interfaces, and dispatch inside the function.

In short, it looks like @DavisVaughan re-invented what we have explained eg in this somewhat classic piece from 2013 by Kevin at the Rcpp Gallery (which is referenced in eg a number of StackOverflow and mailing list answers).

I fear we can't do much better than this.

wolfv commented 5 years ago

Hi Dirk, great to see you in this issue -- I was about to ping you for ideas :)

I think we can do a bit better:

Disallow arrays with data types that are not double, int(32), bool or complex as I understand those are the only data types in R (of interest here). E.g. uint64 and others should all be forbidden
When taking in a R-array we should perform the dynamic type check. Yes, that's costly, but it's probably not that high of a price. When creating an rarray from C++ we don't need to pay the price as we know the type already. I guess pybind11 and Rcpp differ here: pybind11 does a runtime check of the type, and actually selects a working overload (yes, this can be expensive).

If I use the TYPEOF macro on a SEXP do I get the element type of the R vector/matrix?

eddelbuettel commented 5 years ago

I am hopeful we can do better :) So far we haven't.

Type restrictions are fine. The list is what R has. (For int64_t we cheat via "overloaded" interpretation, nothing has been written for uint64_t :-( ).

pybind11 is on my list of things to look at, but as I hardly use Python that list only gets push_back and no pop_front :-( So not sure how it different,

Yup. TYPEOF is your friend. Should be in a number of Rcpp Gallery posts, StackOverflow answers and of course the Rcpp sources. It is an R macros as R started this business with the "union-alike" SEXP.

dselivanov commented 5 years ago

A bit offtopic.

For int64_t we cheat via "overloaded" interpretation

For 32 bit float there is a float pkg which takes same "cheating" approach - use R's integer vectors as storage. I use float pkg with Armadillo "mapped" matrices without any issue. So I believe we can have interoperability of the float and xtensor.

wolfv commented 5 years ago

@dselivanov interesting. This approach would make the proposed type checking completely useless, right? Because we couldn't detect a "float" vector (which is actually a int32 vector to R)?

I think for these advanced R usage we're in need of advanced R people as the xtensor team! It would be awesome if you could help out @dselivanov :)

eddelbuettel commented 5 years ago

No, no. It's a side issue, just like my mention of the integer64 hack.

Fact: R has int32, double, complex, bool. All support NA (and most support NaN) so bool is three valued just to mess with us :)

Fact: They all travel in/out as SEXP and you can use TYPEOF at run-time to inquire about payload.

Fact: Add-on packages (bit64 for integer, float for float) cheat by sticking 64bit ints into a double, and 32bit floats into an int32.

Fact: None of that helps with rarray.

But mentioning these side-hacks shows different approaches in pending the rules a little.

SylvainCorlay commented 5 years ago

I also like pybind11's dynamic dispatch approach, although it is not always ideal because of the overhead.

Maybe we could make a dynamic dispatch in xtensor-r for the data type?

Thanks @eddelbuettel for chiming in!

eddelbuettel commented 5 years ago

Happy to help, particularly as "talk is cheap" :)

Dynamic dispatch may be worth it, definitely for an exploration. I am not really sure if that has been tried (and I am not following Arrow all that closely so I am not sure what they do over there). If it works for pybind11, and as you already put the C++14 marker down (which will "eventually" be less of an issue as all compilers catch up) it may be a good route.

wolfv commented 5 years ago

Hi @eddelbuettel & @DavisVaughan

I've implemented the safe guards in this PR: https://github.com/QuantStack/xtensor-r/pull/61/files

Do you guys want to quickly review the change? @eddelbuettel is this an appropriate way of raising an error to R? It seems to work fine!

Regarding what you've been mentioning above... I had a bit of a talk with a Pandas dev who's interested in Arrow, and he mentioned that R implements sentinel values, correct? Is that how NA values are represented for Ints, and bools? It could be quite cool to support R's way of creating NA values etc. natively using xtensor_optional_assembly and related tools! I was additionally wondering how the character / string array works in R (with regards to memory layout). Is it a bunch of \0 terminated strings, or do they all have the same buffer length (as they do in NumPy). I wonder wether it would be trivial to wrap this data type in xtensor, or not ... :)

Cheers & thanks for the help!

eddelbuettel commented 5 years ago

That looks good to me, and yes, Rcpp::stop() it is as we did a number of iterations on that over the years to get stacks unwounds etc pp. Should "Just Work" (TM).

The (super useful) NA and NaN definitions for types other double are in the R headers.

DavisVaughan commented 5 years ago

Thanks for implementing this, these merged changes look good!

xtensor-stack / xtensor-r

Request) Make the rarray SEXP constructor error if incompatible types are passed #59