Closed RaphaelS1 closed 4 years ago
But you still have to specify naming, type and call conventions for the arguments - what would these be, e.g., for pdf
, in the following cases:
for each of which you probably need to decide on adopting one of the options floating around in #137.
I am in favour of an idea which I proposed in that issue that would work as follows and provides good extensibility.
u$pdf(1) # Returns pdf evaluated at a
u$pdf(c(1,2)) # Returns pdf evaluated at 1 and 2
u$pdf(1, 2) # Errors
b$pdf(1) # Errors
b$pdf(1, 2) # Returns pdf evaluated at 1, 2
b$pdf(c(1,2)) # Errors
b$pdf(c(1,2), c(2,3)) # Returns pdf evaluated at 1, 2 and 2, 3
I don't think we need to consider the matrixvariate case yet as we have no implementations of this nor any use-case. However, I think that this style can be extended to matrixvariate by taking vector valued inputs for a single-evaluate, and matrix-valued for multiple.
EDIT: Better example
u$pdf(a) # Returns pdf evaluated at a
u$pdf(c(a,b)) # Returns pdf evaluated at a and b
u$pdf(a, b) # Errors
b$pdf(a) # Errors
b$pdf(a, b) # Returns pdf evaluated at (a,b)
b$pdf(c(a,b)) # Errors
b$pdf(c(a,b), c(d,e)) # Returns pdf evaluated at (a,d) and (b,e)
I almost agree, but the interface is inconsistent.
Why: b$pdf(c(1,2),c(2,3))
and so on has as argument the type "list of vector" (after getting the ellipsis, or using sys.call
or match.call
).
If this is what you want for multiple evaluation (which I don't disagree with), then all special cases should be admissible, i.e., b$pdf(c(1,2))
should not return an error but exhibit the same behaviour as a list which is longer.
The interface is also inconsistent between univariate and multivariate cases: multiple arguments in the univariate case are vectors, while in the multivariate case they are just arguments.
In addition, I'm not sure whether this choice is good for vectorization (where what you want is vector arguments for multiple arguments, obviously).
Why: b$pdf(c(1,2),c(2,3)) and so on has as argument the type "list of vector" (after getting the ellipsis, or using sys.call or match.call). If this is what you want for multiple evaluation (which I don't disagree with), then all special cases should be admissible, i.e., b$pdf(c(1,2)) should not return an error but exhibit the same behaviour as a list which is longer.
I don't understand this. To clarify we are on the same page, if b$pdf(c(1,2))
is called then we are calling "the bivariate distribution with first variable evaluated at 1 and 2, but the second variable unknown"
The interface is also inconsistent between univariate and multivariate cases: multiple arguments in the univariate case are vectors, while in the multivariate case they are just arguments.
In the univariate case a vector is passed to evaluate the distribution at a vector of points. In the m-variate case multiple single arguments are passed to evaluate at one point or a n-vector at m-arguments to evaluate n points
This is in case of the multivariate distribution.
My comment can be formulated as follows:
If
b$pdf(c(1,2), c(2,3))
Returns pdf evaluated at (1,2) and (2,3)
(elements of R², and evaluates returned as a vector I presume)
then why does
b$pdf(c(1,2))
return an error, and not the pdf evaluated at (1,2), as would be consistent?
My second comment:
if in the bivariate case, you would expect to evaluate multiple times as
b$pdf(c(1,2), c(2,3))
why isn't it, in the univariate case,
b$pdf(1, 2)
for returning pdf evaluated at 1 and 2 (in R)
as would be consistent?
Ah I see the confusion, I used a bad example with the numbers let me illustrate with letters:
b$pdf(c(a,b), c(d,e))
returns pdf evaluated at (a,d), (b,e)
I've updated my examples earlier with letters
yes, just wanted to say - this is now one of those threads where something is edited so it makes some people in it look stupid because readers assume they commented on the edited post (mental note: never reply with "I completely agree" without mention of what to)
I've made the edit clear, and don't worry, it was my bad example so if anyone looks stupid it will be me. But we digress..
In any case, I think it makes sense now, and it also allows vectorization easily (since the "long" dimension is vectorized).
The other options to consider would be an array, matrix (dim x samplesize), or data frame (cols are dimensions).
Why using a 2D container might be useful: in many cases, multiple pdf evaluation can be expressed easily (and efficiently) as matrix/tensor multiplication - so, in those cases, you might end up converting to array/matrix internally, and you may want to avoid back/forth conversion between the efficient representation and one that is less efficient later on.
Separate issue: usability.
A user most likely arrives with the data/inputs in one of the following formats:
Neither is obvious to convert to the input format this would require. Should at least some be accepted by default? E.g., if list, remove one layer of list. If array, 2nd dimension are samples, etc.
I think that the majority of users would be put off with a data.frame type input. Bear in mind most people will use univariate case and just want to simply evaluate at a few points. I think we could consider multiple constructors for the matrixvariate case. But I worry that unifying it might just put off the majority of users
I think that the majority of users would be put off with a data.frame type input.
I think you slightly misunderstand - what I listed is the likely formats in which users have the input - that's different from consideration of what the function should accept in its signature, or use internally.
The key question for usability is the "user journey" - assume I have the data in one of the formats described. What is the shortest way to get that to pdf, and how likely is a normal user able to come up with it quickly?
The advantage of the elipsis method as described is that it can be called via d$pdf(1,2,3)
or do.call(d$pdf, list(1,2,3))
. As the majority of arrays have coercions to lists, I do believe this is the most flexible approach
Yes, but the average user doesn't know do.call
(which I think one should use only in development but not in scripting), so if they start with list(1,2,3,4,etc,1000) they wouldn't know how to pass the arguments to the function.
What about allowing a single-list input along the unlisted input?
Okay, in my mind the most advanced use-case for the average user would be a data.frame of points to evaluate, where the number of columns is equal to the number of variables.
e.g.1. univariate:
x |----| |a| |b| |c|
where x is points to evaluate a-c
e.g.2 bivariate
x1 | x2 |
---|---|
a | d |
b | e |
c | f |
where x1 is points of first variable and x2 is points of second, such that the evaluated points here are (a,d), (b,e), (c,f).
If you agree then we can add an argument to the constructor called something like data
, which is NULL by default and with a conditional that says if non-NULL then use this data.frame otherwise use elipsis
Makes sense - though data frame would be low on my priority list, I think list and array are more important. As long as you have one of these supported, the "average user" can convert.
Actually, maybe data frame is more intuitive to "average user" (even if more fiddly).
Exactly, data frame is more intuitive, and anyone who needs a list (and therefore likely more advanced) can use do.call
Well, the "stats 101" (or 201) use case I'm imagining is that "average user" wants to compute the log-likelihood by sending a list or data frame of data to pdf
or log-pdf
(and then computing arithmetic or geometric mean of the return).
Re. list, you can avoid forcing people through do.call
if the data argument also (or only) takes a list.
I think I'm unnecessarily resistant here as in the long-run we don't know all use-cases. I will add a data
argument and create a helper to test if the supplied data is a data.frame/list/array and this will be converted to another form as required. The amount of work required to do that is far less than another redesign down the line. Just to confirm, do you envisage a list input looking something like list(x1 = c(a,b), x2 = c(d,e))
which evaluates at points (a,d), (b,e)?
do you envisage a list input looking something like list(x1 = c(a,b), x2 = c(d,e)) which evaluates at points (a,d), (b,e)?
Yes - just the thing you would pass to do.call
(if you know of do.call
).
as in the long-run we don't know all use-cases
Sure, but one should think of them in advance, and I'm just telling you what my personal educated guess are. A good architect/programmer has "virtual user profiles" in their head. E.g., "the students from your 2nd year R course" or similar.
I am proposing a major update that will affect all backward compatibility but will make the code much more efficient and will help in all external interfaces. The proposal is to remove
x1,...
and to instead rely only on dots...
and the...elt(2)
function. This means that any explicit naming of x arguments will be removed, which will make the code faster to run but also prevents ugly calls withdo.call
that require a named list. This will effect backwards compatibility, and will require a lot of work. But I think in the long-run it will be very beneficial.This would be instead of #137 which does not solve the problem.