A major re-write of x1 - Githubissues

RaphaelS1 commented 4 years ago

I am proposing a major update that will affect all backward compatibility but will make the code much more efficient and will help in all external interfaces. The proposal is to remove x1,... and to instead rely only on dots ... and the ...elt(2) function. This means that any explicit naming of x arguments will be removed, which will make the code faster to run but also prevents ugly calls with do.call that require a named list. This will effect backwards compatibility, and will require a lot of work. But I think in the long-run it will be very beneficial.

This would be instead of #137 which does not solve the problem.

fkiraly commented 4 years ago

But you still have to specify naming, type and call conventions for the arguments - what would these be, e.g., for pdf, in the following cases:

univariate real
multivariate real
categorical
non-primitive domain, e.g., distribution over set6 objects

for each of which you probably need to decide on adopting one of the options floating around in #137.

RaphaelS1 commented 4 years ago

I am in favour of an idea which I proposed in that issue that would work as follows and provides good extensibility.

Univariate Distribution - u

u$pdf(1) # Returns pdf evaluated at a
u$pdf(c(1,2)) # Returns pdf evaluated at 1 and 2
u$pdf(1, 2) # Errors

Bivariate Distribution - b (Extends to multi)

b$pdf(1) # Errors
b$pdf(1, 2) # Returns pdf evaluated at 1, 2
b$pdf(c(1,2)) # Errors
b$pdf(c(1,2), c(2,3)) # Returns pdf evaluated at 1, 2 and 2, 3

I don't think we need to consider the matrixvariate case yet as we have no implementations of this nor any use-case. However, I think that this style can be extended to matrixvariate by taking vector valued inputs for a single-evaluate, and matrix-valued for multiple.

EDIT: Better example

Univariate Distribution - u

u$pdf(a) # Returns pdf evaluated at a
u$pdf(c(a,b)) # Returns pdf evaluated at a and b
u$pdf(a, b) # Errors

Bivariate Distribution - b (Extends to multi)

b$pdf(a) # Errors
b$pdf(a, b) # Returns pdf evaluated at (a,b)
b$pdf(c(a,b)) # Errors
b$pdf(c(a,b), c(d,e)) # Returns pdf evaluated at (a,d) and (b,e)

fkiraly commented 4 years ago

I almost agree, but the interface is inconsistent.

Why: b$pdf(c(1,2),c(2,3)) and so on has as argument the type "list of vector" (after getting the ellipsis, or using sys.call or match.call). If this is what you want for multiple evaluation (which I don't disagree with), then all special cases should be admissible, i.e., b$pdf(c(1,2)) should not return an error but exhibit the same behaviour as a list which is longer.

fkiraly commented 4 years ago

The interface is also inconsistent between univariate and multivariate cases: multiple arguments in the univariate case are vectors, while in the multivariate case they are just arguments.

In addition, I'm not sure whether this choice is good for vectorization (where what you want is vector arguments for multiple arguments, obviously).

RaphaelS1 commented 4 years ago

Why: b$pdf(c(1,2),c(2,3)) and so on has as argument the type "list of vector" (after getting the ellipsis, or using sys.call or match.call). If this is what you want for multiple evaluation (which I don't disagree with), then all special cases should be admissible, i.e., b$pdf(c(1,2)) should not return an error but exhibit the same behaviour as a list which is longer.

I don't understand this. To clarify we are on the same page, if b$pdf(c(1,2)) is called then we are calling "the bivariate distribution with first variable evaluated at 1 and 2, but the second variable unknown"

RaphaelS1 commented 4 years ago

The interface is also inconsistent between univariate and multivariate cases: multiple arguments in the univariate case are vectors, while in the multivariate case they are just arguments.

In the univariate case a vector is passed to evaluate the distribution at a vector of points. In the m-variate case multiple single arguments are passed to evaluate at one point or a n-vector at m-arguments to evaluate n points

fkiraly commented 4 years ago

This is in case of the multivariate distribution. My comment can be formulated as follows: If b$pdf(c(1,2), c(2,3)) Returns pdf evaluated at (1,2) and (2,3) (elements of R², and evaluates returned as a vector I presume) then why does b$pdf(c(1,2)) return an error, and not the pdf evaluated at (1,2), as would be consistent?

My second comment: if in the bivariate case, you would expect to evaluate multiple times as b$pdf(c(1,2), c(2,3)) why isn't it, in the univariate case, b$pdf(1, 2) for returning pdf evaluated at 1 and 2 (in R) as would be consistent?

RaphaelS1 commented 4 years ago

Ah I see the confusion, I used a bad example with the numbers let me illustrate with letters: b$pdf(c(a,b), c(d,e)) returns pdf evaluated at (a,d), (b,e)

RaphaelS1 commented 4 years ago

I've updated my examples earlier with letters

fkiraly commented 4 years ago

yes, just wanted to say - this is now one of those threads where something is edited so it makes some people in it look stupid because readers assume they commented on the edited post (mental note: never reply with "I completely agree" without mention of what to)

RaphaelS1 commented 4 years ago

I've made the edit clear, and don't worry, it was my bad example so if anyone looks stupid it will be me. But we digress..

fkiraly commented 4 years ago

In any case, I think it makes sense now, and it also allows vectorization easily (since the "long" dimension is vectorized).

The other options to consider would be an array, matrix (dim x samplesize), or data frame (cols are dimensions).

fkiraly commented 4 years ago

Why using a 2D container might be useful: in many cases, multiple pdf evaluation can be expressed easily (and efficiently) as matrix/tensor multiplication - so, in those cases, you might end up converting to array/matrix internally, and you may want to avoid back/forth conversion between the efficient representation and one that is less efficient later on.

fkiraly commented 4 years ago

Separate issue: usability.

A user most likely arrives with the data/inputs in one of the following formats:

2D array
data frame
list of vectors

Neither is obvious to convert to the input format this would require. Should at least some be accepted by default? E.g., if list, remove one layer of list. If array, 2nd dimension are samples, etc.

RaphaelS1 commented 4 years ago

I think that the majority of users would be put off with a data.frame type input. Bear in mind most people will use univariate case and just want to simply evaluate at a few points. I think we could consider multiple constructors for the matrixvariate case. But I worry that unifying it might just put off the majority of users

fkiraly commented 4 years ago

I think that the majority of users would be put off with a data.frame type input.

I think you slightly misunderstand - what I listed is the likely formats in which users have the input - that's different from consideration of what the function should accept in its signature, or use internally.

The key question for usability is the "user journey" - assume I have the data in one of the formats described. What is the shortest way to get that to pdf, and how likely is a normal user able to come up with it quickly?

RaphaelS1 commented 4 years ago

The advantage of the elipsis method as described is that it can be called via d$pdf(1,2,3) or do.call(d$pdf, list(1,2,3)). As the majority of arrays have coercions to lists, I do believe this is the most flexible approach

fkiraly commented 4 years ago

Yes, but the average user doesn't know do.call (which I think one should use only in development but not in scripting), so if they start with list(1,2,3,4,etc,1000) they wouldn't know how to pass the arguments to the function.

What about allowing a single-list input along the unlisted input?

RaphaelS1 commented 4 years ago

Okay, in my mind the most advanced use-case for the average user would be a data.frame of points to evaluate, where the number of columns is equal to the number of variables.

e.g.1. univariate:

x |----| |a| |b| |c|

where x is points to evaluate a-c

e.g.2 bivariate

x1	x2
a	d
b	e
c	f

where x1 is points of first variable and x2 is points of second, such that the evaluated points here are (a,d), (b,e), (c,f).

If you agree then we can add an argument to the constructor called something like data, which is NULL by default and with a conditional that says if non-NULL then use this data.frame otherwise use elipsis

fkiraly commented 4 years ago

Makes sense - though data frame would be low on my priority list, I think list and array are more important. As long as you have one of these supported, the "average user" can convert.

fkiraly commented 4 years ago

Actually, maybe data frame is more intuitive to "average user" (even if more fiddly).

RaphaelS1 commented 4 years ago

Exactly, data frame is more intuitive, and anyone who needs a list (and therefore likely more advanced) can use do.call

fkiraly commented 4 years ago

Well, the "stats 101" (or 201) use case I'm imagining is that "average user" wants to compute the log-likelihood by sending a list or data frame of data to pdf or log-pdf (and then computing arithmetic or geometric mean of the return).

fkiraly commented 4 years ago

Re. list, you can avoid forcing people through do.call if the data argument also (or only) takes a list.

RaphaelS1 commented 4 years ago

I think I'm unnecessarily resistant here as in the long-run we don't know all use-cases. I will add a data argument and create a helper to test if the supplied data is a data.frame/list/array and this will be converted to another form as required. The amount of work required to do that is far less than another redesign down the line. Just to confirm, do you envisage a list input looking something like list(x1 = c(a,b), x2 = c(d,e)) which evaluates at points (a,d), (b,e)?

fkiraly commented 4 years ago

do you envisage a list input looking something like list(x1 = c(a,b), x2 = c(d,e)) which evaluates at points (a,d), (b,e)?

Yes - just the thing you would pass to do.call (if you know of do.call).

fkiraly commented 4 years ago

as in the long-run we don't know all use-cases

Sure, but one should think of them in advance, and I'm just telling you what my personal educated guess are. A good architect/programmer has "virtual user profiles" in their head. E.g., "the students from your 2nd year R course" or similar.

xoopR / distr6

A major re-write of x1 #179

Univariate Distribution - u

Bivariate Distribution - b (Extends to multi)

Univariate Distribution - u

Bivariate Distribution - b (Extends to multi)