Joint Distributions as wrappers

RaphaelS1 commented 5 years ago

Our last discussion concluded that conditional distributions should be implemented as wrappers. This seemed like the natural choice as wrappers inherit methods from DistributionWrapper that get all wrapped models and overload any methods/variables as required. However we also concluded that joint distributions inherit from Distribution and are not wrappers. This seems counterintuitive as the two are naturally related.

I would argue that a better option may be to have constructors for JointDistribution and ConditionalDistribution with methods to get one from the other where appropriate/possible and exploiting properties of independence etc. where possible.

Thoughts?

fkiraly commented 5 years ago

For clarification, is JointDistribution for the tuple/product distribution, i.e., the joint distribution of two distributions known to be independent?

In this case, I'm not sure how you could reduce the one to the other, since distributions defined via conditioning are, in all interesting cases, not independent of the conditioning variable - because conditioning on an independent variables is just as good as not conditioning.

Whereas the joint/product is most interesting in the independent case, because only then there is an obvious, canoncial way to carry out the construction (unless you want to create a universal copula building interface, and I suspect you don't).

fkiraly commented 5 years ago

On a subsidiary note, joint distributions - or products, as I would call them, since not every joint distribution is a product of (independent) distributions - are very important in supervised learning.

Supervised probabilistic predictions take the form of product distributions, across vectors and arrays, i.e., the output of the prediction functional called on a data frame of features.

Therefore it may be important to get "vector-shaped products of distributions" and "array-shaped products of distributions" right, especally in cases where the predicted distribution type may differ - though the domain won't.

RaphaelS1 commented 5 years ago

I think this also leads onto a discussion for naming conventions, as suggested separating 'joint distributions' and 'product distributions' makes sense and in fact I perhaps losing the term 'joint distribution' is preferred. Instead we have three scenarios:

A user can define a product distribution as the product of independent distribution objects. e.g. for n distribution objects that are assumed independent, Z = X1 * X2 * ... * XN for N indep. distr objects. This ProductDistribution is naturally a wrapper.
A user can construct a MultivariateDistribution class, which inherits from Distribution
A user can define a conditional distribution

However in case of the conditional distribution should this really be done as one Distribution class wrapped around another or should the user be able to supply their own p/d/q/r?

fkiraly commented 5 years ago

Unfortunately, use of * for product distributions clashes with the use of + for convolution. If this were consistent, * would denote the distribution of the value product (i.e., distx*disty should be "the distribution of the random variable X*Y where X as distx and Y as disty") rather than the product in the mathematical category of distributions.

fkiraly commented 5 years ago

Regarding conditionals, I think it makes a lot of sense to handle these as wrappers, or even pipeline style compositors.

RaphaelS1 commented 5 years ago

Slightly confused about your point regarding *, let's discuss on Tuesday.

fkiraly commented 5 years ago

Sure.

In an example: if X and Y are standard normal, should X*Y be chi-squared with one degree of freedom, or should it be bi-variate standard normal?

I'm saying that if X+Y is normal with mean 0 and variance 2, then X*Y should be chi-squared, not bi-variate - because if for "plus" you add values (resulting in convolution of distributions), for "times" you should multiply values.

fkiraly commented 5 years ago

On a related note: do distr distributions represent distributions, or random variables? I.e., for standard normal X, should X+X be standard normal with mean 0 and variance 2, or standard normal with mean 0 and variance 4? @PeterRuckdeschel ?

fkiraly commented 5 years ago

(I assumed above they would represent distributions, so using the same distribution will not default to the result for the "same random variable", but to the result for "independent random variables with that distribution")

RaphaelS1 commented 5 years ago

Ah I see, well the way it is done in distr is that operations are performed on random variables and there is a difference between 2*X and X+X, similarly for X^2 and X*X.

PeterRuckdeschel commented 5 years ago

Let me clarify: Our arithmetics takes distributions as input and produces again distributions as output. Having said this, inbetween, when interpreting terms like X+Y, intermediate passage to random variables allows for an easy interpretation.

The clue is as follows:

In our arithmetics for distributions, in any binary operation s(x1,x2) involving two distribution objects x1 and x2 and where s could be addition, substraction, division, multiplication, exponentiation, the operands x1 and x2 are best thought of as stochastically independent random variables distributed according to the given distributions and the return value of this operation is the image distribution of the new random variable.
In any binary operation s(n,x) where n is numeric and x is a distribution object, we understand n as random variable with a Dirac distribution in the numeric value n -- hence n always is independent of x.

In this sence our arithmetics is "closed" -- all operands are thought of as stochastically independent items, and so in fact Y1=X+X and Y2=2*X are indeed not the same, as you already noted.

In fact Y1 and Y2 only cover extreme scenarios of independence and dependence:

in Y1, both operands are two distinct stochastically independent random variables having the same distribution X,
in Y2 the expression 2*X stands for the distribution of X10+X20 where X20 is an identical copy of X10, so that X10(omega)=X20(omega) for each event omega (so in the copula language they are maximally positively dependent).
intermediate scenarios, i.e., X11+X12, where X11 and X12 have the same distributions but are (stochastically) dependent in a more general way, would involve specifying much more information, i.e. one would need more information on the joint law of X11 and X12, which cannot easily be done in a notation as simple as X11+X12...

fkiraly commented 5 years ago

Thanks for clarifying - since I was able to see two logically coherent set-ups: (a) interpreting in-memory references to the same distribution object as a reference to stochastically identical (that is, not identically distributed, but actually entirely identical) random variables, while interpreting references to different objects as references to stochastically independent ones, or (b) interpreting arithmetic operations on distribution objects as arithmetic operations on independent random variables distributed according to arguments, no matter whether the arguments refer to the same distribution object in-memory, or not.

In my gut feeling, (b) is cleaner, less fiddly, but less powerful - and I was half aware that you were probably doing (b).

I also would say that (a) is outside scope for a package that is about distributions, not collections of random variables (which is, decidedly, one full layer above).

fkiraly commented 5 years ago

@RaphaelS1 : In option (a), there would be no difference between X+X and 2*X - or between X^2and X*X.

fkiraly commented 5 years ago

(it would be an interesting project in its own right to develop a full-blown calculus of possibly dependent random variables)

PeterRuckdeschel commented 5 years ago

@Franz: Hm, I would say that (b) is more powerful then (a), because identical copies would never involve convolutions, and would need that both operands be of exactly the same distribution...

As to the approach for more general dependencies: I agree this is very interesting, but then, contrary to our approach, where no reference to the defining probability space is needed, one would only be allowed to work on one common probability space (Omega, A, P) on which all random variables would be defined, and the objects then would have to specify the measurable mappings (Omega, A, P) -> (R, B) in order to have access to joint distributions with only specifying operands X1 and X2 for, say, an expression X1+X2.

In my opinion this reference to a common definition space and the need to specify the actual mappings would make the approach much more involved.

This said, one could think of defining evaluation frames (similar to the R construct with(), within()) which specify a joint distribution for a set of terminal distribution objects with which to work and then to allow operations to generate new distribution objects from these terminal distribution objects which are then all linked to the joint distribution, hence could, in principle, keep track of dependencies.

This is of course, also restrictive to some degree, as for each new distribution you want to work with you either have to derive them from the existing terminal distribution objects or you have to specify the relation to the "big joint distribution" in your frame -- definitively not so easy as just defining slots p,d,q, and r...

fkiraly commented 5 years ago

@PeterRuckdeschel thanks for the insight, though this is getting a little off-topic maybe?

For generic distribution interface, I personally would want to avoid explicit definition of measurable mappings, or specification of the probability space Omega etc. I feel this is more a mathematical crutch to make everything well-specified by saying "it exists" rather than an object which has actual meaning.

It's maybe better, hence, to go along the lines of your second suggestion: allow the user to specify (in)dependence relations, or leave dependence unspecified. Whatever is all the theoretically existing arrows from Omega would be implicit.

But this is surely something for the far future rather than this year...

RaphaelS1 commented 5 years ago

Summarising conversation with Franz. There are three distinct implementation methods:

Distribution$new(pdf, cdf) - A user supplies a pdf and/or cdf, which take multiple arguments each. This is then a distribution with type dimension >1 and is therefore a "multivariate" or "joint" distribution.
ArrayDistribution$new(distribution, paramList) - A user states with distribution is being utilised and then supplies a list of lists such that for a list of length n which contains n lists all of length m this corresponds to n distribution each with m parameters.
ProductDistribution$new(dist1, dist2,...) - A user provides two or more distributions (assumed independent) and the following result is exploited ProductDistribution$new(dist1, dist2)$pdf(x, y) = dist1$pdf(x) \times dist2$pdf(y) analogously for cdf.

PeterRuckdeschel commented 5 years ago

Thanks for this summary --I think this is a viable & good approach:

Case 3 covers the independent setting and helps to easily cover events which are of rectangular form A1 x A2 x ... x Am. You unavoidably get into complex problems once you want to get the probability of an event which is not of this form; sometimes then simulating and checking the empirical frequency of the event helps (if you can easily determine whether the simulation run falls into the event or not).
By case 2 you can in principle cover the general regular conditional case (i.e., in Polish spaces where it suffices to know the factorized conditional probabilities P(Xi <= ti | Xj=tj, j=i+1..m), i=1..m when you treat the values tj, j=2..m as parameters.
By case 1 you can specify the joint pdf / and/or cdf.

This does give you the amount of generality you need to warrant that suitable arithmetics remain closed in the sense that there results are again representable in this form.

In the discrete case with only finitely (relevant) atomic events for each marginal Xi this is completely fine. In this case in fact, you only have a finite sample space so it suffices to relabel a univariate discrete distribution, the only remaining task being to do this relabelling efficiently.

I think the computationally demanding question is in the continous case how to easily come from case 2 to case 1 without too many numeric integrations... Not that I knew how to solve this, and of course the approach of cases 1--3 is not to blame for this. Still, as far as I can say the transfer 2->1 only has been done sucessfully for elliptically symmetric distributions, and, more specifically, for multivariate normals and T distribustion in package mvtnorm.

RaphaelS1 commented 5 years ago

I think that we should keep these as distinct cases for now and assume that the user has the knowledge to know which to construct and when (good documentation will help). In version 2 we can look at the more complex (and more interesting) question of transferring between the cases.

PeterRuckdeschel commented 5 years ago

I am fine with that -- I just wanted to mention the challenge, as depending on the purpose you may need case 1 when you only have case 2 / this challenge is by magnitudes harder than what can be done in the univariate case with our helpers R2DPQ which starting from 1 function are able to numerically reconstruct the others.

fkiraly commented 5 years ago

My 2p on this, in the first iteration: (i) all multivariate joint distributions are explicitly defined via cdf, pdf, etc (ii) there is no interface to compute probability mass of arbitrary event sets, since that would in first instance require an interface to specify arbitrary event sets (thus a generic sets interface)

We need to leave something for Raphael's PhD students to work on too...

xoopR / distr6

Joint Distributions as wrappers #31