Closed RaphaelS1 closed 5 years ago
For clarification, is JointDistribution
for the tuple/product distribution, i.e., the joint distribution of two distributions known to be independent?
In this case, I'm not sure how you could reduce the one to the other, since distributions defined via conditioning are, in all interesting cases, not independent of the conditioning variable - because conditioning on an independent variables is just as good as not conditioning.
Whereas the joint/product is most interesting in the independent case, because only then there is an obvious, canoncial way to carry out the construction (unless you want to create a universal copula building interface, and I suspect you don't).
On a subsidiary note, joint distributions - or products, as I would call them, since not every joint distribution is a product of (independent) distributions - are very important in supervised learning.
Supervised probabilistic predictions take the form of product distributions, across vectors and arrays, i.e., the output of the prediction functional called on a data frame of features.
Therefore it may be important to get "vector-shaped products of distributions" and "array-shaped products of distributions" right, especally in cases where the predicted distribution type may differ - though the domain won't.
I think this also leads onto a discussion for naming conventions, as suggested separating 'joint distributions' and 'product distributions' makes sense and in fact I perhaps losing the term 'joint distribution' is preferred. Instead we have three scenarios:
Z = X1 * X2 * ... * XN
for N indep. distr objects. This ProductDistribution
is naturally a wrapper.MultivariateDistribution
class, which inherits from Distribution
However in case of the conditional distribution should this really be done as one Distribution class wrapped around another or should the user be able to supply their own p/d/q/r?
*
for product distributions clashes with the use of +
for convolution. If this were consistent, *
would denote the distribution of the value product (i.e., distx*disty
should be "the distribution of the random variable X*Y where X as distx and Y as disty") rather than the product in the mathematical category of distributions.Regarding conditionals, I think it makes a lot of sense to handle these as wrappers, or even pipeline style compositors.
Slightly confused about your point regarding *
, let's discuss on Tuesday.
Sure.
In an example: if X
and Y
are standard normal, should X*Y
be chi-squared with one degree of freedom, or should it be bi-variate standard normal?
I'm saying that if X+Y
is normal with mean 0 and variance 2, then X*Y
should be chi-squared, not bi-variate - because if for "plus" you add values (resulting in convolution of distributions), for "times" you should multiply values.
On a related note: do distr distributions represent distributions, or random variables? I.e., for standard normal X
, should X+X
be standard normal with mean 0 and variance 2, or standard normal with mean 0 and variance 4? @PeterRuckdeschel ?
(I assumed above they would represent distributions, so using the same distribution will not default to the result for the "same random variable", but to the result for "independent random variables with that distribution")
Ah I see, well the way it is done in distr is that operations are performed on random variables and there is a difference between 2*X
and X+X
, similarly for X^2
and X*X
.
Let me clarify: Our arithmetics takes distributions as input and produces again distributions as output. Having said this, inbetween, when interpreting terms like X+Y, intermediate passage to random variables allows for an easy interpretation.
The clue is as follows:
In this sence our arithmetics is "closed" -- all operands are thought of as stochastically independent items, and so in fact Y1=X+X and Y2=2*X are indeed not the same, as you already noted.
In fact Y1 and Y2 only cover extreme scenarios of independence and dependence:
Thanks for clarifying - since I was able to see two logically coherent set-ups: (a) interpreting in-memory references to the same distribution object as a reference to stochastically identical (that is, not identically distributed, but actually entirely identical) random variables, while interpreting references to different objects as references to stochastically independent ones, or (b) interpreting arithmetic operations on distribution objects as arithmetic operations on independent random variables distributed according to arguments, no matter whether the arguments refer to the same distribution object in-memory, or not.
In my gut feeling, (b) is cleaner, less fiddly, but less powerful - and I was half aware that you were probably doing (b).
I also would say that (a) is outside scope for a package that is about distributions, not collections of random variables (which is, decidedly, one full layer above).
@RaphaelS1 : In option (a), there would be no difference between X+X
and 2*X
- or between X^2
and X*X
.
(it would be an interesting project in its own right to develop a full-blown calculus of possibly dependent random variables)
@Franz: Hm, I would say that (b) is more powerful then (a), because identical copies would never involve convolutions, and would need that both operands be of exactly the same distribution...
As to the approach for more general dependencies: I agree this is very interesting, but then, contrary to our approach, where no reference to the defining probability space is needed, one would only be allowed to work on one common probability space (Omega, A, P) on which all random variables would be defined, and the objects then would have to specify the measurable mappings (Omega, A, P) -> (R, B) in order to have access to joint distributions with only specifying operands X1 and X2 for, say, an expression X1+X2.
In my opinion this reference to a common definition space and the need to specify the actual mappings would make the approach much more involved.
This said, one could think of defining evaluation frames (similar to the R construct with(), within()) which specify a joint distribution for a set of terminal distribution objects with which to work and then to allow operations to generate new distribution objects from these terminal distribution objects which are then all linked to the joint distribution, hence could, in principle, keep track of dependencies.
This is of course, also restrictive to some degree, as for each new distribution you want to work with you either have to derive them from the existing terminal distribution objects or you have to specify the relation to the "big joint distribution" in your frame -- definitively not so easy as just defining slots p,d,q, and r...
@PeterRuckdeschel thanks for the insight, though this is getting a little off-topic maybe?
For generic distribution interface, I personally would want to avoid explicit definition of measurable mappings, or specification of the probability space Omega etc. I feel this is more a mathematical crutch to make everything well-specified by saying "it exists" rather than an object which has actual meaning.
It's maybe better, hence, to go along the lines of your second suggestion: allow the user to specify (in)dependence relations, or leave dependence unspecified. Whatever is all the theoretically existing arrows from Omega would be implicit.
But this is surely something for the far future rather than this year...
Summarising conversation with Franz. There are three distinct implementation methods:
Distribution$new(pdf, cdf)
- A user supplies a pdf and/or cdf, which take multiple arguments each. This is then a distribution with type dimension >1 and is therefore a "multivariate" or "joint" distribution.ArrayDistribution$new(distribution, paramList)
- A user states with distribution is being utilised and then supplies a list of lists such that for a list of length n
which contains n
lists all of length m
this corresponds to n
distribution
each with m
parameters.ProductDistribution$new(dist1, dist2,...)
- A user provides two or more distributions (assumed independent) and the following result is exploited ProductDistribution$new(dist1, dist2)$pdf(x, y) = dist1$pdf(x) \times dist2$pdf(y)
analogously for cdf
.Thanks for this summary --I think this is a viable & good approach:
This does give you the amount of generality you need to warrant that suitable arithmetics remain closed in the sense that there results are again representable in this form.
In the discrete case with only finitely (relevant) atomic events for each marginal Xi this is completely fine. In this case in fact, you only have a finite sample space so it suffices to relabel a univariate discrete distribution, the only remaining task being to do this relabelling efficiently.
I think the computationally demanding question is in the continous case how to easily come from case 2 to case 1 without too many numeric integrations... Not that I knew how to solve this, and of course the approach of cases 1--3 is not to blame for this. Still, as far as I can say the transfer 2->1 only has been done sucessfully for elliptically symmetric distributions, and, more specifically, for multivariate normals and T distribustion in package mvtnorm.
I think that we should keep these as distinct cases for now and assume that the user has the knowledge to know which to construct and when (good documentation will help). In version 2 we can look at the more complex (and more interesting) question of transferring between the cases.
I am fine with that -- I just wanted to mention the challenge, as depending on the purpose you may need case 1 when you only have case 2 / this challenge is by magnitudes harder than what can be done in the univariate case with our helpers R2DPQ which starting from 1 function are able to numerically reconstruct the others.
My 2p on this, in the first iteration: (i) all multivariate joint distributions are explicitly defined via cdf, pdf, etc (ii) there is no interface to compute probability mass of arbitrary event sets, since that would in first instance require an interface to specify arbitrary event sets (thus a generic sets interface)
We need to leave something for Raphael's PhD students to work on too...
Our last discussion concluded that conditional distributions should be implemented as wrappers. This seemed like the natural choice as wrappers inherit methods from
DistributionWrapper
that get all wrapped models and overload any methods/variables as required. However we also concluded that joint distributions inherit from Distribution and are not wrappers. This seems counterintuitive as the two are naturally related.I would argue that a better option may be to have constructors for
JointDistribution
andConditionalDistribution
with methods to get one from the other where appropriate/possible and exploiting properties of independence etc. where possible.Thoughts?