Automatic generation of d/p/q/r

RaphaelS1 commented 5 years ago

One of the most powerful features of distr is the ability to automatic generate the d/p/q/r functions if not all are supplied to a constructor. Questions to consider now are if the current generating functions for these can be brought forward to an R6 upgrade or if there are other methods for doing so?

RaphaelS1 commented 5 years ago

I had wondered from running some test code if all functions for any distribution can be generated from just the pdf/pmf, below is some example code for the Binomial distribution. I suspect that this is non-optimal however I wonder if it is sufficient to be written in to the Distribution super-class and then overloaded when required for more complex distributions:

# Defines pmf for Binomial distribution
binomDist = function(k,n,p){
 choose(n,k) * p^k * (1-p)^(n-k)
}

# Defines cdf as the sum of the pmf
cdf_binomDist = function(k,n,p){
  if(length(k)==1)
    return(sum(binomDist(0:k,n,p)))
  else
    return(unlist(sapply(k,function(x) sum(binomDist(0:x,n,p)))))
}

# Defines quantile as the first cdf that is greater than the given p (can likely be vectorized for efficiency)
q_binomDist = function(p,n,prob){
  for(i in 0:n){
    if(cdf_binomDist(i,n,prob) > p)
      return(i)
  }
}

# Defines simulation via the `sample()` function in base R and using the defined pmf
r_binomDist = function(n,size,prob){
  sample(x = 0:n,size = n,prob=binomDist(0:n,size,prob),replace = T)
}

The above could be simply generalised to any distribution, with an integration analogue for ContDistributions

RaphaelS1 commented 5 years ago

See generation_test.R in the code section for an updated idea of how to proceed. The code is roughly hacked together and doesn't make use of R6 instead it just gives a general idea for how these are dynamically generated. This may also be a good opportunity to discuss how efficient @PeterRuckdeschel and Matthias have felt their generating functions have performed or if any improvements/edits should be made.

Similarly to distr we use the following methods:

Given density D only:

D2P = Numerical integration/summation
D2Q = D2P --> P2Q
D2R = D2P --> P2Q --> Q2R

Given distribution P only:

P2D = Numerical differentiation
P2Q = Numerical inversion
P2R = P2Q --> Q2R

Given inverse distribution Q only:

Q2P = Numerical inversion
Q2R = Inverse transform sampling
Q2D = Q2P --> P2D

Given simulation function R only:

R2D = Density estimation
R2P = R2D --> D2P
R2Q = R2D --> D2P --> P2Q

PeterRuckdeschel commented 5 years ago

5 minor comments:

(1) the functions D2P and so forth above should be the method of choice if nothing else helps; and then the given order has some justification as to accuracy. whenever available, you should use more accurate functions -- e.g. it is no good idea no to use qnorm, and instead use P2Q to compute it from pnorm...

(2) for the automatically generated functions r,d,p,q, one should adhere to the R coding convention whereever possible, i.e. try to provide special treatment of returns on log scale (see dnorm(x, log=TRUE), pnorm(q, log.p=TRUE)) and for the extreme upper tails (see pnorm(q,lower.tail=FALSE).

(3) for discrete distributions, or mixture distributions with non-trivial discrete part, it pays off to define left and right continuous variants for the cdf and the quantile function (in distr: p.l, p.r (=p), q.l (=q) and q.r) -- this is helpful when it comes to exact critical values for tests.

(4) as to D2P, for functions with heavy tails, it may pay off to perform the integration on quantile scale (if available), i.e., essentiallly use unequally spaced grid points for integration such that between two subsequent grid points there is roughly the same probability mass. This is not yet done so far in distr, but is used, e.g. in our default integrators for Gamma, Weibull, Pareto, GPareto, GEVD, where \int f(x) P(dx) is computed as \int_{[0,1]} f(Q(s)) ds, where P is the cdf of the rv and Q its quantile, see file GammaWeibullExpectation.R in package distrEx.

(5) One should also note that the problem behind D2P is not a mere quadrature problem: We are heading for the antiderivative of D (as a function); so sophisticated optimal quadrature points for some fixed integration range may turn our subefficient when passing to the next integration range. So the grid points in some sense should be "universally useful" for all possible integration ranges. Similar considerations apply for the numerical inversion behind P2Q and Q2P -- to use a sophisticated root solver for each inversion point is not necessarily the best thing to do... In this sense a mirroring approach, i.e. interchange of x and y on a suitable xy-grid (this one indeed obtained by individual root solving) and subsequent smoothing is more efficient.

PeterRuckdeschel commented 5 years ago

sorry not yet quite familiar with the issue tracker; my comment was not meant as the final word on this...

RaphaelS1 commented 5 years ago

(1) the functions D2P and so forth above should be the method of choice if nothing else helps; and then the given order has some justification as to accuracy. whenever available, you should use more accurate functions -- e.g. it is no good idea no to use qnorm, and instead use P2Q to compute it from pnorm...

Agreed, implementing a strict hierarchy to ensure the most efficient function is called will be essential

(2) for the automatically generated functions r,d,p,q, one should adhere to the R coding convention whereever possible, i.e. try to provide special treatment of returns on log scale (see dnorm(x, log=TRUE), pnorm(q, log.p=TRUE)) and for the extreme upper tails (see pnorm(q,lower.tail=FALSE).

Agreed

(3) for discrete distributions, or mixture distributions with non-trivial discrete part, it pays off to define left and right continuous variants for the cdf and the quantile function (in distr: p.l, p.r (=p), q.l (=q) and q.r) -- this is helpful when it comes to exact critical values for tests.

Am I correct in thinking that for a given distribution it is true that: p.l(x) = p(x, lower = TRUE) And similarly for q.r ?

(4) as to D2P, for functions with heavy tails, it may pay off to perform the integration on quantile scale (if available), i.e., essentiallly use unequally spaced grid points for integration such that between two subsequent grid points there is roughly the same probability mass. This is not yet done so far in distr, but is used, e.g. in our default integrators for Gamma, Weibull, Pareto, GPareto, GEVD, where \int f(x) P(dx) is computed as \int_{[0,1]} f(Q(s)) ds, where P is the cdf of the rv and Q its quantile, see file GammaWeibullExpectation.R in package distrEx.

Thanks will take a look at this

(5) One should also note that the problem behind D2P is not a mere quadrature problem: We are heading for the antiderivative of D (as a function); so sophisticated optimal quadrature points for some fixed integration range may turn our subefficient when passing to the next integration range. So the grid points in some sense should be "universally useful" for all possible integration ranges. Similar considerations apply for the numerical inversion behind P2Q and Q2P -- to use a sophisticated root solver for each inversion point is not necessarily the best thing to do... In this sense a mirroring approach, i.e. interchange of x and y on a suitable xy-grid (this one indeed obtained by individual root solving) and subsequent smoothing is more efficient.

I see, and I assume it's much quicker computationally to go for the mirroring approach. So I assume that the functions are generated at construction and stored in the relevant slots as opposed to being computed and evaluated when called (as I wrote in my test code)?

PeterRuckdeschel commented 5 years ago

(1) the functions D2P and so forth above should be the method of choice if nothing else helps; and then the given order has some justification as to accuracy. whenever available, you should use more accurate functions -- e.g. it is no good idea no to use qnorm, and instead use P2Q to compute it from pnorm...

Agreed, implementing a strict hierarchy to ensure the most efficient function is called will be essential

(2) for the automatically generated functions r,d,p,q, one should adhere to the R coding convention whereever possible, i.e. try to provide special treatment of returns on log scale (see dnorm(x, log=TRUE), pnorm(q, log.p=TRUE)) and for the extreme upper tails (see pnorm(q,lower.tail=FALSE).

Agreed

(3) for discrete distributions, or mixture distributions with non-trivial discrete part, it pays off to define left and right continuous variants for the cdf and the quantile function (in distr: p.l, p.r (=p), q.l (=q) and q.r) -- this is helpful when it comes to exact critical values for tests.

Am I correct in thinking that for a given distribution it is true that: p.l(x) = p(x, lower = TRUE) And similarly for q.r ? no: p.l(x) = P(X<x), p.r(x)=P(X<=x), and respectively for the quantile

(4) as to D2P, for functions with heavy tails, it may pay off to perform the integration on quantile scale (if available), i.e., essentiallly use unequally spaced grid points for integration such that between two subsequent grid points there is roughly the same probability mass. This is not yet done so far in distr, but is used, e.g. in our default integrators for Gamma, Weibull, Pareto, GPareto, GEVD, where \int f(x) P(dx) is computed as \int_{[0,1]} f(Q(s)) ds, where P is the cdf of the rv and Q its quantile, see file GammaWeibullExpectation.R in package distrEx.

Thanks will take a look at this

(5) One should also note that the problem behind D2P is not a mere quadrature problem: We are heading for the antiderivative of D (as a function); so sophisticated optimal quadrature points for some fixed integration range may turn our subefficient when passing to the next integration range. So the grid points in some sense should be "universally useful" for all possible integration ranges. Similar considerations apply for the numerical inversion behind P2Q and Q2P -- to use a sophisticated root solver for each inversion point is not necessarily the best thing to do... In this sense a mirroring approach, i.e. interchange of x and y on a suitable xy-grid (this one indeed obtained by individual root solving) and subsequent smoothing is more efficient.

I see, and I assume it's much quicker computationally to go for the mirroring approach. So I assume that the functions are generated at construction and stored in the relevant slots as opposed to being computed and evaluated when called (as I wrote in my test code)?

correct,

RaphaelS1 commented 5 years ago

Thanks @PeterRuckdeschel, I think it would be beneficial to add further discussion on this issue to the agenda for our phone call next week.

fkiraly commented 5 years ago

As an alternative suggestion (and as discussed earlier with @RaphaelS1), I would externalize "automatically add in missing functionals" as a wrapper, if at all.

A clean interface should have the property that the different modalities analytically agree, and that a distribution keeps track of which are not implemented.

Otherwise, in advanced modelling, one easily gets into trouble through propagation of numerical inaccuracies in "filling in the missing one". That is, one needs to rely on all the interface functions of the distributions being exact.

RaphaelS1 commented 5 years ago

This wrapper would likely look similar to the distr constructor AbsContDistribution() which takes as inputs one or more of r/d/p/q as well as other parameters for controlling how these should be generated.

So I believe that in your suggestion, Franz, we would also include parameters for controlling if unknown functionals should be generated at all? Additionally where AbsContDistribution() is a constructor that initialises the respective class, do you suggest that when these functionals are unknown that this should actually construct a wrapped class around AbsContDistribution (for example)? And if so is this wrapper simply tracking which functionals were supplied in the first place and if/how the others were generated?

PeterRuckdeschel commented 5 years ago

Externalizing the automatic filling of rdpq slots is a good point (and almost for free as to computational costs in R6 but not quite so in S4). If I understand correctly, you propose to have wrapped objects consisting of slots on two layers, one starting one with layers filled with functions like pnorm and dnorm (and leaving not yet implemented ones as missing) and another wrapping layer with slots autmatically generated by D2P and friends, so it remains clear which slots come from the call of AbscontDistribution and which have been generated. This for sure is reasonable.

I would say as to accuracy, design decisions could be less black and white than this -- r,d,p,q slots provided as arguments to AbscontDistribution may themselves be of heterogeneous accuracy. In addition some of the D2P functions are very accurate (e.g. for discrete distributions) -- even when coming as return values from our convolutions. So saying that slots on the first layer are accurate and those on the second layer less so is not always true. Being able to trace the propagation of inaccuracies would be great -- as even pnorm and dnorm are not exact / analytical.

Another issue is when to do the filling of empty slots: If you leave unknown slots as missing and only fill them with the wrapper when needed (and forgetting about this filling afterwards), this could cause redundant computations. I would in any case opt for filling missing slots at generation time not at calling time (or at least filling them with promises at generation time, so they get permanently filled at the first call).

fkiraly commented 5 years ago

@PeterRuckdeschel, regarding

If I understand correctly, you propose [...]

Yes, I think you understand correctly.

Regarding your point no.2, I think it should be close-to-exact whenever possible, otherwise one should, at least, raise an error or be subject to manual control. So filling in missing methods by generics is fine, as long as there are exact generics to do so.

Regarding the "when": I'd be in favour of throwing an error or warning if an exact/analytical fill-in is unavailable - this is similar to base-R which often throws warnings when approximations are unsound or assumptions are incorrect. I'm undecided about whether a result should be returned. One solution could be to suggest the use of an "approximation wrapper" in the error message, so the user knows what to do but is aware of limitations.

PeterRuckdeschel commented 5 years ago

I would strongly opt for lowering the level of a thrown exception -- I am fine with a warning issued when using less accurate proxies, but if at any instance you call a "numeric" version of r/p/d/q you have to wrap it to a try-catch, this is certainly too much and would hinder usage of less accurate (but possibly still with acceptable accuracy) slots. You may have seen that we do have some warning issued for such cases but also offer means to exactly suppress these warnings -- see distr::distrARITH().

fkiraly commented 5 years ago

I think we should be clear about what is i.m.o. three separate issues: (a) when a warning/error message is thrown (b) when an "imputer/filler" method is automatically called to fill in a missing rpdq slot (c) what the status of non-exact filling strategies should be

I think (a) whenever a method called is not analytically exact (unless the user turns it off), (b) only if it's guaranteed to be analytically exact, and (c) wrappers/compositors that are explicitly called in class/method construction.

@PeterRuckdeschel, could you outline whether we disagree on any on this (because I'm not sure), and if yes, which?

PeterRuckdeschel commented 5 years ago

I think we basically agree already.

The real important feature for me is to allow for user-controlled changes in the strategies in (a), (b), and (c) and on whether exceptions are thrown and at which level (warning, errors). Once such changes are possible, whatever defaults in (a), (b), and (c) will not hinder flexible applications with imputed slots without being annoyed by exceptions. Then I am with you that in case of doubt one should be restrictive by default to prevent the uninformed user from relying on too inaccurate imputed slots.

Such user-controlled changes should both be possible case by case with respective function arguments at object generation time (hence encouraging transparent code) and by means of global options. The latter allows for powerful arithmetics with binary operations like +,*,-,/ where a case by case control is much less self-explicatory.

RaphaelS1 commented 5 years ago

Summary from phone call with Peter:

We all appear to be in agreement in regard to how/when to generate functions.
Peter's main concern is in the comment above regarding flexibility for the user and in particular making sure that an automated generation should never stop/break code, even if it isn't the 'best' possible strategy. I agree with this and we will come back to it in implementation.

xoopR / distr6

Automatic generation of d/p/q/r #7