"exotic" function-valued properties and extensibility

fkiraly commented 5 years ago

Just to add this as a discussion point:

for various reasons, one may like to have some more exotic functionals related to distributions, such as n-th (anti-)derivatives of pdf, lp-norm of the various distribution characterizing functions, or mgf/cgf.

It's reasonably easy to plan them in from the start if we wanted to, but given any design, one may ask the question:

how easy is it to add one more exotic functional
should there be specific functionality to ask for "l2-norm of" etc
what about automated conversion functionality

PeterRuckdeschel commented 5 years ago

I agree that one should try to avoid trying to find "exotic" functionals for each distributions, r,d,p,q are there in R for good reason, other not so, like Fourier transforms, mgf,... The approach of generic functions in this respect might be a good compromise for these functionals -- compute them numerically, and if needed/wanted one provides more accurate ones for specific distributions...

RaphaelS1 commented 5 years ago

I agree with Peter's suggestion.

how easy is it to add one more exotic functional

I assume this just depends on the correct hierarchy/abstraction set-up. If we get it right then adding a method, independent of others, should be straightforward.

should there be specific functionality to ask for "l2-norm of" etc

Given their extensive uses in modelling, I don't think L1/2 norm should be considered exotic and there should be specific functionality.

what about automated conversion functionality

Sorry what does conversion refer to here? As in conversion to mgfs/pgfs/cfs etc?

fkiraly commented 5 years ago

Sorry what does conversion refer to here? As in conversion to mgfs/pgfs/cfs etc?

E.g., you have cdf, but don't have anti-derivative of cdf. Or, you have pdf, but don't have derivative of pdf.

RaphaelS1 commented 5 years ago

Is there any reason why a user can't do this themselves if required? As far as I'm aware these functions (e.g. differentiation) are fairly standard in R

fkiraly commented 5 years ago

Symbolic differentiation is not easily available for arbitrary expressions. A number of learning strategies - e.g., least squares based probabilistic - require the "exotic" interface points explicitly.

Regarding how "exotic" "exotic" is: a user who is only interested in distributions may never want to query a l2-norm or similar of a pdf/cdf. However, ML/AI models or workflows may be querying these regularly, so for those it would be important that most distributions support them.

I.e., a "pure" user may never call things like l2-norm-of-p/cdf, log-pdf, etc directly, but it would be extremely common as a query by "under-the-hood" functionality of modelling modules. In consequence, unlike the user only working with distributions, a user writing modelling packages may want an "obvious" and easily accessible interface point.

RaphaelS1 commented 5 years ago

Well as we are planning on writing these modelling packages, shall we come back to these methods then and just ensure that now we have functionality to allow method like this in the future?

PeterRuckdeschel commented 5 years ago

fine for me

fkiraly commented 5 years ago

For the internship projects, it might be great to have a wishlist though, and it might be more efficient to implement the more important of the modelling relevant methods right away.

My personal wishlist:

l2-norm of pdf (if exists)
l2-norm of cdf
l2-norm of survival function (= 1 - cdf)
the above, for positive and negative parts
anti-derivative of cdf
log-pdf

RaphaelS1 commented 5 years ago

Fine in which case can I suggest that we open up a second project in this repo entitled something like 'exotic functionals wishlist' and these to the list in order of priority, in the same way as with distributions. Then either we can split them or they can split themselves into both lists, as the two should be able to work in tandem. And before then we can work out if any extra design specification is needed for these or just to determine where they are define in the hierarchy

PeterRuckdeschel commented 5 years ago

To Franz' whishlist: It may be cheap (in design) but helpful to allow for a general integration measure mu (finite or sigma finite, discrete or absolutely continuous) at which to compute the l2-norms (e.g. this allows to generalize Cramér-von-Mises distance to Anderson-Darling distance).

PeterRuckdeschel commented 5 years ago

Coming back to the question whether these funcitonals should be implemented as methods/generics (apart from the distribution object) or as functional slots /members in distribution objects.

I have some reluctance to include such functionals into the distribution class which to a large extent stems from S4 experience where by the necessity to copy theses functions (= parsed R-Code, environment, evaluation frames...) the resulting objects easily become heavy-weights... This paradigm for sure is different with R6. Still, even with R6, I remain reluctant with respect to strict inheritance relations -- once you combine several such functionals as constituents of new classes you need to set up a precedence -- which functional is higher in hierarchy and hence must be filled with higher priority. As such a precedence is not self-evident, I would rather go for a decorator design pattern in the sense of Gamma et al.; in R6 you could do so by implementing these functionals as methods/generics and to maintain an optional and growing list of properties in a corresponding distribution object storing the computed values of such functionals so that it need not be recomputed once stored in there. In particular with reference classes this should also be cheap as to storage. In this strategy you can work without hierarchies and none of these properties would be obligatory. In addition, this strategy does not imply a specific time when these properties need to be filled. The default would be an active call of the filling method by the user, but where computation of this property is "cheap" one could just as well fill the respective property at object generation time.

RaphaelS1 commented 5 years ago

This makes a lot of sense to me, enables sufficient flexibility and fully separates out "exotics" that many users may not want to think about. I would probably suggest not making these available for S3 dispatch then, or in construction they call R62S3 to generate the S3 methods then.

RaphaelS1 commented 5 years ago

I don't think log-pdf should be considered 'exotic' as it is currently implemented as standard in R. e.g. dnorm(x, mean, sd, log = TRUE). (I assume this is identical to log-pdf anyway)

RaphaelS1 commented 5 years ago

Summary from phone call with Peter:

Exotics will be treated as decorators, which can be optionally split into different groups (e.g. ML, inference, ...)
Log will be included as standard as in base R however a more accurate version will be included as an exotic
Full wishlist is now under projects

Franz if you want to add anything to this let me know, if not I will close this issue

fkiraly commented 5 years ago

Regarding 3, here's a modified wishlist in descending order of subjective priority:

anti-derivative of cdf
integral of squared-cdf, between 0 and x (where x can be negative, or one of +inf, -inf)
integral of (one minus cdf)-squared, between 0 and x (where x can be negative, or one of +inf, -inf)
integral of squared-pdf, between 0 and x (where x can be negative, or one of +inf, -inf)

It should be noted that if you have these, you can more easily fill in the defaults for min of random variable and constant, in particular the Huberized which is min(b,max(a,X)).

It is therefore worthwhile thinking about whether at least the anti-derivative of cdf should be moved to "standard" rather than in one of the "exotic" classes.

RaphaelS1 commented 5 years ago

I have updated the project with the modified wishlist.

Decorators can still be easily called by internal functions, such as Huberize. So the abstraction mainly affects the user-experience. If a user may reasonably often call the anti-deriv of cdf then it would make sense to make it 'standard', otherwise if it will mainly be used internally, I think the 'exotic' decorator still makes more sense.

PeterRuckdeschel commented 5 years ago

Regarding 3, here's a modified wishlist in descending order of subjective priority:
* anti-derivative of cdf

* integral of squared-cdf, between 0 and x (where x can be negative, or one of +inf, -inf)

* integral of (one minus cdf)-squared, between 0 and x (where x can be negative, or one of +inf, -inf)

* integral of squared-pdf, between 0 and x (where x can be negative, or one of +inf, -inf)
It should be noted that if you have these, you can more easily fill in the defaults for min of random variable and constant, in particular the Huberized which is min(b,max(a,X)).

It is therefore worthwhile thinking about whether at least the anti-derivative of cdf should be moved to "standard" rather than in one of the "exotic" classes.

The new bullet points are definitively good to have for things like, e.g. Cramer Minimum distance (parameter) estimation. The latter, for Fn the empirical cdf of the data, and F_theta the model cdf at parameter value theta, is the argmin of the function theta mapsto integral (Fn(t)-F_theta(t))^2 dt (provided this integral is finite).

For Huberize and Truncate I would say your bullets are less helpful, but please correct me if I missed something.

My argument is that if p(Xr) is the cdf of a real-valued rv X whose law in distr[6] is implemented as Xr and for numerics a<b (for simplicity such that p(Xr)(b)-p(Xr)(a)>0) you want to fill the p / cdf slots of Yr=Huberize(Xr,a,b), Zr = Truncate(Xr, a, b), the (already vectorized) cdfs of Yr and Zr are p(Yr) = function(t) (t<b)(t>=a)p(Xr)(t)+(t>=b), and p(Zr) = function(t) pmin((t>=a)*(p(Xr)(t)-p(Xr)(a))/(p(Xr)(b)-p(Xr)(a)),1), so I see no need to compute antiderivatives of cdfs / and antiderivatives of squared-cdfs there.

For a clean solution you also need to catch the case p(Xr)(b)-p(Xr)(a)==0, which is tricky as you compare floats/doubles against a target and, similarly, numerically, new issues may pop up as it may not be true that (t<b)+(t>=b) == 1... This is where in current distr we use global options to define accuracies, but, as Raphael has nicely pointed out, one might want / should be able to set these accuracies locally by function arguments as well.

Of course, as to accuracy, it also helps to work out the survival functions p(Yr)(t, lower.tail=FALSE), i.e. 1-p(Yr) and similar for Zr by means of the survival function p(Xr)(t, lower.tail =FALSE) provided the latter has been worked out with more accuracy, and similarly for log.p arguments for more accuracy on the log scale.

fkiraly commented 5 years ago

For Huberize and Truncate I would say your bullets are less helpful, but please correct me if I missed something.

Let's say you wish to populate the "mean" slot of a min(x,X) where X is a non-negative random variable (this happens as part of Huberization).

That is just the anti-derivative of cdf of X, evaluated at x, I believe?

PeterRuckdeschel commented 5 years ago

got it; fine with this; I was still thinking in our distr world of "only" p,d,q,r slots... Sure, for moments, antiderivatives of cdfs are helpful.

fkiraly commented 5 years ago

Hm, ok, this may raise another point then: are simple moments and related quantities in the "basics"? For me, mean and variance and kurtosis and skewness (for real variables) were always "basic" properties that most users may want.

fkiraly commented 5 years ago

as a random piece of guidance of what people consider "basic", see the right top panel of https://en.wikipedia.org/wiki/Normal_distribution (we of course don't need to implement all of this or only this)

PeterRuckdeschel commented 5 years ago

I agree that moments are not exotic, but this again is my distr "past": The first 4 Moments in general are not measure-determining, and p,d,q, (and, asymptotically by LLN) r are. So these are definitely important "decorations" (and we do have them in distr/distrEx as specific methods for E(), var(), kurtosis() and skewness()) but you would not use them to "nail down" a distribution in a fully non-parametric sense.

fkiraly commented 5 years ago

Yes, I agree - there is a separation between "measure determining" methods, and properties. However, I'd think there are "very important properties" which every distribution (of a certain class) ought to have, and which a user familiar with generic data science software will expect.

fkiraly commented 5 years ago

(I'm happy to discuss whether excess kurtosis is a very important property, but I'd think mean and variance definitely are)

fkiraly commented 5 years ago

however all moments together are measure determining (for compactly supported distributions), so would having mgf and "i-th moment" be worth a consideration?

mean and variance could simply default to "kthmoment(1)" and "kthmoment(2)" etc

fkiraly commented 5 years ago

One other point: in modelling (including predictive modelling), one often encounters incompletely specified distributions: that is, distributions of which you wouldn't have pdq, but only some moments or cumulants. This occurs prominently in applications where one doesn't want to use all statistical power for modelling the entire distribution, but focus on important characteristics such as mean and variance.

All classical regression can be seen as modelling of the conditional mean, and predictive intervals can be seen as modelling of conditional mean and variance.

From this perspective, one may even allow distributions without pdq but with some moments/cumulants/properties specified.

fkiraly commented 5 years ago

(random question to which I don't think I know the answer: which functions are mgf? I'm aware all mgf are log-convex, but there exist log-convex functions which are not mgf, e.g., exp(t^4))

PeterRuckdeschel commented 5 years ago

Definitively, all families which are measure determining (for at least a sizeable subclass of distributions) could be replacements for d, p, q, r; so Fourier transformations (aka characteristic functions), Laplace transformations (which is what you probably mean by moment generating functions, terminolgy varies in this), for integer valued distributions, t mapsto E t^X for t in [0,1], and some further variants could qualify, and all of these can be quite helpful...

To your question as to a characterization of a function to be a mgf; if we are talking about Laplace transformations, then passage to complex arguments will give an answer -- for characteristic functions it would be the fact that they are positive definite (see, e.g. Chung).

To wrap up, I think all these properties are extremely helpful as decorators, but only should be obligatory if they could replace r,d,p,q in certain subclasses. This is, eg true for characteristic functions which help to define general alpha stable distributions (we haven't delved into this in distr so far, though.)

As to modelling with only partial constraints, this is what you would probably do in non-parametric or semiparametric approaches. Then you are left very much to the empirical distribution and density estimates which is fair enough, but w.r.t. to your goal of high accuracy this opens quite a box of tuning "opportunities"... So I would not start with this...

RaphaelS1 commented 5 years ago

as a random piece of guidance of what people consider "basic", see the right top panel of https://en.wikipedia.org/wiki/Normal_distribution (we of course don't need to implement all of this or only this)

@fkiraly this is exactly what I used to determine the properties/traits list and which functions I believe should be considered non-exotic. However I doubt mgf/cf/Fisher Info would be used by the non-technical user and similarly kurtosis.

I think we can consider three primary user types:

Non-technical users (or students) who are using R for basic statistical analysis and probability distributions (mean, variance, d/p/q/r, etc.)
Statisticians/mathematicians (or students in final years) who are familiar with statistical theory and understand concepts of parameterization, statistical theory, fisher information etc.
Those interested in modelling/inference/estimation/simulation etc.

Then based on these groups I believe that the for the first group all required properties/traits/functions should be classified as 'basic' and therefore implemented as core methods in the Distribution class hierarchy. Functions for the second class can be grouped under a StatisticalDecorator (or whatever we call it). And the final class can be grouped under a InferenceDecorator. Although in hindsight I think these latter two can be generally combined. Whilst we are at it, we can have a MeasuresDecorator that contains all distance measures. I think it could be interesting to have all decorators as S3 dispatch as well.

Suggested list as follows (not including measures):

Core • Variance • cov • cor • Standard Deviation • mean • pdf • cdf • quantile • rand • survival • hazard • cumulative hazard • max • min • median • mad • mode • iqr

StatisticalDecorator • mgf • cf • pgf • entropy • skewness • kurtosis • kmoment • fisherInfo

RaphaelS1 commented 5 years ago

Just to add, can I safely assume that when a known calculation for e.g. the mean is known we will overload specific implementations for this? For example whilst the mean() function can have the general expectation formula, we would overload this for the binomial distribution to size * prob

PeterRuckdeschel commented 5 years ago

The subclasses for decorators are great, just as to MeasuresDecorator: Here it depends at which reference/ gauge to measure distances to; usually you would rather determine distances between two distributions, so this would not be a property of a single distribution.

As to the question to mean() -- yes I would say so, although, in case of the mean I would rather go for E() as we do it, because (my opinion) mean strongly suggests the empirical expectation and not the functional for a distribution; apart from this, this is overloading of methods in distrEx is exactly one of the reasons to have so many different classes, because the selection about the actual formula/method to use can be delegated to usual method dispatch. On the other hand, you would usually not use very deep inheritance relations for this, so it should not be too hard to take care about method selection yourself when using decorators; in this case it is rather about finding intelligent numerical fallbacks once exact formulae are not available, and, maybe some less general intermediate solutions in some special cases, e.g. for the expectation of an affinely transformed variable, and for distributions with "heavier-than-usual" tails where you may want to use integration of the form integral from 0 to 1 over qtl(distr)(s) ds .

RaphaelS1 commented 5 years ago

The subclasses for decorators are great, just as to MeasuresDecorator: Here it depends at which reference/ gauge to measure distances to; usually you would rather determine distances between two distributions, so this would not be a property of a single distribution.

I was torn with this decision because on the one hand we could have it as multiple dispatch only (i.e. taking two distributions as the first two arguments) but given that we are making all other methods available as R6 or S4 then does it not make sense to do the same for measures? e.g. Let X,Y be two distributions and m() some measure method, then the following would produce an identical response:

X$m(Y)
Y$m(X)
m(Y,X)
m(X,Y)

although, in case of the mean I would rather go for E() as we do it

I was just about to open a new issue regarding method names/conventions/notations and will add this to it. I believe there may be some compromises required between R conventions, statistical conventions and coding conventions

On the other hand, you would usually not use very deep inheritance relations for this, so it should not be too hard to take care about method selection yourself when using decorators; in this case it is rather about finding intelligent numerical fallbacks once exact formulae are not available, and, maybe some less general intermediate solutions in some special cases, e.g. for the expectation of an affinely transformed variable, and for distributions with "heavier-than-usual" tails where you may want to use integration of the form integral from 0 to 1 over qtl(distr)(s) ds .

I will open an issue next week (hopefully) regarding where to define methods and when to overload, I'll add the above as the opening comment

xoopR / distr6

"exotic" function-valued properties and extensibility #13