Closed RaphaelS1 closed 1 year ago
I would expressly not locate estimation functionality with the primary distribution object.
Density/distribution estimation is a natural subcase of probabilistic supervised learning (i.e., that of no features), thus should be implemented together with a probabilistic supervised learning interface.
I think this makes especial sense since there are multiple ways to estimate the same kind of distribution from data, e.g., method of moments, maximum likelihood, maximum-a-posteriori given some choice of prior, least-squares, etc etc.
My design of choice would hence be a class for each estimation method which has a fit and a return method - it is not quite "predict" as there is just a single distribution to return.
I think this is a departure from the distr design, so @PeterRuckdeschel may like to comment.
I am with Franz. Basically, estimators (or more broadly statistical decision functions) build up on models and data. So this should be kept as a distinct concept / class structure. It is also helpful to distinguish estimators and estimates (decision functions and decisions) as you are used to with random variables and their actual realization. Implemented as a class structure, an estimator would comprise a link to a data object, a link to a model, a function mapping these two to an estimate, and the estimate itself. In addition it may contain informational add ons (type of the estimator, diagnostic information, distribution of the estimator (if available), asymptotic variance ...). This does not hinder that the estimate may again be a distribution object, e.g. in case of probabilistic forecasts.
Agreed with both and will close this issue for now, we can return to it when discussing the modelling interface.
It would be most convenient of maximum-a-priori estimation was implemented and included, with
a default prior set to flat
, effectively doing maximum-likelihood estimation (MLE). MLE is most commonly used in classical statistics and statistical learning. In addition to flat
, the prior can also be jeffrey
for non-informative priors; conjugate
for efficient/faster estimation, automatically selecting the conjugate prior given the (likelihood) distribution for the sample; or some arbitrary distribution provided as the selected prior.
This would probably be enough for 90% (sorry for a random number without reference) of parameter estimations. Anything more specific or advanced should then be dealt with a more advanced and specialized machine learning/statistical inference packages.
Ideas?
Hi thanks for commenting, this is actually the perfect time to re-open this issue as we are currently designing an ML interface for estimation in mlr3proba.
As noted originally by @fkiraly , there are several possible methods that could be used, which is why we originally discussed abstracting all estimation to another package. However, I am still in favour of having one general function for estimation that can cover a lot of use-cases.
I am unsure what this general function should be, your suggestion of maximum-a-priori with a flat
default seems sensible as MLE is probably the most well know estimation method for users, but we could just have MLE. Here's an idea I just threw together:
library(distr6); library(stats4);library(data.table)
mleEstimate <- function(distr, data, method = "BFGS", fixed = list(), nobs, ...){
p = as.data.table(distr$parameters())
p = subset(p, subset = p$settable, select = c("id","value"))
f = vector("list",length(p$id))
names(f) = p$id
LL <- function(){}
formals(LL) = f
body(LL) = substitute({
-sum(do.call(distribution$new, pars)$pdf(data), log = TRUE)
}, list(
pars = eval(parse(text = paste0("alist(",paste0(p$id, "=", p$id, collapse = ","),")"))),
distribution = get(distr$name)
))
s = as.list(p$value)
names(s) = p$id
mle = stats4::mle(minuslogl = LL,
start = s,
method = method,
fixed = fixed,
nobs = nobs,
...)@coef
do.call(get(distr$name)$new, as.list(round(mle, 5)))
}
mleEstimate(distr = Normal$new(mean = 2, sd = 2), data = rnorm(1000))
mleEstimate(distr = Exponential$new(rate = 2), data = rexp(1000, rate = 2))
It isn't very efficient and there's a huge bottleneck, but it's a flexible implementation of mle in distr6
. If the general consensus is that a Bayesian approach is prefered for a generic estimation function then I'm happy to do that, but I think that mle tends to be more well-known by less technical users, and more technical users should have no problem interfacing with mlr3proba
, although I agree that could be slight overkill if you just want to estimate a couple of parameters.
Yes, I am sure most users would be perfectly happy with a built-in MLE. MAP with jeffrey
conjugate
can wait or be left for specialized libraries.
Does pdf(, log = TRUE)
return sensibly for densities/mass/probability that equal 0 or 1? Besides that, I see no problem in the above implementation.
Although, since distr6
has a zoo of distributions, a look-up table for jeffrey
and conjugate
priors would be truly awesome and a better interface than being an option to the parameter estimator function. Going even more general, it would be even more awesome if there are lookup tables for distribution transformations. Simple examples:
Given: I have a Gaussian random variable X1 and another Gaussian random variable X2.
Problem: What's the distribution of X1+X2?
Interface: lookup(transformation=sum, operands=c(Gaussian object, Gaussian Object)) should return the new Gaussian.
Limitations: There are infinite transformations and reparameterizations possible between distributions. But simply including the most common ones, like sum
product
min
max
threshold (aka X1>X2)
square
root
conjugate
jeffrey
etc. should be enough and make working with multiple distributions and complex modeling way more pleasant.
Does pdf(, log = TRUE) return sensibly for densities/mass/probability that equal 0 or 1? Besides that, I see no problem in the above implementation.
So this directly relates to #131 and is a big reason why I haven't had a chance to look at this before. pdf(,log=TRUE)
just calls log(pdf())
so it's fairly basic. Which means that yes any edge cases are problematic. Ideally I would like to implement analytical log-pdfs, but I have yet to decide if these should go in the main distribution class or in a decorator, also most of these will have to be manually written.
Although, since distr6 has a zoo of distributions
I'm taking this as a compliment, one day I hope it will be a universe of distributions
Given: I have a Gaussian random variable X1 and another Gaussian random variable X2. Problem: What's the distribution of X1+X2? Interface: lookup(transformation=sum, operands=c(Gaussian object, Gaussian Object)) should return the new Gaussian. Limitations: There are infinite transformations and reparameterizations possible between distributions. But simply including the most common ones, like sum product min max threshold (aka X1>X2) square root conjugate jeffrey etc. should be enough and make working with multiple distributions and complex modeling way more pleasant.
This is something we've discussed before in #153 and #36. It isn't clear to me how to resolve it. I do quite like your lookup idea to be honest, we could create a simple data.table of common combinations and we wouldn't actually need a lookup
function just the data.table::subset
function. This could serve as a good starting to point to then decide which of these to implement and would massively help with convolutions...
I sense whether pdf(..., log=TRUE)
should be a decorator or the distribution property can easily turn into a philosophical debate on the nature of probability [1]. Implementing it inside the Distribution interface/class enables optimization of things like sum(log(pdf(...)))
. For example, Gaussian log-likelihoods usually use a trick to swap the sum and the log to avoid losing precision due to floating point numbers). However, that is again, as you said, too specific to each distribution, and is not possible in general. Personally as a early user, I'd be much happy with something quick and dirty as offsetting from the edges by the smallest floating point precision, aka pdf <- .Machine$double.eps + (1-2*.Machine$double.eps)*pdf
.
Yes, the "zoo" was a compliment :D
There are infinite possible transitions and most of them are either analytically intractable or needs a symbolic computation engine to deal with. Fortunately, those are usually the ones a statistician doesn't really need. The "common" distributions and "common" operations fit into textbooks index pages, and should be easy to implement using data.table
: we simply just list them.
As in our initial discussion, I still think that estimators and estimates should be separate, agreeing with @PeterRuckdeschel.
Currently, the distr6 interface does not support data ingestion, and adding this would result in a major interface re-design, and a major interface change.
However, I can see two ways in which parameter estimation could be supported:
@RaphaelS1 , I don´t think there should be a function ´MLEestimate´, as opposed to an object (in whichever oop paradigm). This is for the simple reason that there are many ways to compute the MLE, even if it is in theory unique; also, more generally, the "estimators" are objects in their own right, with parameters to set or tune, so they should be API encapsulation points in my opinion.
@hyiltiz , would you want to help with this (design discussions, implementations)? Note that the distr6 development team is one of volunteers (mostly @RaphaelS1), so we can´t implement every "nice-to-have" feature, but it also means we´re very open to contributions
@RaphaelS1 got me into Gitter for discussion so we won't clutter the Issue tracker. I do realize that we are all volunteers. Besides, I totally agree that it is better to keep the library design clean and not bloated with every relevant nice feature. I have next to no experience in OO programming in R, but plan to do some after a few months; I can actively contribute in discussions related to library design where feedback is needed.
Note that distr6::Normal$new(0,1)$pdf(5e1, log=TRUE) = -Inf
due to IEEE floating point precision issues, which breaks estimation, although the pdf of that extreme event is well-defined. I added a simple hack so -Inf
won't show up in LL
computation. Another common hack is to add .Machine$double.eps
to all probabilities (then squish) so p=0
never occurs, aka: map probabilities according to f: [0,1] -> [+eps, 1-eps]
interval. Both should give virtually the same result, while the former doesn't sacrifice the precision of the whole interval [0,1]
just to fix a floating point problem.
library(distr6); library(stats4);library(data.table)
mleEstimate <- function(distr, data, method = "BFGS", fixed = list(), nobs, ...){
p = as.data.table(distr$parameters())
p = subset(p, subset = p$settable, select = c("id","value"))
f = vector("list",length(p$id))
names(f) = p$id
LL <- function(){}
formals(LL) = f
body(LL) = substitute({
# distr6::Normal$new(0,1)$pdf(5e1, log=TRUE) produces -Inf
LL.double <- do.call(distribution$new, pars)$pdf(data, log = TRUE)
infinite.LL <- is.infinite(LL.double)
LL.double[infinite.LL] <- .Machine$double.xmin
-sum(LL.double) # nLL
}, list(
pars = eval(parse(text = paste0("alist(",paste0(p$id, "=", p$id, collapse = ","),")"))),
distribution = get(distr$name)
))
s = as.list(p$value)
names(s) = p$id
mle = stats4::mle(minuslogl = LL,
start = s,
method = method,
fixed = fixed,
nobs = nobs,
...)@coef
do.call(get(distr$name)$new, as.list(round(mle, 5)))
}
mleEstimate(distr = Normal$new(mean = 2, sd = 2), data = rnorm(1000))
mleEstimate(distr = Exponential$new(rate = 2), data = rexp(1000, rate = 2))
so, @hyiltiz - instead of hacking machine eps, why not help with fixing the "log = true" case for the most common exponential family distributions? Where log cancels with the exp.
Would be much appreciated.
Hacking with machine eps is probably needed, since p=0
or p=1
may not be meaningful in general*. This also comes up in Bernoulli (and many other discrete) distributions.
My philosophical take is that the outputs of pdf/pmf should always be as accurate as possible, that includes returning 0 or 1 if this is the case.
So one should avoid the hack if possible - it makes sense as part of an algorithm, of course.
Summarising from gitter:
distr6
will include one basic estimation method, mleeps
parameter for controlling when pdf = 0
, but by default this is turned offmlr3proba
for more accurate and flexible methodsThis issue is now on hold until #131, #179, #188 are closed.
What is the best way to implement an Estimate style of method? Some possible suggestions: