Estimate/Fit Methods - Githubissues

RaphaelS1 commented 5 years ago

What is the best way to implement an Estimate style of method? Some possible suggestions:

Pass instance of class into Estimate() method but then need a way to call d/p/q/r altering the parameters after construction
Estimate() creates a new instance of the distribution over all possible parameter values and returns the instance with the optimal parameter values
Create sub-class of distr with overloaded d/p/q/r etc that requires input of parameter that is previously unspecified

fkiraly commented 5 years ago

I would expressly not locate estimation functionality with the primary distribution object.

Density/distribution estimation is a natural subcase of probabilistic supervised learning (i.e., that of no features), thus should be implemented together with a probabilistic supervised learning interface.

I think this makes especial sense since there are multiple ways to estimate the same kind of distribution from data, e.g., method of moments, maximum likelihood, maximum-a-posteriori given some choice of prior, least-squares, etc etc.

My design of choice would hence be a class for each estimation method which has a fit and a return method - it is not quite "predict" as there is just a single distribution to return.

fkiraly commented 5 years ago

I think this is a departure from the distr design, so @PeterRuckdeschel may like to comment.

PeterRuckdeschel commented 5 years ago

I am with Franz. Basically, estimators (or more broadly statistical decision functions) build up on models and data. So this should be kept as a distinct concept / class structure. It is also helpful to distinguish estimators and estimates (decision functions and decisions) as you are used to with random variables and their actual realization. Implemented as a class structure, an estimator would comprise a link to a data object, a link to a model, a function mapping these two to an estimate, and the estimate itself. In addition it may contain informational add ons (type of the estimator, diagnostic information, distribution of the estimator (if available), asymptotic variance ...). This does not hinder that the estimate may again be a distribution object, e.g. in case of probabilistic forecasts.

RaphaelS1 commented 5 years ago

Agreed with both and will close this issue for now, we can return to it when discussing the modelling interface.

hyiltiz commented 4 years ago

It would be most convenient of maximum-a-priori estimation was implemented and included, with a default prior set to flat, effectively doing maximum-likelihood estimation (MLE). MLE is most commonly used in classical statistics and statistical learning. In addition to flat, the prior can also be jeffrey for non-informative priors; conjugate for efficient/faster estimation, automatically selecting the conjugate prior given the (likelihood) distribution for the sample; or some arbitrary distribution provided as the selected prior.

This would probably be enough for 90% (sorry for a random number without reference) of parameter estimations. Anything more specific or advanced should then be dealt with a more advanced and specialized machine learning/statistical inference packages.

Ideas?

RaphaelS1 commented 4 years ago

Hi thanks for commenting, this is actually the perfect time to re-open this issue as we are currently designing an ML interface for estimation in mlr3proba.

As noted originally by @fkiraly , there are several possible methods that could be used, which is why we originally discussed abstracting all estimation to another package. However, I am still in favour of having one general function for estimation that can cover a lot of use-cases.

I am unsure what this general function should be, your suggestion of maximum-a-priori with a flat default seems sensible as MLE is probably the most well know estimation method for users, but we could just have MLE. Here's an idea I just threw together:

library(distr6); library(stats4);library(data.table)

mleEstimate <- function(distr, data, method = "BFGS", fixed = list(), nobs, ...){
  p = as.data.table(distr$parameters())
  p = subset(p, subset = p$settable, select = c("id","value"))

  f = vector("list",length(p$id))
  names(f) = p$id
  LL <- function(){}
  formals(LL) = f
  body(LL) = substitute({
    -sum(do.call(distribution$new, pars)$pdf(data), log = TRUE)
  }, list(
    pars = eval(parse(text = paste0("alist(",paste0(p$id, "=", p$id, collapse = ","),")"))),
    distribution = get(distr$name)
  ))

  s = as.list(p$value)
  names(s) = p$id

  mle = stats4::mle(minuslogl = LL,
                    start = s,
                    method = method,
                    fixed = fixed,
                    nobs = nobs,
                    ...)@coef

  do.call(get(distr$name)$new, as.list(round(mle, 5)))
}

mleEstimate(distr = Normal$new(mean = 2, sd = 2), data = rnorm(1000))
mleEstimate(distr = Exponential$new(rate = 2), data = rexp(1000, rate = 2))

It isn't very efficient and there's a huge bottleneck, but it's a flexible implementation of mle in distr6. If the general consensus is that a Bayesian approach is prefered for a generic estimation function then I'm happy to do that, but I think that mle tends to be more well-known by less technical users, and more technical users should have no problem interfacing with mlr3proba, although I agree that could be slight overkill if you just want to estimate a couple of parameters.

hyiltiz commented 4 years ago

Yes, I am sure most users would be perfectly happy with a built-in MLE. MAP with jeffrey conjugate can wait or be left for specialized libraries.

Does pdf(, log = TRUE) return sensibly for densities/mass/probability that equal 0 or 1? Besides that, I see no problem in the above implementation.

Although, since distr6 has a zoo of distributions, a look-up table for jeffrey and conjugate priors would be truly awesome and a better interface than being an option to the parameter estimator function. Going even more general, it would be even more awesome if there are lookup tables for distribution transformations. Simple examples:

Given: I have a Gaussian random variable X1 and another Gaussian random variable X2.
Problem: What's the distribution of X1+X2? Interface: lookup(transformation=sum, operands=c(Gaussian object, Gaussian Object)) should return the new Gaussian. Limitations: There are infinite transformations and reparameterizations possible between distributions. But simply including the most common ones, like sum product min max threshold (aka X1>X2) square root conjugate jeffrey etc. should be enough and make working with multiple distributions and complex modeling way more pleasant.

RaphaelS1 commented 4 years ago

Does pdf(, log = TRUE) return sensibly for densities/mass/probability that equal 0 or 1? Besides that, I see no problem in the above implementation.

So this directly relates to #131 and is a big reason why I haven't had a chance to look at this before. pdf(,log=TRUE) just calls log(pdf()) so it's fairly basic. Which means that yes any edge cases are problematic. Ideally I would like to implement analytical log-pdfs, but I have yet to decide if these should go in the main distribution class or in a decorator, also most of these will have to be manually written.

Although, since distr6 has a zoo of distributions

I'm taking this as a compliment, one day I hope it will be a universe of distributions

Given: I have a Gaussian random variable X1 and another Gaussian random variable X2. Problem: What's the distribution of X1+X2? Interface: lookup(transformation=sum, operands=c(Gaussian object, Gaussian Object)) should return the new Gaussian. Limitations: There are infinite transformations and reparameterizations possible between distributions. But simply including the most common ones, like sum product min max threshold (aka X1>X2) square root conjugate jeffrey etc. should be enough and make working with multiple distributions and complex modeling way more pleasant.

This is something we've discussed before in #153 and #36. It isn't clear to me how to resolve it. I do quite like your lookup idea to be honest, we could create a simple data.table of common combinations and we wouldn't actually need a lookup function just the data.table::subset function. This could serve as a good starting to point to then decide which of these to implement and would massively help with convolutions...

hyiltiz commented 4 years ago

I sense whether pdf(..., log=TRUE) should be a decorator or the distribution property can easily turn into a philosophical debate on the nature of probability [1]. Implementing it inside the Distribution interface/class enables optimization of things like sum(log(pdf(...))). For example, Gaussian log-likelihoods usually use a trick to swap the sum and the log to avoid losing precision due to floating point numbers). However, that is again, as you said, too specific to each distribution, and is not possible in general. Personally as a early user, I'd be much happy with something quick and dirty as offsetting from the edges by the smallest floating point precision, aka pdf <- .Machine$double.eps + (1-2*.Machine$double.eps)*pdf.

Yes, the "zoo" was a compliment :D

There are infinite possible transitions and most of them are either analytically intractable or needs a symbolic computation engine to deal with. Fortunately, those are usually the ones a statistician doesn't really need. The "common" distributions and "common" operations fit into textbooks index pages, and should be easy to implement using data.table: we simply just list them.

[1] https://en.wikipedia.org/wiki/Almost_surely

fkiraly commented 4 years ago

As in our initial discussion, I still think that estimators and estimates should be separate, agreeing with @PeterRuckdeschel.

Currently, the distr6 interface does not support data ingestion, and adding this would result in a major interface re-design, and a major interface change.

However, I can see two ways in which parameter estimation could be supported:

using the mlr3 interface for probabilistic learning. This has the advantage of adding in the model composition, benchmarking etc functionalities for free.
a dedicated distribution estimation interface, possibly in a new package depending on distr6. The aim of which would be parameter estimation, thus with an interface sightly different from the mlr3 core.

@RaphaelS1 , I don´t think there should be a function ´MLEestimate´, as opposed to an object (in whichever oop paradigm). This is for the simple reason that there are many ways to compute the MLE, even if it is in theory unique; also, more generally, the "estimators" are objects in their own right, with parameters to set or tune, so they should be API encapsulation points in my opinion.

fkiraly commented 4 years ago

@hyiltiz , would you want to help with this (design discussions, implementations)? Note that the distr6 development team is one of volunteers (mostly @RaphaelS1), so we can´t implement every "nice-to-have" feature, but it also means we´re very open to contributions

hyiltiz commented 4 years ago

@RaphaelS1 got me into Gitter for discussion so we won't clutter the Issue tracker. I do realize that we are all volunteers. Besides, I totally agree that it is better to keep the library design clean and not bloated with every relevant nice feature. I have next to no experience in OO programming in R, but plan to do some after a few months; I can actively contribute in discussions related to library design where feedback is needed.

hyiltiz commented 4 years ago

Note that distr6::Normal$new(0,1)$pdf(5e1, log=TRUE) = -Inf due to IEEE floating point precision issues, which breaks estimation, although the pdf of that extreme event is well-defined. I added a simple hack so -Inf won't show up in LL computation. Another common hack is to add .Machine$double.eps to all probabilities (then squish) so p=0 never occurs, aka: map probabilities according to f: [0,1] -> [+eps, 1-eps] interval. Both should give virtually the same result, while the former doesn't sacrifice the precision of the whole interval [0,1] just to fix a floating point problem.

library(distr6); library(stats4);library(data.table)

mleEstimate <- function(distr, data, method = "BFGS", fixed = list(), nobs, ...){
  p = as.data.table(distr$parameters())
  p = subset(p, subset = p$settable, select = c("id","value"))

  f = vector("list",length(p$id))
  names(f) = p$id
  LL <- function(){}
  formals(LL) = f
  body(LL) = substitute({
  # distr6::Normal$new(0,1)$pdf(5e1, log=TRUE) produces -Inf
  LL.double <- do.call(distribution$new, pars)$pdf(data, log = TRUE)
  infinite.LL <- is.infinite(LL.double)
  LL.double[infinite.LL] <- .Machine$double.xmin
  -sum(LL.double) # nLL
  }, list(
    pars = eval(parse(text = paste0("alist(",paste0(p$id, "=", p$id, collapse = ","),")"))),
    distribution = get(distr$name)
  ))

  s = as.list(p$value)
  names(s) = p$id

  mle = stats4::mle(minuslogl = LL,
                    start = s,
                    method = method,
                    fixed = fixed,
                    nobs = nobs,
                    ...)@coef

  do.call(get(distr$name)$new, as.list(round(mle, 5)))
}

mleEstimate(distr = Normal$new(mean = 2, sd = 2), data = rnorm(1000))
mleEstimate(distr = Exponential$new(rate = 2), data = rexp(1000, rate = 2))

fkiraly commented 4 years ago

so, @hyiltiz - instead of hacking machine eps, why not help with fixing the "log = true" case for the most common exponential family distributions? Where log cancels with the exp.

Would be much appreciated.

hyiltiz commented 4 years ago

Hacking with machine eps is probably needed, since p=0 or p=1 may not be meaningful in general*. This also comes up in Bernoulli (and many other discrete) distributions.

Even for philosophical reasons if you want to dive deep into the interpretation of probability; see Wikipedia and Stanford Encyclopedia if interested.

fkiraly commented 4 years ago

My philosophical take is that the outputs of pdf/pmf should always be as accurate as possible, that includes returning 0 or 1 if this is the case.

So one should avoid the hack if possible - it makes sense as part of an algorithm, of course.

RaphaelS1 commented 4 years ago

Summarising from gitter:

Main focus for estimation will be in mlr3proba, this will include multiple fitting methods
distr6 will include one basic estimation method, mle
This will include an eps parameter for controlling when pdf = 0, but by default this is turned off
Users will be directed to mlr3proba for more accurate and flexible methods

This issue is now on hold until #131, #179, #188 are closed.

xoopR / distr6

Estimate/Fit Methods #4