zfit / zfit-development

The developement repository for zfit with roadmaps, internal docs etc to clean up the issues
0 stars 2 forks source link

Binned PDFs and fits #46

Open jonas-eschle opened 4 years ago

jonas-eschle commented 4 years ago

We already had a discussion in #38, doing a somewhat fresh start here with a different approach.

Brought to the point

I think it boils down to the following: Either we have a model offering binned and unbinned methods ("combined model") or we have two kinds of models, a binned and an unbinned ("two models").

I think we should aim for two goals in zfit regarding binned and unbinned fits:

Concerns combined model:

Concerns two models:

I tend towards two models:

Extended thoughts and discussion (for documentation, no need to read)

What is clear

we need two kind of data, Data (unbinned) and BinnedData or Hist or something like that. Given a conversion function, they can be converted into each other.

What we need

The discussion is therefore about the pdf and methods mostly.

I think we should aim for two goals in zfit regarding binned and unbinned fits:

Mostly the latter, assuming we want to be able to do template pdfs, changes quite the current ideas from #38.

Three possibilities are there that I can see, starting with the least feasible as I think:

Loss responsible

tl;dr: does not work with template pdfs reasonable (section can be skipped)

All the responsibility could be moved to the loss, we keep things as they are otherwise, basically what was discussed in #38. While this works for simple cases, it fails for others, namely:

One PDF, two methods

tl;dr: seems like having two mostly independent pdfs in one. Complexity too high?; undefined behavior (sampling binned/unbinned?); hard to extend specifically (e.g. integration for continuous, morphing for templates etc.).

We can extend the functionality of the current PDF with a method binned_prob or something. This takes only a binned Data.

While this works nicely for simpler cases, it has a few shortcommings, namely that it's either disambiguous what a method returns or we need two versions:

My suspicion is that we end up with a pdf that has binned_prob, binned_integral, binned_partial_integral, binned_sampling etc. basically "doubling" the methods (setting flags may be unfeasible since binned methods may take other arguments such as the binsize).

PDF and BinnedPDF

Having two distinct classes and a converter (create_binned, create_pdf or similar) solves the above: each class has distinct methods and e.g.

In short, it's the maximum decoupling. If our goals are as stated in the beginning and we want to be maximal efficient with unbinnedPDF + unbinnedData and binnedPDF + binnedData, this architecture seems the most appropriate.

Comparison and conclusion

The second approach and the third are actually similar: In the second, "two pdfs" are contained in one (the logic of two), while in the third, they are two distinct pdf depending on each other (e.g. if a binnedPDf is created from a PDF, it keeps a dependence).

Advantage of the two-in-one:

Disadvantage of the two-in-one:

Advantage of the two-pdfs:

Disadvantage:

Conclusion: I fear more an overly complex two-in-one pdf and an inefficient implementation of e.g. template pdfs than code duplication or the dependencies on other pdfs.

(N.B: this does not exclude to have a BinnedLoss that uses continuous pdfs and Data, it could e.g. convert it implicitly to a BinnedPDF)

Your thoughts @apuignav, @rsilvaco?

jonas-eschle commented 4 years ago

I'll extend here with an example:

class UnbinnedPDF(...):
    ...

class BinnedPDF(BaseBinnedPDF):
    _UnbinnedPDFClass = UnbinnedPDF  # can be specified, otherwise auto conversion
    _unnormalized_binned_prob(x):
        implementation...

    _yield = False  # e.g. need to set it explicitly since Binned/Template PDFs are often extended.
    _convert_to_unbinned(...):
        ... allow to customize the conversion

binned_pdf = BinnedPDF(...)
unbinned_pdf = binned_pdf.create_unbinned(... customization args)

It implies to extend the PDFs with a conversion functionality. But this is anyway needed, either in this way or otherwise implemented as logic (in the two-in-one case).

Maybe it's slightly more code and accounting needed in exchange for a stronger decoupling.