We already had a discussion in #38, doing a somewhat fresh start here with a different approach.

Brought to the point

I think it boils down to the following: Either we have a model offering binned and unbinned methods ("combined model") or we have two kinds of models, a binned and an unbinned ("two models").

I think we should aim for two goals in zfit regarding binned and unbinned fits:

having the flexibility to simply mix binned pdfs with unbinned ones, even if it may be inefficient
keeping the efficiency of the "pure" cases (e.g. only template pdf and binned data resp. continuous pdf with unbinned data) high

Concerns combined model:

how to determine integration? Do we also have a binned_integration?
seems like two separate models inside one (e.g. binned_prob, binned_sampling, binned_integration etcetc are factually decoupled from the continuous case) -> increases complexity

Concerns two models:

means a lot of conversions into each other. Needs some code to set converters up correctly, though when done it's done.
we end up with 2 x 2 possibilities (func/pdf and unbinned/binned) of conversions (though in combined models, this is also there, just in 2 models)

I tend towards two models:

it allows to better extend each class for usecases (e.g. binned template fits do not need to know about anything unbinned, analytic integrals are well defined etc.), less complexity.
clearer in terms of what happens.

Extended thoughts and discussion (for documentation, no need to read)

What is clear

we need two kind of data, Data (unbinned) and BinnedData or Hist or something like that. Given a conversion function, they can be converted into each other.

What we need

The discussion is therefore about the pdf and methods mostly.

I think we should aim for two goals in zfit regarding binned and unbinned fits:

having the flexibility to simply mix binned pdfs with unbinned ones, even if it may be inefficient
keeping the efficiency of the "pure" cases (e.g. only template pdf and binned data resp. continuous pdf with unbinned data) high

Mostly the latter, assuming we want to be able to do template pdfs, changes quite the current ideas from #38.

Three possibilities are there that I can see, starting with the least feasible as I think:

Loss responsible

tl;dr: does not work with template pdfs reasonable (section can be skipped)

All the responsibility could be moved to the loss, we keep things as they are otherwise, basically what was discussed in #38. While this works for simple cases, it fails for others, namely:

template fits are impossible to do efficient, since a template pdf needs to also convert to a continuous pdf first (if we only have the current _unnormalized_pdf method and then be binned again by the loss. We would stay way beyond the efficiency of what is achievable.
... there is more, but won't list as it seems to be too unfeasible. Ask if you do not agree with my strong conclusion.

One PDF, two methods

tl;dr: seems like having two mostly independent pdfs in one. Complexity too high?; undefined behavior (sampling binned/unbinned?); hard to extend specifically (e.g. integration for continuous, morphing for templates etc.).

We can extend the functionality of the current PDF with a method binned_prob or something. This takes only a binned Data.

While this works nicely for simpler cases, it has a few shortcommings, namely that it's either disambiguous what a method returns or we need two versions:

what should sample return? A binned or an unbinned object? What if the pdf was defined binned, we cannot just return an unbinned data.
what should integrate do? Integrate the binned or unbinned pdf?

My suspicion is that we end up with a pdf that has binned_prob, binned_integral, binned_partial_integral, binned_sampling etc. basically "doubling" the methods (setting flags may be unfeasible since binned methods may take other arguments such as the binsize).

PDF and BinnedPDF

Having two distinct classes and a converter (create_binned, create_pdf or similar) solves the above: each class has distinct methods and e.g.

sampling from a BinnedPDF returns a BinnedData, sampling from a PDF returns a Data, same for integration etc.
customization, the probably strongest point: This allows to have the two fits independent and to e.g. extend the template pdfs with morphing etc. It actually allows to invoke a completely different library behind the scenes and even build an additional "template pdf" section, without worrying about the unbinned case.

In short, it's the maximum decoupling. If our goals are as stated in the beginning and we want to be maximal efficient with unbinnedPDF + unbinnedData and binnedPDF + binnedData, this architecture seems the most appropriate.

Comparison and conclusion

The second approach and the third are actually similar: In the second, "two pdfs" are contained in one (the logic of two), while in the third, they are two distinct pdf depending on each other (e.g. if a binnedPDf is created from a PDF, it keeps a dependence).

Advantage of the two-in-one:

the dependence is clearer
easier to avoid code duplication

Disadvantage of the two-in-one:

more complex implementation (has to think of the other case always as well)
possible mandatory methods that only affect binned/unbinned case actually

Advantage of the two-pdfs:

simpler and possibly more efficient implementation.
intention always clear
worst case: we see to many redundances, merging into one PDF is still possible

Disadvantage:

will have a pending dependency on the other type if converted from it (but that's similar probably to Func etc.)
possibly duplicated code (but with good enough inheritance and functions, this should not really be a problem.

Conclusion: I fear more an overly complex two-in-one pdf and an inefficient implementation of e.g. template pdfs than code duplication or the dependencies on other pdfs.

(N.B: this does not exclude to have a BinnedLoss that uses continuous pdfs and Data, it could e.g. convert it implicitly to a BinnedPDF)

Your thoughts @apuignav, @rsilvaco?

zfit / zfit-development

Binned PDFs and fits #46