jonas-eschle commented 5 years ago

@abhijitm08 commented on Thu Apr 04 2019

I have the code for simple template fit (need to convert of binned fit) in pure TF. https://colab.research.google.com/drive/13RR8ePrVtpdvaVp5MmyUQHbs0UcflH7i

Need to get this integrated with zfit. However, we need to discuss:

1) How the data is should be handled (just use the output of np.histogramdd? There is a TF function 'tf.histogram_fixed_width_bins' but only for 1D.) 2) If the PDF is continuous and not a template, provide the user the option of evaluating the PDF either at the bin centre or integrating b/w the bin boundaries.

jonas-eschle commented 5 years ago

Moved it here to have the higher level discussion

So we need two things in terms of API: Data and PDF/Func

Data API

we could add a method to the current Data, something like

binned: parameters; bins (number of bins or n-d binning array). What should be the exact return value? (bin count, bin_limits)?

PDF/Func API

binned_pdf resp. binned_func: parameters; bins (somehow determine the binning? limits?), a parameter to define where to take the value that may accepts functions (e.g. for a simple integration).

Implementation

tf does not offer too much in terms of binning. We could for the general case use np and wrap it even with tf.py_func (no problem as the gradient is not needed anyway)

jonas-eschle commented 5 years ago

Doing binned fits with pdfs, we have to "bin" the pdf somehow (e.g. integrate out the bins, take the middle point etc.)

What is unclear to me: what's the best way of doing that or rather: who has the responsibility to do it? The loss (with taking an additional "pdf_binning" function/string)? Or the pdf (binned_pdf with the same function as above?)? Or even the data that has a function "bin_pdf" which takes a pdf and returns it's values using a previously specified binning scheme?

@apuignav thoughts?

apuignav commented 5 years ago

Also pinging @abhijitm08 as binned fit expert.

apuignav commented 5 years ago

However, I am still unsure why we need to bin the pdf. Integrating the bins can be taken care of by a VectorSpace, middle point should be a simple transformation previous to unstack, etc...

abhijitm08 commented 5 years ago

To me the responsibility of binning is the responsibility of the loss function alone. The PDF (defined over continuous variables) and PMF (over discrete variables) are always continuous. So discretising these does not make any sense to me (even implementation wise). The Data class should have an argument of whether to bin the data or not and if binned should return the output of numpy histogramdd.

jonas-eschle commented 5 years ago

@apuignav sure but who does all this?

@abhijitm08 we we're even thinking of a different class, a BinnedData, since the output is quite different. E.g. there are bincounts and there are binning limits. But I also tend to the loss taking the responsibility for which method to choose, but the Data has to return (same as a normal data I think) the x values (maybe middle, maybe differently calculated) in order to be able to do amypdf.pdf(binned_data). While the loss can choose to set that, it can also do an integration.

My idea now: BinnedData can return the right x via BinnedData.value() (equivalent to the Data) or also BinnedData.unstack_x() so that is is fully compatible with pdfs. How to choose the right x can be set (with a method that does it).

Loss takes care of which strategy to use: integrate, or use the middlepoints etc...

abhijitm08 commented 5 years ago

Thinking of it again. I somehow feel that, the responsibility of binning and also which points to evaluate the pdf must be all handled by the loss function. So the user has to only change a single argument in the loss function(binned = True or False).

So internally the loss function must bin the data (returning the bin counts and x values at which the pdf must be evaluated, this can indeed be handled by BinnedData class), evaluate pdf at those x points and return the loss tensor.

jonas-eschle commented 5 years ago

I would not use a flag but rather create an additional loss function (BinnedNLL, Chi2) since binning does not make sense for every loss (or rather unbinned, e.g. chi2).

Furthermore, a bool is not enough to specify the binning since there are different binning strategies (-> integrate, midpoint evaluate, etc.).

Formalize the problem

(my idea so far) To formalize, we need three things: two handling the data, forth and back from binned and data (as "helpers" for the loss), one about the loss.

From the user site it looks like this:

create binned data (from bins or from Data) where "binner" (better name needed) defines how to bin the data (str or callable). Something like:

# from Data
data = zfit.Data.from_numpy(...)
binned_data = data.create_binned(bins=..., binner="same" or callable)
# OR from bins
binned_data = zfit.BinnedData.from_numpy_bins(...)  # or other names, HistData,...

create Dataor Data-like object from BinnedData to be able to give it to pdf, func etc.

data = binned_data.create_data(converter="midpoint" or callable)
# OR
# notice that the following would give an error since it is binned data and not
# unambiguous which value to take, midpoint or whatever
# values = binned_data.unstack_x()  # raises error
binned_data.set_value_converter(converter="midpoint" or callable)
values = binned_data.unstack_x()  # now uses "midpoint" to convert to x

create a loss somehow from binned data and letting the user specify the strategy. Compared to above, this is not just about how to convert a bin to an x but can also e.g. use the bin boundaries to integrate the pdf.
```
# bin_to_value = "midpoint"
# bin_to_value = "integrate"
bin_to_value = lambda data, model: ... whatever
binned_nll = BinnedNLL(pdf, data=binned_data, bin_to_y=bin_to_y)
```

and the loss from the implementation site could be something like:

def midpoints(bin_bounds):
    midpoints = ...  # calculate midpoints from boundaries
    return midpoints

def bin_to_y_midpoint(data, model):
    with data.set_value_converter(midpoints):  # tells how x should be
        y = model.pdf(data)
    return y

def Chi2(BaseLoss)
    def _loss(data, model, bin_to_y):
        y = bin_to_y(data, model)  # let's assume it's bin_to_y_midpoint from above
        chi2 = (y - data.bincount) ** 2  # modulo normalization
        return chi2

as usual, we may take a simple default or allow strings so that the most common used methods are available as strings. On the other hand, it allows for simple flexibility. (and for the crazy stuff -> directly create custom loss)

abhijitm08 commented 5 years ago

You can find the implementation of extended binned likelihood fit using zfit in the colab notebook here, with example (It can evaluate integral of pdf over bin edges or bin centers):

https://colab.research.google.com/drive/1t5LF1xOQvlnt85E2kWBLbWpEmtnjuQbl

I am quite a hypocrite, because I have used three functions (despite telling you to absorb the functionality into loss function) lol:

BinData (which uses np.histogramdd, I really wish there was a TF implemention of this. Maybe we need to come up with our own TF multidimensional histogram implementation?)
BinPDF which takes the bin edges output of BinData to evaluate pdf (either integrating over bin edges or bin centres).
BinnedNLL for loss function which in this case in extended binned NLL.

Can you please integrate this into zfit core or physics libraries? Of course do improve upon this.

jonas-eschle commented 5 years ago

Looks interesting, thanks. I am currently playing around as well. We'll have a branch and then work on it, I'll update you once I have something

zfit / zfit-development

Implementation of multidimensional binned fit - API and specs #38

Data API

PDF/Func API

Implementation

Formalize the problem