PDFs and norm_range - Githubissues

jonas-eschle commented 5 years ago

Thinking about cleaning up API etc. it may be cleaner to force each PDF (that is directly called) to have a defined norm_range. An example is the sampling: it is unclear what the norm_range actually is. And adding a parameter norm_range to it would remove the consistency with Func.

Or: the normalization range is an intrinsic definition of the PDF. It can still be changed with the set_norm_range method (temporarily as well as permanent).

Also, I am unsure how ofter the norm_range argument (e.g. for pdf.pdf(...) really will be used.

Thoughts on that, @apuignav @rsilvaco @marinang, @chrisburr?

apuignav commented 5 years ago

I sort of agree. What I have always said is that, while strictly correct, norm_range is quite complicated. I would rather go with a sane/easy default and the possibility to do more complicated things for advanced users, maybe just leaving set_norm_range?

jonas-eschle commented 5 years ago

Yes, this is basically it. So the behavior would be:

obs = zfit.Space('obs1', limits=limits1)
# build pdfs with obs -> same behavior as now
pdf.pdf(x=data)  # as usual
# this would not work though:
# pdf.pdf(x=data, norm_range=...)  # norm_range is no longer a parameter

to get the pdf with a different norm_range (say limits2, one has to do:

with pdf.set_norm_range(limits2):  # or without the context manager
    pdf.pdf(x=data)

on the other hand, without setting a norm_range (one way or the other), the methods raise an error (as now basically). Only difference: stricter (e.g. unnormalized_pdf get's remove, an integral without the norm_range will also raise an error (since it simply does not make sense for a pdf).

At least so far my idea, if you have a counter-example (for the user, not internal), let me know.

apuignav commented 5 years ago

There is a caveat with raising errors when norm_range is not defined: pdf.pdf will almost always be called to plot, so in the end norm_range will always be needed.

jonas-eschle commented 5 years ago

Yes exactly, but this is anyway true (that norm_range is needed). The difference is just whether it has to be set with set_norm_range (resp. in the init with obs by giving a space with limits) or can be given as an argument to pdf(...).

I think in practical matters, most of the time one obs with the corresponding limits will be used to instantiate the whole pdf(s) anyway.

marinang commented 5 years ago

So norm_range will be the limits of obs? I agree that this should be the default.

But regarding pdf composition. Can you add pdf with differents obs (so different norm ranges) ? What is then the norm range of the total pdf (are they merged/combined ...)? I had a case where for instance I was fitting sumpdf = sumpdf1 + pdf2 + etc ... where sumpdf1 parameters and especially fractions where obtained from an external fit, on simulation for instance, with a different norm range than the fit in data.

@mayou36 I am not sure I have understood the point with raising errors if by default the norm_range is the limits from obs.

jonas-eschle commented 5 years ago

So, already currently, the limits are taken from obs if they are given. There are (in the new way) two possibilities to set the norm_range:

pdf = zfit.pdf.Gauss(obs=zfit.Space('obs1', limits=...))
pdf.pdf(x)  # return the probs

Now the limits are used as norm_range

alternative

pdf = zfit.pdf.Gauss(obs=zfit.Space('obs1'))  # no limits here
pdf.set_norm_range(...)  # or with context manager
pdf.pdf(x)  # returns the probs

What won't work anymore

pdf = zfit.pdf.Gauss(obs=zfit.Space('obs1'))  # no limits here
pdf.pdf(x, norm_range=...)  # raises error, norm_range is not a parameter!

So it's about changing the ZfitPDF.pdf API from pdf(x, norm_range) to pdf(x) (and consequent methods)

Or in other words: norm_range is a "fixed" attribute of a pdf (that can be changed with a setter of course) and not an argument to the function.

apuignav commented 5 years ago

Question: could you set norm range on pdf instantiation?

jonas-eschle commented 5 years ago

Yes, by using obs with limits. Then this is automatically the norm_range

So the typical workflow would be that you create your obs with limits, instantiate your pdf with it and don't ever touch norm_range again.

jonas-eschle commented 5 years ago

On the composition, @marinang :

Any daughter will use the norm_range from the mother when using anything that requires a norm_range like pdf, integral etc.

About the case with different norm_ranges: you can't do that now. I cannot think of any usecase for that as the norm_range, speaking from probability, is the region the pdf is defined in, but I may very well miss something. I know that RooFit does something like that, but I do not understand what it should mean/be doing.

Do you have a usecase for that? Can you explain it?

(there is some, only "safe" automatic inference on instantiation only of the mother e.g. the daughters have a norm_range and the mother does not -> if they all coincide, the mother will have this norm_range as well)

marinang commented 5 years ago

@mayou36 So my case, is in a data fit (signal + backgrounds), on left part of the fit range I have a wide background which is cut in part. This cut is to remove other backgrounds that are hard/ or I don't want to model, and they don't bring any additional informations but still a part of the wide (bbbar) background is there and I have to model it.

I use simulated samples to model it but to get the parameters and fractions, (since it's a composition of pdfs in my cases) I extend the fit range to the left to have a better description of that background and avoid edge effects (that I observe If I fit using the same fit range as in data). So the fractions are calculated in that range and therefore cannot be used in another range. It is a very easy thing to do in probfit for instance because you can tell on which range each pdf is normalised and and composed them and set another normalisation range on the composed pdf.

EDIT: I can overcome that by scaling the fractions to another range using integrals such that

rangeA = (x, z)
sumpdf_A = f * pdf1_A + pdf2_A  # normalized in rangeA

#in new range 
rangeB = (y, z)
sumpdf_B = (f * pdf1_A.integrate(rangeB) / sumpdf_A.integrate(rangeB)) * pdf1_B + pdf2_B  # normalized in rangeB

jonas-eschle commented 5 years ago

I agree, the norm range is just an additional "fraction" scaler. But this does not yet fit anything, and that's where my slight confusion starts: do you also intend to do a "simultaneous fit" by fitting the background pdf separately to the extra part on the left? Or how is that taken into account? And if you do, why not just fit the whole pdf together? (and maybe pre-fit the exp only on the side to have good starting values)

marinang commented 5 years ago

@mayou36 assume in my example that I wrote before, that this is a pdf of one background that I model using simulation so I fit the simulated sample to get the parameters and the fraction f. Then I use that pdf and the parameters extracted from the simulation fit to add it to the model that will be used to fit the data (signal + backgrounds), and those parameters won’t be free. No simultaneous fit here. I could fit the whole the range in data as well, but that would mean to model other backgrounds and I need way more simulation for that and it is not necessary to include it, so it’s better cut away. I had however to extend the fit range to the right for the background that I am talking about, which has a wide Gaussian shape, because the left tail was not properly fitting the data.

jonas-eschle commented 5 years ago

I understand! Just to recap and make sure:

two regions, a fit region with signal and some background ("sig+bkg") and a "complicated bkg" region which you only use with the simulation in order to improve the fitting of the specific background and is used to fix the shape of it. Of interest is though only the part of the bkg pdf which is peaking into the "sig+bkg" region. I see that changing the norm_range would screw up the fraction inside the sumpdf. Also, the overall normalization of the bkg does not really matter (as I assume this is added with the signal pdf, which is then fitted to the data, and therefore the fraction absorbs the overall bkg normalization).

Good usecase, and technically this is absolutely no problem in zfit (it actually means removing two lines of code). The question is about how to specify it well, as I think in general propagating (or having the possibility to propagate the norm_range ) could be useful). But probably the default should be not to propagate.

Any specific opinion on how it should behave in which situations?

my take: there is a flag in set_norm_range like propagate=False. And every other time a norm_range argument is available, it does not propagate.

P.S: on your code-snippet @marinang, I think it should be sumpdf_A -> pdf2_A, or am I mistaken?

sumpdf_B = (f * pdf1_A.integrate(rangeB) / pdf2_A.integrate(rangeB)) * pdf1_B + ...

marinang commented 5 years ago

@mayou36 yes good summary.

However for the snippet. You have to normalise by the integral of the whole pdf, otherwise the total pdf is not normalised.

sumpdf_A = f * pdf1_A + (1-f) * pdf2_A
sumpdf_A.integrate(range(B)) = f * pdf1_A.integrate(range(B)) + (1-f) * pdf2_A.integrate(range(B))

1.0 = (f * pdf1_A.integrate(range(B)) + (1-f) * pdf2_A.integrate(range(B))) / sumpdf_A.integrate(range(B))

so

new_f1 = (f * pdf1_A.integrate(rangeB) / sumpdf_A.integrate(rangeB))
new_f2 = ((1-f) * pdf2_A.integrate(range(B)) / sumpdf_A.integrate(range(B)) 
sumpdf_B =  new_f1 * pdf1_B + new_f2 * pdf2_B = new_f1 * pdf1_B + (1-new_f1) * pdf2_B

Does it make sense?

jonas-eschle commented 5 years ago

Hm, but where did the pdf2_A.integrate(range(B)) go? This should also be absorbed in the new f, right?

Not normalized with respect to which range? to range A, right? But what would you expect? If you have different norm_ranges, the pdf won't be "normalized" over A nor B, right?

In my snippet, all that is done is to multiply the factor by the ratio of (pdf1 ratio old norm vs new norm) and (pdf2 old norm vs new norm). Since pdf_A.integrate(B) is equal to the unnormalized integral over B divided by the integral over A

marinang commented 5 years ago

Coming back here. So as it is now fractions are still left unchanged for composed pdf when changing normalisation range. Is the the norm_range of the sub models also changed?

The way the fractions should be updated is the following:

sumpdf_A = fa * pdf1_A + (1-fa) * pdf2_A # _A = normalized over range A

IB = sumpdf_A.integrate(range(B)) = fa * pdf1_A.integrate(range(B)) + (1-fa) * pdf2_A.integrate(range(B))

fb = fa * pdf1_A.integrate(range(B))  /  IB

which gives

sumpdf_B = fb * pdf1_B + (1-fb) * pdf2_B

The question is, how things are done in the background ? Because fa is a zfit.Parameter should it be updated to the value of fb (Temporarily)?

jonas-eschle commented 5 years ago

Yes, the submodels range is currently changed as well. But you're right, this should not be the behavior and different norm_ranges in the submodels should be doable.

We started a discussion/overview in #39 about breaking changes to be considered for the next few releases. One of them is to exactly change the norm_range behavior. Please feel free to comment on it, we did not yet internally discuss it really.

It's of course never nice to do that, but it's better to "break" sooner than never. With the unchanged norm ranges, the problem would vanish, right?

marinang commented 5 years ago

"It's of course never nice to do that, but it's better to "break" sooner than never. With the unchanged norm ranges, the problem would vanish, right?"

Let's see if I understood before flooding #39. Example


obs = Space('obs', (0, 15))

pdf1 = Gauss(obs=obs, mu=12, sigma=1.0)
pdf2 = Gauss(obs=obs, mu=7, sigma=2.0)
f = Parameter('f', 0.5, 0.0, 1.0)

sumpdf = SumPDF(pdfs=[pdf1, pdf2], fracs=[f])

So the norm_range for sumpdf is obs as well.

Now If I take what is said in #39, If I do

with sumpdf.set_norm_range((0, 10), propagate=False):
      print(zfit.run(sumpdf.integrate((0, 10)))

It prints 1.0 but the integral is computed with the subpdfs still normalised in the obs range and using the frac f, this integral is not equal to 1.0 so sumpdf has to be divided by its integral over the new norm_range. It is what I should expect? If yes perfect.

jonas-eschle commented 5 years ago

Nearly. It's true about what is set, but why would you expect this not to be 1.0?

pdf1 and pdf2 are then normalized over obs, yes, the frac remains the same. SumPDF is then normalized over it's own norm_range, with is (0, 10), so "by definition" the integral over (0, 10) has to give 1.

Do you agree? So it's as you said but without the need to normalize again.

marinang commented 5 years ago

There is a misunderstanding, I was not clear, but I think we agree. If pdf1 and pdf2 are normalised over obs, the integral of f*pdf1 + (1-f)*pdf2 over (0, 10) is not 1.0, but it is over obs. sumpdf is now normalised over (0, 10), so its integral is equal to 1.0 by definition I agree. But it has to be forced to be 1.0 since integral of f*pdf1 + (1-f)*pdf2 is not equal to one.

jonas-eschle commented 5 years ago

Yes, exactly!

zfit / zfit-development

PDFs and norm_range #36