zfit / zfit

Model manipulation and fitting library based on TensorFlow and optimised for simple and direct manipulation of probability density functions. Its main focus is on scalability, parallelisation and user friendly experience.
http://zfit.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
181 stars 52 forks source link

Implement Asimov dataset creation for unbinned models #576

Open ikrommyd opened 2 months ago

ikrommyd commented 2 months ago

There is also a way to create Asimov datasets from unbinned models either by: 1) Making a binning on the spot 2) Generating it from many weighted unbinned events. Check the combine docs for more info: https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/latest/part3/runningthetool/#toy-data-generation

jonas-eschle commented 2 months ago

That's an interesting issue, because in hepstats, this creates currently a binned asymov set, also for unbinned data, which is not optimal and should not happen. But it could be a very high stats dataset, that could be a possibility.

The question is a bit conceptually, what is needed? Creating a binned is probably already possible with the to_binned and to_binneddata (I think, right?)

What's the weighted unbinned events, why weighted? Not sure about where the weigths are coming from.

And it reminds me of another discussion about the "best binning", as we're doing a lot of unbinned fits in LHCb that could, in prinziple, be binned. So implementing something like this https://arxiv.org/abs/2210.02848 could be useful.

I guess things are currently possible already to do, hepstats should have an automatic binning, or zfit itself. And modulo that it isn't as easy accessible to the user as it maybe should be. It is, but more in a way of how to communicate this to the user?

ikrommyd commented 2 months ago

I was just looking at the zfit code (not hepstats). Cool so maybe a shortcut of to_binned -> to_binneddata would be nice and easy to have for unbinned models. For the weighted unbinned events, look at the "Pseudo-Asimov" dataset of the combine docs I linked above.

When it comes to visibility to, If I search the code for the word "Asimov", I would find to_binneddata. However If I didn't search like that, I would expect something like model.create_asimov. Or even as part of the sampler. Since we do sampler = model.create_sampler() to generate toys, we could have a sampler method other than resample that is make_asimov or something like that.

jonas-eschle commented 2 months ago

I would expect something like model.create_asimov

I think this is a crucial difference between having a nicely named API and good enough docs: the problem with adding this is that the expectations may be different. Should it be binned, unbinned? But what is more crucial is to have something where this is explained I think

How did you come across "asimov", just to collect a bit of data?

And agree, the to_binneddata as a shortcut wouldn't harm!