ziatdinovmax / gpax

Gaussian Processes for Experimental Sciences
http://gpax.rtfd.io
MIT License
205 stars 27 forks source link

Set noise prior automatically #87

Closed ziatdinovmax closed 7 months ago

ziatdinovmax commented 8 months ago

We should be setting the noise prior automatically based on the y-range of the provided training data. Currently, the default prior is LogNormal(0, 1), which may not be always optimal, especially when using data with normalized y-range (because it allows for scenarios where the noise is almost one order of magnitude larger than the entire observation range). So instead, we can set it automatically as a HalfNormal(0, v) distribution with a default variance of v = 0.2 * (y_max - y_min). As always, a user will also have an option to pass their own custom prior.

Thoughts? @yongtaoliu, @arpanbiswas52, @SergeiVKalinin, @RichardLiuCoding, @aghosh92?

arpanbiswas52 commented 8 months ago

This looks good, specially in the multi-fidelity case, choice of noise prior seemed more critical. However, few questions-

  1. is it possible to make the value 0.2 somewhat learnable param in future or say doing parameter optimization?
  2. How do you think it can handle if any outlier is present in training data? It will unnecessarily put a high variance which will impact the prediction. If we have a big range of y, we wont know whether the training sample is the outlier or a sample from region of interest (if we maximize), unless we run few BO iterations. Will it be good idea to always provide normalized y then in training? Let me know your thoughts
ziatdinovmax commented 8 months ago

@arpanbiswas52 - good points.

is it possible to make the value 0.2 somewhat learnable param in future or say doing parameter optimization?

Note that 0.2 is used (together with measured data) to set variance in the prior distribution for noise and the noise itself is a learnable parameter. So we can think of 0.2 as an 'initial guess.' But in principle, yes, it can be learned by placing a (hyper)prior on it.

How do you think it can handle if any outlier is present in training data? It will unnecessarily put a high variance which will impact the prediction. If we have a big range of y, we wont know whether the training sample is the outlier or a sample from region of interest (if we maximize), unless we run few BO iterations. Will it be good idea to always provide normalized y then in training? Let me know your thoughts

Yes, the assumption (for all models) is that data went through basic preprocessing, with large or unphysical outliers removed. That said, using the suggested half-normal distribution can give us advantage since the noise level will be "pushed" towards zero unless the data strongly suggests otherwise.

ziatdinovmax commented 7 months ago

We're going to keep the default log-normal priors for now