Kernel hyperparamaters - Githubissues

patel-zeel commented 3 years ago

How to check the GP hyperparameters (kernel lengthscale, kernel variance, likelihood variance)?

A suggestion: Opening the GitHub discussions tab would be useful for such queries.

wesselb commented 3 years ago

Hey @patel-zeel! I wasn't aware of GitHub discussions. I've opened the tab. Please feel free to make use of it!

The easiest was to check the parametrisation of a kernel is to print it:

>>> k = 2 * EQ().stretch(4)

>>> k
2 * (EQ() > 4)

You can pick apart this kernel as follows:

>>> k.factor(0)  # Variance
2

>>> k.factor(1)  # The stretched kernel
EQ() > 4

We can extract the stretch from the kernel as follows:

>>> k.factor(1).stretches
(4,)

There is currently no method implemented for computing the variance or length scale of a general kernel. We can, however, work towards this.

For the variance, we firstly require that the kernel is stationary:

>>> k.stationary
True

Note that, for complicated kernels, k.stationary may return False even if the kernel is stationary. (You should interpret k.stationary == False as the statement that Stheno could not prove that k is stationary.) In this case, the kernel is stationary, so we can compute the variance by querying it at any point:

>>> k(0)[0, 0]
2

Therefore, we could do something like this:

from stheno import Kernel
from plum import dispatch

@dispatch
def variance(k: Kernel):
    if not k.stationary:
        raise AssertionError("Can only compute the varinance of stationary kernels.")
    return k(0)[0, 0]

The case for the length scale is slightly more complicated. To begin with, we require a general definition of a length scale. This definition might involve an integral (e.g. something like the integral of k(x) / k(0) from x = 0 to x = infty), which you could compute using quadrature. Alternatively, you could implement a recursion like this:

import mlkernels
from plum import dispatch

@dispatch
def length_scale(k: mlkernels.EQ):
    # The standard EQ is defined to have length scale one.
    return 1

@dispatch
def length_scale(k: mlkernels.kernels.scaled.ScaledKernel):
    return length_scale(k[0])

@dispatch
def length_scale(k: mlkernels.kernels.stretched.StretchedKernel):
    if not len(k.stretches) == 1:
        raise RuntimeError(
            f"Length scale cannot be computed: arguments of the kernel are "
            f"scaled separately."
        )
    return length_scale(k[0]) * k.stretches[0]

>>> length_scale(k)
4

If this would a desirable feature, it would certainly be possible to have this be implemented by Stheno.

patel-zeel commented 3 years ago

Thank you @wesselb for the detailed answer. This was really helpful.

I believe that other libraries (GPy, GPFlow) allow checking kernel parameters easily and as researchers, we might investigate those to interpret the results.

It would be a valuable feature if implemented in Stheno.

wesselb commented 3 years ago

No problem at all!

Could you perhaps describe in a little more detail what you have in mind? Would it be helpful to have a function that automatically breaks down a kernel into its various components and describes, e.g., what the variance and length scale of those components are? Would it be possible sketch how you would like such a breakdown to look like?

patel-zeel commented 3 years ago

Pardon my ignorance on the keen level details of Stheno but I was expecting something like the following.

I construct a GP with the EQ kernel.

prior = Measure()
f = GP(EQ(), measure=prior)

Now, knowing that EQ kernel has two hyperparameters: i) variance; ii) lengthscale (referring to the equation in kernel cookbook), I would try to query for them on the kernel as following,

K = f.kernel
lengthscale = K.lengthscale
variance = K.variance

Various implementation of GP in Python (GPy, GPflow, GPyTorch) use almost similar methods for knowing the lengthscale but the variance is handled differently, For example, GPy and GPflow include variance as a kernel parameter but GPyTorch has ScaleKernel() which is used in combination with any other kernel to learn the variance hyperparameter.

So, if I am getting it correct, Stheno uses stretching of the kernel to allow learning of variance. I would find it useful to have a simple tutorial on learning these hyperparameters in Stheno and then computing the posterior (or please guide me to the link if it is already present). Also, let me know what do you think about the above comments.

wesselb commented 3 years ago

No need to pardon at all. :)

Here's a minimal example of how hyperparameters can be learned and extracted. The example uses Varz and TensorFlow, but you're free to use any other optimisation package.

from stheno import EQ, GP
import tensorflow as tf
from varz.tensorflow import Vars, minimise_l_bfgs_b
import lab.tensorflow as B

# Generate some sample data.
x = B.linspace(tf.float64, 0, 10, 1000)
f = GP(EQ())
y = f(x, 0.1).sample()

def model(vs):
    # Varz handles positivity (and other) constraints.
    kernel = vs.positive(name="variance") * EQ().stretch(vs.positive(name="length_scale"))
    noise = vs.positive(name="noise")
    return GP(kernel), noise

def objective(vs):
    f, noise = model(vs)
    return -f(x, noise).logpdf(y)

vs = Vars(tf.float64)

# Perform optimisation.
minimise_l_bfgs_b(objective, vs, trace=True, jit=True)

# Compute the posterior with the learned hyperparameters.
f_prior, noise = model(vs)
f_posterior = f_prior | (f_prior(x, noise), y)

# Print the learned parameters.
vs.print()

# Extract a particular hyperparameter.
print("This is the learned length scale:", vs["length_scale"])

The output could then as follows:

Minimisation of "objective":
    Iteration 1/1000:
        Time elapsed: 0.9 s
        Time left:  935.0 s
        Objective value: 581.3
    Iteration 6/1000:
        Time elapsed: 2.0 s
        Time left:  181.3 s
        Objective value: 308.3
    Iteration 12/1000:
        Time elapsed: 3.1 s
        Time left:  175.9 s
        Objective value: 304.9
    Iteration 14/1000:
        Time elapsed: 3.4 s
        Time left:  171.4 s
        Objective value: 304.9
    Done!
Termination message:
    CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
variance:   1.99
length_scale: 1.028
noise:      0.09931

Using Varz, you can also provide initial values by writing vs.positive(0.1, name="length_scale").

Stheno takes a slightly different stance from the other GP packages in that it aims to be a "researcher's GP Swiss Army knife" by being very flexible, which comes at the cost of also a more manual experience. I fully agree that the manual could be clearer about how hyperparameters can be learned. There is this example, but that's pretty tucked away.

It would definitely be possible to build abstractions on top of Stheno that provide an experience which is more similar to GPy/GPflow/GPyTorch.

wesselb commented 3 years ago

As a quick follow-up, the problem with implementing a generic f.kernel.length_scale and f.kernel.variance is that this must work for every kernel, even non-stationary ones, but the length scale and variance only make sense for stationary kernels. Perhaps there is a better way. I'm very open to suggestions!

patel-zeel commented 3 years ago

Amazing! Thanks for providing the minimal working code. Got the point about non generelizable parameters. I believe that purpose of this thread has been achieved now :)

wesselb / stheno

Kernel hyperparamaters #8