Normalization of the input sample in KrigingAlgorithm in OT1.16

Hi !
After a discussion with colleagues on the features of OT 1.16, I mentioned the important fact that the normalization of the inputs has been dropped:

That might have been overlooked by some users, and this is why I explain what is the issue here, and how we may solve it.

What this line means is that OT1.15 and previous versions scaled the input sample so that the mean was zero and the standard deviation was equal to 1:

\boldsymbol{z} = \frac{\boldsymbol{x} - \overline{\boldsymbol{x}}}{sd(\boldsymbol{x})}

where \boldsymbol{x} is the input point, \overline{\boldsymbol{x}} is the sample mean and sd(\boldsymbol{x}) is the sample standard deviation.
This computation was performed by default in the constructor of the GeneralLinearModelAlgorithm, the class which does the job behind the KrigingAlgorithm class.

That feature had good effects, because using unscaled input sample makes the learning process of the hyperparameters difficult in some cases. More precisely, it makes difficult to estimate the parameters by maximum likelihood optimization. It has, however, also poor side effects, e.g. the scale parameter of the covariance model is related to the scaled input sample, and not to the original sample anymore.

OpenTURNS 1.16 does not scale the input sample: this is the responsibility of the user. Without change, this may produce poor kriging metamodels because the scale parameter may be poorly estimated.

The code below presents one possibility to overcome this problem:

dimension = myDistribution.getDimension()
basis = ot.ConstantBasisFactory(dimension).build()
covarianceModel = ot.SquaredExponential([1.0] * dimension, [1.0])
covarianceModel.setScale(X_train.getMax())  # Trick A
algo = ot.KrigingAlgorithm(X_train, Y_train, covarianceModel, basis)
scaleOptimizationBounds = ot.Interval(X_train.getMin(), X_train.getMax())
algo.setOptimizationBounds(scaleOptimizationBounds)  # Trick B
algo.run()

There are two tricks, both of which try to take into account for potentially very different magnitudes in the input vector X.

  • The trick A sets the scale parameter of the covariance model before creating the kriging algorithm. Indeed, this value is used as a starting point for the optimization algorithm. Hence, it is much more easy for the algorithm to find the optimum, because the starting point has now the correct order of magnitude (instead of using the default value of the scale parameter of the covariance model, which may be far away from the optimum).
  • The trick B sets the bounds of the optimization algorithm, so that they match the minimum and maximum of the sample. Hence, the optimization algorithm has reliable bounds to search for the scale parameter.

Using these trick is not necessary if the input sample already has a scale close to 1. For example, when approximating the function y=\sin(x) for x\in[0,1], this is not necessary. This is mandatory, however, in the cantilever beam example of the documentation, where the first parameter is the Young modulus, which has an order of magnitude equal to 10^9.

Using this particular normalization may work if the initial sample is large enough to represent the potential optimal value of the scale parameter. It may not work if the initial sample size is too small, e.g. lower than 10. This might be an issue for those of us who want to sequentially learn new points in the design of experiments.

More details and other implementations of scaling are provided at:

Please share any other scaling method that you commonly use or any other way to initialize the parameters of the covariance model for optimization. Please correct me if I did not properly explained the scaling issue here. In any case, I guess that this might be interesting for some who are surprised by changes in their kriging metamodels in OT 1.16.

Best regards,

Michaël

3 Likes

Hi!
There were different discussions after this post. One of the discussions was on the “Trick B”, which was suspicious for some users.

  • Setting the maximum scale bound to the maximum point in the sample was felt as a very large upper bound. Furthermore, it may be completely wrong if the input domain is in the negative half-plane : the maximum point in the sample is then negative, while the scale parameter must necessarily remain positive. Suppose e.g. that the input domain is [-10, -5], then the maximum point in the sample is necessarily lower or equal to -5, while the correlation length must be non-negative.
  • Setting the minimum scale bound to the minimum point in the sample was felt as a false option, because there is no straightforward relationship between the two parameters. Suppose e.g. that the input domain is [0.5, 1.0], then the minimum point in the domain must be greater or equal to 0.5, but a correlation length can be equal to 0.1. Thanks to @ThibaultDelage for pointing this.

One option may be to compute the minimum pairwise distance between the points in the sample. The minimum and maximum scale bounds can then be set depending on a factor of this distance (e.g. with factors 0.1 for the minimum bound and 10 for the maximum bound). Thanks to @JPelamatti for pointing this. This option requires to compute the minimum pairwise distance, which can be done with scipy’s pdist, but cannot be done with OT, as far as I know.

The code that is presented below just uses OpenTURNS’s features and includes two significant modifications. The first change is to compute the componentwise range of the sample, assuming a Monte-Carlo sample with more than one point.The second change is to compute the bounds depending on factors of the range.

# Trick B, v2
x_range = X_train.getMax() - X_train.getMin()
scale_max_factor = 2.0  # Must be > 1, tune this to match your problem
scale_min_factor = 0.1  # Must be < 1, tune this to match your problem
maximum_scale_bounds = scale_max_factor * x_range
minimum_scale_bounds = scale_min_factor * x_range
scaleOptimizationBounds = ot.Interval(minimum_scale_bounds, maximum_scale_bounds)
algo.setOptimizationBounds(scaleOptimizationBounds)

Using these lines of code is not guaranteed to work in all cases. For example, it may fail if the size of the design of experiment is very small or is weighted in some way (e.g. importance sampling).

I share this script in the hope that it can be useful for some users. Please reply to this message if you have counter-examples or other proposals.

Best regards,

Michaël

Hello, and thanks for continuing this discussion.
Just to clarify what is mentioned in the post above, the suggestion I would propose for the initialization of the Kriging lengthscale bounds is to rely on both the minimum and maximum component-wise and pair-wise distance. We would then obtain something of the sort :

# Trick B, v3
for d in range(X_train.getDimension()):
    dist = scipy.spatial.distance.pdist(X_train[:,d])
    scale_max_factor = 2.0  # Must be > 1, tune this to match your problem
    scale_min_factor = 0.1  # Must be < 1, tune this to match your problem
    maximum_scale_bounds = scale_max_factor * np.max(dist)
    minimum_scale_bounds = scale_min_factor * np.min(dist)

This kind of bound initialization is obviously not foolproof, but should be more robust when dealing with variables of different nature and defined over considerably different domains.

PS : please note that I did not come up with this solution, as similar bound initialization procedures can be found in a few discussions on the topic.

1 Like

Hi

We had this kind of discussion with @josephmure in last December but unfortunately no time to discuss it in a Technical committee or to try an experiment in the API.
Now having this written, I am wondering if it was really a good idea to “hard” drop the scaling. I also noticed that the persalys reimplement the normalization. Thus the question is why we did it in persalys and not in OT?

I think we said the implementation was too complex and inconsistent.

Hey I missed this.

At least some tips should be illustrated in the doc or have a more complet example to highlight the impact of the normalization + bounds

Hi! @MichaelBaudin and I conducted a few numerical experiments to test how much the likelihood depends on the design set.
In the first case, the output is a deterministic function (cosinus) of the input.
In the second case, the output comes from actual realizations of a Gaussian process with fixed scale parameter \theta=1.2.
In both cases, the design set is a regular grid. We let the number of points (and therefore the distance between points) vary.

import openturns as ot
from openturns.viewer import View

cos = ot.SymbolicFunction('x', 'cos(0.1 * x)')
mat = ot.MaternModel([1.2], 2.5)
a=0
b=200


def scale_function(x):
    x = ot.Point(x)
    y = x / n
    return y

sf = ot.PythonFunction(1,1,scale_function)



n_liste = [10, 20, 40, 80, 160, 320, 640]
graph = ot.Graph("Vraisemblance avec données issues de la fonction cosinus", "theta", "log-vraisemblance", True)
for n in n_liste:
    delta_x = (b-a)/n
    grid = ot.RegularGrid(a,delta_x,n)
    inp = ot.Sample.BuildFromPoint(grid.getValues())
    out = cos(inp)
    kri = ot.KrigingAlgorithm(inp, out, mat, ot.Basis(0))
    opt = kri.getReducedLogLikelihoodFunction()
    scaled_loglikelihood = ot.ComposedFunction(sf, opt)
    curve = scaled_loglikelihood.draw(0.01,1000)
    curve.setLegends(["n=%d" % (n)])
    graph.add(curve)
graph.setColors(ot.Drawable.BuildDefaultPalette(len(n_liste)))
graph.setLegendPosition("topright")
View(graph)



n_liste = [10, 20, 40, 80, 160, 320, 640]
graph = ot.Graph("Vraisemblance avec données issues de réalisation d'un GP", "theta", "log-vraisemblance", True)

for n in n_liste:
    delta_x = (b-a)/n
    grid = ot.RegularGrid(a,delta_x,n)
    gp = ot.GaussianProcess(mat, grid)
    inp = ot.Sample.BuildFromPoint(grid.getValues())
    out = gp.getRealization()
    kri = ot.KrigingAlgorithm(inp, out, mat, ot.Basis(0))
    kri.run()
    theta_opt = kri.getResult().getCovarianceModel().getScale()
    opt = kri.getReducedLogLikelihoodFunction()
    scaled_loglikelihood = ot.ComposedFunction(sf, opt)
    log_likelihood_at_theta_opt = scaled_loglikelihood(theta_opt)
    curve = scaled_loglikelihood.draw(0.01,20)
    curve.setLegends(["n=%d $\\hat{\\theta}$=%.2f" % (n, theta_opt[0])])
    graph.add(curve)
    cloud = ot.Cloud([theta_opt], [log_likelihood_at_theta_opt])
    graph.add(cloud)
graph.setColors(ot.Drawable.BuildDefaultPalette(2 * len(n_liste)))
graph.setLegendPosition("topright")
View(graph)

image

image

In the second graph, the crosses indicate the position of the MLE for every likelihood function. The value of the MLE is given as \hat{\theta} in the legend.

One thing is certain: the lower bound of the optimization box of the scale parameter should always be 0. The likelihood function always converges to a non-null constant when \theta \to 0. To compute the limit, all we need to do is replace the correlation matrix in the likelihood formula with the identity matrix. OpenTURNS currently does not allow the scale parameter to be null, but it probably should in the future.

Following a discussion with @josephmure , we have come to the conclusion that the simplest solution would be to set the lower bound of the lengthscale optimization to ot.SpecFunc.MinScalar. This would allow to accept a candidate lenghtscale value extremely close to 0, while avoiding possible numerical issues related to invalid arguments to the kernel functions (i.e., an undefined form 0/0 obtained with a 0 lenghtscale value in the case we consider a null distance between points).

Hi,
Thanks for the analysis. Agreed that the big deal is with the lower bound. We discussed with @josephmure sometimes ago about implementing a more sophisticated strategy (by component) but for the moment we did’n go for (something like min(ot.SpecFunc.ScalarEpsilon, data.getMarginal(0).getMin()[0] / 10) for each marginal)
Btw ot.SpecFunc.MinScalar is probably too small here no ? ot.SpecFunc.ScalarEpsilon is the \epsilon-machine whereas the first one is around 10^{-308}. Thus did I miss something?