Unrealistic predictions with PCE

Hi all,
Months ago I used the PCE algorithm to emulate the response of a simple FE model.
Nine input parameters were varied and combined in a LH sample for what FEA were run and two output were stored to be predicted by the PCE (the sample has size N=99 as one FEA had to be discarded). Since the FE model had low computational cost, a validation sample of same size was built with the same strategy and the outcomes were used to validate the developed PCE.

With basic assumptions (e.g., full truncation strategy, least squares integration), I obtained the following PCE vs FEA predictions

Now the idea was to restart working on this case study with the aim to improve the quality of the results by varying some of the PCE features, rather than reducing the input numbers (after some sensitivity). However, I cannot get the same old results with the original (basic) script, and actually the new outcomes of the PCE are quite weird with very high residuals and negative accuracy scores (i.e. Q2).
Here the PCE vs FEA predictions that I am obtaining now with the same script.

I cannot understand what’s happened. Out of the previous scores of the Q2 factor, the PCE predictions are now too bad and they seem to have no sense with respect to the training data. I have double checked all the data to be sure I am using the same and correct values. I am also obtaining similar results with larger sample sizes and for other case studies.

I’m attaching the data of the two samples (training and validation) and the outputs that were predicted for the validation sample by the original script (oldPCE_out1 and oldPCE_out2). I can also provide the used code, but it is very basic.

Hope you can help.

input_train.csv (16.3 KB)
input_val.csv (16.3 KB)
output_train.csv (3.7 KB)
output_val.csv (3.7 KB)
oldPCE_out1.txt (2.5 KB)
oldPCE_out2.txt (2.5 KB)

Hi,

It would have been nice to add your script, even if it is basic :wink:
To make things short, it is an example of overfitting, and the change between the two versions of OT may be due to the default values of some algorithms (impossible to check as you don’t mention the two versions of OT you used).

In order to analyze the problem I used the following script:

import openturns as ot
import openturns.viewer as otv
import openturns.experimental as otexp

# Load the training data
inTrain = ot.Sample.ImportFromTextFile("input_train.csv", ",")
outTrain = ot.Sample.ImportFromTextFile("output_train.csv", ",")
# Load the validation data
inVal = ot.Sample.ImportFromTextFile("input_val.csv", ",")
outVal = ot.Sample.ImportFromTextFile("output_val.csv", ",")
# Recover the input distribution. It should be known, so I use all
# the available information with no shame ;-)
indata=list(inTrain)+list(inVal)
distribution = ot.MetaModelAlgorithm.BuildDistribution(indata)
# Now the polynomial basis
enumFun = ot.LinearEnumerateFunction(distribution.getDimension())
basis = ot.OrthogonalProductPolynomialFactory([ot.LegendreFactory() for i in range(distribution.getDimension())], enumFun)
# Here is the cause of the burden: results look nice with degree 2,
# and awful with degree=3
for deg in [2, 3]:
    print("#"*50)
    print(f"{deg=}")
    basisSize = enumFun.getBasisSizeFromTotalDegree(deg)
    # Use the new, simplified algo
    algo = otexp.LeastSquaresExpansion(inTrain, outTrain, distribution, basis, basisSize, "SVD")
    algo.run()
    # Check the residuals
    print("residuals=", algo.getResult().getResiduals())
    meta = algo.getResult().getMetaModel()
    valid = ot.MetaModelValidation(inVal, outVal, meta)
    graph = valid.drawValidation()
    q2 = valid.computePredictivityFactor()
    q2_0 = q2[0]
    q2_1 = q2[1]
    graph.setTitle(f"{deg=}, {q2_0=}, {q2_1=}")
    view = otv.View(graph)
    view.save("Result_deg_" + str(deg) + ".png")
    view.close()

and the output is:

##################################################
deg=2
residuals= [0.00783466,0.899184]
##################################################
deg=3
residuals= [2.13319e-16,2.27965e-14]

With a residual essentially equal to zero for deg=3, you may be sure that you have overfitting somewhere.

The validation graphs are:


and you see the effect of overfitting.

You get the exact same results using the legacy PCE algorithm:

import openturns as ot
import openturns.viewer as otv
import openturns.experimental as otexp

# Load the training data
inTrain = ot.Sample.ImportFromTextFile("input_train.csv", ",")
outTrain = ot.Sample.ImportFromTextFile("output_train.csv", ",")
# Load the validation data
inVal = ot.Sample.ImportFromTextFile("input_val.csv", ",")
outVal = ot.Sample.ImportFromTextFile("output_val.csv", ",")
# Now the polynomial basis
enumFun = ot.LinearEnumerateFunction(inTrain.getDimension())
# Here is the cause of the burden: results look nice with degree 2,
# and awful with degree=3
for deg in [2, 3]:
    print("#"*50)
    print(f"{deg=}")
    basisSize = enumFun.getBasisSizeFromTotalDegree(deg)
    # Adapt the basis size
    ot.ResourceMap.SetAsUnsignedInteger("FunctionalChaosAlgorithm-BasisSize", basisSize);
    algo = ot.FunctionalChaosAlgorithm(inTrain, outTrain)
    algo.run()
    # Check the residuals
    print("residuals=", algo.getResult().getResiduals())
    meta = algo.getResult().getMetaModel()
    valid = ot.MetaModelValidation(inVal, outVal, meta)
    graph = valid.drawValidation()
    q2 = valid.computePredictivityFactor()
    q2_0 = q2[0]
    q2_1 = q2[1]
    graph.setTitle(f"{deg=}, {q2_0=:.3f}, {q2_1=:.3f}")
    view = otv.View(graph)
    view.save("Result_old_deg_" + str(deg) + ".png")
    view.close()

Please send us your script and the versions of OT you used if you want more insight on what append between these two versions.

Cheers

Régis

Hi Régis

Many thanks for the explanation. I see that it is all about the OT version I was using (as also expected). I would say it was OT 1.19 but honestly the package was already installed on the notebook provided me and I simply started using it without paying attention to that. Now I’m working with the last OT 1.21.

This is the simple code used to get the outcomes. All choices were standard (truncation, polynomial basis, integration strategy), but no control was assigned to the maximum degree of the polynomials.

import openturns as ot
import openturns.viewer as otv
import pandas as pd

# load the TRAINING dataset of the FEA: input and output

df_TRAIN = pd.read_csv(‘input_train.csv’)
input_TRAIN = ot.Sample(df_TRAIN.values)
df_out_TRAIN = pd.read_csv(‘output_train.csv’)
output_TRAIN = ot.Sample(df_out_TRAIN.values)

# load the VALIDATION dataset of the FEA: input and output

df_VAL = pd.read_csv(‘input_val.csv’)
df_out_VAL = pd.read_csv(‘output_val.csv’)
input_VAL = ot.Sample(df_VAL.values)
output_VAL = ot.Sample(df_out_VAL.values)

# create the PCE with TRAINING dataset

algo_PCE = ot.FunctionalChaosAlgorithm(input_TRAIN, output_TRAIN)
algo_PCE.run()
mm_PCE = algo_PCE.getResult().getMetaModel()

# validate the PCE metamodels

val_100vs100 = ot.MetaModelValidation(input_VAL, output_VAL, mm_PCE)
Q2_100 = val_100vs100.computePredictivityFactor()
graph_100 = val_100vs100.drawValidation()
graph_100.setTitle(“Q2=” + str(Q2_100))
view = otv.View(graph_100)

So basically, since in this case I’m using a full term truncation scheme (or standard as from your documentation), the previous version of OT was truncating the expansion up to the max degree of the polynomials according to the sample size (i.e. max degree p=2 for a training sample size N=100 and number of variables n=9, in the example). Now, OT 1.21 tries to further maximise this degree (i.e. up to p=3 as you highlighted) but it is recommended to keep control of that as it can lead to overfitting (as you clearly highlighted).

I cannot check your scripts as I’m out of office today, but it seems clear to me what was happening. Please highlight me if I said something wrong.

Thanks

It looks perfectly correct to me. The main change between versions is the ability to control the basis size directly instead of only through a total degree. Apparently, the control of the basis size wrt the learning size has been dropped on the way.
A side note: if you know the exact distribution behind the doe, you gain a lot in accuracy if you give it to the algorithm.

Cheers

Régis