Data from CSV file workflow

Hello,

i have a csv file that is made from a DOE using the OpenTurns GUI of Salome Meca. Now i have done different things with the study in Salome Meca but i cannot find the option to change and export the validation results of metamodels. I would like to recreate the workflow in a python script to visualize and change with matplotlibs help the graphical representations. I have arrived at a point but i lack the necessary knowledge to proceed so i ask for an example, explaining the overall workflow from reading a .csv file to the construction of the metamodel and visualization.

Thank you!

Hi @GPSalachs! Have you tried exporting your study to a Python file from Salome? Anyway, I do not think we have a complete example detailing everything you want to do, but here are a few pieces:

  • The Sample class has an ImportFromTextFile method that should work on CSV files. You may have to fiddle around with the options (seperator, header, etc.) but if Salome was able to load your CSV file, this method should also be able to.
  • Once you have loaded your input Sample, you can have a look at the cantilever beam model Kriging or Polynomial Chaos example.
  • At the end of these examples, you will see that an openturns.viewer.View object is created in order to display a graph. This class has a getFigure method that produces the underlying matplotlib.figure.Figure object, which then gives you full control over the graphical representation.
  • Note that OpenTURNS also provides an interface for graphical representations. It is not as powerful as matplotlib but might be a little easier to use if your graphs are not too complicated.
1 Like

Hello josephmure!

Thank you very much for the reply to my question. I have fiddled around with the Sample class and i was able to import inside my script the .csv. Actually here is the script so we can maybe discuss around it :

from __future__ import print_function
import openturns as ot
import pandas as pd
import openturns.viewer as viewer
from matplotlib import pylab as plt

sample = ot.Sample.ImportFromCSVFile('DOE_v1.csv',',')
#print(sample)

data = pd.DataFrame.from_records(sample, columns = sample.getDescription())

input_name = list(data.columns[0:36])
output_name = list(data.columns[36:46])

print('The input values of the data are:',input_name)
print('The output values of the data are:',output_name)

x = data[0:36]
y = data[36:46]
print(x)

Description = sample.getDescription()
#print("The description of the dataset is:",Description)

Size = sample.getSize()
#print("The size of the dataset is:",Size)

Dimension = sample.getDimension()
#print("The dimension of the dataset is:",Dimension)

Mean = sample.computeMean()
#print('The mean value is',Mean)

Variance = sample.computeVariance()
#print('The variance value is',Variance)

In the examples you have given me i can see that that the “model” is defined as function of the input g(inputTrain) were the input it self is taken from the distribution of the sample. Here lies my question on the workflow. I have the .csv generated from the design of experiments and i have a number of inputs and quite a few outputs. Do i have to create a model for each output? How should i define the model to begin with ( g = cb.model ) ?

In the examples, data need to be artificially generated, which is why a model is necessary. You on the other hand do not need to bother with this since you import them from CSV files, so the relevant part of the examples starts after Xtrain / inputTrain and Ytrain / outputTrain have been defined.

I have several questions/remarks about your script:

  • from __future__ import print_function could be removed. The purpose of this line was to maintain compatibility with Python 2, but since this is no longer possible, we are progressively removing such lines from the examples.
  • Why do you need to convert your Sample to a pandas.DataFrame?
  • If I understand correctly, sample contains 46 columns representing 36 input variables and 10 output variables. You can create x and y with x = sample.getMarginal(range(36)) and y = sample.getMarginal(range(36,46)). Note that 36 input dimensions is quite a lot, so you may want to try some dimension reduction method first.

By the way, the forum will render your script better if you use ``` (AltGr + 7) quotation marks.

The following text
image

is rendered as

# Some script
import openturns as ot

Hello,

thank you very much for the suggestions given. Indeed 36 variables do seem quite a lot and should be reduced by doing a sensitivity analysis to see which ones could be omitted. What i am trying to do it to create the a metamodel using functional chaos or Kriging and to “predict” the necessary outputs. What i want to use also as a validation measure is to pass the inputs again to the FEA model from which i was able to generate the DOE initially and confirm that the parameters give an output which has the least margin of error.

I have tried that with scikit learn package with different regression model but the difference between the FEA model and the “Metamodel” was of the order of 4% which i think is a lot and so i want to reproduce the passes with Openturns to see if i can get a better response.

I need to apologise also because my script is all over the place. I wanted to use the sample class initially and saw that the data can be passed into OT by utilizing pandas. So i wrote also that line and never corrected it.

Thank you very much for all the help! I will post here again in a correct manner as you suggested the code i write for further discussion!

Hello,

i have been encountering an error that its message says:

TypeError: InvalidArgumentException : 
In CovarianceModelImplementation::setScale: 
the component 1 of scale is non positive

What does this mean?

Also in case, if i want to reuse the created metamodel to ask for outputs giving the inputs, how do i do that?

Re: the InvalidArgumentException. It looks like your code tried to set a nonpositive scale (keep in mind 0 is nonpositive) for the second input dimension of your CovarianceModel (in Python, the second component is numbered 1 because the first is numbered 0).

Hello,

i think that i wrote correctly that does a part of what i want for each output. I haven’t figured out yet how do it without a loop but with the loop too i think it is ok. The results without a calibration of the trends coefficient provide a not so good result although with the optimization of the coefficients the results are similar to the ones i am getting from the Openturns GUI in Salome Meca 2019. What i want to know now, is how to perform a calibration. How can i access the inputs and outputs of the metamodel? Is there a function like “predict” of the scikit Linear Regression ?

The code until now is posted here, suggestions are really really welcome!

Thank you!

algo.setOptimizeParameters(False)

So the metamodel i get doesn’t have optimized trend coefficients (?).

The study until now is :

    import openturns as ot
    import openturns.viewer as viewer
    from matplotlib import pylab as plt

    sample = ot.Sample.ImportFromCSVFile('Test.csv',',')
    print(sample)

    for i in range(0,6):
    Test_sample = 100
    x = sample.getMarginal(range(0,36))[0:Test_sample]
    y = sample.getMarginal(range(36 + i,37 + i))[0:Test_sample]
    #print(x,y)
    
    #histo = ot.HistogramFactory().build(y).drawPDF()
    #view = viewer.View(histo)
    
    #METAMODEL CREATION
    dim = 36
    basis = ot.LinearBasisFactory(dim).build()
    covarianceModel = ot.SquaredExponential()
    algo = ot.KrigingAlgorithm(x, y, covarianceModel, basis,False)
    #print("Lower and upper bounds of X_train:")
    #print(x.getMin(), x.getMax())
    algo.setOptimizeParameters(False)

    algo.run()
    
    result = algo.getResult()
    krigingMetamodel = result.getMetaModel()
    
    #print(result.getTrendCoefficients())
    
    Covariance_Model = result.getCovarianceModel()
    
    #print(Covariance_Model)
    
    #MetaModel Validation

    X_test = sample.getMarginal(range(0,36))[Test_sample:-1]
    Y_test = sample.getMarginal(range(36 + i,37 + i))[Test_sample:-1]
    
    val = ot.MetaModelValidation(X_test, Y_test, krigingMetamodel)
    
    Q2 = val.computePredictivityFactor()
    print(Q2)
    
    #Residuals Distribution
    # r = val.getResidualSample()
    # graph = ot.HistogramFactory().build(r).drawPDF()
    # graph.setXTitle("Residuals (cm)")
    # graph.setTitle("Distribution of the residuals")
    # graph.setLegends([""])
    # view = viewer.View(graph)
    
    graph = val.drawValidation()
    graph.setTitle("Q2 = %.2f%%" % (100*Q2))
    view = viewer.View(graph)
    
    plt.show()
    
    i +=1

Thanks for the detailed script! Can you also share your Test.csv file so I can run it? If it is not confidential and also not too large, you can post it using the forum’s “upload” button. Otherwise I will generate an artificial sample.

Hi @GPSalachs !
I noticed that you used:

algo.setOptimizeParameters(False)

This does mean that the trend coefficients are not fit to the data, and that the parameters of the covariance model (the variance and the correlation lengths) are unoptimized.
This is an unexpected choice to me, because you use the SquaredExponential covariance model directly: it seems to me that a good predictivity coefficient cannot be obtained in these circumstances, isn’it?
Regards,
Michaël

Here is the CSV file containing the data (shared with @GPSalachs’ permission): Test.csv (66.9 KB)

I have tweaked the script a little bit. The most important changes are the following:

  • Your script uses deprecated options from OT 1.15, so I have made changes to ensure compatibility with OT 1.17, which will soon be released.
  • as @MichaelBaudin wrote earlier, setOptimizeParameters(False) means that CovarianceModel-related hyperparameters are not optimized (but trend hyperparameters still are), so I have removed this line. Moreover, since the optimization is quite complex, I have implemented a multi-start strategy.

With these changes, I get Q2 values larger than 97.5% for all outputs on your data set (see below).

import openturns as ot
import openturns.viewer as viewer

sample = ot.Sample.ImportFromCSVFile('Test.csv',',')

dim = 36
Test_sample = 100
x = sample.getMarginal(range(0,dim))[0:Test_sample]

for i in range(dim, sample.getDimension()):
    y = sample.getMarginal(i)[0:Test_sample]

    # Create basis
    basis = ot.LinearBasisFactory(dim).build()

    # Create  the covariance model
    covarianceModel = ot.SquaredExponential([1.] * dim)

    # Create the KrigingAlgorithm and set up hyperparameter optimization
    algo = ot.KrigingAlgorithm(x, y, covarianceModel, basis)

    # Set up bounds for the optimization of the scale hyperparameters
    bounds = ot.Interval([0.01]*dim, 10.0 * x.getMax())
    algo.setOptimizationBounds(ot.Interval([0.01]*dim, 10.0 * x.getMax()))

    # Define a multistart strategy for scale hyperparameter optimization

    # 1) Define a uniform distribution on the optimization domain
    #    as a ComposedDistribution of Uniform distributions
    lower = bounds.getLowerBound()
    upper = bounds.getUpperBound()
    distribution_list = []
    for j in range(dim):
        distribution_list.append(ot.Uniform(lower[j], upper[j]))
    distribution = ot.ComposedDistribution(distribution_list)

    # 2) Generate points from a LowDiscrepancyExperiment emulating the distribution
    lowdiscrepancy = ot.LowDiscrepancyExperiment(ot.SobolSequence(), distribution, 20)
    starting_sample = lowdiscrepancy.generate()

    # 3) Set up the MultiStart algoritm with these points
    ms = ot.MultiStart(ot.TNC(), starting_sample)
    algo.setOptimizationAlgorithm(ms)

    # Run hyperparameter optimization and get the result
    algo.run()
    result = algo.getResult()
    krigingMetamodel = result.getMetaModel()

    # Define the test sample
    X_test = sample.getMarginal(range(0,dim))[Test_sample:]
    Y_test = sample.getMarginal(i)[Test_sample:]

    # Validate the metamodel
    val = ot.MetaModelValidation(X_test, Y_test, krigingMetamodel)
    Q2 = val.computePredictivityFactor()
    graph = val.drawValidation()
    graph.setTitle("Q2 = %.2f%%" % (100*Q2[0]))
    view = viewer.View(graph)

image
image
image
image
image

I thought setOptimizeParameters(False) only deactivated the optimization of covariance model parameters.

Hello,

i have also tweaked the script a little bit. The bounds should have ranges that correspond to the max and min of the dataset, due to the dataset being generated from a DOE with Monte Carlo option from the Openturns GUI in Salome Meca 2019. The nature of the problem should be confined in the ranges also due to the correspondence of those ranges found within the Italian Standards. I have also changed the distributions from uniform to normal ones because in the definition of the probabilistic model in the study i used normal distributions. The results are also attached as images. Is there bibliography on which distributions should be used when one recalls material parameters? Also once metamodel is defined, how can i utilize it for calibration purposes?

import openturns as ot
import openturns.viewer as viewer

sample = ot.Sample.ImportFromCSVFile('Test.csv',',')

dim = 36
Test_sample = 100
x = sample.getMarginal(range(0,dim))[0:Test_sample]

for i in range(dim, sample.getDimension()):
y = sample.getMarginal(i)[0:Test_sample]

# Create basis
basis = ot.LinearBasisFactory(dim).build()

# Create  the covariance model
covarianceModel = ot.SquaredExponential([1.] * dim)

# Create the KrigingAlgorithm and set up hyperparameter optimization
algo = ot.KrigingAlgorithm(x, y, covarianceModel, basis)

# Set up bounds for the optimization of the scale hyperparameters
#bounds = ot.Interval([0.01]*dim, 10.0 * x.getMax())
#algo.setOptimizationBounds(ot.Interval([0.01]*dim, 10.0 * x.getMax()))

bounds = ot.Interval(x.getMin(),  x.getMax())
algo.setOptimizationBounds(ot.Interval(x.getMin(),  x.getMax()))

# Define a multistart strategy for scale hyperparameter optimization

# 1) Define a uniform distribution on the optimization domain
#    as a CompositeDistribution of Uniform distributions
lower = bounds.getLowerBound()
upper = bounds.getUpperBound()
distribution_list = []
for j in range(dim):
    distribution_list.append(ot.Normal(lower[j], upper[j]))
distribution = ot.ComposedDistribution(distribution_list)

# 2) Generate points from a LowDiscrepancyExperiment emulating the distribution
lowdiscrepancy = ot.LowDiscrepancyExperiment(ot.SobolSequence(), distribution, 20)
starting_sample = lowdiscrepancy.generate()

# 3) Set up the MultiStart algoritm with these points
ms = ot.MultiStart(ot.TNC(), starting_sample)
algo.setOptimizationAlgorithm(ms)

# Run hyperparameter optimization and get the result
algo.run()
result = algo.getResult()
krigingMetamodel = result.getMetaModel()

# Define the test sample
X_test = sample.getMarginal(range(0,dim))[Test_sample:]
Y_test = sample.getMarginal(i)[Test_sample:]

# Validate the metamodel
val = ot.MetaModelValidation(X_test, Y_test, krigingMetamodel)
Q2 = val.computePredictivityFactor()
graph = val.drawValidation()
graph.setTitle("Q2 = %.2f%%" % (100*Q2))
view = viewer.View(graph)





Optimization of Kriging parameters

A Kriging algorithm needs to optimize the scale parameters of its covariance model (here squared exponential).

If this optimization problem is difficult, it is helpful to use multiple starting points (i.e. multiple candidate values the optimization algorithm will start its search from).

In my version of your code, I decided to distribute them uniformly accross the optimization domain (hence the uniform distribution which appeared). Using a normal distribution makes little sense in this context because it will concentrate rather than spread points and make the optimization algorithm’s task harder.

There is a trick, though: we do not actually sample from the uniform distribution, because independent samples can very easily form clusters. Here is an example with 50 points on the domain [-1,1] \times [-1,1]:

dist = ot.ComposedDistribution([ot.Uniform()] * 2)
dist.getSample(50)

image

Now with a LowDiscrepancyExperiment based on a SobolSequence:

ot.LowDiscrepancyExperiment(ot.SobolSequence(), dist, 50).generate()

image

Metamodel and calibration

I think you can extract your metamodel with the KrigingResult.getMetaModel() method and then use OpenTURNS calibration classes. I am no expert, but there are a few examples in the doc:

https://openturns.github.io/openturns/1.17/auto_calibration/index.html

There is something I do not understand, though. The metamodel does not depend on parameters, it is simply a function which takes a given input and produces a (deterministic) output. Which parameters do you want to calibrate?

Hello,

it is as you said the metamodel is a function that takes inputs and produces outputs. I want to minimize a mean squared error objective function that is calculated by the output of the metamodel and in situ measurements. My thought was to create the metamodel and try to do an automatic calibration based on the metamodel.

Something like :

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

where the predictions are coming from the metamodel and the targets are measured.

If you can explain, in regards to your last post, what does the LowDiscrepancyExperiment based on SobolSequence mean?

Also another question that came into mind is on how to get access for the inputs of the function? Is there something similar to the “.predict” function found in scikit learn?

Hello!
If I correctly understand the issue you’re trying so solve, you want to create a surrogate model of the function y = f(x) in order to find the input values x^* that minimize the following problem:
||f(x^*) - y_target||^2, where y_target are measured values different from the ones you used to create the surrogate model.

Is that correct?

I don’t think I understand the question. From my limited experience with scikit learn, the .predict() method is used to compute the output of a model, i.e. to compute y = f(x). But I’m guessing you would like to do the opposite, which is to find the values of the input x corresponding to a given output value y: x = f^-1(y). Unfortunately, to my knowledge there is no way of doing that directly, you would need to solve an optimization problem similar to the one you presented, minimizing the difference between the predicted output and the observation.

A Sobol sequence is a type of low discrepancy sequence that allows you to create a Design of Experiment (DOE) that provides a better filling of the design space (in this case, the hyperparameter space) if compared to a standard Monte Carlo approach.
You can find some more information on the theory page of OpenTURNS : Low Discrepancy Sequence — OpenTURNS 1.21 documentation or alternatively, on the wikipedia page: Sobol sequence - Wikipedia

1 Like

Hello, yes that is correct!

I expressed in a wrong manner what i was trying to elaborate, i don’t want to do the opposite, i was simply asking how to change the inputs x of y = f(x) in the metamodel. Once i know how to change these i can define the minimization function and proceed!

Oh, I see. Basically the metamodel object that you obtain from the kriging algorithm result is a function, so it can be directly called as Metamodel(input_sample), where input_sample contains one or several points defined in the search space. If we re-use the example provided by Josephmure we have:

import openturns as ot

sample = ot.Sample.ImportFromCSVFile('Test.csv',',')

dim = 36
Test_sample = 100
x = sample.getMarginal(range(0,dim))[0:Test_sample]

for i in range(dim, sample.getDimension()):
    y = sample.getMarginal(i)[0:Test_sample]

    # Create basis
    basis = ot.LinearBasisFactory(dim).build()

    # Create  the covariance model
    covarianceModel = ot.SquaredExponential([1.] * dim)

    # Create the KrigingAlgorithm and set up hyperparameter optimization
    algo = ot.KrigingAlgorithm(x, y, covarianceModel, basis)

    # Set up bounds for the optimization of the scale hyperparameters
    bounds = ot.Interval([0.01]*dim, 10.0 * x.getMax())
    algo.setOptimizationBounds(ot.Interval([0.01]*dim, 10.0 * x.getMax()))

    # Define a multistart strategy for scale hyperparameter optimization

    # 1) Define a uniform distribution on the optimization domain
    #    as a ComposedDistribution of Uniform distributions
    lower = bounds.getLowerBound()
    upper = bounds.getUpperBound()
    distribution_list = []
    for j in range(dim):
        distribution_list.append(ot.Uniform(lower[j], upper[j]))
    distribution = ot.ComposedDistribution(distribution_list)

    # 2) Generate points from a LowDiscrepancyExperiment emulating the distribution
    lowdiscrepancy = ot.LowDiscrepancyExperiment(ot.SobolSequence(), distribution, 20)
    starting_sample = lowdiscrepancy.generate()

    # 3) Set up the MultiStart algoritm with these points
    ms = ot.MultiStart(ot.TNC(), starting_sample)
    algo.setOptimizationAlgorithm(ms)

    # Run hyperparameter optimization and get the result
    algo.run()
    result = algo.getResult()
    krigingMetamodel = result.getMetaModel()

    # Define the test sample
    X_test = sample.getMarginal(range(0,dim))[Test_sample:]
    
    # Here, you can use the surrogate model to predict the modeled function value at any location of the input space, for example:
    Y_predicted = krigingMetamodel(X_test)

Hello,

trying to define the function for the error minimization i asked myself if it possible to lose the loop, so instead of looping for each output, if it is possible to create one metamodel for all the outputs. I would like to know so to utilize it in the sum of the predicted outputs against the sum of the expected outputs.

Hello,
Yes, the constructor of the KrigingAlgorithm class allows you to provide output samples with a dimension larger than 1, meaning that you could train a single Kriging surrogate model that predicts all of the outputs simultaneously.
However, please note that by default a single trend function and a single covariance model is trained and used for all of the outputs, which may result in a poor modeling performance in case the various outputs have noticeably different behaviors in terms of local or global variability.
I admit that I do not know if OpenTURNS provides alternative covariance models that would allow to bypass this issue.