Generate multivariate joint distribution

Dear all,
I am trying to implement a method that generates a new sample of several parameters being linked in a multivariate joint distribution.
A bit of more details : I get, out of an other process, a matrix n \times m, with n being the length of the sample and m the number of parameters. I’d like to generate a new set of sample keeping the correlations between those.
I went through several methods and approaches, using Cholesky. I found this implementation in Matlab that seemed pretty relevant for my case (lhsgeneral), but I am stuck into the mapping process of each uniform variable to the probability distribution of each parameter. Two issues : there is no equivalent function in Python of the Matlab icdf(pd,p), with pd being the distribution type it should follow, and I don’t know the probability distribution of each parameter. These are given by the external process.

It’s by searching how to identify which distribution type would fit to each parameter in order to transform my new sample into the shape it should be that I went into OpenTURNS. Maybe my solution is embedded in some function inside !
Is it ?
Thanks in advance for your help

Hi !
Welcome on this forum!
I try to reformulate your problem: given a sample, you want to generate a new sample having the same probabilistic structure, that is, the same marginals and dependency. Is that a correct description? Do you know the size of the sample and its dimension?

  • The simplest way to proceed would be by bootstrap, using the BootstrapExperiment class documented here. This generates a new sample having the same size as the original one, by randomly picking points in the original sample. This may work, if the original sample is large enough.
  • A more sophisticated way to be to fit a distribution. This is the step B in the methodology. There are two ways to do that: parametric and non parametric. There are several examples in the doc on this topic. The simplest is to use kernel smoothing, which is available in the lib, even for a multivariate sample. What you get is a Distribution, which has a getSample method. The other way is to fit marginal distributions and fit a parametric model for the dependency. There is an example to do this on a marginal. A full example with dependency is available here.

In OpenTURNS, there is no way to create a LHS with a non trivial dependency (if you try, it generates an exception). This is because, as far as I know, we either lose the dependency structure (and keep a LHS) or lose the LHS property (and keep the dependency). One way to proceed, favoring the dependency over the LHS, is to go into a standardized space (say, with uniform marginals), create a LHS there, and transform back to the physical space. The produced sample is not a LHS, but has the correct dependency.

Best regards,

Michaël

Hi,

As Michaël pointed out, there is no way to keep the LHS property on correlated samples, in particular using the algorithm implemented in lhsgeneral. From a mathematical point of view, it is essentially the sampling of a joint distribution with given marginals and a normal copula parameterized with the expected correlation matrix, which will not give you the expected linear correlation.
BTW the comment given by the author of lhsgeneral after the line:

% this transformation preserves the dependency between the variables
uniform_dependent_samples=normcdf(dependent_samples); 

is true in terms of copulas, not in terms of linear correlation!

Here is an implementation of the algorithm using OpenTURNS/Python:

import openturns as ot

def lhsgeneral(pd, correlation, n):
    dim = len(pd)
    RStar = correlation
    unifND = [ot.Uniform(0.0, 1.0)]*dim
    normND = [ot.Normal(0.0, 1.0)]*dim
    lhsDOE = ot.LHSExperiment(ot.ComposedDistribution(unifND), n)
    lhs_sample = lhsDOE.generate()
    uniform_to_normal = ot.MarginalTransformationEvaluation(unifND, normND)
    independent_sample = uniform_to_normal(lhs_sample)
    R = independent_sample.computeLinearCorrelation()
    P = RStar.computeCholesky()
    Q = R.computeCholesky()
    M = P * Q.solveLinearSystem(ot.IdentityMatrix(dim))
    lin = ot.LinearEvaluation([0.0]*dim, [0.0]*dim, M.transpose())
    dependent_sample = lin(independent_sample)
    normal_to_arbitrary = ot.MarginalTransformationEvaluation(normND, pd)
    transformed_sample = normal_to_arbitrary(dependent_sample)
    return transformed_sample
    
if __name__ == '__main__':
    pd = [ot.Normal(0.0, 20.0), ot.Triangular(0.0, 100.0, 150.0)]
    correlation = ot.CorrelationMatrix(2, [1.0, -0.8, -0.8, 1.0])
    n = 10000
    correlatedSamples = lhsgeneral(pd, correlation, n)
    R = correlatedSamples.computeLinearCorrelation()
    print("correlatedSamples=\n", correlatedSamples)
    print("R=\n", R)
    dist = ot.ComposedDistribution(pd, ot.NormalCopula(correlation))
    print("dist.R=\n", dist.getCorrelation())

but you should use a well defined probabilistic distribution and sample it instead of generating a bunch of samples with partially defined joint distribution. You may be interested in this article: An innovating analysis of the Nataf transformation from the copula viewpoint | Semantic Scholar where things are explained in details.

Cheers

Régis

Hi, Thank you very much to you both. I’ll have a look at this, even though I didn’t want to define probabilistic distribution from the beginning. But it’s indeed maybe the best way to handle the process…
To expose frankly the underlined process, it’s a kind of coarse Bayesian calibration method that I’m trying to implement with, in a few iterative step try to feed as much as I can a final joint distribution for a set of unknown parameters (if you haven’t guessed it before). The model takes some time and I definitively don’t want to go into MCMC approaches, fitting a meta model. My priority (even if it can sounds pretty weird) is not to find the true marginals of each parameters, but to have enough parameters’ combination of good matches to further use the model to forecast other aims. Unless the true (range of) values of parameters are required, I guess there is no need to feed the final joint distribution (final posteriors) by all possible combinations.

An updated version. Here I compare, for a target correlation Rtarget:

  1. the actual correlation R of the sample generated by a pseudo-LHS correlated sample. I see that R is different from Rtarget
  2. the exact correlation dist.R of the underlying normal-copula based joint distribution where the shape matrix of the normal copula is taken equal to Rtarget. I see that dist.R is different from Rtarget, but closely match R
  3. the correlation obtained using a sample of the previous distribution. Once again there is a close match with dist.R but not with Rtarget
  4. the shape matrix Rsol I should have used to have dist.R == Rtarget.
  5. the same as 1 but with Rsol instead of Rtarget. This time R is almost equal to Rtarget
  6. the same as 2 but with Rsol instead of Rtarget. Of course dist.R == Rtarget, by definition of Rsol
  7. the same as 3 but with Rsol instead of Rtarget. The correlation is almost equal to Rtarget

So finding Rsol seems to be the solution, unfortunately it is not always possible to get a given correlation using a normal copula (and even any copula) depending on the marginal distributions.

I also added a fully data-based solution to your problem: how to generate a new sample of a distribution given an initial sample, when the only things to keep are the marginal distributions and the correlation matrix (regardless of the actual copula). It is a 5 lines solution!

Here is the code:

import openturns as ot

ot.ResourceMap.SetAsBool("ComposedDistribution-UseGenericCovarianceAlgorithm", True)

def lhsgeneral(pd, correlation, n):
    dim = len(pd)
    RStar = correlation
    unifND = [ot.Uniform(0.0, 1.0)]*dim
    normND = [ot.Normal(0.0, 1.0)]*dim
    lhsDOE = ot.LHSExperiment(ot.ComposedDistribution(unifND), n)
    #lhsDOE = ot.MonteCarloExperiment(ot.ComposedDistribution(unifND), n)
    lhs_sample = lhsDOE.generate()
    uniform_to_normal = ot.MarginalTransformationEvaluation(unifND, normND)
    independent_sample = uniform_to_normal (lhs_sample)
    R = independent_sample.computeLinearCorrelation()
    P = RStar.computeCholesky()
    Q = R.computeCholesky()
    M = P * Q.solveLinearSystem(ot.IdentityMatrix(dim))
    lin = ot.LinearEvaluation([0.0]*dim, [0.0]*dim, M.transpose())
    dependent_sample = lin(independent_sample)
    normal_to_arbitrary = ot.MarginalTransformationEvaluation(normND, pd)
    transformed_sample = normal_to_arbitrary (dependent_sample)
    return transformed_sample
    
def findR(pd, Rtarget):
    # Define the residual between the target correlation and the actual correlation
    def residualFunction(X):
        n = len(X)
        dim = int(0.5 + 0.5 * (1 + 8 * n)**0.5)
        R = ot.CorrelationMatrix(dim)
        index = 0
        for i in range(dim):
            for j in range(i):
                R[i, j] = X[index]
                ++index
        try:
            distribution = ot.ComposedDistribution(pd, ot.NormalCopula(R))
            Ractual = distribution.getCorrelation()
            residual = ot.Point(n)
            index = 0
            for i in range(dim):
                for j in range(i):
                    residual[index] = Ractual[i, j] - Rtarget[i, j]
                    ++index
        except:
            residual = ot.Point(n, 1.0)
        return residual

    dim = Rtarget.getDimension()
    n = (dim * (dim - 1)) // 2
    problem = ot.LeastSquaresProblem(ot.PythonFunction(n, n, residualFunction))
    bounds = ot.Interval([-1.0]*n, [1.0]*n)
    problem.setBounds(bounds)
    startingPoint = ot.Point(n)
    index = 0
    for i in range(dim):
        for j in range(i):
            startingPoint[index] = Rtarget[i, j]
            ++index
    algo = ot.CMinpack(problem)
    algo.setStartingPoint(startingPoint)
    algo.run()
    result = algo.getResult()
    minimizer = result.getOptimalPoint()
    Rsolution = ot.CorrelationMatrix(dim)
    index = 0
    for i in range(dim):
        for j in range(i):
            Rsolution[i, j] = minimizer[index]
            ++index
    return Rsolution
        

if __name__ == '__main__':
    pd = [ot.Normal(0.0, 20.0), ot.Triangular(0.0, 100.0, 150.0)]
    correlation = ot.CorrelationMatrix(2, [1.0, -0.8, -0.8, 1.0])
    n = 10000000
    ##################################
    # Solution based on a pseudo-LHS #
    ##################################
    correlatedSamples = lhsgeneral(pd, correlation, n)
    R = correlatedSamples.computeLinearCorrelation()
    print("(LHS) R=\n", R)
    #####################################
    # Solution based on a normal copula #
    #####################################
    # With naive correlation
    dist = ot.ComposedDistribution(pd, ot.NormalCopula(correlation))
    print("dist.R=\n", dist.getCorrelation())
    correlatedSamples = dist.getSample(n)
    R = correlatedSamples.computeLinearCorrelation()
    print("(MC) R=\n", R)
    # With adapted correlation
    Rsol = findR(pd, correlation)
    print("Rsol=\n", Rsol)
    correlatedSamples = lhsgeneral(pd, Rsol, n)
    R = correlatedSamples.computeLinearCorrelation()
    print("(LHS) R=\n", R)
    dist = ot.ComposedDistribution(pd, ot.NormalCopula(Rsol))
    print("dist.R=\n", dist.getCorrelation())
    correlatedSamples = dist.getSample(n)
    R = correlatedSamples.computeLinearCorrelation()
    print("(MC) R=\n", R)
    ###################
    # Export the data #
    ###################
    correlatedSamples.exportToCSVFile("data.csv")
    ##################################################################
    # Load the data as if they were generated by an external process #
    ##################################################################
    data = ot.Sample.ImportFromCSVFile("data.csv")
    # Identify the associated normal copula
    copula = ot.NormalCopulaFactory().build(data)
    # Identify the marginal distributions
    pd = [ot.HistogramFactory().build(data.getMarginal(i)) for i in range(data.getDimension())]
    # Build the joint distribution
    dist = ot.ComposedDistribution(pd, copula)
    # Generate a new sample
    correlatedSamples = dist.getSample(n)
    R = correlatedSamples.computeLinearCorrelation()
    print("(MC) R=\n", R)

Here is the result:

(LHS) R=
 [[  1        -0.793146 ]
 [ -0.793146  1        ]]
dist.R=
 [[  1        -0.793182 ]
 [ -0.793182  1        ]]
(MC) R=
 [[  1        -0.793033 ]
 [ -0.793033  1        ]]
Rsol=
 [[  1        -0.806877 ]
 [ -0.806877  1        ]]
(LHS) R=
 [[  1        -0.799999 ]
 [ -0.799999  1        ]]
dist.R=
 [[  1   -0.8 ]
 [ -0.8  1   ]]
(MC) R=
 [[  1        -0.800029 ]
 [ -0.800029  1        ]]
(MC) R=
 [[  1        -0.800066 ]
 [ -0.800066  1        ]]

Cheers

Régis

Thank you so much for all this !! I’ll have a deep look at all of it in the following days !

Hi,
I’ve used the script generating new sample out of some generated by external code, and for which I weighed some combination in order to enable convergence. Three thresholds are considered to feed the sample used to generate new ones. The one with 20%, 10% and 5% and less of accuracy. These are weighed differently to skew the sample to the best ones. The new total set is divided into a skewed one thanks to your script and a new ones with wider bounds and no correlation (to possibly bring new ‘fresh’ winners in the next iteration). And if I have enough 10% and 5% I drop the 20%, etc. The results are pretty impressive ! I will make an attempt to fully describe what OpenTURNS makes in this process and will come back to you.
Btw, I have used OpenTURNS almost 10years ago (Ibpsa.org), I am so happy to found it pretty useful in this new case !
Many thanks for all this !

Hi,
The link to the paper seems to be broken: could you share the full reference of the paper you wanted to share?
Best regards,
Michaël

Hi,
indeed, the link seem broken…sorry about that. It’ds IBPSA congress link. 2013 in Chambéry.
Here is another link (PDF) Sensitivity analyses on the definition of wind driven natural ventilation potential