Kernel smoothing of an extremely skewed distribution

MichaelBaudin · May 2, 2024, 2:05pm

Hi!
I noticed a surprising behaviour of the binning method within a kernel smoothing. Since there is an easy solution, I thought I could share it here. I my study, the variable I am studying is the maximum pressure within a mechanical bearing. A sample of this pressure is generated based on sampling from the output of a physical model, using simple Monte-Carlo sampling. I present below a simplified case study.

Consider for example the LogNormal distribution with parameters \mu = 0 and \sigma = 2.5.

import openturns as ot
import openturns.viewer as otv

distribution = ot.LogNormal(0.0, 2.5)
sample = distribution.getSample(5000)

fitted = ot.KernelSmoothing().build(sample)
graph = fitted.drawPDF()
graph.setLegends(["Fitted"])
curve = distribution.drawPDF()
curve.setLegends(["LN"])
graph.add(curve)
graph.setColors(ot.Drawable().BuildDefaultPalette(2))
view = otv.View(graph)

Figure 1. The PDF of the true LogNormal distribution and the estimated distribution using kernel smoothing with binning.

We see very strange waves, as if the distribution has been discretized. As far as I understand it, there are two reasons for this:

the sample is relatively large,
the distribution is relatively skewed.

Since the sample is larger than 250, the binning system enters into play. It discretizes the sample into 1024 bins and gathers the observations that belong to the same bin, weighting each bin depending on the number of observations it contains. The problem is that the maximum in the sample is very large (close to 5322). Hence, the regular grid cannot represent the data sufficiently accurately: most bins are located at large values of X, far away from the places where the data is located (most values are in the [0,30] interval).

The simplest way is to just disable binning. The library is sufficiently fast to manage 10000 observations without being too CPU consuming.

kernel = ot.Epanechnikov()
fitted = ot.KernelSmoothing(kernel, False).build(sample)
graph = fitted.drawPDF()
view = otv.View(graph)

Figure 2. The PDF of the estimated distribution using kernel smoothing without binning.

Another method is to increase the number of bins so that the discretization is more accurate. In the next code, I use 10000 bins.

kernel = ot.Epanechnikov()
fitted = ot.KernelSmoothing(kernel, True, 10000).build(sample)
graph = fitted.drawPDF()
view = otv.View(graph)

Perhaps we may use a discretization which is more suitable for the data at hand, e.g. computing the discretization grid used in the binning method based on the quantiles of the data?

Regards,
Michaël

regislebrun · May 2, 2024, 7:27pm

Hi,
You are right, the skewed distributions are challenging for kernel smoothing, even without binning. An alternative is to use a transformation as in matlab. You can play with it using the following script (and yes, 1e-6 is arbitrary).

import openturns as ot
import openturns.viewer as otv

grid = ot.GridLayout(2, 3)
for i, distribution in enumerate([ot.LogNormal(0.0, 2.5),
                                  ot.Beta(20000.5, 2.5, 0.0, 1.0),
                                  ot.Exponential(),
                                  ot.WeibullMax(1.0, 0.9, 0.0),
                                  ot.Mixture([ot.Normal(-1.0, 0.5), ot.Normal(1.0, 1.0)], [0.4, 0.6]),
                                  ot.Mixture([ot.LogNormal(-1.0, 1.0, -1.0), ot.LogNormal(1.0, 1.0, 1.0)], [0.2, 0.8])]):
    sample = distribution.getSample(5000)

    graph = distribution.drawPDF()
    graph.setLegends([distribution.getClassName()])
    fitted = ot.KernelSmoothing().build(sample)
    curve = fitted.drawPDF()
    curve.setLegends(["Fitted"])
    graph.add(curve)

    # Now we build an adapted transformation of the data
    m = sample.getMin()[0]
    M = sample.getMax()[0]
    delta = 1e-6 * (M - m)
    Tp = ot.SymbolicFunction("x", "log(x+(" + str(delta - m) + "))")
    Tpinv = ot.SymbolicFunction("y", "exp(y)-(" + str(delta - m) + ")")
    Tm = ot.SymbolicFunction("x", "log((" + str(M + delta) + ")-x)")
    Tminv = ot.SymbolicFunction("y", "(" + str(M + delta) + ") - exp(y)")
    sk = sample.computeSkewness()[0]
    if sk > 0.0:
        T = Tp
        Tinv = Tpinv
    else:
        T = Tm
        Tinv = Tminv
    sampleT = T(sample)
    fittedT = ot.KernelSmoothing().build(sampleT)
    fitted = ot.CompositeDistribution(Tinv, fittedT)
    curve = fitted.drawPDF()
    curve.setLegends(["Fitted (T)"])
    curve = curve.getDrawable(0)
    curve.setLineStyle("dashed")
    graph.add(curve)
    graph.setColors(ot.Drawable.BuildDefaultPalette(3))
    grid.setGraph(i // 3, i % 3, graph)
view = otv.View(grid).ShowAll()

It gives you:

You see that, for the same price, you get an automatic boundary correction.

Of course, other transformations could be used, but these ones are quite effective. If the skewness is moderate, there is almost no change wrt simple kernel smoothing, but if the skewness is large, the transformation performs very well.

Cheers

Regis

Topic		Replies	Views
KernelSmoothing Warning Python usage kernel-smoothing	5	370	September 1, 2023
Skew distributions in OT? Development	5	112	May 14, 2024
Quality of student parameter estimation Python usage	2	293	June 22, 2022
Poursuite des travaux sur les valeurs extrêmes Development	0	134	June 22, 2023
Use of Log Normal Distribution - metamodel building Methodology	3	259	April 28, 2023

Kernel smoothing of an extremely skewed distribution

Related topics