Hi!
I noticed a surprising behaviour of the binning method within a kernel smoothing. Since there is an easy solution, I thought I could share it here. I my study, the variable I am studying is the maximum pressure within a mechanical bearing. A sample of this pressure is generated based on sampling from the output of a physical model, using simple Monte-Carlo sampling. I present below a simplified case study.
Consider for example the LogNormal distribution with parameters \mu = 0 and \sigma = 2.5.
import openturns as ot
import openturns.viewer as otv
distribution = ot.LogNormal(0.0, 2.5)
sample = distribution.getSample(5000)
fitted = ot.KernelSmoothing().build(sample)
graph = fitted.drawPDF()
graph.setLegends(["Fitted"])
curve = distribution.drawPDF()
curve.setLegends(["LN"])
graph.add(curve)
graph.setColors(ot.Drawable().BuildDefaultPalette(2))
view = otv.View(graph)
Figure 1. The PDF of the true LogNormal distribution and the estimated distribution using kernel smoothing with binning.
We see very strange waves, as if the distribution has been discretized. As far as I understand it, there are two reasons for this:
- the sample is relatively large,
- the distribution is relatively skewed.
Since the sample is larger than 250, the binning system enters into play. It discretizes the sample into 1024 bins and gathers the observations that belong to the same bin, weighting each bin depending on the number of observations it contains. The problem is that the maximum in the sample is very large (close to 5322). Hence, the regular grid cannot represent the data sufficiently accurately: most bins are located at large values of X, far away from the places where the data is located (most values are in the [0,30] interval).
The simplest way is to just disable binning. The library is sufficiently fast to manage 10000 observations without being too CPU consuming.
kernel = ot.Epanechnikov()
fitted = ot.KernelSmoothing(kernel, False).build(sample)
graph = fitted.drawPDF()
view = otv.View(graph)
Figure 2. The PDF of the estimated distribution using kernel smoothing without binning.
Another method is to increase the number of bins so that the discretization is more accurate. In the next code, I use 10000 bins.
kernel = ot.Epanechnikov()
fitted = ot.KernelSmoothing(kernel, True, 10000).build(sample)
graph = fitted.drawPDF()
view = otv.View(graph)
Perhaps we may use a discretization which is more suitable for the data at hand, e.g. computing the discretization grid used in the binning method based on the quantiles of the data?
Regards,
Michaël