SORM vs PythonFunction

dmitrygrigoryev · February 5, 2021, 9:07am

Hi all,

Few months ago a had a questions about difference between PythonFunction and SymbolicFunction behavior in SORM method - Strange difference between SymbolicFunction and PythonFunction wrappers in SORM method · Issue #1637 · openturns/openturns · GitHub. The general Idea supposed by @regislebrun works for me but now i’ve a few additional questions about it:

How I should define epsilon for CenteredFiniteDifferenceGradient if I’ve a composed distribution with different scales for standard deviation in marginals? How I should calculate decreasing multiplier (it was a 1e-8 in Regis’s notebook from github ticket) for epsilon? If multiplier is too small relatively to standard deviation of marginal - i’ve got a following exception: Error: impossible to compute Breitung SORM probability, one of the curvatures is < -1/beta. beta=5.32124, curvature=-3.47975.
Sometimes I need to make constant one of ComposedDistribution marginal. I’m use Dirac distribution as marginal to do it. But epsilon=0 is prohibited for FiniteDifferenceGradient, as expected. May be there are another way to do it in OT? This “frozen” value should be transferred to PythonFunction for calculation purposes.

Thanks in advance!

MichaelBaudin · February 5, 2021, 6:59pm

Hi !

Welcome on this forum. This is the right place to discuss this topic.

FORM/SORM methods require sufficiently accurate gradients. The SymbolicFunction provides exact gradient, based on symbolic differentiation. This is a very nice feature. For the PythonFunction, the default implementation relies on finite differences. The FiniteDifferenceGradient class has a input argument step which allows to set the step size:

Setting the value is not that easy: a too small or too large step lead to a wrong gradient, because of the trade-off between the accuracy of the truncation of Taylor’s formula and the rounding errors in the evaluation of the function.
Here is a table with various optimal values for several finite difference formulas:

(This is extracted from the doc in PS.)

But the example you use has marginals which have very different orders of magnitude. I suggest to scale the step size depending on the optimal step size. The class FiniteDifferenceStep allows to do this, because the input argument is either a float or a sequence (i.e. a list of floats).

I suggest to look at the following class, which seems to scale the step as required:

http://openturns.github.io/openturns/master/user_manual/_generated/openturns.BlendedStep.html

There is currently too few examples on this topic.

On the second topic, what you search for is the ParametricFunction:

http://openturns.github.io/openturns/master/user_manual/_generated/openturns.ParametricFunction.html

Good news: this one has good examples!

Regards,

Michaël

https://forge.scilab.org/index.php/p/docnumder/downloads/834/

regislebrun · February 5, 2021, 9:24pm

Michaël already gave you many information on the choice of the step size. You may have missed the fact that the step size can be fixed component by component because I dumbly defined it as a scalar in the notebook. You just have to write

stdev = distribution.getStandardDeviation()

instead of

stdev = distribution.getStandardDeviation()[0]

and the resulting code works in any dimension, adapting the step to each marginal standard deviation.
As Michael mentioned it, the optimal choice is quite complex: you have to ensure that x+h is different from x, which may not be the case due to round off error, so you have to adapt h to the magnitude of x. But the optimal choice for h is given in Proposition 1 of Michael’s document, and it does not explicitly depend on the magnitude of x. It depends instead of the (unknown) value of the second derivative f'' at x, and the approximation error in computing the function at x.
When I defined the default values for the step size nearly ten years ago, I made the assumption that the function was scaled in a way such that f and f'' would be of unit order, and r(x) of the order of the machine precision, leading to the values given in Figure 4. You can check them with ResourceMap:

ResourceMap.GetAsScalar("CenteredFiniteDifferenceGradient-DefaultEpsilon")
ResourceMap.GetAsScalar("CenteredFiniteDifferenceHessian-DefaultEpsilon")
ResourceMap.GetAsScalar("NonCenteredFiniteDifferenceGradient-DefaultEpsilon")

which should give you 1e-5, 1e-4 and 1e-7.
There is no general rule for the empirical choice of the step size. Linking it to the standard deviation is probably the most counterintuitive choice as the input distribution has nothing to do with the value of f and f’’ and the accuracy at which they are computed, and even if the step size has to be chosen such that (x+h)-x=h despite the round off error, it is more the mean of the distribution that will tell you the typical values for x. But in practice it often works, probably because in typical engineering applications a function writes f(x/x_{ref}) where f has unit variations when its argument has unit variations, and because the variability of x is modeled by a distribution centered at x_{ref} with a dispersion of 10% around the nominal value.
But it is not a general rule. If your function writes f(x)=\exp(-x/10^6) and you want to compute its gradient at x=1.4e6 using centered finite differences, then a step of size 1e-5*1e6=10 should be nearly optimal, which is the case:
RelativeErrorExp
You see the typical behavior of the error: smooth at the right of the optimal value, noisy at the left. It suggests that it is better to choose a too large step than a too small step.
If your function writes f(x)=\sin(x), then you should use h=1e-5 regardless of the value of x if you ignore the round-off error in the computation of x+h, e.g. if x=1:
RelativeErrorSin1

For the same function, if x=1e4 you get:
RelativeErrorSin1e4
and the choice of h=1e-5 still gives you something reasonable, even if you are in the noisy part of the curve. Taking the round-off error in x+h into account, you would better choose h=1e-. But in fact there is a trick (and I forgot to use it in OpenTURNS, shame on me) to reduce the effect of the round off errors in the computation of x^+=x+h and x^-=x-h: instead of using \frac{f(x^+)-f(x^-)}{2h}, use \frac{f(x^+)-f(x^-)}{x+-x^-} as an approximation of f'(x) and you get:
RelativeErrorSin1e4trick
This time the noisy part of the curve is much nicer.

All in all you still have to choose the step by yourself, but at least you are more informed of the different compromises you have to make depending on your application. And I will implement the trick ASAP in OpenTURNS.

Cheers

Régis

MichaelBaudin · February 6, 2021, 12:27am

Hi!
If you are at improving the lib on this topic, I suggest to use R. S. Stepleman and N. D. Winarsky algorithm:

Stepleman, R. S., and N. D. Winarsky. “Adaptive numerical differentiation.” Mathematics of Computation 33.148 (1979): 1257-1264.

https://www.jstor.org/stable/2006459

The algorithm uses a decreasing sequence of step sizes. It uses the property that the absolute value of successive finite differences is monotonic. The algorithm stops when the monotonic property fails.

"""
Adaptive numerical differentiation
Authors: R. S. Stepleman and N. D. Winarsky
Journal: Math. Comp. 33 (1979), 1257-1264 
"""
import numpy as np
import openturns as ot
import openturns.viewer as otv

g = ot.SymbolicFunction(["x"], ["exp(-x / 1.e6)"])

def central_finite_difference(g, x, h):
    x1 = x + h
    x2 = x - h
    y1 = g([x1])
    y2 = g([x2])
    g_fd = (y1 - y2) / (x1 + (- x2) ) # Magic trick?
    # g_fd = (y1 - y2) / (2.0 * h)
    return g_fd

x = 1.0
number_of_points = 1000
h_array = ot.Sample([[x] for x in np.logspace(-7.0, 5.0, number_of_points)])
error_array = ot.Sample(number_of_points, 1)
for i in range(number_of_points):
    g_gradient = g.gradient([x])
    h = h_array[i, 0]
    g_fd = central_finite_difference(g, x, h)
    error_array[i, 0] = abs(g_fd[0] - g_gradient[0, 0])

graph = ot.Graph("Finite difference", "h", "Error", True)
curve = ot.Curve(h_array, error_array)
graph.add(curve)
graph.setLogScale(ot.GraphImplementation.LOGXY)
otv.View(graph)

# Algorithm to detect h*
h0 = 1.e5
h_previous = h0
g_fd_previous = central_finite_difference(g, x, h_previous)
diff_previous = np.inf
for i in range(53):
    h_current = h_previous / 2.0
    g_fd_current = central_finite_difference(g, x, h_current)
    diff_current = abs(g_fd_current[0] - g_fd_previous[0])
    print("i=%d, h=%.4e, |FD(h_current) - FD(h_previous)=%.4e" % (
        i, h_current, diff_current))
    if diff_previous < diff_current:
        print("Stop!")
        break
    g_fd_previous = g_fd_current
    h_previous = h_current
    diff_previous = diff_current

print("Optimum h=", h_current)

It seems to work quite well on the nontrivial exp example you gave:

i=0, h=5.0000e+04, |FD(h_current) - FD(h_previous)=1.2508e-09
i=1, h=2.5000e+04, |FD(h_current) - FD(h_previous)=3.1255e-10
i=2, h=1.2500e+04, |FD(h_current) - FD(h_previous)=7.8128e-11
i=3, h=6.2500e+03, |FD(h_current) - FD(h_previous)=1.9531e-11
i=4, h=3.1250e+03, |FD(h_current) - FD(h_previous)=4.8828e-12
i=5, h=1.5625e+03, |FD(h_current) - FD(h_previous)=1.2207e-12
i=6, h=7.8125e+02, |FD(h_current) - FD(h_previous)=3.0518e-13
i=7, h=3.9062e+02, |FD(h_current) - FD(h_previous)=7.6294e-14
i=8, h=1.9531e+02, |FD(h_current) - FD(h_previous)=1.9074e-14
i=9, h=9.7656e+01, |FD(h_current) - FD(h_previous)=4.7689e-15
i=10, h=4.8828e+01, |FD(h_current) - FD(h_previous)=1.1909e-15
i=11, h=2.4414e+01, |FD(h_current) - FD(h_previous)=2.9786e-16
i=12, h=1.2207e+01, |FD(h_current) - FD(h_previous)=7.7307e-17
i=13, h=6.1035e+00, |FD(h_current) - FD(h_previous)=1.8190e-17
i=14, h=3.0518e+00, |FD(h_current) - FD(h_previous)=9.0950e-18
i=15, h=1.5259e+00, |FD(h_current) - FD(h_previous)=1.8190e-17
Stop!
Optimum h= 1.52587890625

The paper describes how to choose the initial step size h_0 and the reduction factor \beta. However, I suggest to let the user choose h_0 and use \beta=2 as default, since the correct order of magnitude is sufficient in many cases. A \beta closer to 1 would allow a finer estimate of h^\star, but the extra evaluations are not worth in terms of improvement in accuracy of the gradient, given the cost in the evaluations of the function. At most a 52 / 2 = 26 evaluations should be required in many situations, provided a rough estimate of h_0 is provided.

I tried the trick you presented, but was not able to make it work: did I miss something? Why should it work?

Best regards,

Michaël

dmitrygrigoryev · February 19, 2021, 1:08pm

Thanks all! I’ve got it.

regislebrun · February 20, 2021, 7:30pm

@MichaelBaudin Many thanks for the pointer: I was not aware of this work, definitely worth to know! I don’t know if it has to be implemented in a systematic manner as it involves quite a lot of function evaluations and the added accuracy may be of second importance in some situation where a good enough step is already known and the number of evaluations is highly constrained. We should discuss this point in a dedicated technical committee.
Here is the quick and dirty script I used to illustrate the influence of the rounding error in the computation of x+h and x-h on the accuracy of the finite difference approximation. You have to switch manually between (f,df) and (f1,df1) to explore things.

from openturns import *
from math import *

def f(x):
    return sin(x)

def df(x):
    return cos(x)

def f1(x):
    return exp(-x/1e6)

def df1(x):
    return -f(x)/1e6

h = list()
err = list()
x = 1e4
for n in range(64):
    hn = 2**((-n-10)*0.5)
    xp = x+hn
    xm = x-hn
    err.append(abs(1 - (f(xp) - f(xm)) / (xp-xm) / df(x)))
    h.append(hn)

import matplotlib.pyplot as plt

plt.plot(h, err)
plt.grid()
plt.yscale("log")
plt.xscale("log")
plt.xlabel(r"$h$")
plt.ylabel(r"$E$")
#plt.title(r"Relative error on $f'(x)$ for $f(x)=\exp(-x/1e6)$ at $x=1.4e6$")
plt.title(r"Relative error on $f'(x)$ for $f(x)=\sin(x)$ at $x=1e4$ with trick")
plt.show()

The trick is inspired by the application of compensated summation for Euler’s method as exposed in N. Higham, “Accuracy and Stability of Numerical Algorithms 2nd ed.”, p86.

Cheers

Régis

MichaelBaudin · February 23, 2021, 8:55pm

Hi Régis,
I modified your script a bit:

import openturns as ot
import numpy as np
import matplotlib.pyplot as plt

def relative_error(computed, reference):
    # Ref (!):
    # https://nhigham.com/2017/08/14/how-and-how-not-to-compute-a-relative-error/
    return np.abs((computed - reference) / reference)

def f(x):
    return np.sin(x)
#    return np.exp(-x/1e6)

def df(x):
    return np.cos(x)
#    return -f(x)/1e6

def numdiff_no_trick(x, h, f):
    xp = x + h
    xm = x - h
    approximate_diff = (f(xp) - f(xm)) / (2.0 * h)
    return approximate_diff

def numdiff_trick(x, h, f):
    xp = x + h
    xm = x - h
    approximate_diff = (f(xp) - f(xm)) / (xp - xm)
    return approximate_diff


number_of_steps = 500
h = np.logspace(-1, -12, number_of_steps)
err_with_trick = list()
err_with_no_trick = list()
err_with_no_trick_NumRecipes = list()
x = 1e4
for n in range(number_of_steps):
    # 1. No trick
    approximate_diff = numdiff_no_trick(x, h[n], f)
    err_with_no_trick.append(relative_error(approximate_diff, df(x)))
    # 2. Trick
    approximate_diff = numdiff_trick(x, h[n], f)
    err_with_trick.append(relative_error(approximate_diff, df(x)))


plt.plot(h, err_with_trick, label="Magic trick")
plt.plot(h, err_with_no_trick, label="No trick")
plt.grid()
plt.yscale("log")
plt.xscale("log")
plt.xlabel(r"$h$")
plt.ylabel(r"$E$")
plt.title(r"Relative error on $f'(x)$ at $x=1e4$ with trick")
plt.legend()
plt.show()

which produces:

Impressive, indeed, as the optimal rounding error goes from 10^{-8} to 10^{-11}, approximately (with a modification of the optimal step). I cannot, however, see any compensated sum. Furthermore, this does work for \sin does not work for \exp function: shouldn’t the magic always operate, or never?

Regards,

Michaël

regislebrun · March 3, 2021, 6:32pm

Hi Michael,

Nice presentation of the trick ;-)! I wrote that it was inspired by the compensated summation: the fact that the step is recomputed to take into account rounding errors when computing the shifted points. In Euler’s algorithm, the compensation comes from the fact that many steps are done and one can keep track of these rounding errors to compensate them.
Concerning the effectiveness of the trick, it all depends on the dynamic of the function. I am not an expert on these topics, but what I found empirically is that it never degrade the accuracy and may improve it a lot. The best candidates seem to be functions with bounded derivatives, but even with polynomials (try x**4/1e8) the improvement is noticeable. You are by far more expert than me on these topics, I have no doubt you will find the explanation soon!

I implemented this trick in OT for gradients, see https://github.com/openturns/openturns/pull/1788

Cheers

Régis

MichaelBaudin · March 21, 2021, 4:51pm

Hi Régis,
The magic trick is presented in:

J. Dumontet et J. Vignes. « Détermination du pas optimal dans le calcul des dérivées sur ordinateur ». In : R.A.I.R.O. Analyse numérique 11.1 (1977), p. 13-25.

This is eq.14, p.16.
Regards,
Michaël

regislebrun · March 21, 2021, 5:29pm

I saw it in your (excellent) ISEP course on numerical analysis. It is always nice to see that a quite natural trick has been consider worth enough to be published. And no regrets: I was 4yo at this time ;-)!

Cheers

Régis

Topic		Replies	Views
Gradient and evaluation Python usage	0	351	September 6, 2021
Sobol Sensitivity analysis of Machine Learning Model Methodology	2	489	May 25, 2021
Computation time Python usage	4	38	October 9, 2024
New otbenchmark module Python usage	0	391	September 25, 2020
Input-dependent discrepancies between Monte Carlo vs FORM probabilities Python usage	3	391	October 16, 2022

SORM vs PythonFunction

Related topics