Kolmogorov-Smirnov test in OT

hello,
This might be a naive question but i was surprised to find that KS test in OT does not show a “deterministic” behavior. The resulting p-value is sometimes big enough, sometimes very close to 0, for a given sample and a given theoretical distributionFactory object.

am i missing something here ? or is it just some numerical approximation thing ?

thanks,

sanaa

1 Like

Hi,
You will get a lot of information here on this topic. To make a long story short, the exact p-value is estimated using a Monte Carlo method in the case where some parameters are estimated and not readily known. It was the idea of Lilliefors for the normal distribution, and it has been extended (and implemented this way) e.g. in the matlab statistical toolbox.

Cheers

Régis

great! and sorry for not taking the time to check if the issue had popped up before

No problem ;-)! This platform is here to exchange information, and it is exactly what we do!

Hi !

Please do not feel sorry about asking questions and getting already answered answers, as the forum is designed just for this purpose. What is unwanted is unkind messages, a category in which your message does not fall!

There is a series of examples in the doc which presents the topic and the issue when parameters are estimated.

The principles are presented here

The theory is described here http://openturns.github.io/openturns/master/theory/data_analysis/kolmogorov_test.html

Notice that the one of the examples has a bug (the KS statistics is wrongly drawn), which is identified and fixed here https://github.com/openturns/openturns/pull/1701

Best regards,

Michaël

PS
Notice that there is a third case which is not presented in the doc: the parameters are estimated from the sample, but the user wrongly use the Kolmogorov class:

dist = ot.NormalFactory().build(data)
test_result = ot.FittingTest.Kolmogorov(data, dist)

This could be used to improve the speed, as presented in https://github.com/openturns/openturns/issues/1061 : Denote by p_1 the p-value evaluated assuming that the parameters are known. The p-value p_1 is fast to compute. Denote p2 the p-value assuming that the parameters are estimated. We always have p_2<p_1.