Help with reading in experimental data (text or CSV file)

Greetings – I am struggling with figuring out how to read in a multi-column dataset (no header line), select a particular column of values, and apply extreme value (GEV, GPD) analysis to that column. Even better, I would like to analyze all the columns and detect trends.
Thanks in advance for any pieces/example of code that will help with this.
Tony Rollett

Hello!
Would it be possible to share a small example or simili of the csv file you’re working on, in order to provide better answers?
Otherwise, if I correctly understand your question, what you need to do is to first import the csv file as an ot.Sample object:

sample = ot.Sample().ImportFromCSVFile('mycsvfile.csv') #Additional arguments such as the separator character can be provided

Then, for example, you can determine the GPD associated to each column in the following way:

GPDList = []
ncolumns = sample.getDimension()

for i in range(ncolumns):
    GPDistribution = ot.GeneralizedParetoFactory().buildAsGeneralizedPareto(sample[:,i])
    GDPList.append(GPDistribution)

You now have a list of GeneralizedParetoDistribution objects that you can access and analyze with all of the associated methods (see GeneralizedPareto — OpenTURNS 1.17 documentation )

I hope this helps a bit,
Cheers

Hi,
You can import your data using any module you want: numpy, pandas, OpenTURNS and probably others. If the reading time is important, here is a little benchmark which may help you in your choice:

import openturns as ot
import pandas as pd
import numpy as np
from time import time

print("Generate data")
size = 1000000
dim = 10
data = ot.Normal(dim).getSample(size)
filename = "data.csv"
data.exportToCSVFile(filename)
# Using numpy
print("Read data (numpy)")
t0 = time()
sample = np.genfromtxt(filename, delimiter=";")
t1 = time()
print("numpy t=", t1 - t0, "s")
# Using pandas
print("Read data (pandas)")
t0 = time()
sample = pd.read_csv(filename, delimiter=";")
t1 = time()
print("numpy t=", t1 - t0, "s")
# Using OT csv
print("Read data (OT csv)")
t0 = time()
sample = ot.Sample.ImportFromCSVFile(filename)
t1 = time()
print("ot csv t=", t1 - t0, "s")
# Using OT txt
print("Read data (OT txt)")
t0 = time()
sample = ot.Sample.ImportFromTextFile(filename, ";")
t1 = time()
print("ot txt t=", t1 - t0, "s")

resulting into:

Generate data
Read data (numpy)
numpy t= 7.1998796463012695 s
Read data (pandas)
numpy t= 1.2764337062835693 s
Read data (OT csv)
ot csv t= 2.9530608654022217 s
Read data (OT txt)
ot txt t= 1.4340307712554932 s

As you see, Pandas do a very good job, and depending on the size (nb of rows)/dimension (nb of columns) OT text and Pandas share the lead (OT being more efficient for large dimensions). The OT CSV parser is more robust to malformed files, but most of the time it is better to use the OT Text parser. You should avoid numpy as it is significantly slower than the other options.

You have an example here:
https://openturns.github.io/openturns/latest/auto_data_analysis/manage_data_and_samples/plot_import_export_sample_csv.html?highlight=csv

Once you have your data, you can try to fit a GEV to the block maxima:


print("read data")
sample = ot.Sample.ImportFromTextFile("data.csv", ";")

n = sample.getSize()
k = 100
m = n // k

print("split data in chunks of size", m)
print("Build the sample of max(chunk)")
#correct but SLOW!
#splitter = ot.KFoldSplitter(n, m)
#sample_max = ot.Sample([sample.select(ind).getMax() for _, ind in splitter])
sample_max = ot.Sample([sample[j*k:j*k+k].getMax() for j in range(m)])
print("Compare the GEV estimate to the exact distribution for each marginal")
for i in range(sample.getDimension()):
    fit = ot.GeneralizedExtremeValueFactory().build(sample_max[:,i])
    exact = ot.MaximumDistribution(ot.Normal(k))
    graph = exact.drawPDF()
    graph.add(fit.drawPDF())
    graph.setLegends(["exact", "gev"])
    graph.setColors(["red", "blue"])
    graph.setTitle("Marginal " + str(i))
    ot.Show(graph)

and you will see a pretty good fit (thanks to the 1e6 data). Of course you will have to play with k to get the best fit, here we know the exact distribution.
You can also try to fit a GPD:


print("read data")
sample = ot.Sample.ImportFromTextFile("data.csv", ";")

n = sample.getSize()
tau = 0.95
upper_statistics = int(tau * n)

print("Compare the GPD estimate to the exact distribution for each marginal")
for i in range(sample.getDimension()):
    print("Select the points above a threshold")
    sampleI = sample.getMarginal(i).sort()[upper_statistics:-1]
    threshold = sampleI[0, 0]
    fit = ot.GeneralizedParetoFactory().build(sampleI)
    exact = ot.TruncatedDistribution(ot.Normal(), threshold)
    graph = exact.drawPDF()
    graph.add(fit.drawPDF())
    graph.setLegends(["exact", "gpd"])
    graph.setColors(["red", "blue"])
    graph.setTitle("Marginal " + str(i))
    ot.Show(graph)

Concerning the trend analysis, you don’t give enough details to help you. Assuming that you are dealing with time series, what is your time grid? Is it regular? Is it the first column of your data?

Cheers

Régis

Thank you very much indeed to both. I am starting with the first reply (simpler!) and posting the jupyter notebook and the datafile that I am reading in. It almost works!
thanks again
Tony Rollett
PS. I apologize if I should have posted the python code directly.
GEV-fitting-OT.ipynb (6.3 KB)
misorientation.txt (180.4 KB)

Unfortunately things are not that simple! When you perform extreme value analyses, using either GPD or GEV, you have to prepare your data for these analyses, either by selecting the data above a high threshold (GPD) or by taking block maxima (GEV). The code proposed by Julien is very naive as it is valid only if the data are distributed according to a generalized Pareto distribution, which is almost never the case!
So sorry for you but you can’t escape the hard work ;-)! The good news is that I almost did it for you…

BTW as your data are separated by comas, you have to replace “;” by “,” in all the scripts I posted. There is also a limitation of the GeneralizedParetoFactory class, which makes the assumption that your data have been shifted by the threshold. The full script is then, removing all the graphical stuff:

sample = ot.Sample.ImportFromTextFile("misorientation.txt", ",")

n = sample.getSize()
tau = 0.8 # To be adapted to your case
upper_statistics = int(tau * n)
eps = 1e-5
for i in range(sample.getDimension()):
    sampleI = sample.getMarginal(i).sort()[upper_statistics:-1]
    delta = sampleI.computeRange()[0]
    threshold = sampleI[0, 0] - eps * delta # eps is here to insure that all the points in sampleI are > 0
    try:
        shifted_GPD = ot.GeneralizedParetoFactory().build(sampleI-threshold) # Don't forget to shift the data
        GPD = shifted_GPD + threshold # Don't forget to shift back the distribution
    except:
        print("Your marginal sample", i, "is constant, no way to fit a GPD")

BTW your sample contains constant columns (0 and 38), so you will not be able to fit a GPD on it.

Cheers

Régis