OpenTURNS in real HPC setup

Does anyone have experience in running OpenTURNS in an HPC cluster setup for larger-scale distributed evaluations? For example, using a batching system such as SLURM/PBS with otwrapper PyCOMPSs or a similar library?

Any guidelines on this?

1 Like

OpenTURNS tries not to provide scheduler or cluster-specific code.

However the otwrapy extension can help you to build a python function plugged to the multithreading/distribution backend of your choice such as multiprocessing/pathos/ipyparallel/joblib or dask
http://openturns.github.io/otwrapy/master/
Then this function can be used like any other OT function via all the available DOEs, simulation or sensitivity algorithms for example.

In practice we found out that dask was simple enough to setup as it only requires an ssh connection and the required python modules installed on remote nodes, possibly via a shared fs.

You will still have to write the scheduler submission though :]

j

Many thanks for your answer, @schueller! It seems that I will be taking the dask path, following your advice.

Hi everyone,

Thanks for raising this very practical topic.
I have recently been trying to use otwrapy on HPC managed by SLURM.

From my understanding, otwrapy offers different backend solutions to distribute a numerical model but only dask would be compatible with a SLURM based submission on HPC. Let us consider the following simple SLURM submission file and the beam example from otwrapy’s documentation deployed on the HPC. Do you know how to setup the otwrapy.Parallelizer class to distribute the calls to the code on different nodes and CPUs?

#!/bin/sh
#SBATCH -n 36		#number of tasks per node
#SBATCH -N 4		#number of nodes

srun python wrapper.py

Alternatively, I have been working with the coupling_tools from OpenTURNS and the tool GLOST (standing for Greedy Launcher Of Small Tasks) developed by the CEA.

Thank you very much,
Elias

you would pass it via the dask_args keyword of Parallelizer, which should be a dict containing a “workers” entry with the different hostnames/ips of your 4 nodes:

the n_cpus argument should be set to 36 in your case

Hi Julien,

Thanks for your answer, I am not an expert in HPC but the way I understand SLURM is that it should be managing the usage of resources. Therefore, I cannot know on which hostnames/ips my SLURM job will be running before launching it.

Best,
Elias

Hello,

Yes, I think you just have to use classical joblib ou multiprocessing backend and set the number of cpus equal to the total number of cpus you can have access. And it should be the scheduler that dispatch the evaluations on the resources.
If it does not work, you can have access to the node names. It may need some lines of code to make it work.

Regards,
Antoine

Hi Antoine,
So far, I have used multiprocessing and the job scheduler doesn’t allow me to use it on more than one node or on more cpus than one node contains. I don’t know if multiprocessing/joblib allows to distribute workers between more than one node.
Getting access to the node name(s) appointed by SLURM can be useful to control whether something is running on it/them but I don’t understand how it can help to launch a job on more than one node.
I will keep you posted if I find a solution.
Regards,
Elias

If you know the nodes names, you can then used dask as mentionned by Julien. You will have to set the node name in the workers key in the dask_args dictionnary.
Let’s say the nodes names are : node35, node2, node5 with 36 cpus per nodes. The dask_args should be:

dask_args = {'scheduler': 'node35', 'workers': {'node35': 36, 'node2': 36, 'node5': 36}}

For the scheduler, I am not sure if it can be one of the node or it must be the front end.

Hi Antoine,
Thanks for the example, if I had the direct control over the HPC this would work. With a scheduler I’m afraid that putting the front-end node will force the runs on it, which they don’t recommend :stuck_out_tongue:
Best,
Elias

turn out DASK has a dedicated API for SLURM, maybe we could use that:
https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html