OpenTURNS in real HPC setup

imilos · October 6, 2022, 9:16am

Does anyone have experience in running OpenTURNS in an HPC cluster setup for larger-scale distributed evaluations? For example, using a batching system such as SLURM/PBS with otwrapper PyCOMPSs or a similar library?

Any guidelines on this?

schueller · October 13, 2022, 3:24pm

OpenTURNS tries not to provide scheduler or cluster-specific code.

However the otwrapy extension can help you to build a python function plugged to the multithreading/distribution backend of your choice such as multiprocessing/pathos/ipyparallel/joblib or dask
http://openturns.github.io/otwrapy/master/
Then this function can be used like any other OT function via all the available DOEs, simulation or sensitivity algorithms for example.

In practice we found out that dask was simple enough to setup as it only requires an ssh connection and the required python modules installed on remote nodes, possibly via a shared fs.

You will still have to write the scheduler submission though :]

j

imilos · October 13, 2022, 7:16pm

Many thanks for your answer, @schueller! It seems that I will be taking the dask path, following your advice.

efekhari27 · June 20, 2023, 9:27pm

Hi everyone,

Thanks for raising this very practical topic.
I have recently been trying to use otwrapy on HPC managed by SLURM.

From my understanding, otwrapy offers different backend solutions to distribute a numerical model but only dask would be compatible with a SLURM based submission on HPC. Let us consider the following simple SLURM submission file and the beam example from otwrapy’s documentation deployed on the HPC. Do you know how to setup the otwrapy.Parallelizer class to distribute the calls to the code on different nodes and CPUs?

#!/bin/sh
#SBATCH -n 36		#number of tasks per node
#SBATCH -N 4		#number of nodes

srun python wrapper.py

Alternatively, I have been working with the coupling_tools from OpenTURNS and the tool GLOST (standing for Greedy Launcher Of Small Tasks) developed by the CEA.

Thank you very much,
Elias

schueller · June 21, 2023, 12:07pm

you would pass it via the dask_args keyword of Parallelizer, which should be a dict containing a “workers” entry with the different hostnames/ips of your 4 nodes:

github.com

openturns/otwrapy/blob/master/otwrapy/_otwrapy.py#L524


      
              'multiprocessing'.
          
          n_cpus : int (Optional)
              Number of CPUs on which the simulations will be distributed. Needed Only
              if using 'joblib', pathos or 'multiprocessing' as backend.
          
          verbosity : int (Optional)
              verbose parameter when using 'joblib' or 'dask'. Default is 10. When 'dask' is used, 0
              means no progress bar, whereas other value activate the progress bar. 
          
          dask_args : dict (Optional)
              Dictionnary parameters when using 'dask'. It must follow this form:
              {'scheduler': ip adress or host name,
               'workers': {'ip adress or host name': n_cpus},
               'remote_python': {'ip adress or host name': path_to_bin_python}}.
              The parallelization uses SSHCluster class of dask distributed with 1 thread per worker.
              When dask is chosen, the argument n_cpus is not used. The progress bar is enabled if
              verbosity != 0.
              The dask dashboard is enabled at port 8787. 
          
          Examples

the n_cpus argument should be set to 36 in your case

efekhari27 · June 26, 2023, 4:26pm

Hi Julien,

Thanks for your answer, I am not an expert in HPC but the way I understand SLURM is that it should be managing the usage of resources. Therefore, I cannot know on which hostnames/ips my SLURM job will be running before launching it.

Best,
Elias

dumas · June 27, 2023, 7:43am

Hello,

Yes, I think you just have to use classical joblib ou multiprocessing backend and set the number of cpus equal to the total number of cpus you can have access. And it should be the scheduler that dispatch the evaluations on the resources.
If it does not work, you can have access to the node names. It may need some lines of code to make it work.

Regards,
Antoine

efekhari27 · June 28, 2023, 9:09am

Hi Antoine,
So far, I have used multiprocessing and the job scheduler doesn’t allow me to use it on more than one node or on more cpus than one node contains. I don’t know if multiprocessing/joblib allows to distribute workers between more than one node.
Getting access to the node name(s) appointed by SLURM can be useful to control whether something is running on it/them but I don’t understand how it can help to launch a job on more than one node.
I will keep you posted if I find a solution.
Regards,
Elias

dumas · June 29, 2023, 7:22am

If you know the nodes names, you can then used dask as mentionned by Julien. You will have to set the node name in the workers key in the dask_args dictionnary.
Let’s say the nodes names are : node35, node2, node5 with 36 cpus per nodes. The dask_args should be:

dask_args = {'scheduler': 'node35', 'workers': {'node35': 36, 'node2': 36, 'node5': 36}}

For the scheduler, I am not sure if it can be one of the node or it must be the front end.

efekhari27 · June 29, 2023, 9:40am

Hi Antoine,
Thanks for the example, if I had the direct control over the HPC this would work. With a scheduler I’m afraid that putting the front-end node will force the runs on it, which they don’t recommend
Best,
Elias

schueller · July 3, 2023, 6:38am

turn out DASK has a dedicated API for SLURM, maybe we could use that:
https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html

Topic		Replies	Views
How to distribute a computer simulation model wrapper? Python usage	6	539	May 11, 2021
Computation time Python usage	4	37	October 9, 2024
Version 1.21 RC1 available Announcements	1	246	June 5, 2023
Questions diverses sur l'utilisation openturns Python usage polynomial-chaos	1	599	March 21, 2021
Version 1.18 released Announcements	0	544	November 16, 2021

OpenTURNS in real HPC setup

Related topics