Code Saturne cluster installation

All questions about installation
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
Pablo
Posts: 49
Joined: Thu Sep 04, 2014 11:31 am

Code Saturne cluster installation

Post by Pablo »

Hello everyone:

At our company we have been using CS in our development and our engineering stages with such promising results we are about to launch a small cluster in the R&D department fully dedicated to CFD simulations with CS for several users.
In the other hand, and as we are developing two different PhD works from a few years in collaboration with two Universities and the works has progressed enough, we are about to begin some tests in the University of Cádiz Super-computing department.

The point is both cluster are (or are going to be) based on Red hat distributions (Rocks at the beginning for the future company cluster), so I would like to gather some experience about the CS installation/experience in these environments (if any) or if it already exists an expressly developed distribution for CS cluster installations.


Thanks in advance.
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Code Saturne cluster installation

Post by Yvan Fournier »

Hello,

There is no specific distribution for Code_Saturne on clusters, but the installation procedure tries to be "cluster-friendly."

On a cluster, installation is very similar to than n a workstation, with the addition of a "post-install" step, detailed in the installation manual (the post-install step is already necessary for conjugate heat transfer with Syrthes).

A few important points for Clusters:

- Code_Saturne makes heavy use of small collective operations (MPI_Allreduce) to compute dot products in linear solvers. The means network latency is just as important as bandwidth, so performance with Infiniband or more specialized high-performance networks is good, performance with gigabit Ethernet probably not so good.

- Use the version(s) of MPI recommended by the vendor, or configured with the drivers for the high speed network. Code_Saturne should be compatible with most MPI libraries, so choose the one which should have best performance.

- If you have the choice between different batch/resource management systems, please use a decent, modern system like SLURM, and avoid Sun Grid Engine or its descendents. The post-install phase will be much simpler, and you will be in a "tested" configuration.

Regards,

Yvan
Pablo
Posts: 49
Joined: Thu Sep 04, 2014 11:31 am

Re: Code Saturne cluster installation

Post by Pablo »

Hello Yvan:

Some questions about cluster distribution:

- Any preference for linux cluster distribution?
- Do you have references for clustering with Rocks?http://www.rocksclusters.org/wordpress/
- Is there any issue with Red Hat-based cluster distribution?

Kind regards.
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Code Saturne cluster installation

Post by Yvan Fournier »

Hello Pablo.

No to all three (no preference, no recent references, an no know issues with Red Hat or CentOS).

Regards,

Yvan
zeph67
Posts: 52
Joined: Tue Oct 23, 2012 5:54 pm

Re: Code Saturne cluster installation

Post by zeph67 »

Hello everyone,

I'm not quite sure that my problem is an installation issue, but I guess it is.

I recently installed Code_Saturne 3.0.7 on the Mésocentre of Aix-Marseille University, which uses OAR.
I did everything mentioned in the installation guide, including the post-install step.

When I launch a parallel job, the resource parameters (number of nodes, cores) I mention in runcase are simply ignored, and only the case.n_procs variable in cs_user_scripts.py is taken into account :
- If I mention nothing (i.e. if I leave #case.n_procs = None), a single processor calculation is launched, no matter what I mention in runcase.
- If I uncomment case.n_procs (i.e. case.n_procs = None), the launch is aborted with an error message like : "case.n_procs cannot be None")
- If I give a case.n_procs number (e.g. case.n_procs = 4) in cs_user_scripts.py, then the calculation is launched, with the correct number of cores. BUT the bigger n_procs is, the slower calculation runs.

Some further informations :
- I use OpenMPI_1.10.1.
- I checked the OAR batch syntax several times, I'm sure it's ok.

So, as I wrote at the top of this message, I guess this all is an installation issue. During this step, I must have missed something. Any clue / idea ?

Thanks in advance.
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Code Saturne cluster installation

Post by Yvan Fournier »

Hello,

Yes, this is probably due to OAR not being supported by Code_Saturne.

For supported batch systems, the environment variables related to known batch systems are handled in bin/cs_exec_environment.py (which should be relatively easy to extend to a different batch system once you have its documentation). The GUI also has equivalent code to handle the runcase, but thar is more a "comfort" feature than a necessity.

If you set case.nprocs in cs_user_scripts.py, you force the correct number of cores, but you still have to make sure the run is distributed correctly (which may be automatic if your MPI library handles OAR, or require defining hostfiles otherwise).

What size mesh are you using ? Which version of MPI are you using on the cluster ? Whihc type of fast network does it use ?

Regards,

Yvan
zeph67
Posts: 52
Joined: Tue Oct 23, 2012 5:54 pm

Re: Code Saturne cluster installation

Post by zeph67 »

Yvan Fournier wrote:Hello,
For supported batch systems, the environment variables related to known batch systems are handled in bin/cs_exec_environment.py (which should be relatively easy to extend to a different batch system once you have its documentation).
Great advice, I'll have a look upon it, and I'll let you know.
Yvan Fournier wrote: What size mesh are you using ?
820000 cells.
Yvan Fournier wrote: Which version of MPI are you using on the cluster ?
OpenMPI 1.10.1
Yvan Fournier wrote: Whihc type of fast network does it use ?
Infiniband.

Thanks a lot !
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Code Saturne cluster installation

Post by Yvan Fournier »

Hello,

With X86_64/ Infiniband clusters, we usually have reasonably good scaling at least down to about 30000 cells per MPI rank for small meshes, and 50000 cells per rank for very large meshes on hundreds of nodes, so a mesh of almost one million cells should provide good scaling at least up to 20 ranks or so (unless your setup includes additional I/O, such frequent output of many files, or similar, "non usual" settings).

So (at least if you are using the "general" physical modeling of the code), you might have an installation issue, or Open MPI might not be tuned well for this type of application (default settings are usually fine, and most times I have tried, modifying Open MPI mca or their equivalent for other MPI distributions did not lead to better performance, but it can be done if necessary). At least on our on clusters, with Myrinet Infiniband, making sure the OpenMPI library is built with support for specific hardware features (FAC/MXM if my memory is correct) did have a major impact. I am not a specialist in each vendor's recommended tuning, but this is a good reason to use the "vendor recommendend" MPI libraries or builds.

Regards,

Yvan
zeph67
Posts: 52
Joined: Tue Oct 23, 2012 5:54 pm

Re: Code Saturne cluster installation

Post by zeph67 »

Dear Yvan,

For the purpose of helping other potential users of OAR-based clusters, I give here the "patched" cs_exec_environment.py that you helped me modifying.

Again, thank you very much.
Attachments
cs_exec_environment.py
Modified cs_exec_environment.py, accounting for OAR architecture.
(54.25 KiB) Downloaded 504 times
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Code Saturne cluster installation

Post by Yvan Fournier »

Hello,

Thanks for the attachment. Note to other users that this version of cs_exec_environment.py is for Code_Saturne 3.0. I have integrated this patch in all currently maintained branches (4.2, 4.0, 3.0), so it will be included in the next bugfix/porting releases, which we'll try to do before our next user meeting (i.e. before the end of this month).

Regards,

Yvan
Post Reply