Hi Yvan,
I finally set up the whole thing on an AWS cluster using graviton cpus as suggested.
But I am now facing a challenge launching code_saturne on multiple nodes.
Here is what i have :
- AWS cluster with head node and 10 compute nodes of 64 cores each (no multithread). I used C6gn instances as they use EFA which is AWS fast interconnect network capability.
- A singularity image with code_saturne and all dependencies installed (that works in a local singularity environment with no nodes or job manager)
i am trying to launch my calculation with the following scripts :
cs_job.sh
Code: Select all
#!/bin/bash
#SBATCH --job-name=mpi_cs
#SBATCH --output=csout.out
#SBATCH --nodes=2
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00
#SBATCH --partition=c6gn
/shared/apps/singularity/4.0.3/bin/singularity exec -e --no-home --bind ../:/TEST,../../:/mnt ../../code_saturne.sif /mnt/cs_compute.sh
exit 0
cs_compute.sh
Code: Select all
#!/bin/bash
shopt -s expand_aliases
alias code_saturne=${cs_path}
code_saturne run
exit 0
My run.cfg file is set up to use 128 procs with 1 thread each.
But i get the following output :
Code: Select all
code_saturne
============
Version: 8.1.0
Path: /opt/code_saturne/8.1.0
Result directory:
/home/ec2-user/TEST/relief/RESU/20240126-2244
Copying base setup data
-----------------------
Compiling and linking user-defined functions
--------------------------------------------
Preparing calculation data
--------------------------
Parallel code_saturne on 128 processes.
Preprocessing calculation
-------------------------
Starting calculation
--------------------
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 128
slots that were requested by the application:
./cs_solver
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
solver script exited with status 1.
Error running the calculation.
Check run_solver.log and error* files for details.
Domain None (code_saturne):
run_solver.log, error*.
Post-calculation operations
---------------------------
Run failed in calculation stage.
I am new to this and i don't really get how SLURM, OPENMP and code_saturne get to work together to set up the proper amount of cpus/tasks/nodes... etc
Can you give a hint on how to make this work ?
Best regards,
Antoine
----------
EDIT 29/01 :
----------
I did some digging on google (and asked chatGPT) and i got to the point where a calculation runs perfectly on 1 node with the number of cores set up in my slurm script and removing the run.cfg file from code_saturne case/DATA directory.
However, when I try to run on 2 nodes I get the following error :
Code: Select all
code_saturne
============
Version: 8.1.0
Path: /opt/code_saturne/8.1.0
Result directory:
/home/ec2-user/TEST/relief/RESU/20240129-1102_121
Copying base setup data
-----------------------
Preparing calculation data
--------------------------
Parallel code_saturne on 128 processes.
Preprocessing calculation
-------------------------
Starting calculation
--------------------
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
solver script exited with status 1.
Error running the calculation.
Check run_solver.log and error* files for details.
Domain None (code_saturne):
run_solver.log, error*.
Post-calculation operations
---------------------------
Run failed in calculation stage.
The main difference since the edit is that I installed slurm, and munge inside the container and i used --bind option when running singularity to share the slurm and munge config from the host to the container.
Best regards,
Antoine