MPI-OpenMP hybrid support

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
saintlyknighted
Posts: 18
Joined: Sun Aug 06, 2023 3:40 am

MPI-OpenMP hybrid support

Post by saintlyknighted »

Hello,

I am currently trying to run Code_Saturne in conjunction with another code on my HPC system using MPI and OpenMP, but when I try to use more than one node, the computation fails since it seems to try to generate one results folder per node. I have already managed to use Code_Saturne with multiple cores on one node on my HPC (though on a different cluster), so I would just like to check if Code_Saturne supports MPI and OpenMP hybrid jobs.

I have attached the run_solver.log file. I couldn't upload the PBS error and output files and my job submission script so there are here below.

Thanks!

Error file:

Code: Select all

The following have been reloaded with a version change:
  1) GCCcore/13.2.0 => GCCcore/11.2.0
  2) bzip2/1.0.8-GCCcore-13.2.0 => bzip2/1.0.8-GCCcore-11.2.0
  3) ncurses/6.4-GCCcore-13.2.0 => ncurses/6.2-GCCcore-11.2.0
  4) zlib/1.2.13-GCCcore-13.2.0 => zlib/1.2.11-GCCcore-11.2.0


The following have been reloaded with a version change:
  1) GCCcore/11.2.0 => GCCcore/13.3.0
  2) binutils/2.40-GCCcore-13.2.0 => binutils/2.42-GCCcore-13.3.0
  3) zlib/1.2.11-GCCcore-11.2.0 => zlib/1.3.1-GCCcore-13.3.0


The following have been reloaded with a version change:
  1) GCCcore/13.3.0 => GCCcore/11.3.0
  2) UCX/1.16.0-GCCcore-13.3.0 => UCX/1.12.1-GCCcore-11.3.0
  3) binutils/2.42-GCCcore-13.3.0 => binutils/2.38-GCCcore-11.3.0
  4) iimpi/2024a => iimpi/2022a
  5) impi/2021.13.0-intel-compilers-2024.2.0 => impi/2021.6.0-intel-compilers-2022.1.0
  6) intel-compilers/2024.2.0 => intel-compilers/2022.1.0
  7) numactl/2.0.18-GCCcore-13.3.0 => numactl/2.0.14-GCCcore-11.3.0
  8) zlib/1.3.1-GCCcore-13.3.0 => zlib/1.2.12-GCCcore-11.3.0


The following have been reloaded with a version change:
  1) GCCcore/13.2.0 => GCCcore/13.3.0
  2) zlib/1.2.13-GCCcore-13.2.0 => zlib/1.3.1-GCCcore-13.3.0


The following have been reloaded with a version change:
  1) GCCcore/13.2.0 => GCCcore/13.3.0
  2) zlib/1.2.13-GCCcore-13.2.0 => zlib/1.3.1-GCCcore-13.3.0


The following have been reloaded with a version change:
  1) GCCcore/13.3.0 => GCCcore/13.2.0
  2) binutils/2.42-GCCcore-13.3.0 => binutils/2.40-GCCcore-13.2.0
  3) zlib/1.3.1-GCCcore-13.3.0 => zlib/1.2.13-GCCcore-13.2.0


The following have been reloaded with a version change:
  1) GCCcore/13.3.0 => GCCcore/13.2.0
  2) binutils/2.42-GCCcore-13.3.0 => binutils/2.40-GCCcore-13.2.0
  3) zlib/1.3.1-GCCcore-13.3.0 => zlib/1.2.13-GCCcore-13.2.0


The following have been reloaded with a version change:

The following have been reloaded with a version change:
  1) numactl/2.0.18-GCCcore-13.3.0 => numactl/2.0.16-GCCcore-13.2.0

  1) numactl/2.0.18-GCCcore-13.3.0 => numactl/2.0.16-GCCcore-13.2.0


The following have been reloaded with a version change:
  1) UCX/1.16.0-GCCcore-13.3.0 => UCX/1.15.0-GCCcore-13.2.0


The following have been reloaded with a version change:
  1) UCX/1.16.0-GCCcore-13.3.0 => UCX/1.15.0-GCCcore-13.2.0


The following have been reloaded with a version change:
  1) GCCcore/13.2.0 => GCCcore/13.3.0
  2) binutils/2.40-GCCcore-13.2.0 => binutils/2.42-GCCcore-13.3.0
  3) zlib/1.2.13-GCCcore-13.2.0 => zlib/1.3.1-GCCcore-13.3.0


The following have been reloaded with a version change:
  1) GCCcore/13.2.0 => GCCcore/13.3.0
  2) binutils/2.40-GCCcore-13.2.0 => binutils/2.42-GCCcore-13.3.0
  3) zlib/1.2.13-GCCcore-13.2.0 => zlib/1.3.1-GCCcore-13.3.0


The following have been reloaded with a version change:
  1) GCCcore/13.3.0 => GCCcore/13.2.0


The following have been reloaded with a version change:
  1) GCCcore/13.3.0 => GCCcore/13.2.0


The following have been reloaded with a version change:
  1) zlib/1.3.1-GCCcore-13.3.0 => zlib/1.2.13-GCCcore-13.2.0


The following have been reloaded with a version change:
  1) zlib/1.3.1-GCCcore-13.3.0 => zlib/1.2.13-GCCcore-13.2.0

Traceback (most recent call last):
  File "/gpfs/home/bt1022/code_saturne/install_path/Code_Saturne/7.0.6/code_saturne-7.0.6/arch/Linux_x86_64/bin/code_saturne", line 89, in <module>
    retcode = cs.execute()
              ^^^^^^^^^^^^
  File "/gpfs/home/bt1022/code_saturne/install_path/Code_Saturne/7.0.6/code_saturne-7.0.6/arch/Linux_x86_64/lib/python3.11/site-packages/code_saturne/cs_script.py", line 96, in execute
    return self.commands[command](options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home/bt1022/code_saturne/install_path/Code_Saturne/7.0.6/code_saturne-7.0.6/arch/Linux_x86_64/lib/python3.11/site-packages/code_saturne/cs_script.py", line 181, in run
    return cs_run.main(options, self.package)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home/bt1022/code_saturne/install_path/Code_Saturne/7.0.6/code_saturne-7.0.6/arch/Linux_x86_64/lib/python3.11/site-packages/code_saturne/cs_run.py", line 690, in main
    return run(argv, pkg)[0]
           ^^^^^^^^^^^^^^
  File "/gpfs/home/bt1022/code_saturne/install_path/Code_Saturne/7.0.6/code_saturne-7.0.6/arch/Linux_x86_64/lib/python3.11/site-packages/code_saturne/cs_run.py", line 661, in run
    retval = c.run(n_procs=r_c['n_procs'],
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home/bt1022/code_saturne/install_path/Code_Saturne/7.0.6/code_saturne-7.0.6/arch/Linux_x86_64/lib/python3.11/site-packages/code_saturne/cs_case.py", line 1845, in run
    self.set_result_dir(force_id)
  File "/gpfs/home/bt1022/code_saturne/install_path/Code_Saturne/7.0.6/code_saturne-7.0.6/arch/Linux_x86_64/lib/python3.11/site-packages/code_saturne/cs_case.py", line 634, in set_result_dir
    os.mkdir(self.result_dir)
FileExistsError: [Errno 17] File exists: '/gpfs/home/bt1022/hx1_subchcfd/SubChCFD/coarse/RESU/iteration_1'
Warning:
  Will try to use 12 processes while resource manager (PBS)
   allows for 32.

[hx1-c08-8-1:735848] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735847] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735844] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735853] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735852] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735854] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735849] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735845] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735851] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735843] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735847] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735844] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735853] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735848] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735852] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735846] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735849] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735854] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735850] mca_base_component_repository_open: unable to open mca_btl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735845] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735851] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735843] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735846] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
[hx1-c08-8-1:735850] mca_base_component_repository_open: unable to open mca_mtl_ofi: libefa.so.1: cannot open shared object file: No such file or directory (ignored)
 solver script exited with status 1.

Error running the calculation.

Check code_saturne log (listing) and error* files for details.

 Error in calculation stage.

Output file:

Code: Select all

                      code_saturne
                      ============

Version:   7.0.6
Path:      /gpfs/home/bt1022/code_saturne/install_path/Code_Saturne/7.0.6/code_saturne-7.0.6/arch/Linux_x86_64

Result directory:
  /gpfs/home/bt1022/hx1_subchcfd/SubChCFD/coarse/RESU/iteration_1

Copying base setup data
-----------------------

Compiling and linking user-defined functions
--------------------------------------------

Preparing calculation data
--------------------------

 Parallel code_saturne on 12 processes.

Preprocessing calculation
-------------------------

Starting calculation
--------------------

initializing
Post-calculation operations
---------------------------


====================================
CPU Time used: 00:00:32
CPU Percent: 88%
Memory usage: 5589608kb
Approx Power usage: 0.0
Walltime usage: 00:00:40

====================================
Job script:

Code: Select all

#!/bin/sh
#PBS -koed
#PBS -lselect=2:ncpus=16:mpiprocs=16:mem=100gb
#PBS -lwalltime=25:00:00

## MODULES TO LOAD @ RUNTIME:
module load tools
module load Python
module load libgd
module load intel
module load SCOTCH

## RUN COMMAND

cd /gpfs/home/bt1022/hx1_subchcfd

mpirun -n 2 /gpfs/home/bt1022/code_saturne/install_path/Code_Saturne/7.0.6/code_saturne-7.0.6/arch/Linux_x86_64/bin/code_saturne run --case /gpfs/home/bt1022/hx1_subchcfd/SubChCFD/coarse -n 12
Attachments
run_solver.log
(10.25 KiB) Downloaded 4965 times
Yvan Fournier
Posts: 4220
Joined: Mon Feb 20, 2012 3:25 pm

Re: MPI-OpenMP hybrid support

Post by Yvan Fournier »

Hello,

Yes, the code should work fine in hybrid MPI/OPenMP mode, and even though in most cases performance is slightly better in pure MPI mode, hybrid mode often requires a bit less memory so can be useful, and may be improved in the future (current work on GPU performance should also improve OpenMP performance). We have at least one nightly test which runs in hybrid mode.

If you have the issue only with more than one node, it is probably related to MPI parameters. Is the MPI library used/tested on multiple nodes on the same cluster ?

Also, in your logs, the crash seems related to PT-Scotch, not to run directories (at least in the provided logs). PT-Scotch can be pretty fragile when using threading, so I suggest using a more up to date PT-Scotch version, and possibly disabling threading. In our recent builds, we had performance issues with PT-Scotch threading, so needed to use build parameters to limit threading or at least thread placement (using the -DCOMMON_PTHREAD_AFFINITY_LINUX flag, as described in the Scotch install documentation).

You can also use code_saturne's built-in space-filling curve (Morton or Hilbert) partitioning, at least for initial tests, and to make sure issues are not solely related to external libraries.

Best regards,

Yvan
saintlyknighted
Posts: 18
Joined: Sun Aug 06, 2023 3:40 am

Re: MPI-OpenMP hybrid support

Post by saintlyknighted »

Thanks Yvan.

Good to know that hybrid mode works. I have tried a few things but nothing works so far, I will probably ask my HPC staff for help. There is also a chance I might not actually need to run my simulations in hybrid mode after all so I will see how things pan out.

Regards,
Bryan
Yvan Fournier
Posts: 4220
Joined: Mon Feb 20, 2012 3:25 pm

Re: MPI-OpenMP hybrid support

Post by Yvan Fournier »

Hello,

Did you try running the code in pure MPI mode over 2 nodes ? Do you have the same issues ?

Regards,

Yvan
Post Reply