Multi process issue during coupling

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
Rodolphe
Posts: 18
Joined: Sun Mar 14, 2021 12:59 pm

Multi process issue during coupling

Post by Rodolphe »

Hi,

I am working on a cluster with SLURM as batch system. I am computing the coupling between Code Saturne and Syrthès from the tutorial 'Three 2D disks' on CS website. I did compute the tutorial for both separately first but when I'm trying to perform the coupling, I've got an error message related to MPI. Indeed, once I perform the coupling case, the following message is displayed in the output file :

Code: Select all

mpiexec noticed that process rank 0 with PID 38883 on node lm3-w007 exited on signal 11 (Segmentation fault).
I don't really know what does it mean. I guess it is linked with the multi processing calculation. I join the compile.log from Saturne and Syrthès as well as the output files (error and out), setup file from Saturne and Syrthès, runcase_coupling, summary, coupling parameters and the meshes in the zip archive.

Note that I already tried multiprocessing with Code Saturne solely and it worked when using Metis and not Scotch for partition.

Thanks for your help !

Rodolphe
Attachments
FILES.zip
(1.28 MiB) Downloaded 124 times
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Multi process issue during coupling

Post by Yvan Fournier »

Hello,

Do yo also have run_solver.log (for the fluid domain) and syrthes.log (or listing, I am not sure) for the solid domain ?

That would help determine where the issue appears.

Best regards,

Yvan
Rodolphe
Posts: 18
Joined: Sun Mar 14, 2021 12:59 pm

Re: Multi process issue during coupling

Post by Rodolphe »

Hello,

I did found a run_solver file but I don't know if it is the one you were talking about (see joined file).
The listing file for the solid domain is empty.

Best regards,

Rodolphe
Attachments
run_solver.txt
(2.14 KiB) Downloaded 117 times
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Multi process issue during coupling

Post by Yvan Fournier »

Hello,

If you did not find a run_solver.log file, it means the computation crashed before creating it, at initialization.

I suspect an installation issue, probably with code_saturne and Syrthes using different MPI libraries.

Do you have any other messages in the terminal (or in the case of a batch system, in the job log file) ?

Otherwise, could you run and post the output of "ldd SOLID/syrthes" and "ldd FLUID/cs_solver" ? After loading the modules listed in run_solver.text...

Best regards,

Yvan
Rodolphe
Posts: 18
Joined: Sun Mar 14, 2021 12:59 pm

Re: Multi process issue during coupling

Post by Rodolphe »

Hello,

Since I've installed Code Saturne with the semi-automatic installation script, I did not specified what was the path to MPI libraries (I let the default options) while during the installation of Syrthès, I had to specify it in the setup file. How can I check which MPI libraries are used by Saturne ? (Sorry I'm quite a beginner in this domain).

I joined the job log files (one for the error message and one for the output text).

ldd FLUID/cs_solver gives :

Code: Select all

linux-vdso.so.1 =>  (0x00002aaaaaacd000)
	libcs_solver-6.0.so => /home/ucl/tfl/rvanco/Code_Saturne/6.0.6/code_saturne-6.0.6/arch/Linux_x86_64/lib/libcs_solver-6.0.so (0x00002aaaaaccf000)
	libsaturne-6.0.so => /home/ucl/tfl/rvanco/Code_Saturne/6.0.6/code_saturne-6.0.6/arch/Linux_x86_64/lib/libsaturne-6.0.so (0x00002aaaaaed5000)
	libple.so.2 => /home/ucl/tfl/rvanco/Code_Saturne/6.0.6/code_saturne-6.0.6/arch/Linux_x86_64/lib/libple.so.2 (0x00002aaaac749000)
	libcgns.so.3.3 => /home/ucl/tfl/rvanco/cgns/lib/libcgns.so.3.3 (0x00002aaaac95b000)
	libmedC.so.11 => /home/ucl/tfl/rvanco/Code_Saturne/6.0.6/med-4.0.0/arch/Linux_x86_64/lib/libmedC.so.11 (0x00002aaaacf8e000)
	libhdf5.so.100 => /opt/cecisw/arch/easybuild/2016b/software/HDF5/1.10.0-patch1-foss-2016b/lib/libhdf5.so.100 (0x00002aaaad2b6000)
	libmetis.so => /opt/cecisw/arch/easybuild/2016b/software/METIS/5.1.0-foss-2016b/lib/libmetis.so (0x00002aaaaaad6000)
	libmpi.so.12 => /opt/cecisw/arch/easybuild/2016b/software/OpenMPI/1.10.3-GCC-5.4.0-2.26/lib/libmpi.so.12 (0x00002aaaad60b000)
	libz.so.1 => /opt/cecisw/arch/easybuild/2016b/software/zlib/1.2.8-foss-2016b/lib/libz.so.1 (0x00002aaaaab56000)
	libdl.so.2 => /usr/lib64/libdl.so.2 (0x00002aaaad980000)
	libgfortran.so.3 => /opt/cecisw/arch/easybuild/2016b/software/GCCcore/5.4.0/lib64/../lib64/libgfortran.so.3 (0x00002aaaaab7e000)
	libquadmath.so.0 => /opt/cecisw/arch/easybuild/2016b/software/GCCcore/5.4.0/lib64/../lib64/libquadmath.so.0 (0x00002aaaadb84000)
	libm.so.6 => /usr/lib64/libm.so.6 (0x00002aaaadbc3000)
	libgomp.so.1 => /opt/cecisw/arch/easybuild/2016b/software/GCCcore/5.4.0/lib64/../lib64/libgomp.so.1 (0x00002aaaaaca0000)
	libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00002aaaadec5000)
	libc.so.6 => /usr/lib64/libc.so.6 (0x00002aaaae0e1000)
	libhdf5.so.103 => /home/ucl/tfl/rvanco/Code_Saturne/6.0.6/hdf5-1.10.6/arch/Linux_x86_64/lib/libhdf5.so.103 (0x00002aaaae4af000)
	libstdc++.so.6 => /opt/cecisw/arch/easybuild/2016b/software/GCCcore/5.4.0/lib64/../lib64/libstdc++.so.6 (0x00002aaaaea74000)
	libgcc_s.so.1 => /opt/cecisw/arch/easybuild/2016b/software/GCCcore/5.4.0/lib64/../lib64/libgcc_s.so.1 (0x00002aaaaebfb000)
	libsz.so.2 => /opt/cecisw/arch/easybuild/2016b/software/Szip/2.1-foss-2016b/lib/libsz.so.2 (0x00002aaaaec12000)
	/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
	librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00002aaaaec25000)
	libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002aaaaee3c000)
	libpsm2.so.2 => /usr/lib64/libpsm2.so.2 (0x00002aaaaf055000)
	libfabric.so.1 => /usr/lib64/libfabric.so.1 (0x00002aaaaf2bb000)
	libopen-rte.so.12 => /opt/cecisw/arch/easybuild/2016b/software/OpenMPI/1.10.3-GCC-5.4.0-2.26/lib/libopen-rte.so.12 (0x00002aaaaf617000)
	libopen-pal.so.13 => /opt/cecisw/arch/easybuild/2016b/software/OpenMPI/1.10.3-GCC-5.4.0-2.26/lib/libopen-pal.so.13 (0x00002aaaaf710000)
	libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00002aaaaf7c0000)
	libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00002aaaaf9c6000)
	librt.so.1 => /usr/lib64/librt.so.1 (0x00002aaaafbde000)
	libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002aaaafde6000)
	libhwloc.so.5 => /opt/cecisw/arch/easybuild/2016b/software/hwloc/1.11.3-GCC-5.4.0-2.26/lib/libhwloc.so.5 (0x00002aaaaffe9000)
	libnuma.so.1 => /opt/cecisw/arch/easybuild/2016b/software/numactl/2.0.11-GCC-5.4.0-2.26/lib/libnuma.so.1 (0x00002aaab0022000)
	libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x00002aaab002d000)
	libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00002aaab029a000)
	libpsm_infinipath.so.1 => /usr/lib64/libpsm_infinipath.so.1 (0x00002aaab04bb000)
	libslurmfull.so => /usr/lib64/slurm/libslurmfull.so (0x00002aaab0711000)
	libinfinipath.so.4 => /usr/lib64/libinfinipath.so.4 (0x00002aaab0adb000)
	libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00002aaab0cea000)

ldd SOLID/syrthes gives :

Code: Select all

linux-vdso.so.1 =>  (0x00002aaaaaacd000)
	libm.so.6 => /usr/lib64/libm.so.6 (0x00002aaaaaccf000)
	libple.so.2 => /home/ucl/tfl/rvanco/usr/local/lib/libple.so.2 (0x00002aaaaaae5000)
	libmpi.so.40 => /opt/cecisw/arch/easybuild/2018b/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/lib/libmpi.so.40 (0x00002aaaaaaf8000)
	libc.so.6 => /usr/lib64/libc.so.6 (0x00002aaaaafd1000)
	/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
	libmpi.so.12 => /opt/cecisw/arch/easybuild/2016b/software/OpenMPI/1.10.3-GCC-5.4.0-2.26/lib/libmpi.so.12 (0x00002aaaab39f000)
	libopen-rte.so.40 => /opt/cecisw/arch/easybuild/2018b/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/lib/libopen-rte.so.40 (0x00002aaaaac05000)
	libopen-pal.so.40 => /opt/cecisw/arch/easybuild/2018b/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/lib/libopen-pal.so.40 (0x00002aaaab714000)
	librt.so.1 => /usr/lib64/librt.so.1 (0x00002aaaab7de000)
	libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002aaaab9e6000)
	libhwloc.so.5 => /opt/cecisw/arch/easybuild/2016b/software/hwloc/1.11.3-GCC-5.4.0-2.26/lib/libhwloc.so.5 (0x00002aaaabbe9000)
	libnuma.so.1 => /opt/cecisw/arch/easybuild/2016b/software/numactl/2.0.11-GCC-5.4.0-2.26/lib/libnuma.so.1 (0x00002aaaaacc0000)
	libpciaccess.so.0 => /opt/cecisw/arch/easybuild/2016b/software/X11/20160819-foss-2016b/lib/libpciaccess.so.0 (0x00002aaaabc22000)
	libxml2.so.2 => /opt/cecisw/arch/easybuild/2016b/software/libxml2/2.9.4-foss-2016b/lib/libxml2.so.2 (0x00002aaaabc2b000)
	libdl.so.2 => /usr/lib64/libdl.so.2 (0x00002aaaabd93000)
	libz.so.1 => /opt/cecisw/arch/easybuild/2016b/software/zlib/1.2.8-foss-2016b/lib/libz.so.1 (0x00002aaaabf97000)
	liblzma.so.5 => /opt/cecisw/arch/easybuild/2016b/software/XZ/5.2.2-foss-2016b/lib/liblzma.so.5 (0x00002aaaabfad000)
	libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00002aaaabfd3000)
	librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00002aaaac1f0000)
	libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002aaaac407000)
	libpsm2.so.2 => /usr/lib64/libpsm2.so.2 (0x00002aaaac620000)
	libfabric.so.1 => /usr/lib64/libfabric.so.1 (0x00002aaaac887000)
	libopen-rte.so.12 => /opt/cecisw/arch/easybuild/2016b/software/OpenMPI/1.10.3-GCC-5.4.0-2.26/lib/libopen-rte.so.12 (0x00002aaaacbe3000)
	libopen-pal.so.13 => /opt/cecisw/arch/easybuild/2016b/software/OpenMPI/1.10.3-GCC-5.4.0-2.26/lib/libopen-pal.so.13 (0x00002aaaaccdc000)
	libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00002aaaacd8d000)
	libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00002aaaacf93000)
	libnl-route-3.so.200 => /usr/lib64/libnl-route-3.so.200 (0x00002aaaad1ac000)
	libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x00002aaaad419000)
	libpsm_infinipath.so.1 => /usr/lib64/libpsm_infinipath.so.1 (0x00002aaaad63a000)
	libgcc_s.so.1 => /opt/cecisw/arch/easybuild/2016b/software/GCCcore/5.4.0/lib64/libgcc_s.so.1 (0x00002aaaad891000)
	libslurmfull.so => /usr/lib64/slurm/libslurmfull.so (0x00002aaaad8a8000)
	libinfinipath.so.4 => /usr/lib64/libinfinipath.so.4 (0x00002aaaadc73000)
	libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00002aaaade82000)

Best regards,

Rodolphe
Attachments
job_69955655.out.log
(2.84 KiB) Downloaded 107 times
job_69955655.err.log
(46.47 KiB) Downloaded 110 times
Rodolphe
Posts: 18
Joined: Sun Mar 14, 2021 12:59 pm

Re: Multi process issue during coupling

Post by Rodolphe »

Hello,

As you suspected, it was a problem of MPI libraries that were different. I did re-installed Syrthès with the same MPI libraries than Saturne and now the run_solver.log file does appear in the RESU_COUPLING directory when I launch a computation.

But I still have a problem, the computation still stops due to another error now :

Code: Select all

 /home/users/r/v/rvanco/ceci/code_saturne-6.0.6/libple/src/ple_locator.c:2882: Erreur fatale.

Locator trying to use distant space dimension 3
with local space dimension 2



Pile d'appels :
   1: 0x2aaaac74c5c0 <ple_locator_extend_search+0x250> (libple.so.2)
   2: 0x2aaaac753f3e <ple_locator_set_mesh+0x29e>     (libple.so.2)
   3: 0x2aaaab085132 <+0x1b0132>                      (libsaturne-6.0.so)
   4: 0x2aaaab086a41 <cs_syr4_coupling_init_mesh+0x51> (libsaturne-6.0.so)
   5: 0x2aaaab08a292 <cs_syr_coupling_init_meshes+0x22> (libsaturne-6.0.so)
   6: 0x2aaaaacd3942 <cs_run+0x5e2>                   (libcs_solver-6.0.so)
   7: 0x2aaaaacd3225 <main+0x175>                     (libcs_solver-6.0.so)
   8: 0x2aaaae103555 <__libc_start_main+0xf5>         (libc.so.6)
   9: 0x4017d9     <>                               (cs_solver)
Fin de la pile
I joined the listing files for fluid and solid domain as well as run_solver.log file.

Best regards,

Rodolphe
Attachments
listing_fluid.txt
(16.57 KiB) Downloaded 121 times
run_solver.log
(16.57 KiB) Downloaded 115 times
listing_solid.txt
(6.44 KiB) Downloaded 123 times
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Multi process issue during coupling

Post by Yvan Fournier »

Hello,

If the mesh on the Syrthes side is 3D and not 2D, do not force a projection in the coupling definition on the code_saturne side. The tutorial uses this projection because the solid mesh is 2D.

Best regards,

Yvan
Post Reply