Calculation is stuck but no error reported on a cluster

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
Ruonan
Posts: 136
Joined: Mon Dec 14, 2020 11:38 am

Calculation is stuck but no error reported on a cluster

Post by Ruonan »

Dear developers,

I got an error when running on the cluster. Could you please have a look if you have any ideas/experience?

I'm using version 7.1.0-patch. I have 3 nodes, each node has 20 processors. I set "n_procs:60 n_threads:1", which succeed before. But now, the simulation seems to be stuck, don't know why. From the terminal, I can see "Starting calculation", which is ok. But when I got to the result folder, there is no run_solver.log file, and of course nothing in the listing. But no error appeared. Please see the figure below.
1.PNG
But, interestingly, if I set "n_procs:20 n_threads:1", the simulation can run. I attached a snippet from the performance.log, where you can find the feature of the nodes.

Code: Select all

Local case configuration:

  Date:                Mon 11 Apr 2022 12:14:59 BST
  System:              Linux 3.10.0-693.5.2.el7.x86_64
  Machine:             galaxy2.swmgmt.eureka
  Processor:           model name       : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
  Memory:              128706 MB
  Directory:           /users/rw00793/cases_20220405/Severac_closed_rotor_stator_y30_r50_rt50/wmles_rot/RESU/20220411-1214
  MPI ranks:           20
  OpenMP threads:      1
  Processors/node:     1

  Compilers used for build:
    C compiler:        gcc (GCC) 4.9.3
    C++ compiler:      g++ (GCC) 4.9.3
    Fortran compiler:  GNU Fortran (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)

  MPI version: 3.0 (MPICH 3.1.2)
  OpenMP version: 4.0

  External libraries:

  I/O read method:     standard input and output, serial access
  I/O write method:    standard input and output, serial access
  I/O rank step:        1

This problem didn't happen before, which is quite confusing. It will be great if you can give me a hint. Thank you!

Best regards,
Ruonan
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Calculation is stuck but no error reported on a cluster

Post by Yvan Fournier »

Hello,

The snippet is from another run, correct ? If the computation does not even open the run_solver.log, it means it locks very early on. Given that it hangs on 60 procs (3 nodes) but works on 20 (1 node), I would guess this is an MPI initialization issue.

Is the MPI configuration identical to that of other installs ? Is the post-install done the same way ?

Regards,

Yvan
Ruonan
Posts: 136
Joined: Mon Dec 14, 2020 11:38 am

Re: Calculation is stuck but no error reported on a cluster

Post by Ruonan »

Hello Yvan,

Thanks for your reply! I'm sorry for my unclear post.

Yes, you are right. I was assigned 3 nodes. I set whether to use 1 node or 3 nodes in the run.cfg. The snippet is from using 1 node, which the case can run with no problem. But when I use 3 nodes, the case hung on, and didn't even open the run_solver.log.

For installation, I always do the same things, so I didn't notice any installation issues so far.

Would it be possible to change some MPI settings in setup.xml, or to re-install Code_Saturne with a different MPI version, to solve this issue?

Best regards,
Ruonan
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Calculation is stuck but no error reported on a cluster

Post by Yvan Fournier »

Hello,

You may reinstall with a different MPI, but you may also (more simply) force some MPI settings in the post-install (code-saturne.cfg) file, in case the automatic settings do not work well.

Especially, check that the "mpixec" command used matches that of the MPI library, and when running on a cluster, check that this MPI installation is configured to work across the cluster (you can test that easily running "mpiexec -n 60 /usr/bin/hostname" for example).

Best regards,

Yvan
Ruonan
Posts: 136
Joined: Mon Dec 14, 2020 11:38 am

Re: Calculation is stuck but no error reported on a cluster

Post by Ruonan »

Hello Yvan,

Thank you very much for your help. I'm following your suggestions, made some progress, but still haven't fixed this problem.

I changed to a new cluster with AMD EPYC 7452 32-Core Processors. Each node has 64 processors.

I first compiled Code_Saturne using MPICH/3.4.2. The compilation was ok. But when I run the code, it hanged (locked) even if I use only 1 node.

Then, as you suggested, I changed to OpenMPI/3.1.4 to compile the code. Luckily, the code can run in parallel on 1 node, with up to 64 processors. But when I tried multiple nodes, the calculation hanged (locked) again, but no error printed out. So now I can only run the code using 1 node.

Then I uncommented and modified two lines in the code_saturne.cfg

Code: Select all

bindir = /opt/software/pkgs/OpenMPI/3.1.4-GCC-8.3.0/bin
mpiexec = mpirun
I'm not sure if I modified it correctly. But unfortunately after modifying these two lines, it still didn't work. When I tried multiple nodes, it was still hanged (locked).

I will be really grateful if you can share some ideas. The setup file is attached, including the MPI binary when I compiled Code_Saturne. Thank you!

Best regards,
Ruonan
Attachments
setup.txt
(4.37 KiB) Downloaded 61 times
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Calculation is stuck but no error reported on a cluster

Post by Yvan Fournier »

Hello,

Did you try a simple "hello world" program with those same MPI libraries and similar batch system options ?

In v7.1, you may find the batch header used in DATA/run.cfg, or in the generated runcase file in the execution directory (when using "code_saturne submit" or submitting through the GUI).

Could you also run "ldd" on the cs_solver executable in the exectution directory so as to check than only one MPI library appears, and that you do not have additional/mixed libraries (possibly pulled by external libraries) ?

Best regards,

Yvan
Ruonan
Posts: 136
Joined: Mon Dec 14, 2020 11:38 am

Re: Calculation is stuck but no error reported on a cluster

Post by Ruonan »

Hello Yvan,

Thank you very much for your reply!

Today I had a look at this issue with my cluster administrator. He did a "hello world" test before and there was no problem.

I did ldd and copied the results below. I have added "xxx" at three lines related to mpi so you can find them easier. It seems only one OpenMPI/3.1.4 appeared, which is exactly what I used to install (compile) and run the code.

Code: Select all

ldd ./Code_Saturne/7.1.1-openmpi/code_saturne-7.1.1/arch/Linux_x86_64/libexec/code_saturne/cs_solver

	linux-vdso.so.1 (0x00007ffc345f0000)
	libcs_solver-7.1.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/code_saturne-7.1.1/arch/Linux_x86_64/lib/libcs_solver-7.1.so (0x000014e307a3f000)
	libsaturne-7.1.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/code_saturne-7.1.1/arch/Linux_x86_64/lib/libsaturne-7.1.so (0x000014e306870000)
	libple.so.2 => /users/rw00793/Code_Saturne/7.1.1-openmpi/code_saturne-7.1.1/arch/Linux_x86_64/lib/libple.so.2 (0x000014e307a29000)
	libcgns.so.4.2 => /users/rw00793/Code_Saturne/7.1.1-openmpi/cgns-4.2.0/arch/Linux_x86_64/lib/libcgns.so.4.2 (0x000014e307958000)
	libmedC.so.11 => /users/rw00793/Code_Saturne/7.1.1-openmpi/med-4.1.0/arch/Linux_x86_64/lib/libmedC.so.11 (0x000014e30673a000)
	libhdf5.so.103 => /users/rw00793/Code_Saturne/7.1.1-openmpi/hdf5-1.10.6/arch/Linux_x86_64/lib/libhdf5.so.103 (0x000014e306378000)
	libparmetis.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/parmetis-4.0.3/arch/Linux_x86_64/lib/libparmetis.so (0x000014e3078d7000)
	libmetis.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/parmetis-4.0.3/arch/Linux_x86_64/lib/libmetis.so (0x000014e307867000)
	libptscotch.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/scotch-6.1.1/arch/Linux_x86_64/lib/libptscotch.so (0x000014e30632b000)
	libscotch.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/scotch-6.1.1/arch/Linux_x86_64/lib/libscotch.so (0x000014e306293000)
	libz.so.1 => /opt/software/pkgs/zlib/1.2.11-GCCcore-8.3.0/lib64/libz.so.1 (0x000014e30784e000)
	libdl.so.2 => /lib64/libdl.so.2 (0x000014e306082000)
	libgfortran.so.5 => /opt/software/pkgs/GCCcore/8.3.0/lib64/libgfortran.so.5 (0x000014e305e13000)
	libquadmath.so.0 => /opt/software/pkgs/GCCcore/8.3.0/lib64/libquadmath.so.0 (0x000014e305dd2000)
	libm.so.6 => /lib64/libm.so.6 (0x000014e305a50000)
xxx	libmpi.so.40 => /opt/software/pkgs/OpenMPI/3.1.4-GCC-8.3.0/lib64/libmpi.so.40 (0x000014e305947000)
	libgomp.so.1 => /opt/software/pkgs/GCCcore/8.3.0/lib64/libgomp.so.1 (0x000014e305918000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x000014e3056f8000)
	libc.so.6 => /lib64/libc.so.6 (0x000014e305333000)
xxx	libopen-rte.so.40 => /opt/software/pkgs/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-rte.so.40 (0x000014e305275000)
xxx	libopen-pal.so.40 => /opt/software/pkgs/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-pal.so.40 (0x000014e3051b0000)
	librt.so.1 => /lib64/librt.so.1 (0x000014e304fa8000)
	libutil.so.1 => /lib64/libutil.so.1 (0x000014e304da4000)
	libhwloc.so.5 => /opt/software/pkgs/hwloc/1.11.12-GCCcore-8.3.0/lib/libhwloc.so.5 (0x000014e304d62000)
	libnuma.so.1 => /opt/software/pkgs/numactl/2.0.12-GCCcore-8.3.0/lib/libnuma.so.1 (0x000014e304d55000)
	libpciaccess.so.0 => /opt/software/pkgs/libpciaccess/0.14-GCCcore-8.3.0/lib/libpciaccess.so.0 (0x000014e304d49000)
	libxml2.so.2 => /opt/software/pkgs/libxml2/2.9.9-GCCcore-8.3.0/lib/libxml2.so.2 (0x000014e304bd8000)
	liblzma.so.5 => /opt/software/pkgs/XZ/5.2.4-GCCcore-8.3.0/lib/liblzma.so.5 (0x000014e304bb0000)
	libstdc++.so.6 => /opt/software/pkgs/GCCcore/8.3.0/lib/../lib64/libstdc++.so.6 (0x000014e304a16000)
	libgcc_s.so.1 => /opt/software/pkgs/GCCcore/8.3.0/lib/../lib64/libgcc_s.so.1 (0x000014e3049fd000)
	/lib64/ld-linux-x86-64.so.2 (0x000014e30781f000)

Also, just in case, I attached the runcase and code_saturne.cfg file (uncommentted bindir and mpiexec).

So unfortunately and sorry I still don't know why the code can work on 1 node but get stuck on multiple nodes.

Best regards,
Ruonan
Attachments
runcase.txt
(209 Bytes) Downloaded 56 times
code_saturne.cfg.txt
(3.81 KiB) Downloaded 57 times
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Calculation is stuck but no error reported on a cluster

Post by Yvan Fournier »

Hello,

Did the administrator run the "hello world" interactively or under a batch allocation ?

You configured the mpiexec command, but no batch system, so unless you are running the code_saturne script interactively under a batch allocation, the issue might be related.

Best regards,

Yvan
Post Reply