Page 1 of 1
Calculation is stuck but no error reported on a cluster
Posted: Mon Apr 11, 2022 12:39 pm
by Ruonan
Dear developers,
I got an error when running on the cluster. Could you please have a look if you have any ideas/experience?
I'm using version 7.1.0-patch. I have 3 nodes, each node has 20 processors. I set "n_procs:60 n_threads:1", which succeed before. But now, the simulation seems to be stuck, don't know why. From the terminal, I can see "Starting calculation", which is ok. But when I got to the result folder, there is no run_solver.log file, and of course nothing in the listing. But no error appeared. Please see the figure below.
But, interestingly, if I set "n_procs:20 n_threads:1", the simulation can run. I attached a snippet from the performance.log, where you can find the feature of the nodes.
Code: Select all
Local case configuration:
Date: Mon 11 Apr 2022 12:14:59 BST
System: Linux 3.10.0-693.5.2.el7.x86_64
Machine: galaxy2.swmgmt.eureka
Processor: model name : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Memory: 128706 MB
Directory: /users/rw00793/cases_20220405/Severac_closed_rotor_stator_y30_r50_rt50/wmles_rot/RESU/20220411-1214
MPI ranks: 20
OpenMP threads: 1
Processors/node: 1
Compilers used for build:
C compiler: gcc (GCC) 4.9.3
C++ compiler: g++ (GCC) 4.9.3
Fortran compiler: GNU Fortran (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
MPI version: 3.0 (MPICH 3.1.2)
OpenMP version: 4.0
External libraries:
I/O read method: standard input and output, serial access
I/O write method: standard input and output, serial access
I/O rank step: 1
This problem didn't happen before, which is quite confusing. It will be great if you can give me a hint. Thank you!
Best regards,
Ruonan
Re: Calculation is stuck but no error reported on a cluster
Posted: Mon Apr 11, 2022 11:40 pm
by Yvan Fournier
Hello,
The snippet is from another run, correct ? If the computation does not even open the run_solver.log, it means it locks very early on. Given that it hangs on 60 procs (3 nodes) but works on 20 (1 node), I would guess this is an MPI initialization issue.
Is the MPI configuration identical to that of other installs ? Is the post-install done the same way ?
Regards,
Yvan
Re: Calculation is stuck but no error reported on a cluster
Posted: Tue Apr 12, 2022 11:32 am
by Ruonan
Hello Yvan,
Thanks for your reply! I'm sorry for my unclear post.
Yes, you are right. I was assigned 3 nodes. I set whether to use 1 node or 3 nodes in the run.cfg. The snippet is from using 1 node, which the case can run with no problem. But when I use 3 nodes, the case hung on, and didn't even open the run_solver.log.
For installation, I always do the same things, so I didn't notice any installation issues so far.
Would it be possible to change some MPI settings in setup.xml, or to re-install Code_Saturne with a different MPI version, to solve this issue?
Best regards,
Ruonan
Re: Calculation is stuck but no error reported on a cluster
Posted: Tue Apr 12, 2022 1:32 pm
by Yvan Fournier
Hello,
You may reinstall with a different MPI, but you may also (more simply) force some MPI settings in the post-install (code-saturne.cfg) file, in case the automatic settings do not work well.
Especially, check that the "mpixec" command used matches that of the MPI library, and when running on a cluster, check that this MPI installation is configured to work across the cluster (you can test that easily running "mpiexec -n 60 /usr/bin/hostname" for example).
Best regards,
Yvan
Re: Calculation is stuck but no error reported on a cluster
Posted: Thu Apr 28, 2022 10:22 pm
by Ruonan
Hello Yvan,
Thank you very much for your help. I'm following your suggestions, made some progress, but still haven't fixed this problem.
I changed to a new cluster with AMD EPYC 7452 32-Core Processors. Each node has 64 processors.
I first compiled Code_Saturne using MPICH/3.4.2. The compilation was ok. But when I run the code, it hanged (locked) even if I use only 1 node.
Then, as you suggested, I changed to OpenMPI/3.1.4 to compile the code. Luckily, the code can run in parallel on 1 node, with up to 64 processors. But when I tried multiple nodes, the calculation hanged (locked) again, but no error printed out.
So now I can only run the code using 1 node.
Then I uncommented and modified two lines in the code_saturne.cfg
Code: Select all
bindir = /opt/software/pkgs/OpenMPI/3.1.4-GCC-8.3.0/bin
mpiexec = mpirun
I'm not sure if I modified it correctly. But unfortunately after modifying these two lines, it still didn't work. When I tried multiple nodes, it was still hanged (locked).
I will be really grateful if you can share some ideas. The setup file is attached, including the MPI binary when I compiled Code_Saturne. Thank you!
Best regards,
Ruonan
Re: Calculation is stuck but no error reported on a cluster
Posted: Fri Apr 29, 2022 2:42 pm
by Yvan Fournier
Hello,
Did you try a simple "hello world" program with those same MPI libraries and similar batch system options ?
In v7.1, you may find the batch header used in DATA/run.cfg, or in the generated runcase file in the execution directory (when using "code_saturne submit" or submitting through the GUI).
Could you also run "ldd" on the cs_solver executable in the exectution directory so as to check than only one MPI library appears, and that you do not have additional/mixed libraries (possibly pulled by external libraries) ?
Best regards,
Yvan
Re: Calculation is stuck but no error reported on a cluster
Posted: Fri May 06, 2022 6:31 pm
by Ruonan
Hello Yvan,
Thank you very much for your reply!
Today I had a look at this issue with my cluster administrator. He did a "hello world" test before and there was no problem.
I did ldd and copied the results below. I have added "xxx" at three lines related to mpi so you can find them easier. It seems only one OpenMPI/3.1.4 appeared, which is exactly what I used to install (compile) and run the code.
Code: Select all
ldd ./Code_Saturne/7.1.1-openmpi/code_saturne-7.1.1/arch/Linux_x86_64/libexec/code_saturne/cs_solver
linux-vdso.so.1 (0x00007ffc345f0000)
libcs_solver-7.1.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/code_saturne-7.1.1/arch/Linux_x86_64/lib/libcs_solver-7.1.so (0x000014e307a3f000)
libsaturne-7.1.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/code_saturne-7.1.1/arch/Linux_x86_64/lib/libsaturne-7.1.so (0x000014e306870000)
libple.so.2 => /users/rw00793/Code_Saturne/7.1.1-openmpi/code_saturne-7.1.1/arch/Linux_x86_64/lib/libple.so.2 (0x000014e307a29000)
libcgns.so.4.2 => /users/rw00793/Code_Saturne/7.1.1-openmpi/cgns-4.2.0/arch/Linux_x86_64/lib/libcgns.so.4.2 (0x000014e307958000)
libmedC.so.11 => /users/rw00793/Code_Saturne/7.1.1-openmpi/med-4.1.0/arch/Linux_x86_64/lib/libmedC.so.11 (0x000014e30673a000)
libhdf5.so.103 => /users/rw00793/Code_Saturne/7.1.1-openmpi/hdf5-1.10.6/arch/Linux_x86_64/lib/libhdf5.so.103 (0x000014e306378000)
libparmetis.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/parmetis-4.0.3/arch/Linux_x86_64/lib/libparmetis.so (0x000014e3078d7000)
libmetis.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/parmetis-4.0.3/arch/Linux_x86_64/lib/libmetis.so (0x000014e307867000)
libptscotch.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/scotch-6.1.1/arch/Linux_x86_64/lib/libptscotch.so (0x000014e30632b000)
libscotch.so => /users/rw00793/Code_Saturne/7.1.1-openmpi/scotch-6.1.1/arch/Linux_x86_64/lib/libscotch.so (0x000014e306293000)
libz.so.1 => /opt/software/pkgs/zlib/1.2.11-GCCcore-8.3.0/lib64/libz.so.1 (0x000014e30784e000)
libdl.so.2 => /lib64/libdl.so.2 (0x000014e306082000)
libgfortran.so.5 => /opt/software/pkgs/GCCcore/8.3.0/lib64/libgfortran.so.5 (0x000014e305e13000)
libquadmath.so.0 => /opt/software/pkgs/GCCcore/8.3.0/lib64/libquadmath.so.0 (0x000014e305dd2000)
libm.so.6 => /lib64/libm.so.6 (0x000014e305a50000)
xxx libmpi.so.40 => /opt/software/pkgs/OpenMPI/3.1.4-GCC-8.3.0/lib64/libmpi.so.40 (0x000014e305947000)
libgomp.so.1 => /opt/software/pkgs/GCCcore/8.3.0/lib64/libgomp.so.1 (0x000014e305918000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000014e3056f8000)
libc.so.6 => /lib64/libc.so.6 (0x000014e305333000)
xxx libopen-rte.so.40 => /opt/software/pkgs/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-rte.so.40 (0x000014e305275000)
xxx libopen-pal.so.40 => /opt/software/pkgs/OpenMPI/3.1.4-GCC-8.3.0/lib/libopen-pal.so.40 (0x000014e3051b0000)
librt.so.1 => /lib64/librt.so.1 (0x000014e304fa8000)
libutil.so.1 => /lib64/libutil.so.1 (0x000014e304da4000)
libhwloc.so.5 => /opt/software/pkgs/hwloc/1.11.12-GCCcore-8.3.0/lib/libhwloc.so.5 (0x000014e304d62000)
libnuma.so.1 => /opt/software/pkgs/numactl/2.0.12-GCCcore-8.3.0/lib/libnuma.so.1 (0x000014e304d55000)
libpciaccess.so.0 => /opt/software/pkgs/libpciaccess/0.14-GCCcore-8.3.0/lib/libpciaccess.so.0 (0x000014e304d49000)
libxml2.so.2 => /opt/software/pkgs/libxml2/2.9.9-GCCcore-8.3.0/lib/libxml2.so.2 (0x000014e304bd8000)
liblzma.so.5 => /opt/software/pkgs/XZ/5.2.4-GCCcore-8.3.0/lib/liblzma.so.5 (0x000014e304bb0000)
libstdc++.so.6 => /opt/software/pkgs/GCCcore/8.3.0/lib/../lib64/libstdc++.so.6 (0x000014e304a16000)
libgcc_s.so.1 => /opt/software/pkgs/GCCcore/8.3.0/lib/../lib64/libgcc_s.so.1 (0x000014e3049fd000)
/lib64/ld-linux-x86-64.so.2 (0x000014e30781f000)
Also, just in case, I attached the runcase and code_saturne.cfg file (uncommentted bindir and mpiexec).
So unfortunately and sorry I still don't know why the code can work on 1 node but get stuck on multiple nodes.
Best regards,
Ruonan
Re: Calculation is stuck but no error reported on a cluster
Posted: Sun May 08, 2022 11:40 pm
by Yvan Fournier
Hello,
Did the administrator run the "hello world" interactively or under a batch allocation ?
You configured the mpiexec command, but no batch system, so unless you are running the code_saturne script interactively under a batch allocation, the issue might be related.
Best regards,
Yvan