[ask] on parallel computing
Forum rules
Please read the forum usage recommendations before posting.
Please read the forum usage recommendations before posting.
[ask] on parallel computing
Hi,
Today I encountered a strange phenomenon. I was using a quad core (Intel Q6600) computer to perform simulation, but I only used two cores of it. The calculation was fine. Today I changed to use 3 or 4 cores, found the calculations do not stop for a much longer time than expected. Obviously, there should be problems, but I don't know what they are.
When using 3 cores, the calculation stopped at the first iteration step (I checked listing file)
MAIN CALCULATION
================
===============================================================
===============================================================
INSTANT 0.100000000E+01 TIME STEP NUMBER 1
=============================================================
--- Phase: 1
---------------------------------
Property Min. value Max. value
---------------------------------
Density 0.8835E+03 0.8835E+03
LamVisc 0.1277E-01 0.1277E-01
---------------------------------
--- Diffusivity:
---------------------------------------
Scalar Number Min. value Max. value
---------------------------------------
TempC 1 0.6311E-04 0.6311E-04
---------------------------------------
** INFORMATION ON BOUNDARY FACES TYPE
----------------------------------
Phase : 1
-------------------------------------------------------------------------
Boundary type Code Nb faces
-------------------------------------------------------------------------
Inlet 2 50
Smooth wall 5 500
Rough wall 6 0
Symmetry 4 25000
Free outlet 3 50
Undefined 1 0
SIGINT signal (Control+C or equivalent) received.
--> computation interrupted by user.
Call stack:
1: 0x7f69e5899340 ? (?)
2: 0x7f69eb0af05a <opal_progress+0x5a> (libopen-pal.so.0)
3: 0x7f69e64e7995 ? (?)
4: 0x7f69e47d774f ? (?)
5: 0x7f69eb9cbe8c <PMPI_Allreduce+0x17c> (libmpi.so.0)
6: 0x7f69ece23d0f <parcpt_+0x2f> (libsaturne.so.0)
7: 0x7f69ecf1c5f5 <typecl_+0x16e5> (libsaturne.so.0)
8: 0x7f69ecd5a831 <condli_+0x1301> (libsaturne.so.0)
9: 0x7f69ecf013d1 <tridim_+0x6931> (libsaturne.so.0)
10: 0x7f69ecd4a0d5 <caltri_+0x5085> (libsaturne.so.0)
11: 0x7f69ecd254db <cs_run+0x83b> (libsaturne.so.0)
12: 0x7f69ecd257c5 <main+0x1f5> (libsaturne.so.0)
13: 0x7f69e92c6abd <__libc_start_main+0xfd> (libc.so.6)
14: 0x4007a9 ? (?)
End of stack
When using 4 cores, I even could not arrive at the first step
...
Directory: /home/salad/tmp_Saturne/duct_2d.MEI.03181539
MPI ranks: 4
I/O mode: MPI-IO, explicit offsets
===============================================================
CALCULATION PREPARATION
=======================
===========================================================
Reading file: preprocessor_output
SIGINT signal (Control+C or equivalent) received.
--> computation interrupted by user.
Call stack: 1: 0x7fba619a73c0 ? (?)
2: 0x7fba674e3aad <mca_io_base_component_run_progress+0x3d> (libmpi.so.0)
3: 0x7fba66ba205a <opal_progress+0x5a> (libopen-pal.so.0)
4: 0x7fba674aa5f5 ? (?)
5: 0x7fba602ccdb6 ? (?)
6: 0x7fba674bf1b7 <MPI_Alltoall+0x107> (libmpi.so.0)
7: 0x7fba619aca3b ? (?)
8: 0x7fba619ae41c <ADIOI_GEN_ReadStridedColl+0xb8c> (mca_io_romio.so)
9: 0x7fba619c2712 <MPIOI_File_read_all+0x122> (mca_io_romio.so)
10: 0x7fba619c2977 <mca_io_romio_dist_MPI_File_read_at_all+0x27> (mca_io_romio.so)
11: 0x7fba674dc88f <MPI_File_read_at_all+0xff> (libmpi.so.0)
12: 0x7fba6856ef78 <fvm_file_read_global+0x178> (libfvm.so.0)
13: 0x7fba688a1a40 <cs_io_read_header+0x90> (libsaturne.so.0)
14: 0x7fba68931218 <ledevi_+0x158> (libsaturne.so.0)
15: 0x7fba6897c2a7 <iniini_+0xce3> (libsaturne.so.0)
16: 0x7fba6897effa <initi1_+0x16> (libsaturne.so.0)
17: 0x7fba68817d3e <cs_run+0x9e> (libsaturne.so.0)
18: 0x7fba688187c5 <main+0x1f5> (libsaturne.so.0)
19: 0x7fba64db9abd <__libc_start_main+0xfd> (libc.so.6)
20: 0x4007a9 ? (?)
End of stack
When using 2 cores, it works very well.
Any advices about this?
Many thanks.
Best regards,
Wayne
http://code-saturne.blogspot.com/
Today I encountered a strange phenomenon. I was using a quad core (Intel Q6600) computer to perform simulation, but I only used two cores of it. The calculation was fine. Today I changed to use 3 or 4 cores, found the calculations do not stop for a much longer time than expected. Obviously, there should be problems, but I don't know what they are.
When using 3 cores, the calculation stopped at the first iteration step (I checked listing file)
MAIN CALCULATION
================
===============================================================
===============================================================
INSTANT 0.100000000E+01 TIME STEP NUMBER 1
=============================================================
--- Phase: 1
---------------------------------
Property Min. value Max. value
---------------------------------
Density 0.8835E+03 0.8835E+03
LamVisc 0.1277E-01 0.1277E-01
---------------------------------
--- Diffusivity:
---------------------------------------
Scalar Number Min. value Max. value
---------------------------------------
TempC 1 0.6311E-04 0.6311E-04
---------------------------------------
** INFORMATION ON BOUNDARY FACES TYPE
----------------------------------
Phase : 1
-------------------------------------------------------------------------
Boundary type Code Nb faces
-------------------------------------------------------------------------
Inlet 2 50
Smooth wall 5 500
Rough wall 6 0
Symmetry 4 25000
Free outlet 3 50
Undefined 1 0
SIGINT signal (Control+C or equivalent) received.
--> computation interrupted by user.
Call stack:
1: 0x7f69e5899340 ? (?)
2: 0x7f69eb0af05a <opal_progress+0x5a> (libopen-pal.so.0)
3: 0x7f69e64e7995 ? (?)
4: 0x7f69e47d774f ? (?)
5: 0x7f69eb9cbe8c <PMPI_Allreduce+0x17c> (libmpi.so.0)
6: 0x7f69ece23d0f <parcpt_+0x2f> (libsaturne.so.0)
7: 0x7f69ecf1c5f5 <typecl_+0x16e5> (libsaturne.so.0)
8: 0x7f69ecd5a831 <condli_+0x1301> (libsaturne.so.0)
9: 0x7f69ecf013d1 <tridim_+0x6931> (libsaturne.so.0)
10: 0x7f69ecd4a0d5 <caltri_+0x5085> (libsaturne.so.0)
11: 0x7f69ecd254db <cs_run+0x83b> (libsaturne.so.0)
12: 0x7f69ecd257c5 <main+0x1f5> (libsaturne.so.0)
13: 0x7f69e92c6abd <__libc_start_main+0xfd> (libc.so.6)
14: 0x4007a9 ? (?)
End of stack
When using 4 cores, I even could not arrive at the first step
...
Directory: /home/salad/tmp_Saturne/duct_2d.MEI.03181539
MPI ranks: 4
I/O mode: MPI-IO, explicit offsets
===============================================================
CALCULATION PREPARATION
=======================
===========================================================
Reading file: preprocessor_output
SIGINT signal (Control+C or equivalent) received.
--> computation interrupted by user.
Call stack: 1: 0x7fba619a73c0 ? (?)
2: 0x7fba674e3aad <mca_io_base_component_run_progress+0x3d> (libmpi.so.0)
3: 0x7fba66ba205a <opal_progress+0x5a> (libopen-pal.so.0)
4: 0x7fba674aa5f5 ? (?)
5: 0x7fba602ccdb6 ? (?)
6: 0x7fba674bf1b7 <MPI_Alltoall+0x107> (libmpi.so.0)
7: 0x7fba619aca3b ? (?)
8: 0x7fba619ae41c <ADIOI_GEN_ReadStridedColl+0xb8c> (mca_io_romio.so)
9: 0x7fba619c2712 <MPIOI_File_read_all+0x122> (mca_io_romio.so)
10: 0x7fba619c2977 <mca_io_romio_dist_MPI_File_read_at_all+0x27> (mca_io_romio.so)
11: 0x7fba674dc88f <MPI_File_read_at_all+0xff> (libmpi.so.0)
12: 0x7fba6856ef78 <fvm_file_read_global+0x178> (libfvm.so.0)
13: 0x7fba688a1a40 <cs_io_read_header+0x90> (libsaturne.so.0)
14: 0x7fba68931218 <ledevi_+0x158> (libsaturne.so.0)
15: 0x7fba6897c2a7 <iniini_+0xce3> (libsaturne.so.0)
16: 0x7fba6897effa <initi1_+0x16> (libsaturne.so.0)
17: 0x7fba68817d3e <cs_run+0x9e> (libsaturne.so.0)
18: 0x7fba688187c5 <main+0x1f5> (libsaturne.so.0)
19: 0x7fba64db9abd <__libc_start_main+0xfd> (libc.so.6)
20: 0x4007a9 ? (?)
End of stack
When using 2 cores, it works very well.
Any advices about this?
Many thanks.
Best regards,
Wayne
http://code-saturne.blogspot.com/
Re: [ask] on parallel computing
Hello Wayne,
Your problem is quite interesting (sorry...) and quite strange. Someone at EDF encountered the same issue on quad-core laptop: works on 1 or two cores, not with 3 or 4 cores... and unfortunately, I have'nt found a solution yet.
Could you do a couple of tests for me ?
- add an option to the kernel, e.g. set ARG_CS_VERIF variable in the runcase: --mpi-io off
- try with another MPI implementation, e.g. mpich2
Hope you'll find something!
And let me know anyway
Your problem is quite interesting (sorry...) and quite strange. Someone at EDF encountered the same issue on quad-core laptop: works on 1 or two cores, not with 3 or 4 cores... and unfortunately, I have'nt found a solution yet.
Could you do a couple of tests for me ?
- add an option to the kernel, e.g. set ARG_CS_VERIF variable in the runcase: --mpi-io off
- try with another MPI implementation, e.g. mpich2
Hope you'll find something!
And let me know anyway

Re: [ask] on parallel computing
Yes, it is interesting :). I remember when I used 1.4.0 or the first beta version of 2.0, I encountered a similar problem that the calculation won't stop if mpi was used. I cannot remember how I dealt with it.
Ok, let's try your suggestions.
For the first suggestion, I add the option to ARG_CS_VERIF; before it is empty.
When I used 4 cores, the calculation didn't stop. I pressed Ctrl-C and obtained:
Parallel Code_Saturne with partitioning in 4 sub-domains
Code_Saturne is running
***********************
Working directory (to be periodically cleaned) :
/home/salad/tmp_Saturne/duct_2d.MEI.03191421
Kernel version: /usr/local
Preprocessor: /usr/local/bin
********************************************
Preparing calculation
********************************************
********************************************
Starting calculation
********************************************
^C--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them. --------------------------------------------------------------------------
mpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 11518 on node ubuntu exited on signal 2 (Interrupt).
--------------------------------------------------------------------------
4 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished
Error running the calculation.
Check Kernel log (listing) and error* files for details
********************************************
Error in calculation stage.
********************************************
When using 3 cores, the situation is very similar. Both listing files are attached for you.
For mpich2 I need a bit more time since I always used openmpi.
Many thanks!
Regards, Wayne
http://code-saturne.blogspot.com/
Ok, let's try your suggestions.
For the first suggestion, I add the option to ARG_CS_VERIF; before it is empty.
When I used 4 cores, the calculation didn't stop. I pressed Ctrl-C and obtained:
Parallel Code_Saturne with partitioning in 4 sub-domains
Code_Saturne is running
***********************
Working directory (to be periodically cleaned) :
/home/salad/tmp_Saturne/duct_2d.MEI.03191421
Kernel version: /usr/local
Preprocessor: /usr/local/bin
********************************************
Preparing calculation
********************************************
********************************************
Starting calculation
********************************************
^C--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them. --------------------------------------------------------------------------
mpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 11518 on node ubuntu exited on signal 2 (Interrupt).
--------------------------------------------------------------------------
4 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished
Error running the calculation.
Check Kernel log (listing) and error* files for details
********************************************
Error in calculation stage.
********************************************
When using 3 cores, the situation is very similar. Both listing files are attached for you.
For mpich2 I need a bit more time since I always used openmpi.
Many thanks!
Regards, Wayne
http://code-saturne.blogspot.com/
- Attachments
-
- listings.zip
- (15.44 KiB) Downloaded 249 times
Re: [ask] on parallel computing
For more advanced debugging in parallel, you can activate the logging of the different processus with the option --logp 1 (variable ARG_CS_OUTPUT in the runcase) or in the Advanced Parameters page of the graphical interface. With this option, you will have several "listing" named "listing_nXXXX" -- one for each processus. So, if you hit Control-C, you'll see where each processus hangs.
Another way would be to attach a debugger and go through the stack like this:
Another way would be to attach a debugger and go through the stack like this:
cd ~/tmp_Saturne/$STUDY.$CASE.$DATE
ps aux|grep cs_solver (and get the different pid)
gdb cs_solver pid (attach the debugger)
Re: [ask] on parallel computing
Temporarily the machine was rebooted to Windows to finish some calculations, but the linux environment is still there and definitely I will test what you said later. (Probably one month later, since I'll take a month holiday in April.)
I was thinking whether it is caused by my patched version of Metis. I noticed your statement in SALOME forum, saying there are alternative ways to instead of Metis, and I might try them first.
What do you think?
Wayne
I was thinking whether it is caused by my patched version of Metis. I noticed your statement in SALOME forum, saying there are alternative ways to instead of Metis, and I might try them first.
What do you think?
Wayne
Re: [ask] on parallel computing
Actually, I don't think Metis is the cause of your problem. I tend to think it is due to the MPI implementation, that's why I advised you to try with MPICH2. Nonetheless, if you want to try without Metis, you can disable the partionning stage (in the interface, Batch Running/Advanced Parameters or directly in the runcase script) and then use the solver internal partionning algorithm (Space-Filling Curve, enabled by default in 2.0-rc1).
Let us know if something works in the end, and have a nice holyday
David
Let us know if something works in the end, and have a nice holyday

David
Re: [ask] on parallel computing
Hi,
got exactly similar problem here with the 2.0 beta2 version of code_saturne running on a bi-quad Xeon under fedora 11.
When i start a parallel calculation with 2 cores only, the calculation performs well but increasing the number of cores used above 3 makes the calculation hangs.
After many tries and many mail exchange with someone at EDF R&D i find that the problem is related to the version of openmpi used (1.4 as far as i remember).
Downgrading OpenMPI to older version (cant remember exactyl but i think it was 1.2.9 or 1.3.x) solves the problem and it runs with my 8 cores.
hope it helps !
got exactly similar problem here with the 2.0 beta2 version of code_saturne running on a bi-quad Xeon under fedora 11.
When i start a parallel calculation with 2 cores only, the calculation performs well but increasing the number of cores used above 3 makes the calculation hangs.
After many tries and many mail exchange with someone at EDF R&D i find that the problem is related to the version of openmpi used (1.4 as far as i remember).
Downgrading OpenMPI to older version (cant remember exactyl but i think it was 1.2.9 or 1.3.x) solves the problem and it runs with my 8 cores.
hope it helps !
Re: [ask] on parallel computing
Hi,
Thanks a lot Jimmy for the feedback !
I hope it will help... though I don't understand why it fails with newer versions of OpenMPI.
David
Thanks a lot Jimmy for the feedback !
I hope it will help... though I don't understand why it fails with newer versions of OpenMPI.
David
Re: [ask] on parallel computing
i reread the mail exchange i had sometiems ago with Yvan Fournier, and the errors making calculation hangs arise from OpenMPI 1.3.3 or 1.3.4 and is solved using 1.2.9 in my case.
When i liste active processes here are the results in case of OpenMPI 1.3.3 (ps -def |grep Saturne) :
jimmy 1371 1332 0 08:24 pts/1 00:00:00
/home/saturne/openmpi-1.3.3/arch/Linux_x86_64/bin/mpiexec -n 4
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy 1372 1371 0 08:24 pts/1 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy 1373 1371 0 08:24 pts/1 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy 1374 1371 0 08:24 pts/1 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy 1375 1371 0 08:24 pts/1 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy 1376 1372 97 08:24 pts/1 00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
jimmy 1377 1373 97 08:24 pts/1 00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
jimmy 1378 1374 97 08:24 pts/1 00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
jimmy 1379 1375 97 08:24 pts/1 00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
Got 100% CPU on the 4 threads used, but calcuylation listing hangs and calculation never ends.
using 1.2.9 here are the results of ps -def : (8 threads)
jimmy 9395 2787 0 08:43 pts/1 00:00:00 /bin/sh ./SaturneGUI
jimmy 9396 9395 6 08:43 pts/1 00:00:02 python
/home/saturne/cs-2.0-beta2/arch/Linux_x86_64_dbg/bin/cs gui
jimmy 9415 1 0 08:44 pts/1 00:00:00 /bin/sh
/home/jimmy/Calculs/FULL_DOMAIN3/CAS2/SCRIPTS/runcase
jimmy 9416 1 0 08:44 pts/1 00:00:00 tee
/home/jimmy/Calculs/FULL_DOMAIN3/CAS2/SCRIPTS/batch
jimmy 9445 9415 0 08:44 pts/1 00:00:00
/home/saturne/openmpi-1.2.9/bin/mpiexec -n 8
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9447 1 0 08:44 ? 00:00:00 orted --bootproxy 1
--name 0.0.1 --num_procs 2 --vpid_start 0 --nodename Workstation
--universe jimmy@Workstation:default-universe-9445
jimmy 9448 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9449 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9450 9448 66 08:44 ? 00:00:07
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9451 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9452 9449 73 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9453 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9454 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9455 9451 72 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9456 9453 72 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9457 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9458 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9459 9454 68 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9460 9457 69 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9461 9458 70 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9462 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9463 9462 71 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
Everything works well and calculation ends correctly
When i liste active processes here are the results in case of OpenMPI 1.3.3 (ps -def |grep Saturne) :
jimmy 1371 1332 0 08:24 pts/1 00:00:00
/home/saturne/openmpi-1.3.3/arch/Linux_x86_64/bin/mpiexec -n 4
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy 1372 1371 0 08:24 pts/1 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy 1373 1371 0 08:24 pts/1 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy 1374 1371 0 08:24 pts/1 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy 1375 1371 0 08:24 pts/1 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy 1376 1372 97 08:24 pts/1 00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
jimmy 1377 1373 97 08:24 pts/1 00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
jimmy 1378 1374 97 08:24 pts/1 00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
jimmy 1379 1375 97 08:24 pts/1 00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
Got 100% CPU on the 4 threads used, but calcuylation listing hangs and calculation never ends.
using 1.2.9 here are the results of ps -def : (8 threads)
jimmy 9395 2787 0 08:43 pts/1 00:00:00 /bin/sh ./SaturneGUI
jimmy 9396 9395 6 08:43 pts/1 00:00:02 python
/home/saturne/cs-2.0-beta2/arch/Linux_x86_64_dbg/bin/cs gui
jimmy 9415 1 0 08:44 pts/1 00:00:00 /bin/sh
/home/jimmy/Calculs/FULL_DOMAIN3/CAS2/SCRIPTS/runcase
jimmy 9416 1 0 08:44 pts/1 00:00:00 tee
/home/jimmy/Calculs/FULL_DOMAIN3/CAS2/SCRIPTS/batch
jimmy 9445 9415 0 08:44 pts/1 00:00:00
/home/saturne/openmpi-1.2.9/bin/mpiexec -n 8
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9447 1 0 08:44 ? 00:00:00 orted --bootproxy 1
--name 0.0.1 --num_procs 2 --vpid_start 0 --nodename Workstation
--universe jimmy@Workstation:default-universe-9445
jimmy 9448 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9449 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9450 9448 66 08:44 ? 00:00:07
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9451 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9452 9449 73 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9453 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9454 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9455 9451 72 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9456 9453 72 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9457 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9458 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9459 9454 68 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9460 9457 69 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9461 9458 70 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy 9462 9447 0 08:44 ? 00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy 9463 9462 71 08:44 ? 00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
Everything works well and calculation ends correctly
Re: [ask] on parallel computing
Thanks for the feedback.
We are using OpenMPI 1.3.1 on our workstations without known problems. When I have time, I'll try with different versions of OpenMPI to see whether I can reproduce the issue (or not...). I'll let you know if I find something.
We are using OpenMPI 1.3.1 on our workstations without known problems. When I have time, I'll try with different versions of OpenMPI to see whether I can reproduce the issue (or not...). I'll let you know if I find something.