Page 1 of 2

[ask] on parallel computing

Posted: Thu Mar 18, 2010 7:32 pm
by salad
Hi,

Today I encountered a strange phenomenon. I was using a quad core (Intel Q6600) computer to perform simulation, but I only used two cores of it. The calculation was fine. Today I changed to use 3 or 4 cores, found the calculations do not stop for a much longer time than expected. Obviously, there should be problems, but I don't know what they are.

When using 3 cores, the calculation stopped at the first iteration step (I checked listing file)

                       MAIN CALCULATION
                       ================
===============================================================
===============================================================
 INSTANT    0.100000000E+01   TIME STEP NUMBER               1
 =============================================================

 --- Phase:          1
 ---------------------------------
 Property   Min. value  Max. value
 ---------------------------------
  Density   0.8835E+03  0.8835E+03
  LamVisc   0.1277E-01  0.1277E-01
 ---------------------------------

 --- Diffusivity:
 ---------------------------------------
 Scalar   Number  Min. value  Max. value
 ---------------------------------------
 TempC         1  0.6311E-04  0.6311E-04
 ---------------------------------------


   ** INFORMATION ON BOUNDARY FACES TYPE
      ----------------------------------

   Phase :    1
-------------------------------------------------------------------------
Boundary type          Code    Nb faces
-------------------------------------------------------------------------
Inlet                     2          50
Smooth wall               5         500
Rough wall                6           0
Symmetry                  4       25000
Free outlet               3          50
Undefined                 1           0
SIGINT signal (Control+C or equivalent) received.
--> computation interrupted by user.

Call stack:
   1: 0x7f69e5899340 ?                                (?)
   2: 0x7f69eb0af05a <opal_progress+0x5a>             (libopen-pal.so.0)
   3: 0x7f69e64e7995 ?                                (?)
   4: 0x7f69e47d774f ?                                (?)
   5: 0x7f69eb9cbe8c <PMPI_Allreduce+0x17c>           (libmpi.so.0)
   6: 0x7f69ece23d0f <parcpt_+0x2f>                   (libsaturne.so.0)
   7: 0x7f69ecf1c5f5 <typecl_+0x16e5>                 (libsaturne.so.0)
   8: 0x7f69ecd5a831 <condli_+0x1301>                 (libsaturne.so.0)
   9: 0x7f69ecf013d1 <tridim_+0x6931>                 (libsaturne.so.0)
  10: 0x7f69ecd4a0d5 <caltri_+0x5085>                 (libsaturne.so.0)
  11: 0x7f69ecd254db <cs_run+0x83b>                   (libsaturne.so.0)
  12: 0x7f69ecd257c5 <main+0x1f5>                     (libsaturne.so.0)
  13: 0x7f69e92c6abd <__libc_start_main+0xfd>         (libc.so.6)
  14: 0x4007a9     ?                                (?)
End of stack

When using 4 cores, I even could not arrive at the first step

...

  Directory:         /home/salad/tmp_Saturne/duct_2d.MEI.03181539
  MPI ranks:         4
  I/O mode:          MPI-IO, explicit offsets
===============================================================
                   CALCULATION PREPARATION
                   =======================
 ===========================================================

 Reading file:        preprocessor_output
SIGINT signal (Control+C or equivalent) received.
--> computation interrupted by user.

Call stack:    1: 0x7fba619a73c0 ?                                (?)
   2: 0x7fba674e3aad <mca_io_base_component_run_progress+0x3d> (libmpi.so.0)
   3: 0x7fba66ba205a <opal_progress+0x5a>             (libopen-pal.so.0)
   4: 0x7fba674aa5f5 ?                                (?)
   5: 0x7fba602ccdb6 ?                                (?)
   6: 0x7fba674bf1b7 <MPI_Alltoall+0x107>             (libmpi.so.0)
   7: 0x7fba619aca3b ?                                (?)
   8: 0x7fba619ae41c <ADIOI_GEN_ReadStridedColl+0xb8c> (mca_io_romio.so)
   9: 0x7fba619c2712 <MPIOI_File_read_all+0x122>      (mca_io_romio.so)
  10: 0x7fba619c2977 <mca_io_romio_dist_MPI_File_read_at_all+0x27> (mca_io_romio.so)
  11: 0x7fba674dc88f <MPI_File_read_at_all+0xff>      (libmpi.so.0)
  12: 0x7fba6856ef78 <fvm_file_read_global+0x178>     (libfvm.so.0)
  13: 0x7fba688a1a40 <cs_io_read_header+0x90>         (libsaturne.so.0)
  14: 0x7fba68931218 <ledevi_+0x158>                  (libsaturne.so.0)
  15: 0x7fba6897c2a7 <iniini_+0xce3>                  (libsaturne.so.0)
  16: 0x7fba6897effa <initi1_+0x16>                   (libsaturne.so.0)
  17: 0x7fba68817d3e <cs_run+0x9e>                    (libsaturne.so.0)
  18: 0x7fba688187c5 <main+0x1f5>                     (libsaturne.so.0)
  19: 0x7fba64db9abd <__libc_start_main+0xfd>         (libc.so.6)
  20: 0x4007a9     ?                                (?)
End of stack

When using 2 cores, it works very well.

Any advices about this?

Many thanks.

Best regards,
Wayne
http://code-saturne.blogspot.com/

Re: [ask] on parallel computing

Posted: Fri Mar 19, 2010 12:46 am
by David Monfort
Hello Wayne,

Your problem is quite interesting (sorry...) and quite strange. Someone at EDF encountered the same issue on quad-core laptop: works on 1 or two cores, not with 3 or 4 cores... and unfortunately, I have'nt found a solution yet.

Could you do a couple of tests for me ?

- add an option to the kernel, e.g. set ARG_CS_VERIF variable in the runcase: --mpi-io off
- try with another MPI implementation, e.g. mpich2

Hope you'll find something!
And let me know anyway ;)

Re: [ask] on parallel computing

Posted: Fri Mar 19, 2010 6:01 pm
by salad
Yes, it is interesting :). I remember when I used 1.4.0 or the first beta version of 2.0, I encountered a similar problem that the calculation won't stop if mpi was used. I cannot remember how I dealt with it.

Ok, let's try your suggestions.

For the first suggestion, I add the option to ARG_CS_VERIF; before it is empty.

When I used 4 cores, the calculation didn't stop. I pressed Ctrl-C and obtained:

Parallel Code_Saturne with partitioning in 4 sub-domains

                      Code_Saturne is running
                      ***********************

 Working directory (to be periodically cleaned) :
     /home/salad/tmp_Saturne/duct_2d.MEI.03191421

 Kernel version:           /usr/local
 Preprocessor:             /usr/local/bin

  ********************************************
             Preparing calculation
  ********************************************

  ********************************************
             Starting calculation
  ********************************************

^C--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them. --------------------------------------------------------------------------
mpirun: killing job...

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 11518 on node ubuntu exited on signal 2 (Interrupt).
--------------------------------------------------------------------------
4 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished

Error running the calculation.

Check Kernel log (listing) and error* files for details

  ********************************************
         Error in calculation stage.
  ********************************************

When using 3 cores, the situation is very similar. Both listing files are attached for you.

For mpich2 I need a bit more time since I always used openmpi.

Many thanks!

Regards, Wayne

http://code-saturne.blogspot.com/

Re: [ask] on parallel computing

Posted: Mon Mar 22, 2010 2:03 am
by David Monfort
For more advanced debugging in parallel, you can activate the logging of the different processus with the option --logp 1 (variable ARG_CS_OUTPUT in the runcase) or in the Advanced Parameters page of the graphical interface. With this option, you will have several "listing" named "listing_nXXXX" -- one for each processus. So, if you hit Control-C, you'll see where each processus hangs.

Another way would be to attach a debugger and go through the stack like this:
cd ~/tmp_Saturne/$STUDY.$CASE.$DATE
ps aux|grep cs_solver (and get the different pid)
gdb cs_solver pid (attach the debugger)

Re: [ask] on parallel computing

Posted: Tue Mar 23, 2010 2:52 pm
by salad
Temporarily the machine was rebooted to Windows to finish some calculations, but the linux environment is still there and definitely I will test what you said later. (Probably one month later, since I'll take a month holiday in April.)

I was thinking whether it is caused by my patched version of Metis. I noticed your statement in SALOME forum, saying there are alternative ways to instead of Metis, and I might try them first.

What do you think?

Wayne

Re: [ask] on parallel computing

Posted: Tue Mar 23, 2010 11:36 pm
by David Monfort
Actually, I don't think Metis is the cause of your problem. I tend to think it is due to the MPI implementation, that's why I advised you to try with MPICH2. Nonetheless, if you want to try without Metis, you can disable the partionning stage (in the interface, Batch Running/Advanced Parameters or directly in the runcase script) and then use the solver internal partionning algorithm (Space-Filling Curve, enabled by default in 2.0-rc1).

Let us know if something works in the end, and have a nice holyday ;)

David

Re: [ask] on parallel computing

Posted: Fri Mar 26, 2010 1:15 pm
by Jimmy Sapède
Hi,
 
got exactly similar problem here with the 2.0 beta2 version of code_saturne running on a bi-quad Xeon under fedora 11.
 
When i start a parallel calculation with 2 cores only, the calculation performs well but increasing the number of cores used above 3 makes the calculation hangs.
 
After many tries and many mail exchange with someone at EDF R&D i find that the problem is related to the version of openmpi used (1.4 as far as i remember).
 
Downgrading OpenMPI to older version (cant remember exactyl but i think it was 1.2.9 or 1.3.x) solves the problem and it runs with my 8 cores.
 
hope it helps !

Re: [ask] on parallel computing

Posted: Fri Mar 26, 2010 1:58 pm
by David Monfort
Hi,

Thanks a lot Jimmy for the feedback !
I hope it will help... though I don't understand why it fails with newer versions of OpenMPI.

David

Re: [ask] on parallel computing

Posted: Mon Mar 29, 2010 10:41 am
by Jimmy Sapède
i reread the mail exchange i had sometiems ago with Yvan Fournier, and the errors making calculation hangs arise from OpenMPI 1.3.3 or 1.3.4 and is solved using 1.2.9 in my case.
 
When i liste active processes here are the results in case of OpenMPI 1.3.3  (ps -def |grep Saturne) :
 
jimmy     1371  1332  0 08:24 pts/1    00:00:00
/home/saturne/openmpi-1.3.3/arch/Linux_x86_64/bin/mpiexec -n 4
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy     1372  1371  0 08:24 pts/1    00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy     1373  1371  0 08:24 pts/1    00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy     1374  1371  0 08:24 pts/1    00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy     1375  1371  0 08:24 pts/1    00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/localexec
jimmy     1376  1372 97 08:24 pts/1    00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
jimmy     1377  1373 97 08:24 pts/1    00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
jimmy     1378  1374 97 08:24 pts/1    00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml
jimmy     1379  1375 97 08:24 pts/1    00:01:12
/home/jimmy/tmp_Saturne/FULL_DOMAIN4.CAS2.11100824/cs_solver --mpi --log
0 --param case2.xml

Got 100% CPU on the 4 threads used, but calcuylation listing hangs and calculation never ends.
 
using 1.2.9 here are the results of ps -def : (8 threads)

jimmy     9395  2787  0 08:43 pts/1    00:00:00 /bin/sh ./SaturneGUI
jimmy     9396  9395  6 08:43 pts/1    00:00:02 python
/home/saturne/cs-2.0-beta2/arch/Linux_x86_64_dbg/bin/cs gui
jimmy     9415     1  0 08:44 pts/1    00:00:00 /bin/sh
/home/jimmy/Calculs/FULL_DOMAIN3/CAS2/SCRIPTS/runcase
jimmy     9416     1  0 08:44 pts/1    00:00:00 tee
/home/jimmy/Calculs/FULL_DOMAIN3/CAS2/SCRIPTS/batch
jimmy     9445  9415  0 08:44 pts/1    00:00:00
/home/saturne/openmpi-1.2.9/bin/mpiexec -n 8
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy     9447     1  0 08:44 ?        00:00:00 orted --bootproxy 1
--name 0.0.1 --num_procs 2 --vpid_start 0 --nodename Workstation
--universe jimmy@Workstation:default-universe-9445
jimmy     9448  9447  0 08:44 ?        00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy     9449  9447  0 08:44 ?        00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy     9450  9448 66 08:44 ?        00:00:07
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy     9451  9447  0 08:44 ?        00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy     9452  9449 73 08:44 ?        00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy     9453  9447  0 08:44 ?        00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy     9454  9447  0 08:44 ?        00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy     9455  9451 72 08:44 ?        00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy     9456  9453 72 08:44 ?        00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy     9457  9447  0 08:44 ?        00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy     9458  9447  0 08:44 ?        00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy     9459  9454 68 08:44 ?        00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy     9460  9457 69 08:44 ?        00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy     9461  9458 70 08:44 ?        00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
jimmy     9462  9447  0 08:44 ?        00:00:00 /bin/sh
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/localexec
jimmy     9463  9462 71 08:44 ?        00:00:08
/home/jimmy/tmp_Saturne/FULL_DOMAIN3.CAS2.11100844/cs_solver --mpi --log
0 --param case2.xml
 
Everything works well and calculation ends correctly

Re: [ask] on parallel computing

Posted: Mon Mar 29, 2010 11:07 pm
by David Monfort
Thanks for the feedback.

We are using OpenMPI 1.3.1 on our workstations without known problems. When I have time, I'll try with different versions of OpenMPI to see whether I can reproduce the issue (or not...). I'll let you know if I find something.