Basic Code_Saturne 3.1 multi-computer openmpi configuration

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
pklimas

Basic Code_Saturne 3.1 multi-computer openmpi configuration

Post by pklimas »

Yvan,

I am attempting to setup a parallel code_saturne 3.1 run using openmpi on 2 computers.

The steps that I have taken so far include:
1. Created my analysis and verified that it works on one computer (single computer multi-core using openmpi)
2. Verifed that my openmpi setup utilizing both computers works useing mpqc (http://www.mpqc.org/mpqc-html/index.html)
3. Moved "cs_user_scripts.py" into the DATA directory adding the following lines:
i) case.n_procs = 12
ii) mpi_env.mpiexec_opts = '-hostfile /etc/openmpi/openmpi-default-hostfile'

However, when I try to start my solution, I get the error below. It does appear that openmpi finds "crunchbang-node2" but that node2 cannot find the runcase xml file. Is there a specific way i need to share the run files over a network drive? Currently I have "synced" home directories on node1 and node2.

Code: Select all

****************************
  Preparing calculation data
 ****************************

 ***************************
  Preprocessing calculation
 ***************************

 **********************
  Starting calculation
 **********************

Code_Saturne: cs_gui_util.c:174: Warning
Unable to open the file: 2D-Rectangular-Bluff-Body1.xml

Call stack:
Code_Saturne: cs_gui_util.c:174: Warning
Unable to open the file: 2D-Rectangular-Bluff-Body1.xml

Call stack:
Code_Saturne: cs_gui_util.c:174: Warning
Unable to open the file: 2D-Rectangular-Bluff-Body1.xml

Call stack:
Code_Saturne: cs_gui_util.c:174: Warning
Unable to open the file: 2D-Rectangular-Bluff-Body1.xml

Call stack:
Code_Saturne: cs_gui_util.c:174: Warning
Unable to open the file: 2D-Rectangular-Bluff-Body1.xml

Call stack:
Code_Saturne: cs_gui_util.c:174: Warning
Unable to open the file: 2D-Rectangular-Bluff-Body1.xml

Call stack:
Error loading parameter file "2D-Rectangular-Bluff-Body1.xml".

Code_Saturne: cs_gui_util.c:174: Warning
Unable to open the file: 2D-Rectangular-Bluff-Body1.xml
Error loading parameter file "2D-Rectangular-Bluff-Body1.xml".
Error loading parameter file "2D-Rectangular-Bluff-Body1.xml".
Error loading parameter file "2D-Rectangular-Bluff-Body1.xml".
Error loading parameter file "2D-Rectangular-Bluff-Body1.xml".
Error loading parameter file "2D-Rectangular-Bluff-Body1.xml".
Error loading parameter file "2D-Rectangular-Bluff-Body1.xml".

Code_Saturne: cs_gui_util.c:174: Warning
Unable to open the file: 2D-Rectangular-Bluff-Body1.xml
Error loading parameter file "2D-Rectangular-Bluff-Body1.xml".

Call stack:

Call stack:
   1: 0x7f8a90d2808b <cs_opts_define+0x36b>           (libsaturne.so.0)
   2: 0x7f8a90cace84 <main+0xa4>                      (libsaturne.so.0)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 SPLIT FROM 0 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
   3: 0x7f8a8db2c995 <__libc_start_main+0xf5>         (libc.so.6)
   4: 0x4007b9     <>                               (cs_solver)
End of stack

--------------------------------------------------------------------------
mpiexec.openmpi has exited due to process rank 4 with PID 5552 on
node crunchbang-node2 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec.openmpi (as reported here).
--------------------------------------------------------------------------
 solver script exited with status 1.

Error running the calculation.

Check Code_Saturne log (listing) and error* files for details.

 ****************************
  Saving calculation results
 ****************************
 Error in calculation stage.
Any help would be greatly appriciated and thanks for supporting this great code!

Peter
Yvan Fournier
Posts: 4208
Joined: Mon Feb 20, 2012 3:25 pm

Re: Basic Code_Saturne 3.1 multi-computer openmpi configurat

Post by Yvan Fournier »

Hello,

All our scripts assume a shared filesystem, such as NFS, or better (performance wise, at lrast in sime cases), a parallel filesystem.

As this does not seem to be the case here, you may need to add additional syncing in cs_user scripts.py (where/once the execution directory is known). You may also have issues with MPI/IO, so you need to configure the code not to use it (on build or in performance tuning options).

In any case, setting up a parallel or shared filesystem would be better, as all serious general purpuse clusters I have seen have one).

Regards,

Yvan
pklimas

Re: Basic Code_Saturne 3.1 multi-computer openmpi configurat

Post by pklimas »

Yvan,

I did as you suggested and exported a directory on "crunchbang-node1" and mounted it at the same location on node2.

The analysis starts momentary and then fails with the following message:

Code: Select all

Version: 3.1.0
 Path:    /opt/code_saturne-3.1

 Result directory:
   /home/analysis/code_saturne/Bluff-Body/Bluff-Body-2DPlate/AirDomain/RESU/20130826-2114
 Parallel Code_Saturne on 12 processes.

 ****************************
  Preparing calculation data
***************************

 ***************************
  Preprocessing calculation
 ***************************

 **********************
  Starting calculation
 **********************
[crunchbang-node2:3203] *** An error occurred in MPI_Bcast
[crunchbang-node2:3203] *** on communicator MPI COMMUNICATOR 4 DUP FROM 3
[crunchbang-node2:3203] *** MPI_ERR_TRUNCATE: message truncated
[crunchbang-node2:3203] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[crunchbang-node1][[60892,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [crunchbang-node1][[60892,1],0][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[crunchbang-node1][[60892,1],2][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[crunchbang-node1][[60892,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[crunchbang-node1][[60892,1],2][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[crunchbang-node1][[60892,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpiexec.openmpi has exited due to process rank 10 with PID 3209 on
node crunchbang-node2 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec.openmpi (as reported here).
--------------------------------------------------------------------------
[crunchbang-node1:07317] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[crunchbang-node1:07317] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
solver script exited with status 15.

Error running the calculation.

Check Code_Saturne log (listing) and error* files for details.

Error in calculation stage.

 ****************************
  Saving calculation results
 ****************************
Unfortunately I don't know where to even set "orte_base_help_aggregate=0" to see more output.

The listing file error states:

Code: Select all

Reading file:        mesh_input
 Finished reading:    mesh_input
 No "partition_input/domain_number_12" file available;
 ----------------------------------------------------------
 Partitioning 208504 cells to 12 domains on 12 ranks
   (SCOTCH_dgraphPart).
  wall-clock time: 1.786792 s

  Number of cells per domain (histogramm):
    [      16680 ;      16776 [ =          1
    [      16776 ;      16872 [ =          0
    [      16872 ;      16968 [ =          0
    [      16968 ;      17064 [ =          0
    [      17064 ;      17161 [ =          1
    [      17161 ;      17257 [ =          1
    [      17257 ;      17353 [ =          2
    [      17353 ;      17449 [ =          1
    [      17449 ;      17545 [ =          2
    [      17545 ;      17642 ] =          4

 Writing file:        partition_output/domain_number_12
SIGTERM signal (termination) received.
--> computation interrupted by environment.

Call stack:
   1: 0x7f10b0d7b180 <__poll+0x10>                    (libc.so.6)
   2: 0x7f10b05feb23 <+0x1fb23>                       (libopen-pal.so.0)
   3: 0x7f10b05fd802 <+0x1e802>                       (libopen-pal.so.0)
   4: 0x7f10b05f23c9 <opal_progress+0xa9>             (libopen-pal.so.0)
   5: 0x7f10b29a8745 <+0x37745>                       (libmpi.so.0)
   6: 0x7f10ab210d52 <+0x1d52>                        (mca_coll_tuned.so)
   7: 0x7f10ab217581 <+0x8581>                        (mca_coll_tuned.so)
   8: 0x7f10b29bc8b9 <MPI_Allgather+0x169>            (libmpi.so.0)
   9: 0x7f10acafb04e <ADIOI_GEN_WriteStridedColl+0x1ae> (mca_io_romio.so)
  10: 0x7f10acb08132 <MPIOI_File_write_all+0x122>     (mca_io_romio.so)
  11: 0x7f10acb08677 <mca_io_romio_dist_MPI_File_write_at_all+0x27> (mca_io_romio.so)
  12: 0x7f10b29e1fc7 <PMPI_File_write_at_all+0x127>   (libmpi.so.0)
  13: 0x7f10b3fc2623 <cs_file_write_block_buffer+0x173> (libsaturne.so.0)
  14: 0x7f10b3fd96e6 <cs_io_write_block_buffer+0x106> (libsaturne.so.0)
  15: 0x7f10b4158478 <+0x368478>                      (libsaturne.so.0)
  16: 0x7f10b415e16e <cs_partition+0x528e>            (libsaturne.so.0)
  17: 0x7f10b3ed0e34 <cs_preprocessor_data_read_mesh+0x274> (libsaturne.so.0)
  18: 0x7f10b3e40268 <cs_run+0x238>                   (libsaturne.so.0)
  19: 0x7f10b3e3ff2a <main+0x14a>                     (libsaturne.so.0)
  20: 0x7f10b0cbf995 <__libc_start_main+0xf5>         (libc.so.6)
  21: 0x4007b9     <>                               (cs_solver)
End of stack
Once again, your insight is greatly appriciated.

Peter
Yvan Fournier
Posts: 4208
Joined: Mon Feb 20, 2012 3:25 pm

Re: Basic Code_Saturne 3.1 multi-computer openmpi configurat

Post by Yvan Fournier »

Hello,

To add OpenMPI options, you need to use cs_user_scripts.py (copying it from DATA/REFERENCE to DATA to activate it), and adapt the define_mpi_environment function, so as to add options in the mpi_env.mpiexec_opts field. If you are coupling the cdoe with itself, SYRTHES, or Code_Aster, you need to modify this in runcase_coupling instead.

Note that this will change a bit in 3.2: general options will be set in $install_prefix/etc/code_saturne/code_sature.cfg (or the users ~/.code_saturne.cfg), and options may be redefined on a givne case by using the CS_MPIEXEC_OPTIONS environment variable.

Regards,

Yvan
st268
Posts: 64
Joined: Fri May 31, 2013 10:45 am

Re: Basic Code_Saturne 3.1 multi-computer openmpi configurat

Post by st268 »

Yvan,

Can I just ask. Is the only way to use openMPI through the cs_user_scripts.py post install? Am I write in thinking the following:

Code_saturne remembers what modules were loaded on installation and then, if you try to use a different MPI environment (for example mpich2) then c_s will try to purge your new modules loaded and load the ones given at installation?

thanks

Susan
Yvan Fournier
Posts: 4208
Joined: Mon Feb 20, 2012 3:25 pm

Re: Basic Code_Saturne 3.1 multi-computer openmpi configurat

Post by Yvan Fournier »

Hello,

Yes, Code_Saturne will try to use the MPI variant matching the modules used at installation time, but you may still need or want to add advanced mpi initialization options specific to an MPI library, hence the possibility of modifying mpiexec_options one way or the other.

Regards,

Yvan
Post Reply