Problem with parallel calculation with big mesh

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
Serra Sylvain

Problem with parallel calculation with big mesh

Post by Serra Sylvain »

Hi all,

I have a problem when I use parallelism in a complex geometry.
 
The first test I did was on a biperiodic plane channel flow with few cells (around 150 000). I run this simulation on one node and 8 procs and on 2 nodes and 8 procs by node without problem.

My problem is when I run the simulation that contain 4,5 millions of cells on a complex geometry. This simulation is ok on one proc, on one node and 8 procs… but does not works on 2 nodes.

The runcase is the same; the turbulence model and others are the same.
The only difference is the geometry and the mesh.
I do not understand why there is an error

the error file gives:

SIGTERM signal (termination) received.
--> computation interrupted by environment.

Call stack:
   1: 0x2b36795242f3 ?                                (?)
   2: 0x2b3679528e43 <cs_sles_solve+0xe23>            (libsaturne.so.0)
   3: 0x2b3679529d71 <reslin_+0x161>                  (libsaturne.so.0)
   4: 0x2b3679689926 <invers_+0x256>                  (libsaturne.so.0)
   5: 0x2b3679550f55 <codits_+0x14d5>                 (libsaturne.so.0)
   6: 0x2b3679709b86 <turbkw_+0x5d4e>                 (libsaturne.so.0)
   7: 0x2b36796faf1e <tridim_+0xcf0e>                 (libsaturne.so.0)
   8: 0x2b3679541ccb <caltri_+0x51ab>                 (libsaturne.so.0)
   9: 0x2b3679520644 <cs_run+0x774>                   (libsaturne.so.0)
  10: 0x2b36795209d5 <main+0x1f5>                     (libsaturne.so.0)
  11: 0x3f49c1d8b4 <__libc_start_main+0xf4>         (libc.so.6)
  12: 0x408fe9     ?                                (?)
End of stack

 ------------------------

Is anyone could help me?

Thanks in advance

Sylvain
Alexandre Douce

Re: Problem with parallel calculation with big mesh

Post by Alexandre Douce »

Hi,
with which version are you working ?
Yvan Fournier

Re: Problem with parallel calculation with big mesh

Post by Yvan Fournier »

Hello,

The traceback indicates your calculation was interrupted by the environment. This may be due to a batch system interrupting your calculation if you have exceeded the alloted time limit, or to the calculation being interrupted due to another processor crashing.

If you see any error.xxx file other than error, check it: it may contain a backtrace with more info, for example a segmentation fault (which would be a bug either in the code or in user subroutines), or a floating-point exception (which could happened either due to a bug as before, or to a diverging calculation).

In case you really have encountered a bug, are you running on the latest available version ?

Best regards,

  Yvan
Serra Sylvain

Re: Problem with parallel calculation with big mesh

Post by Serra Sylvain »

Hello,

I use the version 2.0.0-RC1.

I have no error.xxx files.

I tried on a higher number of processor to know if there error is always the same... I run on 4 nodes and 8 procs by nodes and the error was different.

In that case the first time step was calculated and the error arrived on the second time step.

-----------------
 INSTANT    0.200000000E-03   TIME STEP NUMBER               2  =============================================================
   ** BOUNDARY MASS FLOW INFORMATION
      ------------------------------
   Phase :    1
---------------------------------------------------------------
Boundary type          Code    Nb faces           Mass flow
---------------------------------------------------------------
Inlet                     2           0         0.000000000E+00
Smooth wall               5      291309         0.000000000E+00
Rough wall                6           0         0.000000000E+00
Symmetry                  4           0         0.000000000E+00
Free outlet               3           0         0.000000000E+00
Undefined                 1           0         0.000000000E+00
---------------------------------------------------------------
   ** BOUNDARY CONDITIONS FOR SMOOTH WALLS
   ---------------------------------------
------------------------------------------------------------
 Phase      1                            Minimum     Maximum
------------------------------------------------------------
   Rel velocity at the wall uiptn :  0.00000E+00 0.00000E+00
   Friction velocity        uet   :  0.18962E-01 0.63538E+02
   Friction velocity        uk    :  0.00000E+00 0.00000E+00
   Dimensionless distance   yplus :  0.71628E-03 0.89849E-02
   ------------------------------------------------------
   Nb of reversal of the velocity at the wall   :          0
   Nb of faces within the viscous sub-layer     :     291309
   Total number of wall faces                   :     291309
------------------------------------------------------------

SIGTERM signal (termination) received.
--> computation interrupted by environment.

Call stack:
   1: 0x3363ec92a6 <__poll+0x66>                    (libc.so.6)
   2: 0x2b5c1319b9c6 ?                                (?)
   3: 0x2b5c1319a993 ?                                (?)
   4: 0x2b5c1318fc3e <opal_progress+0x9e>             (libopen-pal.so.0)
   5: 0x2b5c12cc5005 ?                                (?)
   6: 0x2b5c182fdaa7 ?                                (?)
   7: 0x2b5c1830701c ?                                (?)
   8: 0x2b5c12cd986c <MPI_Barrier+0x7c>               (libmpi.so.0)
   9: 0x2b5c10d71e42 <cs_halo_sync_var+0x372>         (libsaturne.so.0)
  10: 0x2b5c10cf1d4e <cs_matrix_vector_multiply+0x4e> (libsaturne.so.0)
  11: 0x2b5c10cf5bb5 <cs_sles_solve+0xb95>            (libsaturne.so.0)
  12: 0x2b5c10dea1bc <resmgr_+0x15ec>                 (libsaturne.so.0)
  13: 0x2b5c10e56873 <invers_+0x1a3>                  (libsaturne.so.0)
  14: 0x2b5c10ea0a1b <resolp_+0x6a9b>                 (libsaturne.so.0)
  15: 0x2b5c10e7d3e8 <navsto_+0x2dd0>                 (libsaturne.so.0)
  16: 0x2b5c10ec50b1 <tridim_+0xa0a1>                 (libsaturne.so.0)
  17: 0x2b5c10d0eccb <caltri_+0x51ab>                 (libsaturne.so.0)
  18: 0x2b5c10ced644 <cs_run+0x774>                   (libsaturne.so.0)
  19: 0x2b5c10ced9d5 <main+0x1f5>                     (libsaturne.so.0)
  20: 0x3363e1d8b4 <__libc_start_main+0xf4>         (libc.so.6)
  21: 0x408fe9     ?                                (?)
End of stack
 
---------------

May be this one gives you more information?

Thank you for your anwers

Sylvain
Yvan Fournier

Re: Problem with parallel calculation with big mesh

Post by Yvan Fournier »

Hello,

I would suspect that when running your case on more processors, it ran faster and had time to finish a time step before being killed.

Are you using a batch system ? Is the time limit set correctly ?

There may be an MPI installation/configuration issue. Can you run the small case which worked before on the same number of processors and for an equivalent duration, just to check ?

Are you using user subroutines in which there might be a bug ? (if so, can you post them here so that I may check).

Also, I would recommend moving to version 2.0-rc2, but 2.0.1 (2.0 final + some new minor corrections) should be released in a few days, so you may prefer to wait...

Best regards,

  Yvan
Serra Sylvain

Re: Problem with parallel calculation with big mesh

Post by Serra Sylvain »

Hello,

I had already remove all the subroutine.

I use batch system :

==============

#PBS -l nodes=4:ppn=8
#PBS -l walltime=23:00:00
#PBS -l mem=1800mb
#
#PBS -j eo
#PBS -N dimplepara
#
#PBS -r n
#PBS -m abe
#PBS -M sylvain.serra@mines-douai.fr

=============

The problem arrived in less than two minutes.

I run exactly the same number of nodes, procs and wall time with the small case and there is no problem (the run is not finish but do not crash until 2 hours).

In the two cases I used periodic boundary ... but it does not create problems on one proc or on one node et many procs...

On the cluster, we use : openmpi-1.4.2

We are waiting after the last version of CS to do a new installation.

Best regards

Sylvain
Yvan Fournier

Re: Problem with parallel calculation with big mesh

Post by Yvan Fournier »

Hello,

This might not be related, but could you try again simply be removing the

#PBS -l mem=1800mb

line in your script ? It was recommended in old PBS documentations, but does not seem necessary nowadays, and in the classical parallel case where you want to have nodes for yourself, it is best not to set it and let the system choose the defaults. We plan to remove this line in version 2.0.1.

Best regards,

  Yvan
Serra Sylvain

Re: Problem with parallel calculation with big mesh

Post by Serra Sylvain »

Hello Yvan and thank you very much

already 21 iterations and not problem...

I think it is working.

Best regards

Sylvain
Post Reply