Some problems on BlueGene/Q

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
zeph67
Posts: 53
Joined: Tue Oct 23, 2012 5:54 pm

Some problems on BlueGene/Q

Post by zeph67 »

Good morning,

I'm using Code_Saturne v3.0.1 and I'm encountering some problems on the BlueGene/Q called "Turing" of the IDRIS Center.

The batch instructions can only be passed with the LoadLeveler "language", whereas the template provided with Code_Saturne is written in SLURM, which seems not interpreted on this BlueGene. The consequence of all that, is that only bg_size (demanded number of nodes) is to be specified.

So, the only way of specifying the number of ranks per node, is by passing it to the runjob command contained in the run_solver.sh file.

The question is : is there a proper way of enforcing a --ranks-per-node option in the run_solver.sh file ?

In the joint tar-ed folder, there are runcase, run_solver.sh and listing files for a case with 64 nodes, 1 core per node. For information, the preprocessor stage runs properly, and the calculation crashes after 1 time-step, in a way that resembles the bug encountered when running CS on a single processus (I've encountered that for years, with several versions of CS).

Thanks in advance !

EDIT : Well, there are also 2 LOADL templates provided with CS. But none of them is fully satisfying.
I think the key to my question is to find a way to generate myself a suited run_solver.sh
Attachments
zeph67_rpn.tar
(60 KiB) Downloaded 407 times
Yvan Fournier
Posts: 4251
Joined: Mon Feb 20, 2012 3:25 pm

Re: Some problems on BlueGene/Q

Post by Yvan Fournier »

Hello,

The templates are just that: templates; whichmay need to be adapted. Unfortunately, there is no "standard" equivalent to MPI for batch schedulers and resource managers, so it is basically impossible to have a "one size fits all".

You may either adapt the template, or in the code_saturne.cfg file (either $HOME/.code_saturne.cfg or better, $install_prefix/etc/code_saturne.cfg), you can use an absolute path to a different template.

We have not had a machine with LoadLeveler since our Blue Gene/P, so the parsing/use of LoadLeveler is not much tested (everything is based on the documentation only, not on real testing), but if you need to fix it, is it handled in bin/cs_exec_environment.py (once installed, $install_prefix/lib/python<version>/site-packages/code_saturne/cs_exec_environment.py), with an additional part for MPMD handling in cs_case.py.

I have not looked at the other issue yet. I'l check.

Also, what Blue Gene/Q driver and e-fix do you have ? On our configuration, with driver 1.2, performance of Code_Saturne is degraded by a factor of 3 to 4 due to higher latencies than with previous drivers.
This is fixed with e-fix 38 (tested on other Blue Gene/Q's; unfortunately, ours is still at e-fix 36...)

Regards,

Yvan
Yvan Fournier
Posts: 4251
Joined: Mon Feb 20, 2012 3:25 pm

Re: Some problems on BlueGene/Q

Post by Yvan Fournier »

Hello again,

just a question regarding the error you obtain: do you also have error_* files in your execution directory ? If you do, can you check them also ?

Otherwise, what compiler do you use ? gcc or xlc ? If you use gcc, some functions that are deactivated or signals trapped might not be handled correctly, if the compiler does not define the __bgq__ macro... (both xlc and gcc define it on ours).

In our case, we had to trap signal 6 to avoid issues. If you link with additional libraries, you might need to trap signal 5 also (done in src/base/cas_base.c, search for _bgq__).

Regards,

Yvan
zeph67
Posts: 53
Joined: Tue Oct 23, 2012 5:54 pm

Re: Some problems on BlueGene/Q

Post by zeph67 »

Thank you Yvan for yor help.

What you say about templates, I totally agree with it. But on the Turing BG/Q, paramaters such as tasks_per_node are not allowed within batch cards (I made several tests).
That is why I needed to enforce run_solver.sh, and your advice totally helps me.

Concerning Blue Gene/Q driver and e-fix, I don't have these informations. How can I get them ?

In the results directory, there is a single error file (probably due to the fact that it occurs on a single proc). Here is the content of the error file (shows signals 5 and 6) :

Code: Select all

Signal 5 intercepted!

Call stack:
   1: 0x1ece000    ?                                (?)
   2: 0x40ac2b4367c3c4fe ?                                (?)
   3: 0x1032320    ?                                (?)
   4: 0x102da28    ?                                (?)
   5: 0x102479c    ?                                (?)
   6: 0x100219c    ?                                (?)
   7: 0x103d71c    ?                                (?)
   8: 0x103d8bc    ?                                (?)
   9: 0x1ae0508    ?                                (?)
  10: 0x1ae0804    ?                                (?)
End of stack

Abort(1) on node 0 (rank 0 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
SIGABRT signal (Abort) intercepted.

Call stack:
   1: 0x1ece000    ?                                (?)
   2: 0x1aea328    ?                                (?)
   3: 0x17346d0    ?                                (?)
   4: 0x16df34c    ?                                (?)
   5: 0x11e9c04    ?                                (?)
   6: 0x1ece000    ?                                (?)
   7: 0x40ac2b4367c3c4fe ?                                (?)
   8: 0x1032320    ?                                (?)
   9: 0x102da28    ?                                (?)
  10: 0x102479c    ?                                (?)
  11: 0x100219c    ?                                (?)
  12: 0x103d71c    ?                                (?)
  13: 0x103d8bc    ?                                (?)
  14: 0x1ae0508    ?                                (?)
  15: 0x1ae0804    ?                                (?)
End of stack

Abort(1) on node 0 (rank 0 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Finally, concerning the compiler, I use xlc.


Thanks !
Yvan Fournier
Posts: 4251
Joined: Mon Feb 20, 2012 3:25 pm

Re: Some problems on BlueGene/Q

Post by Yvan Fournier »

Hello,

Do you have a "listing" file ? It may contain more details on problems leading to the error. It will also tell us what driver and e-fix you have.

Finally, when adapting your LoadLeveler template, remove and add any LoadLeveler directive line necessary so that the script is accepted by the submission filter (this depends on local administrator choices). The Code_Saturne GUI tries to adapt to several known possibilities, but does not handle all situations, so if you are lucky, the GUI may still allow you to edit the batch parameters; at worse, you might need to do it using an editor.

Regards,

Yvan
zeph67
Posts: 53
Joined: Tue Oct 23, 2012 5:54 pm

Re: Some problems on BlueGene/Q

Post by zeph67 »

The listing file is contained in the tar-ed folder I joint to the very first message of this topic. Unfortunately, there is no big indication towards the error encountered.

Concerning the GUI, I just never use it. And I tried several things on the batch cards. My ranks-per-node problem was not due to CS, but to some restrictions on the machine. The ranks-per-node can only be passed to the runjob command, and that it why I was blocked. Your indications towards cs_exec_environment.py were more than helpful !! But computation still doesn't run farther than 1 time-step.

I'm starting to suspect a bug from my own... I'll let you know.

Thanks !
Yvan Fournier
Posts: 4251
Joined: Mon Feb 20, 2012 3:25 pm

Re: Some problems on BlueGene/Q

Post by Yvan Fournier »

Hello,

Are you running with an optimized or debug build ? Although slower, if may be safe to start with a version configured with "--enable-debug", which may help detect upstream errors.

Also, you can try to use the "addr2line" utility to match adresses in your stack trace with files (and lines in debug mode) in the code. Finally, you may try to remove trapping of SIGTRAP (which we do only on Blue Gene/Q), by copying src/base/cs_base.c to your case's SRC directory, and removing all lines involving sigtrap and SIGTRAP in the file (I'm not sure it will make things better or worse, but it is easy to test).

Regards,

Yvan
zeph67
Posts: 53
Joined: Tue Oct 23, 2012 5:54 pm

Re: Some problems on BlueGene/Q

Post by zeph67 »

Yvan Fournier wrote: Are you running with an optimized or debug build ? Although slower, if may be safe to start with a version configured with "--enable-debug", which may help detect upstream errors.
Well, before this advice, I ran on an optimized version. But then I installed a debug version, and the error indicated was :

Code: Select all

/gpfs5r/workgpfs/rech/iam/riam642/ETUDES/taycoupoi/hybride/RESU/20131107-1405/cs_solver was built without MPI support, so option "--mpi" may not be used.
An interesting point is that I couldn't install the debug version with bgxlf95_r as compiler. I had to change to bgxlf2003_r.
Concerning the bug report, it is strange because I followed the cross-compiling instructions of the tutorial. How can I enforce the use of MPI support ?
Yvan Fournier wrote: Finally, you may try to remove trapping of SIGTRAP (which we do only on Blue Gene/Q), by copying src/base/cs_base.c to your case's SRC directory, and removing all lines involving sigtrap and SIGTRAP in the file (I'm not sure it will make things better or worse, but it is easy to test).
It got even worse ! The abortion was somewhat delayed, but there was no new interesting indication.

Thanks for your help.

I think, at this stage, it is becoming clear that this all is simply an installation issue...
Yvan Fournier
Posts: 4251
Joined: Mon Feb 20, 2012 3:25 pm

Re: Some problems on BlueGene/Q

Post by Yvan Fournier »

Hello,

This is starting to look more and more like an install issue. I recommend re-checking your configure options, and comparing them to those of the installation manual examples (at least for the compute node part). I could also sent you our config.log files, but I won't have access to our network until next Tuesday, so a double check might help solve the issue.

The most important parts are :
CC=mpixlc_r (or CC=mpixlc)
CXX=mpicxx (or CXX = mpicxx)
FC=bgf95_r (or FC=bgf95)

Note that we use the MPI wrappers for C (and C++ if you need to link with a library requiring a C++ runtime), but not for Fortran (otherwise libtool may wreak havoc with the additional libraries).

Don't forget to revert to the unmodified version of cs_base.c

Regards,

Yvan
zeph67
Posts: 53
Joined: Tue Oct 23, 2012 5:54 pm

Re: Some problems on BlueGene/Q

Post by zeph67 »

Thank you Yvan.

Here are my config invocation commands :

Code: Select all

SRC_PATH='/workgpfs/rech/iam/riam642/projets/Code_Saturne/3.0.1'
INSTALL_PATH='/workgpfs/rech/iam/riam642/projets/Code_Saturne/3.0.1'
$SRC_PATH/code_saturne-3.0.1/configure \
--prefix=$INSTALL_PATH/build/frontend \
--disable-gui
and

Code: Select all

SRC_PATH='/workgpfs/rech/iam/riam642/projets/Code_Saturne/3.0.1'
INSTALL_PATH='/workgpfs/rech/iam/riam642/projets/Code_Saturne/3.0.1'
$SRC_PATH/code_saturne-3.0.1/configure \
--prefix=$INSTALL_PATH/build/compute \
--build=ppc64 --host=bluegeneq \
--disable-gui \
--enable-debug \
CC=mpixlc \
CXX=mpixlcxx \
FC=bgf95
Still the same problem at the beginning of the calculation stage. Also, the

Code: Select all

--enable-debug
makes the installation fail, with the error : "Compilation failed for file stdtcl.f90" :

Code: Select all

"/workgpfs/rech/iam/riam642/projets/Code_Saturne/3.0.1/code_saturne-3.0.1/src/base/stdtcl.f90", 1500-004 (U) INTERNAL COMPILER ERROR while compiling stdtcl_.  Compilation ended.  Contact your Service Representative and provide the following information: Internal abort. For more information visit: http://www.ibm.com/support/docview.wss?uid=swg21110810
1586-346 (U) An error occurred during code generation.  The code generation return code was 1.
1501-511  Compilation failed for file stdtcl.f90.
make[3]: *** [stdtcl.lo] Erreur 1
make[3]: quittant le répertoire « /gpfs5r/workgpfs/rech/iam/riam642/projets/Code_Saturne/3.0.1/scriptscf/src/base »
make[2]: *** [all-recursive] Erreur 1
make[2]: quittant le répertoire « /gpfs5r/workgpfs/rech/iam/riam642/projets/Code_Saturne/3.0.1/scriptscf/src »
make[1]: *** [all-recursive] Erreur 1
make[1]: quittant le répertoire « /gpfs5r/workgpfs/rech/iam/riam642/projets/Code_Saturne/3.0.1/scriptscf »
make: *** [all] Erreur 2
Thank you !
Post Reply