Page 1 of 2

Code Saturne 5.0.5 crash in linear solver

Posted: Wed Feb 14, 2018 10:16 am
by Antech
Hello.
I understand that this is an intermittent problem so it's rare that my report will help to track down the bug, but, maybe, it will be just a little helpful... This is a "pure aerodynamic" isothermal case that was calculated successfully with Saturn 5.0.5 on different machines (CentOS 6.5 and Kubuntu 14 + Updates) with Catalyst enabled (different setups with ParaView 4.3 and 5.2, MPI 1.8 and 3.0). There was some rare (~2 on dozens of runs) Catalyst-related crashes (using both ParaView versions) but it's clear that they can be avoided running without Catalyst. Now I encountered the bug that seems to be more important. Saturne 5.0.5 was running in parallel (3 processes) on the laptop with CentOS 6.5, with ParaView 5.2 (Catalyst) and MPI 3.0. Unexpectidly the solver crashed and the error was related to some linear solver subroutines, not to Catalyst. So attaching the case without mesh (it's not very small and I don't think that it can cause an issue, this is an Ansys Meshing tetra mesh that we never seen buggy for years and lots of cases, although it can be not of best quality).
Here is error text:

Code: Select all

   ** BOUNDARY CONDITIONS FOR SMOOTH WALLS
   ---------------------------------------
------------------------------------------------------------
                                         Minimum     Maximum
------------------------------------------------------------
   Rel velocity at the wall uiptn : -0.19243E+02 0.15432E+02
   Friction velocity        uet   :  0.10963E-01 0.54561E+01
   Friction velocity        uk    :  0.10960E-01 0.71311E+01
   Dimensionless distance   yplus :  0.58504E-01 0.44998E+02
   ------------------------------------------------------
   Nb of reversal of the velocity at the wall   :         90
   Nb of faces within the viscous sub-layer     :     211277
   Total number of wall faces                   :     219370
------------------------------------------------------------


Incoming flow detained for       5386 outlet faces on      10064
SIGSEGV signal (forbidden memory area access) intercepted!

Call stack:
   1: 0x7f0fcbb22aa3 <+0x360aa3>                      (libsaturne.so.5)
   2: 0x7f0fcbb21dba <cs_matrix_set_coefficients+0xea> (libsaturne.so.5)
   3: 0x7f0fcbb15ce7 <cs_grid_coarsen+0x7f7>          (libsaturne.so.5)
   4: 0x7f0fcbb44ca6 <cs_multigrid_setup_conv_diff+0x3e6> (libsaturne.so.5)
   5: 0x7f0fcbb45ee1 <+0x383ee1>                      (libsaturne.so.5)
   6: 0x7f0fcbb51fa0 <+0x38ffa0>                      (libsaturne.so.5)
   7: 0x7f0fcbb59474 <cs_sles_it_solve+0xf64>         (libsaturne.so.5)
   8: 0x7f0fcbb4be18 <cs_sles_solve+0x278>            (libsaturne.so.5)
   9: 0x7f0fcbed4f08 <__cs_c_bindings_MOD_sles_solve_native+0x1ed> (libsaturne.so.5)
  10: 0x7f0fcb9bd332 <resopv_+0x70b2>                 (libsaturne.so.5)
  11: 0x7f0fcb9a13c0 <navstv_+0x3ffc>                 (libsaturne.so.5)
  12: 0x7f0fcb9ce711 <tridim_+0x4a31>                 (libsaturne.so.5)
  13: 0x7f0fcb865ab4 <caltri_+0x1e94>                 (libsaturne.so.5)
  14: 0x7f0fcb837c2a <cs_run+0x55a>                   (libsaturne.so.5)
  15: 0x7f0fcb837da5 <main+0x135>                     (libsaturne.so.5)
  16: 0x7f0fc9800b45 <__libc_start_main+0xf5>         (libc.so.6)
  17: 0x400959     <>                               (cs_solver)
End of stack
This is almost the basic case setup except for 1000 iterations limitation for velocity and Curvature Correction enabled for SST turbulence model (there is a swirling flow, one must not use 2-equation models without curvature correction for such cases or there will be fantastic divergence from experiments so I was running this test to compare SST+ curvature correction with RSM SSG results for my construction). Curvature correction was enabled in usipsu subroutine (irccor=1), linear iterations for velocity was limited in cs_user_linear_solvers subroutine as recommended on this forum:

Code: Select all

cs_sles_it_t *c=cs_sles_it_define(CS_F_(u)->id, /* Field identifier */
                                  NULL, /* Field name */
                                  CS_SLES_JACOBI, /* Solver type */
                                  0, /* Preconditional polynomial degree (0 for diagonal, -1 for non-preconditioned ) */
                                  1000); /* Number of linear solver iterations */
Corresponding simple sources are in attached archive.

Re: Code Saturne 5.0.5 crash in linear solver

Posted: Wed Feb 14, 2018 11:25 am
by Yvan Fournier
Hello,

Do you have the possibility of running this case with a debug build ? In which case we could get the line information related to crashes.

Otherwise, could you send me the mesh_input and "partition_output" files from the failed run directory (possibly through another means than the forum as it is a bit large) so I can try to reproduce the issue. Not sure I can reproduce it as it might be partition dependent, and the multigrid aggregation depends on the coefficients we have, so even rounding differences with different compilers might lead to slightly different aggregation patterns, but at least I can try.

Best regards,

Yvan

Re: Code Saturne 5.0.5 crash in linear solver

Posted: Wed Feb 14, 2018 1:32 pm
by Antech
Hello, Yvan. Glad to see you helping everyone :)
Here is the mesh archive for this variant: https://cloud.mail.ru/public/B1nb/1CAHRWaCt
Unfortunately there is no partition_output file for this run (it's likely that I deleted it preparing the case for the forum). But the mesh is exactly that was used with the failed run.

Anyway it's rare that the bug will arise again at the same point because, after I restarted the calculation, it reached iteration 249, although in previous run it crashed at iteration 222. The only differencies are autosave every 30 iterations and calcuation without Catalyst (the Catalyst writer is there but ParaView was not launched because there is nothing to check for now, I already checked that velocity field doesn't change significantly, this is an Upwind => SOLU correction run).

Regarding the linear solver iteration limitation, I cannot say if it works or not because there is no "iteration peak issue" in this case. Hope I will check it with other cases.

Re: Code Saturne 5.0.5 crash in linear solver

Posted: Thu Feb 15, 2018 12:22 am
by Yvan Fournier
Hello,

OK, there might be a memory error (out of bounds access or uninitialized value) issue if the reproducibility is "semi-random", so I'll try to run your case under AdressSanitizer (the mesh is probably too large to run it under Valgrind).

I'll try to do this in the next few days and will keep you informed.

Best regards,

Yvan

Re: Code Saturne 5.0.5 crash in linear solver

Posted: Mon Feb 19, 2018 9:09 am
by Antech
Hello. Thanks for your attention to this issue. I just wan to say that there was another similar error on another machine, a workstation with Ubuntu 14.04 and Code Saturne 5.0.5 in Catalyst mode with ParaView 5.2. Both installations are with OpenMPI 3.0.0 (I used 1.8.4 before but decided to update from so old MPI version). So it may be cused not only by the changes in linear solvers but also by this update of OpenMPI. Here is the call stack:

Code: Select all

SIGTERM signal (termination) received.
--> computation interrupted by environment.

Call stack:
   1: 0x7f2d227b0563 <+0x4563>                        (mca_btl_vader.so)
   2: 0x7f2d2c7105bc <opal_progress+0x3c>             (libopen-pal.so.40)
   3: 0x7f2d2c716e35 <sync_wait_mt+0xb5>              (libopen-pal.so.40)
   4: 0x7f2d2e46695a <ompi_request_default_wait_all+0x39a> (libmpi.so.40)
   5: 0x7f2d2e4b35cf <ompi_coll_base_allreduce_intra_recursivedoubling+0x40f> (libmpi.so.40)
   6: 0x7f2d2e4794eb <PMPI_Allreduce+0x18b>           (libmpi.so.40)
   7: 0x7f2d2fac6002 <+0x36c002>                      (libsaturne.so.5)
   8: 0x7f2d2fac84d7 <cs_multigrid_setup_conv_diff+0x417> (libsaturne.so.5)
   9: 0x7f2d2facb851 <+0x371851>                      (libsaturne.so.5)
  10: 0x7f2d2fad850e <cs_sles_it_setup+0x28e>         (libsaturne.so.5)
  11: 0x7f2d2fada87d <cs_sles_it_solve+0x1e2d>        (libsaturne.so.5)
  12: 0x7f2d2facd548 <cs_sles_solve+0x268>            (libsaturne.so.5)
  13: 0x7f2d2fe5f1a6 <__cs_c_bindings_MOD_sles_solve_native+0x1b6> (libsaturne.so.5)
  14: 0x7f2d2f959ba2 <resopv_+0xbe0d>                 (libsaturne.so.5)
  15: 0x7f2d2f937ebf <navstv_+0x44c0>                 (libsaturne.so.5)
  16: 0x7f2d2f96703d <tridim_+0x4b3d>                 (libsaturne.so.5)
  17: 0x7f2d2f8001c2 <caltri_+0x1e62>                 (libsaturne.so.5)
  18: 0x7f2d2f7d190f <cs_run+0x56f>                   (libsaturne.so.5)
  19: 0x7f2d2f7d125f <main+0x12f>                     (libsaturne.so.5)
  20: 0x7f2d2f194f45 <__libc_start_main+0xf5>         (libc.so.6)
  21: 0x400829     <>                               (cs_solver)
End of stack

Re: Code Saturne 5.0.5 crash in linear solver

Posted: Tue Feb 20, 2018 12:49 pm
by Yvan Fournier
Hello,

For that same case, you might also have error_r* or error_n* files which contain different stack traces. If so, could you post one or two of those ?

In any case, things would be more interesting with a debug build, as we could get the matching line number from the stack traces.

I have otherwise run about 350 iterations of that case, on 3 ranks, without Catalyst (as my version of Catalyst is different, the scripts did now work, and I do not have enough space on my machine to keep too many versions of ParaViewand Catalyst). I did not experience any crash (and AdressSanitizer did non complain). In theory, the problem should be independent of Catalyst, unless there is a memory overwrite there (none detected for small cases using Valgrind, but never 100% sure).

Best regards,

Yvan

Re: Code Saturne 5.0.5 crash in linear solver

Posted: Tue Feb 20, 2018 1:20 pm
by Antech
Hello.
There was one more error file, with very similar stack. I usually delete the crashed run directory so I can't upload this file, sorry. I also don't have Debug Saturne build on any machine for now (I must work on my primary tasks now, can't compile the Debug version now). But I also perform a kind of experiment. I have different Release builds so I engaged the same Saturne 5.0.5 with OpenMPI 1.8.4 and ParaView 4.3 ("old dependencies edition"), it calculates one of my validation cyclone cases. Catalyst is On but I don't think it affects the bug (there are usually some VTK libraries/routines in the call stack if the crash is due to Catalyst, it happens, but not critical because you can always re-run without Catalyst). For now, no errors appeared (this bug is rare enough, it only appeared once during 2.5-days weekend calculation).
Thank you for testing with memory bounds checker.

Re: Code Saturne 5.0.5 crash in linear solver

Posted: Wed Feb 21, 2018 9:09 am
by Antech
Hello.
I compiled the debug Saturne 5.0.5 build with OpenMPI 3.0.0 and ParaView 5.2 (only Saturne is build in debug configuration). I used --enbale-debug option for configure script, hope it's correct.
The bug appeared again twice in a row! This time on a CentOS laptop. But I don't see any line numbers in error messages, although I verified that indeed the debug configuration was launched. So I attach the whole case with mesh, the only files I deleted from archive are restart files because they're quite large (please request them if you need). Here it is: https://cloud.mail.ru/public/7WiK/as4NuUfxi

Error messages in two recent runs are as follows. First run, that is in archive:

Code: Select all

SIGSEGV signal (forbidden memory area access) intercepted!

Call stack:
   1: 0x7f255d78ca50 <+0x35a50>                       (libc.so.6)
   2: 0x7f255fc68f59 <+0x530f59>                      (libsaturne.so.5)
   3: 0x7f255fc697cc <+0x5317cc>                      (libsaturne.so.5)
   4: 0x7f255fc6d866 <cs_matrix_set_coefficients+0xd4> (libsaturne.so.5)
   5: 0x7f255fc5e2f8 <cs_grid_coarsen+0x9a8>          (libsaturne.so.5)
   6: 0x7f255fc9945e <cs_multigrid_setup_conv_diff+0x40b> (libsaturne.so.5)
   7: 0x7f255fc9904c <cs_multigrid_setup+0x40>        (libsaturne.so.5)
   8: 0x7f255fc984fe <+0x5604fe>                      (libsaturne.so.5)
   9: 0x7f255fcaf238 <cs_sles_pc_setup+0x5b>          (libsaturne.so.5)
  10: 0x7f255fc9fb8b <+0x567b8b>                      (libsaturne.so.5)
  11: 0x7f255fca7aca <cs_sles_it_setup+0xc8>          (libsaturne.so.5)
  12: 0x7f255fca7c6b <cs_sles_it_solve+0x19a>         (libsaturne.so.5)
  13: 0x7f255fc9cf12 <cs_sles_solve+0x19f>            (libsaturne.so.5)
  14: 0x7f255fc9e98b <cs_sles_solve_native+0x54a>     (libsaturne.so.5)
  15: 0x7f2560623192 <__cs_c_bindings_MOD_sles_solve_native+0x2c9> (libsaturne.so.5)
  16: 0x7f255faaeb6f <resopv_+0x1ebff>                (libsaturne.so.5)
  17: 0x7f255fa3316c <navstv_+0x9710>                 (libsaturne.so.5)
  18: 0x7f255fae233a <tridim_+0xb41e>                 (libsaturne.so.5)
  19: 0x7f255f7ed90c <caltri_+0x3e78>                 (libsaturne.so.5)
  20: 0x7f255f7ae928 <cs_run+0x4ac>                   (libsaturne.so.5)
  21: 0x7f255f7aebce <main+0x165>                     (libsaturne.so.5)
  22: 0x7f255d778b45 <__libc_start_main+0xf5>         (libc.so.6)
  23: 0x400819     <>                               (cs_solver)
End of stack
Second run, not included in archive but saved:

Code: Select all

SIGTERM signal (termination) received.
--> computation interrupted by environment.

Call stack:
   1: 0x7fc4140aba50 <+0x35a50>                       (libc.so.6)
   2: 0x7fc408d5c045 <ompi_coll_libnbc_progress+0x25> (mca_coll_libnbc.so)
   3: 0x7fc4136f8210 <opal_progress+0x40>             (libopen-pal.so.40)
   4: 0x7fc4136fe965 <sync_wait_mt+0x155>             (libopen-pal.so.40)
   5: 0x7fc40938836a <mca_pml_ob1_recv+0x42a>         (mca_pml_ob1.so)
   6: 0x7fc414eec697 <ompi_coll_base_allreduce_intra_recursivedoubling+0x7d7> (libmpi.so.40)
   7: 0x7fc414ea45f6 <PMPI_Allreduce+0x1c6>           (libmpi.so.40)
   8: 0x7fc4165b3f92 <+0x55cf92>                      (libsaturne.so.5)
   9: 0x7fc4165b84e2 <cs_multigrid_setup_conv_diff+0x48f> (libsaturne.so.5)
  10: 0x7fc4165b804c <cs_multigrid_setup+0x40>        (libsaturne.so.5)
  11: 0x7fc4165b74fe <+0x5604fe>                      (libsaturne.so.5)
  12: 0x7fc4165ce238 <cs_sles_pc_setup+0x5b>          (libsaturne.so.5)
  13: 0x7fc4165beb8b <+0x567b8b>                      (libsaturne.so.5)
  14: 0x7fc4165c6aca <cs_sles_it_setup+0xc8>          (libsaturne.so.5)
  15: 0x7fc4165c6c6b <cs_sles_it_solve+0x19a>         (libsaturne.so.5)
  16: 0x7fc4165bbf12 <cs_sles_solve+0x19f>            (libsaturne.so.5)
  17: 0x7fc4165bd98b <cs_sles_solve_native+0x54a>     (libsaturne.so.5)
  18: 0x7fc416f42192 <__cs_c_bindings_MOD_sles_solve_native+0x2c9> (libsaturne.so.5)
  19: 0x7fc4163cdb6f <resopv_+0x1ebff>                (libsaturne.so.5)
  20: 0x7fc41635216c <navstv_+0x9710>                 (libsaturne.so.5)
  21: 0x7fc41640133a <tridim_+0xb41e>                 (libsaturne.so.5)
  22: 0x7fc41610c90c <caltri_+0x3e78>                 (libsaturne.so.5)
  23: 0x7fc4160cd928 <cs_run+0x4ac>                   (libsaturne.so.5)
  24: 0x7fc4160cdbce <main+0x165>                     (libsaturne.so.5)
  25: 0x7fc414097b45 <__libc_start_main+0xf5>         (libc.so.6)
  26: 0x400819     <>                               (cs_solver)
End of stack
I've never seen this in Saturne version 4.x. Looks like something with multigrid solver in version 5.0.5 in combination with OpenMPI 3.0.0 (I used OpenMPI 1.8.4 earlier). I don't think that ParaView version can cause this because there are no VTK libs in the stack. Now I switched back to Saturne 5.0.4 + OpenMPI 1.8.4 + ParaView 4.3 and restarted the same case (collecting statistics). But hope you will investigate a bit into this issue (it's important to use 5.x version because 4.x has the bug with particles inlet while I need particle tracing with statistics in many cases).

Thanks for your attention.

Re: Code Saturne 5.0.5 crash in linear solver

Posted: Wed Feb 21, 2018 10:22 am
by Yvan Fournier
Hello,

The most significant change for multigrid between versions 4 and 5 was switching to multigrid as a preconditioner rather than as a solver, but there has not been a significant change in the aggregation part itself.

For the debug info, you do not have the line directly in the backtrace, but can use the"addr2line" utility for this: "addr2line -e ./cs_solver <address in call stack>" in the failed run directory.

It will be interesting to see if the MPI library really has an impact on this (I would expect the compiler version to have more influence on that side but as long the the bug is not identified/understood, cn't be sure).

Also, as this may be partitioning dependent, for a case which has crashed, I'll try to run this with the partition_output you may have in your case (I'll download your files this evening).

Best regards,

Yvan

Re: Code Saturne 5.0.5 crash in linear solver

Posted: Wed Feb 21, 2018 10:44 am
by Antech
Thanks for your response and recommendation.

There is no partition_output file (it is set only for graph-based partitioning by default). But this issue was observed on various machines with various Linux distributions and various partition numbers (3, 16, maybe others) in two different cases (both cyclones).

There is also no cs_solver in run directory (possibly because there are no any user functions to compile with in this case). I tried addr2line on libexec/code_saturne/cs_solver binary but without success:

Code: Select all

[antech@AntechPC code_saturne]$ pwd
/Programs/Code_Saturne-5.0.5-PV-5.2.0-openmpi-3.0.0/build-catalyst-debug/libexec/code_saturne
[antech@AntechPC code_saturne]$ addr2line -e ./cs_solver 0x400819
??:0