Mesh_Check SIGSEV signal intercepted

knewlands · Post by **knewlands** » Tue Sep 18, 2012 1:07 pm

Hello,

I'm having some issues with the mesh pre-processor in Code_Saturne v2.0.5. When I check the mesh I obtain the following error regardless of whether I apply periodicity of translation or not:

===============================================================

CALCULATION PREPARATION
=======================

===========================================================

Reading file: preprocessor_output
SIGSEGV signal (forbidden memory area access) intercepted!

Call stack:
1: 0x35f5430280 ? (?)
2: 0x2b6eb1b3a5b3 <uiusar_+0x182> (libsaturne.so.0)
3: 0x2b6eb1c4ee82 <iniusi_+0x11de> (libsaturne.so.0)
4: 0x2b6eb1c4d5a3 <initi1_+0x23> (libsaturne.so.0)
5: 0x2b6eb1a39b0d <cs_run+0xe1> (libsaturne.so.0)
6: 0x2b6eb1a3a8e5 <main+0x1fa> (libsaturne.so.0)
7: 0x35f541d974 <__libc_start_main+0xf4> (libc.so.6)
8: 0x400739 <main+0x51> (cs_solver)
End of stack

I am connecting to a cluster on which Code_Saturne has been installed and I've attached the installer_log. In an attempt to assess whether there is a problem with the mesh, I checked it through the Code_Saturne installation on my local machine and it seems to work fine. Any idea where the problem could be?

Thank you,

Kristin

knewlands · Post by **knewlands** » Tue Sep 18, 2012 3:56 pm

Apologies,

I forgot to mention that the mesh is in CGNS format. I'm not sure if this helps determine where the problem might be?

Kind regards,

Kristin

Post by **Yvan Fournier** » Tue Sep 18, 2012 10:16 pm

Hello,

The routine in which you have a crash is related to reading the XML file to determine the size of work arrays. Strangely, uiusar should be called by memini, called by cs_run after reading the mesh, and not by iniusi1, as the backtrace seems to indicate. This seems to indicate the call stack has been corrupted, so we do not really know where the crash occured. That type of crash is probably related either to an out-of-bounds error in a small fixed-size array (probably not a dynamically-allocated one, which would probably lead to a SIGSEGV but not crash the stack, and be easier to detect with tools such as Valgrind), or to an installation issue (such as involuntarily mixing libraries build with different versions of a compiler, when multiple compilers are available and the build environment is somewhat mixed up, which easily leads to such types of crashes, and may be a pain to debug).

In any case, the crash is in the kernel, so the preprocessor has already imported the mesh and passed it to the kernel, so the fact that your mesh is a CGNS has a very small probability of being related to the crash.

I would suggest 2 types of tests:

- test with a very small mesh on the same machine to see if you reproduce the bug (install issue hypothesis).

- test with the same mesh and a debug build of the code (slower, but with more checks, might detect an error at an earlier stage).

Also, checking for quality with a different version of the code is easy if such a version is available, as a minimal data setup is sufficient. I am always interested in konowing whether a possible bug seems to be "still there" or not.

Best regards,

Yvan

knewlands · Post by **knewlands** » Sun Sep 23, 2012 7:54 pm

Hello Yvan,

Thank you for your reply. I tried running the test with a very small mesh and I still got an error, so the code has been reinstalled with a new openmpi installation and it worked. Or at least, it worked for a RANS calculation I planned to use as restart for the LES calculation with the Synthetic Eddy Method applied through the usiniv.f90 subroutine. However, when I try to run the LES calculation I obtain the following error:

SIGFPE signal (floating point exception) intercepted!

Call stack:
1: 0x44a2cc <usiniv_+0xb68> (cs_solver)
2: 0x2aedb724a6e3 <inivar_+0x3c3> (libsaturne.so.0)
3: 0x407fc4 <caltri_+0x1624> (cs_solver)
4: 0x2aedb70d3c6d <cs_run+0x77d> (libsaturne.so.0)
5: 0x2aedb70d4005 <main+0x1f5> (libsaturne.so.0)
6: 0x35f541d974 <__libc_start_main+0xf4> (libc.so.6)
7: 0x4068e9 ? (?)
End of stack

I tried running the simulation on one processor only, but I get the same error. I have successfully used the usiniv.f90 subroutine in parallel before and the mesh I'm using is the same as for the RANS simulation that worked, so I don't understand why I'm coming across this error now. Any suggestions?

Kindest regards,

Kristin

knewlands · Post by **knewlands** » Mon Oct 01, 2012 11:51 am

Hello,

Since my last post, I've tried to pinpoint the exact cause of the SIGFPE error, as it didn't seem to me that there should be a division by zero in the usiniv.f90 file. I am running the code through Valgrind and it did not report an error before the start of the iterations (as was the case when running the code without debugging), but as of the first time step it warns about the "non convergence of GRADRC" with:
GRADRC ISWEEP = 100, NORMED RESIDUAL: NaN and NORM: NaN

Could the lack of convergence be causing a SIGFPE error? Or is there a lack of convergence as a result of the SIGFPE error?

I've checked the mesh quality in ParaView and I can't see a problem with it. Also, the RANS calculation with the steady algorithm and the same mesh converged judging by the probe plots and the lack of NaN's in the listing file.

I would be very grateful for suggestions, as I just can't seem to get to the bottom of this.

Kindest regards,

Kristin

Post by **Yvan Fournier** » Mon Oct 01, 2012 12:45 pm

Hello,

Yes, if you have a convergence problem at some step, the appearance of a floating-point exception may be delayed, but will happen at some point, so searching for the first suspect behavior is a good approach.

Did you try running without your usiniv.f90 user file ? It is possible the the initialization used generates a field which is somewhat "incompatible" with this mesh, and leads to the crash (in which case using the default initialization may require more iterations to reach a converged state, but be safer).

Another test would be to use another gradient reconstruction option (one of the least squares options), to see if things improve, or if you simply move the crash to a later stage...

Best regards,

Yvan

knewlands · Post by **knewlands** » Mon Oct 01, 2012 1:12 pm

Hello Yvan,

Thank you for your reply. I tried running the calculation using a least-squares method for the gradient reconstruction option and the same SIGFPE crash occurs. I had tried to run a simulation without the usiniv.f90 file in the SRC folder in previous attempts to determine the source of the problem and the calculation runs without errors. So it's highly likely that the usiniv.f90 subroutine is the source of the problem, but I'd used it successfully for a simpler geometry and was hoping to use those results as partial validation for this next phase of my project.

When you say that using the default initialisation may be safer, are you referring to using the original usiniv.f90 file without the SEM method or would it be possible to use the usvort.f90 subroutine with my geometry given that it isn't a simple pipe or channel? Or are there other methods of introducing noise to initiate the turbulence for LES?

Kindest regards,

Kristin

Post by **Yvan Fournier** » Mon Oct 01, 2012 4:42 pm

Hello,

I'm referring to using usiniv.f90 without the SEM method (or not using usiniv.f90, which means defaulting to initialization with all fields at zero).

Otherwise, check that the length scales used by the SEM usini1.f90 is compatible not only with your original mesh, but also with that of the mesh with which you have issues.

Otherwise, I do not know if I should recommend usvort.f90: some turbulence specialists recommend the SEM method, some still prefer the vortex method, ..., and I am not knowledgeable enough in this field to form my own opinion (though from a maintenance/parallelism/data setup point of view, I would be happy to get rid of the vortex method, and promote use of the other SEM methods). You might have some of those turbulent specialists nearby...

Cheers,

Yvan

knewlands · Post by **knewlands** » Thu Oct 11, 2012 3:46 pm

Hello,

I have resolved the previous SIGFPE error resulting from the use of the SEM in the usiniv.f90 file by effectively removing the SEM, which was causing the calculation to diverge.

I have now come across the following error however:

SIGFPE signal (floating point exception) intercepted!

Call stack:
1: 0x2af38ed0fd66 <fvm_convert_array+0x4b86> (libfvm.so.0)
2: 0x2af38ed84b62 <fvm_writer_field_helper_step_e+0xaa2> (libfvm.so.0)
3: 0x2af38ed743f0 <fvm_to_ensight_export_field+0x5f0> (libfvm.so.0)
4: 0x2af38ed82326 <fvm_writer_export_field+0xc6> (libfvm.so.0)
5: 0x2af38da73682 <cs_post_write_var+0x172> (libsaturne.so.0)
6: 0x2af38da7665f <pstev1_+0xff> (libsaturne.so.0)
7: 0x2af38da794f0 <psteva_+0x56> (libsaturne.so.0)
8: 0x44af0a <usvpst_+0x16ea> (cs_solver)
9: 0x2af38da77a4d <pstvar_+0x120d> (libsaturne.so.0)
10: 0x40daff <caltri_+0x721f> (cs_solver)
11: 0x2af38d954c6d <cs_run+0x77d> (libsaturne.so.0)
12: 0x2af38d955005 <main+0x1f5> (libsaturne.so.0)
13: 0x35f541d974 <__libc_start_main+0xf4> (libc.so.6)
14: 0x406829 ? (?)
End of stack

When I ran the LES simulation without calculating time averages and gradients, I had no problems. According to the listing, this error appeared when the calculation was saving the first instantaneous output files, so it never reached the end (at which point the averages would've been written). Is there an obvious cause for this error?

I ran the code through Valgrind, but I received this message:

==14471== Address 0x6d00005 is not stack'd, malloc'd or (recently) free'd
==14471==
==14471== More than 10000000 total errors detected. I'm not reporting any more.
==14471== Final error counts will be inaccurate. Go fix your program!
==14471== Rerun with --error-limit=no to disable this cutoff. Note
==14471== that errors may occur in your program without prior warning from
==14471== Valgrind, because errors are no longer being displayed.

I'm trying to rerun it without the error limit, but I wondered if there is a common cause for the new SIGFPE error I've encountered?

Kindest regards,

Kristin

Post by **Yvan Fournier** » Thu Oct 11, 2012 5:10 pm

Hello,

The Valgrind error seems to indicate a problem (which would require attaching a debugger with Valgrinds --db-attach option to find, if re-checking all your other user subroutines does not find a bug), and memory management or overwrite bugs can indeed explain plenty of strange or incorrect behaviors.

Short answer: it might not be related, but as long as the Valgrind error appears, you can't trust your calculation.

Cheers,

Yvan

code_saturne User's Forum

Mesh_Check SIGSEV signal intercepted

Mesh_Check SIGSEV signal intercepted

Re: Mesh_Check SIGSEV signal intercepted

Re: Mesh_Check SIGSEV signal intercepted

Re: Mesh_Check SIGSEV signal intercepted

Re: Mesh_Check SIGSEV signal intercepted

Re: Mesh_Check SIGSEV signal intercepted

Re: Mesh_Check SIGSEV signal intercepted

Re: Mesh_Check SIGSEV signal intercepted

Re: Mesh_Check SIGSEV signal intercepted

Re: Mesh_Check SIGSEV signal intercepted