Code Saturne 5.0.5 crash in linear solver

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Yvan Fournier
Posts: 4209
Joined: Mon Feb 20, 2012 3:25 pm

Re: Code Saturne 5.0.5 crash in linear solver

Post by Yvan Fournier »

Hello,

If you have no user subroutines, you can indeed use the cs_solver from libexec.

In some cases, you have no info from lower levels of the stack trace, but should obtain line info from at least some (most) of the lines in the stack trace. Could you try again ?

Best regards,

Yvan
Antech
Posts: 201
Joined: Wed Jun 10, 2015 10:02 am

Re: Code Saturne 5.0.5 crash in linear solver

Post by Antech »

I tried more than half of the stack on cs_solver and some of addresses on libsaturne.so.5 from the Debug build (checked with pwd). The same result. Looks like I'm doing something wrong... Configure said that the debugging info is enabled, than I made "make clean", "make -j4" and "make install" that produced the result of the compilation/linking in the corresponding debug build directory that was empty before. Maybe I should check something in binaries with hex editor? If yes, what fields should I inspect (offset/format)?

Regarding the compiler version. On laptop, there is a CentOS that is very conservative so it has only gcc 4.4.7. On desktop, there is Kubuntu 14.04 that should have much newer gcc like 4.8 or so (I cannot check it now so maybe wrong). I don't think that this is due to compiler version. All other Saturnes was compiled on this laptop with the same or near gcc version(s) and there was not so many crashes in past, seems like it's relevant for 5.x, maybe 5.0.5 version.
Saturne 5.0.4 + OpenMPI 1.8.4 + PV 4.3 calculation is now alive after 100+ iterations, observation is in progress...
Antech
Posts: 201
Joined: Wed Jun 10, 2015 10:02 am

Re: Code Saturne 5.0.5 crash in linear solver

Post by Antech »

Hello.

Saturne 5.0.4 + OpenMPI 1.8.4 + PV 4.3 combination worked for another 9 hours today without issues on the same cyclone case. It made ~800 iterations in summary.
Yvan Fournier
Posts: 4209
Joined: Mon Feb 20, 2012 3:25 pm

Re: Code Saturne 5.0.5 crash in linear solver

Post by Yvan Fournier »

Hello,

Ok, you ran more iterations than I did... So it would seem it is the OpenMPI 3 component which causes issues ? Strange, I have been using it on my home laptop for some time (but for mostly smaller cases).

Best regards,

Yvan
Antech
Posts: 201
Joined: Wed Jun 10, 2015 10:02 am

Re: Code Saturne 5.0.5 crash in linear solver

Post by Antech »

Hello, thanks for info about OpenMPI 3 (I was not aware of the version you use).
This weekend I continued the cyclone simulations on a workstation with Kubuntu 14.04. I made a bunch of Saturne builds there... Both Saturne 5.0.5 + ParaView 5.2 + OpenMPI 3.0.0 and the same + OpenMPI 1.8.4 worked well for ~2 days (third day it tracked particles so I don't include it because the issue is connected with linear solver). But it failed on the same machine on previous weekend with OpenMPI 3.0.0. So, because the problem is usually rare, I cannot say if this is due to OpenMPI version. Now I switched entirely to MPI 1.8.4 and continue observations. (I now use only home workstation for these cyclone cases with relatively big meshes so I cannot provide test results rapidly because, for safety, I don't leave it on during business days).
Antech
Posts: 201
Joined: Wed Jun 10, 2015 10:02 am

Re: Code Saturne 5.0.5 crash in linear solver

Post by Antech »

Hello.
Reporting results of my small investigation of this problem. I used Saturne 5.0.5 compiled with OpenMPI 1.8.4 again for many calculations and, in these calculations, there was no any stability problems at all. So it seems that the problem arises from OpenMPI 3.0.0 or it's compatibility with Saturne. If anybody have similar issue just avoid using certain versions of OpenMPI, I decided to stick to version 1.8.4 for now.
Yvan Fournier
Posts: 4209
Joined: Mon Feb 20, 2012 3:25 pm

Re: Code Saturne 5.0.5 crash in linear solver

Post by Yvan Fournier »

Hello,

Thanks for this feedback. Did you test OpenMPI 3.0.1 ?

I have not had issues with OpenMPI 3.0 on my side, but as I sais, I only tested it on a workstation, while the issues you encountered are probably driver-related so may be different on a cluster...

We have had some stability issues on other clusters, but with different symptoms (i.e. hang after 4-10 hours with older OpenMPI 1.6-based installs), so comparing experiences is always interesting, and helps us have a good guess of whether a reported issue is code related and must be fixed on our side or MPI library related and fixed elsewhere or worked around using another library or settings. We have seen MPI stability issues which were related to driver/firmware interaction, so your experience on one cluster might not match that on another (but feedback is always good, and helps avoid future problems)

Best regards,

Yvan
Antech
Posts: 201
Joined: Wed Jun 10, 2015 10:02 am

Re: Code Saturne 5.0.5 crash in linear solver

Post by Antech »

Hello.
I tested it on a workstation and laptop, not on a cluster. Different Linux distributions, different machines and the same bug. OpenMPI versions was only 1.8.4 and 3.0.0, just didn't got the time yet to test with other versions, and it works good with 1.8.4.
Successful calculations with MPI 1.8.4 was made on both machines with different Saturn versions, but recent are made on a workstation (Kubuntu 14.04 without updates, Saturne 5.0.5; if hardware matters: AMD Threadripper 1950X CPU @ 3500 MHz, 64 GB RAM @2400 MHz, i.e. standard frequencies, boost is disabled in BIOS). Many calculations with different Saturne versions up to 5.0.5 was successfully performed also on a laptop: i7-2670QM @2.2 GHz, 16 GB RAM, CentOS 6.7, OpenMPI 1.8.4.

P.S. I noticed that there is a bug in the forum engine. I needed to talk with my collegues so some time have passed while the reply page was open, I was logged in. Then I sent the reply and was brought on a login page, after login the message was empty. In my case the browser (Firefox)saved the text in reply form, but if somebody has typed a long post and it didn't save... There is a nice solution on iXBT forum: you always have a quick reply form at the bottom on each topic page. If you post while not logged in, forum engine asks you for login ant then your message appears on a board. While typing, you see the topic in usual manner (not reversed), that is very useful. If you want, you may implement this.
Post Reply