Page 2 of 2
Re: Code Saturne 5.0.5 crash in linear solver
Posted: Wed Feb 21, 2018 12:38 pm
by Yvan Fournier
Hello,
If you have no user subroutines, you can indeed use the cs_solver from libexec.
In some cases, you have no info from lower levels of the stack trace, but should obtain line info from at least some (most) of the lines in the stack trace. Could you try again ?
Best regards,
Yvan
Re: Code Saturne 5.0.5 crash in linear solver
Posted: Wed Feb 21, 2018 1:28 pm
by Antech
I tried more than half of the stack on cs_solver and some of addresses on libsaturne.so.5 from the Debug build (checked with pwd). The same result. Looks like I'm doing something wrong... Configure said that the debugging info is enabled, than I made "make clean", "make -j4" and "make install" that produced the result of the compilation/linking in the corresponding debug build directory that was empty before. Maybe I should check something in binaries with hex editor? If yes, what fields should I inspect (offset/format)?
Regarding the compiler version. On laptop, there is a CentOS that is very conservative so it has only gcc 4.4.7. On desktop, there is Kubuntu 14.04 that should have much newer gcc like 4.8 or so (I cannot check it now so maybe wrong). I don't think that this is due to compiler version. All other Saturnes was compiled on this laptop with the same or near gcc version(s) and there was not so many crashes in past, seems like it's relevant for 5.x, maybe 5.0.5 version.
Saturne 5.0.4 + OpenMPI 1.8.4 + PV 4.3 calculation is now alive after 100+ iterations, observation is in progress...
Re: Code Saturne 5.0.5 crash in linear solver
Posted: Thu Feb 22, 2018 3:00 pm
by Antech
Hello.
Saturne 5.0.4 + OpenMPI 1.8.4 + PV 4.3 combination worked for another 9 hours today without issues on the same cyclone case. It made ~800 iterations in summary.
Re: Code Saturne 5.0.5 crash in linear solver
Posted: Thu Feb 22, 2018 10:38 pm
by Yvan Fournier
Hello,
Ok, you ran more iterations than I did... So it would seem it is the OpenMPI 3 component which causes issues ? Strange, I have been using it on my home laptop for some time (but for mostly smaller cases).
Best regards,
Yvan
Re: Code Saturne 5.0.5 crash in linear solver
Posted: Mon Feb 26, 2018 9:22 am
by Antech
Hello, thanks for info about OpenMPI 3 (I was not aware of the version you use).
This weekend I continued the cyclone simulations on a workstation with Kubuntu 14.04. I made a bunch of Saturne builds there... Both Saturne 5.0.5 + ParaView 5.2 + OpenMPI 3.0.0 and the same + OpenMPI 1.8.4 worked well for ~2 days (third day it tracked particles so I don't include it because the issue is connected with linear solver). But it failed on the same machine on previous weekend with OpenMPI 3.0.0. So, because the problem is usually rare, I cannot say if this is due to OpenMPI version. Now I switched entirely to MPI 1.8.4 and continue observations. (I now use only home workstation for these cyclone cases with relatively big meshes so I cannot provide test results rapidly because, for safety, I don't leave it on during business days).
Re: Code Saturne 5.0.5 crash in linear solver
Posted: Thu May 31, 2018 7:40 am
by Antech
Hello.
Reporting results of my small investigation of this problem. I used Saturne 5.0.5 compiled with OpenMPI 1.8.4 again for many calculations and, in these calculations, there was no any stability problems at all. So it seems that the problem arises from OpenMPI 3.0.0 or it's compatibility with Saturne. If anybody have similar issue just avoid using certain versions of OpenMPI, I decided to stick to version 1.8.4 for now.
Re: Code Saturne 5.0.5 crash in linear solver
Posted: Thu May 31, 2018 10:50 am
by Yvan Fournier
Hello,
Thanks for this feedback. Did you test OpenMPI 3.0.1 ?
I have not had issues with OpenMPI 3.0 on my side, but as I sais, I only tested it on a workstation, while the issues you encountered are probably driver-related so may be different on a cluster...
We have had some stability issues on other clusters, but with different symptoms (i.e. hang after 4-10 hours with older OpenMPI 1.6-based installs), so comparing experiences is always interesting, and helps us have a good guess of whether a reported issue is code related and must be fixed on our side or MPI library related and fixed elsewhere or worked around using another library or settings. We have seen MPI stability issues which were related to driver/firmware interaction, so your experience on one cluster might not match that on another (but feedback is always good, and helps avoid future problems)
Best regards,
Yvan
Re: Code Saturne 5.0.5 crash in linear solver
Posted: Fri Jun 01, 2018 8:24 am
by Antech
Hello.
I tested it on a workstation and laptop, not on a cluster. Different Linux distributions, different machines and the same bug. OpenMPI versions was only 1.8.4 and 3.0.0, just didn't got the time yet to test with other versions, and it works good with 1.8.4.
Successful calculations with MPI 1.8.4 was made on both machines with different Saturn versions, but recent are made on a workstation (Kubuntu 14.04 without updates, Saturne 5.0.5; if hardware matters: AMD Threadripper 1950X CPU @ 3500 MHz, 64 GB RAM @2400 MHz, i.e. standard frequencies, boost is disabled in BIOS). Many calculations with different Saturne versions up to 5.0.5 was successfully performed also on a laptop: i7-2670QM @2.2 GHz, 16 GB RAM, CentOS 6.7, OpenMPI 1.8.4.
P.S. I noticed that there is a bug in the forum engine. I needed to talk with my collegues so some time have passed while the reply page was open, I was logged in. Then I sent the reply and was brought on a login page, after login the message was empty. In my case the browser (Firefox)saved the text in reply form, but if somebody has typed a long post and it didn't save... There is a nice solution on iXBT forum: you always have a quick reply form at the bottom on each topic page. If you post while not logged in, forum engine asks you for login ant then your message appears on a board. While typing, you see the topic in usual manner (not reversed), that is very useful. If you want, you may implement this.