I'm experiencing an issue where the simulations stop exactly halfway through the total number of iterations, without having set anything regarding this. It simply ceases to progress, with no error log. The only strange thing I notice is the following:
What type of machine/compute environment are you running on, how many MPI ranks do you use, and how long has the code been running ? This seems to be a hang when writing an intermediate checkpoint file, and may have similarities with some issues we encountered on a cluster following a system update that introduced a bug in low-level libraries (at the MOFED level), but could be something else.
I'm running this on nodes of an HPC cluster. I typically use between 3 to 6 nodes, each with 32 cores. Specifically: Intel Xeon Gold 6226R - 2 CPU X 16 cores | 192 GB RAM | Infiniband Network. More information can be found https://cimec.org.ar/c3/pirayu/. Here's an example:
It has been running for about a day, but I have had several runs that have lasted even on the order of 10 days. This is the first time this has happened. How could I proceed to detect any possible problem?
I forgot to ask/you forgot to remind me which version of the code you are using. There was in issue in the restart code some time ago (more then 2 years if I remember) where we could also have a race condition leading to the behavior you observe.
Otherwise, if the issue only happened once, troubleshooting it may be more trouble than it is worth. If it appears regularly, it may be interesting to have more details on the software stack (MPI library, ower-level drivers, ...) and see if there was any recent upgrade of part of that stack.
With version 8.1, you should not have the code_saturne "race condition" bug. So this might be a problem in the system.
Did you reproduce it a second time ? On our machine, we had the issue with OpenMPI 4 and 4.1 (less frequent with 4.1), but this depends on lower level elements of the stack such as the MOFED version. You will not see this level of detail with the code_saturne_build.cfg file, you probably need to check with the cluster administrators.
In our case, Intel MPI is less subject to this issue. Open MPI works well with older versions of the stack, and should work well with newer (fixed) versions, but I need to check the versions. This is why I was asking if there has been a recent system upgrade/maintenance on your cluster.
An if this issue appeared only once and is not recurring, you might not want to bother... Also, in case this has nothing to do with the system but comes from a bug in a user-defined function, you can probably reproduce it at "half calculation" with less time steps. If it is a system issue, it might seem more "random" but appear only after a computation has run a relatively long time (a few hours at least).
In my case, the version is 8.0.1, not 8.1; nevertheless, I shouldn't have the race condition issue in either version, should I?
This issue has occurred four times already (I'm simulating external flow around a vehicle, and since the first time it happened, no subsequent simulation has gone beyond halfway through the iterations). It takes time to try things out to prevent it from happening because my simulations last quite a while (several days). Ultimately, I had never encountered it before, and now it's happening consistently. I will try, as you suggest, using fewer iterations. On the other hand, I tried running another simple case (flow over a flat plate), and the issue did not arise.
Perhaps there's an issue with my user routines, although I don't have anything that explicitly does anything particular halfway through the simulation.
Still conducting tests, but in the meantime, I wanted to highlight something quite peculiar I've noticed: another run halted at the one-quarter mark of the total iterations. Once again, there's no code specifying any particular action at that moment. Again, a .lock file is generated inside the checkpoint folder.
Does Saturne automatically save and overwrite checkpoint data at 'regular' intervals? (by that, I mean, halfway through the run, a quarter of the run, etc).
Yes, this is the default : except for test runs of 10 iterations or less, there is a checkpoint every 1/4 of the total number of iterations required.
Again, do you know if there was an "upgrade"/maintenance of some librairies on your cluster ? Do you have access to other MPI libraries ? Assuming it is the same issue we encountered with a MOFED update, Intel MPI/MS MPI seems less sensitive to the issue Which version of OpenMPI are you using ? 4.1.4 is supposedly less sensitive to the bug than 4.0, tough when we tested it, hangs were rarer, but still occurred.
I managed to solve the issue. Just as you suggested, I delved into investigating the changes made by the cluster administrators. We found that due to a memory restriction implemented on the master node, the amount of available memory was lower than what I typically have. This raised the possibility of hang-ups during the most memory-intensive stages (such as the checkpoint), especially considering these are heavy runs with ~40 million cells. We simply removed that restriction, and now it works correctly.