Page 1 of 1

Strange Halting at Half Total Iterations

Posted: Wed Mar 13, 2024 6:40 pm
by finzeo
Hello,

I'm experiencing an issue where the simulations stop exactly halfway through the total number of iterations, without having set anything regarding this. It simply ceases to progress, with no error log. The only strange thing I notice is the following:

Code: Select all

[finzeo@pirayu checkpoint]$ ll
total 3139004
-rw-r--r-- 1 finzeo cimec 1019553024 Mar 12 11:01 auxiliary.csc
-rw-r--r-- 1 finzeo cimec  227725408 Mar 13 00:44 main.csc
-rw-r--r-- 1 finzeo cimec          8 Mar 13 00:44 main.csc-568852481-7472.lock
-rw-r--r-- 1 finzeo cimec 2007195968 Mar 11 21:42 mesh_input.csm
drwxr-xr-x 2 finzeo cimec         30 Mar 13 00:44 previous_dump_0000
[finzeo@pirayu checkpoint]$ ll previous_dump_0000/
total 518904
-rw-r--r-- 1 finzeo cimec 531357184 Mar 12 11:01 main.csc
I have attached my setup.xml and my user routines in case they are useful.

Thank you in advance.

Re: Strange Halting at Half Total Iterations

Posted: Wed Mar 13, 2024 7:22 pm
by Yvan Fournier
Hello,

What type of machine/compute environment are you running on, how many MPI ranks do you use, and how long has the code been running ? This seems to be a hang when writing an intermediate checkpoint file, and may have similarities with some issues we encountered on a cluster following a system update that introduced a bug in low-level libraries (at the MOFED level), but could be something else.

Best regards,

Yvan

Re: Strange Halting at Half Total Iterations

Posted: Wed Mar 13, 2024 9:22 pm
by finzeo
Hi Yvan,

I'm running this on nodes of an HPC cluster. I typically use between 3 to 6 nodes, each with 32 cores. Specifically: Intel Xeon Gold 6226R - 2 CPU X 16 cores | 192 GB RAM | Infiniband Network. More information can be found https://cimec.org.ar/c3/pirayu/. Here's an example:

MPI ranks: 192 (appnum attribute: 0)
MPI ranks per node: 32
OpenMP threads: 1
Processors/node: 1

It has been running for about a day, but I have had several runs that have lasted even on the order of 10 days. This is the first time this has happened. How could I proceed to detect any possible problem?

Re: Strange Halting at Half Total Iterations

Posted: Thu Mar 14, 2024 4:39 pm
by Yvan Fournier
Hello,

I forgot to ask/you forgot to remind me which version of the code you are using. There was in issue in the restart code some time ago (more then 2 years if I remember) where we could also have a race condition leading to the behavior you observe.

Otherwise, if the issue only happened once, troubleshooting it may be more trouble than it is worth. If it appears regularly, it may be interesting to have more details on the software stack (MPI library, ower-level drivers, ...) and see if there was any recent upgrade of part of that stack.

Best regards,

Yvan

Re: Strange Halting at Half Total Iterations

Posted: Thu Mar 14, 2024 5:13 pm
by finzeo
Hi Yvan,

I am using version 8.0.1-patch. I have attached the file code_saturne_build.cfg. Additionally, when running a simulation, I have to use this command:

Code: Select all

export LD_PRELOAD=/share/apps/easybuild/software/GCCcore/10.3.0/lib64/libgfortran.so

Re: Strange Halting at Half Total Iterations

Posted: Fri Mar 15, 2024 12:27 pm
by Yvan Fournier
Hello,

With version 8.1, you should not have the code_saturne "race condition" bug. So this might be a problem in the system.
Did you reproduce it a second time ? On our machine, we had the issue with OpenMPI 4 and 4.1 (less frequent with 4.1), but this depends on lower level elements of the stack such as the MOFED version. You will not see this level of detail with the code_saturne_build.cfg file, you probably need to check with the cluster administrators.

In our case, Intel MPI is less subject to this issue. Open MPI works well with older versions of the stack, and should work well with newer (fixed) versions, but I need to check the versions. This is why I was asking if there has been a recent system upgrade/maintenance on your cluster.

An if this issue appeared only once and is not recurring, you might not want to bother... Also, in case this has nothing to do with the system but comes from a bug in a user-defined function, you can probably reproduce it at "half calculation" with less time steps. If it is a system issue, it might seem more "random" but appear only after a computation has run a relatively long time (a few hours at least).

Best regards,

Yvan

Re: Strange Halting at Half Total Iterations

Posted: Fri Mar 15, 2024 5:43 pm
by finzeo
Hi Yvan,

In my case, the version is 8.0.1, not 8.1; nevertheless, I shouldn't have the race condition issue in either version, should I?

This issue has occurred four times already (I'm simulating external flow around a vehicle, and since the first time it happened, no subsequent simulation has gone beyond halfway through the iterations). It takes time to try things out to prevent it from happening because my simulations last quite a while (several days). Ultimately, I had never encountered it before, and now it's happening consistently. I will try, as you suggest, using fewer iterations. On the other hand, I tried running another simple case (flow over a flat plate), and the issue did not arise.

Perhaps there's an issue with my user routines, although I don't have anything that explicitly does anything particular halfway through the simulation.

Re: Strange Halting at Half Total Iterations

Posted: Sun Mar 17, 2024 12:45 am
by finzeo
Still conducting tests, but in the meantime, I wanted to highlight something quite peculiar I've noticed: another run halted at the one-quarter mark of the total iterations. Once again, there's no code specifying any particular action at that moment. Again, a .lock file is generated inside the checkpoint folder.
Does Saturne automatically save and overwrite checkpoint data at 'regular' intervals? (by that, I mean, halfway through the run, a quarter of the run, etc).

Re: Strange Halting at Half Total Iterations

Posted: Mon Mar 18, 2024 2:24 am
by Yvan Fournier
Hello,

Yes, this is the default : except for test runs of 10 iterations or less, there is a checkpoint every 1/4 of the total number of iterations required.

Again, do you know if there was an "upgrade"/maintenance of some librairies on your cluster ? Do you have access to other MPI libraries ? Assuming it is the same issue we encountered with a MOFED update, Intel MPI/MS MPI seems less sensitive to the issue Which version of OpenMPI are you using ? 4.1.4 is supposedly less sensitive to the bug than 4.0, tough when we tested it, hangs were rarer, but still occurred.

Best regards,

Yvan

Re: Strange Halting at Half Total Iterations

Posted: Mon Mar 25, 2024 3:58 pm
by finzeo
Hi Yvan,

I managed to solve the issue. Just as you suggested, I delved into investigating the changes made by the cluster administrators. We found that due to a memory restriction implemented on the master node, the amount of available memory was lower than what I typically have. This raised the possibility of hang-ups during the most memory-intensive stages (such as the checkpoint), especially considering these are heavy runs with ~40 million cells. We simply removed that restriction, and now it works correctly.

Thank you for your assistance.