jobs stuck when writing checkpoint - CS v7

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
daniele
Posts: 149
Joined: Wed Feb 01, 2017 11:42 am

jobs stuck when writing checkpoint - CS v7

Post by daniele »

Hello,

I am using version 7.2.0 and observing a problem never faced before. The simulation gets stuck once CS has to write either a checkpoint or a postprocessing periodic output. There is no error appearing, the job continues running, but practically it is stuck in the writing process forever...
I have added the Isof and gstack outputs: in this case the Isof shows that CS is blocked in writing the results_fluid_domain.geo. The gstack seems to suggest a problem of communication MPI?

If anyone has observed this behavior before and could suggest me any solution to solve it...
By the way, I have already tried to reinstall an recompile CS from scratch but nothing has changed.

Thank you very much in advance for your help.
Kind regards,
Daniele
Attachments
lsof.184527.txt
(17.62 KiB) Downloaded 48 times
gstack.184570.txt
(3.93 KiB) Downloaded 50 times
Yvan Fournier
Posts: 4080
Joined: Mon Feb 20, 2012 3:25 pm

Re: jobs stuck when writing checkpoint - CS v7

Post by Yvan Fournier »

Hello,

On what machine are you running ? We have observed similar behavior on one of our clusters since a software stack update.

The problem occurs with Open MPI, and mostly for cases which have been running for a long time (case dependent, but occurs more after a day or 2 than after an hour; in this case, more frequent checkpoint outputs may help). It seems related to a lower level "ofed" update, and occurs less frequently (but is not solved by) switching to Open MPI 4.1.

Is this similar to your problem? In this case, switching to Intel MPI solves the issue (though Intel MPI has its own issues, and for coupling with Syrthes we needvto set the MPMP mode to "script" in code_saturne.cfg to work around a (less critical) bug.

Any details/confirmation is welcome, as we need to reproduce this to report the issue upstream and have it fixed.

We also recently fixed a bug on our side when setting the checkpoint frequency based on wall time, but that is more easily avoidable using the default settings.

Best regards,

Yvan
daniele
Posts: 149
Joined: Wed Feb 01, 2017 11:42 am

Re: jobs stuck when writing checkpoint - CS v7

Post by daniele »

Hello Yvan,

I run on supermicro nodes, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz.
For the Infinite Band, the ofed version used is MLNX_OFED_LINUX-5.2-1.0.4.0.
The operative system is Centos7.9.

It is exactly as you say: the problem appears only for long runs where the first checkpoint is saved after 2 days (for example), I made a test with just 20 time steps imposing checkpoints and that worked correctly.

CS was compiled with Open MPI 4.0.6: we will try to recompile with v4.1 to see if it helps in solving the problem.

I do not fix the checkpoint frequency based on wall time. And anyway it occurs even when writing the results_fluid_domain.geo, meaning that the problem is not related to a checkpoint option.

Please let me know if you need further details. I will let you know if Open MPI 4.1 has an impact.

Thank you very much for your help.
Kind regards,
Daniele
Antech
Posts: 197
Joined: Wed Jun 10, 2015 10:02 am

Re: jobs stuck when writing checkpoint - CS v7

Post by Antech »

Sorry if I'm wrong, but, if you have MPI problems you can try to compile against OpenMPI-1.8.4 or 3.1.6. I run Saturne-7.0.2 with OpenMPI 1.8.4 and, for now (after 4.5 months), I haven't noticed any stability problems. But I don't use OFED on this setup (workstation). OS version is CentOS-7.5 + updates.
Yvan Fournier
Posts: 4080
Joined: Mon Feb 20, 2012 3:25 pm

Re: jobs stuck when writing checkpoint - CS v7

Post by Yvan Fournier »

Hello,

Those versions of OpenMPI are obsolete and not maintained anymore. On a workstation, this probably does not make much of a difference, but on a cluster, they probably do not support the latest high performance network drivers and recent versions of batch systems such as SLURM.

In any case, this problem has only been observed on clusters, and usually at a relatively high node count. Which make it difficult to build a "reproducer" case to pass to network driver/MPI library support teams.

Best regards,

Yvan
Post Reply