Page 1 of 1

Exceeded memory limit

Posted: Tue Feb 05, 2019 11:14 am
by konst
Hello!

I was running CS v5.3 and v5.2 with RSM LRR turbulence model, to calculate turbulent flow around cylinder. But this calculations are stopping after ~145000 timestep with error:

Code: Select all

slurmstepd-atcn451: error: Job 17541583 exceeded memory limit (61472984 > 61440000), being killed
slurmstepd-atcn451: error: Exceeded job memory limit

Probably there is some memory leaks in the implementation of RSM model. Is there a way to avoid this problem?

Best regards, Konstantin

Re: Exceeded memory limit

Posted: Tue Feb 05, 2019 9:02 pm
by Yvan Fournier
Hello,

Do you have a small test case ? We could debug this.

In any case the LRR model is nod recommended. SSG is a more "correct" RSM model (though tge memory leak might appear in both).

I won't be able to check before the end of the week, but I'll check if you can provide a small test case.

Best regards,

Yvan

Re: Exceeded memory limit

Posted: Wed Feb 06, 2019 8:16 pm
by konst
Thank you for your reply, Yvan.

I was trying the same test but with k-epsilon model and results gives me the same error "exceeded memory limit". So looks like problem not in the turbulence model.

I attached zip archive with my setup files I was running on claster ATHOS usng 3 nodes.

Best regards, Konstantin

Re: Exceeded memory limit

Posted: Tue Feb 26, 2019 1:45 am
by Yvan Fournier
Hello,

A colleague checked your case and did not find a leak using the "classical" instrumentation, so I'll try with more complete tools and kep you updated.

In any case, your mesh is quite small, so running on 3 nodes seems a lot. 1 or 2 ranks on a single node should be enough for 60000 cells (unless you ran on Athos with a bigger mesh).

Best regards,

Yvan

Re: Exceeded memory limit

Posted: Fri Mar 01, 2019 7:10 pm
by konst
Hello,

Yvan, thank you for spending time for my case. I have checked this case with a smaller number of processors as you recomend. And anyway I have this error. If there is no memory leaks in there, I have only one guess that this case does not converge at some moment.

Thank you again and bon weekend. :)

Re: Exceeded memory limit

Posted: Fri Mar 01, 2019 11:03 pm
by Yvan Fournier
Hello,

At least did you get similar or better performance with 3 cores ? A solution which is not too elegant but should at least work is to checkpoint / stop /restart every 100000 iterations or so.

The fact that I did not reproduce the issue on a small number of time steps does not prove there is no leak, as a leak could be in a function called only in some types of regimes. Memory fragmentation increasing at each time step may be a possibility, though I have only once encountered a case where running out of memory was definitely due to this, and it was on another type of architecture.

Do you have the same error on other machines ? A memory leak could be in the cluster's MPI libraries for example. Since Athos is being retired at the end of this month, results on other machines might be more relevant.

To use another type of "external" instrumentation, I am running the case on version 5.3.2 on a laptop, on 2 MPI ranks, and see no evolution in the values of "top" after about 120 iterations... I'll let it run a bit longer. I detected no isssue with gcc's AdressSanitizer, Valgrind's leak-check, or CS_MEM_LOG environment variable, so I am running out of ideas for further tests... (but I am testing under a different Linux distribution, with recent tool versions, though the probability the issue is dependent on this is low, except as regards MPI drivers, which can be capricious on some HPC systems).

Best regards,

Yvan

Re: Exceeded memory limit

Posted: Fri Mar 08, 2019 3:29 pm
by konst
Yvan, thank you for your help! You are right that was a problem of ATHOS. I were running these case on EOLE and it works really well.
Thanks again.

Bon weekend,
Konstantin