Stop to avoid exceeding time allocation on cluster system

Rodolphe · Post by **Rodolphe** » Sun Apr 04, 2021 6:51 pm

Hi,

I'm working on a cluster with a batch system and when I run a simulation, the calculation stops and return the following message in the listing file :

Code: Select all

========================================================
   ** Arrêt pour éviter de dépasser le temps alloué.
      ----------------------------------------------
      numéro de pas de temps maximal mis à : 49943
========================================================

When I'm checking the configuration of the cluster, it says that the cpu time is supposed to be unlimited (even tough it is specify that time limit is two days on the documentation). the "ulimit -a" command gives :

Code: Select all

data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 380488
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

What could be the problem here ? Ho wan I fix it ?

(The listing file is too large to be joined here)

Thanks a lot for your help !

Best regards,

Rodolphe

Post by **Yvan Fournier** » Tue Apr 06, 2021 9:00 am

Hello,

What batch system is used ? How long did the computation run ? How much time was allocated in the batch submission.

What is the general information on your case as listed in the forum usage recommendations ?

On a cluster, even if you run ulimit in a symbiotes script, it is often not relevant, depending on the batch system.

Regards,

Yvan

Rodolphe · Post by **Rodolphe** » Tue Apr 06, 2021 9:34 am

Hello,

The cluster is using Slurm. My computation last 45 minutes and few seconds no matter what I change in the inputs of my case (except the number of iterations). To estimate the time required, I did run a little submission (1000 iterations compared to 200 000 for the one I'm struggle with) and thus I've allocated 5:30:00 of time (hh:mm:ss). Since it was not working I tried to expand this time to 10:00:00 but without success, the simulation stops at 45 minutes.

Here is the exit of the completed job :

Code: Select all

State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 24
CPU Utilized: 04:43:54
CPU Efficiency: 25.85% of 18:18:24 core-walltime
Job Wall-clock time: 00:45:46
Memory Utilized: 78.80 MB
Memory Efficiency: 0.32% of 24.00 GB

I joined the files needed.

Thanks again for your help,

Best regards,

Rodolphe

Rodolphe · Post by **Rodolphe** » Tue Apr 06, 2021 9:40 am

I couldn't join more than three files to my last post, thus here are the others (mesh, xml)

I've joined the submission script too (not the same as the current case given that I tried other configurations since then, I just changed the time, memory and number of threads).

Best regards,

Rodolphe

Post by **Yvan Fournier** » Tue Apr 06, 2021 12:19 pm

Hello,

I do not recommend running the whole "runcase" script under srun, and in any case, have never tried.

If you install code_saturne correctly, including the post-install step with the "code_saturne.cfg" file (see installation documentation), you shoud just run "sbatch runcase" and the "srun" (or mpiexec) command will occur locally, just for the main parallel executable.

Normally, code_saturne tries to determine remaining available time under slurm based on the sqeue command (see the get_remaining_time function in the bin/cs_exec_environment.py file in the code_sources).

What does

squeue -h -j $SLURM_JOBID -o %L"

return if you run it under your batch environment ?

Regards,

Yvan

Rodolphe · Post by **Rodolphe** » Tue Apr 06, 2021 12:51 pm

Hello,

Indeed, I did not compute the post-install step yet. I don't really understand what I need to uncomment in the "code_saturne.cfg". Maybe you can help me with that ?

So you recommend to write "sbatch runcase" instead of "srun runcase" in the pool script ? If not, how can I specify the number of cpu, memory, ... I require (that I was doing in the pool script) ?

Note that i had to change a line in the parse_wall_time_slurm function in site-packages/code_saturne/cs_batch.py otherwise the simulation would failed.

Code: Select all

line 60 :  t = 2*24*60 # force the wall time to 2 days

Concerning the squeue -h -j $SLURM_JOBID -o %L" command, it returns:

Code: Select all

CLUSTER: lemaitre3
9:49:59

Regards,

Rodolphe

Post by **Yvan Fournier** » Wed Apr 07, 2021 1:41 am

Hello,

In the post-install step, in code_saturne.cfg, just set (uncoment) batch=SLURM.

You can also replace SLURM with an absolute path name ending in SLURM, if you wand to defined you own header model rather than that defined in extras/batch/batch.SLURM in the source tree.

When that is done, if you create a new case, the runcase will contain the SLURM headers (as in your pool script), and the GUI will allow modifying the most common ones (the SLURM entries not known to the GUI are not touched).

The change in parse_wall_time_slurm might explain the problem you have (though I am not sure). What error did you have and how did you change it ? What does your "summary" file contain (it contains a copy of all environment variables so can help debugging) ?

Regards,

Yvan

Rodolphe · Post by **Rodolphe** » Wed Apr 07, 2021 10:12 am

Hello,

I did set "batch = SLURM" in the code_saturne.cfg file. Then, I've created a new case and updated the runcase script with specifications of my simulation. Finally I ran the command "sbatch runcase".

First, I ran with the original parse_wall_time_slurm but the error came back (see log file joined). Thus at line 60 of the site-packages/code_saturne/cs_batch.py in the parse_wall_time_slurm, I changed

Code: Select all

t = th + int(wt[0])*60 + int(wt[1])

with,

Code: Select all

t = 2*24*60 # force the wall time to 2 days

which run well when running "sbatch runcase". But after 45 minutes, the simulation is still stopped for the same reason as my first message on this topic.

I join the "summary" file which ran with the original parse_wall_time_slurm too (the same simulation as the log file joined).

Regards,

Rodolphe

Post by **Yvan Fournier** » Wed Apr 07, 2021 9:07 pm

Hello,

Regarding the parsing of the wall time, I will need to add a test for cases where
squeue -h -j $SLURM_JOBID -o %L
returns a multi-line answer.

Otherwise, it is possible that the margin is not computed correctly.
You can try editing src/base/cs_resource.c, and add a "return;" statement at the beginning of the armtps or cs_resource_get_max_timestep functions, so as to deactivate this. The risk being that if you plan for too many time steps compared to the allocated time, you will not have a clean exit (unless you place a control_file before the end to stop the computation cleanly).

Best regards,

Yvan

Rodolphe · Post by **Rodolphe** » Thu Apr 08, 2021 4:27 pm

Hello,

The squeue -h -j $SLURM_JOBID -o %L returned the same output as before :

Code: Select all

CLUSTER: lemaitre3
9:49:59

I tried adding a "return" at the beginning of the cs_resource_get_max_timestep but it changed nothing. The simulation is still running until 45 minutes before being shut down. Maybe a stupid question but do I need to run ./configure when changing this function ? Since it does not change anything to add a return to the function and then relaunch the case.

Thanks for your precious help !

Best regards,

Rodolphe

code_saturne User's Forum

Stop to avoid exceeding time allocation on cluster system

Stop to avoid exceeding time allocation on cluster system

Re: Stop to avoid exceeding time allocation on cluster system

Re: Stop to avoid exceeding time allocation on cluster system

Re: Stop to avoid exceeding time allocation on cluster system

Re: Stop to avoid exceeding time allocation on cluster system

Re: Stop to avoid exceeding time allocation on cluster system

Re: Stop to avoid exceeding time allocation on cluster system

Re: Stop to avoid exceeding time allocation on cluster system

Re: Stop to avoid exceeding time allocation on cluster system

Re: Stop to avoid exceeding time allocation on cluster system