Stop to avoid exceeding time allocation on cluster system

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Rodolphe
Posts: 18
Joined: Sun Mar 14, 2021 12:59 pm

Stop to avoid exceeding time allocation on cluster system

Post by Rodolphe »

Hi,

I'm working on a cluster with a batch system and when I run a simulation, the calculation stops and return the following message in the listing file :

Code: Select all

========================================================
   ** Arrêt pour éviter de dépasser le temps alloué.
      ----------------------------------------------
      numéro de pas de temps maximal mis à : 49943
========================================================
When I'm checking the configuration of the cluster, it says that the cpu time is supposed to be unlimited (even tough it is specify that time limit is two days on the documentation). the "ulimit -a" command gives :

Code: Select all

data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 380488
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
What could be the problem here ? Ho wan I fix it ?

(The listing file is too large to be joined here)

Thanks a lot for your help !

Best regards,

Rodolphe
Yvan Fournier
Posts: 4069
Joined: Mon Feb 20, 2012 3:25 pm

Re: Stop to avoid exceeding time allocation on cluster system

Post by Yvan Fournier »

Hello,

What batch system is used ? How long did the computation run ? How much time was allocated in the batch submission.

What is the general information on your case as listed in the forum usage recommendations ?

On a cluster, even if you run ulimit in a symbiotes script, it is often not relevant, depending on the batch system.

Regards,

Yvan
Rodolphe
Posts: 18
Joined: Sun Mar 14, 2021 12:59 pm

Re: Stop to avoid exceeding time allocation on cluster system

Post by Rodolphe »

Hello,

The cluster is using Slurm. My computation last 45 minutes and few seconds no matter what I change in the inputs of my case (except the number of iterations). To estimate the time required, I did run a little submission (1000 iterations compared to 200 000 for the one I'm struggle with) and thus I've allocated 5:30:00 of time (hh:mm:ss). Since it was not working I tried to expand this time to 10:00:00 but without success, the simulation stops at 45 minutes.

Here is the exit of the completed job :

Code: Select all

State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 24
CPU Utilized: 04:43:54
CPU Efficiency: 25.85% of 18:18:24 core-walltime
Job Wall-clock time: 00:45:46
Memory Utilized: 78.80 MB
Memory Efficiency: 0.32% of 24.00 GB
I joined the files needed.

Thanks again for your help,

Best regards,

Rodolphe
Attachments
setup.log
(32.57 KiB) Downloaded 120 times
preprocessor.log
(7.83 KiB) Downloaded 119 times
compile.log
(10.33 KiB) Downloaded 107 times
Rodolphe
Posts: 18
Joined: Sun Mar 14, 2021 12:59 pm

Re: Stop to avoid exceeding time allocation on cluster system

Post by Rodolphe »

I couldn't join more than three files to my last post, thus here are the others (mesh, xml)

I've joined the submission script too (not the same as the current case given that I tried other configurations since then, I just changed the time, memory and number of threads).

Best regards,

Rodolphe
Attachments
pool.txt
(329 Bytes) Downloaded 114 times
setup.xml
(10.56 KiB) Downloaded 114 times
phenix10.cgns
(665.21 KiB) Downloaded 106 times
Yvan Fournier
Posts: 4069
Joined: Mon Feb 20, 2012 3:25 pm

Re: Stop to avoid exceeding time allocation on cluster system

Post by Yvan Fournier »

Hello,

I do not recommend running the whole "runcase" script under srun, and in any case, have never tried.

If you install code_saturne correctly, including the post-install step with the "code_saturne.cfg" file (see installation documentation), you shoud just run "sbatch runcase" and the "srun" (or mpiexec) command will occur locally, just for the main parallel executable.

Normally, code_saturne tries to determine remaining available time under slurm based on the sqeue command (see the get_remaining_time function in the bin/cs_exec_environment.py file in the code_sources).

What does
squeue -h -j $SLURM_JOBID -o %L"
return if you run it under your batch environment ?

Regards,

Yvan
Rodolphe
Posts: 18
Joined: Sun Mar 14, 2021 12:59 pm

Re: Stop to avoid exceeding time allocation on cluster system

Post by Rodolphe »

Hello,

Indeed, I did not compute the post-install step yet. I don't really understand what I need to uncomment in the "code_saturne.cfg". Maybe you can help me with that ?

So you recommend to write "sbatch runcase" instead of "srun runcase" in the pool script ? If not, how can I specify the number of cpu, memory, ... I require (that I was doing in the pool script) ?

Note that i had to change a line in the parse_wall_time_slurm function in site-packages/code_saturne/cs_batch.py otherwise the simulation would failed.

Code: Select all

line 60 :  t = 2*24*60 # force the wall time to 2 days
Concerning the squeue -h -j $SLURM_JOBID -o %L" command, it returns:

Code: Select all

CLUSTER: lemaitre3
9:49:59
Regards,

Rodolphe
Yvan Fournier
Posts: 4069
Joined: Mon Feb 20, 2012 3:25 pm

Re: Stop to avoid exceeding time allocation on cluster system

Post by Yvan Fournier »

Hello,

In the post-install step, in code_saturne.cfg, just set (uncoment) batch=SLURM.

You can also replace SLURM with an absolute path name ending in SLURM, if you wand to defined you own header model rather than that defined in extras/batch/batch.SLURM in the source tree.

When that is done, if you create a new case, the runcase will contain the SLURM headers (as in your pool script), and the GUI will allow modifying the most common ones (the SLURM entries not known to the GUI are not touched).

The change in parse_wall_time_slurm might explain the problem you have (though I am not sure). What error did you have and how did you change it ? What does your "summary" file contain (it contains a copy of all environment variables so can help debugging) ?

Regards,

Yvan
Rodolphe
Posts: 18
Joined: Sun Mar 14, 2021 12:59 pm

Re: Stop to avoid exceeding time allocation on cluster system

Post by Rodolphe »

Hello,

I did set "batch = SLURM" in the code_saturne.cfg file. Then, I've created a new case and updated the runcase script with specifications of my simulation. Finally I ran the command "sbatch runcase".

First, I ran with the original parse_wall_time_slurm but the error came back (see log file joined). Thus at line 60 of the site-packages/code_saturne/cs_batch.py in the parse_wall_time_slurm, I changed

Code: Select all

t = th + int(wt[0])*60 + int(wt[1])
with,

Code: Select all

t = 2*24*60 # force the wall time to 2 days
which run well when running "sbatch runcase". But after 45 minutes, the simulation is still stopped for the same reason as my first message on this topic.

I join the "summary" file which ran with the original parse_wall_time_slurm too (the same simulation as the log file joined).

Regards,

Rodolphe
Attachments
summary.txt
(72.09 KiB) Downloaded 116 times
job_69782064.err.log
(1.99 KiB) Downloaded 110 times
Yvan Fournier
Posts: 4069
Joined: Mon Feb 20, 2012 3:25 pm

Re: Stop to avoid exceeding time allocation on cluster system

Post by Yvan Fournier »

Hello,

Regarding the parsing of the wall time, I will need to add a test for cases where
squeue -h -j $SLURM_JOBID -o %L
returns a multi-line answer.

Otherwise, it is possible that the margin is not computed correctly.
You can try editing src/base/cs_resource.c, and add a "return;" statement at the beginning of the armtps or cs_resource_get_max_timestep functions, so as to deactivate this. The risk being that if you plan for too many time steps compared to the allocated time, you will not have a clean exit (unless you place a control_file before the end to stop the computation cleanly).

Best regards,

Yvan
Rodolphe
Posts: 18
Joined: Sun Mar 14, 2021 12:59 pm

Re: Stop to avoid exceeding time allocation on cluster system

Post by Rodolphe »

Hello,

The squeue -h -j $SLURM_JOBID -o %L returned the same output as before :

Code: Select all

CLUSTER: lemaitre3
9:49:59
I tried adding a "return" at the beginning of the cs_resource_get_max_timestep but it changed nothing. The simulation is still running until 45 minutes before being shut down. Maybe a stupid question but do I need to run ./configure when changing this function ? Since it does not change anything to add a return to the function and then relaunch the case.

Thanks for your precious help !

Best regards,

Rodolphe
Post Reply