Page 1 of 1

parallel batch job

Posted: Tue Aug 28, 2018 12:35 pm
by attene
Dear all,

I am facing an issue regarding a parallel batch job.
The batch system is SGE.
Although in my script there is the option to run my job in parallel: #$ -l nodes=2, once the job starts running is still serial!



I also tried to add --mpi at the call to run the job; in my case:

/scratch/hpc/25/attene/CS_wave_loads/73_steady_conformal/3.62/RESU/20180808-0022/run_solver --mpi

with or without --mpi the calculation behaves the same.

any idea?

I was thinking this issue may be due to the configuration of the batch system during the post installation...

regards,

FA

Re: parallel batch job

Posted: Tue Aug 28, 2018 10:27 pm
by Yvan Fournier
Hello,

How are you submitting the job ? What is in your "run_solver" script (generated in RESU/<run_id> before running) ?

Do you have any messages when submitting ? Or in the batch log files ?

Regards,

Yvan

Re: parallel batch job

Posted: Wed Aug 29, 2018 2:15 pm
by attene
Hi Yvan,

I have attached the run_solver as well as the performance.log and setup.log.
I didn't have any particular messages when submitting neither in the batch log files.

At the moment I am waiting for a simulation to start: I added mpirun at the line of the command of cs_solver in the run_solver.

Regards,

FA

Re: parallel batch job

Posted: Wed Aug 29, 2018 6:47 pm
by Yvan Fournier
Hello,

These are not the type o log files to which I am referring. Batch systems usually add a file (either in RESU/<run_id> or SCRIPTS, depending on how you submitted the file, with a name containing the job number, and extension .out and .err).

Also how do you submit the job ? What does your runcase (including the batch header) look like ?

Regards,

Yvan

Re: parallel batch job

Posted: Thu Aug 30, 2018 3:35 pm
by attene
Hello,

Files directly from the batch systems are, in my case, with extension .e1385884 and o.01385884 which are empty and I have not attached.

Anyway I solved the problem by adding mpirun at the command to the binary cs_solver in run_solver. I called then the run_solver to the batch script (see it attached).
It looks like the cs_solver by default runs in serial when invoked on their own..

Regards,

FA

Re: parallel batch job

Posted: Thu Aug 30, 2018 8:06 pm
by Yvan Fournier
Hello,

Maybe you can automate this better by defining the mpiexec command choice in the post-install (code-saturne.cfg), since it seems the automatic/default values do not sem to be adapted to your system.

Details on your system (OS version, mpi library version, and current code_saturne.cfg file, ans summary file) could help us improve the default detection (but SGE is always a mess, with a difficult to automate syntax, so I am not surprised)

Best regards,

Yvan