error running code_saturne in parallel

st268 · Post by **st268** » Fri Jul 05, 2013 5:56 pm

Hello,

I am attempting to run code_saturne in parallel on my University's cluster. I am getting the following error:

/exports/work/see_ies_marine/sTully/channelOnly/RESU/20130705-1700/run_solver.sh: line 10: module: command not found

which is finding exception with the run_solver.sh file generated:

#!/bin/bash

# Detect and handle running under SALOME YACS module.
YACS_ARG=
if test "$SALOME_CONTAINERNAME" != "" -a "$CFDRUN_ROOT_DIR" != "" ; then
YACS_ARG="--yacs-module=${CFDRUN_ROOT_DIR}"/lib/salome/libCFD_RunExelib.so
fi

module purge

# Export paths here if necessary or recommended.
export PATH="/exports/applications/apps/SL6/MPI/mpich2/3.0.4/bin":$PATH
export LD_LIBRARY_PATH="/exports/applications/apps/SL6/MPI/mpich2/3.0.4/lib":$LD_LIBRARY_PATH

cd /exports/work/see_ies_marine/sTully/channelOnly/RESU/20130705-1700

# Run solver.
/exports/applications/apps/SL6/MPI/mpich2/3.0.4/bin/mpiexec -n 5 /exports/work/eng_software/CodeSaturne-3.0.1/libexec/code_saturne/cs_solver --param 19_06 --mpi $YACS_ARGS $@
export CS_RET=$?

exit $CS_RET

The result of the error is that all my output files are empty. According to the listing the calculation is taking place though.

Any ideas? I am guessing it is a cluster problem not a code_saturne problem but my knowledge is not good enough to know for sure

thanks in advance!

Susan

Post by **Yvan Fournier** » Sat Jul 06, 2013 2:54 pm

Hello,

It seems you do not have quite the same user environment when submitting a job on a cluster or when you are logged interactively on a cluster, which is quite frequent (if you have multiple .bash* files for example, man pages may help you determine which are loaded when, but it may be quite subtle).

To work around this, as your run_solver.sh only contains "module purge" (and no subsequent "module load"), you simply need to reinstall Code_Saturne, adding "--with-modules=no" to the configure line you used previously, and the problem should go away.

Regards,

Yvan

st268 · Post by **st268** » Sun Jul 07, 2013 1:13 am

That is great advice thank you!

st268 · Post by **st268** » Mon Aug 05, 2013 11:56 am

Hello,

This error I have been having has been causing me lots of problems. Basically I can only run code_saturne on 12 procs (the number per node on the cluster here) because it is not working correctly in parallel.

I have been in contact with my IT department here about fixing it but they seem to be a little stuck as to how to go about it, at least until the relevant IT person comes back from holiday. I was wondering if anyone could give some advice here, or whether Yvan's advice is still the way to go now that I know more about the problem.

To recap the problem (taken from an email from an engineer working with the cluster): it would seem that on a shortish run of Susans, code_saturn runs on 12 and 24 procs, but not on 48 (job either proceeds very slowly or not at all, until it gets killed).

Having looked at the jobscript, the openmpi-gcc module is being loaded alongside code_saturne's and a couple of others. However, an error is reported, which originates from:

/exports/work/see_ies_marine/sTully/channelOnly/RESU/20130705-1700/run_solver.sh: line 10: module: command not found

run_solver.sh seems to be created by cs_case.py (in the code_saturn package), and this particular error is due to cs_case.py not inserting a ". /etc/profiles.d/modules.sh" line into run_solver.sh ahead of the "module purge" command.

However, the additional material in run_solver.sh concerns me - namely the purging of all loaded modules (I'm not entirely sure of its efficacy in this regard) and hard-coding of replacement paths to mpich2. I'd recommend these paths not be hard-coded, and in fact, that the modules are used rather than purged/substituted. Currently the job seems to initially load the OpenMPI stack, then (try to) unload it, and replace it with MPICH2 - it would be beneficial that this didn't happen - and started out with mpich2 without switching mid-way (I think if self.package_compute.env_modules was set to "yes" you could avoid all this, although it may still be necessary to craft the insertion of a ". /etc/profiles.d/modules.sh" somewhere). I notice that run_solver.sh's invocation of mpiexec also makes no reference to the pe_hostfile (which it will almost certainly need (and probably get from cs_case.py), and I believe corroborates the next finding and above behaviour).

Cpu usage from recent jobs (both failed and successful) seem to indicate that the processes may not be leaving the master node, which after a certain process count, stalls the node (sends it into swap hell) while leaving the other nodes idle. picking up on the pe_hostfile will allow mpiexec to distribute the processes across the relevant hosts in the expected way and avoid this.

Eg: a recent 24 process run (successful) shows only one node busy:

==============================================================
hostname eddie447
qsub_time Mon Jul 29 10:09:27 2013
start_time Mon Jul 29 10:09:27 2013
end_time Mon Jul 29 10:20:35 2013
failed 0
exit_status 0
ru_utime 0.027
ru_stime 0.010
cpu 0.037
==============================================================
hostname eddie446
qsub_time Mon Jul 29 10:09:09 2013
start_time Mon Jul 29 10:09:25 2013
end_time Mon Jul 29 10:20:35 2013
failed 0
exit_status 0
ru_utime 6376.470
ru_stime 1583.745
cpu 7960.215

And a similar 48-core job dies (SIGKILL) at its runtime limit (20 minutes) with no activity on any of the non-master nodes (the master is very busy, presumably swapping):

==============================================================
hostname eddie447
qsub_time Mon Jul 29 10:41:16 2013
start_time Mon Jul 29 10:41:16 2013
end_time Mon Jul 29 11:06:16 2013
failed 100 : assumedly after job
exit_status 137
ru_utime 0.015
ru_stime 0.012
cpu 0.027
==============================================================
hostname eddie446
qsub_time Mon Jul 29 10:41:16 2013
start_time Mon Jul 29 10:41:16 2013
end_time Mon Jul 29 11:06:16 2013
failed 100 : assumedly after job
exit_status 137
ru_utime 0.015
ru_stime 0.007
cpu 0.022
==============================================================
hostname eddie445
qsub_time Mon Jul 29 10:41:16 2013
start_time Mon Jul 29 10:41:16 2013
end_time Mon Jul 29 11:06:16 2013
failed 100 : assumedly after job
exit_status 137
ru_utime 0.015
ru_stime 0.009
cpu 0.024
==============================================================
hostname eddie428
qsub_time Mon Jul 29 10:38:31 2013
start_time Mon Jul 29 10:41:15 2013
end_time Mon Jul 29 11:06:16 2013
failed 100 : assumedly after job
exit_status 0
ru_utime 0.059
ru_stime 0.021
cpu 17953.260

I believe mpiexec will accept the -machinefile for the hosts file, and that this can be drawn from the pe_hostfile (perhaps slightly munged) supplied in the parallel job's environment."

As always, thank you in advance for any help!

Susan

Post by **Yvan Fournier** » Mon Aug 05, 2013 3:20 pm

Hello,

What batch system are you using ? For PBS (Torque or PBS Pro) and OpenMPI combinations, we recently fixed a bug that would lead you to run only on the master node. If that is the case, you can either get the bin/cs_exec_environement for the downloads section (http://code-saturne.org/viewvc/saturne/) of this site, and patch (reinstall) the code with it. You can also work around it by redefining the mpiexec options using cs_users_scripts.py, but you have to do this for every case (in 3.2, such options will be defined in the code_saturne.cfd configuration isntead).

Regards,

Yvan

st268 · Post by **st268** » Mon Aug 26, 2013 4:40 pm

Hi Yvan,

They aren't using PBS, they use SGE.

Seems like fixing this is causing some issues within the IT department here, in the meantime I am trying to do the work around you suggested with the cs_users_scripts.py. However I am struggling to understand what I have to do.

Could you explain a bit more what you mean by redefining the mpiexec options?

thanks for your help!

Susan

code_saturne User's Forum

error running code_saturne in parallel

error running code_saturne in parallel

Re: error running code_saturne in parallel

Re: error running code_saturne in parallel

Re: error running code_saturne in parallel

Re: error running code_saturne in parallel

Re: error running code_saturne in parallel