I am attempting to run code_saturne in parallel on my University's cluster. I am getting the following error:
/exports/work/see_ies_marine/sTully/channelOnly/RESU/20130705-1700/run_solver.sh: line 10: module: command not found
which is finding exception with the run_solver.sh file generated:
#!/bin/bash
# Detect and handle running under SALOME YACS module.
YACS_ARG=
if test "$SALOME_CONTAINERNAME" != "" -a "$CFDRUN_ROOT_DIR" != "" ; then
YACS_ARG="--yacs-module=${CFDRUN_ROOT_DIR}"/lib/salome/libCFD_RunExelib.so
fi
module purge
# Export paths here if necessary or recommended.
export PATH="/exports/applications/apps/SL6/MPI/mpich2/3.0.4/bin":$PATH
export LD_LIBRARY_PATH="/exports/applications/apps/SL6/MPI/mpich2/3.0.4/lib":$LD_LIBRARY_PATH
cd /exports/work/see_ies_marine/sTully/channelOnly/RESU/20130705-1700
It seems you do not have quite the same user environment when submitting a job on a cluster or when you are logged interactively on a cluster, which is quite frequent (if you have multiple .bash* files for example, man pages may help you determine which are loaded when, but it may be quite subtle).
To work around this, as your run_solver.sh only contains "module purge" (and no subsequent "module load"), you simply need to reinstall Code_Saturne, adding "--with-modules=no" to the configure line you used previously, and the problem should go away.
This error I have been having has been causing me lots of problems. Basically I can only run code_saturne on 12 procs (the number per node on the cluster here) because it is not working correctly in parallel.
I have been in contact with my IT department here about fixing it but they seem to be a little stuck as to how to go about it, at least until the relevant IT person comes back from holiday. I was wondering if anyone could give some advice here, or whether Yvan's advice is still the way to go now that I know more about the problem.
To recap the problem (taken from an email from an engineer working with the cluster): it would seem that on a shortish run of Susans, code_saturn runs on 12 and 24 procs, but not on 48 (job either proceeds very slowly or not at all, until it gets killed).
Having looked at the jobscript, the openmpi-gcc module is being loaded alongside code_saturne's and a couple of others. However, an error is reported, which originates from:
/exports/work/see_ies_marine/sTully/channelOnly/RESU/20130705-1700/run_solver.sh: line 10: module: command not found
run_solver.sh seems to be created by cs_case.py (in the code_saturn package), and this particular error is due to cs_case.py not inserting a ". /etc/profiles.d/modules.sh" line into run_solver.sh ahead of the "module purge" command.
However, the additional material in run_solver.sh concerns me - namely the purging of all loaded modules (I'm not entirely sure of its efficacy in this regard) and hard-coding of replacement paths to mpich2. I'd recommend these paths not be hard-coded, and in fact, that the modules are used rather than purged/substituted. Currently the job seems to initially load the OpenMPI stack, then (try to) unload it, and replace it with MPICH2 - it would be beneficial that this didn't happen - and started out with mpich2 without switching mid-way (I think if self.package_compute.env_modules was set to "yes" you could avoid all this, although it may still be necessary to craft the insertion of a ". /etc/profiles.d/modules.sh" somewhere). I notice that run_solver.sh's invocation of mpiexec also makes no reference to the pe_hostfile (which it will almost certainly need (and probably get from cs_case.py), and I believe corroborates the next finding and above behaviour).
Cpu usage from recent jobs (both failed and successful) seem to indicate that the processes may not be leaving the master node, which after a certain process count, stalls the node (sends it into swap hell) while leaving the other nodes idle. picking up on the pe_hostfile will allow mpiexec to distribute the processes across the relevant hosts in the expected way and avoid this.
Eg: a recent 24 process run (successful) shows only one node busy:
And a similar 48-core job dies (SIGKILL) at its runtime limit (20 minutes) with no activity on any of the non-master nodes (the master is very busy, presumably swapping):
I believe mpiexec will accept the -machinefile for the hosts file, and that this can be drawn from the pe_hostfile (perhaps slightly munged) supplied in the parallel job's environment."
What batch system are you using ? For PBS (Torque or PBS Pro) and OpenMPI combinations, we recently fixed a bug that would lead you to run only on the master node. If that is the case, you can either get the bin/cs_exec_environement for the downloads section (http://code-saturne.org/viewvc/saturne/) of this site, and patch (reinstall) the code with it. You can also work around it by redefining the mpiexec options using cs_users_scripts.py, but you have to do this for every case (in 3.2, such options will be defined in the code_saturne.cfd configuration isntead).
Seems like fixing this is causing some issues within the IT department here, in the meantime I am trying to do the work around you suggested with the cs_users_scripts.py. However I am struggling to understand what I have to do.
Could you explain a bit more what you mean by redefining the mpiexec options?