Page 1 of 1

HPC parmetis issue

Posted: Fri Dec 19, 2014 6:51 pm
by mkendrick
Hello,

I've been trying to get code saturne running on our new departmental cluster.

My latest attempt was to build all dependencies individually, creating module files for each and then building code saturne 3.0.5 against these dependancies.

The issue I have occurs at the start of the main code and is a result of mpi failing. Please see the attached files.
You may notice the version of mpi being used here is impi-2013 which is the default mpi on our cluster, the results are the same when using openmpi.

The job runs in serial with no errors.

Eagerly awaiting your response, regards,
Martyn

Re: HPC mpi issue

Posted: Sat Dec 20, 2014 12:42 am
by Yvan Fournier
Hello,

As the errors occur in ParMetis, I suspect an installation issue. More specifically, you may be building and running with different builds of ParMetis, assuming other versions of that tool are already installed.

Could you post your "config.log" file for installation, so I can try to see how the automatic module detection behaved ?

Could you run "ldd ./cs_solver" in your execution directory (assuming you have user subroutines; otherwise, use the cs_solver executable in <install_prefix>/libexec/code_saturne/cs_solver).

Could you also try running Code_Saturne with the built-in partitioner, using the 'Morton curve in bounding box" option (in the GUI, "Calculation Management/Performance tuning/Partitioning, or in user subroutines, cs_user_performance_tuning.c) ?

Regards,

Yvan

Re: HPC parmetis issue

Posted: Mon Dec 22, 2014 4:07 pm
by mkendrick
Thanks for your reply Yvan, very helpful again!

I hadn't noticed it was using parmetis, infact I was having issues with parmetis originally which is why I chose to do a build of CS myself. I hadn't installed parmetis (i.e. I didn't compile using '--with-parmetis=...') but I guess the install of CS must've picked it up.
I tried building a copy of parmetis (ver 4.0.3) myself but this resulted in the code hanging upon submission.

I am happy to report that using morton box the code is running (albeit it seems to be hitting a limit on the number of 'pipes' when I tried to run 256+ processes, so I've reduced this to 128 and it runs nicely. I think this is an mpi issue I'll look into in more detail).

The files/outputs you asked for are attached

Re: HPC parmetis issue

Posted: Mon Dec 22, 2014 7:04 pm
by Yvan Fournier
Hello,

ldd shows nothing related to ParMETIS, so you probably picked up a version based on a static library.
Also, from the config.log, It seems parmetis was found in default system directories.

The message relative to the number of pipes is strange. How many nodes are you runing on ? How many processes per node ?

Be careful: the ldd output you sent seems to indicate you installed your own OpenMPI build in "/shared/code_saturne/software/openmpi/1.6.5/".

On a cluster, you usually do not want to be doing that, but should be using the default MPI library (if several are available as modules, any one you want, as long as it is installed by the administrators). Otherwise, the MPI library might not be configured to use either the high speed network (such as Infiniband), or to interact properly with the resource manager. This could lead to bad performance, and possibly all instances running on the same node (which could explain the "pipe" message).

Regards,

Yvan