CS v2.1.1 Parallel calculation

All questions about installation
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
Alexandre Guilloux

CS v2.1.1 Parallel calculation

Post by Alexandre Guilloux »

Hi,

I'm installing the 2.1.1 version of Code_Saturne on a new machine.

The machine has 2 processors with 8 cores for each one (16 cores).

I have installed ubuntu10.04.

For information, I have installed Syrthes 4.0 with its libraries. I have ran Syrthes with the test case and with 16 cores. There is no problem and no error message with the calculation.

I install Code_Saturne v.2.1.1 and I specified the Syrthes' libraries in the setup file.

hdf5-1.8.7
med-3.0.3
openmpi-1.4.3

When I run a Code_Saturne calculation with 8 cores (1 processor) there are no problem.
When I run a calculation with 16 cores (2 processors) the calculation fail in a random way.
I have this error message :
” mpiexec noticed that process rank5 with PID2061 on node ORDI01 exited on signal 11 (Segmentation fault)”.
"solver script exited with status 139"
When I restart the machine, the temporary directory is empty.

I don't know which file do you need to debug my problem.

Best Regards

Alexandre
Yvan Fournier

Re: CS v2.1.1 Parallel calculation

Post by Yvan Fournier »

Hello,
Besides a temporary directory, you should have somthing (listing, erro*) in the RESU/<run_id> directory (as Code_Saturne 2.1 only uses a temporary directory if a /scratch directory is present, or if indicated in the code_saturne.cfg file).
If present, the stack trace in the error* files is the most useful part.
If the bug seems due to the code itself (i.e. not in or related to a user subroutine), reproducing it would require both the "mesh_input" file or directory, an the "domain_number_16" file if present, as well as your xml file and user subroutines, but hopefully, the stack trace might be enough.
Best regards,
  Yvan
Alexandre Guilloux

Re: CS v2.1.1 Parallel calculation

Post by Alexandre Guilloux »

Hello,
When Code Saturne bug, I need to restart the computer.
The tempory directory is empty (I indicate a directory in the code_saturne.cfg file) and when I have files (mesh_imput, summary, .xml file and preprocessor.log) they are empty.
The bug is random, the same basic calculation (and test on an another machine) work or bug. So I think it is not a bug with Code_Saturne.
I will make some test with Syrthes or benchmark code to test my memory
 
I will tell you about the result
Thanks for the answer
Alexandre
Alexandre Guilloux

Re: CS v2.1.1 Parallel calculation

Post by Alexandre Guilloux »

Hi,

I run few calculations with Syrthes 4 and I have the same error message.
"mpiexec noticed that process rank6 with PID2419 on node ORDI01 exited on signal 11 (Segmentation fault)”.
When I run a calculation with 1 core, there is no problem with Code_Saturne.

So I think the problem is with my library open mpi.

Do you know some benchmark to test the installation of open MPI.

Thanks
 
Best regards,

Alexandre
Yvan Fournier

Re: CS v2.1.1 Parallel calculation

Post by Yvan Fournier »

Hello,
There may be test/example programs in an Open MPI or MPICH2 source trees (but those examples are not necessarily installed, so you need the source).
A first "basic" test I usually run with an MPI distribution is to run a non-MPI system information program first, to see if MPI starts correctly, for example:
mpiexec -n 4 /usr/bin/env
or:
mpiexec -n 4 /bin/hostname
Once that is verified to work, you can move to a simple MPI "hello world" program (just Google MPI hello world), which will not exercise all of MPI, but at least the initialization and finalization parts.
Bugs in MPI libraries are usually more complex, so these simple tests may run fine and a full code like Code_Saturne or SYRTHES may fail with the same MPI library, but if those tests do fail, you can at least be sure the problem is with your MPI installation.
Best regards,
  Yvan
Alexandre Guilloux

Re: CS v2.1.1 Parallel calculation

Post by Alexandre Guilloux »

Hello,

I have made different test.

1.  mpiexec -n 4 /usr/bin/env

This command give a lot of informations, no apparent problem.
 
2.  MPI Hello world
I make a file hello.c with:

Code: Select all

/* C Example */
#include <stdio.h>
#include <mpi.h>

int main (argc, argv)
     int argc;
     char *argv[];
{
  int rank, size;

  MPI_Init (&argc, &argv);	/* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);	/* get current process id */
  MPI_Comm_size (MPI_COMM_WORLD, &size);	/* get number of processes */
  printf( "Hello world from process %d of %d\n", rank, size );
  MPI_Finalize();
  return 0;
}
 
then I compile this file with the command : mpicc hello.c -o hello
And I run the test with the command :
mpiexec -np 8 hello    --> With 8 cores (1 processor), there is a problem, in a random way
mpiexec -np 16 hello    --> With 16 cores (2 processors), there is a problem, in a random way
It's the same message with Code_Saturne.
"mpiexec noticed that process rank 10 with PID2215 on node ORDI01 exited on signal 11 (segmentation fault)"
Then I need to reboot my computer.
 
My problem is with the mpi library. I have to reinstall the library.

I don't know if I need to uninstall the library before, and if yes I don't know how.
 
Thanks again for your answers
 
Best Regards

Alexandre  
Alexandre Guilloux

Re: CS v2.1.1 Parallel calculation

Post by Alexandre Guilloux »

I open a new topic specific to open MPI issues

"Bug with open MPI"

http://code-saturne.org/forum/viewtopic.php?f=8&t=968
Post Reply