Problems with mpi on Ubuntu 18.04

All questions about installation
Forum rules
Please read the forum usage recommendations before posting.
AndrewH
Posts: 47
Joined: Thu Oct 02, 2014 11:03 am

Problems with mpi on Ubuntu 18.04

Post by AndrewH »

Hello,

I have been trying to install Code_Saturne v4.0.1/v4.08 on Ubuntu 18.04 and I have run into some mpi issues. I compiled it with the gcc and openmpi ubuntu packages along with libxml2. Running cases in serial mode, my Code_Saturne installation runs fine. However, in parallel mode my simulation crashes very randomly. Running with 100 iterations, sometimes the computation will finish without any issues but typically the computation will crash on random iterations with the stack originating from different files (e.g. cs_dot, cs_sles, multigrid, partitioning, etc...). I'm using the same computation that I have run multiple times on other computers and clusters, and it shouldn't crash randomly or at all. I also compiled Code_Saturne with mpich and get a consistent error of /usr/lib/x86_64-linux-gnu/libopen-pal.so.20 and /usr/lib/x86_64-linux-gnu/libpthread.so.0 before the computation stage of the simulation starts. Adding more mysterious to the problem, I compiled Code_Saturne with the same settings and packages on virtualbox VM on my laptop and I don't have any problems with parallels simulations; I compared the config files and they are exactly the same. I also modified the code_saturne.cfg after I compiled Code_Saturne. Are there any tests I can make to check my mpi libraries are properly built? I'm hoping this isn't a stupid mistake on my part but I'm also hoping this isn't a hardware problem.

Thank you,
Andrew
Yvan Fournier
Posts: 4069
Joined: Mon Feb 20, 2012 3:25 pm

Re: Problems with mpi on Ubuntu 18.04

Post by Yvan Fournier »

Hello,

I am not too sure about which tools may help chech your Open MPI installation, but in any case, if you have an error in libopenpal using MPICH, it means you are mixing Open MPI and MPICH (either compiling with one and running with the other, ore something more complex).

You can use ldd on the cs_solver executable to check if there are multiple linked versions of one MPI library. This sometimes happens when libraries used by code_saturne are compiled with another MPI version (especially when you have a default library in the parh and are trying to use another).

Do you use --with-mpi or CC=mpicc in your build/configure options ? The latter is often more robust.

Also, make sure the code_saturne.cfg points to the correct mpiexec command in case the automatic detection fails.

Since version 4.0 is quite old, I do not remember which improvements were made in the build to try to avoid issues with multiple MPI libraries.

Regards,

Yvan
AndrewH
Posts: 47
Joined: Thu Oct 02, 2014 11:03 am

Re: Problems with mpi on Ubuntu 18.04

Post by AndrewH »

Dear Yvan,

I will check the code_saturne.cfg file. I tried compiling Code_Saturne with both --with-mpi and CC=mpicc but the problem occured with both. When I installed mpich, I completely purge the openmpi package and recompiled my Code_Saturne installation, but I will double check that openmpi was completely removed.

Thank you,
Andrew
AndrewH
Posts: 47
Joined: Thu Oct 02, 2014 11:03 am

Re: Problems with mpi on Ubuntu 18.04

Post by AndrewH »

Hello Yvan,

I'm still experiencing solver errors when trying to run a computation. I tried running Code_Saturne on Centos 8 and I ran into the same troubles that I had with Ubuntu 18.04. I also upgraded to v6.0.2 with a simpler case and I still had errors with the solver and preprocessing stages.

With Code_Saturne v6.0.2 on Centos 8, I used the default gcc compiler (v8.3.1). I compiled both mpich v3.3.2 and openmpi v4.0.1. I compiled Code_Saturne with the following configuration:

./configure --prefix=/home/andrew/build/cs-6.0.2_mpich-debug --without-salome --disable-gui --with-libxml2=/home/andrew/build/libxml2-2.9.10 CC=/home/andrew/build/gcc-8.3.1_mpich_3.3.2-debug/bin/mpicc CXX=/home/andrew/build/gcc-8.3.1_mpich_3.3.2-debug/bin/mpicxx FC=/bin/gfortran PYTHON=/home/andrew/build/python-2.7.17/bin/python2.7 --enable-debug CFLAGS=-g CXXFLAGS=-g FCFLAGS=-g

./configure --prefix=/home/andrew/build/cs-6.0.2_openmpi-debug --without-salome --disable-gui --with-libxml2=/home/andrew/build/libxml2-2.9.10 CC=/home/andrew/build/gcc-8.3.1_openmpi-4.0.2/bin/mpicc CXX=/home/andrew/build/gcc-8.3.1_openmpi-4.0.2/bin/mpicxx FC=/bin/gfortran PYTHON=/home/andrew/build/python-2.7.17/bin/python2.7 --enable-debug CFLAGS=-g CXXFLAGS=-g FCFLAGS=-g



I also built a serial debug version:

./configure --prefix=/home/andrew/build/cs-6.0.2_serial-debug --without-salome --disable-gui --with-libxml2=/home/andrew/build/libxml2-2.9.10 CC=/bin/gcc CXX=/bin/g++ FC=/bin/gfortran PYTHON=/home/andrew/build/python-2.7.17/bin/python2.7 --enable-debug


When I tried running with the mpich library, Code_Saturne successfully ran once for 10 iterations, crashed twice in the solver stage, crashed twice in the partitioning stage, and once while compiling the code.

When I tried running with the openmpi library, Code_Saturne crashed 4 times in the preprocessing stage and once in the solver stage.

When I tried running in serial mode, Code_Saturne successfully ran twice for 10 iterations and crashed once in preprocessing stage.

For all the computations, I converted the mesh to the native mesh type prior to running the computation to avoid using cgns/hdf5 libraries. The computation used only one source file to define the inlet boundary, everything else is defined in the xml file. I make sure to properly export the proper binaries at the beginning of each computation to make sure the proper mpi library is being used. If I start my computer, the first computation will typically run successfully or crash in the solver stage; if try to launch another computation, the computation will typically crash in the preprocessing stage or when it compiles the one fortran file. I'm not sure if this means that there is a memory leak, I ran Code_Saturne with valgrind and I attached the output.


I compiled openmpi with:
./configure --prefix=/home/andrew/build/gcc-8.3.1_openmpi-4.0.2 CC=/bin/gcc CXX=/bin/g++ --disable-mpi-fortran

And I compiled mpich with:
./configure --prefix=/home/andrew/build/gcc-8.3.1_mpich_3.3.2-debug CC=/bin/gcc CXX=/bin/g++ --disable-fortran


Furthermore, I updated the bios on my computer to the latest version. I ran a hard drive and RAM checks, but both checks came back with no problems. I don't know what else to test. Could there be an incompatibility problem with my processor or motherboard? It seems very strange that I have no problems running Code_Saturne on other clusters or computers, only on my new workstation.

Best regards,
Andrew
AndrewH
Posts: 47
Joined: Thu Oct 02, 2014 11:03 am

Re: Problems with mpi on Ubuntu 18.04

Post by AndrewH »

It didn't attached the files.
Attachments
openmpi-4.0.2.zip
(301.89 KiB) Downloaded 268 times
mpich3.3.2.zip
(333.97 KiB) Downloaded 266 times
config_files.zip
(65.97 KiB) Downloaded 263 times
AndrewH
Posts: 47
Joined: Thu Oct 02, 2014 11:03 am

Re: Problems with mpi on Ubuntu 18.04

Post by AndrewH »

Additional files.
Attachments
valgrind_runs.zip
(652.75 KiB) Downloaded 265 times
serial.zip
(133.21 KiB) Downloaded 257 times
Yvan Fournier
Posts: 4069
Joined: Mon Feb 20, 2012 3:25 pm

Re: Problems with mpi on Ubuntu 18.04

Post by Yvan Fournier »

Dear Andrew,

Since it seems you were able to reproduce a crash even in serial mode, I guess we can leave parallel runs aside and concentrate on the serial case, which will be easier to analyze, especially with Valgrind.

Also, in one of the runs, you seem to be using quite a few OpenMP threads. I do not know what feedback you usually have, but in most cases, we do not observe good speedup beyond 2 threads, except in CDO cases. And even n MPI procs x 2 OpenMP threads is usually slower than nx2 MPI procs, except hat higher MPI proc counts.

For debugging this case/machine, I would suggest at least setting the number of threads to 1, or even installing the code without OpenMP support (--disable-openmp at configure time).

Also, in theory, use of OpenMP threads leads to faces renumbering, which is normally tested at the end of the renumbering, but I am not 100% sure that test is strict enough given some observed crashes (with a cleaner error message) at higher thread counts.

If you can have a Valgrind run with an error on a single proc, that would probably be quite useful.

A build with Adress Sanitizer would also be interesting (not compatible with Valgrind, but both are useful; for Adress sanitizer, gcc 8 is recent enough, so simply add CFLAGS=-fsanitize=address FCFLAGS=-fsanitize=address LIBS=-lasan to the "configure" options). Another sanitizer, ubsan (undefined behavior sanitizer) can also be of interest.

If the issue is very low level and hardware-support related (as we can suspect), possibly forcing a slightly older architecture in the compiler flags might do the trick.

Best regards,

Yvan
AndrewH
Posts: 47
Joined: Thu Oct 02, 2014 11:03 am

Re: Problems with mpi on Ubuntu 18.04

Post by AndrewH »

Dear Yvan,

I'm not sure why Code_Saturne was using multiple threads. I didn't specify for it use multiple threads and it only appeared to be happening when running valgrind. When I recompile Code_Saturne with --disable-openmp, it only used one thread when running with valgrind.

I ran 6 computations with valgrind and the --disable-openmp, it crashed 5 times and ran successfully once. I attached the computation folders.

I also compiled a different CS-6.0.2 with "CFLAGS=-fsanitize=address FCFLAGS=-fsanitize=address LIBS=-lasan". I have run 4 computations and all 4 computations ran successfully without any crashes. I'm running a fifth computation with 1000 iterations and it's on iteration 141 and running okay so far.

I also tried running CS-6.0.2 on Ubuntu 18.04 with valgrind and I believed it corrupted Ubuntu when running. The computation just froze with no additional output in the listing or the valgrind output files. When I checked the computation this morning, it had been frozen for 4 hours with no output. I was also running a second serial computation overnight on Ubuntu and it crashed because the file system had become read-only and couldn't write results. Upon restarting Ubuntu, I got the message that the disk was corrupted and I had to run fsck /dev/sda1 to fix it. There was no error messages in Ubuntu's log nor crash reports.


Also what do you mean by "possibly forcing a slightly older architecture in the compiler flags might do the trick?"

Best regards,
Andrew
Attachments
valgrind_runs.zip
(652.75 KiB) Downloaded 255 times
Yvan Fournier
Posts: 4069
Joined: Mon Feb 20, 2012 3:25 pm

Re: Problems with mpi on Ubuntu 18.04

Post by Yvan Fournier »

Hello Andrew,

The OpenMPI functions make the Valgrind output somewhat unreadable (as there can be many false positives), which is why reproducing the error on a single process was interesting. I did not expect it to crash the system. This is definitely very strange behavior.

code_saturne works mostly in userspace, except for a few system calls for IO, and should not even be able to crash the system, unless using up all resources.

It is interesting so see that the version compiled with the sanitizer is working better (while it should crash earlier, with instrumentation info in case of a programming error in the code). Unfortunately, sanitizer builds are slower than regular debug builds (not as much overhead as Valgrind, but still not ideal for production runs). So this mainly confirms the issue probably won't be found in the code's sources but in some subtle system/library/possibly hardware interactions.

By "forcing" an older architecture I was thinking of the "-march" gcc options (such as "-march=x86-64 for a generic X88-64 architecture, but also other more recent ones; see here https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html for options for the current GCC).
This can be used for each of CFLAGS, CPPFLAGS, and FCFLAGS.

Another possible issue might be some subtle mix between the compiler libraries of base Ubuntu system and those you installed (for example is some sub-path is not in the defaults, or some dependency pulls both).

Running "ldd cs_sover" can help check if some library appears mutliple times with different paths (I don't remember if we already tried this).
Also, following the same logic, I do not remember if you tried running code_saturne using the default Ubuntu compiler (which probably has only a slight impact on performance) instead of the additional one you compiled ? Again, suspecting some subtle mix of system or compiler run-time libraries.

Best regards,

Yvan
AndrewH
Posts: 47
Joined: Thu Oct 02, 2014 11:03 am

Re: Problems with mpi on Ubuntu 18.04

Post by AndrewH »

Hi Yvan,

I think I may have solved my problem. I removed two of the sticks of RAM in my computer and I received no errors after running several simulations in serial or parallel, or with CS-4.0.1 or CS-6.0.2. I swapped the two sticks of RAM with other two sticks and I regularly got crashes. I ran a memory test again on the bad pair and the test still showed no errors. I guess the RAM stick(s) have a defect that the memory tests can't test for. I will let you know if my problems reoccur while using 'good' RAM sticks now.

Best regards,
Andrew
Post Reply