Page 1 of 1

Error in parallelization with differents # of processors

Posted: Fri Oct 03, 2014 6:09 pm
by jmonsa13
Hello,
I'm trying to simulate the operating condition of a gas oven. The simulation process went well until I tried to use several processors for parallelization. When I use 12 processors, the simulation stops at 1176 iterations with the following error message:

SIGTERM signal (termination) received.
--> computation interrupted by environment.
Call stack:
1: 0x7f543f2f01e0 <opal_progress+0x50> (libmpi.so.1)
2: 0x7f543f23df75 <ompi_request_default_wait_all+0x145> (libmpi.so.1)
3: 0x7f543abf965e <ompi_coll_tuned_sendrecv_actual+0x10e> (mca_coll_tuned.so)
4: 0x7f543ac019aa <ompi_coll_tuned_barrier_intra_bruck+0x9a> (mca_coll_tuned.so)
5: 0x7f543f24b1c2 <PMPI_Barrier+0x72> (libmpi.so.1)
6: 0x7f544024e63d <cs_halo_sync_var_strided+0x78d> (libsaturne.so.0)
7: 0x7f54403e9e52 <+0x25ee52> (libsaturne.so.0)
8: 0x7f54403ee3db <cgdvec_+0x3db> (libsaturne.so.0)
9: 0x7f54404327e9 <grdvec_+0x189> (libsaturne.so.0)
10: 0x7f5440515106 <vissst_+0x246> (libsaturne.so.0)
11: 0x7f54402f6840 <phyvar_+0x1060> (libsaturne.so.0)
12: 0x7f5440324d39 <tridim_+0xe91> (libsaturne.so.0)
13: 0x7f544020b6cd <caltri_+0x27e9> (libsaturne.so.0)
14: 0x7f54401e3725 <cs_run+0xa55> (libsaturne.so.0)
15: 0x7f54401e3885 <main+0x155> (libsaturne.so.0)
16: 0x3fa801ed1d <__libc_start_main+0xfd> (libc.so.6)
17: 0x400809 <> (cs_solver)
End of stack

But when I use 8 processors, the simulation stops at 1428 iterations with the same error message. Also my simulation reach the 2000 iterations if only is used one processor. Somebody knows why the reasons of this behavior when differents numbers of processors are used?

Code_Saturne was compiled using openmpi-1.6.5.I already tried to compile CS with another openmpi library (1.4.3), but the results are the same. I don't know what can be the problem. Any help or comments would be appreciated

My version of Code_Saturne is 3.0.5.

Regards,

Juan Felipe Monsalvo

Re: Error in parallelization with differents # of processors

Posted: Sat Oct 04, 2014 6:49 pm
by Yvan Fournier
Hello,

Could you post your mesh ? If you have any user subroutines, could you post them too ?

It appears both cases crash when computing physical properties, due to a floating point error. Looking at your xml file, and your listing, I assume the temperature goes down for some reason (look at its evolution on the time step prior to the crash) and you have a division by zero for the density, but it might be something else.

Also, you max CFL is a bit high, so reducing the time step a bit might be a good idea, but I doubt this is directly the size of the crash.

If your results are very unsteady, the issue might be related to your mesh or time step, and not be a parallelism bug, but a robustness issue. If they are rather steady, it is probably a bug.
With your mesh, I should be able to reproduce the issue, and either provide recommendations, or if it is a bug, fix it.

Regards,

Yvan

Re: Error in parallelization with differents # of processors

Posted: Sat Oct 04, 2014 8:19 pm
by jmonsa13
Thank Yvan Fournier for your answer and interest in my problem.

I don't use any user subroutine. The mesh is too big to get attached here. So, You could use this link to download the mesh.

http://mecanica.eafit.edu.co/~jmonsa13/partager

The username is : partager
The password is : partager

Anyway, I would reduce my time step to the half and relaunch the simulation. To see if there are any improvement.

Sincerely,
Juan Felipe Monsalvo

Re: Error in parallelization with differents # of processors

Posted: Tue Oct 07, 2014 5:04 pm
by jmonsa13
Reducing the time step to the half didn't help. The simulation continues stopping at some iterations without convergence is reached. The iteration at which stopped varies with the numbers of processors used. Have you been able of download the mesh and reproduce the issue? Do you have any other recommendations ? or maybe this is a bug?

Regards,
Juan Felipe Monsalvo

Re: Error in parallelization with differents # of processors

Posted: Wed Oct 08, 2014 10:05 am
by Yvan Fournier
Hello,

I downloaded your case and ran a few "reference" time steps, but debugging would be much easier on a smaller mesh (assuming there is a bug).

If you have a coarser version of the mesh (using the same setup), it would be quite interesting, and testing would be much faster. Otherwise, I'll run the tests I can, but as each iteration is slow, I probably won't give you any news before a few days.

Regards,

Yvan

Re: Error in parallelization with differents # of processors

Posted: Wed Oct 08, 2014 5:50 pm
by jmonsa13
Hello,
You can find in the same link 2 files (Mefisto2D_BL3_Coarse.med and Mefisto2D_BL3_Medium.med) that contain coarse meshes. One coarser than the other(Mefisto2D_BL3_Coarse.med). With these new meshes you can increase the time step of the original .xml file.

http://mecanica.eafit.edu.co/~jmonsa13/partager

The username is : partager
The password is : partager

Regards,
Juan Felipe Monsalvo

Re: Error in parallelization with differents # of processors

Posted: Fri Oct 10, 2014 11:49 am
by Yvan Fournier
Hello,

I downloaded your meshes and started looking a bit into the case. At the first time step, there is already a small difference in some wall friction terms, but I still need to check if this is just a parallel logging bug or synchronization issue, or if everything is related to this or not.

I'll keep you informed (probably not before late next week, as I have a very busy week coming).

Regards,

Yvan

Re: Error in parallelization with differents # of processors

Posted: Fri Oct 10, 2014 4:47 pm
by jmonsa13
Hello,
Thanks Yvan for keeping me informed. Don't worry and take your time. It's very kind of you, your interest about my problem. I will wait for any further advice you can give me.

Regards,
Juan Felipe Monsalvo

Re: Error in parallelization with differents # of processors

Posted: Fri Oct 17, 2014 11:21 pm
by Yvan Fournier
Hello,

I ran a few detailed tests on the coarse mesh, but found no issue (actually, I did find a bug, but in the serial case: even with the extended neighborhood, the velocity gradient reconstruction initialization only uses the standard neighborhood: this is now fixed in trunk, and will be fixed in the next 3.0.6 release).

I guess I'll need to run your larger case on enough processors for a large number of iterations, but this means a quite large computation time. How long did your computations require ? In debug mode, the code is about 3x slower...

Regards,

Yvan

Re: Error in parallelization with differents # of processors

Posted: Mon Oct 20, 2014 4:56 pm
by jmonsa13
Hello,

I was able to run a complete computation of my case using 16 cores (i attach the .xml file used). For this I just reduce the time step to 0.002 and change the temperature conditions of the opening boundary from 273.15 to 293.15. The computation take around 2 days to complete 4000 iterations. But if I use this same setup with a time step of 0.005 the simulation stops in 622 iterations, or the same setup but this time I left the original opening temperature of 273.15 and use a time step of 0.002 the simulation reach 955 iterations and fails.

Maybe the problem was not located in the parallelization and it was indeed a robustness problem of the model. Although if you use the initial setup that I gave to you. The simulation fails at different iterations depending of the number of processors used.

Thank you for the help and support provided. I am glad to hear that this thread served to find and fix a bug.

Regards,
Juan Felipe Monsalvo