Error in parallelization with differents # of processors

jmonsa13 · Post by **jmonsa13** » Fri Oct 03, 2014 6:09 pm

Hello,
I'm trying to simulate the operating condition of a gas oven. The simulation process went well until I tried to use several processors for parallelization. When I use 12 processors, the simulation stops at 1176 iterations with the following error message:

SIGTERM signal (termination) received.
--> computation interrupted by environment.
Call stack:
1: 0x7f543f2f01e0 <opal_progress+0x50> (libmpi.so.1)
2: 0x7f543f23df75 <ompi_request_default_wait_all+0x145> (libmpi.so.1)
3: 0x7f543abf965e <ompi_coll_tuned_sendrecv_actual+0x10e> (mca_coll_tuned.so)
4: 0x7f543ac019aa <ompi_coll_tuned_barrier_intra_bruck+0x9a> (mca_coll_tuned.so)
5: 0x7f543f24b1c2 <PMPI_Barrier+0x72> (libmpi.so.1)
6: 0x7f544024e63d <cs_halo_sync_var_strided+0x78d> (libsaturne.so.0)
7: 0x7f54403e9e52 <+0x25ee52> (libsaturne.so.0)
8: 0x7f54403ee3db <cgdvec_+0x3db> (libsaturne.so.0)
9: 0x7f54404327e9 <grdvec_+0x189> (libsaturne.so.0)
10: 0x7f5440515106 <vissst_+0x246> (libsaturne.so.0)
11: 0x7f54402f6840 <phyvar_+0x1060> (libsaturne.so.0)
12: 0x7f5440324d39 <tridim_+0xe91> (libsaturne.so.0)
13: 0x7f544020b6cd <caltri_+0x27e9> (libsaturne.so.0)
14: 0x7f54401e3725 <cs_run+0xa55> (libsaturne.so.0)
15: 0x7f54401e3885 <main+0x155> (libsaturne.so.0)
16: 0x3fa801ed1d <__libc_start_main+0xfd> (libc.so.6)
17: 0x400809 <> (cs_solver)
End of stack

But when I use 8 processors, the simulation stops at 1428 iterations with the same error message. Also my simulation reach the 2000 iterations if only is used one processor. Somebody knows why the reasons of this behavior when differents numbers of processors are used?

Code_Saturne was compiled using openmpi-1.6.5.I already tried to compile CS with another openmpi library (1.4.3), but the results are the same. I don't know what can be the problem. Any help or comments would be appreciated

My version of Code_Saturne is 3.0.5.

Regards,

Juan Felipe Monsalvo

Post by **Yvan Fournier** » Sat Oct 04, 2014 6:49 pm

Hello,

Could you post your mesh ? If you have any user subroutines, could you post them too ?

It appears both cases crash when computing physical properties, due to a floating point error. Looking at your xml file, and your listing, I assume the temperature goes down for some reason (look at its evolution on the time step prior to the crash) and you have a division by zero for the density, but it might be something else.

Also, you max CFL is a bit high, so reducing the time step a bit might be a good idea, but I doubt this is directly the size of the crash.

If your results are very unsteady, the issue might be related to your mesh or time step, and not be a parallelism bug, but a robustness issue. If they are rather steady, it is probably a bug.
With your mesh, I should be able to reproduce the issue, and either provide recommendations, or if it is a bug, fix it.

Regards,

Yvan

jmonsa13 · Post by **jmonsa13** » Sat Oct 04, 2014 8:19 pm

Thank Yvan Fournier for your answer and interest in my problem.

I don't use any user subroutine. The mesh is too big to get attached here. So, You could use this link to download the mesh.

http://mecanica.eafit.edu.co/~jmonsa13/partager

The username is : partager
The password is : partager

Anyway, I would reduce my time step to the half and relaunch the simulation. To see if there are any improvement.

Sincerely,
Juan Felipe Monsalvo

jmonsa13 · Post by **jmonsa13** » Tue Oct 07, 2014 5:04 pm

Reducing the time step to the half didn't help. The simulation continues stopping at some iterations without convergence is reached. The iteration at which stopped varies with the numbers of processors used. Have you been able of download the mesh and reproduce the issue? Do you have any other recommendations ? or maybe this is a bug?

Regards,
Juan Felipe Monsalvo

Post by **Yvan Fournier** » Wed Oct 08, 2014 10:05 am

Hello,

I downloaded your case and ran a few "reference" time steps, but debugging would be much easier on a smaller mesh (assuming there is a bug).

If you have a coarser version of the mesh (using the same setup), it would be quite interesting, and testing would be much faster. Otherwise, I'll run the tests I can, but as each iteration is slow, I probably won't give you any news before a few days.

Regards,

Yvan

jmonsa13 · Post by **jmonsa13** » Wed Oct 08, 2014 5:50 pm

Hello,
You can find in the same link 2 files (Mefisto2D_BL3_Coarse.med and Mefisto2D_BL3_Medium.med) that contain coarse meshes. One coarser than the other(Mefisto2D_BL3_Coarse.med). With these new meshes you can increase the time step of the original .xml file.

http://mecanica.eafit.edu.co/~jmonsa13/partager

The username is : partager
The password is : partager

Regards,
Juan Felipe Monsalvo

Post by **Yvan Fournier** » Fri Oct 10, 2014 11:49 am

Hello,

I downloaded your meshes and started looking a bit into the case. At the first time step, there is already a small difference in some wall friction terms, but I still need to check if this is just a parallel logging bug or synchronization issue, or if everything is related to this or not.

I'll keep you informed (probably not before late next week, as I have a very busy week coming).

Regards,

Yvan

jmonsa13 · Post by **jmonsa13** » Fri Oct 10, 2014 4:47 pm

Hello,
Thanks Yvan for keeping me informed. Don't worry and take your time. It's very kind of you, your interest about my problem. I will wait for any further advice you can give me.

Regards,
Juan Felipe Monsalvo

Post by **Yvan Fournier** » Fri Oct 17, 2014 11:21 pm

Hello,

I ran a few detailed tests on the coarse mesh, but found no issue (actually, I did find a bug, but in the serial case: even with the extended neighborhood, the velocity gradient reconstruction initialization only uses the standard neighborhood: this is now fixed in trunk, and will be fixed in the next 3.0.6 release).

I guess I'll need to run your larger case on enough processors for a large number of iterations, but this means a quite large computation time. How long did your computations require ? In debug mode, the code is about 3x slower...

Regards,

Yvan

jmonsa13 · Post by **jmonsa13** » Mon Oct 20, 2014 4:56 pm

Hello,

I was able to run a complete computation of my case using 16 cores (i attach the .xml file used). For this I just reduce the time step to 0.002 and change the temperature conditions of the opening boundary from 273.15 to 293.15. The computation take around 2 days to complete 4000 iterations. But if I use this same setup with a time step of 0.005 the simulation stops in 622 iterations, or the same setup but this time I left the original opening temperature of 273.15 and use a time step of 0.002 the simulation reach 955 iterations and fails.

Maybe the problem was not located in the parallelization and it was indeed a robustness problem of the model. Although if you use the initial setup that I gave to you. The simulation fails at different iterations depending of the number of processors used.

Thank you for the help and support provided. I am glad to hear that this thread served to find and fix a bug.

Regards,
Juan Felipe Monsalvo

code_saturne User's Forum

Error in parallelization with differents # of processors

Error in parallelization with differents # of processors

Re: Error in parallelization with differents # of processors

Re: Error in parallelization with differents # of processors

Re: Error in parallelization with differents # of processors

Re: Error in parallelization with differents # of processors

Re: Error in parallelization with differents # of processors

Re: Error in parallelization with differents # of processors

Re: Error in parallelization with differents # of processors

Re: Error in parallelization with differents # of processors

Re: Error in parallelization with differents # of processors