Error in parallelization with differents # of processors
Forum rules
Please read the forum usage recommendations before posting.
Please read the forum usage recommendations before posting.
Error in parallelization with differents # of processors
Hello,
I'm trying to simulate the operating condition of a gas oven. The simulation process went well until I tried to use several processors for parallelization. When I use 12 processors, the simulation stops at 1176 iterations with the following error message:
SIGTERM signal (termination) received.
--> computation interrupted by environment.
Call stack:
1: 0x7f543f2f01e0 <opal_progress+0x50> (libmpi.so.1)
2: 0x7f543f23df75 <ompi_request_default_wait_all+0x145> (libmpi.so.1)
3: 0x7f543abf965e <ompi_coll_tuned_sendrecv_actual+0x10e> (mca_coll_tuned.so)
4: 0x7f543ac019aa <ompi_coll_tuned_barrier_intra_bruck+0x9a> (mca_coll_tuned.so)
5: 0x7f543f24b1c2 <PMPI_Barrier+0x72> (libmpi.so.1)
6: 0x7f544024e63d <cs_halo_sync_var_strided+0x78d> (libsaturne.so.0)
7: 0x7f54403e9e52 <+0x25ee52> (libsaturne.so.0)
8: 0x7f54403ee3db <cgdvec_+0x3db> (libsaturne.so.0)
9: 0x7f54404327e9 <grdvec_+0x189> (libsaturne.so.0)
10: 0x7f5440515106 <vissst_+0x246> (libsaturne.so.0)
11: 0x7f54402f6840 <phyvar_+0x1060> (libsaturne.so.0)
12: 0x7f5440324d39 <tridim_+0xe91> (libsaturne.so.0)
13: 0x7f544020b6cd <caltri_+0x27e9> (libsaturne.so.0)
14: 0x7f54401e3725 <cs_run+0xa55> (libsaturne.so.0)
15: 0x7f54401e3885 <main+0x155> (libsaturne.so.0)
16: 0x3fa801ed1d <__libc_start_main+0xfd> (libc.so.6)
17: 0x400809 <> (cs_solver)
End of stack
But when I use 8 processors, the simulation stops at 1428 iterations with the same error message. Also my simulation reach the 2000 iterations if only is used one processor. Somebody knows why the reasons of this behavior when differents numbers of processors are used?
Code_Saturne was compiled using openmpi-1.6.5.I already tried to compile CS with another openmpi library (1.4.3), but the results are the same. I don't know what can be the problem. Any help or comments would be appreciated
My version of Code_Saturne is 3.0.5.
Regards,
Juan Felipe Monsalvo
I'm trying to simulate the operating condition of a gas oven. The simulation process went well until I tried to use several processors for parallelization. When I use 12 processors, the simulation stops at 1176 iterations with the following error message:
SIGTERM signal (termination) received.
--> computation interrupted by environment.
Call stack:
1: 0x7f543f2f01e0 <opal_progress+0x50> (libmpi.so.1)
2: 0x7f543f23df75 <ompi_request_default_wait_all+0x145> (libmpi.so.1)
3: 0x7f543abf965e <ompi_coll_tuned_sendrecv_actual+0x10e> (mca_coll_tuned.so)
4: 0x7f543ac019aa <ompi_coll_tuned_barrier_intra_bruck+0x9a> (mca_coll_tuned.so)
5: 0x7f543f24b1c2 <PMPI_Barrier+0x72> (libmpi.so.1)
6: 0x7f544024e63d <cs_halo_sync_var_strided+0x78d> (libsaturne.so.0)
7: 0x7f54403e9e52 <+0x25ee52> (libsaturne.so.0)
8: 0x7f54403ee3db <cgdvec_+0x3db> (libsaturne.so.0)
9: 0x7f54404327e9 <grdvec_+0x189> (libsaturne.so.0)
10: 0x7f5440515106 <vissst_+0x246> (libsaturne.so.0)
11: 0x7f54402f6840 <phyvar_+0x1060> (libsaturne.so.0)
12: 0x7f5440324d39 <tridim_+0xe91> (libsaturne.so.0)
13: 0x7f544020b6cd <caltri_+0x27e9> (libsaturne.so.0)
14: 0x7f54401e3725 <cs_run+0xa55> (libsaturne.so.0)
15: 0x7f54401e3885 <main+0x155> (libsaturne.so.0)
16: 0x3fa801ed1d <__libc_start_main+0xfd> (libc.so.6)
17: 0x400809 <> (cs_solver)
End of stack
But when I use 8 processors, the simulation stops at 1428 iterations with the same error message. Also my simulation reach the 2000 iterations if only is used one processor. Somebody knows why the reasons of this behavior when differents numbers of processors are used?
Code_Saturne was compiled using openmpi-1.6.5.I already tried to compile CS with another openmpi library (1.4.3), but the results are the same. I don't know what can be the problem. Any help or comments would be appreciated
My version of Code_Saturne is 3.0.5.
Regards,
Juan Felipe Monsalvo
- Attachments
-
- Oven_parallel.tar
- Listing, error message and xml of my simulations
- (7.2 MiB) Downloaded 397 times
-
- Posts: 4208
- Joined: Mon Feb 20, 2012 3:25 pm
Re: Error in parallelization with differents # of processors
Hello,
Could you post your mesh ? If you have any user subroutines, could you post them too ?
It appears both cases crash when computing physical properties, due to a floating point error. Looking at your xml file, and your listing, I assume the temperature goes down for some reason (look at its evolution on the time step prior to the crash) and you have a division by zero for the density, but it might be something else.
Also, you max CFL is a bit high, so reducing the time step a bit might be a good idea, but I doubt this is directly the size of the crash.
If your results are very unsteady, the issue might be related to your mesh or time step, and not be a parallelism bug, but a robustness issue. If they are rather steady, it is probably a bug.
With your mesh, I should be able to reproduce the issue, and either provide recommendations, or if it is a bug, fix it.
Regards,
Yvan
Could you post your mesh ? If you have any user subroutines, could you post them too ?
It appears both cases crash when computing physical properties, due to a floating point error. Looking at your xml file, and your listing, I assume the temperature goes down for some reason (look at its evolution on the time step prior to the crash) and you have a division by zero for the density, but it might be something else.
Also, you max CFL is a bit high, so reducing the time step a bit might be a good idea, but I doubt this is directly the size of the crash.
If your results are very unsteady, the issue might be related to your mesh or time step, and not be a parallelism bug, but a robustness issue. If they are rather steady, it is probably a bug.
With your mesh, I should be able to reproduce the issue, and either provide recommendations, or if it is a bug, fix it.
Regards,
Yvan
Re: Error in parallelization with differents # of processors
Thank Yvan Fournier for your answer and interest in my problem.
I don't use any user subroutine. The mesh is too big to get attached here. So, You could use this link to download the mesh.
http://mecanica.eafit.edu.co/~jmonsa13/partager
The username is : partager
The password is : partager
Anyway, I would reduce my time step to the half and relaunch the simulation. To see if there are any improvement.
Sincerely,
Juan Felipe Monsalvo
I don't use any user subroutine. The mesh is too big to get attached here. So, You could use this link to download the mesh.
http://mecanica.eafit.edu.co/~jmonsa13/partager
The username is : partager
The password is : partager
Anyway, I would reduce my time step to the half and relaunch the simulation. To see if there are any improvement.
Sincerely,
Juan Felipe Monsalvo
Re: Error in parallelization with differents # of processors
Reducing the time step to the half didn't help. The simulation continues stopping at some iterations without convergence is reached. The iteration at which stopped varies with the numbers of processors used. Have you been able of download the mesh and reproduce the issue? Do you have any other recommendations ? or maybe this is a bug?
Regards,
Juan Felipe Monsalvo
Regards,
Juan Felipe Monsalvo
-
- Posts: 4208
- Joined: Mon Feb 20, 2012 3:25 pm
Re: Error in parallelization with differents # of processors
Hello,
I downloaded your case and ran a few "reference" time steps, but debugging would be much easier on a smaller mesh (assuming there is a bug).
If you have a coarser version of the mesh (using the same setup), it would be quite interesting, and testing would be much faster. Otherwise, I'll run the tests I can, but as each iteration is slow, I probably won't give you any news before a few days.
Regards,
Yvan
I downloaded your case and ran a few "reference" time steps, but debugging would be much easier on a smaller mesh (assuming there is a bug).
If you have a coarser version of the mesh (using the same setup), it would be quite interesting, and testing would be much faster. Otherwise, I'll run the tests I can, but as each iteration is slow, I probably won't give you any news before a few days.
Regards,
Yvan
Re: Error in parallelization with differents # of processors
Hello,
You can find in the same link 2 files (Mefisto2D_BL3_Coarse.med and Mefisto2D_BL3_Medium.med) that contain coarse meshes. One coarser than the other(Mefisto2D_BL3_Coarse.med). With these new meshes you can increase the time step of the original .xml file.
http://mecanica.eafit.edu.co/~jmonsa13/partager
The username is : partager
The password is : partager
Regards,
Juan Felipe Monsalvo
You can find in the same link 2 files (Mefisto2D_BL3_Coarse.med and Mefisto2D_BL3_Medium.med) that contain coarse meshes. One coarser than the other(Mefisto2D_BL3_Coarse.med). With these new meshes you can increase the time step of the original .xml file.
http://mecanica.eafit.edu.co/~jmonsa13/partager
The username is : partager
The password is : partager
Regards,
Juan Felipe Monsalvo
-
- Posts: 4208
- Joined: Mon Feb 20, 2012 3:25 pm
Re: Error in parallelization with differents # of processors
Hello,
I downloaded your meshes and started looking a bit into the case. At the first time step, there is already a small difference in some wall friction terms, but I still need to check if this is just a parallel logging bug or synchronization issue, or if everything is related to this or not.
I'll keep you informed (probably not before late next week, as I have a very busy week coming).
Regards,
Yvan
I downloaded your meshes and started looking a bit into the case. At the first time step, there is already a small difference in some wall friction terms, but I still need to check if this is just a parallel logging bug or synchronization issue, or if everything is related to this or not.
I'll keep you informed (probably not before late next week, as I have a very busy week coming).
Regards,
Yvan
Re: Error in parallelization with differents # of processors
Hello,
Thanks Yvan for keeping me informed. Don't worry and take your time. It's very kind of you, your interest about my problem. I will wait for any further advice you can give me.
Regards,
Juan Felipe Monsalvo
Thanks Yvan for keeping me informed. Don't worry and take your time. It's very kind of you, your interest about my problem. I will wait for any further advice you can give me.
Regards,
Juan Felipe Monsalvo
-
- Posts: 4208
- Joined: Mon Feb 20, 2012 3:25 pm
Re: Error in parallelization with differents # of processors
Hello,
I ran a few detailed tests on the coarse mesh, but found no issue (actually, I did find a bug, but in the serial case: even with the extended neighborhood, the velocity gradient reconstruction initialization only uses the standard neighborhood: this is now fixed in trunk, and will be fixed in the next 3.0.6 release).
I guess I'll need to run your larger case on enough processors for a large number of iterations, but this means a quite large computation time. How long did your computations require ? In debug mode, the code is about 3x slower...
Regards,
Yvan
I ran a few detailed tests on the coarse mesh, but found no issue (actually, I did find a bug, but in the serial case: even with the extended neighborhood, the velocity gradient reconstruction initialization only uses the standard neighborhood: this is now fixed in trunk, and will be fixed in the next 3.0.6 release).
I guess I'll need to run your larger case on enough processors for a large number of iterations, but this means a quite large computation time. How long did your computations require ? In debug mode, the code is about 3x slower...
Regards,
Yvan
Re: Error in parallelization with differents # of processors
Hello,
I was able to run a complete computation of my case using 16 cores (i attach the .xml file used). For this I just reduce the time step to 0.002 and change the temperature conditions of the opening boundary from 273.15 to 293.15. The computation take around 2 days to complete 4000 iterations. But if I use this same setup with a time step of 0.005 the simulation stops in 622 iterations, or the same setup but this time I left the original opening temperature of 273.15 and use a time step of 0.002 the simulation reach 955 iterations and fails.
Maybe the problem was not located in the parallelization and it was indeed a robustness problem of the model. Although if you use the initial setup that I gave to you. The simulation fails at different iterations depending of the number of processors used.
Thank you for the help and support provided. I am glad to hear that this thread served to find and fix a bug.
Regards,
Juan Felipe Monsalvo
I was able to run a complete computation of my case using 16 cores (i attach the .xml file used). For this I just reduce the time step to 0.002 and change the temperature conditions of the opening boundary from 273.15 to 293.15. The computation take around 2 days to complete 4000 iterations. But if I use this same setup with a time step of 0.005 the simulation stops in 622 iterations, or the same setup but this time I left the original opening temperature of 273.15 and use a time step of 0.002 the simulation reach 955 iterations and fails.
Maybe the problem was not located in the parallelization and it was indeed a robustness problem of the model. Although if you use the initial setup that I gave to you. The simulation fails at different iterations depending of the number of processors used.
Thank you for the help and support provided. I am glad to hear that this thread served to find and fix a bug.
Regards,
Juan Felipe Monsalvo
- Attachments
-
- New_setup_fail.tar.gz
- Files concerning the two simulations that failed
- (93.62 KiB) Downloaded 379 times
-
- Main.xml
- (12.61 KiB) Downloaded 379 times