Re: How many process can I run on this machine?
Posted: Wed Feb 21, 2018 3:25 pm
Hello Jonas,
Based on your post I assume that you are running the test in a single machine (not a cluster) with 40 physical cores, so you don't have an overhead due to network communication. I have run some tests in order to measure the scalability of CS and the use of flat MPI option gives the best result (lower time), than using a hybrid alternative, despite the memory consumption.
Trying to understand the performance difference between flat MPI and hybrid, I have run the test cases using callgrind, cachegrind and massif from the valgrind suite tool and I found that the load in each thread was unbalanced. Always the Thread_0 has more load than the others (I attach two figs). Going deep inside the code, the operations of coarsening of the multigrid solver are only executed by the Thread_0 due to there is not a mesh partition. When you add more processes this work is distributed between these processes. When you run flat MPI the load is perfectly balanced between the processes.
This is a brief explanation about what I found, I'am not 100% sure if this appreciation it is correct.
Regards,
Luciano
Based on your post I assume that you are running the test in a single machine (not a cluster) with 40 physical cores, so you don't have an overhead due to network communication. I have run some tests in order to measure the scalability of CS and the use of flat MPI option gives the best result (lower time), than using a hybrid alternative, despite the memory consumption.
Trying to understand the performance difference between flat MPI and hybrid, I have run the test cases using callgrind, cachegrind and massif from the valgrind suite tool and I found that the load in each thread was unbalanced. Always the Thread_0 has more load than the others (I attach two figs). Going deep inside the code, the operations of coarsening of the multigrid solver are only executed by the Thread_0 due to there is not a mesh partition. When you add more processes this work is distributed between these processes. When you run flat MPI the load is perfectly balanced between the processes.
This is a brief explanation about what I found, I'am not 100% sure if this appreciation it is correct.
This is a little confusing, CS can distribute the work using OpenMP(threads), MPI(process) or Hybrid(MPI+OpenMP) and is the user during the launch of the solver who tells to CS how to parallelize the problem. From your results the best option is run CS with flat MPI(process) parallelization.JonasA wrote: Therefore, it seems that it is more efficient to use all the virtual threads than asking Code-Saturne to divide the process by virtual threading. Does this conclusion make sense in general?
Jonas
Regards,
Luciano