code_saturne User's Forum

Posted: **Wed Feb 21, 2018 3:25 pm**

Hello Jonas,

Based on your post I assume that you are running the test in a single machine (not a cluster) with 40 physical cores, so you don't have an overhead due to network communication. I have run some tests in order to measure the scalability of CS and the use of flat MPI option gives the best result (lower time), than using a hybrid alternative, despite the memory consumption.

Trying to understand the performance difference between flat MPI and hybrid, I have run the test cases using callgrind, cachegrind and massif from the valgrind suite tool and I found that the load in each thread was unbalanced. Always the Thread_0 has more load than the others (I attach two figs). Going deep inside the code, the operations of coarsening of the multigrid solver are only executed by the Thread_0 due to there is not a mesh partition. When you add more processes this work is distributed between these processes. When you run flat MPI the load is perfectly balanced between the processes.

This is a brief explanation about what I found, I'am not 100% sure if this appreciation it is correct.

JonasA wrote: Therefore, it seems that it is more efficient to use all the virtual threads than asking Code-Saturne to divide the process by virtual threading. Does this conclusion make sense in general?
Jonas

This is a little confusing, CS can distribute the work using OpenMP(threads), MPI(process) or Hybrid(MPI+OpenMP) and is the user during the launch of the solver who tells to CS how to parallelize the problem. From your results the best option is run CS with flat MPI(process) parallelization.

Regards,

Luciano

Posted: **Wed Feb 21, 2018 3:34 pm**

Hello,

Yes, this is consistent with observations on our side. Usually, using n MPI processes* 2 threads is between 10% faster and 10% slower than n*2 MPI processes * 1 thread, and may depend on the case and threading library (Intel compilers may have OpenMP runtimes with lower latency and slightly better performance). 15% in you case is not so far from 10%...

When running on a very large number of nodes, there might be a small advantage to the MPI/OpenMP mix, but on smaller runs, this is often not the case...

In any case, did you also try using less cores ? On Intel Xeon processors, bandwith limitations often result in full-node performance being only slightly better than half-node performance (but may depend on the problem size and cache usage).

Best regards,

Yvan

Posted: **Thu Feb 22, 2018 10:02 am**

Thanks four your fast and useful answers. Code_Saturne really has a nice community.

Luciano Garelli wrote:Based on your post I assume that you are running the test in a single machine (not a cluster) with 40 physical cores

Yes, it is tested on a single machine, but it has 2 CPU with 10 physical cores per each CPU, so 20 physical cores, but with intel hyperthreading the 20 physical cores are seen by the operating system as 40 virtual cores.

Luciano Garelli wrote: From your results the best option is run CS with flat MPI(process) parallelization.

Yes it is. I used the turbomachinery module and 2 meshes.

Yvan Fournier wrote:In any case, did you also try using less cores ? On Intel Xeon processors, bandwidth limitations often result in full-node performance being only slightly better than half-node performance (but may depend on the problem size and cache usage).

I have tried using less cores, 35 and 38 cores in full MPI give a roughly equivalent performance (There are more details in the attachment of my previous post). And the performance in the configuration of 20 cores of MPI without openMP is only a few percents behind 20 cores with openMP. But the difference between 20 and 40 cores MPI is significant, it is ~20%.

Best regards,

Jonas

code_saturne User's Forum

How many process can I run on this machine?

Re: How many process can I run on this machine?

Re: How many process can I run on this machine?

Re: How many process can I run on this machine?