What is thread imbalance?

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
Mohammad
Posts: 114
Joined: Thu Oct 25, 2018 12:18 pm

What is thread imbalance?

Post by Mohammad »

Hello,

I'm doing LES on a mesh which contains 3.2M cells.
I use 48 threads for this simulation. I'm using METIS.
When I start the simulation these messages are showing up in listing file:
Numbering for interior faces:

type: threads
number of threads: 48
number of exclusive groups: 35
number of elements in group 0: 6245600
number of elements in group 1: 357545
number of elements in group 2: 171791
number of elements in group 3: 171162
number of elements in group 4: 171057
number of elements in group 5: 89671
number of elements in group 6: 89580
number of elements in group 7: 89627
number of elements in group 8: 89561
number of elements in group 9: 90464
number of elements in group 10: 89564
number of elements in group 11: 89559
number of elements in group 12: 89623
number of elements in group 13: 89623
number of elements in group 14: 89560
number of elements in group 15: 89558
number of elements in group 16: 89562
number of elements in group 17: 89560
number of elements in group 18: 89625
number of elements in group 19: 89556
number of elements in group 20: 89558
number of elements in group 21: 89620
number of elements in group 22: 89557
number of elements in group 23: 89557
number of elements in group 24: 89560
number of elements in group 25: 89558
number of elements in group 26: 89558
number of elements in group 27: 89616
number of elements in group 28: 89619
number of elements in group 29: 89555
number of elements in group 30: 89558
number of elements in group 31: 89548
number of elements in group 32: 90749
number of elements in group 33: 518
number of elements in group 34: 63
estimated thread imbalance: 2.460

Numbering for boundary faces:

type: threads
number of threads: 48
number of exclusive groups: 1
number of elements in group 0: 45884
estimated thread imbalance: 0.000
As you can see estimated thread imbalance equals 2.46.

What is that?

Does it influence the simulation speed? If yes, how can I solve it?

If it is partitioning the interior faces, then why the group0 contains 6M faces and the others are much lower? Why it's not equal for all groups? Is that OK?

While the number of threads is 48 and the number if groups is 35, does it mean that 48 threads is too much for simulation?

Regards,

Mohammad
Yvan Fournier
Posts: 4221
Joined: Mon Feb 20, 2012 3:25 pm

Re: What is thread imbalance?

Post by Yvan Fournier »

Hello,

2 (or at most 4) threads might be OK when you reach MPI scalability issues (in the context of a fixed "total threads = MPI ranks * threads). Except for the CDO modules, in most cases you get better performance with pure MPI.

I recommend testing and comparing, but you probably have too much threads.

In any case, it is normal than you have more elements in the first groups (which are called sequentially, and each contain the required number of threads), but 34 groups means 34 thread synchronization steps...

Regards,

Yvan
Mohammad
Posts: 114
Joined: Thu Oct 25, 2018 12:18 pm

Re: What is thread imbalance?

Post by Mohammad »

Hello,

Thanks a lot dear Yvan,

I'm not very familiar with MPI and OMP.

I'm on a cluster with AMD Opteron(tm) Processor 6386 SE and 128GB of RAM and a maximum number of 64 cores. But the calculation speed is too slow which takes about 60 seconds for each iteration for LES.
I have 3 million cells and using 48 cores. I'm using the following settings for my job on cluster:

Code: Select all

PBS -l nodes=1:ppn=48
#PBS -N NAME
#PBS -q batch
#PBS -l walltime=20:00:00:00

# Number of Threads:
export OMP_NUM_THREADS=2

# Run command:
\code_saturne run -n 48
nodes = Number of nodes
ppn = processor(s) per node
Are these settings OK for parallel processing or I should change them for better performance? Can you help me with that?

Due to your suggestion, I changed the number of threads to 2(OMP_NUM_THREADS=2) and compared it with another job with 16 number of threads(OMP_NUM_THREADS=16).
The case with 16 threads was 33% faster than the one with 2 threads.

Best Regards,
Mohammd
Yvan Fournier
Posts: 4221
Joined: Mon Feb 20, 2012 3:25 pm

Re: What is thread imbalance?

Post by Yvan Fournier »

Hello,

It is surprising that you get better performance with more threads, but you are probably already saturating the bandwidth, and the threads may simply force a more efficient renumbering.

Could you compare performance using:

48 MPI processes, 1 thread
24 MPI processes, 2 threads

48 MPI processes on 2 nodes, 1 thread (if available)
96 MPI processes on 2 nodes, 1 thread.

If the numbering due to threads is more effective and you are saturating the bandwidth, it is possible to activate the same renumbering without the threads, which might be a bit more efficient, but that can be done in a later step.

Performance is usually optimal between 20 000 and 80 000 meshs cells per core (less than that, latency degrades performance, more than that, you are saturating the memory bandwidth anf have more cache misses).

Best regards,

Yvan
Mohammad
Posts: 114
Joined: Thu Oct 25, 2018 12:18 pm

Re: What is thread imbalance?

Post by Mohammad »

Hello,

Thank you for the reply.

Unfortunately the cluster does not allow me to use more than 1 node, but for the first two cases that you mentioned the results are as following:

48 MPI processes, 1 thread* -> 120 seconds for each iteration in average
24 MPI processes, 2 threads* -> 84 seconds for each iteration in average

I also tested more cases:
48 MPI processes, 16 threads -> 41 seconds for each iteration in average
24 MPI processes, 16 threads -> 43 seconds for each iteration in average

16 MPI processes, 16 threads -> 41 seconds for each iteration in average
16 MPI processes, 8 threads -> 43 seconds for each iteration in average
16 MPI processes, 2 threads -> 97 seconds for each iteration in average

It seems that the best number of threads is 16. I also tested values bigger than 16 and the simulation got slightly slower.
It is obvious that in my case, increasing the number of threads from 1 to 16 speeds up the simulation.
It's weird that the case with 16 MPI processes with 16 threads has the same performance with the case with 48 processes and 16 threads!
I also noticed that this CPU has 16 physical cores and 16 threads which the best performance occurs at!
If the numbering due to threads is more effective and you are saturating the bandwidth, it is possible to activate the same renumbering without the threads, which might be a bit more efficient, but that can be done in a later step.
I think this is happening in my case. What should I do?

Regards,
Mohammad
Yvan Fournier
Posts: 4221
Joined: Mon Feb 20, 2012 3:25 pm

Re: What is thread imbalance?

Post by Yvan Fournier »

Hello,

Check the example in cs_user_performance_tuning-numbering.c

The cs_renumber_set_n_threads function can be used to set the numbering adapted to a number of threads, without using as many threads (the numbering value must be >= and a multiple of the actual number of threads).

The other renumbering algorithm options might be interesting to test. In most cases they may lead small improvements or degradation (+/- 5 to 10%), but this also depends on the initial mesh numbering.

I forgot to ask what type of network you have on your cluster. If your networks is not a fast/low latency network (such as Infiniband, OmniPath, ...) the MPI exchanges can be costly, which would explain you have better performance using threads.

Best regards,

Yvan
Mohammad
Posts: 114
Joined: Thu Oct 25, 2018 12:18 pm

Re: What is thread imbalance?

Post by Mohammad »

Hello,

Thank you again dear Yvan.

I don't know the network type of cluster, but It seems to be an old-fashioned cluster!

As I said, the best number of threads for any number of MPI Processes was 16 (Any number greater or smaller than this slows down the code) and at least I need 16 MPI Processes which gives about 190,000 cells/core which is a high value.

You say that by using cs_renumber_set_n_threads we can use more threads for numbering than the actual number of threads that we have.
Do you mean that It's better to use cs_renumber_set_n_threads(16) instead of OMP_NUM_THREADS=16 for simulation or it means that I can use lower number of MPI Processes with higher number of threads for example 8 MPI Processes with 16 threads using cs_renumber_set_n_threads(16)?
If the second statement is the answer, then does it make any sense for me to use this function while I need 16 MPI Processes at least?

Sorry for my questions. I'm a bit confused as I said, I'm very amateur in parallel processing!

Best Regards,
Mohammad
Yvan Fournier
Posts: 4221
Joined: Mon Feb 20, 2012 3:25 pm

Re: What is thread imbalance?

Post by Yvan Fournier »

Hello,

To make things clearer:

- the total physical resources used (number of cores) is :
number of MPI processes * number of OpenMP threads

- inside a given MPI process, the numbering for threads should be a multiple of the number of OpenMP threads used

- the number of groups is a "result" of the renumbering algorithm: imagine a local sub-partitioning of the mesh, where each thread is assigned values inside a partition but not on its boundary. That is group 0. Then we subdivide the remaining cells, assigning them to threads, if they are not adjacent to both threads. That is group 1. Then recurse on remaining cells, with smaller and smaller groups. So 2 or 3 groups is "normal", 35 means that the renumbering algorithm is not so well adapted to such high thread counts and it would not be surprising if this degrades performance.

Hope this explains things better.

And in practice, nothing can replace actual benchmarking, though theory can help in planning and explaining results.

Best regards,

Yvan
Mohammad
Posts: 114
Joined: Thu Oct 25, 2018 12:18 pm

Re: What is thread imbalance?

Post by Mohammad »

Hello,

Thank you very much Yvan,

I got it!
But you said that: number of cores=number of MPI processes * number of OpenMP threads.
When I run code_saturne in parallel I use this command:
code_saturne run -n X
The question is what is X exactly? The number of MPI processes or physical cores?
If it is the number of cores, then it means that when I use the following commands, I only have 1 MPI Proccess?

Code: Select all

export OMP_NUM_THREADS=16
code_saturne run -n 16
I also can use 32 OpenMP threads with X=16 without any errors which means 0.5 MPI Processes?!

Or if it is the number of MPI processes then when I use the above commands, it means that I have 16*16=256 physical cores?! It's impossible because the node has only 64 cores. Then why it does not give me an error?

Kind Regards,
Mohammad
Yvan Fournier
Posts: 4221
Joined: Mon Feb 20, 2012 3:25 pm

Re: What is thread imbalance?

Post by Yvan Fournier »

Hello,

x is the number of MPI processes.

If you have 16x16 processes, yes, you have more processes than physical cores, which does not necessarily lead to an error, but is called oversubscribing, and generally leads to degraded performance (but may be useful to debug parallelism on a smaller number of physical cores).

So I would expect performance to be best using at most 48 cores, but there may be other combinations to test (for example 8 MPI * 6 threads, ...), and what counts is the actual performance obtained.

Best regards,

Yvan
Post Reply