Very strange scalability problem on a workstation

Antech · Post by **Antech** » Fri Aug 12, 2022 10:47 am

Hello. I have the very strange problem running Saturne on a new workstation. I used the laptop for additional calculations now but replaced it with small mATX "compute node" for better cooling, convenience and performance. Looks like I only obtained memory size for performance, parllel calculation is very slow...

Hardware config is
============
CPU: Intel Xeon 2678v3 (2011-3) with enabled or disabled HT (doesn't make sense)
RAM: 2 x Samsung 32 GB 1866 ECC REG
Motherboard: Huananzhi X99-8MD3 (2011-3, X99/C610 chipset, DDR3 ECC/REG: 2 slots)
Video: Radeon R5-230 (19W max)
PSU: Seasonic 300W (ecaps brand is OK)
As you can see, it's a low cost compute node.

Software is
=======
OS: CentOS 7.9 (optional kernel from 7.5) or Kubuntu 2022.
OpenMPI: 1.8.4, 1.10.7, 3.1.6, some from 4.x branch or MPICH 4.0.2.
Saturne: 6.x, 7.x branches.
NUMA: 1 node.

Problem description
=============
When I start calculation on just a few cores (up to ~4 cores) there is all OK. When I increase number of threads further, calculation speed for the same case and initial conditions doesn't increase and even show some decrease. This is for various cases with not too small meshes so particular case or mesh is not the problem. Also, when I run the same software config (Kubuntu-2022 with Saturne from USB drive) on different PC (Huananzhi X99-TF, Xeon 2678v3, 4x32 GB Samsung 1866 DDR3 ECC REG) calculation scales at acceptable level (although not ideal). The same is for AMD Threadripper machine with 16 cores and the same memory size (64 GB DDR4).

What I tried
=======
Everything

1. Different Linux, Saturne and MPI versions, even MPICH, although I usually use MPI.
2. Second (spare) hardware set CPU+RAM+Motherboard (the same models). Unfortunately, I also bricked one motherboard trying to update BIOS (for Huananzhi, this precedure is not so reliable as for top brand boards), although this board (X99-8MD3) is cheap (info: such MBs are used for the long hard parallel load, it should wor good for Saturne).
3. CPU temperature (core is 65*C max) and frequency (2900...3000 MHz) check: absolutely OK (Thermalright 180W [real 140W] cooler with Noctua Industrial is used). Two thermal greases: from Thermalright cooler kit + Arctic Cooling MX4.
4. Lots of benchmarks: CPU single thread, parallel matrix operations, CPU cache misses check, cache size and associativity, memory real frequency and performance with different block sizes.
5. Disabling SELinux, spectre etc mitigations, power saving features.
6. Checked core loading. It's almost uniform in time, no huge "system" load, just "user" load.
7. MPI benchmark (OSU). It shows up to 5x slower performance on larger blocks (sizes are OSU defaults) than on other machine (X99-TF BM, same CPU, 128 GB RAM). Although some results are better (depending on test selection and block size).
8. Running on 5...11 cores to leave some space for system services.
9. Playing around with threads per process.

So, if I run CPU benchmarks ("sysbench --test=cpu --threads=12 run" or other with matrix multiplications) on all (12) cores, it scales perfectly (events per second: 951 / 10038). If I run Saturne on all cores, it performs slower than on 4 cores (7 million cell mesh). I can't get any faster on any number of cores more than 4! Moreover, if I start one Saturne calculation on 4 cores, then start another case on another 4 cores (affinity is set with "htop"), the slowdown in about 2.5 times compared with just 1 case running. If I check timer statistics, I see that linear solver and gradient reconstruction times are both some bigger on 12 cores than on 4 cores, that means that any parallel MPI floating point operations slow down instead of speed-up while PC takes more power (up to 190W), all cores running at 2900 MHz at full load without noticable system load!
It looks very very strange. Does MPI/MPICH use some shared resource that saturates after 4 processes? We run CFX/Fluent on a cluster with 24 AMD cores per node and CentOS 7.5 + HP MPI or Platform MPI, it scales without any problems up to 264 cores (with Infiniband interconnect), even with our very old versions that doesn't run smoothly on CentOS 7.5. I expected the same for OpenMPI + Saturne...
Can anybody suggest what is the culprit? It's quite frustrating spending lots of time and efforts to get such result. Is there some hardware feature that is needed for MPI?
Thanks for your attention. I attached some files for 12-cores run and configure script options.

Antech · Post by **Antech** » Fri Aug 12, 2022 10:56 am

Looks like forum engine adds only 3 attachments while accepts for 5. So adding another files.
Saturne is built without Catalyst support, so configure is quite simple. Other dependencies like CGNS, I think, cannot affect parallel performance.
P. S. I suspect that it's not a specific Saturne problem, I just have no other places to ask, sorry. OSU MPI bench results doesn't look too bad, even if there is a 5x difference in some tests, real number of cores used with MPI is clearly not 20 (5x4) but thousands. So I think it may not give result just jumping around OpenMPI/MPICH... May be some specific combination of Saturne and MPI is important?

Post by **Yvan Fournier** » Fri Aug 12, 2022 12:28 pm

Hello,

I'll look into details later, but at first sight, this seems to be a memory bandwidth problem.

If this is the case, as soon as the mesh is large enough, you will be saturating the memory bandwidth, and as most operations in code_saturne are memory-bound rather than compute-bound, adding more MPI processes only adds to system noise, synchronization latencies, and probably also compute-imbalance.

If your case is smaller, you may have better cache usage, so scalability will probably be better... up to the point where there is not enough compute work to occupy each core and control flow becomes dominant.

I remember a few years ago, on rather small (less than 1 million cell meshes), we did not scale beyond 2 cores on ou desktop Xeons (lower-end) while we could have a speedup of 7 on 8 cores on more powerful Xeons with the exact same binary.

If the issue is memory-bandwidth-related, you would expect other codes based on similar numerical methods to exhibit similar behavior, though it might not be as bad if they use less memory or manage to have higher compute density (by organizing their computations differently so as to avoid many loops with simple passes and use fewer loops with more local computation).

Since your MPI benchmarks do not seem so good, there might also be an issue there. In this case, there are many parameters for both MPICH and OpenMPI, but the simplest test would be to compare performance of builds on the same machine using MPICH and OpenMPI, with default settings for each.

Also, are you using the same partitioning on different machines ? (SFC like Morton or Hilbert curves often lead to lower quality partitions and somewhat lower performance than graph-based solutions like PT-Scotch, though this is mesh-dependent). Here, you seem to have required scotch upon installation, but the benchmark ran with Morton.

Best regards,

Yvan

Antech · Post by **Antech** » Fri Aug 12, 2022 1:12 pm

Thanks for so fast answer! Looks problematic... I think I'm unable to make RAM run faster. Any suggestions, configure commands or some tips to look at in RAM settings? In our cluster DDR3 is also used and it's slower, there are no problems despite of 24 processes/node. I cannot believe Saturne so much sensible to RAM speed than CFX and Fluent.

On a cluster 16 x 8GB RAM sticks are used for each node, they are DDR3 1600 MT/s and arranged in 16 banks (according to dmidecode). There are 8 RAM slots per socket (MBs are 2 x G34 socket AMD Opteron). It's OK with CFX and Fluent (cant check with Saturne due to weird administration rules). I cannot simply run memory bench there because there is no Internet available on the cluster but I think it will not be faster (than Xeon node) with 16 banks (looks like there is one channel, so 16 banks).

On this workstation 2 x 32GB sticks are used (oops, there are only 2 slots on this board), they are 1867 MT/s in one node (according to dmidecode). This CPU (2678v3) is capable of handling 4 slots but the board only have 2, that's why RAM sticks are arranged in one node or bank #2 (not #1 for some reason). I tried with 1 stick (32GB) and, as I remember, nothing changed. Memory was good in short (~10min) Memtest run. This memory has ECC but it's not used by the motherboard (it's cheap variant, top server boards are 10x more expensive and unreliable at least in case of G34, I have a bunch of failed old Supermicro H8DGU around my workplace, I had lots of problems testing and replacing them due to spontanous reboots and hangs despite of good Supermicro PSUs with verified ecaps of good brands and relatively low ripple under different loads).
Here is some memory bench info for Xeon node...

Block size=1K, 12 threads:

Code: Select all

sysbench --test=memory --memory-block-size=1K --memory-total-size=1G --num-threads=12 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --num-threads is deprecated, use --threads instead
sysbench 1.0.17 (using system LuaJIT 2.0.4)

Running the test with following options:
Number of threads: 12
Initializing random number generator from current time

Running memory speed test with the following options:
  block size: 1KiB
  total size: 1024MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 1048572 (7651122.16 per second)

1024.00 MiB transferred (7471.80 MiB/sec)

General statistics:
    total time:                          0.1354s
    total number of events:              1048572

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    3.23
         95th percentile:                        0.00
         sum:                                 1410.33

Threads fairness:
    events (avg/stddev):           87381.0000/0.00
    execution time (avg/stddev):   0.1175/0.00

Block size=1M, 12 threads:

Code: Select all

sysbench --test=memory --memory-block-size=1M --memory-total-size=10G --num-threads=12 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --num-threads is deprecated, use --threads instead
sysbench 1.0.17 (using system LuaJIT 2.0.4)

Running the test with following options:
Number of threads: 12
Initializing random number generator from current time

Running memory speed test with the following options:
  block size: 1024KiB
  total size: 10240MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 10236 (44243.08 per second)

10236.00 MiB transferred (44243.08 MiB/sec)

General statistics:
    total time:                          0.2297s
    total number of events:              10236

Latency (ms):
         min:                                    0.06
         avg:                                    0.27
         max:                                    4.43
         95th percentile:                        1.16
         sum:                                 2718.16

Threads fairness:
    events (avg/stddev):           853.0000/0.00
    execution time (avg/stddev):   0.2265/0.00

Block size=1M, 1 thread:

Code: Select all

sysbench --test=memory --memory-block-size=1M --memory-total-size=10G --num-threads=1 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.17 (using system LuaJIT 2.0.4)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Running memory speed test with the following options:
  block size: 1024KiB
  total size: 10240MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 10240 (15638.40 per second)

10240.00 MiB transferred (15638.40 MiB/sec)

General statistics:
    total time:                          0.6531s
    total number of events:              10240

Latency (ms):
         min:                                    0.06
         avg:                                    0.06
         max:                                    0.17
         95th percentile:                        0.08
         sum:                                  649.83

Threads fairness:
    events (avg/stddev):           10240.0000/0.00
    execution time (avg/stddev):   0.6498/0.00

Would you provide required memory performance parameters for Saturne to check with? Or maybe the problem is somwere else?

For OpenMPI/MPICH: I tried both, nothing changed. MPI 1.10.7 is precompiled in CentOS-7, also tested 1.8.4, 3.1.6 and 4.0.2. MPICH 3.2 haven't make it: too old, new Saturnes crash with some function call format errors. MPICH 4.0.2 is OK but also slow so it's not particular MPI problem... Nevertheless, if you have such info, maybe you can share key performance levels that must be satisfied in OSU MPI bench? Honestly, I cannot believe that just 5..7 cores are already "too much" for parallel with DDR3 RAM.

It's strange that Scotch didn't engaged (partitioning is set to default). But I checked with Scotch (not PT Scotch), didn't help. Also, I checked partition sizes (number of cells) in Paraview, they are near equal. I reran test case with Scotch, it's slightly faster but not as on other PC (where the same software setup gives 2x faster solution on 12 cores). Here is time statistics for test cases on problematic PC.

Default partitioner

Code: Select all

iteration, total, mesh processing, checkpoint/restart, post-processing, linear solvers, gradients reconstruction
       0,  2.3542585e+01,  2.1607169e+01,  0.0000000e+00,  3.3961000e-05,  0.0000000e+00,  0.0000000e+00
       1,  7.0153671e+01,  0.0000000e+00,  0.0000000e+00,  5.1461761e-01,  4.7853307e+01,  1.2825522e+01
       2,  5.9750577e+01,  0.0000000e+00,  0.0000000e+00,  5.1711726e-01,  3.6762106e+01,  1.3424707e+01
       3,  7.6528775e+01,  0.0000000e+00,  1.7458888e+00,  4.7246425e+00,  4.1749078e+01,  1.6469033e+01

Scotch

Code: Select all

iteration, total, mesh processing, checkpoint/restart, post-processing, linear solvers, gradients reconstruction
       0,  3.5839636e+01,  3.3922746e+01,  0.0000000e+00,  4.9924000e-05,  0.0000000e+00,  0.0000000e+00
       1,  6.6297585e+01,  0.0000000e+00,  0.0000000e+00,  5.1644657e-01,  4.4156675e+01,  1.2625950e+01
       2,  5.6671236e+01,  0.0000000e+00,  0.0000000e+00,  5.1029119e-01,  3.3836948e+01,  1.3305692e+01
       3,  6.6839055e+01,  0.0000000e+00,  0.0000000e+00,  5.1255480e-01,  3.6670989e+01,  1.7795293e+01
       4,  1.2730686e+02,  0.0000000e+00,  3.7371221e+00,  2.4240452e+00,  7.2702094e+01,  2.9244948e+01

Important thing: even 2 independent calculations each on 4 cores get much (2...3 times) slower than one calculation on 4 cores, so it doesn't look like process communication problem. I don't think that memory bandwidth depends on the number of processes running.

Post by **Yvan Fournier** » Sun Aug 14, 2022 11:38 am

Hello,

I do not have detailed memory performance requirements for code_saturne, as it is quite difficult to keep track with the great number of different architectures.

As code_satirne is mostly memory-bound, 2 relevant benchmarks would ne the stream benchmark (tests only memory) or the HPCG benchmark (high performance.log conjugate gradient), which is a bit more realistic. Running those on your machine (with various system sizes) should allow checking if you have similar behavior.

Regarding running multiple independent runs on the same machine, to the best of my knowledge and expérience, this can also degrade performance, as both computations are using the same resources, especially if 2 computations are using the same sockets, which will happen unless you use appropriate process bindings, And even then, I am not sure that system noise does not increase....

Best regards,

Yvan

Antech · Post by **Antech** » Mon Aug 15, 2022 11:02 am

Hello, thanks for your response.
I made some tests with HPCG (test compilation and usage is quite twisted)
I ran tests on both machines, with Huananzhi X99-TF and X99-8MD3 boards. Both has C610 chipset, 2678v3 CPU and the same 32 GB 1866 DDR3 REG Samsung memory sticks, but 8MD3 has only 2 slots (64GB) while TF has 4 slots (128 GB).
Test was launched with default parameters with command
/Programs/openmpi-1.8.4/build/bin/mpirun -np <NumberOfCores> ./xhpcg --rt=10

So what are results? Results are really not good! Lets see what is memory speed "core dependence" for the small 8MD3 board.

Code: Select all

Cores   Mem Used [GB]   Mem Total Speed [GB/s]   Equations
1       0.80393         11.0929                  1 124 864
2       1.60836         19.2902                  2 249 728
3       2.41252         22.0987                  3 374 592
4       3.21765         22.6536                  4 499 456
5       4.02084         11.7643                  5 624 320
8       6.43715         11.2339                  8 998 912

As you can see, for single core memory speed is 11 GB/s. For 2...4 cores it's around 20 GB/s (and I usually use 4 cores on this PC for calculation to be not too slow). But with 5 cores and more, memory speed falls down to 11 GB/s! That's, I think, the reason for scalability problem. One may say that 11 GB/s is fast enough, but let's see what "big brother" with X99-TF board shows:

Code: Select all

Cores   Mem Used [GB]   Mem Total Speed [GB/s]   Equations
1       0.80393         10.937                   1 124 864
3       2.41252         28.7049                  3 374 592
5       4.02084         37.9908                  5 624 320

On this machine, when number of cores is more than 4, memory speed doesn't fall back to one-core value. Looks that here is the reason for the slow calculation on small machine. But how can I work around it? According to dmidecode, memory parameters are the same except for the number of slots. Chipset and CPU are also identical. Maybe I need to play around with NUMA? Or specific MPI compilation parameters? (They are all compiled so no problems to recompile MPI with other parameters) Or there is no way mATX machine will perform better on 4+ cores?

Post by **Yvan Fournier** » Tue Aug 16, 2022 10:17 am

Hello,

For MPI, you probably do not neet to recompile anything, simply play with the process-binding parameters. Bind-to-socket would seem a good choice.

Also, you might try to experiment with OpenMP on code_saturne. This is usually less effective than MPI (because less of the code is parallelized), but at least allows you to check if some issues are related to MPI only. In this case, hybrid MPI + OpenMP would be good, for example using 2 MPI processes, binding the to sockets, and using 2 OpenMPI threads per MPI rank (binding processes to sockets would avoid NUMA effects in the OpenMP part).

I also noticed that your OpenMPI version seems old, so maybe a newer version would be more efficient, but am not sure whether it will make a big difference or not.

Best regards,

Yvan

Antech · Post by **Antech** » Tue Aug 16, 2022 11:00 am

Hello. Thanks for you answer.
Regarding OpenMPI version. I tried 3.1.6 and 4.x without evident speed-up.
How should I set MPI parameters in Saturne (GUI)? As I can see, GUI sets MPI parameters automatically and runs run_solver script. If I just change this script it will not be persistent for other cases. Also, the system only has one socket (with 2 sockets and 2 RAM sticks/socket things would be better if, for example 4-core CPUs are used; but it's an mATX small machine, 2-socket board will not fit, all such boards are E-ATX minimum).
I tried to execute run_solver script in a fresh run directory with checkpoint subdir (contains mesh_input.csm), sc_solver binary, run.cfg and case XML. Current dir is set to this (run) directory. Solver complains that mesh_input.scm not found... (I usually use GUI to launch case)

Post by **Yvan Fournier** » Tue Aug 16, 2022 3:06 pm

Hello,

For MPI parameters, you can adjust settings in the optional <install_path>/etc/code_saturne.cfg file (or your own $HOME/.code_saturne.cfg), but for testing, exporting environment variables should also be possible.

Regarding mesh_input.csm not being found, this may be related to re-running directly in a result directory multiple times, as the mesh input is moved to the checkpoint directory. If this is the case, moving checkpoint/mesh_input.csm to ./mesh_input.csm in that directory should do the trick help (I do it quite often when debugging).
Best regards,

Yvan

Antech · Post by **Antech** » Tue Aug 16, 2022 4:29 pm

I tried options in code_saturne.cfg but without visible result. What I've done:
1. Copied etc/code_saturne.cfg.template to /etc/code_saturne.cfg
2. Uncommented and set 2 parameters:

Code: Select all

### mpiexec command options
mpiexec_opts = --bind-to socket
### Option to pass arguments (usually none, or -args)
mpiexec_args = --bind-to socket

3. Saved the file and started calculation.
No any changes was noticed. Processes has strict affinity to cores (in htop, in resourse monitor), no changes in command line within run_solver file in run directory, no changes in solver speed (I check timer_stats). Maybe I need to do some extra things with code_saturne.cfg?

I also tried direct editing of run_solver script (--bind-to socket) and starting it (mesh is in run directory now). Number of iterations is set to 10000000 in XML. Solver starts, but only makes one iteration without any messages (post-processing files are written). Processes are not bind to cores now, but it doesn't want to calculate further... Run script:

Code: Select all

#!/bin/bash

# Export paths here if necessary or recommended.
export PATH="/Programs/openmpi-1.8.4/build//bin":$PATH
export LD_LIBRARY_PATH="/Programs/openmpi-1.8.4/build//lib":$LD_LIBRARY_PATH

# Load environment if this script is run directly.
if test "$CS_ENVIRONMENT_SET" != "true" ; then
  module purge
fi


export OMP_NUM_THREADS=1

# Run solver.
/Programs/openmpi-1.8.4/build//bin/mpiexec --bind-to socket -n 7 ./cs_solver --trace  --mpi "$@"
export CS_RET=$?

exit $CS_RET

Listing is attached.

code_saturne User's Forum

Very strange scalability problem on a workstation

Very strange scalability problem on a workstation

Re: Very strange scalability problem on a workstation

Re: Very strange scalability problem on a workstation

Re: Very strange scalability problem on a workstation

Re: Very strange scalability problem on a workstation

Re: Very strange scalability problem on a workstation

Re: Very strange scalability problem on a workstation

Re: Very strange scalability problem on a workstation

Re: Very strange scalability problem on a workstation

Re: Very strange scalability problem on a workstation