Hardware config is
============
CPU: Intel Xeon 2678v3 (2011-3) with enabled or disabled HT (doesn't make sense)
RAM: 2 x Samsung 32 GB 1866 ECC REG
Motherboard: Huananzhi X99-8MD3 (2011-3, X99/C610 chipset, DDR3 ECC/REG: 2 slots)
Video: Radeon R5-230 (19W max)
PSU: Seasonic 300W (ecaps brand is OK)
As you can see, it's a low cost compute node.
Software is
=======
OS: CentOS 7.9 (optional kernel from 7.5) or Kubuntu 2022.
OpenMPI: 1.8.4, 1.10.7, 3.1.6, some from 4.x branch or MPICH 4.0.2.
Saturne: 6.x, 7.x branches.
NUMA: 1 node.
Problem description
=============
When I start calculation on just a few cores (up to ~4 cores) there is all OK. When I increase number of threads further, calculation speed for the same case and initial conditions doesn't increase and even show some decrease. This is for various cases with not too small meshes so particular case or mesh is not the problem. Also, when I run the same software config (Kubuntu-2022 with Saturne from USB drive) on different PC (Huananzhi X99-TF, Xeon 2678v3, 4x32 GB Samsung 1866 DDR3 ECC REG) calculation scales at acceptable level (although not ideal). The same is for AMD Threadripper machine with 16 cores and the same memory size (64 GB DDR4).
What I tried
=======
Everything

1. Different Linux, Saturne and MPI versions, even MPICH, although I usually use MPI.
2. Second (spare) hardware set CPU+RAM+Motherboard (the same models). Unfortunately, I also bricked one motherboard trying to update BIOS (for Huananzhi, this precedure is not so reliable as for top brand boards), although this board (X99-8MD3) is cheap (info: such MBs are used for the long hard parallel load, it should wor good for Saturne).
3. CPU temperature (core is 65*C max) and frequency (2900...3000 MHz) check: absolutely OK (Thermalright 180W [real 140W] cooler with Noctua Industrial is used). Two thermal greases: from Thermalright cooler kit + Arctic Cooling MX4.
4. Lots of benchmarks: CPU single thread, parallel matrix operations, CPU cache misses check, cache size and associativity, memory real frequency and performance with different block sizes.
5. Disabling SELinux, spectre etc mitigations, power saving features.
6. Checked core loading. It's almost uniform in time, no huge "system" load, just "user" load.
7. MPI benchmark (OSU). It shows up to 5x slower performance on larger blocks (sizes are OSU defaults) than on other machine (X99-TF BM, same CPU, 128 GB RAM). Although some results are better (depending on test selection and block size).
8. Running on 5...11 cores to leave some space for system services.
9. Playing around with threads per process.
So, if I run CPU benchmarks ("sysbench --test=cpu --threads=12 run" or other with matrix multiplications) on all (12) cores, it scales perfectly (events per second: 951 / 10038). If I run Saturne on all cores, it performs slower than on 4 cores (7 million cell mesh). I can't get any faster on any number of cores more than 4! Moreover, if I start one Saturne calculation on 4 cores, then start another case on another 4 cores (affinity is set with "htop"), the slowdown in about 2.5 times compared with just 1 case running. If I check timer statistics, I see that linear solver and gradient reconstruction times are both some bigger on 12 cores than on 4 cores, that means that any parallel MPI floating point operations slow down instead of speed-up while PC takes more power (up to 190W), all cores running at 2900 MHz at full load without noticable system load!
It looks very very strange. Does MPI/MPICH use some shared resource that saturates after 4 processes? We run CFX/Fluent on a cluster with 24 AMD cores per node and CentOS 7.5 + HP MPI or Platform MPI, it scales without any problems up to 264 cores (with Infiniband interconnect), even with our very old versions that doesn't run smoothly on CentOS 7.5. I expected the same for OpenMPI + Saturne...
Can anybody suggest what is the culprit? It's quite frustrating spending lots of time and efforts to get such result. Is there some hardware feature that is needed for MPI?
Thanks for your attention. I attached some files for 12-cores run and configure script options.