Hello,
I noticed somewhere in the forum that the parallelism efficiency with the legacy FV method is reached with about 32000 cells per CPU (assumed without multi-threading). Is that the same ratio using the CS default CDO_fb space discretization ?
Thank you in advance.
Kind regards.
Parallel Calculation Efficiency with CDO
Forum rules
Please read the forum usage recommendations before posting.
Please read the forum usage recommendations before posting.
-
- Posts: 4255
- Joined: Mon Feb 20, 2012 3:25 pm
Re: Parallel Calculation Efficiency with CDO
Hello,
This value is really an average, an depends on the network and processor speeds/balance, and number of nodes used. On a single node, scalabilty may be better, while on billion-cell cases running on thousands of cores, the optimum is often higher.
So for legacy FV, I always recommend starting from 30K to 50K cells per core, and running a few iterations using both half and double that. Than repeat in the direction where scalability seems best, until things seem about optimal.
For CDO, scalability might be slightly better, though I recommend using the same approach. Using hybrid MPI/OpenMP might slightly improve things at high core counts.
Processor architecture is also important. Some years ago, I saw a case on a workstation loose scalability very fast on my desktop machine, with almost no improvement from 2 to 8 cores (for 200000 cells), while the same binary had a speedup of about 7 on 8 cores on a more powerful workstation (both Intel Xeon processors, but different models). So you need to test on your own case to be sure (but so get back to me if performance is completely different from that expected).
Best regards,
Yvan
Edit: the original post mentioned "30 to 50 cells per core". This was a typo. I meant "30K to 50K celles per core".
This value is really an average, an depends on the network and processor speeds/balance, and number of nodes used. On a single node, scalabilty may be better, while on billion-cell cases running on thousands of cores, the optimum is often higher.
So for legacy FV, I always recommend starting from 30K to 50K cells per core, and running a few iterations using both half and double that. Than repeat in the direction where scalability seems best, until things seem about optimal.
For CDO, scalability might be slightly better, though I recommend using the same approach. Using hybrid MPI/OpenMP might slightly improve things at high core counts.
Processor architecture is also important. Some years ago, I saw a case on a workstation loose scalability very fast on my desktop machine, with almost no improvement from 2 to 8 cores (for 200000 cells), while the same binary had a speedup of about 7 on 8 cores on a more powerful workstation (both Intel Xeon processors, but different models). So you need to test on your own case to be sure (but so get back to me if performance is completely different from that expected).
Best regards,
Yvan
Edit: the original post mentioned "30 to 50 cells per core". This was a typo. I meant "30K to 50K celles per core".
Re: Parallel Calculation Efficiency with CDO
On a workstation, you need to take into account a memory bottleneck. With DDR3 you cannot use more than [2xMemoryChannels] cores, with DDR5 the limit is not as strict and at ~[3.5xMemoryChannels]. For example, if you have DDR3 Xeon with 4 memory controller channels all with DIMMs connected, you can utilize up to 8 cores, then calculation wiil not speed up. But if you have 2xEpyc with 12 memory controller channels and there are 24 DDR slots in your server all fitted with DIMMs, you will utilize all cores without significant scalability degradation. Also, newer DDR5 systems will give you ~3.5 (not 2) usable cores per one memory channel. But, to our experience, there may be stability problems with high-core DDR5 servers (we found 64-core DDR5 [Epyc] or 32-core DDR4 [Epyc or Xeon] systems optimal).