Page 2 of 2

Re: Parallel computing on a cluster

Posted: Thu Dec 02, 2021 6:11 pm
by Ruonan
Hello Yvan,

Here follow the timer_stats.csv files. They may help. Thank you!

Best regards,
Ruonan

Re: Parallel computing on a cluster

Posted: Fri Dec 03, 2021 8:34 pm
by Yvan Fournier
Hello,

Comparison is a bit tricky, since you have 2633 time steps in the 56 cores case, and 1224 in the 28 cores case.

Looking at the averages, I find 0.848 s/time step for the 56 cores case, and 1.64 s/time step for the 28 cores case and 9.34 s/time step for the 2 core case., so efficiency seems to match your curve. I see no specific operation being slower (we have about 2/3 of the time in the linear solvers and gradients for the 2 procs case, and 3/4 for the 56 procs case, so I see no obvious issue here.

If you are on a single node, memory bandwidth saturation does not seem to be the issue either, because in that case you would have very little additional speedup from 28 to 56. So I would guess some MPI or network driver aspect comes into play here. Do you have "vanilla" MPICH 3.1 on the machine, or some version with optimized drivers for OmniPath ? That could be the cause of the performance loss ?

Best regards,

Yvan

Re: Parallel computing on a cluster

Posted: Sun Dec 05, 2021 6:08 pm
by Ruonan
Hello Yvan,

Thanks a lot for your comments! I really appreciate your help!

Sorry for not running the same timesteps for each case. I also use the average time for each step, the same as your method.

Actually not all the cases are on a single node. I have 28 cores per node. So for the 2cores and 28cores cases, I only use one node. But for the 56cores case, I use two nodes. So will the "memory bandwidth saturation" be a problem?

I will check the MPI or network driver thing with my IT support and get back to you soon. Because as you said, if I can increase the parallel performance by a factor of 2, that will be wonderful.

Best regards,
Ruonan

Re: Parallel computing on a cluster

Posted: Sun Dec 05, 2021 7:30 pm
by Yvan Fournier
Hello,

In that case, the drop in performance may be due to saturating the (node) memory bandwidth on the node when you move to 28 ranks.

A test to confirm this would be to try running 14 ranks on a single node, and 28 ranks on 2 nodes, and check the performance in those configurations.

Best regards,

Yvan