Parallel computing on a cluster

Ruonan · Post by **Ruonan** » Tue Oct 19, 2021 5:35 pm

Dear developers,

Please could you help me with these questions?

Recently I compiled the GitHub latest version on my university's cluster. This time I didn't use GUI, what I did is: copy the mesh, xml (generated by GUI on my desktop PC) and src files to the cluster's corresponding folder, then go into the /DATA/run.cfg file, change the node and CPU numbers, then run "code_Saturne run" command in DATA folder. The case can run, but I have three questions:

1. On the cluster, I got two nodes, each node has 20 CPUs. In the run.cfg file, I wrote "n_procs: 2 n_threads: 20". Is this correct? Does "n_threads" mean the total CPUs or the CPUs per node please?

2. Are there anything else I need to specify in the run.cfg file? For example the Input/output method, MPI rank step, etc. ?

3. When I want to stop the case and save the results, what command should I run in the terminal? In GUI I can click "Stop now" but in the terminal I don't know how to stop it.

Many thanks and best regards,
Ruonan

Post by **Yvan Fournier** » Tue Oct 19, 2021 6:08 pm

Hello,

n_procs is the number of MPI processes used, and n_threads the number of OpenMP threads per CPU.

I do not recommend more than 2 threads per CPU, as OpenMP is not used everywhere, so
(n_procs = 40, n_threads=1) or (n_procs = 20, n_threads=2) are the recommended options.

There have already been similar questions on this forum, regarding performance, so you should find more info by searching.

You will find detailed documentation on the run.cfg here: https://www.code-saturne.org/documentat ... rg_run_cfg

To stop the code, look here: https://www.code-saturne.org/documentat ... ntrol_file

Regards,

Yvan

Ruonan · Post by **Ruonan** » Wed Oct 20, 2021 3:43 pm

Hi Yvan,

Thanks for your reply! they are very helpful.

Regarding stopping the code, I tried but failed. I generated "control_file", added a line "<time_step_number>1000" into it, and put this file in the DATA folder (I also tried SRC folder). At that moment, the case had run more than 1000 timesteps, so I think the calculation should stop immediately after that timestep. But the calculation didn't stop, nothing happened. Please could you tell me if anything I did wrong?

Best regards,
Ruonan

Post by **Yvan Fournier** » Thu Oct 21, 2021 12:05 am

Hello,

The control_file must be placed in the execution folder (RESU/<run_id>) to be used.

If copied in DATA, it will be copied to RESU/<run_id> for each next run (probably not what you want).

Regards,

Yvan

Ruonan · Post by **Ruonan** » Thu Oct 21, 2021 11:15 am

Hi Yvan,

Thanks for your reply! I tried but when I put the control_file in RESU/<run_id> folder, the control_file will be deleted immediately, and the calculation can't stop. Did I write the control_file wrong? I only have one line in the control_file:

Code: Select all

<time_step_number>100

Thanks for checking!

Best regards,
Ruonan

Ruonan · Post by **Ruonan** » Mon Nov 15, 2021 1:44 pm

Hi Yvan,

Sorry I miswrote the control_file. Just add a number "1" into control_file, and copy it to the result folder will work.

Many thanks,
Ruonan

Ruonan · Post by **Ruonan** » Tue Nov 16, 2021 3:20 pm

Hello Yvan,

Could you please help me with this error? I ran a test case in parallel on cluster, using the setting that you recommended, but still can't do it.

I have 1 node with 27 CPUs. I set "n_procs: 27, n_threads: 1" in run.cfg, but the calculation can't start. The error is shown below, the run_solver.log file and two error files are attached.

Code: Select all

  ----------------------------------------------------------
 Composing periodicities

 Halo construction with standard neighborhood
 ============================================

 Face interfaces creation
 Definition of periodic vertices
 Vertex interfaces creation
 Halo creation
 Halo definition
    Local halo definition
    Distant halo creation
SIGINT signal (Control+C or equivalent) received.
--> computation interrupted by user.

Call stack:
   1: 0x7fdbf92d9296 <PMPIDI_CH3I_Progress+0x1146>    (libmpi.so.12)
   2: 0x7fdbf93e4c29 <MPIC_Wait+0x39>                 (libmpi.so.12)
   3: 0x7fdbf93e526a <MPIC_Recv+0xea>                 (libmpi.so.12)
   4: 0x7fdbf92bdeef <MPIR_Barrier_intra+0x2ff>       (libmpi.so.12)
   5: 0x7fdbf92bd875 <I_MPIR_Barrier_intra+0x125>     (libmpi.so.12)
   6: 0x7fdbf92bd6cc <MPIR_Barrier+0xc>               (libmpi.so.12)
   7: 0x7fdbf92bd5fc <MPIR_Barrier_impl+0x4c>         (libmpi.so.12)
   8: 0x7fdbf92bf482 <PMPI_Barrier+0x1c2>             (libmpi.so.12)
   9: 0x7fdbfb56bf5f <+0x5f4f5f>                      (libsaturne-7.1.so)
  10: 0x7fdbfb56e229 <cs_mesh_halo_define+0x1139>     (libsaturne-7.1.so)
  11: 0x7fdbfb52e817 <cs_mesh_init_halo+0x1cd7>       (libsaturne-7.1.so)
  12: 0x7fdbfb106aa0 <cs_preprocess_mesh+0x370>       (libsaturne-7.1.so)
  13: 0x7fdbfc156b96 <main+0x2d6>                     (libcs_solver-7.1.so)
  14: 0x7fdbf89e6c05 <__libc_start_main+0xf5>         (libc.so.6)
  15: 0x401879     <>                               (cs_solver)
End of stack

But what strange is: I tried to decrease the process number, using "n_procs: 8, n_threads: 1", the calculation can run with no error. I also tried to run this case on my desktop PC, it can run with no error. So I think the case setting is ok, the error is related to parallel running.

(I am using the master version from GitHub. When I compiled the code on cluster, I used the semi-automatic installation method. PT-Scotch and ParMETIS were installed with no errors.)

Could you please guide me on what is wrong here?

Many thanks,
Ruonan

Ruonan · Post by **Ruonan** » Wed Dec 01, 2021 11:10 pm

Hi Yvan,

Thank you! I still have no idea what to do with the previous error. But I tried other nodes with different features. Then all the errors disappeared. I think some nodes on my cluster are not compatible with Saturne, or need special settings maybe, for some reason.

I tested the parallel performance on the cluster, using 2~56 cores. Could you suggest me, based on your experience, whether this parallel performance is good or not?

Please see the two graphs below. The speedup ratio is calculated by (time using 1 core)/(time using n cores). The parallel efficiency is calculated by (speedup ratio)/(core number). I got a parallel efficiency of about 40%. I followed the suggestion: putting 20000~80000 cells per core. This suggested region is highlighted in green.

I can see you are very experienced in parallel optimization, with many papers published. I really appreciate your help.

Best regards,
Ruonan

Post by **Yvan Fournier** » Wed Dec 01, 2021 11:57 pm

Hello,

What type of cluster do you have (processor type, network type, ... and even MPI library install)? From one of your older logs I would say Intel(R) Xeon(R) Gold 5120, but that does not tell me whether you have a fast network (Infiniband for example or something with higher latency (Gigabit Ethernet maybe). Or whether the MPI drivers make the best of the network (the compilers and system seem old).

Even on our own clusters (the latest is the one described here https://top500.org/system/179899/), we can observe a factor of 2 on the performance depending on the compilers and especially MPI library configuration used (between optimized libraries and a "generic" workstation-type configuration).

Partitioning quality may play a major role, as well as load balance. Do you have the performance.log files for some of your runs on different number of processes ?

Also, are other codes running on the same nodes, or do you have exclusive access (such as when using SLURM's --exclusive option, or whatever equivalent option LSF, Torque, or the scheduler/resource manager you use may have ?

And finally, some specific models might not scale as well as the commonly-used ones. The info in timer_stats.csv and performance.log can provide precious feedback, to help see where things are slower.

All these factors can be important.

Best regards,

Yvan

Ruonan · Post by **Ruonan** » Thu Dec 02, 2021 6:03 pm

Hello Yvan,

Thanks a lot for your very useful comments. I have attached the performance.log files when using 2cores, 28cores and 56 cores. I really appreciate it if you could help check.

Here are some details of my cluster:
Processor type: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Network type: op
MPI version: 3.0 (MPICH 3.1.2)

I have "gold-5120" nodes as well, but these nodes give me errors described in the previous post. So I can only use "e5-2660" nodes now.

Yes, I use "--exclusive" command.

Best regards,
Ruonan

code_saturne User's Forum

Parallel computing on a cluster

Parallel computing on a cluster

Re: Parallel computing on a cluster

Re: Parallel computing on a cluster

Re: Parallel computing on a cluster

Re: Parallel computing on a cluster

Re: Parallel computing on a cluster

Re: Parallel computing on a cluster

Re: Parallel computing on a cluster

Re: Parallel computing on a cluster

Re: Parallel computing on a cluster