user routine parallelization issues - v7

daniele · Post by **daniele** » Fri Sep 22, 2023 9:20 am

Hello,

I am facing issues in making a user routine working correctly, when parallelized.
The specific "code" is written inside the routine "cs_boundary_conditions_ale.f90" (subroutine usalcl), in v7.
The routine loops on the faces of one boundary surface, and stores the position of the nodes:

Code: Select all

allocate(lstelt(nfabor))

call getfbr('BC1', nlelt1, lstelt)

if(.not.allocated(y_v)) then
  allocate(y_v(nnod))
endif
if(.not.allocated(y_v_paral)) then
  allocate(y_v_paral(nnod))
endif

! We store inside y_v (or y_v_paral) the y-coordinate of all vertices --> y_v_paral will have a size equal to k_par 
k=1
do ilelt = 1, nlelt1
   ifac = lstelt(ilelt)
   do ii = ipnfbr(ifac), ipnfbr(ifac+1)-1
      inod = nodfbr(ii)
      y_v(k) = (xyzno0(3,inod))
      k=k+1
   enddo
enddo

! Parallelization of y_v to y_v_par
k_par = k-1
if (irangp.ge.0) then
      call parcpt(k_par)
      call cs_parall_allgather_r(k-1,k_par,y_v,y_v_paral)
endif

Up to 20 CPU on the same node, the routine works correctly. It shows issues when switching to higher CPU number (I have tested it on a 40 CPU node), and also when I want to parallelize the simulation on more the one node (for example, I tried on two nodes of 20 CPU each). The error I get is the following:

Code: Select all

MPI_ABORT was invoked on rank 15 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[node141:106113] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[node141:106113] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
 solver script exited with status 1.

Error running the calculation.

Moreover, the behavior seems case dependent: for certain cases, the same routine works without problems even when parallelized on several nodes...
Any suggestion or idea on what could create the problem?

Thank you very much in advance for your help.
Kind regards,
Daniele

daniele · Post by **daniele** » Fri Sep 22, 2023 9:54 am

Hello,

I add that the issue seems to come from "cs_parall_allgather_r()", since commenting this line prevents the problem, and printing "k_par" after the "call parcpt(k_par)" shows the correct result.

Thank you.
Kind regards,
Daniele

Post by **Yvan Fournier** » Fri Sep 22, 2023 10:44 am

Hello,

In general, I recommend avoiding "allgather" operations when possible, unless the volume of associated data remains small.

In your case, the allocation for y_v_parall may be too small, as it is based on the local number of vertices, and not the total number of boundary vertices using that BC.

You need to sum the local sites before using allgather (you van check examples for this).

Also note that the Fortran version of ALE boundary conditions has been removed after v8.0, so starting from v8.1 (end of 2023) and In the next major release (9.0 In June 2025), you will need to switch to the C version.

Best regards,

Yvan

daniele · Post by **daniele** » Tue Sep 26, 2023 1:39 pm

Hello Yvan,

Thank you very much for seeing so quickly the issue... the problems does seem to come from the allocation of y_v_parall, increasing its size solves the problem.
Thank you also for the suggestion and for the updates about v8.

Kind regards,
Daniele

code_saturne User's Forum

user routine parallelization issues - v7

user routine parallelization issues - v7

Re: user routine parallelization issues - v7

Re: user routine parallelization issues - v7

Re: user routine parallelization issues - v7