Problem with a parallel job.

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
Philippe Parnaudeau

Problem with a parallel job.

Post by Philippe Parnaudeau »

hello,
I try to simulate a flow over a hot circular cylinder at moderate Re number (around 40 K), with Sutherland law;
The mesh has been generated by gmsh (around 2 10.6 hexahedra).
I use Saturn 1.3 on a UV100 SGI computer and I would like to make a run over 8 cpu, but the job finish around "5 hours"  like this :
===============================================================
   ** STOP BECAUSE OF TIME EXCEEDED
      -----------------------------
      MAX NUMBER OF TIME STEP SET TO NTCABS:         81
===============================================================
 
 
===============================================================
   ** REMAINING TIME MANAGEMENT
      -------------------------
      REMAINING TIME ALLOCATED TO THE PROCESS   :    0.22790E+04
      ESTIMATED TIME FOR ANOTHER TIME STEP      :    0.24003E+04
        MEAN TIME FOR A TIME STEP               :    0.24151E+03
        TIME FOR THE PREVIOUS TIME STEP         :    0.23660E+03
        SECURITY MARGIN                         :    0.21600E+04
===============================================================
 
 
 CPU TIME FOR THE TIME STEP               81:        0.23777E+03
 
and I don't undestand why.
 
I ask to do 200 iterations and I give 8 hours to do it.
 
The job is schedduled by "pbs pro" and the order is:
#PBS -q small_para
#PBS -l ncpus=8
#PBS -l mem=20000mb
#PBS -j eo -N hot_cylinder
 
The result seem to be good, but when I try to restart the job, Results became weird, or to be clear false!
 
I'm sure I made something wrong, but I don't understand where...
 
If someone could help me...
 
Thanks.
Yvan Fournier

Re: Problem with a parallel job.

Post by Yvan Fournier »

Hello,
The heuristic for determining the safety margin may be found in armtps.F, and is very empirical. According the the comments, for 100 iterations or less, it should amount to 10% of the allocated time for jobs less than 1000 iterations, 100x the mean cost of an iteration for a job from 1000 to 10000 iterations, and 1% of the allocated time beyond that. Looking at the code itself, I am not quite convinced that is what it does (I also observed a strange result with it recently). You may want to add armtps.F to you user subroutines and modify it to avoid it causing the job to stop prematurely.
Otherwise, it is strange the the restart should give you incorrect results. Did you postprocess the results or check the "listing" file to make sure that the calculation's results just before stopping seemed correct ?
Best regards,
  Yvan
Philippe Parnaudeau

Re: Problem with a parallel job.

Post by Philippe Parnaudeau »

hello,
thanks for your answer.
ok for your suggestion (add armtps.F to you user subroutines and modify it to avoid it causing the job to stop prematurely.), I try to done it.
For the restart problem, I made some investigations and realized new simulation with more elapsed time.
The elapsed time for this long run cover 2 last run (eg a first run + a restart) and results seem to be bad (too important velocity)...
Regards 
Philippe Parnaudeau

Re: Problem with a parallel job.

Post by Philippe Parnaudeau »

After few days past to testing , I think that I allready have a problem when I restart a job.
My problem (described in my first post) converged  in the first run :
   ** INFORMATIONS ON THE CONVERGENCE
      -------------------------------
---------------------------------------------------------------
   Variable    Rhs norm      N_iter  Norm. residual      derive
---------------------------------------------------------------
c  Pressure     0.19328E-02    3929   0.98630E-08   0.71216E+00
c  VelocitU     0.40586E+00     124   0.98757E-08   0.43668E-02
c  VelocitV     0.27390E-01     146   0.93045E-08   0.11009E-02
c  VelocitW     0.47047E-01     136   0.95211E-08   0.10405E-02
c  TurbEner     0.13364E-01     120   0.98569E-08   0.37657E-04
c  omega        0.16050E+06      32   0.89508E-08   0.34588E+05
c  Temp.K       0.12187E+03      72   0.89896E-08   0.11965E-01
---------------------------------------------------------------
 
 
 
   ** INFORMATIONS ON THE VARIABLES
      -----------------------------
---------------------------------------------------------------
   Variable      Min. value    Max. value   Min clip   Max clip
---------------------------------------------------------------
v  Pressure    -0.12868E+01   0.79520E+00         --         --
v  VelocitU    -0.57194E+00   0.16545E+01         --         --
v  VelocitV    -0.91935E+00   0.92499E+00         --         --
v  VelocitW    -0.11094E+01   0.15753E+01         --         --
v  TurbEner     0.12671E-14   0.18247E+00       1804          0
v  omega        0.38811E-01   0.14688E+06         22          0
v  Temp.K       0.28766E+03   0.30765E+03          0          0
v  Lam. vis     0.17818E-04   0.18764E-04          0          0
v  turb. vi     0.35720E-20   0.24151E-01          0          0
v  total_pressu 0.10128E+06   0.10132E+06          0          0
v  Th. cond     0.24781E-04   0.26098E-04          0          0
---------------------------------------------------------------
 
Result are physical acceptable...
But when I try to restart after few iteration, I have this result :
 
  ** INFORMATIONS ON THE CONVERGENCE
     -------------------------------
--------------------------------------------------------------
  Variable    Rhs norm      N_iter  Norm. residual      derive
--------------------------------------------------------------
  Pressure     0.10919E+00    3920   0.99561E-08   0.99648E+00
  VelocitU     0.39989E+00     133   0.94825E-08   0.44341E+01
  VelocitV     0.47967E-01     145   0.96760E-08   0.57162E+00
  VelocitW     0.20253E+00     147   0.97753E-08   0.16142E+02
  TurbEner     0.13308E-01     122   0.88715E-08   0.19021E-03
  omega        0.16050E+06      68   0.89448E-08   0.47969E+07
  Temp.K       0.11622E+03      78   0.98349E-08   0.20159E+00
--------------------------------------------------------------
 
 
 
  ** INFORMATIONS ON THE VARIABLES
     -----------------------------
--------------------------------------------------------------
  Variable      Min. value    Max. value   Min clip   Max clip
--------------------------------------------------------------
  Pressure    -0.12788E+03   0.47269E+02         --         --
  VelocitU    -0.31855E+02   0.16582E+02         --         --
  VelocitV    -0.66965E+01   0.78254E+01         --         --
  VelocitW    -0.33113E+02   0.29405E+02         --         --
  TurbEner     0.63856E-13   0.51487E+00       1068          0
  omega        0.10912E-03   0.14709E+06       1672          0
  Temp.K       0.28718E+03   0.30745E+03          0          0
  Lam. vis     0.17802E-04   0.18767E-04          0          0
  turb. vi     0.37154E-18   0.23208E-01          0          0
  total_pressu 0.10116E+06   0.10134E+06          0          0
  Th. cond     0.24759E-04   0.26101E-04          0          0
--------------------------------------------------------------
 
 
 
  ** INFORMATIONS ON THE CLIPPINGS
     -----------------------------
--------------------------------------------------------------
  Variable    Min wo clips  Max wo clips   Min clip   Max clip
--------------------------------------------------------------
  TurbEner    -0.18892E-01   0.51487E+00       1068          0
  omega       -0.42985E+05   0.14709E+06       1672          0
  Temp.K       0.28718E+03   0.30745E+03          0          0
--------------------------------------------------------------
 
I'm sure I'm doing something wrong, and I think the problem concern boundary condition
after the first run, results are :
   ** BOUNDARY CONDITIONS FOR SMOOTH WALLS
   ---------------------------------------
------------------------------------------------------------
 Phase      1                            Minimum     Maximum
------------------------------------------------------------
   Rel velocity at the wall uiptn : -0.54381E+00 0.74582E+00
   Friction velocity        uet   :  0.30330E-01 0.15756E+05
   Friction velocity        uk    :  0.00000E+00 0.57993E-01
   Dimensionless distance   yplus :  0.46194E-06 0.71605E+02
   ------------------------------------------------------   
   Nb of reversal of the velocity at the wall   :         96
   Nb of faces within the viscous sub-layer     :      43008
   Total number of wall faces                   :      73728
------------------------------------------------------------
 
and after restart:
 ** BOUNDARY CONDITIONS FOR SMOOTH WALLS
   ---------------------------------------
------------------------------------------------------------
 Phase      1                            Minimum     Maximum
------------------------------------------------------------
   Rel velocity at the wall uiptn :  0.00000E+00 0.19862E+01
   Friction velocity        uet   :  0.23463E-01 0.90612E+05
   Friction velocity        uk    :  0.00000E+00 0.10589E+00
   Dimensionless distance   yplus :  0.13361E-05 0.71597E+02
   ------------------------------------------------------  
   Nb of reversal of the velocity at the wall   :          0
   Nb of faces within the viscous sub-layer     :      42952
   Total number of wall faces                   :      73728
 
 
More informations :
The symbolic link "SUITE" is made correctly in directory DATA.
I give you in attachments the two listing file...
 
Thanks  in advance.
Attachments
listing-restart.txt
(47.35 KiB) Downloaded 135 times
listing-first.txt
(972.17 KiB) Downloaded 147 times
Philippe Parnaudeau

Re: Problem with a parallel job.

Post by Philippe Parnaudeau »

More information:
it's seem that an another user have same problem than me :
http://cfd.mace.manchester.ac.uk/twiki/bin/view/Forum/ForumIntro0052
 
but I can not find any solution...
 
regards.
Yvan Fournier

Re: Problem with a parallel job.

Post by Yvan Fournier »

Hello,
It is difficult to determine anything with only some elements of the log files. Note that the information on convergence in the log file is that of the linear solvers for a given time step, not a global convergence indicator for the calculation.
The range of values for uitpn at th boundary does seem strange (I would not expect a negative value in the initial calculation, but I am not an expert on the turbulence wall laws.
Otherwise, the range of values for Y+ seems similar.
How are the results "strange" after restart ? Do you have a postprocessing view illustrating the problem ?
Are you running a steady or unsteady calculation ? The problem reported by another user is in a steady case, and the steady algorithm is more recent (and was not validated as extensively as the unsteady algorimth at the release of 1.3). Large calculations routinely use restarts with no problems, but I am not sure if any of those use the steady algorithm (most calculations using the steady algorithm manage to run in one time allocation slot).
If your issue is with a steady calculation, debugging restarts in version 1.3 will certainly not be a priority for the Code_Saturne team, though if you have the same problem with version 2.0.1, we will look into it
In any case, to debug further, we would need your data setup (user subroutines and/or xml file, and mesh, or a smaller version of your mesh).
Finally, note that although it is not labeled as such, an unsteady calculation with spatially local time step is actually a form of steady algorithm (as the time step is not global, a solution at a given "time" may not be interpreted, but is an intermediate step towards a converged solution. That variant of the algorithm has been more tested over time, and has been seen to actually be more robust in some cases. So you may wish to switch to an "unsteady with time step varying in space" instead.
Best regards,
  Yvan Fournier
 
Philippe Parnaudeau

Re: Problem with a parallel job.

Post by Philippe Parnaudeau »

hello,
"Note that the information on convergence in the log file is that of the
linear solvers for a given time step, not a global convergence indicator
for the calculation."
ok; 
"The range of values for uitpn at th boundary does seem strange (I would
not expect a negative value in the initial calculation, but I am not an
expert on the turbulence wall laws."
The first run is not the initial calculation but the results  takes after a shoort run; But whatever, you've right, I've a problem with.
I expect a too large mesh somewhere  and I investigate this way for the moment, and same for Y+.
Could you confirm this ?
 
"How are the results "strange" after restart ?"
 
A huge jump in the velocity value like this.
After the first run : VelocitU :   -0.57194E+00   0.16545E+01
Restart after 1 iteration :    VelocitU    -0.31855E+02   0.16582E+02 
I don't understand what happens, for me, it's wrong... 
But I'm not an expert with RANS simulation, I only know DNS and LES...     So, it's my first steady simulation (RANS)...
 
I gave (last week) my mesh and my *.xml to d. Monfort;
I understand that I use a old version of Saturn and I start to install the new version ASAP.
 
Many thanks for answers.
 
kind regards.
Guest

Re: Problem with a parallel job.

Post by Guest »

Hello Yvan
 
please tell me know how to start from the last time step after 100 iterations.
 
if i give 200 iterations as the solution is not converged
 
Yvan Fournier

Re: Problem with a parallel job.

Post by Yvan Fournier »

Hello,
If you use the GUI, check the "restart" section. If you are only using the script, search for "restart" in the runcase. The reference documentation may also provide details, but the GUI should be self-explanatory.
In any case, if you ran 100 iterations in the fist run and want to add 200, you need to set the new number of iterations to 300 (and not 200), as the value given represents the total.
Best regards,
  Yvan
Post Reply