v2.1-rc2 periodicity problem

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Post Reply
Ashton Neil

v2.1-rc2 periodicity problem

Post by Ashton Neil »

Hi,

I'm currently using v2.1-rc2 at the University of Manchester. I'm in the process of implementing the work I've done in v2.0.1 with new DDES subroutines into v2.1-rc2 aswell as implementing a new SA-DDES model. However I've run into a few problems with regards to using translation periodicity in each direction.

I'm running the decaying Isotropic Turbulence test case where periodicity is set in each direction.

To make finding the bug easier, I've limited the subroutines I'm using to the minimum of just : usclim.f90, usini1.f90, usiniv.f90 and a module incthi.f90 as well as the cs_user_periodicity.c file. (I only use a file to initialise the velocity field)

When I use periodicity in all directions the computation just hangs before starting the first iteration, however when I use symmetry in one or two directions, the case runs fine.

I have no problems running this case in v2.0.1 using the exact same settings (although of course the way of specifying periodicity is different in v2.1-rc2)

Are there are issues or bugs that any of the development team are aware of with periodicity?

I've attached the subroutines,mesh and input files in case you have time to try and reproduce the error I get.

Thanks very much
 
Neil Ashton
Attachments
dit_perio_setup.tgz
(8.65 MiB) Downloaded 144 times
Yvan Fournier

Re: v2.1-rc2 periodicity problem

Post by Yvan Fournier »

Hello Neil,

I reproduced your problem, and it seems to be due to a recently-fixed bug (that will not be in version 2.1 final).

To work around it, simply deactivate the visualization output of the boundary mesh (which is empty when you have periodicity in 3 directions, and which is where the bug with an infinite loop occurs).

To deactivate the output of the boundary mesh, de-associate it from all writers, either using the GUI, or using for example the attached file (based on the reference example, simply replacing a "false" by "true" line 250 of cs_user_postprocess.c.

Adding this, your test case seems to finish fine.

Best regards,

  Yvan
Attachments
cs_user_postprocess.c
(22.44 KiB) Downloaded 147 times
Ashton Neil

Re: v2.1-rc2 periodicity problem

Post by Ashton Neil »

Hi Yvan,
 
Thanks very much, that solves the problem!
 
Cheers
 
Neil
Ashton Neil

Re: v2.1-rc2 periodicity problem

Post by Ashton Neil »

Hi Yvan,
Thanks for your help with the previous problem. I managed to run this case with no problems (although I need to post-process the data using ensight, so I look forward to using the release without this bug.
I'm running two further cases, the flow over a 2D hump and also the 2D periodic hills. I'm testing out the SA-DDES hybrid RANS/LES model which i've implemented.
I have an error which is produced by both cases when I use the SA-DDES (not when I use the SA model on its own however).
The error is the following:
 
SIGTERM signal (termination) received.
--> computation interrupted by environment.
 
Call stack:
   1: 0x383dacb65f <__poll+0x2f>                    (libc.so.6)
   2: 0x2aef9124a931 ?                                (?)
   3: 0x2aef912497c9 ?                                (?)
   4: 0x2aef9123dc89 <opal_progress+0x99>             (libopen-pal.so.0)
   5: 0x2aef90d6fda5 ?                                (?)
   6: 0x2aef954d68ea ?                                (?)
   7: 0x2aef90d84bdc <PMPI_Allreduce+0x17c>           (libmpi.so.0)
   8: 0x2aef8ea4eacf <parcpt_+0x2f>                   (libsaturne.so.0)
   9: 0x434a9b     <turbsa_+0x1c3b>                 (cs_solver)
  10: 0x426455     <tridim_+0x4155>                 (cs_solver)
  11: 0x2aef8ea1f148 <caltri_+0x2648>                 (libsaturne.so.0)
  12: 0x2aef8ea18c5b <cs_run+0x71b>                   (libsaturne.so.0)
  13: 0x2aef8ea18e85 <main+0x1e5>                     (libsaturne.so.0)
  14: 0x383da1d994 <__libc_start_main+0xf4>         (libc.so.6)
  15: 0x40e5e9     <main+0x41>                      (cs_solver)
End of stack
I get the same problem in serial and on a different installation of v2.1-rc2 saturne on my home PC.
I've trying to track down what is calling the memory violation. I believe it originates from turbsa.f90 but is there anyway to see the exact line or call which caused this memory violation? All I can see is   9: 0x434a9b     <turbsa_+0x1c3b>                 (cs_solver) etc, I'm not sure how to interpret this.
I've tried to print out the variables into separate files and they all seem ok, nothing to suggest that the code is going to crash.
Thanks very much
 
Neil
Yvan Fournier

Re: v2.1-rc2 periodicity problem

Post by Yvan Fournier »

Hello Neil,
You can already use postprocessing with the version you have: deactivating the boundary mesh output, but keeping the volume mesh should not run into the bug...
When you get a
 
SIGTERM signal (termination) received.
--> computation interrupted by environment.
message, it means the process doing the printing/logging is not the one which caused the crash, but is interrupted due to another processor or coupled domain crashing (or more mundane issues, such as the user pressing CTRL-C in interactive mode, or alloted time running out in batch mode, but this does not seem to be the case here).
You may check for error_xxx messages (instead of the default error of rank 0) to see if there is anything more intesting.
In a coupled case, also check the "listing" and "error" files for both domains, as one probably crashed first (bringing the other down with it), and may have more interesting error messages).
Also, you should not be getting the exact same error in serial, as the interruption seems to be when waiting for a parallel sum. If you still get a crash in parcpt in serial, it means parcpt is called when it should not be, or some more serious memory bug a bit earlier confused everything.
If your case is small enough, running it compiled in debug mode and under Valgrind could maybe provide intersting info (20x memory and speed overhead, so limited to cases below 1 million cells, but may save a huge amount of debugging time).
Cheers,
  Yvan
Ashton Neil

Re: v2.1-rc2 periodicity problem

Post by Ashton Neil »

Hi Yvan,

I ran Valgrind in parallel and very interestingly it runs in the valgrind environment but not in the native environment without valgrind. 

The Valgrind FAQ said that this does sometimes happen and can point to unaddressable memory problems etc. It looks as though Valgrind suppresses some of the memory issues.

I've attached the valgrind output, I was hoping you could have a quick look through to see if you recognised any signs of errors you've seen before or recognise. I realise the version I have (v2.1-rc2) might be missing some fixes you've done, so I wasn't sure if this was one of them.

I can see errors arising from strmem and some other libraries but as these are binaries I'm finding it a bit tricky to link them to the subroutine or call which is causing them.

The same modifications I made to turbsa.f90, I make to turbkw.f90 and turbke.f90 (allocating new memory etc) for other DDES based models (SST-DDES, v2f-DDES etc), but the error only effects the SA model, which makes me wonder where there is some memory conflict related to this model.

Appreciate any guidance you can offer!

Cheers

Neil

P.S The error on one of the processors is:
Call stack:
   1: 0x42e6bf     <turbsa_+0x186f>                 (cs_solver)
   2: 0x2abcac40749e <tridim_+0x49be>                 (libsaturne.so.0)
   3: 0x2abcac313148 <caltri_+0x2648>                 (libsaturne.so.0)
   4: 0x2abcac30cc5b <cs_run+0x71b>                   (libsaturne.so.0)
   5: 0x2abcac30ce85 <main+0x1e5>                     (libsaturne.so.0)
   6: 0x382541d994 <__libc_start_main+0xf4>         (libc.so.6)
   7: 0x40cc89     <main+0x41>                      (cs_solver)
End of stack
Attachments
valgrind_output_parallel.txt.gz
(88.22 KiB) Downloaded 144 times
Yvan Fournier

Re: v2.1-rc2 periodicity problem

Post by Yvan Fournier »

Hello,

The fact that the code runs in the
valgrind environment but not in the native environment without
valgrind is interesting, and usually means it is related to a memory or optimization issue.

In this case, valgrind doesn't give you much useful info beyond that : plenty of stuff related to MPI initialization, and some warnings in string comparisons when reading the restart file (as a side note, I tried a version of OpenMPI 1.5 compiled with Valgrind support a few months ago, and it had much less spurious warnings, but did give me some false positives, but a local build of MPICH2 I have on my Linux machine seems "valgrind-clean" ; the worst I have seen with OpenMPI with ifiniband or similar drivers, in which specific memory handling due to "memory pinning" requirements with the fast network drivers lead to plenty of unrelated Valgrind errors).

In any case, Valgrind output does'nt seem to help much here, unless some of you modifications affect the restart system (in which case we would have to look there first).

You said the problem also affected you in serial mode: perhaps a Valgrind test on a single processor would provide more useful info.

Also, do you have the same crash running wouthout Valgrind but using a build with --enable-debug ? In case of compiler optimization bugs, it may make a difference.

If none of this works, I could check your memory modification (for both a model with no issue and the SA-DDES).

Cheers,

  Yvan
Post Reply