Tutorial Case2 problem while computing geometric quantities

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
Marta Garcia

Tutorial Case2 problem while computing geometric quantities

Post by Marta Garcia »

Hello,

I'm working on CASE2 (FULL_DOMAIN) of the tutorial (version 2.0.0-rc2) and I apparently I have a problem with the mesh generated since the code exits after the section of "Renumbering mesh" and before the line "Computing geometric quantities".

I'm preprocessing the mesh, partitioning and running with the following lines:
> cs_preprocess -m downcomer.des fdc.des pic.des --join --color 5 24 32
> cs_partition 4
> qsub  ... cs_solver $@ --mpi --param case2.xml
I also tried without the option "--color 5 24 32" but I still have the problem.

I was able to run the CASE1 of the tutorial so I think that I'm doing something wrong with the preprocessor phase but I don't know what can it be.

Any idea?

Marta
Attachments
listing.txt
(27.65 KiB) Downloaded 250 times
Yvan Fournier

Re: Tutorial Case2 problem while computing geometric quantit

Post by Yvan Fournier »

Hello,

No easy answer here. I seem to remember having users encounter this type of issue once or twice under Blue Gene machines, but not recently. I would suspect some sort of installation issue, but we need to find which part of the code's tool chain is causing the problem. Your "listing" file or a back trace may help, but there may be other things to try first:
  • if you also built the Kernel on the front-end, you may tray to run it using the same preprocessor_output file to check if it is corrupt or not. Running "cs_solver --quality" in a directory containing the preprocessor_output file allows you to run without setting up boundary conditions ans such. It that works, the file can be considered "OK". Otherwise, it is damaged for some reason or another, and reinstalling the preprocessor or running it under another type of machine might help.

    I noticed that in one of our last installs, I did not use CC=bg_xlc
    for the configure options of the BFT library, meaning the default
    front-end compiler, albeit in 32-bit mode may have been used. It seems
    the code may work anyhow, but you should check if you used CC=bg_xlc for
    BFT's configure line. If it was not used, try it (sorry for giving you
    an incorrect example in that case).

    If everything is installed correctly according to the above tests, trying to run without or GUI (preparing the case adding the --nogui option to cs_create may be useful. I believe some tests were run with the GUI on BG/P, but most of our users on that machine prefer a minimalist toolchain, and prefer running with user subroutines only. I would rather recommend the GUI, as it is simplet to upgrade a case setup from one version if the code to another, and is more "polished", but facts are the XML reader has been less tested on BG/P than on regular clusters. I do not see why it should fail, but something si failing, while other calculations run normally on that sort of machine, so looking first at what is done differently may be useful.

Depending on the results of those tests, we can try to troubleshoot the issue (we're starting on a "bisection" approach, let's just hope that it doesn't require too many iterations).

Best regards,

  Yvan
Marta Garcia

Re: Tutorial Case2 problem while computing geometric quantit

Post by Marta Garcia »

Hello Yvan and Happy New Year,

1) if I run "cs_solver --quality" with the front-end version I obtain a chr.ensight directory. If I take a look to the mesh with Paraview I observe a difference between:

- Figure III.2 of the Tutorial (page 16, "View of the full domain mesh with zoom on the joining regions")

- FULL_DOMAIN_np8.png (see attachment)

I made also the test to see the mesh generated by the cs_preprocess command with and without the "--color 5 24 32" option and I have the same output.

I'm not sure if the joining option of the preprocessor is dealing correctly with hanging nodes. It's my image normal?

2) I made some tests by adding CC=bglxc and I still have the problem. Then, I tried CC=mpixlc since I have this option in the FVM built but I find the same problem too.

3) I haven't tried yet. I will try to do this tomorrow if you consider that the image in point 1 is correct.

Thank you.

Marta
Previously Yvan Fournier wrote:

Hello,

No easy answer here. I seem to remember having users encounter this type of issue once or twice under Blue Gene machines, but not recently. I would suspect some sort of installation issue, but we need to find which part of the code's tool chain is causing the problem. Your "listing" file or a back trace may help, but there may be other things to try first:
  • if you also built the Kernel on the front-end, you may tray to run it using the same preprocessor_output file to check if it is corrupt or not. Running "cs_solver --quality" in a directory containing the preprocessor_output file allows you to run without setting up boundary conditions ans such. It that works, the file can be considered "OK". Otherwise, it is damaged for some reason or another, and reinstalling the preprocessor or running it under another type of machine might help.

    I noticed that in one of our last installs, I did not use CC=bg_xlc
    for the configure options of the BFT library, meaning the default
    front-end compiler, albeit in 32-bit mode may have been used. It seems
    the code may work anyhow, but you should check if you used CC=bg_xlc for
    BFT's configure line. If it was not used, try it (sorry for giving you
    an incorrect example in that case).

    If everything is installed correctly according to the above tests, trying to run without or GUI (preparing the case adding the --nogui option to cs_create may be useful. I believe some tests were run with the GUI on BG/P, but most of our users on that machine prefer a minimalist toolchain, and prefer running with user subroutines only. I would rather recommend the GUI, as it is simplet to upgrade a case setup from one version if the code to another, and is more "polished", but facts are the XML reader has been less tested on BG/P than on regular clusters. I do not see why it should fail, but something si failing, while other calculations run normally on that sort of machine, so looking first at what is done differently may be useful.
Depending on the results of those tests, we can try to troubleshoot the issue (we're starting on a "bisection" approach, let's just hope that it doesn't require too many iterations).

Best regards,

  Yvan
Attachments
FULL_DOMAIN_np8.png
Yvan Fournier

Re: Tutorial Case2 problem while computing geometric quantities

Post by Yvan Fournier »

Hello Marta, and happy new year to you too !
Actually, the difference you observe is due to ParaView splitting the polyhedra resulting from mesh joining before displaying them (you would not see the same thing using a developement version of ParaView (3.9), or the future ParaView 3.10. So everything seems correct at this stage.
Did you try running tutorial 1 (a simpler mesh, no joining) to see if you have the same issue ? Also, did you try to run the same case (with the same "cell_domain_32" file) on another machine ? Could you check or post the partitioning log ? As the case only has 1650 cells, for an average of 51 cells/rank, it is improbable, though not impossible that some rank has 0 cells, which could explain a crash (METIS usually give pretty well balanced meshes, SCOTCH slightly less well balanced meshes, though possibly with slightly better optimized parallel boundaries). If everything is OK, I would still be interested in your "preprocessor_output" and "cell_domain_32" files to check if I can reproduce the issue (I cannot reproduce cell_domain_32 if you used METIS, as is uses randomization, and will thus produce slightly different partitionings on different machines, but I could use use file).
Best regards,
  Yvan
Marta Garcia

Re: Tutorial Case2 problem while computing geometric quantities

Post by Marta Garcia »

Hi Yvan,

Oh! I just download the lastest release 3.8.1 from the web (http://paraview.org/paraview/resources/software.html). But good to know that the mesh seems normal.

For TUTORIAL 1:  I remembered to have had a problem while trying to run it with a 64 cores partition. I also thought that it could be a problem in some cells (because there, the mesh contains only 700 cells and 700/64 it's pretty low...) so I run it with 8 cores and I found no problems. In the first case (64 cores) it fails just while doing the first iteration but initialization and partitioning are correct.

For TUTORIAL 2 (CAS2): No, I haven't tried in another machine. I'm supposed to use the BG/P in the future so I guess it's better to see where are the problems. For 'np8' I understand 8 cores, sorry. I think, 1650/8=206 should be enough for METIS. I just have tried with 32 cores to see what happens and the code stops later (see differences between "listing_TUT2_CAS2_8" and "listing_TUT2_CAS2_32") the same way as for TUTORIAL 1 with 64 cores (see file "error_n0002_TUT2_CAS2_cores_32"). So I trying to run these simple cases with a high number of cores it will not work (at least on BG/P... maybe in other platforms it works?).

Anyway, CAS2 should be able to run partitioned in 4 sub-domains. Please find attached the files: "preprocessor_output", "domain_number_4", "domain_number_8", "domain_number_32" and its respective screen outputs "domain_number_*.output". I've observed that for 8 cores (in "domain_number_8.output"), the histogramm gives 0 cells per domain in some cases but I don't understand this histogramm information :-(

Thank you again for your time and experience.

Marta
Attachments
CS_TUT2.tar
(130 KiB) Downloaded 263 times
Yvan Fournier

Re: Tutorial Case2 problem while computing geometric quantities

Post by Yvan Fournier »

Hello Marta,

The error you have for 32 cores is not so bad: it just says that you need to increase the Fortran work array sizes (meaning the automatic default for integers is too small). We are progressivly replacing work arrays with dynamic memory, but in the meantime, you may increase that memory size either in usini1.f90 it you are using that (almost at the end), or in the GUI (in the calculation management part). Starting with 30 integers/cell for and 150 reals/cell is reasonable, and multiplying those numbers by 1.5 or 2 as long as you get an error in iasize (integers) or rasize for (reals) should do the trick.

It is actually also possible that the issue you had on 64 procs is related to this (normally, Fortran arrays sizes should be checked everywhere, but a single mistake could be difficult to detect, given the way memory is managed in the Fortran part of the code). So if you get it to work on 32 ranks, you may try again on 64 with bigger work arrays.

Best regards,

  Yvan
Marta Garcia

Re: Tutorial Case2 problem while computing geometric quantities

Post by Marta Garcia »

Hello Yvan,

In the listing file of the 32 cores simulation of CASE2, I show "longia=3630 (Number of integers); longra=14520 (Number of reals). This simulation fails but at least is able to provide this values so I run again the simulation by putting these values in the case2.xml file (end of line omitted):
 <integer_work_array><ncelet>0</ncelet><nfac>0</nfac><nfabor>0</nfabor><dimless>3630</dimless>
 </integer_work_array><real_work_array><ncelet>0</ncelet><nfac>0</nfac><nfabor>0</nfabor<dimless>14520</dimless>

By doing this, the code fails again but before. In the area initially mentioned: just after "Renumbering mesh".

Analyzing one core file with addr2line, I obtain the following files to look:
.../fvm-0.15.1/src/fvm_selector_postfix.c:2730
.../fvm-0.15.1/src/fvm_selector.c:970

I have some questions:
- May I adapt the information of the number of halos per cell, or faces per cell... or it could also work just modifying the <dimless> tag?
- I have increased substantially the value of <dimless> but I obtain the same result. Do you think that it's still a problem with these integer and real dimensions or it's anything which concerns the compilation of FVM? I took a look to the lines mentioned but maybe you have already seen this problem before...

Thank you,

Marta
Yvan Fournier

Re: Tutorial Case2 problem while computing geometric quantities

Post by Yvan Fournier »

Hello Marta,
You may adapt either values relative to cells, faces, or dimless, or a combination thereof.
Usually, we recommend having a multiplier of the number of cells, but in reality, we have a combination of cell_ face, and dimless arrays. As in this case, the number of cells and faces per rank is very small, the dimless part might be dominant.
So I would just recommend increasing dimless, but don't hesitate to start with a value of 10000 for integers (which amounts to only about 40 Kb), and push it beyond that if necessary. On larger cases, it may be important not to ask for too much memory, but here, we have plenty.
If things still fail, I'll try to reproduce your error with the files you posted (I won't have the time before the end of this week, though).
Best regards,
  Yvan
Marta Garcia

Re: Tutorial Case2 problem while computing geometric quantities

Post by Marta Garcia »

Hello Yvan,

Ramesh built his own version on the BG/P and he has the same problem than me, i.e., even if he increases the iasize/irsize variables the code explodes after the 'Renumbering mesh' lines and it points to these fvm_selector_postfix.c subroutines. 

On the contrary, if he runs CAS2 on his Mac the code runs! (after adapting the iasize and irsize parameters as done on the BG/P). 

We are still looking for the problem but we will appreciate if someone could run the CASE2 of the FULL_DOMAIN tutorial on your BG/P with the 2.0-rc2 executables and generating the 'preprocess_output' and 'domain_number_32' files like this:
MESH dir> cs_preprocess -m *.des -j --color 5 24 32
MESH dir> cs_partition 32
DATA dir> qsub..... cs_solver $@ --mpi --param case2.xml
($@ if you use bash, with tcsh it's not needed).

As you suspected, we are considering the possibility of a problem in the installation but just to know if you are able to obtain a result.

Thank you.

Marta
Marta Garcia

Re: Tutorial Case2 problem while computing geometric quantities

Post by Marta Garcia »

Hello Yvan,

Problem solved. Talking with someone here, he observed that the code fails by an alignment issue and advise us to run by adding an environmental variable to the qsub command:
qsub --env "BG_MAXALIGNEXP=-1" ...

Apparently, the default option in our BG/P deals only with 1000 alignment exceptions and by setting it up to "-1" it deals with all cases.

This solves the problem.

Thank you again for all your help.

Marta
Post Reply