restart problems with v7

daniele · Post by **daniele** » Wed Dec 08, 2021 6:45 pm

Hello,

Has anyone had problems with restarts in v7?
I cannot manage to restart a simulation. The previous simulation was completed with no errors.
The error looks as a memory issue, could it be due to the environement rather than directly to the code?

Code: Select all

     READING THE MAIN RESTART FILE

 Start reading
 Reading dimensions complete
  Reading the previous time step number (restarting computation)  NTPABS =          100
  Reading the previous time step number (restarting computation)  TTPABS =   0.2000E-01
 Reading options complete
  Read variables from restart: restart/main.csc
 Reading complete

         READING THE AUXILIARY RESTART FILE             



Memory allocation summary
-------------------------

Theoretical current allocated memory:   0 kB
Maximum program memory measure:         1240184 kB
Current program memory measure:         1240184 kB



System error: Cannot allocate memory

../../../src/base/cs_io.c:2167: Fatal error.

Failure to reallocate "inp->buffer" (4611686018427387904 bytes)


Call stack:
   1: 0x2b2c566c5225 <bft_mem_realloc+0x285>          (libsaturne-7.0.so)
   2: 0x2b2c560b702a <cs_io_read_header+0x30a>        (libsaturne-7.0.so)
   3: 0x2b2c560b7821 <cs_io_initialize_with_index+0x181> (libsaturne-7.0.so)
   4: 0x2b2c55fb7aff <cs_restart_create+0x47f>        (libsaturne-7.0.so)
   5: 0x2b2c566a6e73 <__cs_c_bindings_MOD_restart_create+0x1f4> (libsaturne-7.0.so)
   6: 0x2b2c56029f80 <lecamx_+0xf5>                   (libsaturne-7.0.so)
   7: 0x2b2c56028dcd <lecamo_+0x87>                   (libsaturne-7.0.so)
   8: 0x2b2c55ed751c <caltri_+0x107a>                 (libsaturne-7.0.so)
   9: 0x2b2c55c0fe3e <main+0x6ce>                     (libcs_solver-7.0.so)
  10: 0x2b2c5aa44555 <__libc_start_main+0xf5>         (libc.so.6)
  11: 0x402709     <>                               (cs_solver)
End of stack

Thank you very much.
Kind regards,
Daniele

Post by **Yvan Fournier** » Wed Dec 08, 2021 7:16 pm

Hello,

This is probably related to a specific option, as we have not encountered this bug so far t my knowledge.

Could you send me a small test case for verification and possibly debugging ?

Best regards,

Yvan

Luciano Garelli · Post by **Luciano Garelli** » Thu Dec 09, 2021 1:22 pm

Hello,

I have faced the same issue as you in one of our clusters with CS 6. I think that the problem is during the writing of the restart files.

The solution that I found was to change the input/output method, from default to serial I/O.

Regards,
Luciano

daniele · Post by **daniele** » Thu Dec 09, 2021 4:53 pm

Hello Luciano,

I indeed saw (before creating this new subject) your post on the forum related to your restart issues.
I misunderstood the solution you found: I tried to change the read method, and not the write method... Actually I have just done a test changing both methods: the restart seems to work correctly!

Thank you for your help!
Kind regards,
Daniele

Post by **Yvan Fournier** » Fri Dec 10, 2021 11:01 am

Hello,

Do you have more info on the MPI library and file system on which you encountered this issue ?
Also, if you have a small test case on which I could try to reproduce this (in case it is generic and not related to a single system), that would be of interest.

Thanks,

Yvan

daniele · Post by **daniele** » Tue Jan 04, 2022 12:15 pm

Sorry Yvan I missed your last post on this topic, that's why I have not replied before.
I will try to collect the information about our MPI environment and also try to build a small test case (the actual one is too big to be shared).

Thank you (and happy new year!),
Best regards,
Daniele

daniele · Post by **daniele** » Wed Feb 09, 2022 4:42 pm

Hello,

I have collected some information about my system, I am not an expert on these aspects, so if you need further details do not hesitate:

…/modulefiles/tools/openmpi/2.1.1-ucx:

module load ucx/1.5.1
prereq ucx/1.5.1
module-whatis Implementation mpi
conflict openmpi
setenv MPI_ROOT /…/openmpi/2.1.1-ucx
setenv MPIHOME /…/openmpi/2.1.1-ucx
setenv OMPI_MCA_btl_base_warn_component_unused 0
prepend-path PATH /…/openmpi/2.1.1-ucx/bin
prepend-path LD_LIBRARY_PATH /…/openmpi/2.1.1-ucx/lib
prepend-path MANPATH /…/openmpi/2.1.1-ucx/share/man
-------------------------------------------------------------------

$module show ucx/1.5.1
-------------------------------------------------------------------
…/modulefiles/tools/ucx/1.5.1:

module-whatis Support ucx pour openmpi
prepend-path PATH /…/ucx/1.5.1/bin
prepend-path LD_LIBRARY_PATH /…/ucx/1.5.1/lib
-------------------------------------------------------------------

I will try to build a small test case as well.

Thanks.
Kind regards,
Daniele

Post by **Yvan Fournier** » Thu Feb 10, 2022 4:22 pm

Hello,

Thanks for the info. In the beginning of your "listing"/run_solver.log file, you also have system info. Could you provide that (editing out if you choose the line with your login name, which I do not need).

Best regards,

Yvan

daniele · Post by **daniele** » Fri Feb 11, 2022 9:40 am

Hello,

Here are the lines inside run_solver.log:

Code: Select all

Local case configuration:

  Date:                …
  System:              Linux 3.10.0-1127.19.1.el7.x86_64
  Machine:             node112
  Processor:           model name	: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
  Memory:              191887 MB
  User:                ...
  Directory:           …
  MPI ranks:           80 (appnum attribute: 0)
  MPI ranks per node:  40
  OpenMP threads:      1
  Processors/node:     20

  Compilers used for build:
    C compiler:        gcc (GCC) 5.5.0
    C++ compiler:      g++ (GCC) 5.5.0
    Fortran compiler:  GNU Fortran (GCC) 5.5.0

  MPI version 3.1 (Open MPI 2.1.1)

  I/O read method:     standard input and output, serial access
  I/O write method:    standard input and output, serial access
  I/O rank step:        1

  External libraries for partitioning:
    ParMETIS 4.0.3
    SCOTCH 6.1.0

Hope this answers you question.
Kind regards,
Daniele

Post by **Yvan Fournier** » Fri Feb 11, 2022 5:03 pm

Hello,

Yes, this is what I wanted to check. OpenMPI 2.1 is quite ancient, though at the time we used it, I did not encounter IO problems (which can depend also on the filesystem configuration, so is difficult to reproduce on another system).

I see that you system repots 20 processors per node, but you are using 40. The reporting/detection might be wrong, but otherwise, I would expect that you get better performance using only as many ranks per node as there are physical processors. Unless perhaps hyperthreading comes into play ? I'm interested in feedback here too.

Best regards,

Yvan

code_saturne User's Forum

restart problems with v7

restart problems with v7

Re: restart problems with v7

Re: restart problems with v7

Re: restart problems with v7

Re: restart problems with v7

Re: restart problems with v7

Re: restart problems with v7

Re: restart problems with v7

Re: restart problems with v7

Re: restart problems with v7