Page 1 of 2
Restart not taken into account in v7
Posted: Thu Oct 26, 2023 3:33 pm
by daniele
Hello,
In v7.0.2, I have noticed that for some cases I cannot manage to tell the code to do a restart.
I select restart in the GUI, select the correct checkpoint directory, but the code seems blind to my set-up and always restarts the simulation from zero.
The ISUITE=0 appears in the listing, which means that actually it does not even try to consider the restart.
I checked the path of the checkpoint, even tried to insert the entire path by hand in the setup.xml, but nothing changes.
What can be the problem?
Thank you very much for your help.
Kind regards,
Daniele
Re: Restart not taken into account in v7
Posted: Fri Oct 27, 2023 1:35 pm
by Yvan Fournier
Hello,
Do you also have a cs_user_scripts.py file in DATA ? This can also be used to set the restart path, and takes precedence over the XML file.
Otherwise, do you have any warning in the case's log (not only the run_solver.log, but also the logs appearing in the terminal or batch output) ?
Best regards,
Yvan
Re: Restart not taken into account in v7
Posted: Fri Feb 16, 2024 11:09 am
by daniele
Hello Yvan,
I forgot to update you about this issue: indeed, imposing the restart through the cs_user_scripts.py works!
So thank you very much for this tip!
However, I sometimes encounter the following error with restart (I am always with v7):
Code: Select all
Reading file: restart/mesh_input.csm
Finished reading: restart/mesh_input.csm
No "partition_input/domain_number_52" file available;
----------------------------------------------------------
SIGTERM signal (termination) received.
--> computation interrupted by environment.
It seems that CS looks for a partition input? In Performance settings, the "Use existing partition input" is set to off...
The number 52 of the domain, corresponds the total number of CPU requested. I mean, in this example I requested 52 CPU, but it would show 100 (instead of 52) if I requested 100 CPU.
Do you have an idea of the cause of the problem?
Thank you very much in advance for your help.
Kind regards,
Daniele
Re: Restart not taken into account in v7
Posted: Fri Feb 16, 2024 11:16 pm
by Yvan Fournier
Hello Daniele,
When code_saturne does not find a partition input file matching a given number of MPI ranks, it simply recoputes the partition (so saving a partitioning is useful mostly for debugging, when we want to ensure we are in the same conditions as a failed computation).
So I would guess the crash is due to something else. If the code is killed by the environment, that may be due to a caught error on another rank (leading to MPI_Abort, with this message on rank 0), or a timeout for example.
Do you have any other non-empty error* files, or anything "interesting" in the run_case.log of batch output and/or error files ?
Best regards,
Yvan
Re: Restart not taken into account in v7
Posted: Wed Feb 21, 2024 5:19 pm
by daniele
Hello Yvan,
I see what you mean.
I actually found a forbidden memory area access error, reported in the error file (I have two error files, this is reported only in one of the two):
SIGSEGV signal (forbidden memory area access) intercepted!
This could be a sign of the source of the issue, but the setup between the two cases is identical, I have no time averages that could cause problems at restart...
I will go on investigating...
Thank you very much.
Best regards,
Daniele
Re: Restart not taken into account in v7
Posted: Wed Feb 21, 2024 8:09 pm
by Yvan Fournier
Hello Danièle,
Do you have a stack trace in the file indicating a SIGSEGV ?
Best regards,
Yvan
Re: Restart not taken into account in v7
Posted: Wed Feb 21, 2024 10:56 pm
by daniele
Hello Yvan,
Yes:
Code: Select all
Call stack:
1: 0x2aaad3362040 <+0x261040> (libsaturne-7.0.so)
2: 0x2aaad3364d9f <cs_all_to_all_copy_array+0x137f> (libsaturne-7.0.so)
3: 0x2aaad3717787 <cs_partition+0x4b47> (libsaturne-7.0.so)
4: 0x2aaad328a02a <cs_preprocessor_data_read_mesh+0x1fa> (libsaturne-7.0.so)
5: 0x2aaad3284118 <cs_preprocess_mesh+0x208> (libsaturne-7.0.so)
6: 0x2aaad2efe9c0 <main+0x2d0> (libcs_solver-7.0.so)
7: 0x2aaad7b70555 <__libc_start_main+0xf5> (libc.so.6)
8: 0x402919 <> (cs_solver)
End of stack
Thank you very much.
Best regards,
Daniele
Re: Restart not taken into account in v7
Posted: Thu Feb 22, 2024 1:06 am
by Yvan Fournier
Hello,
This seems to occur in the partitioning stage, before the restart file is read. So I would guess it is independent of the restart.
Which partitioning algorithms are you using ? I assume PT-Scotch, if it is installed, as it is the default.
If this is the case, does the crash occur before writing partition_output_domain_number* ?.
Also, which version are you using ? It should be specified in the run_solver.log file.
Best regards,
Yvan
Re: Restart not taken into account in v7
Posted: Thu Feb 22, 2024 11:49 am
by daniele
Hello Yvan,
Yes, I use the Default partitioning. I have just made a test with other partitioning methods, but nothing changes.
Yes, the crash occurs before the partitioning file is written...
I am using v7.0.2.
I see that it does not seem to be linked to the restart, but actually the calculation works correctly when I just remove the "domain.restart_input = 'RESU/20240215-1025/checkpoint' " line inside the cs_user_scripts.py.
By the way, the listing indicates that the error occurs when reading the restart/mesh_input.csm:
Code: Select all
Reading file: restart/mesh_input.csm
Finished reading: restart/mesh_input.csm
No "partition_input/domain_number_104" file available;
----------------------------------------------------------
SIGTERM signal (termination) received.
--> computation interrupted by environment.
Call stack:
1: 0x2b858c32a8e3 <+0x1578e3> (libopen-pal.so.20)
2: 0x2b858c210b39 <opal_progress+0xb9> (libopen-pal.so.20)
3: 0x2b858900a31d <ompi_request_default_wait+0x1d> (libmpi.so.20)
4: 0x2b8589050841 <ompi_coll_base_sendrecv_nonzero_actual+0xb1> (libmpi.so.20)
5: 0x2b85890532ae <ompi_coll_base_alltoall_intra_bruck+0x2ae> (libmpi.so.20)
6: 0x2b858901b097 <PMPI_Alltoall+0x177> (libmpi.so.20)
7: 0x2b858635d0d7 <+0x2610d7> (libsaturne-7.0.so)
8: 0x2b858635fd9f <cs_all_to_all_copy_array+0x137f> (libsaturne-7.0.so)
9: 0x2b8586712787 <cs_partition+0x4b47> (libsaturne-7.0.so)
10: 0x2b858628502a <cs_preprocessor_data_read_mesh+0x1fa> (libsaturne-7.0.so)
11: 0x2b858627f118 <cs_preprocess_mesh+0x208> (libsaturne-7.0.so)
12: 0x2b8585ef99c0 <main+0x2d0> (libcs_solver-7.0.so)
13: 0x2b858ab6b555 <__libc_start_main+0xf5> (libc.so.6)
14: 0x402919 <> (cs_solver)
End of stack
Thanks a lot.
Best regards,
Daniele
Re: Restart not taken into account in v7
Posted: Thu Feb 22, 2024 7:25 pm
by Yvan Fournier
Hello Daniele,
Does RESU/20240215-1025/checkpoint contain a "mesh_input.csm" file or is the mesh input re-imported at each restart ? If the failure occurs just with this file, there might be a file corruption problem.
Also, do you check "use unmodified checkpoint mesh in case of restart in the GUI" ? Or use a similar setting outside the GUI ?
In that case, if you have some preprocessing operation applied, it might mean there is some inconsistency in the mesh, which is minor enough so that it does not cause a crash when running, but leads to an incorrect mesh when saved and re-loaded (which goes through slightly different paths).
In any case, using a debug build, I can help you run this under a debugger so as to obtain more info. Using a production build, it will be difficult to say more. If the case is not too large and runs on a work station, it should be possible to run it under Valgrind (if small enough) or a build with AddressSanitizer and really pinpoint the cause of the crash.
In any case, I do not think we have encountered a similar issue with v7.0 or v8.0 with a crash in the same zone (and see no fixes in later v7.0.x releases which would seem relevant to this), so no "obvious" idea. So fixing this will require some sort of instrumentation.
Best regards,
Yvan