Hi all,
I am running Code_Saturne 8.2 coupled with SSH-Aerosol on an HPC system, and I consistently encounter a runtime error at the last time step of the simulation.
The case runs normally for most of the simulation, but it fails during the final time step. The job terminates with an error (see details below). I have attached the relevant error messages and runtime information from the Slurm output file for reference.
Thank you very much for your time and help.
Best regards,
Sophie
Running the model on HPC
Forum rules
Please read the forum usage recommendations before posting.
Please read the forum usage recommendations before posting.
Running the model on HPC
- Attachments
-
- slurm-50365070.txt
- (3.95 MiB) Downloaded 11 times
-
Yvan Fournier
- Posts: 4276
- Joined: Mon Feb 20, 2012 3:25 pm
Re: Running the model on HPC
Hello,
Do you use a specific tool or setting to obtain backtraces in the SLURM output ? By default, code_saturne installs its own exception handler, which logs backtraces (with less details) to error* files, but some libraries (at least PT-Scotch and ParaView, possibly others) can force their own handlers.
In any case, this seems to be a memory corruption issue. It might be in SSH-Aerosol, but I can't be sure of this. I would recommend running a (much) smaller version of your case for a few time steps on a local workstation with a debuggging tool such as Valgrind or with an AdressSanitizer build to try to locate the issue.
Also, if you have the possibility of upgrading to v9.0, than is recommended, as v8.2 is an intermediate feature release, which has been retired as of the v9.0 release. So although a bug may still occur in v9.0, its fix would be more relevant.
Best regards,
Yvan
Do you use a specific tool or setting to obtain backtraces in the SLURM output ? By default, code_saturne installs its own exception handler, which logs backtraces (with less details) to error* files, but some libraries (at least PT-Scotch and ParaView, possibly others) can force their own handlers.
In any case, this seems to be a memory corruption issue. It might be in SSH-Aerosol, but I can't be sure of this. I would recommend running a (much) smaller version of your case for a few time steps on a local workstation with a debuggging tool such as Valgrind or with an AdressSanitizer build to try to locate the issue.
Also, if you have the possibility of upgrading to v9.0, than is recommended, as v8.2 is an intermediate feature release, which has been retired as of the v9.0 release. So although a bug may still occur in v9.0, its fix would be more relevant.
Best regards,
Yvan