Problem rearding submitting job to cluster

Questions and remarks about code_saturne usage
Forum rules
Please read the forum usage recommendations before posting.
biodc172
Posts: 12
Joined: Sat May 07, 2022 2:59 am

Problem rearding submitting job to cluster

Post by biodc172 »

Hi,

I am trying to use code saturne on my working HPC ,and its login nodes and compute nodes are using different cpu setup, so I used the cross compile option during the installation.

The installation was fine, but realize that when code_saturne was executed ,it will first run the python scripts and then the executable, unfortunately the python part can only run on my login node, and since I cross compiled the executable is meant to run on the compute node only.

I tried to use code_saturne run in the case directory and then cd to RESU and submit the cs_solver executable to compute node, but I did not get any feedback from the submission, so I am not sure if I did it the correctly.

I am wondering is there anyway I can sepeate the python part and the solver?

Brest regards.
Chai
Posts: 4
Joined: Mon Jan 04, 2021 5:54 pm

Re: Problem rearding submitting job to cluster

Post by Chai »

Hi,

What version of code_saturne are you using ?
Normally if you are using a recent version (>= 6.0) you can use the "submit" command of code_saturne instead of the "run" command.
Typically, the "submit" command launches the python script on the login/frontal node and handles the creation of the RUN directory, copy of run files, compilation of user functions and final executable and generation of the "run_solver" script which loads the environment variables and paths and launches the executable using the mpiexec/mpirun/... command.

"code_saturne submit --help" would also provide you with any additional info :)
biodc172
Posts: 12
Joined: Sat May 07, 2022 2:59 am

Re: Problem rearding submitting job to cluster

Post by biodc172 »

Thanks for your advise, it worked. But after the calculation it got stuck on this

cs_file.c:4661: Fatal error.

Error removing file "checkpoint/previous_dump_0000":

Destination address required


Call stack:
1: 0x4ffff08f723c ? (?)
2: 0x4ffff053793c ? (?)
3: 0x4ffff04952f8 ? (?)
4: 0x4ffff041a760 ? (?)
5: 0x4ffff0417770 ? (?)
6: 0x4ffff0c54160 ? (?)
7: 0x4ffff0415f1c ? (?)
End of stack

Do you have any idea about it?

Best regards.
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Problem rearding submitting job to cluster

Post by Yvan Fournier »

Hello,

This might be due some limited or restricted features on your filesystem, but has not been observed on another machine to my knowledge.

Could you provide details on your system and build ?

Regards,

Yvan
biodc172
Posts: 12
Joined: Sat May 07, 2022 2:59 am

Re: Problem rearding submitting job to cluster

Post by biodc172 »

Hi,

I am using gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC), I believe the compute node has the same setup.

I built Code saturne 7.1.1 with this configure command:
./configure --prefix=/home/software/code_saturne2/install-parallel CC=mpicc CXX=mpicxx FC=mpifort F9X=mpifort AR=sw9ar NM=swnm RANLIB=sw9ranlib CPP=cpp CXXCPP=cpp --host=alpha cross_compile=maybe --enable-mpi --disable-gui --disable-shared --enable-static --disable-dependency-tracking.

I looked at the code and it s caused by the usage of unlink() command, but I didn't notice anything that leads to it regarding the system.

Best regards.

Yvan Fournier wrote: Thu Jul 28, 2022 10:36 am Hello,

This might be due some limited or restricted features on your filesystem, but has not been observed on another machine to my knowledge.

Could you provide details on your system and build ?

Regards,

Yvan
biodc172
Posts: 12
Joined: Sat May 07, 2022 2:59 am

Re: Problem rearding submitting job to cluster

Post by biodc172 »

Yvan Fournier wrote: Thu Jul 28, 2022 10:36 am Hello,

This might be due some limited or restricted features on your filesystem, but has not been observed on another machine to my knowledge.

Could you provide details on your system and build ?

Regards,

Yvan
I just found that the compute node does not have command rmdir.
It works fine now.

Best regards. :D
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Problem rearding submitting job to cluster

Post by Yvan Fournier »

Hello,

Did you need to modify the code, or simply change some settings in your cluster environment ?

Regards,

Yvan
biodc172
Posts: 12
Joined: Sat May 07, 2022 2:59 am

Re: Problem rearding submitting job to cluster

Post by biodc172 »

Yvan Fournier wrote: Sat Jul 30, 2022 2:23 pm Hello,

Did you need to modify the code, or simply change some settings in your cluster environment ?

Regards,

Yvan
Hi,

It does not work now somehow. I found the error to be it tries to delete the folder
/case1/RESU/20220801-1544/checkpoint/previous_dump_0000
and this folder has two files in it (main.csc and auxiliary.csc).

I printed the path variable and found that cs_solver only tries to unlink main.csc and then rmdir the previous_dump_0000 folder while auxiliary.csc still exists in the folder, which leads to the Destination address required error.

I am not sure how to fix it and why it wored last week.

Best regards.
Yvan Fournier
Posts: 4070
Joined: Mon Feb 20, 2012 3:25 pm

Re: Problem rearding submitting job to cluster

Post by Yvan Fournier »

Hello,

I'll check the code. We have never encountered this issue, but I wonder whether it might be due to latency/delays in the filesystem operation leading to a directory not being empty yet. Not sure, just guessing, but I'll see if I can make the code more robust there.

In the meantime, as a workaround, you can use the advanced checkpoint/restart settings in the GUI so as to have only one checkpoint at the end of the computation. This should avoid going through the path that causes the error.

Best regards,

Yvan
biodc172
Posts: 12
Joined: Sat May 07, 2022 2:59 am

Re: Problem rearding submitting job to cluster

Post by biodc172 »

Yvan Fournier wrote: Mon Aug 01, 2022 5:53 pm Hello,

I'll check the code. We have never encountered this issue, but I wonder whether it might be due to latency/delays in the filesystem operation leading to a directory not being empty yet. Not sure, just guessing, but I'll see if I can make the code more robust there.

In the meantime, as a workaround, you can use the advanced checkpoint/restart settings in the GUI so as to have only one checkpoint at the end of the computation. This should avoid going through the path that causes the error.

Best regards,

Yvan
Hi,

It is proabaly not casued by the filesystem. cs_file_remove() was called in cs_restart_clean_multiwriters_history() by the instruction of

int n_files_to_remove
= mw->n_prev_files - _n_restart_directories_to_write + 1;
for (int ii = 0; ii < n_files_to_remove; ii++) {remove}

and in my case:

mw->n_prev_files =1
_n_restart_directories_to_write=1

which leads to the action of only try to delete one file. But I'll try the checkpoint/restart settings in the GUI, see if it works.

Best regards.
Post Reply