Page 1 of 1
Running on Cray System
Posted: Mon Dec 15, 2014 1:28 pm
by AndrewH
Hello,
I'm trying install/run Code Saturne v3.3.2 on a Cray system that has a cross-compiling environment. I somewhat successfully built Code Saturne on the computing end, however Code Saturne doesn't seem to recognize the allocated time it is given. The batch system has to terminate Code Saturne to stop it. I believe I filled out code_saturne.cfg properly (I attached my .cfg file). I have tried using control_file in my running folder to specify the wall time, but it doesn't seem to help. Additionally, Code Saturne doesn't seem to recognize the script files that I placed in the src folder. In the error file, I get the following message:
SIGTERM signal (termination) received.
--> computation interrupted by environment.
Call stack:
1: 0x2aaaaaf47fd9 <+0x27bfd9> (libsaturne.so.0)
2: 0x2aaaaaf4d3e3 <cs_gradient_scalar+0xdb3> (libsaturne.so.0)
3: 0x2aaaaaf3a70f <cs_face_diffusion_potential+0x32f> (libsaturne.so.0)
4: 0x2aaaaae62b47 <resopv_+0x4893> (libsaturne.so.0)
5: 0x2aaaaae48143 <navstv_+0x4723> (libsaturne.so.0)
6: 0x2aaaaae74cda <tridim_+0x4036> (libsaturne.so.0)
7: 0x2aaaaad516be <caltri_+0x365e> (libsaturne.so.0)
8: 0x2aaaaad336ac <cs_run+0x43c> (libsaturne.so.0)
9: 0x2aaaaad330f7 <main+0x147> (libsaturne.so.0)
10: 0x2aaaae60ac16 <__libc_start_main+0xe6> (libc.so.6)
11: 0x400c49 <> (s_solver)
End of stack
To run Code Saturne, I used "code_saturne run --initialize --param=test.xml" in my DATA folder to generate a results folder. After creating a results folder, I set up my .pbs file with "cs_solver --param=test.xml." My job runs fine except for the problem described above. I also tried running my job from the DATA folder using the command "code_saturne run --param=test.xml," but I get the error the RESU folder already exists.
Thank you,
Andrew
Re: Running on Cray System
Posted: Tue Dec 16, 2014 2:09 am
by Yvan Fournier
Hello,
You need to uncomment the batch = PBS line (remove the #) in the Code_Saturne.cfg.
Did you test you control_file syntax also on a workstation ? Otherwise, I'll try to check with other users on Cray to see if they have similar issues with control_file handling. Are you using a Lustre file system, like most Crays I know of ? The existence test for control_file in Code_Saturne is actually based access() rather than stat(), from some older performance recommendations for Lustre, so I expect it to work, but things might have changed...
Regarding your user subroutines, do they work on a workstation ? Could you post them ?
Regards,
Yvan
Re: Running on Cray System
Posted: Tue Dec 16, 2014 12:00 pm
by AndrewH
Hello,
Yes, I realized that silly mistake soon after I posted my message yesterday. With the corrected code_saturne.cfg file, I can run my job directly from the DATA folder without any errors. It recognizes the specified wall time and script files, however when the cs_solver is called, the number of processors being used isn't passed onto the mpiexec command. For my job, I'm using 24 processors, but cs_solver only uses 1 processor. In my job submission file, I'm using the command "code_saturne run --param=test.xml -n 24." Is there something I forgot to configure to enable the number of processors that I want to use when cs_solver is executed? From the practice guide of the cluster, a command of "aprun -n 24 ..." should be used to run a job in parallel.
If I execute cs_solver separately in my RESU folders, the updated code_saturne.cfg doesn't seem to affect it. I can runs it in parallel with the command "aprun -n 24 cs_solver --param=test.xml," but it doesn't recognize the wall time or the script files still.
When I try to use the control_file, the listing file notes the time has been adjusted, but it doesn't affect the running of cs_solver. To make sure I'm using the right syntax, if write:
max_time_step 1000
checkpoint_wall_time 600
in my control_file, number iterations will limited to 1000 and cs_solver will only run for 10 mins (I assume checkpoint_wall_time unit of time is seconds).
Yes, the cray system that I'm using uses a parallel Lustre file system. My subroutines are simply a modified version of cs_user_extra_operations-global_efforts.f90 and the cs_user_initialization.f90 to create a file to save my forces into. If it proves helpful, I attached my listing, summary, and edited code_saturne.cfg file.
Thank you,
Andrew
Re: Running on Cray System
Posted: Tue Dec 16, 2014 7:53 pm
by Yvan Fournier
Hello,
In the bottom of the code_saturne.cfg file, you can redefine the mpiexec command to set it to aprun.
I'm in a bit of a hurry now, but I'll try to match the syntax with that of aprun to give you the recommended settings.
I have not checked yet with other users on Cray for the control_file, but I'll try to remember to do it tomorrow.
Regards,
Yvan
Re: Running on Cray System
Posted: Wed Dec 17, 2014 11:30 am
by AndrewH
Hello,
I was able to fix the problem that cs_solver wasn't reading my script files, I needed to use the generated cs_solver executable in the results folder instead of the default one in my build file. But, cs_solver still doesn't recognize the wall time. Does cs_solver read code_saturne.cfg when it is executed in a batch submission?
In regards to running Code Saturne using "code_saturne run --param=test.xml" in a submission file, does the batch submission file need to contain the format "#PBS -l nodes=1:ppn=12" for code to read the proper number of nodes/processors? Even though the batch submission file system is PBS, the submission file is slightly different for this cluster, the number of nodes being requested is written as "select=1."
Thank you,
Andrew
Re: Running on Cray System
Posted: Sun Dec 21, 2014 7:44 pm
by Yvan Fournier
Hello,
Yes, the solver reads code_saturne.cfg, but only handles the wall time "correctly" on a few systems: for systems that set the processes's resource limits (as per the C function getrlimits), the cod stops automatically before the job manager kills it. The last systems on which I had access to which did this c were clusters at the CCRT using LSF, or a mix of LSF and SLURM, a few years ago.
Otherwise, the resource manager kills the job when the time is over (actually, several managers, including SLURM, have a soft and hard limit, for example sends a first signal 10% before the max job time, then really killing it if it has not stopped; the difficulty is then in getting the mpi launcher to propage this signal without simply killing the job, so so far, we do not handle that).
The batch submision can contain syntax you like, but only some syntaxes will be handled automatically, so the number or ranks may be guessed wrong (which is actually an issue only for coupled calculations). In the worse case, you can add --nprocs=<n_procs> to the "code_saturne run" command in the runcase to force the number of ranks used, in case the script fails to handle it correctly.
Another solution is also to add --initialize to the code_saturne run command in the runcase, in which case the computation will simply be set up in the execution directory, but not run. Run the runcase in interactive mode (i.e. no need to submit it), hten go to the execution directory, and edit the "run_solver" script to add jour job parameters and correct aprun syntax if necessary.
Also, another user running on the EPCC Cray machine (ARCHER, an XC30) confirmed me that the control_file works fine for them.
Regards,
Yvan