Page 1 of 1

Restart a simulation without checkpoint directory

Posted: Thu Sep 22, 2016 4:17 pm
by Oscar
Hello,

I am running CS v4.0.5.

I want to restart my simulation from the last saved time step, as I am running on an HPC with time limits. Lets assume that the run which I want to resume from is called "init_01" then I would do something like this in DATA/cs_user_scripts.py:

Code: Select all

    if domain.param == None:
        domain.mesh_input = "RESU/init_01/mesh_input"
        domain.partition_input = None
        domain.restart_input = "RESU/init_01/checkpoint"
However I have no directory called "checkpoint" in the init_01 directory. Presumably I have forgotten to specify that I want it somewhere in the source code...

Is there a way to continue the simulation from the last time step that is contained in my init_01 despite this, so that I don't have to start over? What is the best practice for ensuring that you have a restart point in a simulation? I know that I could for instance save the following to a file called control_file in the result directory during runtime, but this seems a bit tedious and may be easy to forget to do...

Code: Select all

checkpoint_wall_time_interval <wall time interval>
Edit: I forgot to mention that I also know of the fact that setting ntsuit > 0 in cs_user_parameters.f90 is the way to go about saving checkpoint files, however I clearly forgot to do this...

Re: Restart a simulation without checkpoint directory

Posted: Fri Sep 23, 2016 1:24 am
by Yvan Fournier
Hello,

This is strange. Checkpoints are enabled by default, though they may be missing if the computation was not interrupted cleanly before creating one.

It is simpler to define checkpoint options using the GUI than with user subroutines.

Regards,

Yvan

Re: Restart a simulation without checkpoint directory

Posted: Fri Sep 23, 2016 2:20 pm
by Oscar
Even when I run with ntsuit=1 (which should save a checkpoint at each time step) fails to save a checkpoint directory in my RESU. Is there something else I need to do to to ensure checkpointing is happening? I cannot use the GUI in my case. Please find attached my listing and source files for this case, Can you see where there problem might be from this?

Re: Restart a simulation without checkpoint directory

Posted: Fri Sep 23, 2016 3:58 pm
by Yvan Fournier
Hello,

There might be complex recomputation of ntsuit.

Did you try with a "clean" stop (such as ntmabs = 10) ?

Why can't you use the GUI ? It should be installable on most machines (and the libxml2 for the reader side on all machines).

Regards,

Yvan

Re: Restart a simulation without checkpoint directory

Posted: Fri Sep 23, 2016 4:15 pm
by Oscar
Hi Yvan,

Thanks for your response. I just tried with ntmabs=10 and it is true that there is now a checkpoint file once the calculation finishes.

However ntmabs is the total desired time steps of my simulation and since it is big I will need to perform restarts. Having checkpoints in between is really essential because I want to save close to the time step I'm at when I get kicked out of the cluster. I thought ntsuit would control this save interval?

I cannot run the GUI because it is not installed on the HPC I am using. I also prefer the terminal as I am more accustomed to it.

Do you have any suggestions for what I can do to solve the ntsuit issue?

Kind regards,

Oscar

Re: Restart a simulation without checkpoint directory

Posted: Sat Sep 24, 2016 11:30 am
by Yvan Fournier
Hello,

Did you check the documentation for ntsuit ? If I remember, it might be the number of checkpoints (4 by default).

You can force a checkpoint at any time using the control_file, or use the control_file to set a restart interval in elapsed (user, not simulation) time, which aligns bettet to batch systems.

But in practice, we always try to set ntmabs so as to finish in the allocated time, then restart with an increased ntmabs (a user scripts example allows you to automate this).

Regards,

Yvan

Re: Restart a simulation without checkpoint directory

Posted: Sat Sep 24, 2016 3:55 pm
by Oscar
Hi Yvan,

Yes I have checked the docs for ntsuit - if it is set to > 0 then it is the period of checkpoints, and so I would have expected it to save every time step when I set it to 1.

I guess I will do as you suggest, run a simulation and then increase the ntmabs for the next one!

Kind regards,

Oscar