When to checkpoint at the end of a fixed-length reservation?

This work considers an application executing for a fixed duration, namely the length of the reservation that it has been granted. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. We address two scenarios. In the first scenario, a checkpoint can be taken at any time; despite its simplicity, this natural problem has not been considered yet (to the best of our knowledge). We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. The second scenario is more involved: the application is a linear workflow consisting of a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the application checkpoints at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue executing at the end of each task. We instantiate this second scenario with several examples of probability distribution laws for task durations.


INTRODUCTION
Scheduling a job onto a computing platform typically involves making a series of reservations of the required resources.Long running applications, or applications whose total run-time are hard to predit, usually split their reservation in multiple smaller reservations and use checkpoint-restart [12,23] to save intermediate steps of computation.There are multiple advantages to this approach, but the main one is that it lowers the wait-time of the application, as the job scheduler can easily place a smaller reservation.On some platforms, a maximum reservation time is imposed on applications, forcing * Also with: University of Tennessee, Knoxville, TN, USA.
applications that run longer than this maximum time to split their reservation and rely on a form of checkpoint-restart.These scenarios occur in large scale High Performance Computing (HPC) platforms as well as on the Cloud.
For each actual reservation, the job needs to be checkpointed before the reservation time has elapsed, otherwise the progress of the execution during the reservation will be lost.
This work focuses on an application executing for a fixed duration.The size of the application (sequential or parallel) is irrelevant; what matters is the volume of the data that needs to be saved before the end of the execution, or equivalently, the time needed to checkpoint that data.The (very natural) objective is to squeeze the most out of the reservation by executing as much work as possible before checkpointing.Obviously, in a perfect world, with a reservation of duration  and a checkpoint of duration , one should checkpoint exactly  seconds before the end of the reservation, i.e., at time  −  if the execution started at time 0.
While assuming a perfect knowledge of the value of  is quite reasonable (you know what you paid for), assuming a perfect knowledge of the value of  is more questionable.If the actual value of  exceeds the one planned by, say, a few seconds, all the work executed during the reservation will be lost.In fact, it is very likely that the value of  would vary from one execution to another; the range of variation of the possible values of  would typically depend upon the application.What is the best strategy then?If we know a worst-case value  max for , should we always use it and checkpoint at time  −  max ?Using  max means taking no risk at all, but this pessimistic approach leads to wasting execution time whenever the actual value of  is significantly smaller than  max .
A natural approach is to assume that a probability distribution law D  for the values of  is known (instead of just an upper bound).The question becomes to determine the instant to checkpoint that maximizes the expected amount of work that will be saved before the end of the reservation.The probability distribution can be learned from traces of previous checkpoints.A main contribution of this work is to give the solution to this problem for an arbitrary distribution D  , and to determine the solution for a variety of widely-used distributions whose support lies in an interval [, ], where  =  min and  =  max represent the extreme values that  can take.Such distributions include Uniform([, ]), the uniform law in the interval [, ], and Exponential or Normal laws truncated to [, ].
So far, we have considered that the execution of the application can be interrupted at any instant to take a checkpoint.This is a very strong hypothesis too.For instance, numerical iterative applications are composed of a set of iterations that are repeated until convergence is reached: checkpoints should be taken only at the end of an iteration, because the data footprint to be saved has a much smaller volume than when the checkpoint is taken in the middle of an iteration.Another example is that of linear workflows, which are composed of a linear chain of tasks.These tasks are black boxes that operate on inputs and deliver outputs; checkpoints can only be taken at the end of a task.A main contribution of this work is to deal with such applications.To complicate matters, the duration of the tasks themselves is likely to vary from one execution to the next, just as the duration of the checkpoint.Assuming that all task durations are Independent and Identically Distributed (IID) and obey the same probability distribution law D  , and still using another probability distribution law D  for the duration of the checkpoint, we provide the optimal strategy to maximize the expectation of the amount of work executed during the reservation.This optimal strategy comes in two flavors: either we compute the best time to checkpoint statically, at the beginning of the execution, or we dynamically decide either to checkpoint or to continue at the end of each task, accounting for the actual duration of all previously executed tasks.
We point out that this work deals with checkpointing on failurefree platforms!On HPC platforms, checkpoint/restart is the de facto standard to mitigate the impact of fail-stop errors [12].By nature, fail-stop errors strike at random instants.Here we take checkpoints to save the work at the end of the reservation, which we can interpret as a fail-stop error that will strike at a well-known and fully deterministic instant.Another difference is that reservations are used for all kind of jobs, sequential or parallel, while checkpoint/restart is used only for very large jobs executing on very large platforms.Hence this work has a wider potential impact than large-scale HPC platforms Altogether, the major contributions of this work are the following: • When the (preemptible) application allows for checkpointing at any time-step: assuming that checkpoint times obey a probability distribution law D  , we compute the optimal time to checkpoint, in order to maximize the expected work done during the reservation.• When the application consists of a linear chain of tasks and a checkpoint can be taken only at the end of a task: assuming IID stochastic task execution times that obey a probability distribution law D  , and still assuming that checkpoint times obey a probability distribution law D  , we compute the optimal number of tasks after which to checkpoint, in order to maximize the expected work done during the reservation.This optimal number of tasks is computed either statically at the beginning of the execution, or dynamically at the end of each task.• For both scenarios, we provide several examples with a variety of probability distribution laws for checkpoint and task durations.The rest of the paper is organized as follows.Section 2 reviews related work.We discuss applications that can checkpoint at any instant in Section 3 and stochastic linear workflows where checkpoints can only be taken at the end of a task in Section 4. Finally, Section 5 provides concluding remarks and directions for future work.

RELATED WORK
We survey related work in this section.First, we point out that most of the literature uses checkpoints to mitigate the impact of fail-stop errors that can strike during the execution of a large-scale parallel application.In such a context, the natural strategy is to checkpoint periodically, and the optimal checkpointing period is given by the Young/Daly formula [4,26].In our framework, checkpointing is used only to save the application data at the end of the reservation.The application may well be sequential or moderately parallel.The execution is assumed to be safe while progressing during the reservation.In other words, the only catastrophic event is the end of the reservation, but this one is fully known in advance.What is not known is the duration of the final checkpoint at the end of the reservation.
The first part of this work deals with a fully preemptible application executing for  seconds and where checkpoints can be taken at any instant.We assume that checkpoint time obeys a probability distribution law D  and investigate what is the optimal instant to checkpoint in order to maximize the expectation of the amount of work executed during the reservation.This is a very natural and important problem because checkpointing at the end of a reservation is routinely used in many scientific fields as a way to save state [22,23].However, to the best of our knowledge, this work is the first to investigate this problem.
The second part of this work deals with linear workflows made of identical tasks that are repeated until some criterion is met.This framework corresponds to iterative methods that are popular for solving large sparse linear systems, which have a wide range of applications in several scientific and industrial problems.There are many classic iterative methods including stationary iterative methods like the Jacobi method [19], the Gauss-Seidel method [19] and the Successive Overrelaxation method (SOR) [7,25], and nonstationary iterative methods like Krylov subspace methods, including Generalized Minimal Residual method (GMRES) [20], Biconjugate Gradient Stabilized method (BiCGSTAB) [10], Generalized Conjugate Residual method (GCR) [6], together with their ABFT (algorithm-based fault-tolerance) variants [1,15].
The class of iterative applications goes well beyond sparse linear solvers.Uncertainty Quantification (UQ) workflows explore a parameter space in an iterative fashion [16,18].This class also encompasses many image and video processing software which operate a chain of computations kernels (each being a task) on a sequence of data sets (each corresponding to an iteration).Examples include image analysis [21], video processing [9], motion detection [14], signal processing [3,11], databases [2], molecular biology [17], medical imaging [8], and various scientific data analyses, including particle physics [5], earthquake [13], weather and environmental data analyses [17].
Iterative applications are the primary motivation for the second scenario of this work: we have an unknown number of tasks, whose number depends on the convergence rate.The total execution time is unknown, which calls for a series of fixed-length reservations of duration , where  depends upon many parameters provided both by the user (estimating the order of magnitude of the total execution time) and the resource provider (availability and cost of each reservation).Within each reservation, the execution progresses from one iteration to the next until a checkpoint is taken in the end.if the execution starts with a recovery of length  , this amounts to working with a reservation of length  −  instead of .Each iteration is a task whose length obeys the same probability distribution law.Another probability distribution law is used for the duration of the final checkpoint.To the best of our knowledge, this work is the first to investigate this important but challenging problem.

CHECKPOINTING AT ANY INSTANT
In this section, we assume that a checkpoint can be taken at any instant during the reservation.Section 3.1 details the framework and provides a general formulation for the expectation E( ( )) of the work done when checkpointing  seconds before the end of the reservation, when assuming that checkpoint times obey a probability distribution law D  .Section 3.2 shows how to optimize this expectation for several widely used probability laws D  .

Framework
Starting the execution at time 0, we take a checkpoint at time  −  , where 0 ≤  ≤ .The time to checkpoint  is a random variable that obeys a probability distribution law D  with support [, ], where 0 <  <  ≤ .In particular, we always have  ≤  ≤ .In fact, the lower bound  of  leads to refine the range of  as  ≤  ≤ : if  < , there is simply not enough time left to checkpoint!We use this range  ≤  ≤  throughout the paper.
How to choose  to maximize the expectation E( ( )) of the amount of work  ( ) that is saved when checkpointing at time  −  ?The work saved by a checkpoint at time  −  is ] and 0 otherwise.Indeed, we save  −  if  ≤  and nothing otherwise.This confirms that choosing  >  will never be optimal, but it may well be the case that the optimal is reached for  < .
Let  be a random variable with cumulative distribution function (CDF)  and probability density function (PDF)  with possibly an infinite support.The law D  of  is defined as the law of  truncated within [, ].Then we have: otherwise This gives the CDF   of .Rewriting it as We can now derive the expectation of the work saved when checkpointing at time  : In Section 3.2, we use Equation ( 1) to find the optimal value of  for various probability distribution laws.

Solution for several probability distribution laws
3.2.1 Uniform law.For a uniform law in [, ], there is no need for truncating, and we directly have the PDF and CDF as The expectation of the work saved when checkpointing at time  is: The trinomial  ↦ −→ ( − )( − ) is maximum for  = + 2 .This is the optimal value  opt of  if + 2 < , otherwise the maximum is obtained for some  larger than , and then  is optimal in the interval [, ].Altogether ,  Figure 1 provides an example of each case (optimal reached before  or at ).Recall that the range of  is [, ].When there remains  =  seconds before the end of the reservation, the checkpoint will fail almost surely, and the expectation of the work saved is E( ()) = 0. Similarly, if we checkpoint at the very beginning of the reservation, i.e.,  = , no work is executed and E( ()) = 0.In between, the expectation E( ( )) of the work done obeys Equation (2).In particular, it decreases linearly from  =  to  = .In Figure 1(a), the maximum of E( ( )) is reached for  opt = + 2 = 5.5, with E( ( opt )) ≈ 3.1; the pessimistic approach would use  =  max =  and get E( ()) = 2.5, reaching only 80% of the optimal work amount in average.On the contrary, in Figure 1(b), the pessimistic approach is optimal since  opt = .The main take-away is that deciding to checkpoint with  = , hence preparing for the worst-case of checkpoint duration, is not always a good strategy.Again, the main take-away is that preparing for the worstcase of checkpoint duration and choosing  =  is not always a good strategy.Contrarily to the Uniform law, the Exponential law requires to compute a complicated value for  opt , but this can be done easily with available tools like [24].
′′ has two zeros: We see that  1 <  <  <  2 .Furthermore, we have is thus a concave function on [ 1 ,  2 ] and a convex function elsewhere.There are two possible cases: (1) either  ≥  1 , and then  is concave on [, ]; hence the zero  of  ′ is a maximum of .We do not have an explicit formula for   but we can evaluate it numerically.Figure 3 provides an example of each case (optimal reached before  or at ).The main take-away for the Normal law is the same as for the Exponential law.

LogNormal law.
Let  and  be the CDF and PDF of a Log-Normal law of parameters  and : we have . Recall that the mean  * and standard deviation  * of this law are such that We assume that  obeys the LogNormal law with parameters  and  truncated to [, ], and we choose these parameters  and The determination of the maximum of the expectation of the work done is similar to what we have done for a truncated Normal law, therefore we do not detail the derivations.Just as for the truncated Normal law, the maximum can be obtained either for  opt <  or for  opt = .Figure 4 provides an example of each case (optimal reached before  or at ).The main take-away for the Log Normal law is the same as for the Exponential and Normal laws.

STOCHASTIC LINEAR WORKFLOWS 4.1 Framework
This section addresses a much more challenging problem than the one of Section 3. Now the application consists of a linear chain of tasks.Checkpoints cannot be taken at any instant during the  execution but instead must be taken at the end of a task.The objective is to execute as many tasks as possible and to checkpoint successfully before the end of the reservation.We further assume that task execution times are not fully deterministic and can vary from one execution to the next.In the most general setting, we would have a chain where each task   is characterized by two probability distributions: • a first probability distribution D  distributions are supposed to be independent.We note that if task execution times are deterministic instead of stochastic (in other words, if D ( )  is constant for all ), the problem can be solved using the same approach as in Section 3. Obviously, it is much more realistic to assume that task execution times can change, even moderately, from one execution to another; but this assumption dramatically complicates the problem.
In this work, we restrict to a simpler yet challenging instance of the problem: we assume that the probability distributions are the same for all tasks.More precisely, we assume that the D ( )  are independent and identically distributed (IID) and obey the same distribution D  ; similarly, the D ( )  are independent and identically distributed (IID) and obey the same distribution D  .As mentioned in Section 2, this problem instance with IID stochastic tasks perfectly models the behavior of large-scale numerical iterative solvers for sparse linear systems of equations.
Because of the difficulty of the problem, we make further technical assumptions.The key argument in the solution of the static strategy described in Section 4.1 is to restrict to distributions D  such that the sum of  IID random variables   obeying D  will obey a well-known probability distribution, typically the same type as D  but with scaled parameters.Recall that each   represents the execution time of task   , so that   =  =1   represents the execution time of the first  tasks: this is why we need  () to obey some well-known probability distribution.We investigate three cases below that match this restriction: when D  is a Normal law, a Gamma law, or a Poisson law (this last one requires discretization of time).Finally, for simplicity of the derivations, we assume that each distribution D  ), and all these distributions are independent.We investigate two strategies.First we use a static approach: at the beginning of the execution, we compute the value  opt of the number of iterations that should be executed before taking a checkpoint in order to maximize the expectation of the work done.Then, we provide a dynamic strategy that accounts for the work actually done so far and decides at the end of each iteration whether it is better (in expectation) to checkpoint now or to perform another iteration and checkpoint only then.

Static strategy
The static strategy is applied before the beginning of the execution, and takes a checkpoint at the end of the same iteration number  opt .The goal is to determine the value of  opt which maximizes the expectation E() of the work done when checkpointing after  iterations.Assuming that D  has positive support [0, ∞), E() can be expressed as follows: In Equation ( 3),   is the PDF of D  and    is the PDF of   =  =1   .As stated before, this expression is useful only when each random variable   obeys some well-known distribution.

Normal law.
In this section, we assume that task execution times obey a Normal law:   ∼ N (,  2 ).Then   is a Normal law too:   ∼ N (,  2 ).A non-truncated Normal law to model task execution times is meaningful only if its mean is a large positive number and its standard deviation relatively small, so that the probability for   to be negative remains very low, thereby ensuring the coherence of the model.But for correctness, we need to update Equation (3) to account for possible negative values.We derive:

𝑑𝑥
We replace  by a real variable  ∈]0, +∞[ to get the continuous function If  has a maximum  opt , then the optimal value  opt will be either  opt = ⌊ opt ⌋ or  opt = ⌈ opt ⌉, whichever gives the larger value for  .We provide a numerical example in Figure 5.In this example,  has a maximum  opt ≈ 7.4.We have  (7) ≈ 20.9 and  (8) ≈ 17.6, hence  opt = 7.   3) as such.Indeed, the sum   of  independent   ∼ Gamma(,  ) is §  ∼ Gamma(,  ).We derive: We replace  by a real variable  ∈]0, +∞[ to get the continuous function

𝑑𝑥
If  has a maximum  opt , then the optimal value  opt will be either  opt = ⌊ opt ⌋ or  opt = ⌈ opt ⌉, whichever gives the larger value for .We provide a numerical example in Figure 6.In this example,  has a maximum  opt ≈ 11.8.We have ( 11  In this section, we consider a Poisson law.As Poisson() has for support the set N of nonnegative integers, we assume that task execution times are expressed in some discrete unit (e.g., seconds) and take only integer values.We assume that  and the mean of the   random variables are large in front of the time unit.We also assume w.l.o.g. that  is an integer.Task execution times obey a Poisson law:   ∼ Poisson().Recall that Poisson() has for PDF  () =  −   ! .The sum   of  independent   ∼ Poisson() is §  ∼ Poisson().We derive: We replace  by a real variable  ∈]0, +∞[ to get the continuous function If ℎ has a maximum  opt , then the optimal value  opt will be either  opt = ⌊ opt ⌋ or  opt = ⌈ opt ⌉, whichever gives the larger value for ℎ.

Dynamic strategy
The static strategy does not account for the actual duration of the tasks during the beginning of the execution and always takes a checkpoint after the same (optimal) number of iterations.This approach is best suited to the scenario where the random variables   follow a distribution D  with a small standard deviation.On the contrary, if D  has a large standard deviation, there is a risk to checkpoint much too early or much too late, depending upon what values the   have effectively been taking.In this section, we introduce a dynamic strategy: given the values of previous   's, we decide at the end of each task whether it is better to checkpoint now or to continue with (at least) one additional task.To this purpose, at the end of each task, we compare the expectation E(  ) of the work done if we checkpoint now, and the expectation E( +1 ) of the work done if we execute one more task before checkpointing.If E(  ) ≥ E( +1 ) we stop the execution and checkpoint; otherwise, we execute one more task and re-apply the algorithm at the end of that new task.
For  ∈ N, let   be the work done after the first  tasks.We have: • If we checkpoint: • If we continue execution: where   +1 is the PDF of the random variable  +1 .Since  ∼ N [0,+∞[ (  ,  2  ), we derive: Technically, the dynamic strategy provides more flexibility than the static one.because we know the actual value   of the work executed after  tasks, we no longer need that   =  =1   obeys some well-known probability distribution.In what follows, we instantiate the problem with D  being a truncated Normal law, a Gamma law or a Poisson law.

𝑑𝑥
We can now directly compare E(  ) and E( +1 ).We provide a numerical example in Figure 8.In this example, the two graphs intersect at  int ≈ 20.3.When   >  int , it is better to checkpoint right now than executing another task, while it is the opposite for   <  int .

Gamma law.
We assume here that   ∼ Gamma(,  ):D  is a Gamma law.We derive that

𝑑𝑥
We can now directly compare E(  ) and E( +1 ).We provide a numerical example in Figure 9.In this example, the two graphs intersect at  int ≈ 6.4.When   >  int , it is better to checkpoint right now than executing another task, while it is the opposite for   <  int .

Poisson law.
In this section, similarly to Section 4.2.3, we consider that task execution times are expressed in some discrete unit (e.g., seconds) and take only integer values.We assume that  and the mean of the   random variables are large in front of the time unit.We also assume w.l.o.g. that  is an integer.Task execution times obey a Poisson law:   ∼ Poisson().We derive: We can now directly compare E(  ) and E( +1 ).We provide a numerical example in Figure 10.In this example, the two graphs intersect at  int ≈ 18.9.When   >  int , it is better to checkpoint right now than executing another task, while it is the opposite for   <  int .

And after the checkpoint?
We conclude this section with a short discussion about using the time left in the reservation, if any, after a checkpoint has been successfully taken.Should we attempt to execute one or several new tasks and take a new checkpoint after these new tasks?or should we drop the reservation?This question can be raised when there is enough time left in the reservation after a successful checkpoint.Of course there must remain at least  =  min seconds, the minimum time to checkpoint.Such a scenario is indeed possible; it is more likely with the static approach which determines when to checkpoint at the beginning of the execution, hence which can overestimate actual task execution times; but it can also happen with the dynamic strategy.
If we decide to continue the execution, we can always re-use both approaches, either static or dynamic, for the time left in the reservation.However, some HPC or cloud systems charge by time actually spent rather by time reserved.In that case, it may be worth to drop the reservation and save money on our account.Obviously, the decision involves many parameters, including the urgency of getting application results and the budget of the user!

CONCLUSION
This work has dealt with the problem of maximizing the expectation of the work that can be done during a fixed-length reservation.The key question is when to take a checkpoint at the end of the reservation.We have started with applications where a checkpoint can be taken at any time.For such applications, we have provided the optimal solution when checkpoint time can be modeled as a random variable obeying a probability distribution law D  with bounded support [, ].An important result was to assess the gain that can be achieved over the pessimistic (but risk-free) approach, which assumes the highest value  max =  for the checkpoint duration , using a variety of well-known probability distribution laws D  .
Then, we have focused on the more involved problem where the application is a linear workflow consisting of a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task.We have introduced a static strategy where we compute the optimal number of tasks before the checkpoint at the beginning of the execution.We have also designed a dynamic strategy, which decides whether to checkpoint or to continue execution at the end of each task.We have instantiated this second scenario with several examples of probability distribution laws for task durations.Obviously, the dynamic strategy is to be preferred whenever its use is possible, because it accounts for the actual execution times of all tasks that have been executed so far.But not all applications can be modified on the fly to insert a checkpoint, and the static strategy has a wider potential of applicability.
However, the static strategy requires all task execution times to be IID, while the dynamic strategy does not have this restriction.In fact, it would be easy to extend the dynamic strategy to deal with the general instance of the problem, as described in Section 4.1: in the general instance, each task   is characterized by two probability distributions, D  distributions are independent.However, extending the static strategy to find the optimal solution for the general case seems out or reach.Future work will be devoted to the design of efficient heuristics to solve this challenging problem.
This work has laid the foundations for the design of checkpoint strategies within a fixed-length reservation.Further work is needed to experimentally assess the gain provided by such strategies for real-life scientific applications.We expect this gain to be much higher for stochastic linear workflows than for fully preemptible applications: indeed, in the former case (workflows), the pessimistic, risk free, approach needs to account for (and add-up) two worstcase durations, namely that of a task and that of a checkpoint; while in the latter case (preemptible applications), only the maximum duration of a checkpoint is required.An experimental campaign, either via simulations using traces or through actual application runs, is needed to quantify the effective gain for both application types.
Finally, as mentioned in the introduction, this work is not related to checkpointing on failure-prone platforms.Dealing with the occurrence of fail-stop errors within fixed-size reservations would be an interesting direction for future work.

Figure 1 : 2 =Figure 2
Figure 1: Both cases for  opt with a Uniform law.

Figure 2 :
Figure 2: Both cases for  opt with an Exponential law.

Figure 3 :
Figure 3: Both cases for  opt with a Normal law.

Figure 4 :
Figure 4: Both cases for  opt with a LogNormal law.
obeys the same Normal law D  ∼ N (  ,  2  ) truncated to positive values (support [0, ∞)).It is easy to extend the approach to different distributions D ( )  of arbitrary types: simply compute the expectation of the work done after  iterations for each value of , using the approach below, and select the best value.To summarize, task execution times are IID distributions D  with positive support [0, +∞[, checkpoint times are IID truncated Normal distributions D  ∼ N [0,+∞[ (  ,  2
to model the execution time of the task, and D ( )  to model the time to checkpoint at the end of the task.The only requirement is that all the D ( )  and D ( )