Online Scheduling of Moldable Task Graphs under Common Speedup Models

—The problem of scheduling moldable tasks has been widely studied, in particular when tasks have dependencies (i.e., task graphs), or when tasks are released on-the-ﬂy (i.e., online). However, few study has focused on both (i.e., online scheduling of moldable task graphs). In this paper, we derive constant competitive ratios for this problem under several common yet realistic speedup models for the tasks (rooﬂine, communication, Amdahl, and a combination of them). We also provide the ﬁrst lower bound on the competitive ratio of any deterministic on-line algorithm for arbitrary speedup model, which is not constant but depends on the number of tasks in the longest path of the graph.


INTRODUCTION
This work investigates the online scheduling of parallel task graphs, where each task in the graph is moldable.In the scheduling literature, a moldable task (or job) is a parallel task that can be executed on an arbitrary but fixed number of processors.The execution time of the task depends upon the number of processors chosen to execute it.This number of processors is chosen once and for all, when the task starts its execution, and cannot be modified later on during execution.This corresponds to a variable static resource allocation, as opposed to a fixed static allocation (rigid tasks) and to a variable dynamic allocation (malleable tasks) [7].
Moldable tasks offer a nice trade-off between rigid and and malleable tasks: they easily adapt to the number of available resources, contrarily to rigid tasks, while being easy to design and implement, contrarily to malleable tasks.This explains that many computational kernels in scientific libraries for numerical linear algebra and tensor computations are provided as moldable tasks that can be deployed on a wide range of processor numbers.We assume that the scheduling of each task is non-preemptive and without restarts [8], which is a highly desirable approach to avoid high overheads incurred by checkpointing partial results, context switching, and task migration.
Because of the importance and wide availability of moldable tasks, scheduling algorithms for such tasks have received considerable attention.The scheduling problem, whose objective is to minimize the overall completion time, or makespan, comes in many flavors: Offline vs. online.In the offline version of the problem, all tasks are known in advance, before the execution starts.The problem is NP-complete, and the goal is to derive lower bounds and approximation algorithms.
On the contrary, in the online version of the problem, tasks are released on the fly, and the objective is to derive competitive ratios [17] for the performance of a scheduling algorithm against an optimal offline scheduler, which knows in advance all the tasks and and their dependencies in the graph.The competitive ratio is established against all possible strategies devised by an adversary trying to force the online algorithm to take bad decisions.

Independent tasks vs. task graphs.
There are two versions of the online problem, with independent tasks or with task graphs.For the version with independent tasks, the tasks are released on the fly and the scheduler discovers their characteristics only upon release.For the version with task graphs, the whole graph is released at the start, but the scheduler discovers a new task and its characteristics only when all of its predecessors have completed execution.In other words, the shape of the graph and the nature of the tasks are not known in advance and are revealed only as the execution progresses.
In this work, we investigate the most difficult instance of the problem, namely, the online scheduling of moldable tasks graphs.Our main contribution resides in several new competitive ratios, which greatly depend upon the speedup model of the tasks.Several common yet realistic speedup models have been introduced and analyzed, including the roofline model, the communication model, the Amdahl's model, and a general combination of them (see Section 3.1 for definitions).We provide a constant competitive ratio for each of these four models.In addition, we derive a new lower bound on the competitiveness of any deterministic online algorithm under the arbitrary speedup model.To the best of our knowledge, a competitive ratio was only known for task graphs under the roofline model [8], and we extend the result to several other speedup models.
The rest of this paper is organized as follows.Section 2 surveys related work.The formal model and problem statement are presented in Section 3. Section 4 is the heart of the paper: we introduce the new online algorithm and prove its competitive ratios for the different speedup models.Section 5 is devoted to the lower bound for arbitrary speedup models.Finally, Section 6 concludes the paper and provides hints for future directions.

RELATED WORK
Several prior studies have considered offline scheduling of independent moldable tasks, and derived approximation results.While some results depend on specific speedup models for the tasks, other results hold for the arbitrary model.Turek et al. [18] designed a 2-approximation listbased algorithm for the arbitrary model.Furthermore, when each task only admits a subset of all possible processor allocations, Jansen [10] presented a (1.5 + )-approximation algorithm, which is tight since it was also shown that the problem cannot have an approximation ratio better than 1.5 unless P = N P [14].For the monotonic model, where the execution time is non-increasing and the area (processor allocation times execution time) is non-decreasing with the number of processors, Jansen and Land [11] further proposed a polynomial-time approximation scheme (PTAS).
For online scheduling of independent moldable tasks that are released on-the-fly, Ye et al. [21] designed a 16.74competitive algorithm.They also explained how to transform an algorithm for rigid tasks whose makespan is at most ρ times the lower bound into a 4ρ competitive algorithm for moldable tasks.Further, some algorithms designed in the offline setting will also work online if they make scheduling decisions independently for each task; see for instance [6], [9], [15], which studied the communication model.
For offline scheduling of moldable tasks with dependencies, Wang and Cheng [19] showed that the earliest completion time algorithm is a (3 − 2/P )-approximation for the roofline model.For the monotonic model, Lepère et al. [16] proposed an algorithm with approximation ratio 3 + √ 5, which was later improved to 4.73 by Jansen and Zhang [13].Chen and Chu [5] further proposed improved approximations for a more restrictive model, where the area is a concave function and the execution time is strictly decreasing with the number of processors.
Feldmann et al. [8] designed an online algorithm for moldable tasks with dependencies, under the roofline model.By keeping the system utilization above a given bound and by carefully tuning of this bound, their algorithm achieves 2.618-competitiveness, even when the task execution times and the DAG structure are unknown.Canon et al. [4] focused on hybrid platforms with several types of processors (for instance, CPUs and GPUs), and derived competitive ratios depending on the number of such resources, but they did not consider moldable tasks.
We have recently investigated the problem of scheduling independent moldable tasks subject to failures [3], where tasks need to be re-executed after a failure until a successful completion.This corresponds to a semi-online setting, since all tasks are known at the beginning, but failed tasks are only discovered on-the-fly.Although we do not consider task failures in this paper, but rather focus on the general online scheduling of moldable task graphs (as in [8]), the results can readily carry over to the failure scenario.
Table 1 summarizes the instances of different scheduling problems and the related papers under each instance.

PROBLEM STATEMENT
In this section, we formally present the online scheduling model and the objective function.We also show a simple lower bound on the optimal makespan, against which the performance of our online algorithms will be measured.

Model and Objective
We consider the online scheduling of a directed acyclic graph (DAG) of moldable tasks on a platform with P identical processors.Let G = (V, E) denote the task graph, where V = {1, 2, . . ., n} represents a set of n tasks and E ⊆ V ×V represents a set of precedence constraints among the tasks.An edge (i, j) ∈ E indicates that task j depends on task i, and therefore it cannot be executed before task i is completed.Task i is called the predecessor of task j, and task j is called the successor of task i.
The tasks are assumed to be moldable, meaning that the number of processors allocated to a task can be determined by the scheduling algorithm at launch time, but once the task has started executing, its processor allocation cannot be changed.The execution time t j (p j ) of a task j is a function of the number p j of processors allocated to it, and we assume that the processor allocation must be an integer between 1 and P .In this paper, we focus on the following execution time function: where w j denotes the total parallelizable work of the task, pj denotes the maximum degree of parallelism of the task, d j denotes the sequential work of the task, and c j denotes the communication overhead when more than one processor is used.The execution time function in Equation (1) generalizes several speedup models commonly observed for parallel applications.In particular, it contains the following well-known models as special cases: • Roofline Model [20] (with d j = 0 and c j = 0): This model assumes that the task has a linear speedup until a maximum degree of parallelism pj ≤ P .• Communication Model [9] (with pj ≥ P and d j = 0): This model assumes that the work of the task can be perfectly parallelized, but there is a communication overhead when more than one processor is allocated to the task, and that overhead increases linearly with the number of allocated processors.• Amdahl's Model [1] (with pj ≥ P and c j = 0): This model assumes that the task has a perfectly parallelizable fraction with work w j and an inherently sequential fraction with work d j .
From the execution time function of the task j, we can further define the area of the task as a function of the processor allocation as follows: a j (p j ) = p j × t j (p j ).Intuitively, the area represents the total amount of processor resources utilized over the entire period of task execution.
In this work, we consider the online scheduling model, where a task becomes available only when all of its predecessors have been completed.This represents a common scheduling model for dynamic task graphs [4], [8].Furthermore, when a task j is available, all of its execution time parameters (i.e., w j , pj , d j , c j ) also become known to the scheduling algorithm.The goal is to find a feasible schedule of the task graph that minimizes its overall completion time or makespan, denoted by T .The performance of an online scheduling algorithm is measured by its competitive ratio: the algorithm is said to be c-competitive if, for any task graph, its makespan T is at most c times the makespan T OPT produced by an optimal offline scheduler, i.e., T ≤ c × T OPT .Note that the optimal offline scheduler may know all the tasks and their speedup models, as well as all dependencies in the graph, in advance.The competitive ratio is established against all possible strategies by an adversary trying to force the online algorithm to take bad decisions.

Lower Bound on Optimal Makespan
Given the execution time function in Equation ( 1), let us define s j = w j /c j .We can then compute the maximum number of processors that should be allocated to the task as p max j = min (P, pj , pj ) , Indeed, allocating more than p max j processors to the task will no longer decrease its execution time while only increasing its area.Thus, we can assume that the processor allocation of the task should never exceed p max j under any reasonable algorithm.In addition, we can observe that, when the processor allocation is in the range [1, p max j ], the task satisfies the following monotonic property [16]: • The execution time is a non-increasing function of the processor allocation, i.e., t j (p) ≥ t j (q) for all 1 ≤ p < q ≤ p max j ; • The area is a non-decreasing function of the processor allocation, i.e., a j (p) ≤ a j (q) for all 1 ≤ p < q ≤ p max j .Thus, the minimum execution time of the task is t min j = t j (p max j ) and the minimum area of the task is a min j = a j (1).We note that the second point above also shows that the task cannot achieve superlinear speedup, i.e., We now define two quantities that can be used as a lower bound of the optimal makespan.Definition 1.The minimum total area A min of the task graph is the sum of the minimum area of all tasks in the graph, i.e., A min = n j=1 a min j .
Definition 2. The minimum length L min (f ) of a path 1 f in the graph is the sum of the minimum execution time of all tasks along that path, i.e., L min (f ) = j∈f t min j .The minimum critical path length C min of the graph is the longest minimum length of any path in the graph, i.e., C min = max f L min (f ).
Clearly, the optimal makespan cannot be smaller than Amin P and C min .This follows from the well-known area bound and critical-path bound for scheduling any task graph.The choice of minimum value for both quantities ensures that they can serve as the lower bounds on the optimal makespan.The following lemma states this result.

ONLINE ALGORITHM
In this section, we present an online scheduling algorithm and derive its competitive ratio for the considered speedup model (Equation ( 1)) as well as for its three special cases.

Algorithm Description
Algorithm 1 presents the pseudocode of the online scheduling algorithm, which at any time maintains the set of available tasks in a waiting queue Q.At time 0 or whenever a running task completes execution and thus releases processors, it checks if new tasks have become available.If so, for each newly available task j, it finds a processor allocation p j for the task (using Algorithm 2) before inserting it into the queue Q.Then, it applies the well-known list scheduling strategy by scanning through all the available tasks in Q and executing each one right away if there are enough processors.Algorithm 2 presents the details of the processor allocation strategy for any task j.It consists of two steps.The first step performs an initial allocation for the task, which is inspired by the Local Processor Allocation (LPA) strategy proposed in [2], [3].Specifically, for each possible allocation p ∈ [1, p max j ], we define the ratio between the area of the task and the minimum area to be α p = a j (p)/a min j , and the ratio between the execution time of the task and the minimum execution time to be β p = t j (p)/t min j .We then find an allocation that minimizes α p subject to the constraint ≈ 0.382 is a constant whose exact value will be determined based upon the speedup model under consideration.The justification for this strategy as well as for the choice of µ will be presented in the next section.Since α p is non-decreasing with p and β p is nonincreasing with p, the above optimization problem can be efficiently solved in linear time.
In the second step, the algorithm reduces the initial allocation to µP if it is more than µP ; otherwise the allocation will be unchanged.Let p j denote the initial allocation for the task and p j the final allocation.Thus, after the second step, we have: 1.A path f consists of a sequence of tasks with linear dependency, i.e., f = (j π(1) , j π(2) , . . ., j π(v) ), where the first task j π(1) in the sequence has no predecessor in the graph, the last task j π(v) has no successor, and, for each This step adopts the technique first proposed in [16] and subsequently used in [12], [13].The purpose is to be able to execute more tasks at any time during the schedule, thus potentially increasing the overall resource utilization of the platform and reducing the makespan.

General Analysis Framework
We now outline a general analysis framework, under which the competitive ratio of the proposed online algorithm will be derived for different speedup models.
Recall that T denotes the makespan of the online scheduling algorithm.Since the algorithm allocates and de-allocates processors upon task completions, the schedule can be divided into a set I = {I 1 , I 2 , . . .} of nonoverlapping intervals, where tasks only start (or complete) at the beginning (or end) of an interval, and the number of utilized processors does not change during an interval.For each interval I ∈ I, let p(I) denote its processor utilization, i.e., the total number of processors used by all tasks running in interval I. Following the analysis of [16], we classify the set of intervals into the following categories.The next two lemmas relate these durations to the minimum total area and minimum critical path length of the task graph, given certain conditions on the initial processor allocations of the tasks.Lemma 2. If there exists a constant α such that, for each task j, its initial processor allocation satisfies a j (p j ) ≤ α × a min j , then we have: Proof.As the area of each task j is non-decreasing with its processor allocation and p j ≤ p j , the final area of the task should satisfy a j (p j ) ≤ a j (p j ) ≤ α × a min j .Thus, the total area A of all tasks after their final allocations will satisfy A = j a j (p j ) ≤ α × j a min j = α × A min .Since at least µP ≥ µP processors are utilized during T 2 and at least (1−µ)P ≥ (1−µ)P processors are utilized during T 3 , we have µT 2 + (1 − µ)T 3 ≤ A P ≤ α × Amin P .Lemma 3. If there exists a constant β such that, for each task j, its initial processor allocation satisfies t j (p j ) ≤ β × t min j and β ≤ 1 µ , then we have: Proof.During T 1 and T 2 , the processor utilization is at most (1 − µ)P − 1, so there are at least P − ( (1 − µ)P − 1) ≥ µP available processors.Based on Algorithm 2, any task is allocated at most µP processors.Thus, there are enough processors to execute any new task (if one is available).This implies that there is no available task in the queue Q during T 1 and T 2 .When a task graph is scheduled by the list scheduling algorithm, it is well known that there exists a path f in the graph such that some task along that path will be running whenever there is no available task in the queue [8], [13], [16].
For any task j along path f running during T 1 , its processor allocation must be less than µP , hence is not reduced by Step 2 of Algorithm 2, i.e., p j = p j .Thus, its execution time should satisfy t j (p j ) = t j (p j ) ≤ β × t min j .For any task j along path f running during T 2 , its processor allocation may or may not be reduced.If it is not reduced, then similarly we can get Otherwise, if it is reduced, and based on Equation (6), the task execution time should satisfy: Now, let L min (f ) (resp.L min (f )) denote the minimum length for the portion of path f executed during T 1 (resp.T 2 ).The argument above implies that T 1 ≤ β × L min (f ) and T 2 ≤ 1 µ × L min (f ).Thus, we have Based on the results of Lemmas 2 and 3, we can now derive a bound on the makespan of the online scheduling algorithm as shown below.
Lemma 4. If there exist two constants α and β such that, for each task j, its initial processor allocation satisfies a j (p j ) ≤ α × a min j and t j (p j ) ≤ β × t min j with β ≤ 1−2µ µ(1−µ) , then we have: Proof.As the makespan is given by T = T 1 + T 2 + T 3 , we can multiply both sides by 1−µ α and apply Equation ( 8) to remove the T 3 term, which gives: We can then multiply both sides of the above inequality by µα 1−2µ and use Equation ( 9) to remove the T 2 term (since ).This gives: , the first term above becomes nonpositive and hence can be removed without affecting the inequality.By rearranging the factors, we can then obtain the result as shown in Equation (10).
The result of Lemma 4 shows that the competitive ratio of the online algorithm increases with α, for a given µ.This suggests that the initial processor allocation should try to minimize α subject to the constraint β ≤

Competitive Ratio
In this section, we prove the competitive ratio of the online algorithm, which is given by µα+1−2µ µ(1−µ) subject to β ≤ 1−2µ µ(1−µ) , based on Lemma 4. We will show that there exists a processor allocation parameterized by a parameter x and that achieves specific values of α and β for any task that follows the considered speedup model.Then, by carefully choosing the values of x and µ, we can minimize the ratio while satisfying the constraint.
In the following, we first consider the three special speedup models (i.e., roofline, communication and Amdahl) before tackling the general model.As the analysis focuses on bounding the ratios α and β for each individual task, we drop the task index j for simplicity.

Roofline Model
Recall that a task follows the roofline speedup model if its execution time satisfies t(p) = w min(p, p) for some p ≤ P .Lemma 5.For any task that follows the roofline speedup model, there exists a processor allocation that achieves α = 1 and β = 1.
Proof.Setting the processor allocation to p achieves both the minimum execution time and the minimum area for the task, thus giving α = β = 1.
Theorem 1.The online algorithm is 2.62-competitive for any graph of tasks that follow the roofline speedup model.This is achieved with µ The above ratio retains the same result by Feldmann et al. [8] 2 .They also proved a matching lower bound for any online deterministic algorithm under the "non-clairvoyant" setting, where the work w of a task is also unknown to the scheduler.

Communication Model
Recall that a task follows the communication model if its execution time satisfies t(p) = w p + c(p − 1).For the ease of analysis, we rewrite the execution time function as: t(p) = c( w p + p − 1) with w = w c .Lemma 6.For any task that follows the communication model and for any x ∈ [ , 1  2 ], there exists a processor allocation that achieves α x = 1 + x 2 + x 3 and β x = 3 5x + 3x 5 .Proof.Recall that p max denotes the number of processors that minimizes the execution time function t(p), i.e., t(p max ) = t min .Clearly, we have either p max = P or √ w ≤ p max ≤ √ w .Also, the area function is given by a(p) = p × t(p) = c(w + p(p − 1)), and the minimum area is obtained with one processor, i.e., a min = a(1) = cw .We consider two cases.
Case 1: w ≤ 9.In this case, we must have p max ≤ 3. We further divide this case into three subcases and, for each subcase, we will show that there always exists a processor allocation p that achieves α ≤ 4  3 and β ≤ 3 2 .• If p max = 1, we can set p = 1 and get α = β = 1.
w , then as x ≤ 1 2 , we must have √ w > P and thus p = p = P .In this case, we clearly 2. In [8], each task has a parallelism p, and can be virtualized if p ≤ p processors are used for execution, with a linear slowdown.This is equivalent to the roofline model.
Proof.From result of Lemma 6, we aim to minimize α x = 1 + x 2 + x 3 while satisfying the constraint . For a fixed µ, multiplying both sides of the constraint by x and rearranging terms, we get a second-degree inequality: The smallest x satisfying this inequality can be computed to be 2 − 36 25 .Now, plugging the above expression of x * µ into α x = 1 + x 2 + x 3 and plugging the result into the competitive ratio 2 ], we can get the optimal competitive ratio to be at most 3.61, which is obtained at µ * ≈ 0.324.This results in the value , thus is a valid choice.

Amdahl's Model
Recall that a task follows the Amdahl's model if its execution time function is t(p) = w p + d, so the area function is given by a(p) = p × t(p) = w + dp.Lemma 7.For any task that follows the Amdahl's model and for any x > 0, there exists a processor allocation that achieves α x = 1 + x and β x = 1 + 1 x .Proof.The minimum execution time of the task is obtained by allocating all P processors, i.e., t min = t(P ) = w P + d, and the minimum area is obtained with one processor, i.e., a min = a(1) = w + d.
For any x > 0, we can set p = min( x w d , P ).This Otherwise, if p = P , we get t(p) = t min and thus β = 1 < β x .Theorem 3. The online algorithm is 4.74-competitive for any graph of tasks that follow the Amdahl's model.This is achieved with µ ≈ 0.271.
Proof.Again, we need to minimize α x = 1 + x subject to the constraint . For a fixed µ, the smallest x satisfying the above inequality can be computed as: µ into α x = 1 + x, and then plugging the result into the competitive ratio µαx+1−2µ µ(1−µ) and simplifying, we can get the following function: Minimizing this function numerically for µ ∈ (0, 3− √ 5 2 ], we can get the optimal competitive ratio to be at most 4.74, which is obtained at µ * ≈ 0.271 (thus x * µ ≈ 0.759).

General Model
We finally consider the general speedup model as given in Equation (1).Again, for the ease of analysis, we rewrite the execution time function as: t(p) = c( w min(p, p) + d + p − 1) with w = w c and d = d c .Lemma 8.For any task that follows the general model and for any x > 1, there exists a processor allocation that achieves α x = 1 + 1 x + 1 x 2 and β x = x + 1 + 1 x .
Proof.If we allow the processor allocation to take noninteger values and assuming unbounded p, the execution time function t(p) would be minimized at p * = √ w .Thus, the minimum execution time should satisfy t min ≥ c(2 √ w + d − 1).Note that this bound will hold true regardless of the value of p: it is obviously true if p ≥ p * , otherwise t min is achieved at p, with a value also higher than c(2 . Furthermore, the minimum area is obtained with one processor, i.e., a min = a(1) = c(w + d ).
Recall that p max denotes the number of processors that minimizes the execution time, i.e., t(p max ) = t min .Clearly, we have either p max = P , or √ w ≤ p max ≤ √ w , or p max = p.We consider two cases.
Case 1: w ≤ 1.In this case, it must be that p max = 1.We can then set the processor allocation to be p = 1 and have α = β = 1.
Case 2: w > 1.In this case, for any x > 1, we can set p = min( w +d The last inequality above comes from w > 1 and d > 0.
Since w > 1, we get t min > c( To derive β, we further consider two subcases.Theorem 4. The online algorithm is 5.72-competitive for any graph of tasks that follow the general speedup model given in Equation (1).This is achieved with µ ≈ 0.211.

A LOWER BOUND FOR ARBITRARY SPEEDUP MODEL
In the previous section, we have proven constant competitive ratios of our online algorithm for task graphs under several common speedup models.In this section, we show that the competitive ratio of any deterministic online algorithm can be unbounded for the arbitrary speedup model.
Theorem 5. Any deterministic online algorithm is at least Ω(ln(D))-competitive for scheduling moldable task graphs under the arbitrary speedup model, where D denotes the number of tasks along the longest (critical) path of the graph.
Proof.We fix an arbitrary integer > 1 and set K = 2 .The instance consists of n = 2 K − 1 independent linear task chains organized in groups.Specifically, for any i ∈ [1, K], group i contains 2 K−i linear chains, each with exactly i tasks.Thus, the number of tasks along the longest path of the graph is given by D = K. Figure 1 shows such an instance for = 2, K = 4 and n = 15.All tasks in the graph are identical, with an execution time function t(p) = 1 lg(p)+1 .We set the total number of processors to be P = K × 2 K−1 .
We show that the optimal offline algorithm completes the above instance with a makespan at most 1, whereas any deterministic online algorithm may produce a makespan at least ln(K) − ln( ) − 1 , thus showing the result.
First, the optimal offline algorithm could schedule the tasks as follows: for any group i ∈ [1, K], it allocates 2 i−1 processors to each linear chain in the group.The total number of required processors is then Thus, all linear chains could be executed in parallel.Furthermore, they will all be completed at time 1, since each linear chain in group i has i tasks, and each task has an execution time t(2 i−1 ) = 1 lg(2 i−1 )+1 = 1 i .Figure 2 illustrates the schedule for our instance with = 2. Now, we establish a lower bound on the makespan of any deterministic online algorithm.For any i ∈ [1, K − 1], let L i denote the set of linear chains in all groups j ≤ i, and let L i denote the set of linear chains in all groups j > i.Let us define t i to be the first time a linear chain in L i completes i tasks.We further define t 0 = 0 and let t K denote the makespan of the online algorithm.Lemma 9.In the worst case, a schedule produced by any deterministic online algorithm could satisfy: Proof.Since all tasks are identical, an online algorithm cannot distinguish the linear chains.Thus, for any i ∈ [1, K], an adversary could make all linear chains that first complete i tasks by the online algorithm be chains from L i .Therefore, at time t i , all linear chains containing exactly i tasks (i.e., the ones from group i) are already completed, and at time t i−1 , no linear chain has started its i-th task by definition (this also holds for t 0 and t K ).Hence, all tasks in the i-th position of the linear chains in group i must be entirely processed between t i and t i−1 , and the number of such tasks is 2 K−i .For the sake of contradiction, suppose we have t i − t i−1 < 1 +i .Thus, the execution time of these tasks must satisfy t(p) = 1 lg(p)+1 ≤ 1 +i , hence their processor allocation must be at least p ≥ 2 +i−1 = K × 2 i−1 .As the area of the task a(p) = p × t(p) = p lg(p)+1 is increasing with the number of processors, the total area of all tasks that needs to be processed between t i and t i−1 is at least 2 K−i × a(K × 2 i−1 ) = 2 K−i ×K×2 i−1 log(K×2 i−1 )+1 = K×2 K−1 +i = P +i .Since we have P processors, the total time required to process this area is at least 1 +i which contradicts t i − t i−1 < 1 +i .One strategy to cope with the worst-case scenario above is to allocate the same number of processors to each linear chain (or more precisely allocate one more processor to some linear chains in order to utilize all the processors).Figure 3 illustrates this strategy for the same instance with = 2.
Finally, we can use the result of Lemma 9 to lower bound the makespan of an online algorithm, which is given by t K = K i=1 (t i −t i−1 ).Since ∀j, ln(j)+γ < j i=1 1 i < ln(j)+ γ + 1 j where γ is the Euler constant, we obtain: This completes the proof of Theorem 5.

Algorithm 1 : 3 for each new task j that becomes available do 4 Allocate Processor(j) 5 insert task j into the waiting queue Q 6 end // List Scheduling 7 for each task j in the waiting queue Q do 8 if there are enough processors to execute the task then 9 execute task j now 10 2 :2
Online Scheduling Algorithm 1 initialize a waiting queue Q 2 when at time 0 or a running task completes execution do // Processor Allocation Allocate Processor(j) // Step 1: Initial Allocation 1 Compute p max j based on Equation (5) Compute t min j = t j (p max j ) and a min j = a j (1) 3 Find an allocation p j ∈ [1, p max j ] by solving the following optimization problem:

Fig. 1 .
Fig. 1.Lower bound instance for = 2, K = 4, and n = 15 linear task chains.Each circle represents a task and the number inside each circle indicates the ID of the linear chain the task is in (and the number in the parenthesis indicates the task's position in that linear chain).

Fig. 3 .
Fig. 3.The schedule of an online algorithm for the lower bound instance with = 2, K = 4, and n = 15 linear chains.The algorithm allocates (approximately) the same number of processors to all linear chains, producing a makespan t 4 ≈ 1.23.

Table 1 .
Instances of the scheduling problem.
Step 2: Allocation Adjustment 4 if p j > µP then