Efficient Approximation Algorithms for Scheduling Moldable Tasks

We study the problem of scheduling 𝑛 independent moldable tasks on 𝑚 processors that arises in large-scale parallel computations. When tasks are monotonic, the best known result is a ( 3 2 + 𝜖 ) -approximation algorithm for makespan minimization with a complexity linear in 𝑛 and polynomial in log 𝑚 and 1 𝜖 where 𝜖 is arbitrarily small. We propose a new perspective of the existing speedup models: the speedup of a task 𝑇 𝑗 is linear when the number 𝑝 of assigned processors is small (up to a threshold 𝛿 𝑗 ) while it presents monotonicity when 𝑝 ranges in [ 𝛿 𝑗 ,𝑘 𝑗 ] ; the bound 𝑘 𝑗 indicates an unacceptable overhead when parallelizing on too many processors. For a given integer 𝛿 ≥ 5 , let 𝑢 = (cid:108) 2 √ 𝛿 (cid:109) − 1 . In this paper, we propose a 1 𝜃 ( 𝛿 ) ( 1 + 𝜖 ) -approximation algorithm for makespan minimization with a complexity O( 𝑛 log ( 𝑛 / 𝜖 ) log 𝑚 ) where 𝜃 ( 𝛿 ) = 𝑢 + 1 𝑢 + 2 (cid:16) 1 − 𝑘𝑚 (cid:17) ( 𝑚 ≫ 𝑘 ). As a by-product, we also propose a 𝜃 ( 𝛿 ) -approximation algorithm for throughput maximization with a common deadline with a complexity O( 𝑛 2 log 𝑚 ) .


Introduction
Most computations nowadays are done in a parallelized way on large computers containing many processors.Optimizing the use of processors leads to the problem of scheduling parallel tasks based on their characteristics.In certain cases, the number of processors assigned to a task is predefined by its owner and is said to be rigid.However, in many cases, the scheduler can decide this number before the task execution: if this number cannot be changed during the task execution, the task is said to be moldable; otherwise, it is said to be malleable 1 .Moldable tasks are easier to implement and manage than malleable tasks; the latter require additional system support for task migrations and preemptions (Drozdowski, 2004).

General Problem Description
We consider the problem of scheduling n independent moldable tasks T = {T 1 , T 2 , • • • , T n } on m identical processors; all tasks are available at time zero.For every task T j ∈ T , its execution time t j,1 on one processor is given, as well as the speedup η j,p when assigned p ≥ 1 processors, where p is a positive integer.The execution time of T j on p processors is t j,p = t j,1 η j,p ; then, its workload is D j,p = p × t j,p .The task T j can be represented by a rectangle in the processors × time space.Like (Mounié et al., 1999(Mounié et al., , 2007;;Jansen & Land, 2018), given a real number d, we define a parameter γ(j, d) as the minimum number of processors needed to finish task T j by time d; if T j cannot be finished by time d on any permissible number of processors, we set by convention γ(j, d) = +∞.We often hope to finish all tasks as soon as possible.Sometimes, a task T j also has a value v j that can be obtained if it is finished by a deadline τ ; then we hope to finish by time τ the most valuable tasks.We will propose algorithms that generate schedules for different objectives: (i) minimize the makespan, i.e., the maximum completion time of all tasks of T or (ii) choose a subset of tasks and finish them on the m processors by a deadline τ to maximize the throughput, i.e., the aggregate value of tasks finished by time τ .For each task to be executed, a schedule will define the number of processors assigned to it and the time interval in which it is finished.An algorithm is a ρ-approximation if • for our minimization problem, it produces a schedule whose makespan is at most ρ times the optimal makespan where ρ ≥ 1; • for our maximization problem, it produces a schedule whose throughput is at least ρ times the optimal throughput where ρ ≤ 1.
It is always desired to have performance bound ρ closer to one, while keeping algorithms simple to run efficiently.

Typical Speedup Models, and Motivation
For moldable tasks, a key aspect that conditions scheduling is the relation between the task execution time t j,p and the number p of assigned processors.Now, we introduce three typical speedup models in literature and the most related works, as well as the main motivation of this paper.In this paper, our main problem is offline scheduling of independent moldable tasks for makespan minimization.While introducing the related works, if they have any difference with our main problem,  Benoit et al. (2022a) 2 for failure-prone platforms Benoit et al. (2022b) 2.62 Dependent, Online we only clarify their difference with ours; otherwise, they consider the same problem as our main problem.Linear-Speedup Model.An ideal speedup model is linear when p does not exceed a threshold δ j (Drozdowski, 2004): t j,p = t j,1 p where η j,p = p; the workload of T j is independent of p since D j,p = pt j,p = t j,1 .Benoit et al. (2022a) propose a 2-approximation algorithm, called LPA-LIST, for failure-prone platforms with additional constraints in the process of executing jobs.We note that LPA-LIST is applicable to the main problem of this paper by setting the number of job execution failures in its model to zero.Like ours, the other related works of this paper are directly for failure-free platforms.When there are precedence constraints among moldable tasks, Wang & Cheng (1992) propose a 3 − 2 m -approximation algorithm while Benoit et al. (2022b) give a 2.62-approximation algorithm in the online setting.Besides, the case of scheduling independent malleable tasks has already been studied well, e.g., Drozdowski (1996) gives a polynomial time exact algorithm with a time complexity of O(n 2 ).Table 1 summarizes the most relevant works under this model and their differences with our main problem are clarified in the third column; here, the works whose objectives are throughput maximization will be introduced in Section 2.2.Communication Time Model.The communication time model is defined by a where c j is a positive real number; the term (p − 1)c j is used to model the communication overhead among different parts of a task.As more processors are assigned, the overhead and workload D j,p increase; if p is too large, t j,p will not decrease and even increase as p increases, due to the effect of (p − 1)c j .Like Table 1, Table 2 summarizes the related works.Specifically, when all tasks T j ∈ T have the same c j = c, Dutton & Mao (2007) give an online algorithm whose approximation ratio is 2, 9 4 , and 20 9 for m = 2, 3, and 4 respectively, and is 30 13 when m → ∞.Havill & Mao (2008) propose an online algorithm with an approximation ratio 4(m−1) m for even m ≥ 2 and 4m m+1 for odd m ≥ 3. Kell & Havill (2015) improve the work of (Dutton & Mao, 2007) by giving online algorithms whose approximation ratio are 1.5 and 2 for m = 2 and 3.The following works consider the case that each task T j has a specific c j .Guo & Kang (2010) give an online algorithm whose approximation ratio is (1 + √ 5)/2 for m = 2, and show that (1 + √ 5)/2 is a lower bound on the approximation ratio of any online algorithm for the problem with m ≥ 2. In the offline setting, Benoit et al. (2022a) show that LPA-LIST is a 3-approximation for failure-prone platforms.Benoit et al. (2022b) consider online scheduling of moldable tasks with precedence constraints and give a 3.61-approximation algorithm.Monotonic Model.To date, the best algorithm for our problem is designed by simply using a general monotonic assumption: t j,p is non-increasing and D j,p is non-decreasing in p ∈ [1, m], where η j,p ≤ p. Fig. 1 illustrates the major algorithm improvements over the past three decades, where m is independent of n.Specifically, Belkhale & Banerjee (1990) give a 2 1+1/m -approximation algorithm.Mounié et al. (1999Mounié et al. ( , 2007) ) first propose a ( √ 3+ ǫ)-approximation algorithm and then a ( 3 2 + ǫ)-approximation algorithm with a complexity O(mn log n ǫ ) where ǫ is arbitrarily small.Jansen & Land (2018) achieve an improved complexity polynomial in log m and 1 ǫ and linear in n, although the algorithm is still a ( 3 2 + ǫ)approximation.Additionally, in the special case where m ≥ 8 n ǫ , they give a FPTAS with a complexity O(n log 2 m(log m + log 1 ǫ )).The FPTAS requires a specific relation between n and m.Wu et al. (2023) give a 3 2 -approximation algorithm without ǫ and its time complexity is O(mn log(mn)) for m > n and O(n 2 log n) for m ≤ n.As illustrated in Fig. 1, the three recent algorithmic results all have approximation ratios of around 1.5, and it is difficult to lower the best known approximation ratio 1.5.In this paper, we aim to sacrifice the generality of the monotonic model for a better performance guarantee.
In the case where n is independent of m, we hope to develop a ρ-approximation algorithm with ρ < 3 2 and will revisit the related speedup models.Under the monotonic assumption, we have the following bounds of the execution time t j,γ(j,d) when a task T j is assigned γ(j, d) processors, which will also hold in this paper: (2) By the definition of γ(j, d), D j,γ(j,d) is the minimum workload needed to complete T j by time d.Suppose that an algorithm produces a schedule of a makespan d.We observe that it is a 1 θ -approximation to makespan minimization if every task T j ∈ T has the minimum workload and the aggregate workload processed on the m processors in [0, d] is ≥ θmd where θ is a lower bound of the processor utilization.Our objective is to make θ large (e.g., θ > 2 3 ).For each task T j with large γ(j, d) (e.g., γ(j, d) ≥ 4), we have by Inequality (2) that executing it on γ(j, d) processors alone can make these processors achieve a high utilization in [0, d].One main challenge comes from tasks with smaller γ(j, d).Then, a more precise speedup description than monotonicity could help, which is fortunately available in literature; it allows quantitatively characterizing the execution time reduction while keeping the workload constant, when the number p of processors assigned to a task T j changes from γ(j, d) to a larger value.We can thus obtain some desired properties and design a schedule under which the m processors achieve a high overall utilization in [0, d] under some additional constraints (see Section 3).The Proposed Speedup Model.While the linear-speedup model is studied, (Drozdowski, 1996) points out that it is typical of parallel applications that the speedup is linear when p is within a relatively small δ j ; assigning more than δ j processors to execute T j becomes less efficient.This model sets the parallelism bound of T j to be δ j , although it may be worth exploring the opportunity of assigning more processors to each task T j to get better resource efficiency.Complementarily, the function (1) of the communication time model is also tested on widely used NAS parallel benchmarks and HPLinpack, which embody various computations with typical communication patterns for evaluating the performance of parallel systems (John & Eeckhout, 2018); here, an instance of a type of computation represents a task.The benchmarking results of Dutton et al. (2008) show that the function (1) can well approximate the execution times of tasks and also indicate that the factor c j is far smaller than t j,1 : when p is small (up to a threshold δ j ), the effect of (p−1)c j on t j,p is negligible compared with the term t j,1 p and the speedup coincides accurately with the linear-speedup model (Drozdowski, 1996); assigning more than δ j processors to execute T j becomes less efficient: its execution time still decreases as p increases but its workload starts to increase, similarly to monotonic tasks; finally, there may be a larger threshold k j such that when p > k j , its execution time does not decrease any longer and even increases as p increases, since parallelizing on too many processors incurs an unacceptable overhead.Thus, we associate every task T j with two thresholds δ j and k j to distinguish the speedup modes of T j when p is in different ranges where δ j ≤ k j ; then, we make the following definition on which we will base the algorithmic design of this paper.Definition 1.A task T j ∈ T is (δ j , k j )-monotonic if it is moldable and satisfies 1.When p ∈ [1, δ j ], its workload remains constant and the speedup is linear, i.e., D j,1 = D j,p = p × t j,p ; 2. If δ j < k j , its workload is increasing and its execution time is decreasing in p ∈ [δ j , k j ], i.e., D j,p < D j,p+1 and t j,p > t j,p+1 for p ∈ [δ j , k j − 1]. 3. The parameter k j is a parallelism bound, i.e., the maximum number of processors allowed to be assigned to T j .
In Definition 1, the second point implies that assigning more than δ j for executing T j is less efficient but its execution time is decreasing in p ∈ [δ j , k j ].The third point is used to reflect that when p > k j , the workload begins to increase to an unacceptable extent such that the execution time does not decrease any more (i.e., η j,k j ≥ η j,p ); then, assigning more than k j processors to T j cannot bring any benefit.Overall, when p ∈ [1, k j ], t j,p is non-increasing in p while D j,p is non-decreasing in p. Relations with the Monotonic and Linear-Speedup Models.For each task T j ∈ T , the speedup model defines the way that t j,p and D j,p change with the number p of allocated processors, where t j,1 is known and D j,p = p × t j,p .We consider the problem of offline scheduling of independent moldable tasks on identical machines, and the objective is either makespan minimization or throughput maximization.By Definition 1, a task whose speedup is linear is also a (δ j , k j )monotonic task when k j = δ j .Thus, the linear-speedup model is a special case of the (δ j , k j )-monotonic model; thus, for a given objective, any ρ-approximation algorithm for the problem under the (δ j , k j )-monotonic model of this paper is also a ρ-approximation algorithm for the problem under the linear-speedup model, as illustrated in Fig. 2. A problem A is S-reducible to a problem B if any instance of A can be transformed into an instance of B with the same optimal objective function value, and any solution for B can be transformed into a solution for A with the same objective function value (Crescenzi et al., 2016).There exists a S-reduction from the problem under the (δ j , k j )-monotonic model to the problem under the monotonic model, which is proved in Appendix A, and thus any ρ-approximation algorithm for the monotonic model can be transformed into a ρ-approximation algorithm for the (δ j , k j )-monotonic model.
Algorithms for scheduling problems with a more general speedup model have more extensive applicability, as illustrated in Fig. 2.However, algorithms under specific models are still important since they may be designed more finely to have better approximation ratios.For example, for online scheduling of moldable task graphs to minimize the makespan, (Benoit et al., 2022b) give a 2.62-approximation algorithm for the linear-speedup model and a 3.61-approximation algorithm for the communication time model; they also generalize these speedup models and give a 5.72-approximation algorithm under the generalized model.

Algorithmic Results
Consider a set T in which each task T j is (δ j , k j )-monotonic.Given a task T j , its parameters δ j and k j are fixed; as reported in (Dutton et al., 2008), δ j and k j typically range in [25, 150] and [250, 512], depending on the types of computation embodied in the tasks of T .We denote by δ the minimum linear-speedup threshold of all tasks and by k the maximum parallelism bound of all tasks, i.e., δ = min T j ∈T {δ j } and k = max T j ∈T {k j }. (3) The number m of processors is large since our problem arises in large-scale parallel systems such as supercomputers and cloud computing clusters (Jain et al., 2012;Aridor et al., 2005), e.g., supercomputers can have m = 2 16 processors inside (Aridor et al., 2005).Like (Jain et al., 2012), we assume in this paper that m is much larger than the maximum parallelism bound of tasks, i.e., m ≫ k.
Let t m denote the maximum execution time of tasks when they are executed on one processor, i.e., t m = max T j ∈T {t j,1 }.In this paper, for any δ ≥ 5, the main algorithmic result is a 1 θ(δ) (1 + ǫ)-approximation algorithm for makespan minimization with a complexity O(n log m log (nmt m /ǫ)) where The algorithm achieves an approximation ratio θ(δ) close to u+2 u+1 since m ≫ k.Typically, the minimum linear-speedup threshold δ has an effective range of [25,150] (Dutton et al., 2008).In the worst case that δ = 25, θ(δ) is close to 6 5 ; when δ = 150, u+2 u+1 = 14 13 ≈ 1.077, which is close to 1.The larger the threshold δ, the better the proposed algorithm.Under mild assumptions, we realize our goal to sacrifice the generality of the monotonic model for a better approximation ratio.
For throughput maximization with a given deadline τ , we assume that every task T j ∈ T can be finished by time τ , i.e., γ(j, τ ) ∈ [1, k j ].As a by-product, another algorithmic result of this paper is a θ(δ)-approximation algorithm with a complexity O(n 2 log m) to maximize the throughput with a deadline τ .To the best of our knowledge, we are the first to address this scheduling objective for moldable tasks, while this objective has been addressed for other types of parallel tasks in the literature of scheduling theory (Jansen & Zhang, 2007;Fishkin et al., 2005).
The rest of this paper is organized as follows.In Section 2, we give more related works.In Section 3, we give an overview of the ideas developed in this paper.The following two sections are used to elaborate these ideas.In particular, in Section 4, we propose a scheduling algorithm Sched that produces a schedule with several features described in Section 3. In Section 5, we show the application of Sched to the objectives of makespan minimization and throughput maximization with a deadline respectively.Finally, we conclude this paper in Section 6.

Makespan Minimization
The problem of scheduling moldable tasks to minimize the makespan is strongly NP-hard when m ≥ 5 (Drozdowski, 2004).There is a long history of study with continuous improvements to the approximation ratio or time complexity.Turek et al. (1992) consider moldable tasks without monotonicity and propose a two-phases approach: (i) determine the number of processors assigned to each task and (ii) solve the resulting strip packing problem; the latter has well been studied, e.g., we can directly use the 2-approximation algorithm of Steinberg (Steinberg, 1997).Further, the authors show that any λ-approximation algorithm of a complexity O(f (m, n)) for strip packing can be transformed into a λ-approximation algorithm of a complexity O(mnf (m, n)) for our problem.In the special case of monotonic tasks, Ludwig & Tiwari (1994) improve the transformation complexity to O(n log 2 m + f (m, n)).Jansen & Porkolab (2002) formulate the original problem as a linear program.They propose a polynomial time approximation scheme (PTAS) when the number m of processors is constant; here, the complexity is exponential in m.Further, Jansen & Thöle (2010) propose a PTAS when m is polynomially bounded in the number n of tasks.In the case of an arbitrary number of processors, Jansen (2012) also propose a polynomial time ( 3 2 + ǫ)-approximation algorithm for any fixed ǫ.Barketau et al. (2014) give an optimal enumerative algorithm whose time complexity is O(n 3 2 (2n+m−2)n ).In the special case of n identical tasks, Decker et al. (2006) give a 5 4 -approximation algorithm.As introduced in Section 1, of the great relevance to our work are (Mounié et al., 1999(Mounié et al., , 2007;;Jansen & Land, 2018) that use similar techniques for monotonic tasks.For example, Mounié et al. (2007) apply the dual approximation technique (Hochbaum & Shmoys, 1987): it takes a real number d as an input, and either outputs a schedule of a makespan ≤ 3 2 d or answers correctly that d is a lower bound of the optimal makespan.To realize this, tasks are mainly classified into two subsets T 1 and T 2 whose tasks are respectively assigned γ(j, d) and γ(j, d 2 ) processors; the classification aims at minimizing the total workload W of T 1 and T 2 while guaranteeing that the total number of processors assigned to T 1 is ≤ m, which is formulated as a knapsack problem.If the optimal W exceeds the processing capacity of the m processors, there exists no schedule with a makespan < d.Otherwise, the total number of processors assigned to T 2 may exceed m and a series of reductions to the numbers of processors assigned to the tasks of T 1 and T 2 is taken to get a feasible schedule: the tasks are assigned to different parts of processors respectively in the time intervals Finally, our problem has also been studied well when the speedup η j,p is a concave or convex function of p (Blazewicz et al., 2004(Blazewicz et al., , 2006;;Barketau et al., 2014;Ebrahimi et al., 2018), which is less relevant to the speedup model of this paper.We don't introduce them in this paper any more.

Throughput Maximization
Several works have considered scheduling other types of parallel tasks to maximize the throughput.Jansen & Zhang (2007) and Fishkin et al. (2005) consider scheduling rigid tasks with a common deadline, e.g., the former apply the theory of knapsack problem and linear programming to propose an ( 1 2 + ǫ)-approximation algorithm.Jain et al. (2012) consider malleable tasks with individual deadlines.Each task has a linear speedup within a parallelism bound, and there is a parameter s used to characterize the minimum delay-tolerance of all tasks: each T j has to be finished in a time window [a j , d j ]; it has the minimum execution time len j when assigned δ j processors; s is the minimum ratio of d j − a j to len j among all tasks.For offline scheduling, Jain et al. (2012) propose a greedy m−k m s−1 s -approximation algorithm where k is the maximum parallelism bound of all tasks.Wu & Loiseau (2015) prove that the best approximation ratio that the type of greedy algorithms of (Jain et al., 2012) can achieve is s−1 s and propose such an algorithm with a time complexity of O(n 2 ); they also show a sufficient and necessary condition under which a set of malleable tasks with deadlines can feasibly be scheduled on a fixed number of processors and propose an exact algorithm by dynamic programming that has a time complexity of O max{n 2 , n(mT ) T } , where T is the maximum deadline of tasks.Guo & Shen (2017) give a m−k m -approximation algorithm with a time complexity of O(n 2 + nT ) and also an exact algorithm with a time complexity of O n(mT ) T .For online scheduling, Lucier et al. (2013) propose a 2+O 1/( 3 √ s − 1) 2 -approximation algorithm.In cloud computing clusters, many applications are delay-tolerant where s ≫ 1 and m ≫ k.Thus, their algorithms achieve good approximation ratios in practical settings.

Overview of the Approaches
Central to our algorithm design is an algorithm Sched that aims to schedule a set T of tasks on the m processors in a time interval [0, d] and achieves a processor utilization ≥ θ(δ) on the conditions that (i) each scheduled task T j has a workload D j,γ(j,d) , which is the minimum workload to finish T j by time d, and (ii) there exists some task of T rejected to be scheduled due to the insufficiency of idle processors (see Section 4).We establish the connection of Sched with our two problems in the following ways.
For makespan minimization, we need to schedule all tasks of T , while Sched can play a role only when a part of tasks are scheduled.We apply a binary search procedure to find two parameters U and L such that Sched can schedule all tasks by time U but only a part of tasks by time L, with the relation U ≤ L(1 + ǫ) (see Section 5.1).Let d * denote the optimal makespan.We can establish the relation between U and d * via L and prove U/d * ≤ 1 θ(δ) (1 + ǫ), thus showing that the resulting algorithm is a 1 θ(δ) (1 + ǫ)-approximation.Specifically, in the case that d * ∈ [L, U ], we have U/d * ≤ (1 + ǫ)/θ(δ) trivially.In the case that d * < L, we have that the total workload of all tasks of T in an optimal schedule is ≤ md * but ≥ the total workload processed when Sched manages to schedule a part of tasks of T by time L. Thus, we have md * ≥ mθ(δ)L ≥ mθ(δ)U/(1 + ǫ) and U ≤ d * (1 + ǫ)/θ(δ).
For throughput maximization, v j /D j,γ(j,d) is the maximum possible value obtained from processing a unit of workload of T j , called its value density.Let us accept the maximum number of tasks in the non-increasing order of their value densities until Sched cannot produce a feasible schedule by time τ ; then, the feature of Sched leads to that the utilization θ(δ) will be the approximation ratio of the resulting algorithm (see Section 5.2).
Finally, the design of Sched relies on the properties of the speedup model in Definition 1 to classify the tasks of T .The threshold δ in Equation ( 3) is a fixed parameter and we have the following property by Definition 1.
Property 3.1.If a task T j is (δ j , k j )-monotonic, we have that (i) the workload D j,p is non-decreasing and the execution time t j,p is non-increasing in the number p of assigned processors when p ∈ [1, k j ] and (ii) the speedup is linear when p ∈ [1, δ], i.e., t j,p = t j,1 p .For a task T j ∈ T , its execution time on p processors is defined by t j,1 and η j,p .Given the time d, γ(j, d) = min{p ∈ [1, k j ] | t j,p ≤ d} is a fixed parameter and can be found by binary search (Jansen & Land, 2018).The classification of tasks for the scheduling process mainly uses three integer variables ν, H and δ ′ and is based on the values of γ(j, d), t j,γ(j,d) and t j,δ ′ ; it attempts to guarantee that the aggregate execution time is in [rd, d] when some tasks in the same class are executed on a group of γ(j, d) or δ ′ processors.Specifically, ν and H are for distinguishing tasks with different γ(j, d): a task T j is said to have a large, medium, or small γ(j, d) H and we will use rd and (1 − r)d to distinguish tasks with different execution times.The first class of tasks, denoted by A ′ , includes every task that has a large execution time ≥ rd when assigned a group of γ(j, d) processors (see Equation ( 6)), e.g., every task with large γ(j, d) has such a feature by Inequality (2).
For the remaining tasks with medium or small γ(j, d), we will maintain several relations among ν, H, δ ′ and δ.For example, by letting H − 1 ≤ δ ′ ≤ δ, the speedup is linear and the workload keeps constant when the number p of assigned processors ranges in [γ(j, d), δ ′ ].These relations finally enable the following properties: • For the tasks with small γ(j, d) whose execution times are < rd when assigned γ(j, d) processors, they are denoted by B ν−1 and their execution times will decrease remarkably (by a factor at least δ ′ ν−1 ) to a small value < (1 − r)d when assigned δ ′ processors (see Equation ( 7) and Lemma 3).Executing as many such tasks as possible on a group of δ ′ processors in [0, d] will lead to an aggregate execution time ≥ rd.
• Let h be an integer in [ν, H − 1].For the tasks with γ(j, d) = h whose execution times are ≥ (1 − r)d when assigned δ ′ processors and < rd when assigned h processors, they are denoted by A h and there exists a positive integer x h such that the aggregate execution time is in [rd, d] when x h such tasks are executed one by one on a group of δ ′ processors (see Equation ( 11) and Proposition 5).
Finally, each group of γ(j, d) or δ ′ assigned processors described above can achieve a utilization ≥ r in [0, d].The overall utilization θ(δ) of the m processors is close to r and can be derived when some task is rejected due to the insufficiency of processors, with at most k − 1 processors idle.The task classification and maintained relations are formally described in Section 4.1, with other related issues solved.
The scheduling algorithm Sched is given in Section 4.2.

The Algorithm Sched
In this section, we consider the case that every task T j ∈ T can be finished by time d, i.e., γ(j, d) ∈ [1, k j ].

Task Classification
Following the high-level ideas in Section 3, we now begin to elaborate the task classification.For ease of reference, we first summarize the maintained relations between the fixed parameter δ, the integer variables H, ν, δ ′ , x ν , • • • , x H−1 , and the number r = H−1 H where the meanings of these variables and the number r will be clarified later: As we classify tasks and prove their properties, we can gradually perceive the underlying reasons why these relations are established to get the desired properties.
At the end of this subsection, we will give a feasible solution of H, ν, δ ′ , x ν , • • • , x H−1 that satisfy the relations (4a)-(5b).Fig. 3 summarizes how to classify a task T j ∈ T according to its value of γ(j, d) and its execution time on γ(j, d) or δ ′ processors.Specifically, the first class of tasks contains all tasks whose execution times t j,γ(j,d) are ≥ rd when assigned γ(j, d) processors and is defined as A ′ also includes a part of tasks with smaller γ(j, d) but they have t j,γ(j,d) ≥ rd.Except A ′ , the remaining tasks have medium or small γ(j, d) and each has an execution time t j,γ(j,d) < rd.Among these tasks, let B ν−1 denote all tasks with γ(j, d) ≤ ν − 1, i.e., The second class of tasks is defined as For each task T j with γ(j, d) ≤ H − 1, the relation (4a) ensures by Property 3.1 that the speedup is linear when the number of processors assigned to T j changes from γ(j, d) to δ ′ ; when assigned δ ′ processors, its execution time t j,δ ′ satisfies Proof.The execution time of T j satisfies where the above (a), (b) and (c) are due to Equation (10), Equation ( 7) and the relation (4c) respectively.
Proof.It follows from Lemma 3 and the definition of B H−1 in Equation ( 8).
Finally, the remaining are all tasks with γ(j, d) ∈ [ν, H − 1] and each has an execution time t j,γ(j,d) < rd when assigned γ(j, d) processors and t j,δ ′ ≥ (1 − r)d when assigned δ ′ processors.For each h ∈ [ν, H − 1], a single class of tasks A h is defined to contain all such tasks with γ(j, d) = h, i.e., Proposition 5. When a task is assigned δ ′ processors, we have that (i) for every task T j ∈ A h , its execution time t j,δ ′ is < l h d where l h = h δ ′ r; (ii) the aggregate execution time of any x h tasks of A h is in [rd, d].
Proof.The relation (4b) implies that ν is the maximum possible integer such that the relation (4c) can hold.Let us consider every task T j ∈ A h and by the definition of A h in Equation ( 11), we have where γ(j, d) = h.We have by Lemma 2 that the execution time of this task T j satisfies Thus, by Equation ( 10), we have where the above (d) is due to Inequality (12), and (e) is due to Inequality (14).By Inequalities ( 13), ( 15) and ( 16), we have for any task While executing any x h tasks of A h one by one on δ ′ processors, the relations (5a) and (5b) ensure that their aggregate execution time is in [rd, d].Together with Inequality (15), Proposition 5 thus holds.
Proposition 4 and 5 enable us to design good schedules.Executing as many tasks from A ′′ as possible on δ ′ processors by time d can lead to that these processors have a utilization ≥ r in [0, d].This also holds for the tasks of A h where h ∈ [ν, H − 1] since at least x h tasks can be finished by time d.Proposition 6.For a given linear-speedup threshold δ ≥ 5, A feasible solution that satisfies the relations ( 4a)-( 5b) is as follows: where r = u+1 u+2 .Proof.The proof is about verifying that the setting in Equation ( 17) can satisfy the relations (4a)-( 5b) and its detail can be found in Appendix B.
In the rest of this paper, we will set the parameter values in the way described in Proposition 6.Since ν = u and H = u + 2, the tasks of T are finally classified as A ′ , A u+1 , A u , A ′′ .Finally, we show the time complexity while classifying the tasks of T .When a task T j is allocated p ∈ [1, m] machines, the speedup η j,p can be accessed via some oracle in constant time (Jansen & Land, 2018), e.g., the oracle can obtain such information by benchmarking studies (Dutton et al., 2008).Theoretically, the value of k j or δ j is a fixed integer in [1, m] and can be obtained by binary search, leading to the proposition below.
Proposition 7.For each task T j ∈ T , the time complexity of finding the value of k j or δ j is O(log m).
Proposition 8. Given the value of d and the values of k j and δ j of each task T j ∈ T , the time complexity of task classification is O(n log m).
Proof.The time complexity of finding the value of δ = min T j ∈T {δ j } in Equation ( 3) is O(n).We can directly compute the value of δ ′ by Equation ( 17).Afterwards, we classify each task T j where we need to check the value of γ(j, d), t j,γ(j,d) , or t j,δ ′ at most four times, as illustrated in Fig. 3; the time complexities of find these values determine the time complexity of classifying a task.Given the execution time t j,1 on one processor, γ(j, d) = min{p ∈ [1, k j ] | t j,p ≤ d} can be found by binary search with a time complexity of O(log k j )≤ O(log m), where k j ≤ m.Given the values of γ(j, d) and δ ′ , t j,γ(j,d) and t j,δ ′ can directly be computed in time O(1).Thus, the time complexity of classifying the n tasks is O(n log m).

Algorithm Description
Now, we give the scheduling algorithm Sched, which is presented in Algorithm 1.Let m ′ denote the number of idle processors; initially, m ′ = m.T is partitioned into A ′ , A u+1 , A u , A ′′ , and these sets are also sorted and assigned in this order where the tasks in the same set are chosen in an arbitrary order.Following this order, Sched assigns tasks in the following way until all tasks of T are assigned or there are not enough idle processors: 1 Set the parameters by Proposition 6 and classify the tasks of // X ′ , Xu+1, Xu, Xu−1: the currently unassigned tasks T δ ′ ← ∅, t ← 0 // T δ ′ : the tasks currently chosen for the δ ′ processors; t: the aggregate execution time of T δ ′ 11 while X l = ∅ do 12 Get an arbitrary task T j from X l break // got enough tasks and go to line 22 Reset l to the maximum such l // go to line 11 Assign the tasks of T δ ′ on the δ ′ idle processors (i) For each unassigned task T j ∈ A ′ , assign it onto γ(j, d) idle processors; then, m ′ = m ′ − γ(j, d) (lines 3-5).(ii) If m ′ ≥ δ ′ , divide the idle processors into ⌊ m ′ δ ′ ⌋ groups, each with δ ′ processors.For each group, get unassigned tasks of A u+1 ∪ A u ∪ A ′′ such that their aggregate execution time on δ ′ processors is ≤ d (lines 10-21); assign these tasks onto the group of processors (line 22).
k and δ ′ are given in Equations ( 3) and ( 17).Algorithm 1 ends (i) if there are unassigned tasks but the idle processors are not enough (m ′ < k in line 6 or m ′ < δ ′ in line 8), or (ii) if all tasks of T have been assigned.

Example
Figure 4: Task assignment when δ = 5 where each colored rectangle represents a task of some type.We give a toy example where δ = 5 to illustrate the execution of Algorithm 1.By Proposition 6, we have u = ν = 2, H = 4, δ ′ = 5, x 3 = 2, x 2 = 3, and r = 3 4 ; then, T is divided into 4 subsets A ′ , A 3 , A 2 , and A ′′ (line 1).Suppose that we are given m = 33, Retrospectively, we get six groups from the m processors.The first group has γ(1, d) processors, each of the remaining groups has δ ′ processors, and there are also 4 ungrouped processors.As illustrated in Fig. 4, Algorithm 1 assigns tasks in the following way: (1) Assign the only task T 1 of A ′ onto the 1st group (lines 3-5).
By the definition of A ′ in Equation ( 6) and Propositions 4 and 5, the 1st-2nd and 4th-6th groups have an execution time in [rd, d].The 3rd group of δ ′ processors executes a mix of the tasks of A 3 and A 2 ; the aggregate execution time of tasks is < rd but ≥ (1 − l 2 )d since the rejected task of A 2 has an execution time ≤ l 2 d = 3d/10 by Proposition 5. Finally, there is one unassigned task of A ′′ and the number of idle processors is at most δ ′ − 1.The total number of processors whose execution time is and the overall processor utilization in [0, d] is at least

Algorithm Analysis
Now, we prove the features of Sched.The following conclusion is a generalization of Equation (18) in the example above.
Proposition 9.If Sched cannot schedule all tasks of T on the m processors by time d, then Sched achieves a processor utilization of at least where r = u+1 u+2 ∈ (0, 1).Proof.The proof is a generalization of the analysis process to derive Equation (18).Please see the detailed proof in Appendix C.
Proposition 10.Given the value of d and the values of k j and δ j of each task T j ∈ T , the time complexity of Algorithm 1 is O(n log m).
Proof.The time complexity of task classification is O(n log m) by Proposition 8 (line 1).Afterwards, the n tasks are assigned to processors one by one (lines 4, 12) and Sched stops when all tasks are assigned or there are not enough processors to assign the remaining tasks, which has a time complexity of O(n).Hence, Algorithm 1 has a time complexity of O(n log m).
Let S denote the tasks accepted and scheduled by Algorithm 1 where S ⊆ T .γ(j, d) denotes the minimum number of processors needed to complete T j by time d.As illustrated in Fig. 3, in Algorithm 1, the number of processors allocated to a task is either γ(j, d) or δ ′ that is no larger than δ by Inequality (4a).By Definition 1 and Property 3.1, we have the following lemma.
Lemma 11.In Algorithm 1, we have for every task T j ∈ S that its workload is D j,γ(j,d) , which is the minimum workload needed to be processed to complete T j by time d.
Proof.Please see the detailed proof in Appendix D.
With Proposition 9 and Lemma 11, we have completed the design of the scheduling algorithm Sched described in Section 3.

Application to Two Objectives
In this section, we apply Sched to respectively minimize the makespan and maximize the throughput with a common deadline τ .

Makespan Minimization
Now, we give the algorithm for makespan minimization, which is formally presented in Algorithm 2 and also referred to as the OM S algorithm (Optimized MakeSpan).Its high-level idea is as follows.Initially, let U and L be such that Sched can produce a feasible schedule of all tasks of T by time U but fails to do so by time L, e.g., U = n(δ +2) max T j ∈T {t j,1 } and L = 0 (line 1); we explain the reason why such U is feasible in Appendix E. The OM S algorithm will repeatedly operate as follows and stops when U ≤ (1 + ǫ)L (line 2): 1. M ← U +L 2 (line 3). 2. judge whether there exists a task T j ∈ T that cannot be completed by time M with the parallelism bound k j (lines 4-8).3. if γ(j, M ) ∈ [1, k j ] for every task T j ∈ T and Sched can produce a feasible schedule of all tasks of T by time M , set U ← M (lines 9-11); otherwise, set L ← M (lines 9, 12-15).
Algorithm 2: The OM S(ǫ) algorithm Flag ← 1 In the rest of this subsection, we analyze the approximation ratio and complexity of the algorithm.As shown below, for a task T j , the larger the value of d, the smaller the value of γ(j, d).
Let d * denote the optimal makespan.In an optimal schedule, let D * j denote the workload of a task T j and D * denote the total workload of all tasks of T to be processed on the m processors in [0, d * ] where we have When the OM S algorithm ends, if γ(j, L) ∈ [1, k j ] for every task T j ∈ T , only a part of tasks are scheduled by Sched by time L and we have by Proposition 9 that θ(δ) is a lower bound of the processor utilization in [0, L]; we denote by D L j the workload of a scheduled task T j and by D L the total workload of all the scheduled tasks; here, we have Lemma 13.When the OM S algorithm ends, if d * < L, then we have that (i) D * ≥ D L and (ii) γ(j, L) ∈ [1, k j ] for every task T j ∈ T .
Proof.For every T j ∈ T , if d * < L, we have γ(j, L) ∈ [1, k j ] since T j can be finished by d * , with the parallelism bound k j .By Lemma 12, if d ′ < d ′′ , we have γ(j, d ′ ) ≥ γ(j, d ′′ ).Since d * < L, we have in an optimal schedule that the number of processors assigned to a task T j is ≥ γ(j, d * ), which is ≥ γ(j, L).By Property 3.1, we have D * j ≥ D j,γ(j,d * ) ≥ D j,γ(j,L) .By Lemma 11, we have D j,γ(j,L) = D L j .Finally, we have D * j ≥ D L j and D * ≥ D L .
Proposition 14.The OM S algorithm gives a 1 θ(δ) (1 + ǫ)-approximation to the makespan minimization problem with a complexity of O(n log m log (nmt m /ǫ)) where t m = max T j ∈T {t j,1 }.
Executing the OM S algorithm needs prior knowledge of the values of k j and δ j of all the n tasks of T , which will be used for computing the upper bound U and in calling Sched (lines 1, 10, 12); the time complexity of obtaining these values is O(n log m) by Proposition 7.While executing the OM S algorithm, the initial values of U and L are n(δ + 2)t m and 0. The binary search stops when U ≤ L(1 + ǫ) and the number of iterations is O(log (nδt m /ǫ)) ≤ O(log (nmt m /ǫ)), where δ ≤ m.At each iteration, the time complexity of computing γ(j, d) is Algorithm 3: GreedyAlgo(τ ) O(log k j ) ≤ O(log m) as we show in the proof of Proposition 8; while judging whether there exists a task T j ∈ T that cannot be completed by time M (lines 4-8), the time complexity is O(n log m); then, Sched is run (line 10 or 12) and has a time complexity O(n log m) by Proposition 10.The entire execution process has a time complexity O(n log m log (nmt m /ǫ)), which is also the complexity of the OM S algorithm.

Throughput Maximization with a Common Deadline
Let v ′ j = v j /D j,γ(j,τ ) , and it is the maximum possible value obtained from processing a unit of workload of T j , referred to as the (maximum) value density of T j .We assume without loss of generality that We propose a greedy algorithm called GreedyAlgo, presented in Algorithm 3: it considers tasks in the non-increasing order of their value densities v ′ j and finally finds the maximum i ′ such that Sched can output a feasible schedule by time τ for the first i ′ tasks, denoted by S i ′ , but fails to do so for the first i ′ + 1 tasks.The throughput of GreedyAlgo is i ′ j=1 v j .Proposition 15.GreedyAlgo gives a θ(δ)-approximation to the throughput maximization problem with a common deadline and it has a complexity of O(n 2 log m).
In the rest of this subsection, we give an overview of the proof of Proposition 15.By Proposition 9, θ(δ) is a lower bound of the processor utilization when Sched schedules S i ′ in [0, τ ].Let OPT denote the optimal throughput of our problem.The proof of Proposition 15 has two parts: (i) We give an upper bound of OPT , denoted by OPT , i.e., OPT ≥ OPT (22) where OPT will be specified in Equation ( 24).
(ii) We show that θ(δ) is a lower bound of the ratio of the throughput of GreedyAlgo to the upper bound, i.e., Then, we have by Inequalities ( 22) and ( 23) that Thus, the throughput i ′ j=1 v j of GreedyAlgo is at least θ(δ) times the optimal throughput OPT and GreedyAlgo is a θ(δ)-approximation algorithm.
For the first part, let us consider a fractional knapsack problem (Korte & Vygen, 2018) and there are a knapsack of size τ m and n divisible items.With abuse of notation, each item is still denoted by T j , with a fixed size D j,γ(j,τ ) and a value v j .Its optimal solution is packing into the knapsack the first σ items, denoted by S ′ , with the highest value densities such that their total size equals τ m: σ−1 j=1 D j,γ(j,τ ) + αD σ,γ(σ,τ ) = τ m where α ∈ (0, 1] and the σ-th item may be partially packed.The following lemma completes the description of the first part.
Lemma 16.An upper bound of OPT is which is the optimal value of the knapsack problem.
Proof.GreedyAlgo chooses a subset of tasks

and uses
Sched to schedule S i ′ on the m processors in [0, τ ].We will show that any solution to the problem of this paper corresponds to a feasible solution to the above knapsack problem, where the same tasks/items are chosen and the two solutions have the same total value of tasks/items; the lemma thus holds.Specifically, when a task T j ∈ S i ′ is chosen in our problem and assigned p j processors, we can correspondingly pack an item T j with a size D j,γ(j,τ ) into the above knapsack.By Lemma 11, D j,p j = D j,γ(j,τ ) and T j ∈S i ′ D j,γ(j,τ ) ≤ τ m; thus, the items T 1 , T 2 , • • • , T i ′ can successfully be packed into the knapsack.
For the second part, the detailed proof of (23) will be provided in Appendix F. Below, we provide the underlying intuition while proving (23).The workload of each task T j ∈ S i ′ accepted by GreedyAlgo is also D j,γ(j,d) by Lemma 11. S i ′ and S ′ contain the first i ′ and σ tasks with the highest value densities respectively.
We have i ′ ≤ σ since in GreedyAlgo the utilization of the m processors in [0, τ ] is ≤ 1.Thus, the average value density of S i ′ is no smaller than the average value density of S ′ , i.e., Further, we can prove (23): Executing GreedyAlgo needs prior knowledge of the values of k j and δ j of all the n tasks of T , which will be used in calling Sched (line 3); the time complexity of obtaining these values is O(n log m) by Proposition 7.During its execution, it considers S 1 , S 2 , • • • , S n one by one (line 2).Whenever Sched attempts to schedule the tasks of S i on m processor by time τ (line 3), it has a time complexity O(n log m) by Proposition 10.Thus, the entire execution process has a time complexity O(n 2 log m), which is also the complexity of GreedyAlgo.

Conclusions
In this paper, we study the problem of scheduling n independent moldable tasks on m processors that arises in large-scale parallel computations.For makespan minimization, the best known result is a ( 3 2 + ǫ)-approximation algorithm with a complexity linear in n and polynomial in log m and 1 ǫ , where ǫ is arbitrarily small; it is achieved under a monotonic assumption: the execution time of a task T j is non-increasing and its workload is non-decreasing in the number p of assigned processors.We propose a new perspective of the existing speedup models: the speedup of a task T j is linear when p is small (up to a threshold δ j ); afterwards, there may be a larger threshold k j such that the task is strictly monotonic when p ranges in [δ j , k j ]; the bound k j indicates an unacceptable overhead when parallelizing the task on too many processors.Let δ be the minimum linear-speedup threshold of all tasks and k be the maximum parallelism bound of all tasks.For any δ ≥ 5, let u = ⌈ 2 √ δ⌉ − 1.A main algorithmic result of this paper is a 1 θ(δ) (1 + ǫ)-approximation algorithm for makespan minimization with a complexity O(n log m log (nmt m /ǫ)) where θ(δ) = u+1 u+2 1 − k m (m ≫ k); typically, δ can range in [25,150].As a by-product, we also propose a θ(δ)-approximation algorithm for throughput maximization with a common deadline with a complexity O(n 2 log m).

Appendix C. Proof of Proposition 9
After executing Sched, the m processors may be divided into three parts: (i) the first part executes the tasks of A ′ (lines 3-5), e.g, the 1st group of processors in the example above, (ii) the second part executes the tasks of A H−1 , • • • , A u , A ′′ (lines 7-22), e.g., the 2nd-6th groups in the example, (iii) the third part is idle and not assigned any task.
Sched ends with two cases: (i) m ′ < k (line 6), or (ii) m ′ < δ ′ (line 8).Different parts exist in each case.Our analysis proceeds by showing (a) which parts of processors exist in each case and (b) the utilization of each part.θ(δ) is a lower bound of the ratio of the total workload processed by different parts to md.
First, we analyze the utilizations of the three parts.The first part of processors has a utilization ≥ r in [0, d] by the definition of A ′ .The utilization of the third part is zero.The second part can be divided into several groups, each with δ ′ processors.Let A u−1 = A ′′ for ease of exposition.For each group, we have (1) it is assigned the tasks purely from a single set A h where h ∈ [u − 1, u + 1] (see the second, fourth and sixth groups in the example), or (2) it is a mix of the tasks of multiple sets A h , A h−1 , • • • , A h ′ where u + 1 ≥ h > h ′ ≥ u − 1 and h ′ ∈ {u, u − 1}.
The total value obtained by GreedyAlgo is Here, in Equation (a), v j = v ′ j D j,γ(j,τ ) ; Inequalities (b) and (d) are due to that

Figure 1 :
Figure 1: Major Algorithmic Improvements for the Monotonic Model over the Past Three Decades.

Figure 2 :
Figure 2: Relations Between Different Speedup Models

if
Flag = 1 then // every task Tj ∈ T can be completed by time M , i.e., γ(j, d) ∈ [1, kj] 10 if Sched produces a feasible schedule of all tasks of T by time M then 11 U ← M 12 if Sched can only schedule a part of tasks of T by time M then 13

3if
Sched produces a feasible schedule of all tasks of S i by time τ then 4 i ′ ← i;

Table 1 :
The Most Relevant Algorithmic Results for the Linear-speedup Model

Table 2 :
Algorithmic Results for the Communication Time Model