IO-SETS: Simple and eﬀicient approaches for I/O bandwidth management

—One of the main performance issues faced by high-performance computing platforms is the congestion caused by concurrent I/O from applications. When this happens, the plat-form’s overall performance and utilization are harmed. From the extensive work in this field, I/O scheduling is the essential solution to this problem. The main drawback of current techniques is the amount of information needed about applications, which compromises their applicability. In this paper, we propose a novel method for I/O management, IO-S ETS . We present its potential through a scheduling heuristic called S ET -10, which is simple and requires only minimal information. Our extensive experimental campaign shows the importance of IO-S ETS and the robustness of S ET -10 under various workloads. In particular in most of the simulated scenarios we improve the I/O slowdown over fairshare by 50%, which corresponds in our scenarios to a platform utilization gain of 2.5%. In the practical scenarios that we did, the utilization gain varies between 10 and 30%. We also provide insights on using our proposal in practice.


Introduction
As high-performance applications increasingly rely on data, the stress put on the I/O system has been one of the key bottlenecks of supercomputers. Indeed, processing speed has increased at a much faster pace than parallel file system (PFS) bandwidth. For example, the first supercomputer in the Top500 list [1] of November 1999 was ASCI Red, with peak performance of 3.2 TFlops and 1 GB/s of PFS bandwidth [9]. Twenty years later, in November 2019, the fastest machine was Summit, with 200.8 PFlops (over 62, 000× faster) and a PFS capable of 2.5 TB/s (2, 500× faster) [16]. The I/O bottleneck becomes a bigger issue when multiple applications access the I/O infrastructure simultaneously. Congestion delays their execution, which means they hold compute resources for longer, hurting the utilization of the platform. Moreover, performance variability increases, since the application's execution time depends not only on what it does but on the current state of the machine. In this context, solutions were proposed to try to manage the accesses to the shared I/O system. The main way of doing so, and the topic of this work, is I/O scheduling [12,15,7,8,21] : it consists in deciding algorithmically which applications get priority in this access. In the majority of the I/O scheduling approaches in the related literature, the I/O operations are performed exclusively (i.e. one application at a time), or semi-exclusively when applications cannot use all the available bandwidth by themselves (i.e. some applications run concurrently using as much bandwidth as they can). An alternative method is to use fair-sharing : the bandwidth is shared equally by all applications in a best-effort fashion. With fair-sharing, when two applications perform I/O concurrently, each one takes twice the time it would take by itself.
We propose a novel approach that is built on the intuition that both exclusive and fair-sharing have benefits. With this respect, we design a two-pronged I/O management approach for HPC systems : first, applications are sorted into sets. If applications from the same set try to perform I/O concurrently, that is done in an exclusive fashion (one at a time). Nonetheless, applications from different sets can do I/O accesses at the same time, and in this case they share the available bandwidth. The main contributions of this paper are the following : -a novel method for I/O management in HPC systems ; -an instantiation of this method with very simple heuristics that require little information about applications ; The rest of this paper is organized as follows. In Section 2, we detail the problem we studied and our platform and application models, so that we can present our method and studied heuristics in Section 3. Results are presented in Section 4. The paper closes with related work in Section 5 and final remarks in Section 6.

I/O scheduling in HPC systems
In this work, we assume that we have a parallel platform, with a separated I/O infrastructure that is shared by all nodes and that provides a total bandwidth B. There are many ways to share the I/O bandwidth between concurrent I/O accesses [7]. One of the novel ideas that we develop here is to manage I/O bandwidth using a priority-based approach, which was inspired by a network protocol [20]. It consists in assigning priorities to I/O accesses. The idea is that, when an application (or job, in this paper we use both terms without distinction) performs an I/O phase by itself, it can use the full bandwidth of the system. However, when there are k concurrent requests for I/O, with respective priorities p 1 , p 2 , . . . , p k , then for i ∈ {1, . . . , k}, the i th request is allocated a share x i of the total bandwidth such that : We consider the execution of workloads composed of N jobs. HPC applications have been observed to alternate between compute and I/O phases [7,11]. Thus, for j ≤ N, job J j consists of n j non-overlapping iterations. For i ≤ n j , iteration i is composed of a computing phase of length t (i) cpu followed by an I/O phase of length t (i) io 2 . These lengths (in time units) correspond to when the job is run in isolation on the platform (i.e. when there is no interference from other jobs). We consider here that an I/O operation that would take t io units of time in isolation takes t io /x units of time when using only a share x of the I/O bandwidth. The precise I/O profile of an application, i.e. all its I/O accesses along with their volumes, start and finish time, can be extremely hard to predict. Obtaining this would involve a detailed I/O profiler that would impose a heavy overhead and generate a large amount of data, and still the profile would suffer from inaccuracy. I/O profilers used in practice focus on average or cumulative data. For instance, Darshan [17] is able to measure the total volume of I/O by an application. Therefore, we focus on average parameters. The first parameter is the characteristic time, w iter , of an application with n iterations. This parameter can be seen as the average time between the beginning of two consecutive I/O phases : 2. For simplicity, in the following and if there is no ambiguity, we relax some of the indexes and use n for the number of iterations, and t cpu and t io for the length of a compute or I/O phase Compas'2022 : Parallélisme / Architecture/ Système MIS/UPJV -Amiens France, 5-8 juillet 2022 We also define a job's average portion of time spent on I/O. The I/O ratio of a job is : . Using this, we can define the average I/O stress of the platform with N applications at a given time : The intuition behind this value is that it corresponds roughly to the expected average occupation of the I/O bandwidth. We can make the following important remarks : -These average values do not depend on the number of used nodes or their performance. In practice, an application that runs on one node of 1 GFlops for 20 minutes and performs 20 iterations has the same w iter value (1 minute) than another that runs on 2000 nodes with at 2 TFlops for 120 minutes and performs 120 iterations. -The characteristic time and I/O ratio as defined above are average values that we expect could be evaluated easily, or approximated from previous runs. Due to being averages, they are more robust to variability than individual phases (Ergodic Theorem).

IO-SETS : a novel I/O management method for HPC systems
In this work we propose the IO-SETS method, which allows exclusive access for some applications, and (not fair) bandwidth sharing for others. Our proposition is a set-based approach, described below.
-when an application wants to do I/O, it is assigned to a set S i ∈ {S 0 , S 1 , . . .}.
-Each set S i is allocated a bandwidth priority p i . -At any time, only one application per set is allowed to do I/O (exclusive access within sets). We use the first-come-first-served scheduling strategy within a set (i.e. we pick the application that requested it the earliest). -If applications from multiple sets request I/O, they are allowed to proceed and their share of the bandwidth is computed using the sets' priorities and Equation (1). Proposing a heuristic in the IO-SETS method consists therefore of answering two important questions : (i) how do we choose the set in which an application is allocated, and (ii) how do we define the priority of a set. We can illustrate the instantiation of the IO-SETS method by using it to represent the two discussed reference strategies. EXCLUSIVE-FCFS : for this heuristic, all applications have exclusive access and are scheduled in a first-come-first-served fashion. This is the case where there is a single set S 1 with any priority (for instance p 1 = 1). All applications are scheduled in the set S 1 . FAIR-SHARE : in this heuristic, the bandwidth is shared equally among all applications requesting I/O access. This can be modeled by the case where there is one set by application (S 0 , . . . , S k ), all with the same priority p i = 1. Application id is scheduled in set S id .

SET-10 algorithm
When two jobs have the same characteristic time, it seems more efficient to provide them with exclusive I/O access so that they can synchronize their phases. In this case, one of them would pay once a delay equal to the length of the other's I/O phase, but then for the remaining iterations, neither of them would be delayed. However, exclusive access does not bring benefits when the applications' characteristic times are very different. Based on this observation, we propose the SET-10 heuristic, which builds the sets depending on w iter .
Given an application A id , with a characteristic time w id iter , then the mapping allocation is π : π : A id → S log 10 w id iter (4) where x defines the nearest integer value of x.
To define priorities for the sets, we start with the following observation : if two applications start performing I/O at the same time and share the bandwidth equally, the one with the smallest I/O volume finishes first. Now if we increase the bandwidth of the smallest, it will finish earlier, but the largest one will not be delayed. Based on this, we want to provide higher bandwidth to the sets with the smallest I/O accesses. Intuitively, if the characteristic time w iter are of different orders of magnitude, we assume that so are the I/O accesses. An advantage of using w iter instead of the I/O volume is that it is easy to obtain, as previously discussed. Hence we define the priority p i of set S i (which corresponds to jobs such that i = log 10 w id iter ) :  (5)) ; -The π function that maps jobs to sets using Equation (4).

Evaluation and Results
The evaluation in this paper represents the first exploratory work to show the importance of the IO-SETS method. For that, we use Simgrid v3.30 [4], an open-source simulator of distributed environments that also simulates I/O operations [13]. We chose to use simulation because we wanted to test many scenarios, to extensively assess our method's strengths and limitations (the extended version of the paper comprises tests with hundreds of different workloads). Moreover, test platforms are somewhat limited in size, but deploying I/O scheduling strategies in a production system (just to test them) is complex and usually not allowed. All code used for our evaluation, the details of implementation, and the extended version of the paper are available and documented at https://gitlab.com/u693/iosets. In our experiments all applications start at the beginning of the simulation. It means that the early and final stages are not representative in terms of arriving I/O requests. To avoid such artifacts, we do our measurements within a time frame [T begin , T end ] with T begin > 0 and T end < h. Specifically, with a horizon h = 20, 000s, we select : T begin = 1, 000s and T end = 9, 000s. That ensures that it starts after a possible de-synchronization of the different applications and stops before any application finishes. Additionally, for each task J j , we define the effective completed work e j = e cpu j + e io j as the sum of the lengths of all compute and I/O phases executed by this task within the interval. We create 200 workloads each with N = 60 parallel applications, and we consider a saturated system, with an average I/O stress ω =0.80 (Equation 3). We considered two metrics : i) the system Utilization (Equation 6), i.e., the proportion of time loss due to I/O congestion ; and ii) the application Stretch (Equation 7), i.e., how long does it take for a user to get a result. We aim to show that IO-SETS is indeed a relevant method when compared to the reference strategies EXCLUSIVE-FCFS and FAIR-SHARE. In order to focus on the importance of the use of sets in the IO-SETS method, we design a collection of applications that clearly belong to different groups of the heuristic SET-10. By doing that, we temporarily put aside the question of designing an efficient heuristic for IO-SETS. As discussed in Section 3, a job is fully described by its number of iterations n, and compute and I/O phases' length. In order to generate each job, we proceed as detailed below. 1. We decide on a characteristic time w iter for the job. In practice, it is drawn from the normal distribution N (µ, σ), truncated so that we consider only positive values. µ and σ are selected on an experiment basis.

The number of iterations of the job is
io is defined as follows : for each application, we draw a value α j uniformly at random in U[0, 1], then we set α io (j) = ω k≤N α k · α j to guarantee an I/O Load of ω. α io allows us to define the average values t cpu and t io : t cpu = (1 − α io ) · w iter and t io = α io · w iter . In Section 2, we explained the intuition that jobs should be grouped based on their characteristic time (w iter ). Therefore, we define three different job profiles (i.e. values for µ and σ) : -n H jobs with characteristic time w iter ∼ N (10, 1), also called high-frequency jobs in the following (the mean duration of an iteration is 10s) ; -n M medium-frequency jobs with w iter ∼ N (100, 10) ; -n L low-frequency jobs with w iter ∼ N (1000, 100) ; We set n H +n M +n L = 60, and specifically we study the impact of the algorithms when n M = 20, n H ∈ {0, . . . , 40}, and n L = 40 − n H . Utilization is presented in Figure 1a. The FAIR-SHARE and SET-10 strategies have a high and stable platform usage, the latter better by up to 4.4% when there are few high-frequency jobs. This difference disappears when n H increases. EXCLUSIVE-FCFS has rather poor results compared to the other heuristics, with a higher utilization when I/O phases are mostly large. Figure 1b presents the Stretch objective for each workload. We can observe the important gains provided by SET-10, a 15.3% improvement over FAIR-SHARE, especially when there are mostly low-frequency jobs (larger I/O phases). This difference disappears as n H increases. Here again, EXCLUSIVE-FCFS has the worst results in all cases, confirming its expected shortcomings. Interestingly, the Utilization is not correlated to the Stretch : indeed, in an exclusive scenario, one of the high-frequency jobs can be penalized many times (which increases its Stretch), without hurting the Utilization significantly.
When there are diverse job profiles, EXCLUSIVE-FCFS is not a good strategy for scheduling I/O. While it is important to share the bandwidth, keeping some exclusivity has a positive impact on performance, showing the importance of the IO-SETS method.

Related Work
In this paper, we proposed a method for managing applications' accesses to the shared parallel file system in an HPC platform. Other efforts were put into scheduling I/O operations aiming at mitigating interference. Gainaru et al. [7] and Zhou et al. [22] schedule applications' operations at the I/O node level. In a more recent contribution, Gainaru et al. [8] propose a pattern-based schedule relying on the periodic I/O behavior. Dorier et al. [5] analyze the benefits of either delaying or interrupting an application when it is interfering with another one. ASCAR [14] uses controllers on storage clients to detect and reduce congestion using traffic rules. The scheduler AIS [15] identifies offline the I/O characteristics of the application and uses this information to avoid conflicts by issuing I/O aware scheduling recommendations. Alves and al. [2], Snyder and al. [18] and Dorier and al. [6] use different models to predict and avoid I/O interference. Our work is motivated by the observation that both exclusive and shared access to the bandwidth have advantages, as observed by Jeannot et al. [12]. To the best of our knowledge, ours is the first work to propose classifying applications into sets and using that for managing the bandwidth. Others have improved the batch scheduler to consider I/O as one of the resources to arbitrate. Grandl et al. [10] and Tan et al. [19] enrich theoretical scheduling problems with additional dimensions such as I/O requirements. Bleuse et al. [3] attempt to schedule applications requesting both compute and I/O nodes. In their model, Herbein et al. [11] consider that jobs perform I/O proportionally to the number of nodes they need. They show that considering I/O helps reducing performance variability. Our approach is complimentary to those, because we focus on the situation where there already is a set of applications running and requesting I/O.

Conclusion
In this work we have introduced a novel method for I/O management in HPC systems : IO-SETS, intended to be easy to implement and light in overhead. This method consists in a smooth combination of exclusive and non-exclusive bandwidth sharing. We have proposed the SET-10 algorithm, based on this method, where jobs are categorized depending on the order of magnitude of their characteristic times. The latter is an average value that corresponds to the mean time between two consecutive I/O phases of an application. We shown the importance of IO-SETS, and then the excellent performance of the SET-10 algorithm. We believe that the results from this paper open a wide range of avenues for I/O management. From a theoretical standpoint, optimal set mapping and bandwidth partitioning algorithms can be investigated. From a practical point of view, the implementation of IO-SETS will have to adapt to real applications that have more complex I/O behaviors that change over time. An important direction that we will consider is that jobs can change sets at each I/O phase. For research on I/O monitoring tools, our results indicate that an average behavior may be enough information to obtain, and also cheaper. Finally, we believe our method could be successful in other parts of the I/O stack, for instance to manage the access to shared burst buffers.