Exponential asymptotic optimality of Whittle index policy

We evaluate the performance of Whittle index policy for restless Markovian bandit. It is shown in Weber and Weiss (J Appl Probab 27(3):637–648, 1990) that if the bandit is indexable and the associated deterministic system has a global attractor fixed point, then the Whittle index policy is asymptotically optimal in the regime where the arm population grows proportionally with the number of activation arms. In this paper, we show that, under the same conditions, this convergence rate is exponential in the arm population, unless the fixed point is singular (to be defined later), which almost never happens in practice. Our result holds for the continuous-time model of Weber and Weiss (1990) and for a discrete-time model in which all bandits make synchronous transitions. Our proof is based on the nature of the deterministic equation governing the stochastic system: We show that it is a piecewise affine continuous dynamical system inside the simplex of the empirical measure of the arms. Using simulations and numerical solvers, we also investigate the singular cases, as well as how the level of singularity influences the (exponential) convergence rate. We illustrate our theorem on a Markovian fading channel model.


Introduction
A multi-armed bandit (MAB) problem is a sequential allocation problem: At each decision epoch, one or several arms are activated and some observable rewards are obtained.The goal is to maximize the total reward obtained by a sequence of activations.There are at least three fundamental formalizations of the bandit problem depending on the assumed nature of the reward process: stochastic (i.i.d.), adversarial, and Markovian.Each bandit model has its own specific playing strategies and uses distinct techniques of analysis.We focus here on the Markovian bandits (for a thorough analysis of the other two types of bandit models, see e.g.Lattimore and Szepesvári (2020)).Each time, a subset of arms are chosen to be activated.Each arm generates an instantaneous reward that depend on their state and their activation.The state of each activated arm then changes in a Markovian fashion, based on an underlying transition matrix (or a rate matrix in the continuous-time case).Both the reward and the new state are revealed to the decision maker for its next decision.The arms that are not activated change state according to a different transition matrix.When the underlying stochastic transition laws are assumed to be known (see Duff (1995) for a treatment of the case where the transition matrices are unknown), the optimal policy can be computed via dynamic programming, and the problem is essentially of computational nature.
The above Markovian MAB problem has been solved in the rested case (non activated arms do not change their states) with one active arm at each decision epoch in Gittins (1979) by the Gittins index policy, which is a greedy policy that can be computed efficiently.In Whittle (1988), Whittle generalizes the model in two aspects.Firstly, at each decision epoch more than one arm can be activated, and secondly, the arms that are not activated can also change states (restless bandits), according to a different transition matrix, as mentioned before.Under these generalizations, the problem can no longer be solved by a similar efficient index-type greedy policy, and indeed it has been proven in Papadimitriou and Tsitsiklis (1999) that this problem is PSPACE-hard.In Whittle (1988), however, Whittle conjectures that, under some conditions, the so-called "Whittle index policy" (WIP) should be optimal asymptotically, i.e. when the number of arms goes to infinity with a fixed proportion of active arms.
This conjecture has been proven in the famous paper Weber and Weiss (1990) for the continuous-time model, under several technical conditions (further discussed in Weber and Weiss (1991)), namely when the bandits are indexable and the drift of the Markov system has a fixed point that is a global attractor.These results further reinforce the interest of Whittle index, as restless arm models have been used in a wide range of applications and Whittle index policies turn to be efficient solutions.Among them one can cite wireless scheduling Aalto et al (2015); Raghunathan et al (2008), queuing systems Ansell et al (2003), crawling optimal content on the web Avrachenkov and Borkar (2016), load-balancing Larranaga et al (2016) and sensors Niño-Mora and Villar (2011).Some partially observable Markov decision processes (POMDPs) also fall into the category of restless Markovian bandits by using a Bayesian approach to construct the transition matrices.One concrete example is the multi-channel wireless scheduling problem of Liu and Zhao (2010); Meshram et al (2018) that we study in Section 5 of this paper.In this system, there are N Gilbert-Elliott channels and the state of a channel is only observed when a transmission is scheduled on this channel.We will show in Section 5 that this example falls in our framework and we will use it to illustrate our convergence resutls.

Contributions
Despite the well-known asymptotic optimality of WIP (under some conditions) and its empirically good performance on numerous models listed above, as well as its many extensions, there is very limited research on how fast WIP becomes optimal.In this paper we show that the convergence of the performance of WIP to the performance of an optimal policy is exponentially fast with the number N of arms, giving a theoretical explanation for the good performance of WIP in practice, even when the number of arms is small.This result holds in the discrete-time as well as the continuous-time cases, under the same conditions as the asymptotic optimality proven in Weber and Weiss (1990), namely the bandits are indexable and that the ordinary differential equation driving the dynamics of the mean field approximation has a fixed point that is a global attractor, plus the additional condition that the fixed point is non-singular (which almost always holds).This last condition will be discussed in length in the rest of the paper.
The proof of our main result (i.e.exponential convergence rate in the general case) relies on two main ingredients.The first one comes by noticing that the dynamics of the mean field approximation of the N arms under WIP, each with d states, is piecewise affine and continuous over a finite number of polytopes partitioning the configuration space (the simplex in dimension d).This piecewise linearity of the mean field approximation comes as a mixed blessing when one tries to compute the convergence rate: On the one hand the dynamics is not differentiable at the interface between the polytopes.Therefore, previous approaches based on the smoothness of the drift such as Gast et al (2018a); Gast and Van Houdt (2017); Ying (2017) collapse here.On the other hand, when the global attracting fixed point falls into the interior of a polytope (i.e. it is non-singular), the dynamics in a small neighborhood around the fixed point is affine and the expected behavior of the system is relatively simple to analyze.
The second ingredient is to divide the analysis of the behavior of the stochastic system into two parts: before it enters a small neighborhood of the fixed point and after it does.The Stein's method is used to compare its behavior with its mean field approximation inside the neighborhood.Hoeffding's inequality (in the discrete-time case) or an exponential martingale concentration inequality (in the continuous-time case) is used to control its behavior outside the neighborhood.
To be more precise, we show that under indexability, global attraction of the fixed point of the mean field dynamics and non-singularity of this fixed point, the average performance of a stochastic Markovian bandit system under WIP converges to its mean field limit as b • exp(−cN ) where N is the number of arms and b, c are positive constants independent of N .Our result comes with several novelties.
• Firstly, we believe that this is the first example where an exponential convergence to a mean field limit has been obtained.This exponential rate relies crucially on the piecewise affine nature of the deterministic dynamical system, as opposed to most other mean field approximation results that prove convergence rates that are polynomial in 1/ √ N and for which the deterministic dynamics is smooth everywhere.
• Secondly, although a part of our proof has a large deviation flavor, our result concerns the expected behavior of the stochastic bandit and not its deviations, so that our result cannot be obtained by simply using general results on dynamical systems in the presence of random perturbations, such as the large deviation bounds presented in Section 1.5 in Kifer (1988).
As for the part of our proof on concentration bounds that might have been obtained using large deviation principles, we believe that our direct proof, based on concentration inequalities, is simple enough and provides a clearer understanding of the picture.• The contrast between singular and non-singular attractors has gone unnoticed so far.Our theoretical results (exponential convergence in the non-singular case and possibly only O(1/ √ N ) in the singular case) are backed by numerical experiments showing that for a moderate number of arms (N ranging from 10 to 50), the relative performance of WIP w.r.t. the optimal policy can be almost perfect (less than 0.1 % difference) in the non-singular case to simply good (around 4 %) in the singular case.

Related work
Our work can be seen as a natural sequence of the classical paper Weber and Weiss (1990), in which we show the exponential convergence rate of the asymptotic optimality proven there, and discuss its significance and consequence.Another paper that is closely related to our work is Verloop (2016), in which the author proposes a class of priority policies using linear programming, that are all asymptotically optimal, provided that a global attractor condition holds on the model (which is also needed in our result on WIP).The advantage of the policies in Verloop (2016) is that they do not need the indexability assumption.Although the focus of our paper is on Whittle index policy, our exponential convergence rate result proven in Section 6 on the continuous-time model can be adapted to show the exponential convergence rate for the policies of Verloop (2016).This is shown in our followup paper Gast et al (2022b).
WIP and the index policies of Verloop (2016) are asymptotically optimal for the infinite horizon undiscounted problems.More recently, a series of paper have studied asymptotically optimal policies for the finite-horizon criterion.These papers use relaxation of the problem that is similar to Whittle's relaxation but adapted to finite horizon.To the best of our knowledge, these ideas first appear in Hu and Frazier (2017) in which the authors show that for any finite-horizon Markovian bandit problem, it is possible to compute a policy whose sub-optimality gap is O(1/ √ N ).A similar result is proven for discounted infinite-horizon criterion in Zhang and Frazier (2022).For finitehorizon problems, a notion of non-degenerate problem is introduced in Brown and Smith (2020); Zhang and Frazier (2021) that is similar to the notion of non-singular model that we introduce in our paper.In our followup paper Gast et al (2022b), we show that for a non-degenerate problem, it is possible to compute a policy that becomes asymptotically optimal at exponential rate.We should stress that the results developed in these papers are only for finitehorizon or discounted infinite horizon models whereas our paper focussed on the undiscounted infinite-horizon model.This leads to distinct proof techniques and also different results: • For finite-horizon, the LP-index policies derived in Hu and Frazier (2017); Brown and Smith (2020); Zhang and Frazier (2021); Gast et al (2022b) are always asymptotically optimal with a rate of convergence of at least O(1/ √ N ).One obtain an exponential rate of convergence only when the problem is non-degenerate.Numerical evidences provided in Gast et al (2022b) show that there exist a large fraction of problems that are degenerate.
• For infinite-horizon (that we study in this paper), WIP or the LP-priority policy studied in Verloop (2016) are only asymptotically optimal when the global attractor property is satisfied.If it is satisfied then the problem is almost always non-singular, which implies that, for almost all models, WIP becomes optimal exponentially fast when it is asymptotically optimal.

Organization of the paper
In Section 2, we introduce the discrete-time restless bandit model: all arms change their state simultaneously in discrete time, according to transition matrix P 1 when being activated and P 0 when not being activated.We also define the Whittle indices and the main notations used in the paper.We then present the main result of the paper in Section 3, namely exponential convergence for the performance of WIP to the optimal one in the general situation.
In Section 4, we illustrate our results with several examples.We provide simulation and numerical estimations for the performance of WIP in different cases.
In Section 5, we present an application of our result to the Markovian fading channel problem, where we check numerically with parameters that fall into the general case framework (non-singular global attracting fixed point).Finally, in Section 6, we extend our result to the continuous-time model (bandits are continuous-time Markov chains, and decisions are made every time when one arm changes its state).We show that exponential convergence rate also holds in the continuous-time case, and highlight the similarity and difference between the two models.

The Discrete-Time Restless Bandit Model
We first describe the restless bandit model in Section 2.1.We then recall the definition of Whittle index in Section 2.2 and its relation with a linear problem in Section 2.3.Note that in our model, all arms are synchronous.This is a discrete-time version of the classical continuous-time model studied in Weber and Weiss (1990).We discuss how to adapt our results to the latter model in Section 6.

Model description
The synchronous discrete-time restless bandit model with parameters (P 0 , P 1 , R 0 , R 1 ); α, N is a Markov decision process (MDP) defined as follows: 1.
Given a(t) and S(t), the N arms make their transitions independently.4. For each arm that is in state i and for which action a ∈ {0, 1} is taken, a reward R a i ∈ R is earned.We emphasize that the symmetric arms assumption can be relaxed in a straightforward way to a finite number of classes of arms.We then need to specify the initial proportion of arms in each class, and the parameters of each class will be given separately.The transition matrices will be k-blocks matrices, if there are k classes of arms.We will study a 2 classes bandit problem in detail later in Section 5.
The goal of the decision maker is to compute a decision rule in order to maximize the long-term expected average reward per period.The theory of stochastic dynamic programming Puterman (1994) shows that there exists an optimal policy which is stationary and deterministic (i.e.a(t) can be chosen as a time-independent deterministic function of S(t)).Denote by Π the set of such policies, which are maps from S to a.The optimization problem of the decision maker can be formalized as a n (t) = αN, for all t ∈ N. (3) In the above formulation and in what follows, the dependence of a n (t) on S n (t) based on a policy in Π should be understood.We also assume that the parameters of the model are such that the states of the N -arms bandit form a single aperiodic closed class, regardless of the policy employed.This assumption is mostly to simplify our discussion and is also used in Weber and Weiss (1990) to guarantee that neither the value of the optimization problem (2) nor the optimal policy depend on the initial state S(0) of the system at time 0. We call such a bandit an aperiodic recurrent bandit.

Indexability and Whittle index
In theory, a dynamic programming approach can be used to solve the optimization problem (2)-( 3), but this approach is computationally intractable, as the numbers of possible states and actions grow exponentially with N .In fact, such problems have been proven to be PSPACE-hard in Papadimitriou and Tsitsiklis (1999).To overcome this difficulty, Whittle introduces in Whittle (1988) a very efficient heuristic known as Whittle index policy (WIP).This heuristic is obtained by computing an index ν i for each state i.At a given decision epoch, WIP activates the αN arms having currently the highest indices.We describe below how these indices are defined.
The index of an arm can be computed by considering each individual arm in isolation1 .For a given ν ∈ R, we define the ν-subsidized problem as the following MDP.The state space is the one of a single arm.At each time t, the decision maker chooses whether or not to activate this arm.As in the original problem, the arm evolves at time t according to (1).The difference lies in the passive action that is subsidized: If the arm is in state i and action 1 is taken, then as before, a reward R 1 i is earned; if the arm is in state i and action 0 is taken, then a reward R 0 i + ν is earned.The goal of the decision maker is to maximize the long-term expected average reward per period (including passive subsidies).For a given ν ∈ R, let us denote by ω(ν) the set of states for which there exists an optimal policy of the ν-subsidized MDP such that the passive action is optimal in these states.Whittle indices are defined as follows: Definition 1 (Indexability and Whittle index) A bandit In this case, the Whittle index of a state i, that we denote by ν i , is defined as the smallest subsidy such that the passive action is optimal in this state: Note that the value ν i is finite since the state space is finite.
It should be emphasized that there exist restless bandit problems that are not indexable, we discuss this in more detail in Section 4.1.Note that when P 0 is the identity matrix, bandits are rested, i.e. the states of the arms that are not activated do not change.In such a case, a bandit is always indexable and Whittle index coincides with the classical definition of Gittins index, see Gittins et al (2011).

Whittle relaxation and asymptotic optimality
An intuition behind the definition of Whittle index is given by considering a relaxation of the original N arms problem (2) where the constraint (3) is replaced by lim T →∞ 1 T T −1 t=0 N n=1 a n (t) = αN .This relaxed constraint imposes the time-averaged number of activated arms to be equal to αN .We denote by V (N ) rel (α) the value of the optimal control problem.It is given by the following optimization problem: subject to lim a n (t) = αN. (5) While the original problem (2)-( 3) is computationally hard to solve, the value V (N ) rel (α) after the relaxation is the solution to the following linear program: subject to s x s,1 = α and s,a x s,a = 1, (6b) where x s,a is the steady state probability for an arm to be in state s and for which action a ∈ {0, 1} is chosen.See Verloop (2016); Gast et al (2022b) for more detailed discussion about this LP and how it is derived.The constraint (5) is weaker than the constraint (3).This shows that V rel (α) is an upper bound on the value of the original optimization problem (2).In fact, the next result shows that, as the number of arms grows, the value of the original problem converges to this value.This theorem justifies the relaxation (5) by showing that when the number of arm is large, the value of the optimization problem (2) is close to V (N ) rel (α).
Theorem 1 Consider an aperiodic recurrent discrete-time restless bandit model with N identical arms and such that the matrices P 0 and P 1 are rational.Then there exists a constant c > 0 that depends only on P 0 , P 1 , R 0 , R 1 and α, such that for any N with αN being an integer, we have Note that this theorem is the analogue of Theorem 1 of Weber and Weiss (1990), that proves that lim N →∞ rel (α) for the continuous-time bandit model that we will discuss in Section 6.To the best of our knowledge, the statement of this theorem in our setting of discrete-time bandit model is new.Moreover, our result shows that the convergence is at least in O(1/ √ N ).For completeness, we provide a proof of Theorem 1 in Appendix A. It is an adaptation of the proof of (Weber and Weiss, 1990, Theorem 1): we use a similar coupling argument, although the coupling has to be adapted to our discrete-time setting, and we also need the additional aperiodic assumption on the model.
While Theorem 1 guarantees that the original optimization problem converges to the relaxation, it does not guarantee any result on the performance of WIP.This leaves one important question: At which speed does WIP become optimal?In the remainder of the paper we will show that, except in rare cases, when WIP is asymptotically optimal, it does so at exponential speed with the number of arms N .This complements Theorem 1 by proving that, under the same conditions, the convergence in (7) occurs at exponential rate.

Main Results
We first show in Section 3.1 that, when N is large, the stochastic system governed by WIP behaves like a piecewise affine deterministic system.We then present the exponential convergence result in Section 3.2.Later in Section 6 we will see how to extend this result to the classical model of continuous-time bandits of Weber and Weiss (1990).

Piecewise affine dynamics and definition of a singular point
To avoid ambiguity in the definition of WIP, we assume that the problem is strictly indexable.By this, we mean that there do not exist two states that have the same Whittle index.This is mostly a technical assumption that guarantees that there is a unique2 WIP.
Recall that the state space of a single arm is {1, . . ., d}, and assume without loss of generality that the states are already sorted according to their Whittle indices in decreasing order: We shall call a configuration of an N -arms system the vector representing the proportion of arms being in each state.Let A possible configuration of the system at a given time step can be represented by a point m in ∆ d , where m i is the proportion of arms in state i ∈ {1, . . ., d}.
Our result on the rate at which WIP becomes asymptotically optimal depends on the property of the iterations of a deterministic map that we define below.Denote by M (N ) (t) the N -arms system configuration at time t under WIP.The arms being time homogeneous Markov chains, we can define a map ϕ : for all i ∈ {1, . . ., d} and m ∈ ∆ d .It is the expected proportion of arms going to state i at time t+1 under WIP, knowing that the system was in configuration m at time t.This map has the following properties: Lemma 2 Assume that the bandit is indexable.Then: (i) The definition of ϕ does not depend on N (as long as αN is an integer) nor on t.
(ii) ϕ is a piecewise affine function, with d affine pieces, and ϕ is Lipschitzcontinuous. (iii) ϕ has a unique fixed point: there exists a unique m ∈ ∆ d such that ϕ(m) = m.

Sketch of proof
The full details of the proof are provided in Appendix B. We only describe the main ingredients here.
Proof of (i) and (ii) -For a given configuration m ∈ ∆ d , define s(m) ∈ {1, . . ., d} to be the state such that i=1 m i , with the convention that 0 i=1 m i = 0. WIP activates arms by decreasing index order.This means that when the system is in configuration m, WIP will activate all arms that are in states 1 to s(m) − 1, and N (α − s(m)−1 i=1 m i ) arms that are in state s(m).The rest of the arms will not be activated.This means that the map ϕ satisfies: The above expression of ϕ implies that this map is affine on each zone Z i , and there are d such zones.Moreover, the value of ϕ coincides on the intersection of zones, hence ϕ is continuous.
Proof of (iii) -This part of the proof is more involved, and it relies on indexability.The details are given in Appendix B where we show that indexability implies a monotonic property of ϕ that we use to obtain uniqueness. □ In what follows, we will denote by m * the unique fixed point of ϕ.As we will see in Theorem 3, the rate at which WIP becomes asymptotically optimal depends on: (1) whether the iterations of ϕ converge to m * , (2) whether m * lies strictly inside a zone Z i .Concerning the second property, we will call a point m singular if there exists i ∈ {1, . . ., d} such that i j=1 m j = α.Said otherwise, a fixed point is singular if it is on the boundary of two zones.Fig. 1: An example with d = 3.When α = 0.4 (Figure 1a) the fixed point is singular, while for α = 0.5 (Figure 1b) it is not singular.
In Figure 1, we illustrate the notion of singular fixed point by an example in dimension d = 3.As m 1 + m 2 + m 3 = 1, the simplex ∆ 3 can be represented in a 2-dimensional space as ∆ 2 c , where ∆ d c is the unit d-simplex and its interior.Our convention is that the x-coordinate of a point corresponds to m 3 (the proportion of arms in state 3), and the y-coordinate corresponds to m 2 (the proportion of arms in state 2).The colored dotted lines of Figures 1a and 1b are singular points.These lines partition the different zones Z i .The partition of zones, as well as the position of the unique fixed point depend on α.For this example, when α = 0.4 (Figure 1a), the fixed point is singular, while for α = 0.5 (Figure 1b), it is non-singular (all the other parameters in these two figures are the same, and are available in our Git repository).

Exponential convergence rate
We are now ready to state our main theorem.Assume indexability, at a given time t, WIP sorts all arms according to the Whittle indices ν Sn(t) and activates the αN arms that have the highest indices.We denote the long-term average expected reward of WIP as where for all t, a(t) is chosen according to WIP.
Let Φ t be defined as the t-th iteration of the map ϕ, i.e.
Recall that m * is the unique fixed point of ϕ.As stated in the next theorem, the asymptotic optimality of WIP is guaranteed when m * attracts all trajectories of Φ t≥0 (•).In the rest of the paper, unless otherwise specified, we use ∥ • ∥ to denote the L ∞ -norm of a vector.
Then there exists two constants b, c > 0 that depend only on P 0 , P 1 , R 0 , R 1 and α, such that for any N with αN being an integer, Recall that V (N ) rel (α) is the value of the relaxed problem (4)-( 5).

Sketch of proof
The full details of the proof are given in Appendix C. We first transform the evaluation of the performance to the analysis of the configuration of the bandit system.We then show that in stationary regime the expectation of M (N ) (0) concentrates exponentially fast on the fixed point m * .More precisely, there exists In order to show this: • We first use Hoeffding's inequality in Lemma 10 to show that for any configuration m: • By Lipschitz continuity of ϕ, for a time t, we apply Lemma 10 to prove Lemma 11, which bounds P ∥M (N ) (t) − Φ t M (N ) (0) ∥ ≥ ϵ by a term that depends on t but decreases exponentially fast with N .
• As m * is an attractor that is locally stable, this implies that when t is large enough, M (N ) (t) is within a neighborhood N of m * with very high probability.
As m * is non-singular, this neighborhood can be taken to be within a zone Z i on which ϕ is affine.We will choose carefully this neighborhood N and make sure that its choice does not depend on N .We then deduce an exponentially small upper-bound for the probability of M (N ) (0) in stationary regime being outside N (see Subsection C.4.3), hence allows us to restrict our attention to a zone where ϕ is affine.
• The result then follows by using Stein's method on the process restricted to this affine zone, which shows that conditional on starting inside the neighborhood N , the additive long-term distance between the large N stochastic trajectory and the deterministic trajectory is exponentially small (see Subsection C.4.4).□ We give here some comments on the assumptions of Theorem 3, their practical relevance will be discussed in detail in Section 4.1.To prove that WIP is asymptotically optimal in the continuous-time case, Weber and Weiss (1990) assume two conditions: that the bandit is indexable (Assumption (i)) and that m * is a global attractor (Assumption (iii)).To prove our result, we require two additional assumptions: (ii) The non-singular condition on m * , which is almost always satisfied (see Section 4.1); and (iv) that m * is locally stable.We conjecture that condition (iii) implies condition (iv) here, but we leave this question for future work.In conclusion, our conditions are almost identical than the one needed in Weber and Weiss (1990).
Note that the most difficult assumption to verify is point (iii) that requires m * to be a global attractor, as there is no general method to exclude cyclic or chaotic behaviors from a dynamical system.It is shown in Blondel et al (2001) that global properties of continuous piecewise affine functions in R n are undecidable in general.Note that as the piecewise affine maps Φ induced by WIP form a subclass of piecewise affine functions, this does not imply that testing the global attractor property in our case is undecidable.In fact, there exist special cases for which showing the global attractor property is relatively easy.In general, this is done by finding a Lyapunov function or exhibiting some monotony property.
Assuming that a given map satisfies the global attractor property (iii) is in fact quite common in the literature (see for instance Weber and Weiss (1990) and most of the papers that use their results).Even if one cannot show mathematically that the global attractor property holds, it can still be tested numerically in an efficient way.In practice, a sufficiently large number of initial conditions suffices to approve (or disapprove) this assumption.Moreover, this condition is almost a necessary condition in the sense that there exists examples that satisfy all assumptions of Theorem 3 except this one and for which WIP is not asymptotically optimal (more comments on this in Remark 3).As already been discussed in Verloop (2016), it is a challenging question as how to design (non index-type) policies that are asymptotically optimal without the global attractor property.
Remark 2 The singular case.The non-singularity of the fixed point m * is also necessary in the sense that the following simple example satisfies all the assumptions of Theorem 3 except this one and does not satisfy (9).Consider the following 2 states bandit problem with P 0 = P 1 = 0.5 0.5 0.5 0.5 , R 0 = (0, 0), R 1 = (1, 0), and α = 0.5.
It should be clear that V (1) rel (α) = 0.5.In stationary regime, the configuration M (N ) of the system of size N is distributed independently from the policy employed.Moreover, WIP will activate in priority the arms in state 1.This implies that the reward of WIP will be V follows a binomial distribution of parameter (N, 0.5), the central limit theorem shows that where G is a standard normal random variable.This example shows that, in a case where m * is singular, the convergence in ( 9) may occur at rate Θ(1/ √ N ) and not at exponential rate.Note on the other hand that if we take instead α ̸ = 0.5, then V rel (α) = min(α, 0.5) at exponential rate, due to the fact that almost all the mass of a Gaussian distribution is concentrated around its mean value α (which is different from 0.5).
Remark 3 Cyclic and chaotic behaviors.Although the drift ϕ is piecewise affine and has a unique fixed point, the long run behavior of the deterministic dynamical system m(t+1) = ϕ(m(t)) can be cyclic or chaotic.In these cases, the fixed point is no longer a global attractor, and the performance of WIP is in general not asymptotically optimal.
More precisely, when the dynamical system admits a cycle as a global attractor for almost every initial configuration in the simplex, then as suggested in Weber and Weiss (1990), one can infer a cyclic version of Theorem 3: The performance of WIP converges to the average reward on the cycle.This average reward is in general strictly smaller than V (1) rel (α), while V (N ) opt (α)/N always converge to V (1) rel (α), regardless to the behavior of the deterministic system (from Theorem 1).Consequently, when cycles appear, the performance of WIP is asymptotically sub-optimal.
Remark 4 What happens when αN is not an integer.The exponential convergence rate in Theorem 3 assumes that αN is an integer.When it is not the case, a decision maker cannot activate exactly αN arms at each time step.There are three natural solutions to define the model in such cases: (1) activate ⌊αN ⌋ arms; (2) activate ⌈N α⌉ arms; (3) activates ⌊αN ⌋ arms, plus one more arm being activated with probability αN − ⌊αN ⌋.As we further discuss in Section 4.3, the convergence rate in the first two solutions is much slower than in the third solution.
Remark 5 Finding optimal constants.Theorem 3 claims the existence of constants b and c for which the inequality (9) holds true, but we do not emphasize on the optimality of the constant c, in the sense of finding constant c such that lim sup Our choice of c in the proof of Theorem 3 provided in Appendix C actually depends subtly on the given parameters, and we believe that finding c is, if not impossible, a much more demanding task.Nevertheless, later on in Section 4.2 we shall illustrate via numerical examples that the approximate value of c is affected by the level of singularity of the fixed point, which in turn is affected by the value of α, if all the other parameters P 0 , P 1 , R 0 , R 1 are fixed.

Numerical Experiments
In this section, we first provide statistical results to justify the conditions needed for Theorem 3, and then verify numerically the exponential convergence rate for a general 3 states restless bandit model with non-singular fixed points.
We also evaluate numerically the convergence rate for a singular fixed point example.At last we investigate the situation when αN is not an integer.3

How general is the general case?
The exponential convergence rate for the performance of WIP on a restless bandit problem is very desirable, however, several conditions have to be verified beforehand, listed in order as: (C1) The restless bandit problem is indexable; (C2) The unique fixed point is not singular; (C3) The unique fixed point is a global attractor.(C4) The unique fixed point is locally stable Condition (C1) is mostly verified through the specific structure of the restless bandit problem and by using various techniques that are model dependent; a general method for the test of indexability is also presented in Gast et al (2022a).For Condition (C2), checking the singularity condition is straightforward, as it amounts to checking whether the sum of the first s(m * ) coordinates of m * (after the Whittle index reordering) is α.Moreover, being in an exact singular situation is improbable (for a given problem, the activation ratio α can only be singular if it satisfies an equality constraint).More generally, we also observe that the "closer" the fixed point to a singular situation, the smaller the coefficient c in Theorem 3 on the estimation of the exponential rate could be.This point will be made more precise in the next subsection.
As indicated before, Condition (C3) is more complicated to verify.In our implementation, we verify numerically that Condition (C3) holds, by (1) testing if the fixed point is locally stable, and (2) simulating the dynamics on a large number of initial conditions over a long horizon.
As for Condition (C4), the local stability is easy to verify numerically when m * is not singular: indeed, in this case the dynamical system is affine in a neighborhood of m * : ϕ(m) = (m−m * )•K s(m * ) +m * , where K s(m * ) is a matrix of dimension d obtained from (8).The dynamical system is locally stable if Exponential Asymptotic Optimality of Whittle Index Policy K s(m * ) is a stable matrix, i.e. if the norm of all4 eigenvalues of K s(m * ) is less than 1.If K s(m * ) is not a stable matrix, then in most cases the fixed point will not be a global attractor and an attracting cycle will appear.To give an idea of how general these conditions are, we generate a large number of discrete-time restless bandit problems by choosing random parameter (P 0 , P 1 , R 0 , R 1 ) in dimensions d ∈ {3, 4, 5, 6, 7}.We estimate the rarity of violations of the above conditions.More precisely, for each d, we randomly generate 10 7 instances of (P 0 , P 1 , R 0 , R 1 ), using a uniform distribution in [0, 1] for the rewards, and uniform distribution for probability vectors P 0 i and P 1 i over the simplex ∆ d .We then count the number of instances that violate conditions (C1) or (C4), the results are reported in Table 1.This table shows that the number of models that satisfy the conditions is more than 99.8% for d = 3; when d = 7, all generated models (among 10 7 ) satisfy our conditions.In our tests, what we mean by the number of indexable instances such that m * is not locally stable is the number of models for which there exists α ∈ (0, 1) such that m * is not locally stable.This can be done by testing each of the d matrices K i .Note that for all these locally stable examples in Table 1, the corresponding m * also appears to be a global attractor (numerically).However, we should point out that it is possible to construct examples for which m * is locally stable while not being a global attractor.Such examples have special structures and are hard to find if we generate the parameters uniformly.

The influence of how non-singular is a fixed point
To test how the "non-singularity" of the fixed point m * affects the convergence rate, we consider the example displayed in Figure 1 with varying values of α in the range between 0.20 and 0.50.We emphasize that the fixed point m * = m * (α) is then a function of α.Numerically, these fixed points are global attractors for two reasons: • All matrices K i are locally stable because the eigenvalues of K 2 are {1, −0.4 . . ., 0.08 . . .}5 while K 1 = P 0 and K 3 = P 1 are always stable matrices.
• For all tested values of α, we simulated Φ t (m) from random initial points m and they all converge to the corresponding fixed point m * .Moreover, as already shown in Figure 1, the fixed point m * is singular when α = 0.4, and it is non-singular for any other values of α ∈ [0.2, 0.5].This implies that all assumptions of Theorem 3 are satisfied when α ̸ = 0.4.As V (N ) rel (α) depends on the value of α, to make better comparisons, we consider the quantity V (N ) WIP (α)/V (N ) rel (α), which is the normalized performance of WIP with respect to the relaxation upper-bound.In Figure 2a, we choose four values of α as 0.2, 0.3, 0.4 and 0.5, and plot the normalized performances as a function of the number of arms N that takes values on multiples of 10.The value of V (N ) WIP (α) are computed by using simulations.We repeat each simulation so that 95% confidence intervals become negligible and hence can not be seen from the pictures.In Figure 2b, this time we fix the value of N and plot the normalized performance as a function of α where α varies between [0.3, 0.5] with a stepsize of 1/N : α ∈ {0.3, 0.3 + 1/N, 0.3 + 2/N, . . ., 0.5} (so that αN are always integers).These two figures suggest that the convergence rate is related to how far m * is away from the closest boundary of two zones (i.e.how non-singular it is).Here is an intuitive explanation for this phenomenon: the stochastic system in equilibrium will wander around the fixed point m * that gives the optimal reward, now if m * is near a boundary, it is more likely for the stochastic trajectory to jump into another neighboring polytope Z ′ , in which case another affine drift applies and this may take the trajectory away from m * .To examine more closely the convergence rate, let us consider the quantity Theorem 3 implies that subgap(N ) converges to 0 approximately as b • e −c•N , for some constants b, c > 0 in non-singular cases.In Figure 3, we plot in logscale the subgap (10) as a function of N for the same model as in Figure 2 and α = 0.2, 0.3 and 0.5.For each value of α, we also plot the best-fit b ′ •e −c ′ N which is a straight line in log-scale.The constant c is around 0.03 for α = 0.3, 0.5, and it is around 0.125 for α = 0.2.However, in the singular case α = 0.4, we could not find a straight line to fit log subgap(N ) .But if we plot instead subgap(N ) • √ N , the curve behaves like a constant.Moreover, this constant behavior is lost6 as soon as we plot subgap(N ) • N β , with a power β = 0.49 or β = 0.51.This gives numerical evidence for an O(1/ √ N ) convergence rate in this singular case, same as for the example given in Remark (2).Actually, we believe that the convergence rate is O(1/ √ N ) for all singular global attractor situations, but a proof of this claim is still open to us.

Non integer values of αN
Our previous analysis rely on the assumption that αN is an integer.Let us briefly discuss in this subsection how to deal with non integer values of αN for the optimization problem ( 2 WIP (α) if αN is an integer, but otherwise are different in general.Numerically, we discover that the average reward when always activating ⌊αN ⌋ arms or always activating ⌈αN ⌉ arms will be at distance Here is an informal explanation: Let ϕ rounding (m) = E M (N ) i (t + 1) | M (N ) (t) = m when any of the three rounding policy among floor, ceil, or probabilistic is used.When the rounding is probabilistic, it is not hard to show that ϕ probabilistic (m) = ϕ(m), where ϕ(•) is defined as in Equation ( 8) of the proof of Lemma 2. In contrast, ϕ f loor (m) = ϕ(m) + O(α − ⌊αN ⌋/N ).This shows that if the map ϕ has a unique non-singular attractor m * , then as N goes to infinity, the maps ϕ rounding also have a unique nonsingular attractor, that is equal to m * for the probabilistic rounding and at distance O(1/N ) of m * for floor or ceil.Moreover, the proof of Lemma 10 and Lemma 11 in the appendix can be adapted to obtain a concentration bound around ϕ rounding for all policies.This guarantees an exponential convergence rate on the performance of WIP to performance on the attractor, for any of these three policies.Consequently we have To further illustrate these points, we consider in Figure 4 the same example as in Section 4.2, with α = 0.3.As in Figure 2, the green curve represents V (N ) WIP (α)/N for N being a multiple of 10.Here, we extend this curve to all N being a multiple of 5, using the three possible rounding.The values of WIP (⌈N α⌉/N )/N are plotted respectively in blue, green and red dots for N ∈ {25, 35, 45, . . .}, while their values coincide for N being a multiple of 10 (which explains the zigzag of the orange and blue curves).We observe that the differences when N → ∞ and {N α} ≡ 0.5, i.e.N = 5 • (2k + 1).The behavior is quite different for the probabilistic rounding (green curve).Indeed, in this case we cannot distinguish when αN is an integer or not.This indicates that V

Application: Markovian Fading Channels
The Markovian fading channel is a typical synchronous discrete-time restless bandit model.Strictly speaking, this model has a countable infinite state space, so some approximation is needed, as we discuss later.In Ouyang et al (2012) a two-classes channel problem has been studied.By using the same scaling as here, the authors of Ouyang et al (2012) have proven the asymptotic optimality of WIP for this model, after verifying the global attractor property of the deterministic system.In this section we take a step further, evaluate numerically the convergence rate of the performance, and verify if it is exponential, as claimed in Theorem 3.
Let us first briefly review this two-class channel model (more details can be found in Ouyang et al (2012)).A Gilbert-Elliott channel is modeled as a two-states Markov chain with a bad state 0 and a good state 1.Two classes of channels are available, with the transition probability matrices for class , where p k is the probability of a class k channel being in good state at time t + 1 if it was in good state at time t, and r k is the probability being in good state if one time step ago it was in bad state.
We assume the channels are positively correlated, namely p k > r k for k = 1, 2. We consider a total population of N channels, a proportion β of them are from class 1. Due to limited resource, each time we can only activate a proportion α of the channels, and only a channel in good state under activation can transmit data.We assume that we can observe the state of a channel only when it is activated.Otherwise, we keep track of the state of a channel by using a belief value b k s,t where k = 1, 2, s = 0, 1 and t ≥ 1.The value b k s,t is the probability for a class k channel to be in good state, provided that it was activated (hence observed) t time steps ago and was observed to be in state s.

The expression of b
To cast this channel model into a discrete-time restless bandit problem, we treat each channel as an arm, and its state space is the whole set of possible values of b k s,t 's.The transition matrices P 0 , P 1 can then be naturally written down: all other probabilities being 0.
We evaluate the performance by the throughput of the system, hence we obtain a reward of 1 each time we activate a channel and it is in good state.Under the MDP framework, this is equivalent to assuming that state b k s,t gives a reward b k s,t under activation.It is shown in Ouyang et al (2012) that this problem is indexable, and that Whittle index can be calculated explicitly (via techniques due to the specific structure of the model).The index of a state b k s,t is denoted by ν(b k s,t ) and is equal to: We remark that for k = 1, 2, the index value ν(b k 0,t ) is an increasing function of t, and furthermore ν(b k 0,t ) , for any t ′ ≥ 1.We shall also point out that the relative orders of the index values ν b k s,t between two classes k = 1 and k = 2 could be different from the orders of the belief values b k s,t .This indicates an interaction between classes and makes the Whittle indices for this model interesting.
The reader might have noticed that to apply Theorem 3, two assumptions are violated: first, the restless bandit model we consider here has a countable infinite state space; second, not all arms are identical (there are two classes of arms).The first point might raise some technical difficulties that we have not encountered on our previous finite state model.However, it can be shown that the states b k 0,t for t large are extremely rarely visited, hence using a threshold t * and ignoring all states b k s,t with t > t * (i.e.treating them as b k s,t * ) makes a negligible difference.Concerning the two classes of arm, we argue that having two classes of arms can be represented by a single class of arm by considering a larger state-space: the state of an arm would be (k, b k s,t ), where k is its class and b k s,t is its belief value.Compared to our model, in this new case, the arms are no longer recurrent as an arm of class k cannot become an arm of class k ′ ̸ = k.This implies that the quantities V (N ) WIP (α) and V (N ) rel (α) will depend on the initial condition of the system, i.e. on the fraction β of arms that are in class 1. Apart from that, our results apply mutatis mutandis to this case.We can now provide some numerical results.We shall choose a parameter set that is used in Ouyang et al (2012): β = 0.6, α = 0.3, (p 1 , r 1 ) = (0.75, 0.2), (p 2 , r 2 ) = (0.8, 0.3).It can be shown that using these parameters, a class 2 channel that has just been activated and has been observed in good state will have the highest priority, hence should always be activated.Also a class 2 channel after 4 time steps of being idle has higher priority than a class 1 channel in any belief state.We can then characterize the fixed point m * by computing a threshold of activation of class 1 channels so that in steady-state, a proportion of α = 0.3 of channels are activated.This gives that all class 1 channels in belief state b 1 0,t with t ≤ 20 will be kept idle, a fraction 0.89 . . . of the class 1 channels in belief state b 1 0,21 will be activated, and all class 1 channels in belief states b 1 0,t with t ≥ 22 will be activated.As 0.89 . . .̸ = 1, the fixed point is not singular.
Consequently, all conditions needed for Theorem 3 are satisfied for this model.We then use simulations to evaluate the average throughput, with N ranging from 10 to 300.We see through Figure 5 that a similar convergence pattern as in the 3 states model occurs, and it suggests an exponential rate convergence as claimed, with a value of the constant c ≈ 0.0085.

The Continuous-Time Restless Bandit Model
Throughout the paper, we studied a discrete-time restless bandit problem in which all arms synchronously make a transition.In this section, we explain how to adapt the proofs done in Section 3 to the continuous-time model studied for example in Weber and Weiss (1990).We start by recalling the model of Weber and Weiss (1990) in Section 6.1.We show how the discrete-time and continuous-time models are related in Section 6.2.Finally, we state the equivalence of our main Theorem 3 in Section 6.3 for the continuous model.

The continuous-time bandit model
Similarly to Section 2.1, a continuous-time restless bandit problem with parameters (Q 0 , Q 1 , R 0 , R 1 ); α, N is a Markov decision process defined as follows: 1.As before, the model is composed of N arms each evolves in a finite state space.The state space of the process at time t ∈ R + is the vector S(t).2. In continuous time, the decision maker chooses an action a(t) ∈ {0, 1} N .
Decisions can be modified only when the process S(t) changes state: At each jump of the process S(t), the decision maker observes S(t) and chooses a new action vector a(t) that will be kept until the next jump of the process.The action vector must satisfy . Given a(t), the evolutions of the N arms are independent.4. The gain per unit time of the decision maker is As before, the goal of the decision maker is to compute a decision rule in order to maximize the long-term average reward.Using our notation, this problem can be written as This problem is the continuous-time version of the discrete-time problem ( 2)-( 3).As before, we assume that the matrices Q 0 and Q 1 are such that bandit is recurrent regardless of the policy employed.

Whittle index, relaxation and equivalence with the discrete-time model
In this subsection, we recall briefly the definition of Whittle index and of the relaxation for the continuous-time bandit model.These definitions coincide with the ones of Weber and Weiss (1990).As for the discrete-time case, Whittle index of continuous-time bandits is defined by considering a subsidized MDP for a single arm n, in which a decision maker that takes the passive action a n (t) = 0 earns an extra reward ν per unit time.The definition of indexability is the same as the one in discrete-time case and the index of a state i, denoted by ν i , is the smallest subsidy such that the passive action is optimal for state i.
Similarly to the discrete-time problem, the definition of Whittle index in the continuous-time model can be justified by looking at the Lagrangian of the optimization problem (11) where the constraint ( 12) is replaced by the constraint ( 14) below.We again denote the value of this relaxed problem as As we show below, when considering arms in isolation, using a discretetime or a continuous-time model is equivalent via a standard uniformization scaling.In particular, neither the definition of Whittle index nor the value of the relaxation depend on the synchronization nature of the bandit.
Definition 6 Let (Q 0 , Q 1 , R 0 , R 1 ) be the parameters of a continuous-time bandit.By a standard uniformization scaling, let τ := max i maxa |Q a ii |, and let (P 0 , P 1 , R0 , R1 ) be the matrices defined as follows: for all states i ̸ = j and all action a ∈ {0, 1}: We call (P 0 , P 1 , R0 , R1 ) the discrete-time version of our continuous-time bandit model.
The following lemma states the equivalence of Whittle relaxation between the discrete-time and the continuous-time problems: be a continuous-time bandit and let (P 0 , P 1 , R0 , R1 ) be its discrete-time version (15).Then: (i) The matrices P 0 and P 1 are probability matrices.
(ii) The discrete-time bandit (P 0 , P 1 , R0 , R1 ) is indexable if and only if the continuous-time bandit In such a case, the indices of both bandits coincide.
(iii) The discrete-time relaxed optimization problem (4) has the same value as its continuous-time counterpart (13).
The proof of Lemma 4 is a direct consequence of uniformization: the results rely on analysis of an arm in isolation; when focus on a single arm, Bellman's equation is identical for the discrete-time and continuous-time version of the MDP.

Exponential convergence in the case of continuous-time model
Lemma 4 uses the fact that the Whittle relaxation is defined for arm in isolation.Hence, considering discrete-time or continuous-time bandits is equivalent.
For the N arms model, however, the situation is different: in the discrete-time model of Section 2 all arms change states synchronously, while in continuoustime situation, the probability that two arms make a jump at the exact same time is 0. This implies that the reward of WIP for the N arms problem does depend on whether the model is synchronous or not.We denote the later by V It is shown in Weber and Weiss (1990) that the asymptotic optimality depends on the ergodic property of the solution of an ordinary differential equation (ODE) defined in Equation ( 10) of Weber and Weiss (1990).Using our notation, this differential equation can be written as where ϕ is defined as in Lemma 2 for a discrete-time bandit problem (P 0 , P 1 , R0 , R1 ).By applying Lemma 2 on the discrete-time problem, we see that this equation has a unique fixed point, m * .It is then shown in Weber and Weiss (1990) that if all the solutions of the differential equation ( 16) converge to m * , then lim rel (α).In the next theorem, we show that we can adapt the result of Theorem 3 to the continuous-time model.
Then there exists two constants b, c > 0 that depend only on Q 0 , Q 1 , R 0 , R 1 and α, such that for any N with αN being an integer,

Sketch of proof
The proof of this result follows the same structure as the proof of Theorem 3 but needs substantial adaptation, the full details are given in Appendix D. The main ingredients are: • We first use a result from Darling and Norris (2008) to obtain an analogue of Hoeffding's inequality.This proves that the behavior of the N arms model is close to the dynamic of the ODE ( 16).
• Using this and the fact that m * is non-singular, we show that the stochastic system lies with high probability in a neighborhood N of m * where ϕ is affine.We again use Stein's method to obtain the exponential convergence result (but this time applied to a continuous-time process).□ This theorem is a refinement of the original asymptotic optimality result of (Weber and Weiss, 1990, Theorem 2), as it provides a bound on the rate of convergence for the performance of WIP to the optimal one.The applicability conditions are essentially similar: (Weber and Weiss, 1990, Theorem 2) also needs the assumption that m * is an attractor of the ODE.We add in addition that m * is locally stable and that m * is not singular.Those conditions are also similar to the conditions of Theorem 3.However, we should point out that the behavior of the discrete-time dynamical system m(t + 1) = ϕ(m(t)) can be quite different from its continuous-time counterpart ṁ = τ ϕ(m) − m : there are bandit models for which WIP is asymptotically optimal under a continuous-time model but is not for the discrete-time model.One such example is displayed in our Git repository.

Conclusion and Future Work
In this paper, we studied the performance of Whittle index policy (WIP) when there is a large number of arms.We showed that, when WIP becomes asymptotically optimal, it does so at exponential rate (unless the fixed point is singular, which barely occurs).This explains why WIP is very efficient in practice, even when the number of arms remains moderate.Our results hold for the classical model of Weber and Weiss (1990) where arms evolve asynchronously in continuous time, as well as for a synchronous discrete-time model in which all arms make their transitions simultaneously.
As for future research, we plan on investigating more closely the singular situations, as well as extending the exponential convergence rate result to those generalizations of Whittle index as in Duran and Verloop (2018); Hodge and Glazebrook (2015); Verloop (2016).

A Proof of Theorem 1
Proof Let m * be the fixed point of ϕ.As P 0 , P 1 are rational, each coordinate of m * is a rational number.Let {N k } k≥0 be a sequence of increasing integers that goes to ∞, such that for all k ≥ 0 and all 1 ≤ i ≤ d, m * i N k and αN k are integers.We then fix an N from this sequence {N k } k≥0 .Recall that m i N is the number of arms in state i in configuration m and that Sn(t) is the state of arm n at time t.We use S(t) to denote the state vector of the N arms system at time t.Let S * be a state vector corresponds to configuration m * with N arms.This is possible as m * i N is an integer for all i ∈ {1, . . ., d}.
Note that in configuration m * (i.e.state vector S * ), an optimal action a * under the relaxed constraint (5) will activate exactly αN arms.As a * is sub-optimal compared to an optimal policy for the original N arms problem (2)-(3), we have , where in the above equation the function V : S → R is the bias of the MDP.The first line corresponds to Bellman's equation (see e.g.Equation 8.4.2 in Chapter 8 of Puterman (1994)), the second line is because a * is a valid action for the N -arms MDP but might not be the optimal action, and the last line is because rel (α).We hence obtain In the following, we bound E a * [V S(1) − h(S * )].This will be achieved in two steps.Exponential Asymptotic Optimality of Whittle Index Policy Step One We define for two state vectors y, z the distance which counts the number (among the N arms) of arms that are in different states between those two vectors.Such distance satisfies the property that for all y and z such that δ(y, z) = k, we can find a sequence of state vectors z 1 , z 2 , ..., In what follows, we show that there exists C > 0 independent of N such that for all state vectors y and z, the bias function h(.) satisfies: In view of the above property of δ, we only need to prove this for δ(y, z) = 1, i.e.
|h(y) − h(z)| ≤ C. Let y, z be two state vectors such that δ(y, z) = 1, and assume without loss of generality that it is arm 1 that are in different states: y 1 ̸ = z 1 and yn = zn for n ∈ {2 . . .N }.We use a coupling argument as follows: We consider two trajectories of the N arms system, Y and Z, that start respectively in state vectors Y(0) = y and Z(0) = z.Let π * be the optimal policy of the N arms MDP, and suppose that we apply π * to the trajectory Z.At time t, the action vector will be π * (Z(t)).We couple the trajectories Y and Z by applying the same action vectors π * (Z(t)) for Y and keeping Yn(t) = Zn(t) for arms n ∈ {2 . . .N }.The Z trajectory follows an optimal trajectory, hence Bellman's equation is satisfied: for any T > 0, we have: (17) Since Y follows a possibly sub-optimal trajectory, we have: Recall that the matrices P 0 , P 1 are such that a bandit is recurrent and aperiodic.This shows that the mixing time of a single arm is bounded (independently of N ): for any policy π ∈ Π max i,j Because of the coupling, for 0 ≤ t ≤ T and 1 ≤ n ≤ N , Yn(t) ̸ = Zn(t) is only possible for n = 1.Furthermore, as the mixing time of an arm is bounded, for T large enough, there is a positive probability, say at least p > 0, that Y 1 (T ) = Z 1 (T ).Hence with probability smaller than 1 − p we have δ y(T ), z(T ) = 1, conditional on Y(0) = y and Z(0 This being true for all y, z with δ(y, z) = 1, it implies that max U,V: δ(U,V)=1 |h(U) − h(V)| ≤ T • r/p, and we can take the constant C := T • r/p.

Step Two
Recall that the state vector S * corresponds to the optimal (relaxed) configuration m * .We now prove that with a constant D independent of N , where S(1) is the random vector conditional on S(0) = S * under action vector a * .Indeed, let x * := m * N , and denote X := m(1)N to be the random d-vector, with m(1) the random configuration corresponds to S(1).For each 1 ≤ i ≤ d, we may write ) where B a i,j ∼ Binomial(x * j,a , P a ji ) for 1 ≤ j ≤ d, a ∈ {0, 1}; and x * j,0 + x * j,1 = x * j , with x * j,a representing the number of arms in state j taking action a, when optimal action vector a * is applied to state vector S * .
By stationarity, we have Consequently, we can bound with a constant D independent of N .
To summarize, we have which implies that rel (α) when N goes to +∞.Moreover, from (19), the convergence rate is at least as fast as O(1/ √ N ).□

B Proof of Lemma 2
In this appendix we prove Lemma 2. We first show the piecewise affine property in Lemma 6, which gives (i) and (ii).We then show the uniqueness of fixed point from a bijective property in Lemma 7, from which we conclude (iii).
Lemma 6 (Piecewise affine) ϕ is a piecewise affine continuous function, with d affine pieces.
Proof Let m ∈ ∆ d be a configuration and recall s(m) ∈ {1, . . ., d} is the state such that i=1 m i .When the system is in configuration m at time t, WIP will activate all arms that are in states 1 to s(m)−1 and not activate any arm in states s(m) + 1 to d.Among the N m s(m) arms in state s(m), N (α − s(m)−1 i=1 m i ) of them will be activated and the rest will not be activated.
This implies that the expected number of arms in state j at time t + 1 will be equal to It justifies the expression (8).Note that (8) can be reorganized to The above expression of ϕ implies that this map is affine on each zone Z i .There are d such zones with 1 ≤ i ≤ d.It is clear from the expression that ϕ(m) is continuous on m. □ Lemma 7 (Bijectivity) Let π(s, θ) ∈ Π be the policy that activates all arms in states 1, . . ., s − 1, does not activate arms in states s + 1, s + 2, . . ., d, and that activates arms in state s with probability θ.Denote by α(s, θ) the proportion of time that the active action is taken using policy π(s, θ).Then, the function (s, θ) → α(s, θ) is a bijective map from {1 . . .d} × [0, 1) to [0, 1).
Proof The following proof is partially adapted from the proof of (Weber and Weiss, 1990, Lemma 1).For a given ν ∈ R, denote by γ(ν) the value of the subsidy-ν problem, i.e.
γ(ν) := max We defined similarly γπ(ν) as the value under policy π for a such subsidy-ν problem.Note that for fixed π, the function γπ(ν) is affine and increasing in ν.

C Proof of Theorem 3
In this appendix, we explain technical details of the proof of our main result Theorem 3. In the following, we denote by B(m * , r) the ball centered at m * with radius r.
Theorem 8 Under the same assumptions as in Theorem 3, and assume that M (N ) (0) is already in stationary regime.Then there exists two constants b, c > 0 such that Let us first explain how Theorem 8 implies Theorem 3. To show this, we first prove that: Lemma 9 Assume that bandits are indexable, and let ρ(m) be the instantaneous arm-averaged reward of WIP when the system is in configuration m.Then: (i) ρ is piecewise affine on each of the zone Z i and for all m ∈ ∆ d : Proof Let m ∈ ∆ d be a configuration and recall s(m) ∈ {1, . . ., d} is the state such that i=1 m i .Similarly to our analysis of Lemma 6, when the system is in configuration m, WIP will activate all arms that are in states 1 to s(m) − 1.This will lead an instantaneous reward of i .WIP will not activate arms that are in states s(m) + 1 to d.This will lead an instantaneous reward of d i=s(m)+1 N m i R 0 i .Among the N m s(m) arms in state s(m), N (α− s(m)−1 i=1 m i ) of them will be activated and the rest will not be activated.This shows that ρ(m) is given by ( 22).For (ii), recall that m * is the unique fixed point, and consider a subsidy-ν s(m * ) MDP, where ν s(m * ) is the Whittle index of state s(m * ).Denote by L the value of this MDP: By definition of Whittle index, any policy of the form π(s(m * ), θ) defined in Lemma 7 is optimal for (23).Moreover, if θ * is such that α(s(m * ), θ * ) = α, then such a policy satisfies the constraint (5): lim T →∞ rel (α) and as all arms are identical, we have rel (α), and π(s(m * ), θ * ) is an optimal policy for the relaxed constraint (5).
It remains to show that the reward of policy π(s(m * ), θ * ) is ρ(m * ).This comes from the fact that the steady-state of the Markov chain induced by this policy is m * , and π(s(m * ), θ * ) is such that αN arms are activated on average.Indeed, the arm-averaged reward of this policy is: As the proportion of activated arms is α, we have By linearity of ρ and Theorem 8(i), the first term inside the above expectation is exponentially small; by Theorem 8(ii) and since the rewards are bounded, the second term is also exponentially small.
In the rest of the section, we first prove a few technical lemma, and conclude by proving Theorem 8.

C.1 Hoeffding's inequality (for one transition)
Lemma 10 (Hoeffding's inequality) For all t ∈ N, we have where the random vector ϵ (N ) (t + 1) is such that E[ϵ (N ) (t + 1) M (N ) (t)] = 0, and for all δ > 0: Proof Since the N arms evolve independently, we may apply the following form of Hoeffding's inequality: Let X 1 , X 2 , ..., X N be N independent random variables bounded by the interval [0, 1], and define the empirical mean of these variables by More precisely, for a fixed 1 ≤ j ≤ d, we have where for (t), the U i,k 's are in total N independent and identically distributed uniform (0, 1) random variables, and P ij (m) is the probability for an arm in state i goes to state j under WIP, when the N arms system is in configuration m.
By definition, we have where the last inequality comes from the above form of Hoeffding's inequality.□

C.2 Hoeffding's inequality (for t transitions)
Lemma 11 There exists a positive constant K such that for all t ∈ N and for all δ > 0, Proof Since ϕ is a piecewise affine function with finite affine pieces, in particular ϕ is K-Lipschitz: there is a constant K > 0 such that for all m 1 , m Let t ∈ N and m ∈ be fixed, we have By iterating the above inequality, we obtain where for each 0 ≤ s ≤ t, we have by lemma 10: for all δ > 0, Hence, using the union bound, we obtain Proof As ϕ is locally stable, for all ε > 0, there exists δ > 0 such that if ∥m−m * ∥ ≤ δ, then for all t ≥ 0: ∥Φ Let us now show that there exists T > 0 such that for all m ∈ ∆ d , Φ T (m) ∈ B(m * , ε).We shall reason by contradiction: If this is not true, then there exists a sequence of t ∈ N that goes to infinity and a corresponding {m t } t such that ∥Φ t (m t ) − m * ∥ ≥ ε.As ∆ d is a compact space, there exists a subsequence of {m t } t (denoted again as {m t } t ) that converges to an element m.On the other hand, as m * is an attractor, there exists T 1 such that Φ T1 ( m) ∈ B(m * , δ/2).And since Φ T1 (•) is continuous, there exists η > 0 such that if ∥m − m∥ ≤ η, then ∥Φ T1 (m) − Φ T1 ( m)∥ ≤ δ/2.As {m t } t converges to m, there exists T 2 such that for t ≥ T 2 , we have ∥m t − m∥ ≤ η.Consequently for t ≥ T 2 , we have Hence for t ≥ max(T 1 , T 2 ), by our choice of ε and δ from the local stability of ϕ, we deduce that This gives a contradiction!Consequently, there exists T such that for all m ∈ ∆ d , Φ T (m) ∈ B(m * , ε).This implies in particular that K s(m * ) is a stable matrix: the modules of all its eigenvalues are smaller than one.Moreover, we have for all m ∈ ∆ d and t ≥ T : As Z s(m * ) is a stable matrix, this implies that ( 25) holds for all m ∈ ∆ d .□

C.4 Proof of Theorem 8
We are now ready to prove the main theorem.
Proof The proof consists of several parts.

C.4.1 Choice of a neighborhood N
The fixed point m * is in zone Z s(m * ) in which ϕ can be written as As m * is not singular, let N 1 be a neighborhood of m * included in Z s(m * ) .Since m * is locally stable, K s(m * ) is a stable matrix.We can therefore choose a smaller neighborhood N 2 ⊂ N 1 so that Φ t (N 2 ) ⊂ N 1 for all t ≥ 0. That is, the image of N 2 under the maps Φ t≥0 remains inside N 1 .This is possible by stability of m * .We next choose a neighborhood N 3 ⊂ N 2 and a δ > 0 so that (ϕ(N 3 )) δ ⊂ N 2 , that is, the image of N 3 under ϕ remains inside N 2 and it is at least to a distance δ away from the boundary of N 2 .We finally fix r > 0 so that the intersection B(m * , r)∩∆ d ⊂ N 3 , and we choose our neighborhood N as Note that the choice of r and δ is independent of N .From (ii) of Lemma 12, we denote furthermore by T := T (r/2) the finite time such that for all m ∈ ∆ d , Φ T +1 (m) ∈ B(m * , r/2).

C.4.2 Definition and properties of the function G.
Following the generator approach used for instance in Gast et al (2018b).For m ∈ By using Lemma 12, for all m ∈ ∆ d we have This shows that the function G is well defined and bounded.Denote by G := sup m∈∆ d ∥G(m)∥ < ∞.
By our choice of N 2 defined above, for all t ≥ 0 and m ∈ N 2 we have: Hence, for all m ∈ N 2 , we have where the last equality holds because K s(m * ) is a stable matrix.Hence in N 2 , G(m) is an affine function of m.
From the definition of function G, we see that for all m ∈ ∆ d : In the following, we bound ( 27) and (28) separately.

C.4.3 Bound on (27)
As G is bounded by G, we have We are left to bound P M 2 , where K is the Lipschitz constant of ϕ.We have by Lemma 11: This shows that where the last equality comes from our choice of T = T (r/2)).

C.4.5 Conclusion of the proof
To summarize, we have obtained by ( 29 Recall that M (N ) (t) is the configuration of the system at time t, which means that M (N ) i (t) is the fraction of arms that are in state i at time t.Let e i be the d dimensional vector that has all its component equal to 0 except the ith one that equals 1.The process M (N ) is a continuous-time Markov chain that jumps from a configuration m to a configuration m + 1 N (e j − e i ) when an arm jumps from state i to state j.For i < s(m), this occurs at rate N m i Q 1 ij as all of these arms are activated.For i > s(m), this occurs at rate N m i Q 0 ij as these arms are not activated.For i = s(m), this occurs at rate N (α − The process M (N ) jumps from m to m + (e j − e i )/N at rate N λ ij (m).This shows that M (N ) is a density dependent population process as defined in Kurtz (1978).It is shown in Kurtz (1978) that, for any finite time t, the trajectories of M (N ) (t) converge to the solution of a differential equation ṁ = f (m) as N grows, with f (m) := i̸ =j λ ij (m)(e j − e i ).The function f (m) is called the drift of the system.It should be clear that f (m) = τ (ϕ(m) − m), where ϕ is defined for the discrete-time version of our continuous-time bandit problem.
For t ≥ 0, denote by Φ t m the value at time t of the solution of the differential equation that starts in m at time 0, it satisfies Following Gast and Van Houdt (2017); Ying (2017), we denote by L (N ) the generator of the N arms system and by Λ the generator of the differential equation.They associate to each almost-everywhere differentiable function h two functions L (N ) h and Λh that are defined as L (N )  So as for the discrete-time case, G(m) is an affine function of m, with affine factor B := 1 τ (K − I) −1 .As m * is non-singular, it is at a positive distance from the other zones Z i ̸ = Z s(m * ) and we therefore define δ := min i̸ =s(m * ) d(m * , Z i )/2 > 0, where d(• , •) is the distance under ∥ • ∥-norm.We then choose a neighborhood N 1 := B(m * , ϵ 1 ) ∩ ∆ d of m * such that for all t ≥ 0 and all initial condition m ∈ N 1 , Φ t (m) ∈ B(m * , δ).This is possible by the exponentially stable attractor property of m * .Following Theorem 3.2 of Gast (2017), we have where N := B(m * , ϵ 1 /2) ∩ ∆ d .Let N 0 := ⌈2/ϵ 1 ⌉.For N ≥ N 0 , m ∈ N verifies additionally that Φ t m + ej −ei N ∈ Z s(m * ) for all 1 ≤ i ̸ = j ≤ d and t ≥ 0. Hence, G is locally affine and for all m ∈ N and N ≥ N 0 , we have: This shows that the first term of ( 30) is equal to zero.
For the second term, note that both G and ΛG are continuous functions defined on the compact region ∆ d , hence they are both bounded, while L (N ) G grows at most linearly with N .Hence we can choose constants u, v > 0 independent of N such that: We are left to bound P M (N ) (0) / ∈ N exponentially from above.This could be done by using the (unnamed) proposition on page 644 of Weber and Weiss (1990).Yet, we were not able to find the paper referenced for the proof of this proposition.Hence, we provide below a direct proof of this.To achieve this, we rely on an exponential martingale concentration inequality, borrowed from Darling and Norris (2008), which in our situation can be stated as Lemma 14 Fix T > 0. Let K be the Lipschitz constant of drift f , denote λ := max i,j λ ij , and c 1 := e −2KT /18T .If ϵ > 0 is such that then we have The above lemma plays the role of Lemma 11 in discrete-time case.Note that its original form stated as Theorem 4.2 in Darling and Norris (2008) is under a more general framework, which considered a continuous-time Markov chain with countable state-space evolves in R d , and discussed a differential equation approximation to the trajectories of such Markov chain.As such, the right hand side of (34) has an additional term P(Ω c 0 ∪Ω c 1 ∪Ω c 2 ), with Ω c i being the complementary of Ω i .In our case, Ω 0 = Ω 1 = Ω trivially holds; while the analysis of Ω 2 is more involved.However, as remarked before the statement of Theorem 4.2 in Darling and Norris (2008), if the maximum jump rate (in our case N λ) and the maximum jump size (in our case 1/N ) of the Markov chain satisfy certain inequality, which in our situation can be sated as (33), then Ω 2 = Ω.Note that the constraint (33) is satisfied as long as ϵ is sufficiently small, and consequently P(Ω c 0 ∪ Ω c 1 ∪ Ω c 2 ) = 0. Now let ϵ > 0 be such that B(m * , 2ϵ) ∩ ∆ d ⊂ N .The uniform global attractor assumption on m * ensures that there exists T > 0 such that for all m ∈ ∆ d and t ≥ T : Φ t m ∈ B(m * , ϵ).Let such T and ϵ be as in Lemma 14 that verify additionally (33).This is possible as the right hand side of (33) converges to 0 when ϵ is small and T is large.
We then have: b) Performance as a function of α.

Fig. 2 :
Fig. 2: Normalized performance of WIP for different values of α and N .

Fig. 3 :
Fig. 3: Estimation of the constants c and b from Theorem 3.

Fig. 4 :
Fig. 4: Performance of the three policies for non integer values of αN .
(ii) The (unique) fixed point m * of the ODE ṁ = τ (ϕ(m) − m) is not singular.(iii)m * is a global attractor of the trajectories of the ODE.

0 Φ
h (m) := d i=1 j̸ =i N λ ij (m) • h(m + e j − e i N ) − h(m) , Λh (m) := f (m) • Dh(m), with Dh being the differential of function h.The function Λh is defined only on points m for which h(m) is differentiable.Remark that if h(m) is an affine function in m, i.e. h(m) = m • B + b, with B a d-dimensional matrix and b a d-dimensional vector, thenL (N ) h (m) = Λh (m) = f (m) • B.Now the analogue of Theorem 8(i) in the continuous-time case isTheorem 13 Under the same assumptions as in Theorem 5, and assume that M (N ) (0) is already in stationary regime.Then there exists two constants b, c > 0such that ∥E[M (N ) (0)] − m * ∥ ≤ b • e −cN .Note first that similarly, Theorem 13 implies Theorem 5.Proof Define the continuous-time version of function G asG(m) := ∞ t m − m * dt.As for the discrete-time case, our assumptions imply that the unique fixed point is an exponentially stable attractor and a result similar to Lemma 12 can be obtained for the continuous-time case.This implies that the function G is well-defined, continuous and bounded.Recall that the function f is affine inZ s(m * ) : since if m ∈ Z s(m * ) , then ϕ(m) = (m − m * )K + m * where K is a d × d matrix, and f (m) = τ (ϕ(m) − m) = τ (m − m * )(K − I).Now suppose m ∈ ∆ d is such that Φ t mremains inside Z s(m * ) for all t ≥ 0, then Φ t m = (m − m * ) • e t•τ (K−I) + m * , and G(m) = 1 τ (m − m * )(K − I) −1 .

Table 1 :
Number of randomly generated instances that violate any of the conditions of Theorem 3 out of 10 7 uniformly generated restless bandit models for each dimension d ∈ {3, 4, 5, 6, 7}.
On an index policy for restless bandits.Journal of