Fast All-Digital Clock Frequency Adaptation Circuit for Voltage Droop Tolerance

Naive handling of supply voltage droops in syn- chronous circuits results in conservative bounds on clock speeds, resulting in poor performance even if droops are rare. Adaptive strategies detect such potentially hazardous events and either initiate a rollback to a previous state or proactively reduce clock speed in order to prevent timing violations. The performance of such solutions critically depends on a very fast response to droops. However, state-of-the-art solutions incur synchronization delay to avoid that the clock signal is affected by metastability. Addressing the challenges discussed by Keith Bowman in his ASYNC 2017 keynote talk, we present an all-digital circuit that can respond to droops within a fraction of a clock cycle. This is achieved by delaying clock signals based on measurement values while they undergo synchronization simultaneously. We verify our solution by formally proving correctness, complemented by VHDL and Spice simulations of a 65 nm ASIC design confirming the theoretically obtained results.


I. INTRODUCTION
Correctness of synchronous circuit designs relies on the assumption that signal propagation through the combinational logic is complete before the next active clock edge.Temperature and voltage variations lead to dynamically changing interconnect and transistor delays, and are classically alleviated by decreasing the clock frequency such that a single clock period always provides sufficient time, even in face of worst-case temperature-voltage conditions.These effects, together with worst-case assumptions on aging and process variation, lead to a large frequency guardband that results in under-utilization of the circuit under normal conditions.
Power supply plays a central role when designing the guardband: Sensitivity of gate propagation delay increases with lower V CC : a 1 % voltage droop results in up to 4 % delay change in 90 nm technology with V CC = 0.9 V [1].The trend to decrease V CC suggests that the situation will gain in importance for future chip generations.In [2] it was shown that a major part of the guardband is required to account for power supply noise, with more than 6 % loss in attainable clock frequency for a 130 nm processor.In [3], a 12 % voltage droop at 100 MHz was injected into a 45 nm microprocessor, already requiring a 16 % reduction of clock frequency to account for increased critical path delay.
Several techniques for handling slowly changing environmental conditions have been proposed, ranging from slow temperature-voltage and aging compensation [4] to process variation compensation [5].However, compensation techniques typically involve significant sensing and response times that prevent their application for fast environmental changes with dynamics in the order of a single clock period.Supply voltage noise, induced by switching activities with high dI/dt, was shown to have its main frequency components in the 100-300 MHz range with amplitudes around 10 % [6], [7].While ultra-high frequency components in the order of 10-100 GHz are local to the switching circuit, the high-frequency components in the 100-1000 MHz range are due to die and package LC and are global across the chip [8].

A. Tolerating High-Frequency Voltage Droops
In the following, we briefly summarize existing techniques to tolerate global high-frequency voltage droops.Both fully asynchronous and globally asynchronous locally synchronous (GALS) design styles are inherently more robust to voltage droops than their synchronous counterparts.For synchronous designs, desynchronization techniques [9] and elastic synchronous design styles [10], based on local handshaking, have been proposed.In case local handshaking poses a too large circuit overhead, globally adaptive methods have been proposed: In [11], the authors advocate the use of on-die ring oscillators instead of externally generated clock signals: ring oscillators are shown to have advantageous correlation between frequency and critical path delays in presence of droops.
Several approaches discussed in the following propose solutions to adapt a stable, external reference clock signal.An advantage of such an approach over adaptive ring oscillators is that deriving multiple adaptive clocks from the same stable reference clock potentially allows to track their phase relation.
In [4] the clock frequency adjustment is split into a fast and slow adjustment.The fast adjustment is performed by switching between three PLLs, while the slow adjustment is performed by adjusting the individual PLL frequencies.As the PLL outputs are not synchronized, switching between them incurs the risk of metastability and short clock cycles.In [12], an adaptive clocking system for a 90 nm processor running at nominal 2.2 GHz and V CC = 1.2 V is proposed.It senses voltage droops, and via an arbiter selects a new clock signal with an adjusted clock divisor.This technique is reported to tolerate droops of up to 30 mV/ns (i.e., 2.5 % per ns, or about 1.1 % per clock cycle) slope with average 700 ps response time (about 1.5 clock cycles).In [13], an adaptive clocking system based on sensing droops and adjusting a fast digitally controlled oscillator (DCO) that triggers a slowly changing frequency correction is presented.The attained response time is 8 to 10 clock cycles for a 45 nm processor with nominal frequency of about 3.8 GHz.
In [3], a Dynamic Variation Monitor (DVM) based on mixed gate-interconnect delay line monitoring was proposed to track delay changes in critical paths.It was applied in [14] to tolerate steep voltage droops that require fast adaptations: the authors propose to route the clock signal over delay lines that have similar voltage-delay dependencies as the critical paths.This allows automatic and fast stretching of the clock signal on a negative droop slope.The potentially malicious compression of the clock signal on the successive positive droop slope is prevented by masking the clock output until the droop is over and clock periods are nominal again.Masking is triggered by a 2 clock cycle delayed error signal, of which one cycle is used for synchronization.While this approach is faster than the above approaches, it still results in a control latency with additive synchronization delay which is likely to be more than 1 cycle for reliable designs.Furthermore, it completely stops the clock (by masking) until the droop is over and cycle compression is over.
Likewise, the design in [15] is tailored to tolerate fast, steep voltage droops: their droop detector uses a delay line to detect droops within a clock cycle.The binary detection signal is then synchronized (resulting in a 2 clock cycle synchronization delay) and shifts the phase by selecting a proper output from a tapped delay-locked loop (DLL).The output clock runs at 3 to 4 GHz in 28 nm CMOS.

B. Contributions
We propose a mechanism that enables avoiding the frequency guardband in ideal conditions while ensuring correct operation even during frequent and steep voltage droops.
The main idea is to remove the additive synchronization delay from the critical path in the control loop, by making use of metastability-containing circuit design [16]: we sense V CC by standard means, e.g., voltage comparators [8], and directly "compute" with the potentially metastable or unstable measurement, shifting the phase of the clock signal.After a certain number of clock cycles, chosen such that metastability has ceased with sufficiently high probability, we use the sensor values to adjust a DCO.Synchronization thus occurs in parallel to using the measurement values to shift the clock phase, hence does not incur any delay in reaction time.This method allows fast reaction to voltage droops by shifting the phase, and fine mid/long term adaptation by adjusting the DCO.Note that our approach does not require to completely mask the clock signal during the voltage droop; we merely decrease the frequency of the generated clock output by a known (configurable) factor.

C. Structure of the Paper
In Section II, we introduce the formal model and problem statement and prove the correctness of an algorithmic solution within the model.A direct circuit implementation of the proposed control algorithm is presented in Section III.In Section IV, we back our theoretical findings with VHDL and Spice simulations.

II. FORMAL MODEL AND SOLUTION
We start with the problem of implementing a correct frequency adaption module.We then specify a circuit, called FAM, with submodules Droop Detector (DD), Delay Element, and Phase Accumulator (ϕ), and show that it is a correct (implementation of a) frequency adaption module.
All module specifications are stated as a list of input assumptions (Ix) and output constraints (Cy).A module is correct if it fulfills all (Cy) if all (Ix) hold.

A. The Problem: A Correct Frequency Adaptation Module
The overall frequency adaptation system is formalized by a module with two input ports and one output port.One input signal is a clock signal with a fixed nominal frequency (which can be chosen much higher than the derived system clock), the other is the supply voltage.We model the clock signal by a sequence of times (τ ↑ i ) i∈N , where τ ↑ i corresponds to the time the i th rising input clock edge occurs; analogously, τ ↓ i is the time of the i th falling input clock edge.The supply voltage is given by V CC : R ≥0 → [V min , V max ], where V CC (t) is the voltage at time t.We require that the input is well-behaved: • Assumption of well-separated input.The input clock fulfills and ∀i ∈ N : . where T − s and T + s are the minimum and maximum duration of the "short" clock pulses it provides.The above essentially means a 50 % duty cycle of the input clock, although this requirement can be relaxed.
• Assumption on droops.The supply voltage satisfies that i.e., K bounds how steep a droop can be.The only output is the clock signal, which during a voltage droop must slow down appropriately.We model the output by the sequence of times (τ ↑ i ) i∈N , where τ ↑ i is the time the i th rising output clock edge occurs.(τ ↓ i ) i∈N is defined analogously.We will also need T − l and T + l , the desired minimum and maximum period of the slowed-down clock, which has "long" periods, to accommodate increased switching times during droops.In summary, T − s < T + s < T − l < T + l .The frequency adaptation module is said to be correct if, given (I1) and (I2), it fulfills constraints (C1) and (C2): • Guarantee of well-separated output.Output clock edges are well-separated, i.e., We do not require 50 % duty cycle of the output clock, but will show bounds for our solution later on.• Guarantee of well-shifted output.We require that the output clock always runs fast when the supply voltage has been sufficiently high during the previous cycle, and that it runs slow when the supply voltage was too low during the last clock cycle: The voltages V low , V high define what is considered a droop.In summary, V min < V low < V high < V max .While this specification does not explicitly require it, the proposed system also guarantees an amortized minimum frequency of 1/T + l ; in absence of metastability in the constructed delay chain, in fact no clock period is longer than T + l , and for a chain of length n, the maximum clock period is

B. Components of the Frequency Adaptation Module FAM
Central to our proposed solution are flip-flops with xmasking outputs, for x ∈ {0, 1}: a flip-flop whose output is x if it is internally metastable.Note that such a flip-flop only produces full-swing, fast transitions at its output, but no glitches or long intermediate voltage levels: when metastability resolves to 1 − x, it produces a (possibly arbitrarily late) transition from x to 1 − x; if metastability resolves to x, its output remains at x.Such flip-flops can be realized by successive high/low-threshold inverters; see e.g.[17], [18].
Next, we present an abstract implementation of a frequency adaptation module, called FAM, that consists of (i) a droop detector, (ii) a configurable delay chain comprising n ≥ 1 conditional delay elements, and (iii) a digital phase accumulator.The three modules of FAM are specified and interconnected as follows (see Figure 1 (3) The purely digital Phase Accumulator ϕ takes the oldest delay enable signal, forwarded by the leftmost delay element, at its input ĒI , and accumulates the delay value into its phase offset.This requires that the delay enable input ĒI is metastability-free at the time it arrives at the phase accumulator.The phase accumulator skips (i.e., masks) a clock cycle whenever its accumulated phase offset reaches a full period.We continue with a detailed specification of the modules.For succinctness and in the interest of readability, in the following we will use single variables instead of intervals for a time range, with the understanding that the timing analysis has to respect the respective upper and lower bounds.For example, we write T s instead of the interval [T − s , T + s ].We will also need the common timing parameters for what boils down to the properties of the underlying storage elements: t set , t hold , t prop , t ofs , which are the setup, hold, and propagation times of the circuits, as well as the offset between the active clock edge and the time the input is captured.
Module ϕ (Phase Accumulator).We model the behavior of module ϕ, as introduced in (3), in a straightforward way.The component has an internal state (the accumulated phase shift), and two inputs: the single-bit signal ĒI indicating whether to increase the phase offset, and the clock signal Clk in generated by the source clock, e.g., an external free-running quartz oscillator.It outputs a clock signal C O derived from Clk in , whose pulses are phase-shifted appropriately.Specifically, this means that we have to add phase shift values, handle overflow as clock gating, and must be able to complete this within T − s time even during a voltage droop.As we will see in Section III, this is can be achieved by a simple and fast circuit.
Formally, let the sequences τ ↑ i , τ ↓ i , τ ↑ i,0 , τ ↓ i,0 be the times of the rising and falling edges of the input and output clock signals, respectively (the 0 indicates that ϕ is the "0 th " element of the delay chain).We assume that (I1) holds for Module ϕ's clock input.By b i,0 we denote the digital interpretation of ĒI around time where ĒI is scaled accordingly).We assume: • Assumption of metastability-free input.There always is such a value b, which we will argue to hold with high probability later.
We can now define the total shift count ).We say the Phase Accumulator is correct if, given (I1) and (I3), conditions (C3) and (C4) hold: • Guarantee of well-shifted output.Let Q be the quotient of the clock period increase, i.e., T l /T s = 1 + 1/Q, and assume Q is in N. The output clock C O is shifted according to the amount indicated by all previous rounds' b i,0 : where δ ϕ accounts for internal gate and wire delays of the module (like T s , it is shorthand for an interval).• Guarantee on high-time.The high-time of each pulse in the output clock signal C O is bounded by O .Distinguishing between the local and forwarded "copy" of the delay enable is relevant only if the input is unstable, a case we carefully handle using metastability masking techniques.
Formally, we require that the input signal at C I is a "clean" clock signal, i.e., it has sharp edges between periods of stronghigh and strong-low signals (as we consider unstable inputs, we will have to show that this holds true in our proof of correctness); the module guarantees the same for its clock output C O .Denote by τ ↑ i,j and τ ↓ i,j the sequences of times of the rising and falling output clock edge of the j th delay element, respectively.Therefore, τ * i,j−1 is the occurrence of the respective rising/falling input clock edge.Observe that τ ↑ i,j−1 and τ ↓ i,j−1 fully describe the clock input C I to the j th element, where the first element receives τ ↑ i,0 and τ ↓ i,0 from ϕ.We require: • Assumption of well-separated input.
i.e., the clock period is at least T − s and the high time is T s /2.Then the same guarantees are ensured for the clock output: • Guarantee of well-separated output.
It remains to specify how the module responds to the delay enable inputs.To this end, for * ∈ {S, F } we define b * i,j as the digital abstraction of the respective signal at the input port E * I of the j th delay element, using the mapping Also, if the element adds delay, we need the guarantee that the one to the left (providing C I as its clock output) does the same on the next clock pulse, as otherwise we would have to choose T s conservatively, defeating the purpose of our construction.Hence, we also demand: We now use b F i−1,j to decide whether or not to delay the i th clock pulse.b S i−1,j , on the other hand, is used to forward the delay enable.If b F i−1,j = M, we are satisfied with ensuring (C1) -(C3), where (C3) is achieved by guaranteeing that b we guarantee that b F i−1,j = 1 by masking metastability.Both properties together (captured by (C10)) ensure that if a delay enable input causes any delay for a pulse i, then it is guaranteed to delay all following pulses by Q/T s time, which lies at the heart of the correctness proof.
• Guarantee of delayed output and delay propagation.
Module DD (Droop Detector).Finally, we define the Droop Detector, as introduced in (1).It provides a discrete, but potentially unstable or metastable value of whether a droop has occurred; see e.g.[8] for an implementation.To enable our masking strategy, however, we use a high and a low output threshold to generate two signals Ē * O , * ∈ {S, F }, which we feed as Ē * I to the rightmost delay element.It is required that (C10) holds for this element; straightforward ways to ensure this is using two identical detectors with different thresholds and exploiting the assumption that V CC changes at most at rate K, or to use a detector with (at least) three-valued output.
Moreover, the detector's output must indicate whether a voltage droop may be imminent.Accordingly, we require for a correct DD module that if (I2) holds then (C10) (for any i + 1 ∈ N and j − 1 = n), (C11), and (C12) hold: • Guarantee of droop detection.
The specifics of the implementation of the detector are of no concern to us.However, note that it is crucial that the detector's delay is small, as it adds to the response time of the circuit and thus affects the steepness K of droops that can be tolerated.This suggests to favor simple implementations.

C. Correctness of the FAM
To show that the FAM is a correct implementation of the frequency adaptation module, we first prove that all input requirements of the FAM's submodules are fulfilled.Lemma 1 does so for the delay elements in the FAM.Lemma 1.Consider the FAM with correct implementations of its submodules and a chain of n ≥ 1 delay elements.If (I1) and (I3) hold, the input requirements (I4), (I5), (I6), and (I7) hold for each delay element.
Induction step (n − 1 → n): Assume the statement of the lemma holds for chains up to size n − 1 ≥ 1. Assume for contradiction that the claim does not hold and consider the causally first violation; we show that such a violation is impossible.Prior to any violation, the n th element satisfies (C10), implying by the induction hypothesis that the first n − 1 delay elements have their input requirements satisfied.Accordingly, (I4), (I5), and (I7) cannot be violated first at element n due to (C5), (C6), and (C8), respectively, for element n − 1.As in the base case, (I6) holds by (C10) of the DD module, which operates correctly unconditionally.We arrive at the contradiction that no input requirement can be violated first, concluding the induction and the proof.
Applying Lemma 1 to the right-most Delay Element yields property (C1) and bounds on the output clock high-times follow (by (C5) and (C6)).An upper bound on the output clock period follows from the fact that a clock transition can be phase shifted at most an additional T s /Q per delay element.As ϕ drops at most a 1/(Q + 1) fraction of the clock pulses and delay elements never add or remove pulses, the amortized frequency is at least roughly 1/T l .
Corollary 1.Consider the FAM with correct implementations of its submodules, and a chain of n ≥ 1 delay elements.If (I1) and (I3) hold, property (C1) holds and the output clock high-time is within [T − s /2, T + s /2].The output clock period is at most (1 + n/Q)T s and amortized (Q + 1)T + s /Q.We are now in the position to show that the FAM reacts to voltage droops as required by (C2).From Lemma 1 we already have that all delay elements' input and output requirements are fulfilled; specifically delay element n's output guarantees hold.It remains to show that the DD module correctly senses a droop and passes on this information to delay element n, which then reacts with an according phase shift.Lemma 2. Consider the FAM with correct implementations of its submodules, and a chain of n ≥ 1 delay elements.If the delay constraints t ofs ≥ t set and t ofs + t hold ≤ T s /2, (I1), (I2), and (I3) are fulfilled, then property (C2) holds.
Proof.First note that by Lemma 1 the input and output requirements of delay element n are fulfilled.
We begin by showing the first implication in (C2).Assume that for all t ∈ [τ where ( 4) follows from the delay constraints, and ( 5) from the delay constraints, and (C5) and (C6) for delay element n.We obtain b * i−1,n = 1.From (C5) and (C7) for delay element n, the first implication follows.
We next show the second implication in (C2).Assume that there exists a t in Thus, (C11) yields that Ē * (t) = 0 during this interval.Further, from the lemma's delay constraints, ] and thus b * i−1,n = 0. Now the implication follows by (C8).Overall correctness follows from Corollary 1 and Lemma 2 together with (I3), i.e., the chain being long enough to ensure that metastability is always resolved before reaching ϕ.
Theorem 1.If the delay constraints t ofs ≥ t set and t ofs +t hold ≤ T s /2, (I1), (I2), and (I3) hold, then the FAM with correct implementations of its submodules, and a chain of n ≥ 1 delay elements, is correct.
Note that the chain length n does not influence correctness assuming that no metastability occurs, but is of course relevant to ensure (I3) indeed holds.The delay chain achieves this by acting as a synchronizer chain of length n; cf.Section III.

III. CIRCUITS
We next present circuits for the Phase Accumulator ϕ and the Delay Element that fulfill the modules' specifications.
Circuit for Phase Accumulator.The phase accumulator behaves like a phase accumulator in a numerically controlled oscillator (NCO): it internally runs at a higher frequency, periodically adding a constant phase offset (plus an externally provided potential phase shift), thereby generating the output clock.For an implementation consider the circuit in Figure 2. Observe that the 2-bit counter selects from four clock signals that are shifted by 1/4 • 2π phase difference.Increasing the 2-bit counter by one "skips" π/2 phase.Lemma 3. The circuit in Figure 2 correctly implements Module ϕ for Q = 4.
Proof Sketch.Given (I1), consider first the output of the circuit when the counter value is fixed to some r ∈ {0, 1, 2, 3}.It is easily verified that the resulting output is the input clock guaranteed by (I1) with a phase shift of rT s /4 + δ ϕ , where δ ϕ is the propagation delay of U2.
Accordingly, all we need to verify is that when the counter increases, the low time of the output signal is extended by 1 cycle of the multiplied clock.This holds true, as a phase shift of T s /4 w.r.t. to the previous counter value together with feeding the (not multiplied) clock to the D-input of U1 and U2 implies that they capture the value after the falling clock edge again, which is also low.Note that when the counter Figure 2: Phase accumulator.The input clock is four times the system clock in order to generate exact phase outputs without the need of stringent delay matching requirements.
The clock divider produces two output clocks with π phase difference, used by U1 and U2 below.The binary 2-bit upcounter accumulates the phase shift E I , which is latched upon the falling edge of the clock output the phase accumulator produces, and selects the appropriate inputs to U1 and U2.
overflows from 3 to 0, we switch back to the unshifted clock input without additional delay, implying that a clock cycle has been masked: the total phase shift since the previous overflow has reached 2π, yet the clock input is directly forwarded to the clock output.With these observations in place, (C3) follows by an induction on parameter i ≥ 1. (C4) follows from (I1) together with the fact that the circuit applies the same phase shift to rising and successive falling transitions.
An interesting alternative implementation of a phase accumulator is provided in [15].Their design is based on a tapped DLL and a MUX that allows to select among the taps, thereby applying phase shifts.Such a design has the advantage of no need for higher internal clock frequencies and thus allows higher output clock frequencies: [15] yield 3 to 4 GHz in 28 nm technology.
Circuit for Delay Element.Consider the circuit in Figure 3 with the pulse shaping circuit PS as depicted in Figure 4. Concerning the flip-flops, output Q0 is required to be 0-masking (slow-masking), and output Q1 1-masking (fast-masking).We further require that the flip-flop parameters fulfill Lemma 4. The circuit in Figure 3, with U5 and U6 initialized to 1, correctly implements a Delay Element for Q = 4.
Proof.We prove the claim by induction over the pulse number i, where apart from the properties (C5) -(C10) we claim that U5 and U6 attain states b F i,j and b S i,j when being latched by the falling outgoing clock edge; for the induction anchor at i = 0, we may set b F 0,j = b S 0,j = 1 by the prerequisites of the lemma.Now consider the i th incoming rising clock edge.As U5 is a slow-masking flip-flop, in absence of a (falling outgoing) clock edge latching it, its output can only transition from 0 to 1. Hence, the rising clock edge incoming at C I is forwarded to the pulse shaper, with a delay between δ and δ + T l − T s = δ + T s /4, where δ denotes the delay from C I through U3 and U4 to the input of the pulse shaper PS.Note that the high time of the signal may be increased by up to T s /4, but it will drop to low again before the next pulse arrives due to (I4).The first stage of the pulse shaper then inverts the signal, transforming the high time into a low time of T s /3.Afterwards, the signal is inverted again and the resulting high time extended to T s /3 + T s /6 = T s /2, guaranteeing (C6).Moreover, the overall delay of the pulse is in the range [δ DE , δ DE + T s /4], where δ DE = δ + δ P S with δ P S being the delay between a rising input clock edge to the pulse shaper and the rising edge at its output.By the earliest time the falling output clock edge occurs, i.e., after δ DE + T s /2 time, we are guaranteed that U3 has low input from C I again.Further, until the new latching output propagated to U3's input, i.e., at latest after time δ DE + T s /2 + T s /4 + t ofs + t prop < T s , U3's clock input remains 0, by the delay constraints and (I4).So latching U5 does not cause glitches.
Next, observe that the above considerations also show that the delay of the rising clock edge at the output can be larger than δ DE only if U5 was not in a stable state of 1, i.e., b F i−1,j = 1.By (I6), b F i−1,j = 1 entails that b S i−1,j = 0. Further, b F i,j−1 = 0 is guaranteed if during τ ↓ i,j−1 +t ofs +[−t set , t hold ] register U5 of element j − 1 sees a stable 0. This is guaranteed because the stable b S i−1,j = 0 is driven by U6 of element j since We conclude that, regardless of the state of U5, τ ↑ i,j − τ ↑ i−1,j ≥ T s , proving (C5) for the i th pulse.(C7) and (C8) are immediate consequences of the above considerations for the case of U5 being in a stable state.
Concerning (C9), it follows from the already established (C5), (C6), and the delay constraints that b S i,j = b ∈ {0, 1} entails that the output of U6 is stable during [τ ↓ i+1,j−1 +t of s − t set , τ ↓ i+1,j−1 + t of s + t hold ].Apart from showing (C9), these timing constraints also imply that U6 is not latched during this time.If U6 is internally metastable, this means that it can (begin to) stabilize only in one direction (note that we require an implementation that does not allow for oscillatory metastability).Finally, recalling that ĒF I and ĒS I are fast-and slow-masking, respectively, it is guaranteed that b F i+1,j−1 = 1 or b S i+1,j−1 = 0, proving (C10) for the i th pulse.

A. Metastability
Combining Theorem 1 with Lemmas 3 and 4, we obtain correctness of the FAM implementation.Note, however, that correctness relies on requirement (I3).Given our circuit implementation, (I3) corresponds to the fact that the delay enable propagated through the n delay elements from the DD module to the ϕ module is not metastable when it arrives.From the fact that stable register values are propagated correctly, i.e., again result in stable register states of the element to the left, we observe that metastability can only propagate through the chain when the register U6 of delay element j resolves exactly when register U6 of element j − 1 latches its input; i.e., the chain acts as a synchronizer chain of length n.The overall probability of a failure can thus be bounded analogous to failure of an n-stage synchronizer; see e.g.[17], [18].

IV. SIMULATION
We implemented and simulated the circuit, both in a highlevel logic simulator using VHDL as well as in Spice, demonstrating that the required design constraints can be met for a 1 GHz clock in 65 nm.
Given the circuit specification and constraints in Section III, the design entry in VHDL followed a standard approach.All sub-circuits used back-annotated gate delays, after synthesis in the UMC 65 nm process, and their constraints were met.
For synthesis, all flip-flops and gates were used from the UMC library.Delay elements were modeled using chains of minimal sized inverters with small RC elements in between (in the order of 100 Ω and 10 fF, respectively).As expected from the circuits presented in Section III, the critical path is in the Module ϕ, the phase accumulator, as this part of the circuit runs at four times the clock frequency of the remaining parts.For maximum speed, the proper alignment of 4 * Clk Results of the Spice simulation of the complete circuit, showing the supply voltage "VCC", the droop detector output "E" and pairs of delay enable and clock signals at the boundary between two delay elements."C1" and "E1" are the clock and the delay enable signals, respectively, between the phase accumulator and the first delay element, "C2" and "E2" between the first delay element and the second, and so forth until "E7" and "C7" being between the second last and the last delay element."C out" and "E" are the output and input, respectively, of the last delay element.The graphs show the quick reaction time of the system to droops, well within a single clock cycle from the assertion of the delay enable, both at the start of the droop and its end.The delay enable and a "zone" of slow clock cycles trickle backward in the chain until they get absorbed by the phase accumulator.Even though the droop visibly affects the circuit elements' operation and output voltage, their timing behavior is still as desired.Observe the immediate stretching of "C out" due to the voltage droop.The last delay element samples "E=0" and thus (i) applies the phase shift to "C out" and (ii) sets "E7=0" with the falling transition of "C out" at 12 ns.Delay element 7 then samples "E7=0" and thus applies the phase shift to "C7" with the falling transition of "C7" at 13 ns.
handled properly, the critical path in the circuit is the loop from U2, C O , via the up-counter and its output r[1 : 0] back to the multiplexer and the inputs of U1 and U2.Our simulations showed that the circuit could be clocked well in excess of 4.5 GHz resulting in an output clock frequency of over 1.1 GHz.Adding some margins, we decided to use a clock of 1 GHz for the simulations.
The complete circuit consists of the phase accumulator as shown in Figure 2 and seven delay elements as shown in Figure 3.The input clock ran at 4 GHz, leading to a nominal clock frequency of 1 GHz.We used a sharp and steep voltage droop from nominal 1.1 V down to 0.95 V, lasting slightly over 10 ns in duration, with a fall and rise time of 10 ps in order to capture a worst-case scenario of a sharp highfrequency droop with a duration of only a few clock cycles.The Spice simulation results can be seen in Figure 5 and a zoomed version around the first droop in Figure 6.The topmost graph shows the supply voltage and its drop to 0.95 V.The second graph "E" shows the simulated output of the droop detector.We assumed a delay of 1 ns for the droop detector.The third graph "C out" denotes the clock output of our circuit.The remaining graphs are pairs of the delay enable and clock signals passed between the delay elements, with corresponding signals shown in the same color, backwards from the clock output to the phase accumulator: "E7" and "C7" are the enable and clock signal between the last and second last delay element, "E6" and "C6" the signals between the second and third last, etc.The signals "E1" and "C1" are between the phase accumulator and the first delay element.
As can be seen, the output clock frequency adapts to the droop detect signal within a single clock cycle, both at the start and the end of the droop.The delay enable trickles backward in the chain and finally gets absorbed in the phase accumulator.As the droop lasts for approximately 9 clock cycles, this results in two clock cycles being dropped.Note that, because there are seven delay elements in the chain, the phase accumulator has just seen the delay enable signal by the time the droop is over.Yet the output clock immediately resumes its highfrequency operation, thus minimizing the performance impact of the droop.

V. CONCLUSION
High-frequency voltage droops consume a significant fraction of the clock guardband.We proposed a circuit that allows to react to steep and high-amplitude droops, without the need to halt the clock.The circuit is based on detecting droops and propagating this information along a delay line, back to a DCO that accounts for the respective phase offset.The clock signal travels in the opposite direction through the delay line.Care had to be taken in handling metastability: we make use of masking flip-flops, ensuring that no glitches are introduced in the clock signal.
We verified our design by correctness proofs and synthesized it in UMC 65 nm, running VHDL and Spice simulations with a 1 GHz input clock, which are in accordance with our theoretical predictions.

Figure 1 :
Figure 1: System architecture of the proposed frequency adaptation module: the DD senses occurrence of a droop, and issues delay enable signals that travel through the pipe from right to left to the ϕ module.Clock signals travel from left to right and are delayed accordingly at the delay elements.Delay enable signals that arrive at ϕ are made permanent by shifting the phase of the input clock.

and 1 *
Clk is vital.The delay added on 1 * Clk and 4 * Clk by a naively implemented divide-by-4 circuit would easily consume the slack at the inputs of U1 and U2.In case this is

Figure 6 :
Figure6: Zoom in around the first droop.Observe the immediate stretching of "C out" due to the voltage droop.The last delay element samples "E=0" and thus (i) applies the phase shift to "C out" and (ii) sets "E7=0" with the falling transition of "C out" at 12 ns.Delay element 7 then samples "E7=0" and thus applies the phase shift to "C7" with the falling transition of "C7" at 13 ns.
O is the clock input C I , potentially delayed by an additional up to T s /Q time.Inputs Ē * I provide the delay enable, representing the (low-active) decision whether we need to delay the clock or not.Outputs Ē * O propagate this delay enable backwards in the chain, at the occurrence of the next local falling edge of C I .We use ĒF C4) Delay Element.The delay element, as introduced in (2), has three inputs ĒF I , ĒS I , and C I , and three corresponding outputs ĒF O , ĒS O , and C O , connected like a REQ/ACK pipeline.Clock output C I for the internal decision whether to add delay, whereas ĒS I is propagated to both outputs Ē * where we scaled Ē * I such that 1 represents a strong-high, 0 a strong-low, and M any voltage in between.Intuitively, b * i,j is the resulting state of a flip-flop with input Ē * I latched at time τ ↓ i,j + t ofs , where M represents metastability resulting from a setup/hold time violation or otherwise unclean signal.Note that, as the outputs Ē * O are fed to the module to the left, b * i,j−1 is given in terms of Ē * O latched at time τ ↓ i,j−1 +t ofs .With this, we can require: