Exact Fused Dot Product Add Operators

This article explores architectures of exact (correctly rounded) fused dot product and add operators suitable for the FP32 and FP64 binary floating-point representations with subnormal support, and other representations with a wide dynamic range such as bfloat16. The exact summation of terms before rounding requires a full-size accumulator, and this work discusses techniques to compress the identical bits of this accumulator. This requires the computation of the relative shift amounts of the terms, which is formulated as a parallel prefix algorithm, allowing for a low-latency implementation. Architectural options for the exact fused dot product and add operators with up to 16 products for FP32, FP64 and mixed-precision BF16 to FP32 are evaluated using the TSMC 16FFC technology node.


I. INTRODUCTION
This work addresses the floating-point computation of with a single rounding • of the exact dot product plus addend.Here (X i ) i∈[0,N −1] , (Y i ) i∈[0,N −1] , Z and R are floating-point numbers in a representation with a wide dynamic range such as IEEE 754 binary 64 (FP64), binary 32 (FP32), and bfloat16 (BF16).The multiplicands X i∈{0;N −1} and Y i∈{0;N −1} have e in exponent bits and m in significand bits.The output R and the addend Z have a possibly wider format with e out and m out bits.The corresponding operator is called FDPN A, possibly suffixed with the format, e.g.FDPN A FP32 for homogeneous operators and FDPN A BF16→FP32 for mixed-precision ones.

A. Motivations
The main applications of FDPN A operators are accumulations of partial dot products in machine learning and linear algebra applications.In addition, the operator where N = 2 (called ExSdotp in [1]) effectively supports complex floating-point arithmetic, as a complex fused multiply and add R = XY + Z is implemented with the best possible accuracy in only two operations: Noting j 2 = −1, ) .
An illustration of practical importance is the on-line computations of the twiddle factors of Fast Fourier Transforms (FFT) [2].The twiddle factors of a N-point DFT are defined as W k P = e −2jπk P , with P some power of 2 smaller than N and k ∈ [0, P − 1].They can be computed offline with the results stored in a table, or online using the recurrence e j(k+1)θ = e jkθ × e jθ with θ = − 2π P .Such a sequence of n multiplications by e jθ may lead to O(n) error [3], so a twiddle factor recurrence should be implemented carefully.Rewriting e j(k+1)θ = e jkθ + e jkθ (e jθ − 1) with e jθ − 1 = −2 sin 2 θ 2 + j sin θ avoids the cancellation in cosθ − 1 [4] since θ is small for large-N FFTs.Computing e j(k+1)θ then becomes a complex multiply-add that is accurately implemented by the FDP2A operators.
Fig. 1 displays the maximum and average errors for the FMA-based and the FDP2A-based twiddle factor recurrences in log 10 scale, for N spanning successive powers of two.The reference values are twiddle factors computed using the libm standard cos and sin functions, rounded to FP32 from FP64.The error is defined as the modulus of the difference between a value and the baseline using FP64 arithmetic.Since twiddle factors are roots of unity, these errors are both absolute and relative.The FMA-based recurrence errors grow up to two orders of magnitude larger than those of the FDP2A-based recurrence errors.
1) Complex Multiply-Add Implementations: The Arm SVE2 Floating-point Complex Multiply Accumulate (FCMLA) instruction [11] multiplies a complex number by the real or imaginary part of the second multiplicand with optional negation, then adds a complex addend.A full complex multiply-add requires only two FCMLA instructions, however each FCMLA instruction is implemented as two parallel FMA operations.
2) Inexact Fused Add3 Operators: The fused sum of three floating-point numbers has been studied in [5], [7].Here the  term significands are aligned in a way similar to the sum of two numbers, and if not overlapping the smaller term is shifted out into a sticky bit.In case of a full cancellation of the two larger terms, the result should be the third (smaller) term, but it cannot be recovered, so this approach is not exact.
3) Inexact Fused Dot Product Add Operators: Likewise, the fused floating-point four-term dot product unit of [12] is inexact as significands of low magnitude are compressed into a sticky bit.
The Intel Nervana Neural Network Processor-T [10] implements an inexact 32-product BF16 dot product FP32 add operator.The products are aligned to the maximum product exponent then truncated to an internal 37-bit datapath before being summed.A similar operator with up to 32 products is presented in [13], focusing on optimizing the maximum exponent and global alignment computation which becomes a bottleneck for large numbers of products.Reverse-engineering of the NVIDIA V100 GPU tensor cores [9] reveals they also adopted a similar approach for dot product add operators.
4) Exact Fused Dot Product Add Operators: The ExSdotp fused dot product add operator of [1] first sorts the three terms based on their exponents (the exponent of each product being the sums of the multiplicand exponents).The two larger terms are added, and if a full cancellation is detected, the smaller term is restored for the result.This works for the sum of three floating-point numbers.However, with products involving subnormal multiplicands, this approach may result in incorrect results with directed rounding.For example, consider ExSdotp FP16→FP32 with rounding up; therefore, R should be 1.However, the exponent of X 1 Y 1 is larger than that of Z due to Y 1 being subnormal.Therefore the sort, based on the exponents only, will use X 1 Y 1 for the addition instead of Z, and the result returned will be R = 1 + 2 −23 .This issue also also affects the inexact flag in round to nearest.
Kulisch [14] advocated the use of a fixed-point accumulator large enough to hold any exact product of floating-point terms.Many-term exact fused dot product add operators using a Kulisch-like full-size accumulator have been proposed in [8], [15].This approach will be studied quantitatively in this paper.An alternative is to compress identical bits inside the full size accumulator.Here we build on Tao et al. [6], improving their work with subnormal support, managing the addend Z, and mixed-precision.We also explore alternative techniques for several sub-problems, including sorting networks and a parallel-prefix computation of the significand shifts.

C. Outline
This article is organized as follows.Section II exposes the operating principles of the proposed exact fused dot product add operators.Section III focuses on the key insight, which is how to suitably align the sorted term significands before their exact summation.Section IV discusses implementation and validation details.Section V reports on the synthesis results.
High-level architecture of the FDPN A operator.
Section VI presents some conclusions with respect to the relevance domain of the techniques studied.

II. OPERATING PRINCIPLES
A. Architecture Overview Fig. 2 describes the high-level architecture of our dot product add operators.Prior to each multiplication, the implicit bit is added to the fraction of each input, with a value 0 for a subnormal and 1 otherwise.The significands are multiplied together and negated if needed, while the exponents are added.Each product X i Y i is now represented with E i its exponent and M i its signed, non-normalised significand.The addend Z is also split into its exponent E N and signed significand M N .
Then, a component denoted FPΣ aligns the significands and sums the N +1 floating-point terms.Following the summation, a Leading Zero Count (LZC) retrieves the exponent of the result.Finally, the sum is normalized or subnormalized, rounded, and output as a floating-point number.Overflow, underflow, NaN and IEEE 754 flags are managed within FPΣ as well.

B. Full-Size FPΣ Implementation
The FPΣ component can be implemented by first converting all terms into a common two's complement fixed-point representation [14], [8].To ensure that this conversion is exact, the fixed-point representation is defined by the worst-case Most Significant Bit (MSB) and Least Significant Bit (LSB) of a shifted product significand.These values can be deduced from the parameters e in , m in , e out and m out of the in/out floating-point formats.For example, in the homogeneous case, LSB = −2 × (2 ein−1 − 2 + m in ) and MSB = 2 × (2 ein−1 − 1).This defines the full-size in bits w full of the common fixedpoint representation.Values for the floating-points formats of interest are summarized in Table I.
w full Fig. 3. Example of alignment for the sum using a full-size fixed-point format.
This conversion to a common fixed-point format (or alignment) requires shifting left M i by E i bits, or alternatively shifting right M i by MSB − E i bits, in both cases with sign extension.Fig. 3 uses right-shifting for consistency with later figures.Once converted to fixed-point, the terms can be summed using integer addition into a full-size accumulator (Fig. 3), extended with ⌈log 2 (N + 1)⌉ extra bits on the left to guard against possible overflows.Multiplications, conversions, and additions are all exact.The result may then be normalized and rounded (only once) to the output format.
This approach is quite efficient for formats with small fullsize accumulators, so has been used in particular for the exact sequential accumulation of FP16 products [8] and a FDP8A FP16→FP32 operator [15].In the mixed-precision case, it is better to perform the sum of products separately in an accumulator corresponding to the small format, then perform the last addition of Z as in a classical FMA.As the sum of products is exact, it is associative and can be parallelized without any consideration of the exponent values.

C. Compressed FPΣ Principles
When (as is the case in Fig. 3) the accumulator size is much larger than the sum of the widths of the N +1 significand terms to add, one observes that most of the summation adds zeros (or sign bits, which are similarly easy to manage).Intuitively, these parts of the summation can be saved, and the worstcase size of actual addition needed should be roughly N times the size of a term significand.What we call compression in this article is the suppression of the columns of predictably identical bits in Fig. 3.It is called realignment in [6].
The core idea is to use a single fixed-point multi-operand adder of size w compressed < w full .The shift values needed to align the M i before their summation are no longer trivially deduced from the E i as in the full-size case: they now need a more complex computation to skip the compressed bits.The main issue is the combinatorial explosion of alignment situations because it has to be implemented in hardware.
To address this explosion, we first sort the terms by their exponent E i .Section IV-A discusses our approach for this.Let us note Our sorting approach requires that the products X i Y i and the addend Z be represented in the same way.The significand size w that fits all is the maximum width of all M i : As illustrated in Fig. 4, the significand M N of the addend Z is extended with an extra MSB corresponding to the overflow bit of a product.In the homogeneous case, M N is extended with extra LSBs.In the mixed-precision case M N usually also has more precision to the right than the other M i (e.g.22 bits Homogeneous case Mixed-precision case for BF16 products versus 24 bits for FP32).In the sequel, all M i are considered to have the same size w.The exponent E N of Z must be similarly updated, we do not detail it further.Accordingly, the architecture we propose is sketched in Fig. 5.The (exponent, significand) pairs are first sorted by their exponent, then processed by a component that determines the adjusted shift values S i .The sorted significands are then shifted before being summed into an adder tree.The sum RM full is converted back to a (sign,magnitude) representation, normalized, then rounded to produce the final significand.
Compared to the full-size FPΣ, this architecture involves one fewer shift, and if w compressed < w full the shifts are smaller, as are the adder tree and the LZC.However, it also requires a sorting component, more exponent pre-processing, and some exponent post-processing as well.This trade-off is evaluated quantitatively in Section V.

D. Introductory Considerations
What is the smallest number of bits w compressed such that all the N + 1 terms can be added using integer adders of size at most w compressed and the exact sum can be recovered for rounding as if it were computed on w full bits?To address this question while introducing notions needed in the sequel, first consider the trivial case N = 1 then the simple case N = 2.
1) Two Terms (N = 1): Fig. 6 shows a fixed-point addition of two terms.There are two cases: (case 1) the two significands are far apart, and the bits between them can be omitted, as the result will be M * 0 or one of its immediate FP neighbours (one round bit must be kept to the right of M * 0 ); or (case 2) M * 1 overlaps M * 0 , and there will be an addition of size at most 2w bits.In both cases the w full bits of the exact sum can be compressed in 2w + 2 bits.This observation is exploited in classic FP adders, but with an additional trick: the bits in the blue zone of Fig. 6 can be compressed into a sticky bit and a guard bit [16] in a way that keeps enough information for a correctly rounded addition.
2) Three Terms (N = 2): There is now the possibility of a complete cancellation of the two leading terms M * 0 and M * 1 .In such a case, the exact result is M * 2 , therefore we must keep all its bits, so it is incorrect to compress them into a sticky bit.This will be the case for all N > 1, so we will no longer mention sticky bits.Fig. 7 shows the four alignment cases of fixed-point addition for three terms (N = 2).
In case 1, the three significands are fully separated and all the bits between them can be compressed.We have three colored zones in this figure (from left to right : orange, blue and magenta), and the architecture will need to remember the exponent of each significand and associate it with the corresponding zone to emulate the full-size accumulator.In all the sequel, zone i is associated with the exponent E * i and a priori contains the significand M * i .In case 2, the two lower significands M * 1 and M * 2 overlap, which means that their sum may overflow by one bit to the left of M * 1 .The blue zone, in the compressed view, reserves for this overflow one more bit than in Fig. 6 to the left of M * 1 .Besides, there is no longer a magenta zone like in case 1: significand M * 2 belongs to the blue zone: since it must be added to M * 1 , it inherits its exponent E * 1 .The good news is that in the compressed adder tree, the bits necessary to this addition may recycle those of the magenta zone in case 1.
Case 3 is similar, but with M * 1 now in the orange zone due to M * 1 overlapping M * 0 .Note that the magenta zone, associated with the exponent E 2 , exists in this case.In such a case E * 1 will be used to shift M * 1 to its proper place in the compressed sum, but it is not associated with a zone.
Finally, case 4 shows the situation where we only have one zone associated with E * 0 , which happens as soon as the compressed accumulator can hold the exact sum.Here neither Case 1: Case 2: Compressing the exact addition of two terms.
Case 1: Case 2: Case 3: Case 4: Compressing the exact addition of three terms.
E * 1 nor E * 2 are associated with a zone.III.CONSTRUCTION OF THE COMPRESSED FPΣ In this section, we first generalize the N = 1 and N = 2 considerations to formally define the sizes and parameters of the N + 1 zones of a FDPN A operator.We then define an architecture to shift all the significands to their proper place in the compressed accumulator.Determining those shift values can be implemented as a parallel prefix computation.

A. Compressed FPΣ Parameters for N Terms
In case of N terms, the key is to determine the zones in the summation where each significand M * i should be positioned if it does not overlap with other significands.These zones are specified by their boundaries, denoted d i .Fig. 8 illustrates the d i values for w = 3, in cases N = 2 and N = 4.
The number of bits between d i and d i+1 should at least be the width w of the significand.An extra bit is added to the right of the significand as a placeholder for a round bit.Besides, p i protection bits are added to the left of the significand (see Fig. 8) to absorb the worst-case overflow of the significands of lower or same magnitude: Fig. 8. Position of the zone delimiters in cases N = 2 and N = 4.This recurrence also defines the total size of the compressed accumulator w compressed = d N +1 −d 0 .The p i and d i parameters only depend on the value of N that is set at design time.

B. Definition of the Shift
Values S i M * 0 is placed in a constant position to the left of the addition: S 0 = d 0 +p 0 .Determining the subsequent S i involves another case analysis, illustrated in Fig. 9 for N = 2: •

Specifically, E *
i "close to" E * j means that M * i can be placed in the same zone as M * j .Then it aligns with E * j , and the bits of zone i are recycled to make space for the addition in zone j.This leads to the following definitions for the shifts: The recurrence defining S i may be evaluated faster by reformulating it into a form suitable for parallel prefix computation.
Fig. 10.Example of parallel prefix computation for N = 4.
First, each level contains dependencies on all the positions of the previous levels, but this is not necessary for a correct result.
The formula for S 2 can be rewritten as: . Second, observe that the last argument of the min can be rewritten as This leads to: Note that all the d i + p i + E * i can be computed in parallel as soon as the order of the exponents is known.This is also a convenient form for hardware implementation using a Hillis & Steele parallel prefix tree [17] as shown in Fig. 10.
The general formulas for the shifts (pre-processing) are: The index k i is computed along the min and is the index of the zone to which M * i belongs.

D. Computation of Final Exponent E
Once the summation is done, its result is converted back into a sign-magnitude representation and then normalized.Normalization starts with a leading zero counter (LZC) which determines the number L of zeros before the leading bit in the (compressed) sum.This L is compared to all the d j to determine the index i of the zone of the result.Then the index k i is used to retrieve the exponent E * ki actually associated with this zone.The result exponent verifies Here the +1 captures the fact that our M * i have two bits, not one, to the left of the binary point (see Fig. 4).Finally the normalized result exponent E is computed as:

E. Subnormal Management
Fig. 11 illustrates a problematic situation to avoid: due to a cancellation, the normalized result contains bits that belong to two different zones.In the full sum there would be more bits between them, so the resulting significand is incorrect.
This situation cannot happen because of partial cancellations with normal inputs, as in this case the zone in which the cancellation occurs has been enlarged by fusing in the bits from zones to the right.See Case 3 of Fig. 7: there are more than w extra bits in the orange zone due to fusing in the blue zone.A cancellation is either full (then the LZC will skip the orange zone), or it cancels at most w−1 bits and there remains more than a significand worth in the orange zone to the right of the leading one.This is also true if three or more significands share the same zone, as the zone is correspondingly larger.
However, the problematic situation may occur if there is one subnormal input to a product.Managing this properly requires specific considerations on e in , e out , m in , and m out .
In the following paragraphs, M * 0 will be used to discuss the problematic situations, since they arise when the leading 1 after sum is positioned near the next zone.In general, those situations may happen in any zone, in case of a complete cancellation in the leading zones.
1) Homogeneous Case (m in = m out and e in = e out ): In this case remember that the width of the M * i is w = 2 + 2m in .If M * 0 , the largest-magnitude term, is a product between a large normal and a subnormal, it cannot have more than m in leading zeros and the width of the zone is greater than w = 2 + 2m in .So the problematic situation in Fig. 11 cannot happen.
If M * 0 is a product of two subnormals, then its exponent is the smallest possible, which means that all terms are products of two subnormals, all the following zones will be merged, and there are no problems either.
2) Mixed-Precision when e in ≤ e out : In the worst case, the product of the two smallest non-zero subnormals has only one bit left, and nevertheless may be larger than Z.This leads to the problematic situation in Fig. 11.
3) FDPN A BF16→FP32 Special Case: The BF16 format does not support subnormals [18].The operator FDPN A BF16.32 can still support subnormals in its FP32 input Z and output R. A subnormal Z input is not a problem.Even if there is a product of BF16 numbers smaller than Z, the case in Fig. 11 cannot happen as the result will also be subnormal.Subnormalization of the output is managed in all cases by the normalization and rounding of the compressed exact sum.
Since BF16 and FP32 share the same exponent size, it is possible to support BF16 subnormals by taking w = 32 instead of w = max(2+23, 2(1+7)) = 25.This modification ensures that the significand of the product of a normal by a subnormal cannot extend to the next zone: this product has a maximum of 8 leading zeros, therefore to ensure 24 significant bits in the zone, we need w ≥ + 8 = 32.

A. Exponent Sorting Network
To ensure that subnormals sort bigger than 0, a bit is first appended to the LSB of each E i , equal to 0 iff the significand is 0. These modified exponents are the keys in the (key, payload) pairs processed by the 'Sort by exponent' component in Fig. 5. Several variants of this component were explored and implemented in a fork of FloPoCo [19].
First, we have two options for the payloads: either directly use the significand, or use only their index (which fits on much fewer bits).The second option makes the sorting itself cheaper, but requires an expensive multiplexing step in order to recover the significands from their indices.A detailed evaluation (omitted here for brevity) shows that sorting by indices ends up being almost twice as expensive in terms of area.In isolation, it would also be slower due to the extra multiplexing to recover the significands.However, sorting by indices reduces the overall latency since the sorting can be performed while the significand products are being computed.It is thus the option used in the sequel.
For the sorting algorithm itself, we considered two alternatives.The first is to use textbook sorting networks [20].For N = 4 (5 terms to sort) a bitonic sort has a depth of 5 which is optimal [21].For 9 terms, the network from [22] has a depth of 7 which is proved optimal in [23].For 17 terms the sort used is from [24], whose depth of 10 is optimal.
The second alternative is from by Tao et al. [6].To sort n = N + 1 terms, it first performs n(n − 1)/2 parallel key comparisons.The resulting comparison bits or their complement are input to n counters ("population count") that compute in parallel the rank of each key.Finally a n × n crossbar completes the sort.This technique obviously has a lower latency than a sorting network.In our experiments, even its area is competitive with sorting networks at least for the values of N considered here, all the more as only the final crossbar moves payloads.This sorting technique is therefore used in the sequel.
The sequential shift computation of [6] is able to recycle the values E * i − E * j , ∀i ≥ j computed in the sort for their sign bit.This saves area in a non-trivial way (it increases the number of multiplexers used at the end of the sort, but reduces the number of adders and saves their latency).The parallel prefix computation of shift values does not allow to use the same recycling trick.+ + + + Fig. 12.Some bits can be omitted from RShift and the summation bit heap.

B. Shifter and Bit Heap Sizes
Due to the structure of the summation in the compressed FPΣ component, the 'RShift' and '+' components of Fig. 5 can be simplified.As the significand M * i can only be positioned in {Zone 0 , . . ., Zone i }, the actual size of the shifters is max shift i = d i + p i (Fig. 12).
Similarly, this structure reduces the actual number of bits to be added.The summation is performed using a compressor tree that input the bit array (or bit heap) represented in Fig. 12.Note that a full-size architecture always requires a rectangular bit array.

C. Operator Validation
The general FDPN A operators have been validated with the test bench generator of FloPoCo, which compares outputs to the exact results computed using GNU MPFR.The test bench consists of random tests and some directed tests as well as exhaustive corner-case checks.Special cases and flag behavior are extrapolated from the IEEE standard for fused operations.The directed tests include: • Testing cancellations by forcing all the inputs to have the same exponent after the product: , where some products are randomly negative.• Forcing inputs to have pairs of equal exponents after product: , where one product is negative and another is positive.• Forcing M * 0 to be a product between normal and subnormal: The FDP2A FP32 and FDP2A FP64 operators have also been validated with a directed framework which tests against a golden model implemented with Sollya [25].This framework thoroughly explores catastrophic cancellation cases with different significands that multiply to the same number, between Z and a product, different multi-sticky issues, and the use of subnormals.This adds an extra half a million tests.

V. EXPERIMENTAL RESULTS
In this section, we compare three variants of the FDPN A operators: with a Kulisch-like accumulator (denoted Full), our approach as per Section III (denoted Ours) and a reimplementation of Tao et al. [6] adapted to handle subnormals (denoted Tao).The main difference between Tao and Ours is that the former uses a sequential algorithm to compute the mantissa shift values, whereas Ours uses a parallel prefix computation.
All these FDPN A operators have been synthesized with the Synopsys Design Compiler NXT for the TSMC 16FFC node.

A. Synthesis Without Pipelining
We first set a target frequency of 1 MHz in order to explore the synthesis trade-off between areas and combinatorial delays.The results appear in Table II and are plotted in Fig. 13, along with the size of the internal additions.There is always a threshold in N above which there is no compression (w compressed > w full ).This threshold is 2 in the homogeneous FP16 case, about 10 in the homogeneous FP32 case, and above 16 for the other formats studied.
In full-size operators, the size of the internal adder does not depend on N .The rest of these operators essentially grows linearly with N .In the compressed operators, the adder sizes increase linearly with N , although extra pre/post-processing may be compensated by the smaller shifters (see Fig. 12).
The parallel prefix computation of shift values is not always beneficial.In particular it is not justified for N = 2.

B. Synthesis With Pseudo-Pipelining
We then evaluate the best architecture for each case in the context of a pipelined floating-point unit (FPU).For this purpose, we approximate the synthesis constraints of n-stage operator pipelining by using pseudo-pipelining, that is, singlecycle synthesis at 1 n the target FPU frequency.The approximation is that the reported results do not account for the cost of the pipeline registers.Results are reported in Table III.The number of pseudo-pipelining cycles n in Table III is selected for each configuration as the area/latency trade-off most relevant for FPU design.Wherever two solutions were close in Table II, both were re-synthesized for the same n, and the best is reported.

VI. CONCLUSIONS
This article explores architectures of exact fused dot product add operators with N + 1 floating-point operands.Our results show that operators with a compressed accumulator perform better than those with a Kulisch-like full-size accumulator when the floating-point range is large and N is small.
The FP16 format has a large precision with respect to its range, so the full size approach performs better except for N = 2.The FP32 format has comparatively less precision with respect to its range, so the full size approach only becomes relevant for N > 8.The FP64 formats has a very large range with respect to their precision, therefore a compressed approach performs better even for N = 16.

Fig. 1 .
Fig. 1.Base-10 logarithm of errors for the FMA-based and the FDP2A-based radix-2 twiddle factor FP32 recurrences depending on the FFT size.

Fig. 11 .
Fig. 11.Result is using bits from 2 different zones, giving an incorrect result.

TABLE II SYNTHESIS
RESULTS FOR TSMC 16nm (1 MHz), NO PIPELINING.A IS THE AREA IN µm 2 , D IS TIMING IN ns.

TABLE III SYNTHESIS
RESULTS FOR TSMC 16nm OF THE AREA IN µm 2 , WITH PSEUDO-PIPELINING FOR 1GHz.