Enabling Floating-Point Arithmetic in the Coq Proof Assistant

Floating-point arithmetic is a well-known and extremely efficient way of performing approximate computations over the real numbers. Although it requires some careful considerations, floating-point numbers are nowadays routinely used to prove mathematical theorems. Numerical computations have been applied in the context of formal proofs too, as illustrated by the CoqInterval library. But these computations do not benefit from the powerful floating-point units available in modern processors, since they are emulated inside the logic of the formal system. This paper experiments with the use of hardware floating-point numbers for numerically intensive proofs verified by the Coq proof assistant. This gives rise to various questions regarding the formalization, the implementation, the usability, and the level of trust. This approach has been applied to the CoqInterval and ValidSDP libraries, which demonstrates a speedup of at least one order of magnitude.


Introduction
Efficient and reproducible floating-point computations are widely available nowadays, from embedded processors to supercomputers, thanks to the internationally recognized 1 IEEE-754 standard [1].The primary use of floating-point arithmetic is to perform approximate computations over real numbers.Due to the limited precision and range of floating-point numbers, various numerical issues arise: rounding errors, overflows, underflows, loss of algebraic properties, and so on.
These issues did not stop people from using floating-point arithmetic for serious applications.In fact, given sufficient care in the implementation, intensive floatingpoint computations can even be used in proofs of mathematical theorems, e.g., the existence of a strange attractor for Lorenz' equations [2].To do so, the algorithms not only compute a floating-point approximation of the ideal real result but also a bound on the approximation error.This bound might be computed dynamically, as is the case with interval arithmetic [3].This approach is easy to use and correct by construction but it is more expensive than traditional floating-point arithmetic.Another approach consists in mathematically bounding the approximation error by performing a comprehensive floating-point analysis.Correctness becomes much more tedious to guarantee (though tools for static analysis might help in some specific cases), but this offers a wide range of efficient yet rigorous algorithms [4].
Both approaches have been used to formally prove theorems with the Coq proof assistant.On the one hand, the use of interval arithmetic can be illustrated by the CoqInterval library [5], which automatically and formally proves properties on real expressions by computing their floating-point enclosures.To do so, it specifies, implements, and verifies an adhoc floating-point arithmetic in Coq: arbitrary radix, arbitrary precision, unbounded exponent range, neither infinities nor signed zeros.On the other hand, the use of precomputed error bounds can be illustrated by the formal verification of the rigorous variant of Cholesky's decomposition [6].This time, the underlying arithmetic complies with the IEEE-754 standard, so one can use the reference implementation provided by the Flocq library to perform the floating-point computations [7].In either case, computations on floating-point numbers are emulated inside the logic of Coq using integer arithmetic, and thus they do not benefit from the highly efficient floating-point unit of the processor running the proof assistant.
To improve on this unfortunate state of affairs, support for hardware floatingpoint numbers was added to Coq [8], in a way reminiscent of the support for machine integers [9,10].In both cases, the process was as follows.One first declares an abstract type as well as some operations over it: addition, multiplication, and so on.Then, one provides some dedicated reduction rules, which delegate the operations to the arithmetic unit of the processor.This makes it possible to formally prove an equality such as 1.0 + 2.0 = 2.0 + 1.0 by reducing it to 3.0 = 3.0.But short of enumerating all the finitely-many pairs of floating-point numbers (which is practically impossible), this is not sufficient to formally prove ∀x, y, x + y = y + x.So, one also has to relate these abstract operations to some concrete definitions that are much slower but exhibit suitable mathematical properties.
Section 2 focuses on the latter part, that is, how these concrete definitions are formalized in Coq.A peculiarity of this specification of floating-point arithmetic is that it is complete but hardly useful in isolation, as it explains how the numbers are computed but not what their properties are.Our first contribution was to extend this specification using the Flocq library to obtain a meaningful formalization that bridges the gap between the hardware floating-point numbers provided by Coq and the real numbers.
Section 3 then focuses on the actual implementation.It quickly reminds how the various conversion and reduction engines of Coq were modified to support hardware floating-point computations [8].It then details various improvements we have contributed since then, to make this support more useful and usable.In particular, it explains some design choices related to rounding modes (which are critical for interval arithmetic) and to parsing and printing.Finally, it discusses the issue of trust, as both the specification and implementation amount to adding a large numbers of axioms.In particular, this article analyzes all the soundness bugs that were found (and fixed) during the four years that followed the original implementation.
Even once a consistent system has been recovered, relying on the hardware floatingpoint unit in a proof assistant would be pointless if it required too much of a proof effort or if the performance gain was too small.So, we have converted two preexisting, representative, Coq developments, so as to evaluate the costs and the benefits.Section 4 describes what kind of work this conversion entailed.It also benchmarks how proofs relying on hardware floating-point arithmetic compare to those based on emulated computations, performance-wise.

Specification: Coq and Flocq
From the point of view of the logic of Coq, the type float of floating-point numbers is completely abstract.Similarly, basic operations on these numbers are declared as axioms.Except for some hardcoded reduction rules for these operations (which are delegated to the floating-point unit of the processor), no property is known.Thus, a specification describing this arithmetic is needed.Its role is twofold.
First, it should precisely characterize what the floating-point operations compute.In other words, one should be able to prove properties about what is being computed without performing the computation, e.g., commutativity of addition.The axioms describing this part of the specification are shipped with Coq's standard library.It provides an inductive data type spec float representing floating-point numbers, as well as conversion functions from and to the abstract type float.It also provides some naive implementation of the arithmetic operations over spec float, and then states as axioms that the conversion function is a morphism from float to spec float.Section 2.1 gives more details about this specification.
This first specification is complete but useless in practice, as it is purely operational.It is no different from a software floating-point emulator such as SoftFloat. 1ne should not have to look at the implementation to make use of floating-point numbers inside proofs.So, we need higher-level properties about floating-point operations.In particular, the IEEE-754 standard states that an arithmetic operation shall be performed as if it first computed the result with infinite precision and then rounded it to the target floating-point format [1].By "infinite precision", the standard simply means that, except for the exceptional values, floating-point numbers are just real numbers, and operations behave the same, rounding notwithstanding.With such a specification, it becomes possible to perform floating-point computations to prove properties about real numbers.This higher-level specification is provided by the Flocq library [7].If not for the large and intricate proofs that relate both specifications, the Flocq specification could also have been shipped with Coq's standard library.Section 2.2 gives more details about it.

On Coq's side
Support for hardware floating-point arithmetic in Coq was inspired by two libraries: the Flocq library [7], which gathers an IEEE-754-compliant executable formalization, formerly in the module IEEE754.Binary,2 and the formalization of primitive 63-bit integers [10], now part of the standard library as module UInt63. 3The former provides a precise specification in the form of a reference implementation of floating-point operators, while the latter guided the implementation methodology.
Our aim was to extract a subset of this Flocq theory sufficient to completely specify floating-point numbers and operators, so that Coq does not depend on Flocq at compilation time.We started by porting this definition, just removing the NaN payloads because they are not fully specified by the IEEE-754 standard, including the distinction between signaling and quiet NaNs, which can lead to implementation discrepancies between hardware vendors.Thus, ignoring these payloads in Coq is paramount to guarantee the portability of computations and proofs performed with the same Coq script on different processors.Values of type spec float are not normalized, since no bound is enforced on mantissas and exponents.But in practice, all the operators will make sure that the exponent e is bounded, and that the mantissa m contains 53 bits for normalized numbers, and at most 52 for subnormal numbers, thus ensuring that the number belongs to the binary64 format.This property is denoted by the valid binary predicate.
Note that this definition actually matches Flocq's original formalization of the IEEE-754 standard [11].But Flocq was later extended to make it suitable for the semantics used in the CompCert C compiler [12].Indeed, even if nothing can be specified about NaN payloads, they can still be distinguished by a C program and thus need to exist.As part of this work, the original formalization was added back to Flocq, in the module IEEE754.BinarySingleNaN.
Next, following the same methodology as that of the Uint63 formalization, we have declared an abstract type and some arithmetic operations: Primitive float := #float64_type.Primitive add := #float64_add.(* and so on *) Here, the Primitive vernacular amounts to introducing Parameters (i.e., axiomatic symbols) that the kernel maps to hardware operations whenever they are fully applied to concrete floating-point values, as explained in Section 3.This is different from the original "retroknowledge" mechanism [13], which would have first defined some concrete data types and operations over it, and then hardcoded some reduction rules for these operations (thus sidestepping their original definitions).
Finally, the computational content of all the operators is axiomatized with respect to reference implementations of the algorithms over the spec float type.For example, in the case of the addition, we first define a naive5 implementation SFadd for any precision prec and maximal exponent emax: Then, an axiom relates the axiomatic symbol add to this reference implementation instantiated for the binary64 format (prec = 53 and emax = 1024):
Users wanting to load all floating-point operations and axioms just need to Require Import Floats.In addition, to enjoy decimal literals and notations without explicit quoting, users can Open Scope float scope.
Second, some comparison functions return a boolean result: "=?", "<?", "<=?".These comparison functions comply with the IEEE-754 standard, that is, comparing to a NaN is always false.In addition, there is a generic comparison function "?=" that returns a four-valued result: FEq, FLt, FGt, or FNotComparable (in case one or both inputs are NaN).There is also a classify function that tells whether a number is NaN, zero, infinite, normal, or subnormal, and except for NaN, what its sign is.Note that there is a lot of redundancy between the comparison functions "=?", "<?", "<=?" and "?=" as most could be emulated using the others.It is difficult to choose which ones are more useful than others, so we ended up providing hardcoded reduction rules for all of them.
Finally, there are some dedicated operations to manipulate numbers.The function of int63 rounds a 63-bit unsigned integer to the nearest floating-point number, ties breaking to even.Conversely, normfr mantissa takes a number in [0.5, 1.0) and returns its integral mantissa.There are also functions frshiftexp and ldshiftexp that behave like frexp and ldexp from the standard library of the C language.The first one decomposes a floating-point number into a mantissa (i.e., a floating-point number f such that 0.5 ≤ |f | < 1) and an integer exponent e.The second one is the converse operation which computes f • 2 e .There is a small peculiarity though, as Coq's hardware integers were unsigned at the time hardware floating-point numbers were implemented (signed integers have since been added).So, the exponent e is represented with a bias, to make it non-negative.The functions next up and next down return the successor and predecessor of a given floating-point value.
The reference implementation amounts to about 500 lines of Coq code in the standard library.But let us recall that the material provided in the Floats module is purely algorithmic.It does not axiomatize nor prove any meaningful floating-point properties and is thus basically useless in isolation.

On Flocq's side
Comprehensive formalizations of floating-point arithmetic exist for several proof assistants, e.g., HOL Light [14] and PVS [15,16].In the case of Coq, the largest formalization is provided by the Flocq library [7].A whole hierarchy of formats is provided, ranging from real numbers with bounded mantissas but unbounded exponents to computable numbers with all the floating-point special values: signed zeros, infinities, and NaNs.Along with these formats and the links between them, the library contains many classical results about roundings and error-free transformations.
When verifying properties of floating-point algorithms, two families of formats are commonly encountered: Numbers with an unbounded exponent range, i.e., without underflow nor overflow.Although unrealistic, this model is attractive for its simplicity and commonly used for error bounds [17].Numbers with an exponent range only lower bounded, i.e., with underflow but without overflow.This is slightly more realistic, since overflows can often be studied separately, while this is usually much harder for underflows [6].
To make Coq's basic floating-point specification useful, we need to establish a link with one of Flocq's formats, namely the binary float type.This is basically a dependent product of spec float and a proof that the mantissa and exponent are bounded: The Coq theory Flocq.IEEE754.PrimFloat6 provides two functions Prim2B : float -> binary float and B2Prim : binary float -> float that convert back and forth between values of Coq's abstract type float and Flocq's concrete type binary float.These conversion functions act as morphisms, as illustrated by the following theorem.
In the above theorem, the + operator on the left stands for Coq's hardware float addition whereas Bplus mode NE on the right stands for Flocq's addition rounded to nearest, with ties breaking to even, which is the default rounding mode of the IEEE-754 standard.
The main purpose of these morphisms is to give access to Flocq's theorems which state that floating-point operations amount to rounding the corresponding operation in the real field R, as mandated by the IEEE-754 standard.For instance, Flocq provides a formal proof of the following theorem: This theorem mainly states that, if x and y are finite floating-point numbers, the real value of their floating-point sum (Bplus m x y) is exactly the rounding z of the mathematical addition (B2R x + B2R y) of x and y seen as real numbers, assuming the addition did not overflow.The actual theorem in Flocq also gives the sign of the result, which is useful to distinguish +0 and −0, but it is omitted here for the sake of conciseness.
By relating the addition of real numbers with the addition of floating-point numbers, this theorem brings confidence in the correctness of the non trivial bit-level specification of floating-point operations described in Section 2.1, at least for finite inputs.For infinite and NaN inputs, exhaustive testing is achievable [12].Moreover, this theorem gives access to the extensive set of results proved in the Flocq library.This includes cases where floating-point operations are exact or where the round-off error is represented as a floating-point number or bounded.The latter enables the use of the standard model of floating-point arithmetic to derive bounds on errors of elaborate expressions or algorithms [6,18].This will be used in Section 4.1 to perform efficient proofs by reflection, combining floating-point computations and such proofs bounding their round-off errors.
Building this link between Flocq's formalization and Coq's specification of hardware floating-point numbers has provided the opportunity to add to the IEEE754 layer of Flocq several new functions, such as the Boolean comparisons Beqb, Bltb, and Bleb, which complement the already available and more general Bcompare function.Some other added functions provide ways to precisely craft or destruct floating-point values from their integer mantissa and exponent: Bnormfr mantissa, Bldexp, and Bfrexp.Of particular interest for interval arithmetic are the predecessor and successor functions Bpred and Bsucc as well as the unit in the last place Bulp.Finally, the two constants Bone and Bmax float are provided for convenience.
In terms of implementation, all the above changes to Flocq required adding 4900 lines and removing 1600 lines.

Implementation
While the reference implementations described in Section 2.1 can effectively perform floating-point operations, they are excruciatingly slow.So, we want to delegate them to hardware units instead.Section 3.1 shows how the kernel of Coq was extended to make it possible.
To minimize the trusted computing base, only the default rounding direction of the IEEE-754 standard is supported by Coq, as it remains the most portable one.Unfortunately, applications such as interval arithmetic depend on the availability of directed rounding modes.Section 3.2 explains how the functions predecessor and successor can be used instead, and how the performance overhead was mitigated.
While floating-point numbers are usually hidden deep inside proofs of theorems about real numbers, it might happen that the user wants to directly manipulate floating-point numbers.To cover this use case, our implementation also provides some facilities for parsing and printing numbers (Section 3.3).
Finally, this whole work would be moot if the implementation did not match the specification described in Section 2.1.So, Section 3.4 discusses the issue of trust and what has been done to offer the highest level of guarantee.

Reduction engines
As explained in Section 2.1, Coq provides an abstract type float as well as operations over it, and Coq's version of λ-calculus is extended with dedicated reduction rules for these operations.This means that these rules have to be implemented in the software.Unfortunately, Coq supports various engines, each one with its own implementation of the rules of Coq's calculus.
First, there is the conversion engine, which is responsible of checking that two λ-terms are equivalent according to the rules of the calculus.The conversion engine works great in general, but it falls short when performing a proof by computational reflection.Indeed, in that case, the archetypal goal to prove is f (x) = true for some given x.This is a proof by reflexivity, which means that Coq has to check that the term f (x) = true is equivalent to the term true = true.Since true is already in normal form, this amounts to computing the normal form of f (x).But the conversion engine tries hard to never normalize terms, as the size of the normal form explodes in the general case.So, it is unable to handle this use case efficiently.
That is why Coq provides two reduction engines that are solely designed to compute normal forms.The first one, invoked by the vm compute tactic, compiles the λ-term to bytecode and then evaluates it using an interpreter derived from the bytecode interpreter for the OCaml language [19].The second one, invoked by the native compute tactic, follows a similar approach.It turns the λ-term into an OCaml function, compiles it to machine code using the OCaml compiler, and then loads it and executes it [20].
For all three engines, the underlying runtime comes from the OCaml language.As a consequence, we chose to represent floating-point numbers the same way as OCaml, if only to please the garbage collector.Thus, floating-point numbers in Coq are boxed, that is, they are represented by a pointer to an allocated memory cell containing a single floating-point number.
The implementation of the arithmetic operations, however, is not as straightforward, as we cannot just follow OCaml's guidance.Indeed, contrarily to standard OCaml functions, Coq λ-terms can contain free variables, which are irreducible.In turn, even if the type of a λ-term is float, it does not mean that its reduced value is a floating-point number, it might be an arbitrary irreducible expression.So, we follow the same approach as in previous works for hardware integer support [9,10].The implementation first checks if the inputs are actual floating-point numbers.Thanks to the boxing format, this information is readily available.If the inputs are numbers, the floating-point operation is performed.Otherwise, an irreducible term is built from the inputs and the operation.
This causes a discrepancy with the OCaml runtime.Indeed, while floating-point numbers are usually boxed, the runtime optimizes the representation of floating-point arrays, so that they directly store numbers rather than pointers to boxes.This is problematic for Coq, since irreducible terms of floating-point types are not numbers and thus cannot be stored as unboxed values.The original implementation failed to circumvent this runtime optimization, which led to an inconsistency, as explained in Section 3.4.

Rounding directions
As mentioned above, only rounding to nearest is supported by our implementation.Yet, the IEEE-754 standard specifies some other rounding directions, some of which are especially useful for proofs by computation.Let us illustrate this with interval arithmetic.This arithmetic reliably approximates a real expression x with a pair (x, x) of floating-point numbers that enclose it [3].Given some enclosures of u and v, an enclosure of u + v is then represented by the pair (u + v, u + v).To make sure this is an actual enclosure, the lower bound u + v is computed using a floating-point addition rounded toward −∞, while the upper bound is rounded toward +∞.Indeed, simply rounding to nearest would be unsound; for instance, 1 ∈ [0; 1] and 2 −80 ∈ 0; 2 −80 whereas the real sum 1 + 2 −80 does not lie in 0 ⊕ 0; 1 ⊕ 2 −80 = [0; 1], denoting by ⊕ the floating-point addition rounded to nearest.A similar approach is used for multiplication, division, and so on.
Unfortunately, while rounding to nearest is readily available in floating-point units, this is not the case for directed rounding.And even if it was, there is no foolproof way of performing floating-point operations with directed rounding in programming languages such as OCaml and C. For instance, as of writing this article, the GCC compiler still does not handle dynamically changing the rounding mode in a safe way, let alone an efficient one. 7So, to ensure portability, we had to stop at rounding to nearest.
But since interval arithmetic is such an ubiquitous paradigm when it comes to proving properties about real numbers, we have provided two more primitives to ease its implementation: predecessor and successor.Indeed, interval arithmetic does not really care whether the bounds of the enclosures are correctly rounded toward −∞ and +∞.If the lower bound is even smaller (or the upper bound larger), the enclosure is still valid, though less tight.In other words, this might cause some proofs to fail, but cannot lead to unsound results.Thus, rather than rounding u + v toward −∞ to compute the lower bound, an interval library can instead round u + v to nearest and then take its floating-point predecessor.
The original implementation of these two primitives was a trivial wrapper over the nextafter function of the C standard library [8].Unfortunately, neither LLVM nor GCC properly optimize it, even if the second operand is an infinite constant.So, we replaced nextafter by our own specific implementation, 8 which brought a noticeable speedup.
Still, performance of the bytecode interpreter (vm compute) was poor, due to the boxing of floating-point numbers.Indeed, any arithmetic operation would cause two memory allocations in a row, one for the result of the floating-point operation rounded to nearest, and another one for the predecessor or successor.A simple way to fix it would have been to provide variants of all the arithmetic operations composed either with the predecessor or the successor, that is, ten more floating-point primitives.The C implementation of these new primitives would then be able to elide the first allocation.
But adding ten more primitives to Coq just for the sake of one single library felt wasteful.So, we explored a different approach.Instead of eliding the first allocation, we elide the second one in some specific cases.Indeed, if the first allocation is no longer needed once the predecessor/successor has been computed (and thus would be collected by the garbage collector), these functions can reuse it to store their own result.This case can easily be detected by the peephole optimizer of the bytecode compiler. 9As a consequence, immediately computing the predecessor/successor of a floating-point result is now so cheap that removing the calls to these functions from CoqInterval (which would be unsound) does not bring any noticeable speedup, and thus neither would having some dedicated primitives.

Parsing and printing
Parsing and particularly printing the floating-point constants in Coq appeared to be a non-trivial point.Coq basically offers two levels for printing terms.The most commonly used level applies library-and user-defined notations to display concise terms that are as readable as possible.At the lower level, terms are displayed in a raw form to ensure that no details are hidden by some notation.The user can switch between the two levels with the Set and Unset Printing All commands.Printing of floating-point constants now fully supports both levels.
As suggested by its name, the binary64 floating-point format, implemented by processors and wired into Coq's hardware floating-point numbers, is a binary format.This means that its finite values are the rational numbers m × 2 e for bounded integers m and e.In the binary64 format, the mantissa m is encoded on 53 bits while the exponent e is encoded on 11 bits.Thus, all finite values can be exactly encoded in the standard hexadecimal format [+-]0x⟨m⟩p[+-]⟨e⟩ where ⟨m⟩ is an hexadecimal encoding of m, spreading on at most 14 characters, and ⟨e⟩ a decimal encoding of e, taking at most 4 characters.This is the way floating-point values are displayed as raw terms, offering an exact and compact display, with at most 23 characters.
Unfortunately, this hexadecimal printing lacks readability for humans, used to work with decimal representations.It is worth noting that 10 being a multiple of 2, any binary value can be exactly represented as a decimal one.Indeed, when e ≥ 0, the number m × 2 e is an integer and when e < 0, we have m × 2 e = (m × 5 −e ) × 10 e that is a decimal value since m × 5 −e is an integer.For the binary64 format, |m| ≤ 2 53 − 1 and −1074 ≤ e ≤ 971, which means that, in the first case, the integer m × 2 e can be represented with at most 309 digits and in the second case, the integer m × 5 −e can be represented with at most 767 digits.Thus, although an exact decimal representation exists, using up to nearly 800 characters to display numeric constants is utterly unpractical.However, it is known that printing values in a decimal format with at least 17 significant digits and implementing parsing as a rounding to nearest guarantees that no information is lost [21, Table 3.16].More precisely, the following theorem holds: Theorem 1 (17 digits is enough).For any floating-point value x in the binary floating-point format with 53 bits of precision and gradual underflows at some minimal exponent e min , then for any we have •(x ′ ) = x with • any rounding to nearest in the format of x.
Note that any rounding to nearest x ′ of any x ∈ R with 17 significant digits satisfies the above hypothesis.We prove this theorem in Coq as follows.This is in fact a corollary of the following more generic lemma.Lemma 1.Let F be the floating-point format with gradual underflow, radix β, minimal exponent e min , and precision p.For any radix β ′ and any precision p ′ such that then for any floating-point value x ∈ F and any we have •(x ′ ) = x with • any rounding to nearest in F. Note that 17 is the smallest integer p ′ such that 2 53 < 10 p ′ −1 .Printing with 17 decimal digits is thus the choice made in the default printing mode.This means that one can verify the following script: Goal (0.99999999999999999 = 1)%float.Proof.reflexivity.Qed.Indeed, the constants 0.99999999999999999 and 1 are the same, as clearly seen when displaying the goal under its raw form: @eq float 0x1p+0%float 0x1p+0%float.To avoid any surprise to the user, a warning is displayed by Coq: The constant 0.99999999999999999 is not a binary64 floating-point value.A closest value 0x1p+0 will be used and unambiguously printed 1.
Regarding the implementation, as of Coq 8.14, the parsing relies on the OCaml function float of string and the printing on printf "%.17g" for decimal and printf "%h" for hexadecimal display.All these OCaml functions are themselves mostly wrappers on the corresponding libc functions.An alternative, that may be preferred in the future, would be to use Coq's Number Notation mechanism which provides a way to write parsers and printers for numeric constants directly as Coq functions.This would enable to perform some correctness proofs on those parsing and printing functions, at the cost of a slower parsing and printing since interpreting Coq functions is many times slower than executing natively compiled OCaml code.As a nice side effect, such Coq implemented parser and printer could provide a generic implementation that would work for other formats than binary64, for instance binary32.Andrew Appel offered a proof of concept 11 of such an implementation.

Soundness
When we originally reported on the implementation of hardware floating-point numbers in Coq [8], we had identified three main potential threats to soundness.We remind them below, along with a discussion of the few soundness bugs that were discovered since the merge of the feature in Coq in November 2019.All known bugs are now fixed.It is worth noting that all these bugs pertain more to the usual implementation mishaps than fundamental issues and that barring such implementation bugs, the approach is theoretically sound.

Specification issues
A mismatch with respect to the implementation would break the soundness.Of course, such errors in the specification can only be harmful when the offending axioms are used (we recall that all the axioms used in a proved theorem explicitly appear in the result of the Coq command Print Assumptions).So far, two such bugs have been reported and fixed: incorrect specification of PrimFloat.leb, 12nconsistent classification of zeros. 13oth bugs were due to some typo in the specification, and it should be noted that the former would now be automatically spotted, thanks to a new warning raised by Coq for unused variables in pattern-matching.
As seen in Section 2.1, our Coq specification happens to be executable (although this can be pretty slow).This allowed us to add to Coq's test-suite some consistency checks between the specification and the implementation.These checks, however, can obviously not be exhaustive.

Incompatible implementations
If evaluation tactics (native compute, vm compute, compute, simpl, hnf, and so on) were to evaluate a same term to different results depending on the hardware, this would lead to a proof of False.In particular, it happens that the payload of NaNs is not fully specified by the IEEE-754 standard (different hardwares can produce different NaNs for a same computation), so we have chosen to consider all NaNs as equal and not distinguish them.Thus incompatible bit-level implementations remain compatible at the logical level.
Double roundings due to the legacy x87 unit on old 32-bit architectures could also be harmful [22].The OCaml compiler systematically relies on it when it is available, which led us to implement all floating-point operators in C for 32-bit architectures, and use the appropriate compiler flags.To double-check the absence of double roundings, we have also added a runtime test 14 to prevent Coq from running in case of miscompilation.
As with the specification, some tests of consistency between the various evaluation mechanisms have been added to Coq's test-suite.

Incorrect convertibility test
Distinguishing two values that should not be distinguished, or vice versa, would also be a threat.In particular, implementing the convertibility test using the equality test on floating-point values (as defined in the IEEE-754 standard) would be wrong, as not only NaNs cause issues, but also signed zeroes.Indeed, the standard equates −0 and +0, while 1 Fortunately enough, a very simple implementation is feasible; it amounts to the following OCaml code: let equal f1 f2 = if f1 = f2 then f1 <> 0. || sign f1 = sign f2 else is_nan f1 && is_nan f2 When f1 and f2 compare equal in the guard, they are either nonzero, and are then the same floating-point value, or they can be +0 or −0 which are distinguished by the then branch.Otherwise, f1 and f2 are different floating-point values unless they are both NaNs, since nan = nan is false by the IEEE-754 equality.The else branch thus checks for that case.

Interaction with other primitive types in Coq
This can also be a source of soundness issues.To be more precise, unsigned hardware integers were completely reworked in February 2019 (9 months before floating-point numbers); primitive persistent arrays were added in July 2020; finally, signed hardware integers were added in February 2021.First, the fact that the hardware integers used in the specification of hardware floating-point numbers were unsigned did require some care.Next, the way OCaml optimizes arrays of floating-point values did raise a few issues during the development, although it seemed unlikely that such bugs could lead to a proof of False.This nonetheless happened with the introduction of primitive persistent arrays that were developed concurrently to hardware floating-point numbers, and whose interactions had not been properly tested before then. 15Lastly, a similar bug involving the OCaml binary representation of floating-point arrays and the native compute mechanism, but independent from primitive arrays, was discovered and fixed. 16

Applications
A few Coq libraries rely on intensive floating-point computations to formally prove properties involving real numbers.We have adapted two of them to take advantage of the hardware support for floating-point arithmetic.The first one is ValidSDP [23], which uses floating-point numbers to verify positivity certificates.As detailed in Section 4.1, the adaptation was rather straightforward, since ValidSDP was already formalized in a way that made it robust against underflow and overflow.
This was not the case for the second library, CoqInterval [5], which uses pairs of floating-point numbers to enclose real numbers.Indeed, CoqInterval was performing computations using a floating-point format with unbounded exponents and, in numerous places, the proofs were implicitly relying on some of its non-standard properties.Making this formalization compatible with IEEE-754 floating-point numbers required some large and intrusive changes, as explained in Section 4.2.

ValidSDP
The ValidSDP library implements a reflexive tactic to automatically prove inequalities involving multivariate polynomials with real coefficients [23].It follows the so-called skeptical (or certifying) approach, calling external Semi-Definite Programming (SDP) solvers and formally proving the considered polynomial positivity claims by relying on the witness provided by the solvers.
The implementation relies on the following theorem, formally verified in Coq, where the constants ε and η bound the errors when rounding to nearest.More precisely, the error of a floating-point operator is bounded by the relative bound ε when its result is a normal value and by the absolute bound η when it is a subnormal value.Theorem 2 (Corollary 2.4 in [24]).For A ∈ R n×n , let c be If the Cholesky decomposition of A − c • I n succeeds with floating-point numbers (i.e., without taking the square root of a negative value), then the Cholesky decomposition of A with real numbers would succeed as well, which implies that A is positive definite.
The ValidSDP tactic reifies the positivity goal to obtain a multivariate polynomial p, then calls a SDP solver to obtain a matrix A such that when A is positive definite, p ≥ 0. The above theorem is then used to check that A is definite positive [23].
The implementation thus requires a floating-point kernel, and unlike the proof goals discharged by the CoqInterval tactic, the accuracy of the floating-point operators involved is important, as shown by the use of ε and η in Theorem 2. This floating-point kernel was initially the emulated one of CoqInterval, providing an effective implementation of floating-point operators, using rounding-to-nearest most of the time, and occasionally, rounding to +∞.
The ValidSDP formalization gathers several slightly different definitions of the type of floating-point numbers, seen either as a subtype of the real numbers R (an axiomatized type from Coq's standard library) or as a subtype of rational numbers (a concrete type allowing for effective computation).
FS of: a dependent record gathering every v ∈ R satisfying format(v), for a given function format of type R → bool.
Float spec: a dependent record gathering every format satisfying the standard floating-point model without overflow.It includes a rounding operator • : R → FS of format and the bounds ε ∈ R and η ∈ R. For instance, the axiom for the multiplication reads: ∀x, y : FS of format, ∃d, ∃e, Float infnan spec: a dependent record gathering an instance of the previous structure, a concrete type FIS-that can contain NaNs and infinities-a projection FIS2FS from FIS to FS of format, a Boolean predicate finite that holds for finite floating-point numbers of FIS, and concrete arithmetic operators over FIS with their specification, linking them to the axiomatized operators for finite values.Thus, this structure formalizes overflow handling as well as effective computation.
Float round up infnan spec: a dependent record gathering an instance of the previous structure, along with fieps, fieta (overapproximations of ε and η) and addition, multiplication, division with upward rounding.
Once these structures are defined, the following instances are provided: flx64: an instance of Float spec with η = 0 matching Flocq's FLX(prec=53) format, corresponding to 53-bit floating-point numbers without NaNs, overflows, nor underflows.
binary64 infnan: an instance of Float infnan spec extending binary64 with an effective implementation taken from Flocq's IEEE754.BinarySingleNaN.coqinterval infnan: an instance of Float infnan spec extending flx64 with an efficient implementation relying on CoqInterval's Float.Specific ops.
coqinterval round up infnan: an instance of Float round up infnan spec, also relying on CoqInterval.
It happens that the already formalized structures were precise enough to account for underflows, overflows, and NaNs.This made the porting to hardware floatingpoint numbers smooth, since we just needed to implement the following two instances, without needing to alter the Float * spec structures: primitive floats infnan: an instance of Float infnan spec extending binary64 with an efficient implementation relying on from Coq.Floats.Floats. 17rimitive float round up infnan: an instance of Float round up infnan spec, also relying on Coq.Floats.Floats.

CoqInterval
Since version 4.0, the CoqInterval library uses hardware floating-point numbers by default.This implementation of interval arithmetic lives alongside the original implementation, which emulates floating-point numbers using arbitrary-precision integers bigZ from the bignums library [25].Two such integers are used, one for the mantissa and the other one for the exponent.A special value NaN is provided, but infinities are not (thanks to the unbounded exponent, neither overflow nor underflow are possible).Every operation is performed at a given target precision, i.e., number of figures in the resulting mantissa.Let us consider addition as an illustration.The original interface of the operation was as follows in the FloatOps module type.
The argument of type rounding mode is chosen among one of four rounding modes: rnd UP, rnd DN, rnd ZR, and rnd NE.In the same module type, the addition was equipped with an exact specification: Parameter add_correct : ∀ mode p x y, toX (add mode p x y) = Xround radix mode (prec p) (Xadd (toX x) (toX y)).
The function toX is an injection from the type type of floating-point numbers into a type of real numbers extended with an exceptional value Xnan.
Interval arithmetic relies mostly on the rnd DN and rnd UP rounding modes to compute respectively lower and upper bounds of intervals.The exact rounding of the bounds, however, is not required for correctness.Thus, as explained in Section 3.2, we can instead round to nearest and then pick the predecessor next down or successor next up on floating-point numbers to approximate directed rounding.The specification add correct, however, no longer holds, as the results might be off by one.So, the add function above has been dropped and replaced in the FloatOps module type by two separate functions: Parameter add_UP : precision -> type -> type -> type.Parameter add_DN : precision -> type -> type -> type.
The precision argument has been kept, as it still governs the operations on emulated floating-point numbers.But for hardware floating-point numbers, it is ignored and always assumed to be 53.
We now need to specify those operators.The extended real numbers R ∪ {Xnan} are kept for specification, in order to minimize changes.Thus, toX maps −∞ and +∞ to Xnan which remains interpreted as −∞ or +∞, depending on whether it is used as the lower or upper bound of an interval.One would like to write the following straightforward specification, where le upper is essentially a "less than or equal to" operator on extended reals.
However, add UP(−∞)(−∞) = −Ω (with Ω the largest finite floating-point number) is not an upper bound of Xadd(toX(−∞))(toX(−∞)) = Xadd Xnan Xnan = Xnan, which plays the role of +∞ when used as an upper bound.We thus need to ensure that add UP is never called on −∞ (and add DN is never called on +∞).Under this assumption, one can specify, and later prove, that add UP yields an upper bound.We take the opportunity to specify that it never returns −∞ and the specification finally reads as follows.
The predicate valid ub x means that x can be used as the upper bound of an interval, i.e., that it is not −∞.The other operators, sub, mul, div, and sqrt, receive a similar specification.Adapting the implementations of this FloatOps module type required some easy but tedious work as many proofs needed to be updated.
In order to guarantee that add UP is never called on −∞, one could test valid ub before every call but that would have a performance impact.Fortunately, add UP is only used to compute upper bounds of intervals (and symmetrically, add DN is used for lower bounds).So, we just need to forbid "ill-formed" intervals [+∞; x] or [x; −∞].This could be done using dependent types, by adding to every interval a proof that it is well formed.Instead, we chose to modify the interpretation of intervals, which was originally as follows, where Inan contains all the extended real numbers, including Xnan, whereas Ibnd l u encodes the interval [l; u] ⊆ R. As a consequence, correctness theorems are now vacuously true when ill-formed intervals are passed to interval functions.Although this required many modifications to the FloatInterval module implementing basic interval operators, only minimal changes had to be applied in the higher layers of the library.
Finally, a few other adaptations were needed as some functions were incompatible with overflows, or simply with the fixed precision of hardware floating-point numbers: Function fromZ was originally an exact injection from Z into floating-point numbers.This is no longer possible, so most of its uses have been replaced with calls to the new functions fromZ DN and fromZ UP.The original function fromZ is still available, so as to easily compute constants in algorithms, but it is now specified only on the small interval [−256; 256].The scale function was specified as performing exact multiplication or divisions by powers of 2 since neither overflow nor underflow could happen with unbounded exponents.This was mostly used in heuristics, so the function was kept with a besteffort implementation and without any specification.A div2 function was added for the few cases where exact divisions by 2 were needed, but it is specified only for inputs larger than 1 256 .Finally, a midpoint function was added such that midpoint x y returns a value in the interval [x; y].In practice, the result is close to the middle of the interval, but this property is left unspecified as it is not used in correctness proofs.Exact operations (addition, subtraction, and multiplication), using as much precision as needed to exactly encode their result, were used mostly to compute some constants (typically integers).So, the latter were replaced with tight intervals containing the constants.
With this new coarser FloatOps interface, it was relatively easy to implement hardware floating-point numbers in CoqInterval.Regarding the implementation, 9700 lines were added to the library (including 3300 for the hardware-based implementation of the interface FloatOps) while 4300 were removed.
The use of hardware floating-point numbers provides a speedup of about one order of magnitude, as illustrated by the following proof of From Coq Require Import Reals.From Coquelicot Require Import Coquelicot.From Interval Require Import Tactic.3.60 GHz, with 16 GB of RAM.All the benchmarks have been executed sequentially (namely, without the -j option of make).
Our benchmarks rely on a large set of dependencies that can take about an hour to compile.For greater convenience, we devised some Docker images containing the benchmark environments, based on Debian 11, opam 2 (the OCaml package manager), and OCaml 4.07.1.The source code of all the benchmarks as well as guidelines to run them are gathered at the following URL: https://github.com/validsdp/benchs-primitive-floats.We first run the posdef check tactic on ValidSDP's test-suite and compare its execution time between emulated and hardware floating-point numbers.The results are displayed in Table 1 for vm compute and Table 2 for native compute.
Notice that the measured speedups are far from the three order of magnitudes separating emulated floating-point operations from equivalent OCaml implementations.From the above results, it appears that arithmetic operators constitute most of the computation time when using emulated numbers (at least 95% with vm compute) but nothing tells us this is still the case with hardware numbers.In fact, with hardware numbers, most of the computation time is spent on manipulating lists as our matrices are implemented using them [26].This could be improved using primitive "persistent arrays" now that they have been integrated in Coq.
To get an idea of the time actually devoted to floating-point arithmetic in the total proof time of our reflexive tactic, we use the following simple methodology: replace every arithmetic operator over emulated floating-point numbers with a version that uselessly performs the computation twice, then measure the execution time of both the original program (denoted "Op" in Table 3) and the modified program (denoted "Op×2").The difference between both timings gives the cost of performing only the arithmetic operations.In the case of hardware operations, they are performed 1001 times instead, as the difference would otherwise be of the same order of magnitude as the variability of the measured timings.
Note that the redundant computations involved in the modified program are not implemented with a mere additional let-in such as The results are given in Table 3 for vm compute and native compute.The tested operations are addition and multiplication, as they constitute the vast majority of the arithmetic computations performed during a Cholesky decomposition.
It is worth noting that these speedups should be taken as coarse orders of magnitude rather than precise measurements.Indeed, the time difference "Op×N −Op" also includes the cost of the duplication machinery itself, i.e., some extra function calls such as select.As a consequence, the speedups are presumably largely underestimated, as this extra cost is far from negligible in the case of hardware floating-point arithmetic ("Op×1001−Op").The second benchmark comprises 101 mathematical properties verified using the CoqInterval library.Fig. 1 shows that, except for the shortest examples, hardware floating-point numbers usually offer a 10× to 20× speedup over emulated computations when the latter are performed at the corresponding precision, i.e., 53 bits.
Earlier versions of CoqInterval, however, were using a default precision of 30 bits, which is sufficient for a large number of examples from the benchmark.As shown in Fig. 2, this made proofs up to twice faster in that case, but nowhere near the speedup achieved using hardware floating-point numbers.
Fig. 3 shows that using the native compute reduction mechanism instead of vm compute can bring a 3× speedup, but only for the longer examples.Indeed, native compute performs an invocation of the OCaml compiler, which incurs a systematic latency.In particular, hardware floating-point numbers bring such a large speedup over emulated ones that using native compute is often detrimental, except for a handful of examples, as shown in Fig. 4.
Finally, Fig. 5 shows that the 10× to 20× speedup from hardware floatingpoint numbers observed in Fig. 1 also holds when using native compute instead of vm compute, but only for the longest examples.For shorter examples, the native OCaml compilation dominates the timings, so any speedup brought by hardware floating-point numbers goes unnoticed.
In addition to the figures described above, Appendix A provides the raw data for all the timings involved.come with a floating-point unit whose semantics are specified by the IEEE-754 standard [1], such an emulation is a waste of computational resources.The same motivation had already led to delegating arithmetic on 31-bit integers (and later 63-bit integers) to hardware units [9].This work follows a similar approach for floating-point computations: the three conversion/reduction engines of Coq have been extended, so as to use the processor whenever floating-point inputs are not open terms.
While the approach is similar on the implementation side, there is a large difference on the specification side.Indeed, while both integer and floating-point computations are axiomatized using their operational semantics, floating-point arithmetic is so peculiar that one should not blindly believe that the semantics expressed in the logic of Coq matches the behavior of the floating-point hardware.To restore the trust in the formal system, this operational semantics has been proved equivalent to that of the Flocq library, which had already been proved to comply with the IEEE-754 standard [7,12].But as usual with any software implementation, a few bugs were introduced along the way.Those are now fixed and, barring such implementation bugs, the approach is theoretically sound and does not allow any incorrect proof.
Since the IEEE-754 standard relates floating-point computations to infinitely precise ones, i.e., real numbers, the theorems from Flocq make it easy to use hardware floating-point numbers to formally prove properties on real numbers.There are two main ways to do so.One is to formalize a careful error analysis of floating-point computations, as in the ValidSDP library [6].The other is to use directed rounding, as in the CoqInterval library [5].Our work accommodates both approaches.But in the latter case, directed rounding is only approximate, which has forced us to rewrite large parts of CoqInterval to enable the use of hardware floating-point numbers.
Thanks to this work, some proofs by computational reflection have been sped up by a factor 10. While they would have necessarily been run offline before, some of them can now be performed in an interactive setting.This makes for a much friendlier user experience when tactics fail earlier.This speedup comes at the expense of a larger trusted computing base, but the opposite could also be said of this work.Indeed, the speedup is so large that the point of the native compute mechanism becomes largely moot for this use case, thus making it possible to greatly reduce the trusted computing base.
Hardware floating-point arithmetic is not a panacea though, as the numbers are constrained to a bounded range of exponents and, more importantly, a precision of 53 bits.Whenever a higher precision is needed, the tactics provided by, e.g., CoqInterval have to fall back to emulating floating-point arithmetic.But this problem is nothing new, and whichever solution the scientific computing community comes with, we hope it can be adapted to a proof assistant.

5. 1
Benchmark with ValidSDP 1.0.1 and Coq 8.15 let m1 := mul a b in let m2 := mul a b in m2 because both Coq's bytecode compiler and the OCaml native compiler optimize away the unused local definition.Fortunately, this doubling trick can be enabled by going through an extra function call select m1 m2 with Definition select (a b : F.type) := a.

10 − 3 Fig. 1
Fig.1Comparison of proof times between hardware and emulated 53-bit floating-point arithmetic using vm compute.The graph uses a log-log scale, so diagonals represent equivalent speedups.Out of the 101 examples, 4 proofs fail with hardware numbers due to the pessimistic outward rounding, as explained at the end of Section 4.2.The corresponding points appear at the top of the graph.

10 − 3 Fig. 4
Fig.4Comparison of proof times for hardware floating-point arithmetic between vm compute and native compute.The 4 proofs that fail in both cases appear as a cluster at the top right of the graph.

10 − 3 Fig. 5
Fig.5Comparison of proof times between hardware and 53-bit emulated floating-point arithmetic using native compute.The 4 proofs that fail with hardware numbers appear as a cluster at the top of the graph.

Table 1
Proof time for the reflexive tactic posdef check using vm compute.Every test is run 5 times.The table indicates the average and relative variability among the timings of 5 runs.

Table 2
Proof time for the reflexive tactic posdef check using native compute.Every test is run 5 times.The table indicates the average and relative variability among the timings of the 5 runs.

Table 3
Computation time for individual operations obtained by subtracting the CPU time of a normal execution from that of a modified execution where the specified operation is performed twice (resp.1001 times).Every test is run 5 times.The table indicates the average and relative variability among the timings of the 5 runs.