Models for Storage in Database Backends

This paper describes ongoing work on developing a formal specification of a database backend. We present the formalisation of the expected behaviour of a basic transactional system that calls into a simple store API, and instantiate in two semantic models. The first one is a map-based, classical versioned key-value store; the second one, journal-based, appends individual transaction effects to a journal. We formalise a significant part of the specification in the Coq proof assistant. This work will form the basis for a formalisation of a full-fledged backend store with features such as caching or write-ahead logging, as variations on maps and journals.


Introduction
A database system manages a collection of digital data.An essential component is the backend, which is in charge of recording the data into some memory or store.Although conceptually simple at a high level, actual backends are complex, due to the demands for fast response, high volume, limited footprint, concurrency, distribution, and reliability.For instance, the open-source RocksDB comprises 350+ kLOC and Redis is approximately 200 kLOC [4,10].Any such complex software has bugs; and database backend bugs are critical, possibly violating data integrity or security [6,7].
Formal methods have the potential to avoid such bugs, but, given the complexity of a modern backend, fully specifying all the moving pieces is a daunting task.This paper reports on an incremental approach to the rigorous and modular development of such a backend towards an implementation.To this end, we formalise the semantics of atomic transactions above a versioned key-value store; this high-level specification helps to reason about correctness, both informally and formally with the Coq proof tool.Although this paper focuses on a highly-available transaction model (convergent causal consistency or TCC+, a variant of PSI [12]), our results generalise to stronger models such as SI or strong serialisability.The transaction model appeals to a store's specialised book-keeping operations (called doBegin doUpdate, doCommit, and lookup), implicity assuming infinite memory and no failures.
Next, we instantiate these semantics with two models of the store.The first variant is a classical map-based, versioned key-value store.As a transaction executes, it eagerly computes new versions, which it copies into a map upon commit, labelled with the transaction's commit timestamp; reading a key searches the map for the most recent corresponding version.We plan to mechanise the proof that the map-based model satisfies the transactional specification, i.e., that in any reachable state, a call to read returns the value expected by the semantics for any key and timestamp pair.
Our second variant uses a journal (or log).A transaction appends individual effects to the journal, tagged with the transaction identifier; committing appends a commit record, sealing it with its commit timestamp.Reading from the journal applies all the relevant effects previously recorded in the journal.Again, we plan to prove mechanically that the journal-based store satisfies the abstract specification.
In this paper, we summarize our work-in-progress on the formal models for the journal-and map-based store.We define a common interface and show how these stores can be employed in a transactional storage system.Further, we sketch their implementation and reasoning about their correctness in Coq.
The models presented here lack fault-tolerance and essential features such as sharding, caching, write-ahead logging, etc., which are required for state-of-the-art performance.Our hypothesis for future work is that such features can be described and implemented by composing instances of these basic variants.

System Model and Terminology
Next, we present an informal, high-level overview of the system model and terminology.Table 1 overviews our notation.

Stores and transactions.
At the core of the model is an abstract mutable shared memory, called a store, ∈ Σ.A store follows the common API shown in Figure 1.Method lookup returns the value that store associates with key at time ; an absent mapping returns ⊥ (i.e., initially, every key maps to ⊥).Update method doUpdate applies a new effect to that key's entry in the store, in order to update the value (see below).Successfully invoking doCommit makes the updates of the current transaction visible with a commit timestamp noted ct.
Updating a key under timestamp creates a new version mapped at index ( , ).A mapping is write-once, and remains valid until the next mapping, if any.For example, suppose store updates a version of key at time = 100 with an assignment of 27. 1 Then, lookup( , , 101) should return 27.If there are no other versions between 100 and 110, lookup( , , 111) should also return 27.If the next mapping is at timestamp 120, to incr10, then lookup( , , 121) should return 37.
A client (left unspecified by our model) accesses the store in the context of a transaction, a sequence of begin, lookup, update and commit/abort actions.A transaction reads from a consistent snapshot of the store, and makes its effects visible in the store by committing the transaction atomically (all-or-nothing).We defer a more detailed discussion of transactions to Section 3.
Keys, values and timestamps.Keys, values, and timestamps are opaque types.Keys compare for equality only.Timestamps are partially ordered by ≤; we say timestamps are concurrent, if they are not ordered, i.e., 1 1 .Note that we do not assume a global clock.An ordered timestamp pair (OTSP) is of the form ( , ) ∈ Timestamp × Timestamp, where ≤ , called dependence and version respectively.We define a strict partial order relation ≺ OTSP over OTSPs, as follows: Two OTSPs are concurrent if they cannot be ordered by ≺ OTSP : Effects.Classically, an update simply assigns a new value to the key, as in ≔ 27.Such an assignment creates a new version of with value 27.
Many recent stores [2,3,11] support a more general concept of update, which we call effect.Applying effect to a current value computes a new value ( ).For instance, If a sequence of updates with effects , ′ , . . ., ′′ has been applied to key , then a store is expected to return value ( ⊙ ′ ⊙ . . .⊙ ′′ ) (⊥) when queried. 2 We say a sequence of effects is proper if it starts with an assignment and therefore evaluates to a value, or ⊥, when applied to ⊥; i.e., it does not depend on any preceding effects.We assume that every history forms a proper sequence.
An assignment masks any previous effects to the same key: ∀ , ( ⊙ assign ) = assign ; therefore effects that precede the last assignment in a sequence can be safely ignored.Conversely, any proper sequence is equivalent to a single assignment.For instance, assign 27 ⊙ incr10 = assign 37 .This justifies checkpointing a proper sequence into a single assignment.Figure 2 summarizes the rules of effect composition.
Visibility and concurrent effects.Effects are ordered by the visibility relation ≺ ′ (read " is visible to ' "), defined as follows: • ≺ ′ if both belong to the same transaction, and is before ′ .• ≺ ′ if they belong to different transactions, and ′ respectively, where has committed, and is before ′ in OTSP order, i.e., .ct< ′ .st.
Visibility is a strict partial order.Two effects are concurrent if they are not mutually ordered by visibility.Some data types support concurrent effects thanks to a merge operator on effects.To ensure convergence, the merge operator is required to be commutative, associative, and idempotent (CAI) [11].
In the presence of concurrent effects, the value expected of key is results from applying, from the initial ⊥, the visible effects related to , in visibility order, while mergeing concurrent effects.

Concurrent data types.
As an aside, note that classical sequential data types generally disallow concurrent updates, leaving merge undefined.These data types require a strong consistency model, where updates occur in some serial order.
Data types that merge concurrent effects do exist [11].There are also data types designed with non-assignment effects [1].
To provide the CAI properties, the implementation of an effect typically needs to carry metadata, e.g., to provide idempotence or determine causal relationships between updates.For example, the classical last-writer-wins approach supports concurrency by merging concurrent assignments under some deterministic total order (e.g., timestamp order), and retaining only the one with the highest timestamp.Another example is a counter supporting concurrent increment and decrement effects, which uses a vector of sets of effects, with one entry per (concurrent) client [11].This representation ensures that a given increment or decrement is applied only once.

Semantics of Transactions
A transaction ∈ is a sequence of effects.We associate to a transaction its transaction descriptor ( , st, R, W, B, ct).It reads from a snapshot timestamped by snapshot timestamp st.Its write buffer W lists the keys that it dirtied, i.e., modified.It may commit with a commit timestamp noted ct.
However, the effects of a running transaction are not visible from outside (isolation).Within a transaction, an effect visible to another one that executes after it (the "readyour-own-writes" property [14]).The semantics formalise this by staging effects to an effect buffer B. A transaction terminates in an all-or-nothing manner, by either an abort that discards its effect buffer, or by a commit that makes all its effects visible to later transactions at once.Atomicity is formalised by assigning the same, unique, commit timestamp to all its effects.The commit timestamp of a running or aborted transaction is irrelevant and can be arbitrary, marked by _.
Every first read of some key comes from a same snapshot, identified by its snapshot timestamp noted st.Transactions that are visible in the snapshot are those that committed strictly before its snapshot timestamp.
(c) Map store for key at = 11. [ 2 ) for = 6 (5, For interested readers, we provide a small-step operation semantics of transactions in the Appendix in Figure 6. Figure 3 illustrates these semantics with an example.The history (Figure 3a) shows a sequence of (atomic) transactional steps; the steps for concurrently executed transactions, like 1 and 2 , are interleaved.For simplicity, the example updates a single key of integer type.Figure 3b visualizes the transactions in a dependency graph.

Store Models
This section discusses two basic variants implementing the general store API.We aim to model their most essential, primitive properties, abstracting away as much complexity as possible.

Map-based store semantics
The map-based store models a classic versioned key-value store as a random-access map, located either in memory or on disk.It is restricted to contain only values, which (in our model) are represented as assignment effects.Versions of a key are distinguished by their version timestamps.Such a store maps a (key, version timestamp) pair, to an (assignment effect, dependency timestamp) pair.Figure 4 summarises its semantics.A map store defers its updates to commit time, and committing atomically copies the transaction's effect buffer into a new version of the corresponding keys.lookup searches for the most recent assign effect directly from the map; both doBegin and doUpdate are no-ops.Figure 3c illustrates the contents of a map store, after the history in Figure 3a.
In more detail, mapping [ , ] = ( , ) associates a versioned key ( , ) with a dependent effect ( , ).Here, is a version timestamp, is an assignment (a map store does not support non-assignment effects) associated with metadata , called dependence timestamp, where ≤ .Versions are ordered by their OTSPs ( , ), i.e., When a transaction commits, method doCommit of a map store eagerly creates a new version for each key modified by the transaction.The version identifier is the transaction's commit timestamp ct, and it is associated with metadata st, the transaction's snapshot timestamp.It returns a store unchanged except for the new versions. 3he versions of visible from the current transaction are the set = { [ , ] | < }.Since a map contains only assignments, lookup( , , ) can omit all but the most recent one in this set in visibility order, noted max ≺ MS ( ).To determine the returned value, any concurrent effects are merged (as explained in Section 2), and the resulting assignment effect is then applied to ⊥ to obtain a value.
The map store defers updates to commit time; therefore doUpdate leaves the store unchanged.
In practice, many existing database backends contain an in-memory map store, for simplicity and fast reads.To persist a map store, it suffices to write it to disk periodically; however such a large write can be slow and is not natively crash-atomic.

Journal-based store semantics
An alternative store variant is the journal-based store, which logs its updates incrementally to a sequential file. 4This design is optimised for fast disk writes, and has good crashtolerance properties.It is also friendly to non-assignment effects.However, to lookup the value of a key can be slow, as its semantics is to read the journal and applies effects sequentially.
Figure 5 gives the formal semantics of a journal store.A journal store is a finite sequence = [ 1 , 2 , . . .] of records of type BeginTxnRec, update and commit, initially empty.Function doBegin appends a record with transaction identifier and snapshot timestamp st.doUpdate appends an update record that contains transaction identifier , key , and effect .Similarly, doCommit appends a commit record containing the transaction identifier , snapshot timestamp st and commit timestamp ct.
The real action is in lookup( , , ), which accumulates the effects to key that committed strictly before .To formalise the procedure is somewhat complex.
• Procedure poststate ( , ) computes the state of key after a record in journal takes effect.Records take effect in ≺ JS order.
• In ≺ JS order, a record of type beginTxn has any number of immediate predecessors; other types of records have a single one.• The poststate of an update record with key and effect is computed by taking the poststate of its immediately-preceding record, and applying .• The poststate of a beginTxn is the merge of poststate of its immediate predecessors.• Otherwise, the poststate is the same as that of the immediate predecessor; i.e., updates for other keys are ignored, as well as commit records.
Note that a poststate can be computed in a single left-toright pass over the journal, because a commit record always appears before the beginTxn of a transaction that depends on it.Note also that records are single-assigned and that poststate is a function; therefore a practical implementation may use a cache.
Figure 3d illustrates the contents of a journal store after execution of the history in Figure 3a.To obtain the value for lookup( , , 11), we calculate recursively the poststate s for 3 and 4 and merge the results: As explained earlier (Section 2) the implementation of effects must ensure that the incr 1 from 1 executes once only in the merged value, even though the history contains two paths to lookup.

Formal model in Coq
So far, we have formalized a major part of the definitions presented in Section 2, 4 and 3 in the proof assistant Coq with the goal to formally verify their correctness.Our Coq codebase comprises currently around 2k LOC, without using any external libraries other than the standard library.
Note max ≺ JS ( ) the immediate predecessor(s) of record in ≺ JS order (beginTxn may have any number of mmediate predecessors; update and commit have exactly one immediate predecessor).The poststate function computes the state of a key after record takes effect, as follows: The journal operations are defined as follows (where ⊲ represents the append operation): There are several reasons for choosing to use a proof tool such as Coq.A positive aspect about the Coq formalization is the high level of abstraction.We can define constructs with desired properties without the need to provide a specific implementation.This is in contrast to traditional programming languages, where interfaces or abstract classes can be defined, but it is usually not possible to restrict the behavior of implementations.As an example of this, in our formalization we defined timestamps to be some arbitrary data type that comes with an ordering, as described in section 2, and for which equality is decidable.No assumption about the implementation is made, and refining the specification to specific instances can be done independently.
However, the main benefit of reasoning about our system in a formal context is the required level of detail and precision, which is typically not attained by pen-and-paper proofs.It not only makes spotting mistakes easier and earlier, but it also forced us to pay attention to specific corner cases.This already proved to be useful during this initial formalization phase.For example, in earlier versions the journal-based store explicitly maintained both a dependency and commit timestamp, while the map-based store did this implicitly.This oversight lead to an inconsistency when merging concurrent effects, since the map-based store carries too little information.Only when trying to formalize the mapbased store, we noticed this mismatch.
While it is arguably more effort to formalize everything in a proof assistant rather than using pen-and-paper definitions and proofs, we believe that the benefits of obtaining a verified design outweigh the costs.

Discussion and Outlook
Formal methods have been successfully employed for proving (distributed) systems correct [5, 8, 15? ].The focus of these approaches has been the verification of safety and liveness properties for different types of distributed systems and their implementations.The work presented in [? ] is closest to our approach, as it proves the correctness of a transaction library.However, their work targets the correctness verification of the specific library and all its sophisticated optimisations, while we aim to take a compositional approach to proving a generic database backend.
Typically, developers need to provide a high-level specification that is then refined in one or more steps, while the corresponding proofs are correspondingly refined.For example, Verdi [15] is a framework to implement and specify systems under different network semantics in Coq; starting from an idealized system, proof obligations are then transformed for more and more complex fault models.Finally, an implementation can then be generated from the specification.Our approach differs in two major aspects.The focus of our work is the correct design and implementation of a central system component, not a distributed protocol.This component is typically simplified in the before mentioned verification frameworks.However, a correct and efficient implementation is essential in any (distributed) datastore.Further, instead of refinement, we propose a compositional approach to construct more and more complex implementations.This helps system designers to select and incrementally add features such as caching, write-ahead logging, or checkpointing.Reducing these features to their essence helps us extracting their actual (and not incidental) requirements and to re-purpose metadata in different contexts.
Tools exist to compile a Coq specification to executable code [9,13,15].We are not taking this path for pragmatic reasons: it is too far from the ordinary programmer's experience and as of today still requires extensive manual intervention.Instead, we manually transcribe the specification to Java, verbatim, resisting the temptation to optimise.We will check through testing that the implementation behaves like the specification, and that variants that were proved equivalent do have the same runtime behaviour.

A Formal semantics of transactions
Figure 6 shows the transition rules for the small-step operational semantics of a transaction system build on a store.The specification is fully formal and unambiguous: we find it invaluable to reason about the system, and it is easily translated to the language of a proof tool such as Coq.Most interestingly, it can be read as pseudocode, as we explain now.
The semantics are written as a set of rules.Each rule represents an indivisible state transition; i.e., there are no intermediate states from a semantic perspective, and any intermediate states in the implementation must not be observable.
The system state is represented as a tuple ( , X a , X c , X r ) consisting of a store, its field, and the sets of aborted, committed, and running transactions' descriptors.
A rule consists of a set of premises above a long horizontal line, and a conclusion below.A premise is a logical predicate referring to state variables.A variable without a prime mark refers to before the state before the transition (pre-state); a primed variable refers to state after the transition (post-state).Thus a premise that uses only non-primed variables is a pre-condition on the pre-state; if it contains a primed variable, it is a post-condition that constrains the post-state.
If the premises are satisfied, the state-change transition described by the conclusion can take place.A label on the transition arrow under the line represents a client API call.Thus, a rule can be seen as terse pseudocode for the computation to be carried out by the API.

A.1 Example
To explain the syntax, consider for example rule B T .The conclusion describes the transition made by API command beginTxn(st) from pre-state ( , X a , X c , X r ) on the left of the arrow beginTxn(st ) − −−−−−−−−− → , to post-state ( ′ , X a , X c , X ′ r ) on the right.Note that in the right-hand side of this conclusion, only X r is primed, indicating that the other elements of the state do not change.

A.2 Parameters
The rules describe a transaction system, which is a tuple ( , X a , X c , X r ) consisting of a store , its associated field F, and sets of transaction descriptors X a , X c , and X r , which keep track of aborted, committed and running (ongoing) transactions respectively.A transaction descriptor is a tuple ( , st, R, W, B, ct) of transaction identifier , its snapshot timestamp st, its read set R, its write set W, its effect buffer B, and its commit timestamp ct. 5he two timestamps define visibility between transactions, as defined previously (Section 2).Initially, after rule B T , the sets and the effect buffer are empty and the commit timestamp is invalid.For each key that is accessed, rule I K initialises the buffer, and rule U updates it.Computation of the actual commit timestamp may be deferred to the C rule.
The semantic rules are parameterised by commands lookup, doUpdate, and doCommit, specified in Figure 1.These commands are specialised for each specific store variant: the map-based variant in Section 4.1, the journal-based variant in Section 4.2.

A.3 Transaction begin
We now consider each rule in turn.

B
T describes how API beginTxn() begins a new transaction with snapshot timestamp st.The snapshot of the new transaction is timestamped by st, passed as an argument; remember that a snapshot includes all transactions that committed with a strictly lesser commit timestamp.
The first premise chooses a fresh transaction identifier .The last premise ensures that the appropriate transaction descriptor is in the post-state set of running transactions.
As the transition is labeled by , multiple instances of B T are mutually independent and might execute in parallel, as long as each such transition appears atomic.

A.4 Reads and writes
Reading or updating operate on the transaction's effect buffer B, which must contain the relevant key.
Rule I K specifies a buffer miss, which initialises the buffer for some key .As it does not have an API label, it can be called arbitrarily.It modifies only the current transaction's descriptor.Its first premise takes the descriptor of the current transaction from the set of running transactions X r .The second one checks that is not already in the read set, ensuring that the effect buffer is initialised once per key.The third reads the store by using lookup (specific to a store variant).Next, a premise updates the read set, and another initialises the effect buffer with the return value of lookup.The final premise puts the transaction descriptor, containing the updated read set and effect buffer, back into the descriptor set of running transactions.
In Rule R , API read( ) returns a value from the effect buffer.It does not modify the store.The first premise is as above.The second one requires that the key is in the read set, thus ensuring that I K has been applied.The next two premises extract 's mapping from the effect buffer and compute the corresponding return value.
Note the clause ∈ Assign.It requires that, previously to reading, the application has initialised the store with an assignment to (possibly followed by other effects; such a sequence resolves to an assignment, by associativity, as explained earlier).Otherwise, lookup would return either ⊥ (if the key has not been initialised at all) or a non-assignment (if the application has stored only non-assignments).We leave the burden of initialisation to the application to simplify the semantics; logically, it's an axiom.
In Rule U , API call update( , ) applies effect to key .It updates both the store and the transaction descriptor.The first two clauses are similar to R , and similarly require a buffer miss if the key has not been used before (avoiding blind writes).It updates the effect buffer, ensuring that the transaction will read its own writes, and puts the key in the write set.It calls the variant-specific command doUpdate, discussed later in the context of each variant.

A.5 Transaction termination
A transaction terminates, either by aborting without changing the store, or by committing, which applies its effects atomically to the store.
Rule A moves the current transaction's descriptor from X r to X a , marking it as aborted.It does not make any other change.
API call commit( , ct) takes a commit timestamp argument.It is enabled by rule C , which modifies the store, the running set, and the committed set.The first premise is as usual.Commit timestamp ct must satisfy the constraints stated in the next three premises: it is unique (it does not appear in X c ); it is greater or equal to the snapshot; the NoInversion(ct) premise ensures that no already-committed or running transaction may depend on this one.The latter premise aims to protect against the case where another transaction has read a value that this transaction has yet to write, because the transactions commit in the wrong order.To understand, consider the following anomalous example: (i) Transaction 1 has commit timestamp 1; (ii) Transaction 2 starts with snapshot timestamp 2 > 1; thus 2 reads the updates made by 1 ; (iii) however, 1 is slow and its committed effects reach the store only after the read by 2 .Clearly, this would be incorrect.To avoid this issue, To avoid this issue, no still-running or committed transaction may read from the current transaction, i.e., ∈ X c ∪ X r : ct < .st∧ W ∩ .R ≠ This expression requires to keep track of the read-set of committed transactions; to avoid this, we could check running transactions only, using the slightly stronger expression: ∈ X c ∪ X r : ct < .st∧ ( ∈ X r =⇒ W ∩ .R ≠ ) For simplicity, we choose to use the even stronger premise NoInversion(ct) def = ∈ X c ∪ X r : ct < .stat the cost of aborting transactions unnecessarily.
Operation doCommit (specific to a store variant) provides the new state of the store; it should ensure that the effects of the committed transaction become visible in the store, labelled with the commit timestamp.Finally, the transaction descriptor, now containing the commit timestamp, is moved to the set of committed transactions.

Figure 3 .
Figure 3. Example of execution trace.The history (a) shows the order in which the transactional operations are executed.The dependency graph (b) visualizes the partial order of the transactions.The map (c) and journal (d) show the different stores after the history executed.

Table 1 .
Overview of notation.