Boosting Transactional Memory with Stricter Serializability

. Transactional memory (TM) guarantees that a sequence of operations encapsulated into a transaction is atomic. This simple yet powerful paradigm is a promising direction for writing concurrent applications. Recent TM designs employ a time-based mechanism to leverage the performance advantage of invisible reads. With the advent of many-core architectures and non-uniform memory (NUMA) architectures, this technique is however hitting the synchronization wall of the cache coherency protocol. To address this limitation, we propose a novel and ﬂexible approach based on a new consistency criteria named stricter serializ-ability (SSER + ). Workloads executed under SSER + are opaque when the object graph forms a tree and transactions traverse it top-down. We present a matching algorithm that supports invisible reads, lazy snapshots, and that can trade synchronization for more parallelism. Several empirical results against a well-established TM design demonstrate the beneﬁts of our solution.


Boosting Transactional Memory with Stricter Serializability 1 Introduction
The advent of chip level multiprocessing in commodity hardware has pushed applications to be more and more parallel in order to leverage the increase of computational power.However, the art of concurrent programming is known to be a difficult task [27], and programmers always look for new paradigms to simplify it.Transactional Memory (TM) is widely considered as a promising step in this direction, in particular thanks to its simplicity and programmer's friendliness [11].
The engine that orchestrates concurrent transactions run by the application, i.e., the concurrency manager, is one of the core aspects of a TM implementation.A large number of concurrency manager implementations exists, ranging from pessimistic lockbased implementations [1,21] to completely optimistic ones [22], with [29] or without multi-version support [2].For application workloads that exhibit a high degree of parallelism, these designs tend to favor optimistic concurrency control.In particular, a widely accepted approach consists in executing tentatively invisible read operations and validating them on the course of the transaction execution to enforce consistency.For performance reasons, another important property is disjoint-access parallelism (DAP) [12].This property ensures that concurrent transactions operating on disjoint part of the application do not contend in the concurrency manager.Thus, it is key to ensures that the system scales with the numbers of cores.
From a developer's point of view, the interleaving of transactions must satisfy some form of correctness.Strict serializability (SSER) [24] is a consistency criteria commonly encountered in database literature.This criteria ensures that committed transactions behave as if they were executed sequentially, in an order compatible with real-time.However, SSER does not specify the conditions for aborted transactions.To illustrate this point, let us consider history h 1 where transaction T 1 = r(x); r(y) and T 2 = w(x); w(y) are executed respectively by processes p and q.In this history, T 1 aborts after reading inconsistent values for x and y.Yet, h 1 is compliant with SSER.

A new consistency criteria
This section is organized in two parts.The first part (Section 2.1) present the elements of our system model as well as the notions of contention and binding (Section 2.2).In the second part (Sections 2.3 and 2.4), we formulate our notion of stricter serializability and study its applicability.

System Model
Transactional memory (TM) is a recent paradigm that allows multiple processes to access concurrently a shared memory region.Each process manipulates objects in the shared memory with the help of transactions.When a process starts a new transaction, it calls operation begin.Then, the process executes a sequence of read and write operations on the shared objects according to some internal logic.Operation read (x) takes as input an object x and returns either a value in the domain of x or a flag ABORT to indicate that the transition aborts.A write write(x, v) changes x to the value v in the domain of x.This operation does not return any value and it may also abort.At the end of the transaction execution, the process calls tryCommit to terminate the transaction.This calls returns either COMMIT, to indicate that the transaction commits, or ABORT if the transaction fails.
A history is a sequence of invocations and responses of TM operations by one or more processes.As illustrated with history h 2 below, a history is commonly depicted as parallel timelines, where each timeline represents the transactions executed by a process.In history h 2 , process p, q and r execute respectively transactions T 1 = w(x), T 2 = w(x) then T 4 = r(y); r(x), and T 3 = r(x); r(y).All the transactions but T 4 completes in this history.For simplicity, a complete transaction that is not explicitly aborted in the history commits immediately after its last operation.We note com(h) the set of transactions that commit during history h.In the case of history h 2 , we have A history induces a real-time order between transactions (denoted ≺ h ).The order T i ≺ h T j holds when T i terminates in h before T j begins.For instance in history h 2 , transaction T 1 precedes transaction T 3 .When two transactions are not related with real-time, they are concurrent.
A version is the state of a shared object as produced by the write of a transaction.This means that when a transaction T i writes to some object x, an operation denoted w i (x i ), it creates the version x i of x.Versions allow to uniquely identify the state of an object as observed by a read operation, e.g., r 3 (x 1 ) in h 2 .When a transaction T i reads version x j , we say that T i read-from transaction T j .
Given some history h and some object x, a version order on x for h is a total order over the versions of x in h.By extension, a version order for h is the union of all the version orders for all the objects (denoted h ).For instance, in history h 2 above, we may consider the version order (x 2 h2 x 1 ).
Consider an history h and some version order h .A transaction T i depends on some transaction T j , written T i T j when T j precedes T i , T i reads-from T j , or such a relation holds transitively.Transaction T i anti-depends from T j on object x, when T i reads some version x k , T j writes version x j , and x k precedes x j in the version order (x k h x j ).An anti-dependency between T i and T j on object x is a reverse-commit anti-dependency (for short, RC-anti-dependency) [20] when T j commits before T i , and T i writes some object y = x. 4o illustrate the above definitions, consider again history h 2 .In this history, transaction T 3 depends on T 1 and T 2 .On the other hand, if x 2 h x 1 holds and T 4 reads x 2 , then this transaction exhibits an anti-dependency with T 1 .This anti-dependency becomes an RC-anti-dependency if T 4 executes an additional step during which it writes some object z = x.
Over the course of its execution, a transaction reads and writes versions of the shared objects.The set of versions read by the transaction forms its read set (or snapshot).The versions written define the write set.
A transaction observes a strictly consistent snapshot [5] when it never misses the effects of some transaction it depends on.In detail, the snapshot of transaction T i in history h is strictly consistent when, for every version x j read by T i , if T k writes version x k , and T i depends on T k , then x k is followed by x j in the version order.

Contention and bindings
Internally, a transaction memory is built upon a set of base objects, such as locks or registers.When two transactions are concurrent, their steps on these base objects interleave.If the two transactions access disjoint objects and the TM is disjoint-access parallel, no contention occurs.However, in the case they access the same base object, they may slow down each other.
A transactional read is invisible when it does not change the state of the base objects implementing it.With invisible reads, read contention is basically free.From a performance point of view, this property is consequently appealing, since workloads exhibit in most case a large ratio of read operations.
When two transactions are concurrently writing to some object, it is possible to detect the contention and abort preemptively one of them.On the other hand, when a read-write conflict occurs, a race condition occurs between the reader and the writer.If the read operation takes place after the write, the reader is bound to use the version produced by the writer.Definition 1 (Binding).During a history h, when a transaction T i reads some version x j and T i is concurrent to T j , we say that T i is bound to T j on x.
When a transaction T i is bound to another transaction T j , to preserve the consistency of its snapshot, T i must read the updates and causal dependencies of T j that are intersecting with its read set.This is for instance the case of transaction T 4 in history h 2 , where this transaction is bound to T 3 on y.As a consequence, T 4 must return x 1 as the result of its read on x, or its snapshot will be inconsistent.
Tracking this causality relation is difficult for the contention manager as it requires to inspect the read set, rely on a global clock, or use large amount of metadata.We observe that this tracking is easier if each version read prior the binding is either, accessed by the writer, or one of its dependencies.In which case, we will say that the binding is fair.
Definition 2 (Fair binding).Consider that in some history h a transaction T i is bound to a transaction T j on some object x.This binding is fair when, for every version y k read by T i before x j in h, T j T k holds.
Going back to history h 3 , the binding of T 4 to T 3 on y is fair.Indeed, this transaction did not read any data item before accessing the version of y written by T 4 .When the binding is fair, the reader can leverage the metadata left by the writer to check prior versions it has read and ensure the consistency of later read operations.In the next section, we formalize this idea with the notion of stricter serializability.

Stricter serializability
In what follows, we introduce and describe in details SSER + , the stricter serializability consistency criteria that we build upon in the remainder of this paper.As strict serializability, SSER + requires that committed transactions form a sequential history which preserves the real-time order.In addition, it prohibits transactions to view inconsistencies unless one of their bindings is unfair.

Definition 3 (Strict serialization graph).
Consider some version order h for h.Below, we define a relation < to capture all the relations over com(h) induced by h .
This relation can be either a partial or a total order over the committed transactions in h.The serialization graph of history h induced by h , written SSG(h, h ), is defined as (com(h), <).
In the above definition, (1) is a real-time order between T and T , (2) a read-write dependency, (3) a version ordering, and (4) an anti-dependency.

Definition 4 (Stricter serializability).
A history h is stricter serializable (h ∈ SSER + ) when (i) for some version order h , the serialization graph (com(h), <) is acyclic, and (ii) for every transaction T i that aborts in h, either T i observes a strictly consistent snapshot in h, or one of its bindings is unfair.
Opacity (OPA) and strict serializability (SSER) coincide when aborted transactions observe strictly consistent snapshots.As a consequence of the above definition, a stricter serializable history during which all the aborted transactions exhibit fair bindings is opaque.
Proposition 1.For a history h ∈ SSER + , if every transaction T in h exhibits fair bindings then h ∈ OPA holds.
Proposition 1 offers a convenient property on histories that, when it applies, allows to reach opacity.The next section characterizes a class of applications for which this property holds.In other words, we give a robustness criteria [6] against SSER + .

Applicability
In what follows, we give some details about the model of application we are interested with.Then, we present our robustness criteria and prove that it applies to SSER + .

Model of application
The state of an object commonly includes references to one or more objects in the shared memory.These references between objects form the object graph of the application.
When performing a computation, a process traverses a path in the object graph.To this end, the process knows initially an immutable root object in the graph.Starting from this root, the process executes a traversal by using the references stored in each object.
For some transaction T , a path is the sequence of versions π that T accesses.It should satisfy that (i) the first object in π corresponds to the immutable root of the object graph, and (ii) for all x i ∈ π, some y j < π x i includes a reference to x i .
A robustness criteria To define our criteria, we focus specifically on SSER + implementations that allow invisible reads.As pointed out earlier, this restriction is motivated by performance since most workloads are read-intensive.In this context, the result of Hans et al. [20] tells us that it is not possible to jointly achieve (i) SSER, (ii) read invisibility, (iii) minimal progressiveness, and (iv) accept RC-anti-dependencies.As a consequence, we remove histories that exhibit such a pattern from our analysis; hereafter, we shall note these histories RCAD.
Let us consider the property P below on a TM application.In what follows, we prove that if P holds and the TM does not accept RC-anti-dependencies, then it is robust against SSER + .
-(P) The object graph forms initially a tree and every transaction maintains this invariant.
Let T be some set of transactions for which property P holds.H T refers to histories built upon transactions in T .We wish to establish the following result: To state this result, we note h some history in H T ∩ SSER + .Since h is serializable, there exits some linearization λ of com(h) equivalent to h.For a transaction T i in λ, we let π i and π i be the paths (if any) from the root to x before and after transaction T i .By property P, if such a path exists it is unique, because each transaction preserves that the object graph is a tree.Lemma 1.If transaction T i reaches x in h, then for every y j in π i ∪ π i , the dependency T i T j holds.
Proof.There are two cases to consider: -(y j ∈ π i ) Property P implies that either y j is the root, or T i reads the version z k right before y j in π i .Hence, by a short induction, transaction T i reads all the versions in π i .-(y j ∈ π i ) Assume that T i accesses y j and name z k the version right before y j in π i .Version z k holds a reference to y j .If this reference does not exist prior to the execution of T i , object z was updated.Otherwise, T must reads z k prior to accessing y j .
Lemma 2. If transaction T i aborts in h then all its bindings are fair.
Proof.(By induction.)Define x and T j such that T i is bound to T j on x and assume that all the prior bindings of T i are fair.First, consider that either (π i = π j ) or (π i = π j ) is true.Choose some y k read before x j in π i .By Lemma 1, since y k ∈ (π j ∪ π j is true, the dependency T j T k holds. Otherwise, by our induction hypothesis, all the bindings of T i prior x j are fair.It follows that transaction T i observes a strictly consistent snapshot in h up to r i (x j ).Hence, there exists a committed transaction T k such that π i is the path to x after transaction T k in λ (i.e., π k = π i ).
Depending on the relative positions of T j and T k in λ, there are two cases to consider.In both cases, some transaction T l between T j and T k modifies the path to x in the object graph.
-(T j < λ T k ) Without lack of generality, assume that T l is the first transaction to modify π j .Transaction T l and T j are concurrent in h and T l commits before T j .This comes from the fact that T l must commit before T i in h, T j is concurrent to T i in h and T j is before T l in λ.Then, since T l modifies π j and the two transactions are concurrent, T l must update an object read by T j .It follows that h exhibits an RC-anti-dependency between T j and T l .Contradiction. -(T k < λ T j ) Choose some y k read before x j in π i .If y is still in π j , then T j reads at least that version of object y.Otherwise, consider that T l is the first transaction that removes y from the path to x in the object graph.To preserve property P, T l updates some object y read by T k that was referring to y.Because h / ∈ RCAD, transaction T k cannot commit after T l .Hence, T j T k holds.

Algorithm
In this section, we present a transactional memory that attains SSER + .Contrary to several existing TM implementation, our design does not require a global clock.It is weakly-progressive, aborting a transaction only if it encounters a concurrent conflicting transaction.Moreover, reads operations do not modify the base objects of the implementation (read invisibility).
We first give an overview of the algorithm, present its internals and justify some design choices.A correctness proofs follows.We close this section with a discussion on the parameters of our algorithm.In particular, we explain how to tailor it to be disjoint-access parallel.

Overview
Algorithm 1 depicts the pseudo-code of our construction of the TM interface at some process p.Our design follows the general approach of the lazy snapshot algorithm (LSA) [14], replacing the central clock with a more flexible mechanism.Algorithm 1 employs a deferred update schema that consists in two steps.A transaction first executes optimistically, buffering its updates.Then, at commit time, the transaction is certified and, if it commits, its updates are applied to the shared memory.
During the execution of a transaction, a process checks that the objects accessed so far did not change.Similarly to LSA, this check is lazily executed.Algorithm 1 executes it only if the shared object was recently updated, or when the transaction terminates.

Tracking Time
Algorithm 1 tracks time to compute how concurrent transactions interleave during an execution.To this end, the algorithm makes use of logical clocks.We model the interface of a logical clock with two operations: read () returns a value in N, and adv (v ∈ N) updates the clock with value v.The sequential specification of a logical clock guarantees a single property, that the time flows forward: (Time Monotonicity) A read operation always returns at least the greatest value to which the clock advanced so far.In every sequential history h, (res(read (), v) ∈ h) → (v ≥ max ({u : adv (u) ≺ h read ()} ∪ {0})).
Algorithm 1 associates logical clocks with both processes and transactions.To retrieve the clock associated with some object x, the algorithm uses function clock (x).Notice that in the pseudo-code, when it is clear from the context, clock (x) is a shorthand for clock (x).read ().
The clock associated with a transaction is always local (line 2).In the case of a process, it might be shared or not (line 3).The flexibility of our design comes from this locality choice for clock (p).When the clock is shared, it is linearizable.To implement an (obstruction-free) linearizable clock we employ the following common approach: (Construction 1) Let x be a shared register initialized to 0. When read () is called, we return the value stored in x.Upon executing adv (v), we fetch the value stored in x, say u.If v > u holds, we execute a compare-and-swap to replace u with v; otherwise the operation returns.If the compare-and-swap fails, the previous steps are retried.
Algorithm 1 A SSER + transactional memory -code at process p

Internals
In Algorithm 1, each object x has a location in the shared memory, denoted loc(x).This location stores a pair (t, d), where t ∈ N is a timestamp, and d is the actual content of x as seen by transactions.For simplicity, we shall name hereafter a pair (t, d) a version of object x.Since the location of object x is unique, a single version of object x may exist at a time in the memory.As usual, we assume some transaction T INIT that initializes for every object x the location loc(x) to (0, ⊥).Furthermore, we consider that each read or write operation to some location loc(x) is atomic.
Algorithm 1 associates a lock to each object.To manipulate the lock-related functions of object x, a process p employs appropriately the functions lock (x), isLocked (x) and unlock (x).
For every transaction T submitted to the system, Algorithm 1 maintains three local data structures: clock (T ) is the logical clock of transaction T ; rs(T ) is a map that contains its read set; and ws(T ) is another map that stores the write set of T .Algorithm 1 updates incrementally rs(T ) and ws(T ) over the course of the execution.The read set serves to check that snapshot of the shared memory as seen by the transaction is strictly consistent.The write set buffers updates.With more details, the execution of a transaction T proceeds as follows.
-When T starts its execution, Algorithm 1 initializes clock (T ) to the smallest value of clock (q) for any process q executing the TM.Then, both rs(T ) and ws(T ) are set to ∅.
-When T accesses a shared object x, if x was previously written, its value is returned (line 10).Otherwise, Algorithm 1 fetches atomically the version (d, t), as seen in location loc(x).Then, the algorithm checks that (i) no lock is held on x, and (ii) in case x was previously accessed, that T observes the same version.If one of these two conditions fails, Algorithm 1 aborts transaction T (line 14).The algorithm then checks that the timestamp t associated to the content d is smaller than the clock of T .In case this does not hold (line 15), Algorithm 1 tries extending the snapshot of T by calling function extend ().This function returns true when the versions previously read by T are still valid.In which case, clock (T ) is updated to the value t.
If Algorithm 1 succeeds in extending (if needed) the snapshot of T , d is returned and the read set of T updated accordingly; otherwise transaction T is aborted (line 16).
-Upon executing a write request on behalf of T to some object x, Algorithm 1 takes the lock associated with x (line 20), and in case of success, it buffers the update value d in ws(T ) (line 25).The timestamp t of x at the time Algorithm 1 takes the lock serves two purposes.First, Algorithm 1 checks that t is lower than the current clock of T , and if not T is extended (line 23).Second, it is saved in ws(T ) to ensure that at commit time the timestamp of the version of x written by T is greater than t.
-When T requests to commit, Algorithm 1 certifies the read set by calling function extend () with the clock of T (line 27).If this test succeeds, transaction T commits (lines 43 to 48).In such a case, clock (T ) ticks to reach its final value (line 43).By construction, this value is greater than the timestamps of all the versions read or written by T (lines 14 and 23).Algorithm 1 updates the clock of p with the final value of clock (T ) (line 44), then it updates the items written by T with their novel versions (line 46).

Guarantees
In this section, we assess the core properties of Algorithm 1. First, we show that our TM design is weakly progressive, i.e., that the algorithm aborts a transaction only if it encounters a concurrent conflicting transaction.Then, we prove that Algorithm 1 is stricter serializable.
(Weak-progress) A transaction executes under weak progressiveness [19], or equivalently it is weakly progressive, when it aborts only if it encounters a conflicting transaction.By extension, a TM is weakly progressive when it only produces histories during which transactions are weakly-progressive.We prove that this property holds for Algorithm 1.
In Algorithm 1, a transaction T aborts either at line 14, 16, 21, 24, or 28.We observe that in such a case either T observes an item x locked, or that the timestamp associated with x has changed.It follows that if T aborts then it observes a conflict with a concurrent transaction.From which we deduce that it is executing under weak progressiveness.
(Stricter serializability) Consider some run ρ of Algorithm 1, and let h be the history produced in ρ.At the light of its pseudo-code, every function defined in Algorithm 1 is wait-free.As a consequence, we may consider without lack of generality that h is complete, i.e., every transaction executed in h terminates with either a commit or an abort event.In what follows, we let h be the order in which writes to the object locations are linearized in ρ.We first prove that < is acyclic for this definition of h .Then, we show that, if a transaction does not exhibit any unfair binding, then it observes a strictly consistent snapshot.For some transaction, we shall note clock (T i ) f the final value of clock (T ).Proposition 3. Consider two transactions T i and T j =i in h.If either T i T j or x j h x i holds, then clock (T i ) f ≥ clock (T j ) f is true.In addition, if transaction T i commits then the ordering is strict, i.e., clock Proof.In each of the two cases, we prove that clock (T i ) f ≥ clock (T i ) f holds before transaction T i commits.
(T i T j ) Let x be an object such that r i (x j ) occurs in h.Since transaction T i reads version x j , transaction T j commits.We observe that T j writes version x j together with clock (T j ) f at loc(x) when it commits (line 46).As a consequence, when transaction T i returns version x i at line 18, it assigns clock (T j ) f to t before at line 11.The condition at line 15 implies that either clock (T i ) ≥ t holds, or a call to extend (T i , t) occurs.In the latter case, transaction T i executes line 35, advancing its clock up to the value of t. (x j h x i ) By definition, relation h forms a total order over all versions of x.Thus, we may reason by induction, considering that x i is immediately after x j in the order h .When T j returns from w j (x j ) at line 25, it holds a lock on x.This lock is released at line 47 after writing to loc(x).As h follows the linearization order, T i executes line 20 after T j wrote (x j , clock (T j ) f ) to loc(x).Location loc(x) is not updated between x j and x i .Hence, afterT i executes line 23, clock (T i ) ≥ clock (T j ) holds.
Since a clock is monotonic, the relation holds forever.Then, if transaction T i commits, it must executes line 43, leading to clock (T i ) f > clock (T i ) f .Proposition 4. History h does not exhibit any RC-anti-dependencies (h / ∈ RCAD) Proof.Consider T i , T j and T k such that r i (x k ), w j (x j ) ∈ h, x k h x j and T j commits before T i .When T j invokes commit, it holds a lock on x.This lock is released at line 47 after version x j is written at location loc(x).Then, consider the time at which T i invokes tryCommit.The call at line 27 leads to fetching loc(x) at line 32.Since T i reads version From the definition of h the write of (x k , clock (T k ) f ) takes place before the write of version (x j , clock (T j ) f ) in ρ.Hence, loc(x) does not contain anymore (x k , clock (T k ) f ) Applying Proposition 3, T i executes line 34 and aborts at line 29.
Proposition 5. Consider two transactions T i and T j =i in com(h).If T i < T j holds, transaction T i invokes commit before transaction T j in h.
Proof.Assume that T i and T j conflict of some object x.We examine in order each of the four cases defining relation <. - Before committing, T j invokes extend at line 27.Since T j commits in h, it should retrieve (x i , l) from loc(x) when executing line 32.Hence, transaction T i has already executed line 46 on object x.It follows that T i invokes commit before transaction T j in history h.-(∃x : By definition of h , the write of version x i is linearized before the write of version x j in ρ.After T i returns from w i (x i ), it owns a lock on object x (line 46).The object is then unlocked by transaction T i at line 47.As a consequence, transaction T i takes a lock on object x after T i invokes operation commit.From which it follows that the claim holds.
Follows from Proposition 4.
Proof.Proposition 5 tells us that if T i < T j holds then T i commits before T j .It follows that the SSG(h, h ) is acyclic.
Let us now turn our attention to the second property of SSER + .Assume that a transaction T i aborts in h.For the sake of contradiction, consider that T i exhibits fair bindings and yet that it observes a non-strictly consistent snapshot.
Applying the definition given in Section 2.1, there exist transactions T j and T k such that T i T j , r i (x k ) occurs in h and x k h x j .Applying Proposition 5, if T j ≺ h T i holds, transaction T i cannot observe version x k .Thus, transaction T j is concurrent to T i .Moreover, by definition of T i T j , there exist a transaction T l (possibly, T j ) and some object y such that T i performs r i (y l ) and T l T j .In what follows, we prove that T i aborts before returning y l .
For starter, relation < is acyclic, thus x k = y l holds.It then remains to investigate the following two cases: - From Proposition 5 and T l T j , transaction T j is committed at the time T i reads object x.Contradiction.
We first argue that, at the time T i executes line 11, the timestamp fetches from loc(y) is greater than clock (T i ).
Proof.First of all, observe that T j is not committed at the time T i reads object x (since x k h x j holds).Hence, denoting q the process that executes T j , clock (q) < clock (T j ) f is true when T i begins its execution at line 5. From the pseudo-code at line 5, clock (T i ) < clock (T j ) f holds at the start of T i .Because T j is concurrent to T i , T l is also concurrent to T i by Proposition 5. Thus, as r i (y l ) occurs, T i is bound to T l on y.Now, consider some object z read by T i before y, and name z r the version read by T i .Since the binding of T i to T l is fair, T l T r is true.Hence, applying Proposition 3, we have clock From what precedes, transaction T i invokes extend at line 15.We know that transaction T j is committed at that time (since T l is committed and T l T j holds).Thus, the test at line 33 fails and T i aborts before returning y l .

Discussion
Algorithm 1 replaces the global clock usually employed in TM architectures with a more flexible mechanism.For some process p, clock (p) can be local to p, shared across a subset of the processes, or even all of them.
If processes need to synchronize too often, maintaining consistency among the various clocks is expensive.In this situation, it might be of interest to find a compromise between the cost of cache coherency and the need for synchronization.For instance, in a NUMA architecture, Algorithm 1 may assign a clock per hardware socket.Upon a call to clock (p), the algorithm returns the clock defined for the socket in which the processor executing process p resides.
On the other hand, when the processes use a global clock, Algorithm 1 boils down to the original TinySTM implementation [14]..In this case, a read-only transaction always sees a strictly consistent snapshot.As a consequence, it can commit right after a call to tryCommit, i.e., without checking its snapshot at line 38.
A last observation is that our algorithm works even if one of the processes takes no step.This implies that the calls to process clocks (at lines 5 and 44) are strictly speaking not necessary and can be skipped without impacting the correctness of Algorithm 1. Clocks are solely used to avoid extending the snapshot at each step where a larger timestamp is encountered.If process clocks are not used, when two transactions access disjoint objects, they do not contend on any base object of the implementation.As a consequence, such a variation of Algorithm 1 is disjoint-access parallel (DAP).

Evaluation
This section presents a performance study of our SSER + transactional memory described in Section 3. To conduct this evaluation we implemented and integrated our algorithm inside TINYSTM [14], a state-of-the-art software transactional memory implementation.Our modifications account for approximatively 500 SLOC.We run Algorithm 1 in disjoint-access parallel mode.As explained in Section 3.5, in this variation the clocks of the processes are not accessed.A detailed evaluation of the other variations of Algorithm 1 is left for future work.
The experiments are conducted on an AMD Opteron48, a 48-cores machine with 256 GB of RAM.This machine has 4 dodeca-core AMD Opteron 6172, and 8 NUMA nodes.To evaluate the performance of our implementation on this multi-core platform, we use the test suite included with TINYSTM.This test suite is composed of several TM applications with different transaction patterns.The reminder of this section briefly describes the benchmarks and discuss our results.As a matter of a comparison, we also present the results achieved with the default TINYSTM distribution, (v1.0.5).

A bank application
The bank benchmark consists in simulating transfers between bank accounts.A transaction updates two accounts, transferring some random amount of money from one account to another.A thread executing this benchmark performs transfers in closed-loop.Each thread is bound to some branch of the bank, and accounts are spread evenly across the branches.A locality parameter allows to tune the accounts accessed by a thread to do a transfer.This parameter serves to adjust the probability that a thread executes consecutive operations on the same data.More specifically, when locality is set to the value ρ, a thread executes a transfer in its branch with probability ρ and between two random accounts with probability (1 − ρ).When ρ = 1, this workload is fully parallel.
Figure 1 presents the experimental results for the bank benchmark.In Figure 1(a), we execute a base scenario with 10k bank accounts, and a locality of 0.8.We measure the number of transfers performed by varying the number of threads in the application.In this figure, we observe that the performance obtained with TINYSTM merely improves as the number of thread increases: 48 threads achieve 2.8 million transactions per second (MTPS), scaling-up from 2.2 MTPS with a single thread.Our implementation performs better: with 48 threads Algorithm 1 executes around 68 MTPS, executing ×31 more operations than with one thread.
To understand the impact of data locality on performance, we vary this parameter for a fixed number of threads.Figure 1(b) presents the speedup obtained when varying locality from 0, i.e., all the accounts are chosen at random, up to 1, where they are all chosen in the local branch.In this experiment, we fix the number of threads to 48, i.e. the maximum number of cores available on our test machine.As shown in Figure 1(b), our TM implementation leverages the presence of data locality in the bank application.This is expected, since we use the disjoint-access parallel (DAP) variation of Algorithm 1.When locality increases, the contention in the application decreases.As a consequence of DAP, each thread works on independent data, thus improving performance.

Linked-list
The linked-list benchmark consists in concurrently modifying a sorted linked-list of integers.Each thread randomly adds or removes an integer from the list.We run this benchmark for a range of 512 values, i.e. a thread randomly selects a value between −255 and +256 before doing an insertion/removal.The linked list is initialized to contain 256 integers.We report our results in Figure 2 (left).We observe that TINYSTM outperforms our implementation in the linked-list benchmark.This is due to the fact that, without proper clock synchronization, transactions tend to re-validate their reads frequently over their execution paths.In this scenario of high contention, it is (as expected) preferable to rely on a frequent synchronization mechanism such as the global clock used in TINYSTM.To alleviate this issue, one could adjust dynamically the clocks used in Algorithm 1 accordingly to contention.Such a strategy could rely on a global lock, similarly to the mechanism used to avoid that long transactions abort.We left the implementation of this optimization for future work

Red-Black Tree
The red-black tree benchmark is similar to the linked-list benchmark except that the values are stored in a self-balancing binary search tree.We run this benchmark with a range of 10 7 values, and a binary tree initialized with 10 5 values.Figure 2 (right) reports our results.When using the original TINYSTM design, the performance of the application improves linearly up to 12 threads.It then stalls to approximately 50 MTPS due to contention on the global clock.In this benchmark, the likelihood of having two concurrent conflicting transactions is very low.Leveraging this workload property, our implementation of Algorithm 1, scales the application linearly with the number of threads.Algorithm 1 achieves 176 MTPS with 48 threads, improving performance by a ×36 factor over a single threaded execution.

Related Work
Transactional memory (TM) allows to design applications with the help of sequences of instructions that run in isolation one from another.This paradigm greatly simplifies the programming of modern highly-parallel computer architectures.
At first glance, it might be of interest that a TM design accepts all correct histories; a property named permissiveness [16].Such TM algorithms need to track large dependencies [25] and/or acquire locks for read operations [2].However, both techniques are known to have a significant impact on performance.
Early TM implementations (such as DSTM [23]) validate all the prior reads when accessing a new object.The complexity of this approach is quadratic in the number of objects read along the execution path.A time-based TM avoids this effort by relying on the use a global clock to timestamp object versions.Zhang et al. [33] compare several such approaches, namely TL2 [8], LSA [31] and GCC [32].They provide guidelines to reduce unnecessary validations and shorten the commit sequence.
Multi-versioning [10,15] brings a major benefit: allowing read-only transactions to complete.This clearly boosts certain workloads but managing multiple versions has a non-negligible performance cost on the TM internals.Similarly, invisible reads ensure that read operations do not contend in most cases.However, such a technique limits progress or the consistency criteria satisfied by the TM [3].In the case of Algorithm 1, both read-only and updates transaction are certain to make progress only in the absence of contention.
New challenges arise when considering multicore architectures and cache coherency strategies for NUMA architectures.Clock contention [7] is one of them.To avoid this problem, workloads as well as TM designs should take into account parallelism [28].Chan et al [7] propose to group threads into zones, and that each zone shares a clock and a clock table.To timestamp a new version, the TL2C algorithm [4] tags it with a local counter together with the thread id.Each thread stores a vector of the latest timestamp it encountered.The algorithm preserves opacity by requiring that a transaction restarts if one of the vector entries is not up to date.

Conclusion
Transactional memory systems must handle a tradeoff between consistency and performance.It is impractical to take into account all possible combinations of read and write conflicts, as it would lead to largely inefficient solutions.For instance, accepting RCAD histories brings only a small performance benefits in the general case [20].
This paper introduces a new consistency criteria, named stricter serializability (SSER + ).Workloads executed under SSER + are opaque when the object graph forms a tree and transactions traverse it top-down.We present an algorithm to attain this criteria together with a proof of its correctness.Our evaluation based on a fully implemented prototype demonstrates that such an approach is very efficient in weakly-contended workloads.