0% found this document useful (0 votes)
8 views15 pages

Deuce: Java Noninvasive STM Framework

The document presents Deuce, a noninvasive Software Transactional Memory (STM) framework for Java, designed to facilitate the development of scalable concurrent applications without requiring changes to the Java Virtual Machine (JVM) or language extensions. Deuce dynamically instruments classes at load time and employs a field-based locking strategy to enhance concurrency, demonstrating significant performance improvements over existing Java STM frameworks. The paper highlights Deuce's capabilities, including its pluggable architecture for STM algorithms and empirical results showcasing its scalability with multiple concurrent threads.

Uploaded by

Dnevni Mesečar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

Deuce: Java Noninvasive STM Framework

The document presents Deuce, a noninvasive Software Transactional Memory (STM) framework for Java, designed to facilitate the development of scalable concurrent applications without requiring changes to the Java Virtual Machine (JVM) or language extensions. Deuce dynamically instruments classes at load time and employs a field-based locking strategy to enhance concurrency, demonstrating significant performance improvements over existing Java STM frameworks. The paper highlights Deuce's capabilities, including its pluggable architecture for STM algorithms and empirical results showcasing its scalability with multiple concurrent threads.

Uploaded by

Dnevni Mesečar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/233960814

Deuce: Noninvasive Software Transactional Memory in Java

Article · January 2010

CITATIONS READS
17 207

3 authors:

Guy Korland Nir Shavit


Tel Aviv University Massachusetts Institute of Technology
11 PUBLICATIONS 269 CITATIONS 232 PUBLICATIONS 11,527 CITATIONS

SEE PROFILE SEE PROFILE

Pascal Felber
Université de Neuchâtel
393 PUBLICATIONS 11,870 CITATIONS

SEE PROFILE

All content following this page was uploaded by Guy Korland on 26 July 2016.

The user has requested enhancement of the downloaded file.


Noninvasive concurrency with Java STM
1 1 2
Guy Korland and Nir Shavit and Pascal Felber

1
Computer Science Department
Tel-Aviv University, Israel,
guykorla@[Link], shanir@[Link]
2
University of Neuchâtel, Neuchâtel, Switzerland,
[Link]@[Link]

Abstract. In this paper we present a complete Java STM framework, called


Deuce , intended as a platform for developing scalable concurrent applications
and as a research tool for designing new STM algorithms. It was not clear
if one could build an ecient Java STM without compiler support. Deuce
provides several benets over existing Java STM frameworks: it avoids any
changes or additions to the JVM, it does not require language extensions or
intrusive APIs, and it does not impose any memory footprint or GC overhead.
To support legacy libraries, Deuce dynamically instruments classes at load
time and uses an original eld-based locking strategy to improve concur-
rency. Deuce also provides a simple internal API allowing dierent STMs
algorithms to be plugged in. We show empirical results that highlight the
scalability of our framework running benchmarks with hundreds of concur-
rent threads. This paper shows for the rst time that one can actually design
a Java STM with reasonable performance without compiler support.

1 Introduction
Multicore CPUs have become commonplace, with dual-cores powering almost any
modern portable or desktop computer, and quad-cores being the norm for new
servers. While multicore hardware has been advancing at an accelerated pace, soft-
ware for multicore platforms seems to be at a crossroads.
Currently, two diverging programming methods are commonly used. The rst ex-
ploits concurrency by synchronizing multiple threads based on locks. This approach
is well known to be a two-edged sword: on the one hand, locks give the programmer
a ne-grained way to control the applications critical sections, allowing an expert to
provide great scalability; on the other hand, because of the risk of deadlocks, starva-
tion, priority inversions, etc., they impose a great burden on non-expert programmers,
often leading them to prefer the safer but non-scalable alternative of coarse-grained
locking.
The second method is a shared-nothing model usually common in Web based
architectures. The applications usually contain only the business logic deployed in
a container, while the state is saved in an external multi-versioned control system,
such as database, message queue, or distributed cache. While this method removes
the burden of handling concurrency in the application, it imposes a huge performance
impact on data accesses and is not seamlessly applicable to many types of application.
The hope is that transactional memory (TM) will simplify concurrent program-
ming by combining the desirable features of both methods, providing state-aware
shared memory with simple concurrency control.
In the past several years there has been a urry of software transactional memory
(STM) design and implementation work. The state of the art today however is less
than appealing. With the notable exception of transactional C/C++ compilers [9],
most of STM initiatives have remained academic experiments, applicable only to toy
applications, and though we have learned much from the process of developing them,
they never reached a state that will allow them to be seriously eld tested. There are
several reasons for this. Among them are the problematic handling of many features
of the target language, the large performance overheads, and the lack of support
for legacy libraries. Moreover, many of the published results have been based on
prototypes whose source code is unavailable or poorly maintained, making a reliable
comparison between the various algorithms and designs very dicult.

In this paper, we introduce Deuce, our novel open-source Java framework for
transactional memory. Deuce has several desired features not found in earlier Java
STM frameworks. As we discuss in Section 2, there currently does not exist an ef-
cient Java STM framework that delivers a full feature STM that can be added to
an existing application without changes to its compiler or libraries. It was not clear
if one could build an ecient Java STM without compiler support.

Deuce is intended to ll this gap. It is non-intrusive in the sense that no modica-
tions to the Java virtual machine (JVM) or extensions to the language are necessary.
It uses, by default, an original locking design that detects conicts at the level of in-
dividual elds without a signicant increase in the memory footprint (no extra data
is added to any of the classes) and therefore there is no GC overhead. This locking
scheme provides ner granularity and better parallelism than former object-based
lock designs. In order to avoid performance penalty, Deuce provides weak atomicity,
i.e., it does not guarantee that concurrent accesses to a shared memory location from
both inside and outside a transaction are consistent. It also supports a pluggable
STM back-end and an open easily extendable architecture, allowing researchers to
integrate and test their own STM algorithms.

Deuce has been heavily optimized for eciency and, while there is still room
for improvements, our performance evaluations on several high-end machines (up to
128 hardware threads on a 16-core Sun Niagara-based machine and a 96-core Azul
machine) demonstrate that it scales well. Our benchmarks show that it outperforms
the main competing JVM-independent Java STM, the DSTM2 framework [6], in
many cases by two orders of magnitude, and in general scales well on many workloads.
This paper shows for the rst time that one can actually design a Java STM with
reasonable performance without compiler support.

The Deuce framework has been in development for more than a year, and we
believe it has reached a level of maturity sucient to allow it to be used by developers
with no prior expertise in transactional memory. It is our hope that Deuce can help
democratize the use of STM among developers, and that its open-source nature will
encourage STM them to extend the infrastructure with novel features, as has been
the case, for instance, with the Jikes RVM [1].

The rest of the paper is organized as follows. We rst overview related work
in Section 2. Section 3 describes Deuce from the perspective of the developer of a
concurrent application. In Section 4 we discuss the implementation of the framework,
and we show how it can be extended by the means of the STM backends in Section 5.
Finally, in Section 6, we evaluate the performance of Deuce.
2 Related work

Several tools have been proposed to allow the introduction of STM into Java. They
dier in their programming interface and in the way they are implemented. One of
the rst proposals is Harris and Fraser's CCR [5] that used a C-based STM imple-
mentation underneath the JVM and a programmatic API to use STM in Java.

DSTM [7] is another pioneering Java STM implementation. It used an explicit


API to demarcate transactions and read/write shared data. Transactional objects
had to additionally provide some pre-dened functionality for cloning their state.

More recently, the DSTM2 [6] framework was introduced, proposing a set of
higher-level mechanisms to support transactions in Java. DSTM2 is a pure Java
library that does not require any change to the JVM nor to the compiler. It uses
a special annotation to mark an atomic interface. Only a class that implements
an annotated interface can participate in a transaction. This class must be created
using a special factory and it can only support transactional access to primitive types
or other annotated classes. Transactions must be started using a Callable object.
This design creates an API that requires important refactoring of legacy code and
introduces many limitations, such as the lack of support for existing libraries.

LSA-STM [12] is a Java STM that also relies primarily on annotations for TM
programming. Transactional objects to be accessed in the context of a transaction
must be annotated as @Transactional and methods must be declared as @Atomic.
Additional annotations can be used to indicate that some methods will not modify the
state of the target object (read-only) or to specify the behavior in case an exception
is thrown within a transaction. Transactional objects are implemented in LSA-STM
using inheritance and must support state cloning, as such it does not fully support
legacy code and is not transparent nor non-invasive as Deuce is. Among the limita-
tions of the framework, one can mention that accesses to a eld are only supported
as part of methods from the owner class, therefore public and static elds cannot be
accessed transactionally. Note that this limitation also applies to other Java STM
implementations with object-level conict detection like DSTM and DSTM2.

AtomJava [8] takes a dierent approach to Java STM design by providing a


source-to-source translator based on Polyglot [10], an extensible compiler framework.
AtomJava adds a new atomic keyword to mark atomic blocks, and performs some
major extensions during code transformation, such as adding an extra instance eld
to every single class. AtomJava relies on this eld to maintain an object-based lock
schema. It is shown to reduce lock access overhead but imposes a memory overhead
which impacts not only transactional objects but all the application objects. The
source code translation approach simplies the tuning and verication of STM in-
strumentation, but it makes it harder for the programmer to debug her original code.
Also, by translating source code, AtomJava imposes a major limitation on users in
that it prevents them from using compiled libraries, and renders their source code
with atomic blocks incompatible with regular Java compilers (unlike approaches such
as DSTM2 and LSA-STM that are based on annotations).

In summary, Deuce is the rst ecient full-feature STM framework that can be
added to an existing application without changing its compilation process or libraries.
3 Concurrent Programming with DEUCE

One of the main goals in designing the Deuce API was to keep it simple. A survey
of the past designs (see previous section) reveals three main approaches: (i) adding
a new reserved keyword to mark a block as atomic, e.g., atomic in AtomJava [8];
(ii) using explicit method calls to demarcate transactions and access shared objects,
e.g., DSTM2 [6] and CCR [5]; or (iii) requiring atomic classes to implement a pre-
dened interface or extend a base class, and to implement a set of methods [7]. The
rst approach requires modifying the compiler and/or the JVM, while the others are
intrusive from the programmer's perspective as they require signicant modications
to the code (even though some systems use semi-automatic weaving mechanisms such
as aspect-oriented programming to ease the task of the programmer, e.g., [12]).
In contrast, Deuce has been designed to avoid any addition to the language or
changes to the JVM. In particular, no special code is necessary to indicate which
objects are transactional, no method needs to be implemented to supporting trans-
actional objects (e.g., cloning the state as in [7]). This allows Deuce to seamlessly
support transactional accesses to compiled libraries. The only piece of information
that must be specied by the programmer is, obviously, which part of the code should
execute atomically in the context of a transaction.
To that end, Deuce relies on Java annotations. Introduced as a new feature in
Java 5, annotations allow programmers to mark a method with metadata that can be
consulted at class loading time. Deuce introduces new types of annotations to mark
methods as atomic: their execution will take place in the context of a transaction.
This approach has several advantages, both technically and semantically. First
technically, the smallest code block Java can annotate is a method, which simplies
the instrumentation process of Deuce and provides a simple model for the program-
mer. Second, atomic annotations operate at the same level as synchronized methods,
which execute in a mutually exclusion manner on a given object; therefore, atomic
methods provide a familiar programming model.
From a semantic point of view, implement-
1 public int sum(List list ) {
ing atomic blocks at the granularity of methods 2 int total = 0;
removes the need to deal with local variables
3 atomic {
4 for (Node n : list )
as part of the transaction. In particular, since 5 total += [Link]();
6 }
Java doesn't allow any access to stack variables 7 return total;
outside the current method, the STM can avoid 8 }
logging many memory accesses. For instance, in Fig. 1. Atomic block example.
Figure 1, a ner-granularity atomic block would
require costly logging of the total local variable (otherwise the method would yield
incorrect results upon abort) whereas no logging would be necessary when considering
atomic blocks at the granularity of individual methods.
To illustrate the use and implementation of Deuce, we will consider a well-known
but non-trivial data structure: the skip list [11]. A skip list is a probabilistic structure
based on multiple parallel, sorted linked lists, with ecient search time in O(log n).
Figure 2 shows a partial implementation of skip list, with an inner class represent-
ing nodes and a method to search for a value through the list. The key observation
in this code is that transactifying an application is as easy as adding @Atomic anno-
tations to methods that should execute as transactions. No code needs to be changed
within the method or elsewhere in the class. Interestingly, the linked list directly
manipulates arrays and accesses public elds from outside their enclosing objects,
which would not be possible with DSTM2 or LSA-STM.

1 public class SkipList { 16 @Atomic (retries=64)


2 private static class Node { 17 public boolean contains(int v) {
3 public nal int value; 18 Node node = head;
4 public nal Node[] forward; 19 for (int i = level ; i >= 0; i−−) {
5 // ... 20 Node next = [Link][i];
6 public Node(int level, int v) { 21 while ([Link] < v) {
7 value = v; 22 node = next;
8 forward = new Node[level + 1]; 23 next = [Link][i];
9 } 24 }
10 // ... 25 }
11 } 26 node = [Link][0];
12 private static int MAX_LEVEL = 32; 27 return ([Link] == v);
13 private int level ; 28 }
14 private nal Node head; 29 // ...
15 // Continued in next column... 30 }

Fig. 2. @Atomic method example.


One can also observe that the @Atomic annotation provides one congurable at-
tribute, retries, to optionally limit the amount of retries the transaction attempts
(at most 64 times in the example). A TransactionException is thrown in case this
limit is reached. Alternatively one can envision providing timeout instead (Which we
might add in future versions).
A Deuce application is compiled with a regular Java compiler. Upon execution,
one needs to specify a Java agent that allows Deuce to intercept every class loaded
and manipulate it before it is loaded by the JVM. The agent is simply specied on
the command line as a parameter to the JVM, as:
java -javaagent:[Link] MainClass args...
As will be discussed in the next section, Deuce instruments every class that may be
used from within a transaction, not only classes that have @Atomic annotations. If
it is known that a class will never be used in the context of a transaction, one can
prevent it from being instrumented by providing exclusion lists to the Deuce agent.
This will speed up the application loading time yet should not aect execution speed.

4 DEUCE Implementation DEUCE


agent
JVM

This section describes the implementation of the !!


!
Application

Deuce framework. We rst give a high-level


DEUCE runtime

overview of its main components. Then, we ex- Java TL2 Lock LSA
plain the process of code instrumentation. Fi- classes
nally, we describe various optimizations that en-
hance the performance of transactional code. Fig. 3. Main components of the
Deuce architecture.
4.1 Overview

The Deuce framework is conceptually made of 3 layers, as shown in Figure 3:


1. The application, which consists of user classes written without any relationship
to the STM implementation, except for annotations added to atomic methods.
2. The Deuce runtime that orchestrates the interactions between transactions ex-
ecuted by the application and the underlying STM implementation.
3. The actual STM libraries that implement the Deuce context API (see Section 5),
including a reference single-global-lock implementation (Lock in the Figure).

In addition, the Deuce agent intercepts classes as they are loaded and instru-
ments them before they start executing.

4.2 Instrumentation Framework

Deuce's instrumentation engine is based on ASM [3], an all-purpose Java bytecode


manipulation and analysis framework. It can be used to modify existing classes or to
dynamically generate classes, directly in binary form. The Deuce Java agent uses
ASM to dynamically instrument classes as they are loaded, before their rst instance
is created in the virtual machine. During instrumentation, new elds and methods
are added, and the code of transactional methods is transformed. We now describe
the dierent manipulations performed by the Java agent.

Fields. For each instance eld in any loaded class, Deuce adds a synthetic constant
eld (final static public) that represents the relative position of the eld in the
class. This value, together with the the instance of the class, uniquely identies a
eld and is used by the STM implementation to log eld accesses.
Static elds in Java are
eectively elds of the en- 1 public class SkipList {
2 public static nal long __CLASS_BASE__ = ...
closing class. To designate 3 public static nal long MAX_LEVEL__address__ = ...
static elds, Deuce denes 4
5
public static nal long level__address__ = ...
// ...
for each class a constant 6 private static int MAX_LEVEL = 32;
7 private int level ;
that represents the base 8 // ...
9 }
class and can be used in
combination with the eld Fig. 4. Fields address.
position, instead of the class instance.
For instance, in Figure 4, the level level__ADDRESS__
eld is represented by
while the SkipList base class is represented by __CLASS_BASE__.

Accessors. For each eld of any loaded class, Deuce adds synthetic static acces-
sors used to trigger eld's access events on the local context.
Figure 5 shows the syn-
thetic accessors of class 1 public class SkipList {
2 private int level ;
SkipList. The getter 3 // ...
4
level__Getter$ receives 5 // Synthetic getter
the current eld value and 6 public int level__Getter$(Context c) {
7 [Link](this, level__ADDRESS__);
a context, and triggers two 8 return [Link](this, level, level__ADDRESS__);
events: beforeReadAccess 9 }
10 // Synthetic setter
and onReadAccess; the re- 11 public void level__Setter$(int v, Context c) {
12 [Link](this, v, level__ADDRESS__);
sult of the latter is re- 13 }
turned by the getter. The 14 // ...
15 }
setter receives a context
and yields a single event: Fig. 5. Fields accessors.
onWriteAccess. The reason for having two events in the getter is technical: the
before and after events allow the STM backend to verify that the value of the
eld, which is accessed between both events without using costly reection mech-
anisms, is consistent. This is typically the case in time-based STM algorithms like
TL2 and LSA that ship with Deuce.
Duplicate methods. In 1 private static class Node {
2 public Node[] forward;
order to avoid any per- 3 // ...
4
formance penalty on non 5 // Original method
transactional code and since 6 public void setForward(int level, Node next) {
Deuce provides weak atom- 7
8 }
forward[ level ] = next;

icity, Deuce duplicates each


9
10 // Synthetic duplicate method
method to provide two dis- 11 public void setForward(int level, Node next, Context c) {
12 Node[] f = forward__Getter$(c) {
tinct versions. The rst 13 forward__Setter$(f, level, c)
version is identical to the 14 }
15 // ...
original method: it does 16 }
not trigger an event upon
Fig. 6. Duplicate method.
memory access and, conse-
quently, does not impose any transactional overhead. The second version is a synthetic
method with an extra Context parameter. In this instrumented copy of the original
method, all eld accesses (except for final elds) are replaced by calls to the asso-
ciated transactional accessors. Figure 4 shows the two versions of a method of class
Node. The second synthetic overload replaces forward[level] = next by calls to
the synthetic getter and setter of the forward array. The rst call obtains the ref-
erence to the array object, while the second one changes the specied array element
(note that getters and setters for array elements have an extra index parameter).

1 public class SkipList { 23 // Try to commit


2 // ... 24 if (commit) {
3 25 if ([Link]()) {
4 // Original method instrumented 26 if (throwable == null)
5 public boolean contains(int v) { 27 return result;
6 Throwable throwable = null; 28 // Rethrow application exception
7 Context context = 29 throw (IOException)throwable;
8 [Link](); 30 }
9 boolean commit = true; 31 } else {
10 boolean result; 32 [Link] ();
11 33 commit = true;
12 for (int i = 64; i > 0; −−i) { 34 }
13 context. init (); 35 } // Retry loop
14 try { 36 throw new TransactionException();
15 result = contains(v, context); 37 }
16 } catch(TransactionException ex) { 38
17 // Must rollback 39 // Synthetic duplicate method
18 commit = false; 40 public boolean contains(int v, Context c) {
19 } catch(Throwable ex) { 41 Node node = head__Getter$(c);
20 throwable = ex; 42 // ...
21 } 43 }
22 // Continued in next column... 44 }

Fig. 7. Atomic method.

Atomic methods. The duplication process described above has one exception:
a method annotated as @Atomic does not need an uninstrumented version. Instead,
the original method is replaced by a transactional version that calls the instrumented
version from within a transaction that executes in a loop. The process repeats as long
as the transaction aborts and the number of retires is not reached. Figure 7 shows
the transactional version of method contains.
4.3 Optimizations

During instrumentation, we perform several optimizations to improve the perfor-


mance of Deuce. First, we do not instrument accesses to nal elds as they cannot
be modied after creation. This optimization, together with the declaration of nal
elds whenever possible in the application code, dramatically reduces the overhead.
Second, instead of generating accessor methods, Deuce actually inlines the code
of the getters and setters directly in the transactional code. We have observed a slight
performance improvement from this optimization.
Third, we chose to use the [Link] pseudo-standard internal library
to implement fast reection, as it proved to be vastly more ecient than the stan-
dard Java reection mechanisms. Using [Link] even outperformed the
approach taken in AtomJava [8], which is based on using an anonymous class per
eld to replace reection.
Finally, we tried to limit as much as possible the stress on the garbage collector,
notably by using object pools when keeping track of accessed elds (read and write
sets) in threads. In order to avoid any contention on the pool, we had each thread
keep a separate object pool as part of its context.
Together, the above optimizations helped to signicantly decrease the implemen-
tation overhead, in some of our benchmarks by almost an order of magnitude (almost
ten times faster) as compared to our initial implementation.

5 Customizing Concurrency Control

Deuce was designed to provide a research platform for STM methods. In order to
provide a simple API for researchers to plug in their own STM implementation,
Deuce denes the Context API as shown in Listing 8. The API includes an init
method called once before the transaction starts and then upon each retry, allow-
ing the transaction to initialize its internal data structures. The atomicBlockId
argument allows the transaction to log information about the specic atomic block
(statically assigned in the bytecode).
One of the heuristics we 1 public interface Context {
added to the LSA imple- 2 void init(int atomicBlockId);
3 boolean commit();
mentation is, following [4], 4 void rollback ();
5
that each one of the atomic
6 void beforeReadAccess(Object obj, long eld);
blocks will initially be a 7
8 Object onReadAccess(Object obj, Object value, long eld);
read-only block. It will 9 int onReadAccess(Object obj, int value, long eld );
be converted to become a 10 long onReadAccess(Object obj, long value, long eld );
11 // ...
writable block upon retry 12
13 void onWriteAccess(Object obj, Object value, long eld);
once it encounters a rst 14 void onWriteAccess(Object obj, int value, long eld );
write. Using this method, 15 void onWriteAccess(Object obj, long value, long eld );
16 // ...
read-only blocks can save 17 }
most of the overhead of log-
Fig. 8. Context interface.
ging the elds' access.
Another heuristic is that commit is called in case the atomic block nishes without
a TransactionException and can return false in case the commit fails, which in
turn will cause a retry. A rollback is called when a TransactionException is thrown
during the atomic block (this can be used by the business logic to force a retry).
The rest of the methods are called upon eld access: a eld read event will trigger
a beforeReadAccess event followed by a onReadAccess event. A write eld event
will trigger an onWriteAccess event. Deuce currently includes two Context imple-
mentations, [Link] and [Link], implementing the TL2 [4] and LSA [12]
STM algorithms. Deuce also provides a reference implementation based on a single
global lock. Since a global lock doesn't log any eld access, it doesn't implement the
Context interface and doesn't impose any overhead on elds' access.

TL2 Context. The TL2 context is a straightforward implementation of the TL2


algorithm [4]. The general principle is as follows (many details omitted).
TL2 uses a shared array of revokable versioned locks, with each object eld being
mapped to a single lock. Each lock has a version that corresponds to the commit
timestamp of the transaction that last updated some eld protected by this lock.
Timestamps are acquired upon commit from a global time base, implemented as a
simple counter.
When reading some data, a transaction checks that the associated timestamp is
valid, i.e., not more recent than the time when the transaction started, and keeps
track of the accessed location in its read set. Upon write, the transaction buers the
update in its write set.
At commit time, the transaction acquires (using an atomic compare-and-set op-
eration) the locks protecting all written elds, veries that all entries in the read
set are still valid, acquires a unique commit timestamp, writes the modied elds to
memory, and releases the locks. If locks cannot be acquires or validation fails, the
transaction aborts, discarding buered updates.

LSA Context. The LSA context uses the LSA algorithm [12] that was developed
concurrently with TL2 and is based on a similar design. The main dierences are that
(1) LSA acquires locks as elds are written, instead of at commit time, and (2) it
performs incremental validation to avoid aborting transactions that read data that
has been modied more recently than the start time of the transaction.
Both TL2 and LSA take a simple approach to conict management: they simply
abort and restart (possibly after a delay) the transaction that encounters the con-
ict. This strategy is simple and ecient to implement, because transactions unilat-
erally abort without any synchronization with others, but it can sometimes hamper
progress, e.g., by producing livelocks. Deuce also supports a variant of the LSA
context that provides modular contention management strategies (as rst proposed
in [7]), at the price of some additional synchronization overhead at runtime.

6 Performance Evaluation
We evaluated the performance of Deuce on a Sun UltraSPARC T2 Plus multicore
machine (2 CPUs each with 8 cores at 1.2 GHz, each with 8 hardware threads) and
an Azul Vega2 (2 CPUs each with 48 cores).
6.1 Deuce overheads
We rst briey discuss the overheads introduced by the Deuce agent when instru-
menting Java classes. Table 1 shows the memory footprint and the processing over-
head when processing compiled libraries: [Link] is the runtime library containing
all Java built-in classes for JDK 6; [Link] (version 0.4) is a widely used in-
memory cache for Java objects; JavaGrande (version 1.0) is a well-known collection
of Java benchmarks. Instrumentation times were measured on a x86 Core 2 Duo CPU
running at 1.8 GHz. Note that instrumentation was executed serially, i.e., without
exploiting multiple cores.
Application Memory Instrumentation
Original Instrumented Time
[Link] 50 MB 115 MB 29 s
[Link] 248 KB 473 KB <1 s
JavaGrande 360 KB 679 KB <1 s
Table 1. Memory footprint and processing overhead.
As expected, the size of the code approximately doubles because Deuce du-
plicates each method, and the processing overhead is proportional to the size of the
code. However, because Java uses lazy class loading, only the classes that are actually
used in an application will be instrumented at load time. Therefore, the overheads
of Deuce are negligible considering individual classes are loaded on demand.
In terms of execution time, instrumented classes that are not invoked within a
transaction incur no performance penalty as the original method executes, not the
transactional version.

6.2 Benchmarks

We tested the three built-in Deuce STM options: LSA, TL2, and a simple Global
Lock. Our LSA version captures many of the properties of the LSA-STM [13] frame-
work. The TL2 [4] form provides an alternative STM implementation that acquires
locks at commit time using a write set as opposed to the encounter order and undo
log approach used in LSA.
Our experiments included three classical micro-benchmarks: a red/black tree, a
sorted linked list and a skip list, on which we performed concurrent insert, delete,
and search operations. We controlled the size of the data structures, which remained
constant for the duration of the experiments, and the mix of operations.
We also experimented with two real-world applications, the rst implements the
Lee routing algorithm (as presented in Lee-TM [2]). It is worth noting that adapting
Deuce was trivial, with little more than a change of synchronized
the application to
blocks into atomic methods. The second application called Cache4J, is an LRU cache
implementation, again adapting the application to Deuce was trivial.

Micro-benchmarks. We began by comparing Deuce to DSTM2, the only com-


peting STM framework that does not require compiler support. We benchmarked
DSTM2 and Deuce based on the Red/Black tree benchmark provided with the
DSTM2 framework. We tried many variations of operation combinations and levels
of concurrency on both the Azul and Sun machines. Comparison results for a tree
with 10k elements are shown in Figure 9, we found that in all the benchmarks DSTM2
was about 100 times (two orders of magnitude) slower than Deuce. Our results are
consistent with those published in [6], where DSTM2's throughput in operations per
second is in the thousands while Deuce throughput is in the millions. Based on these
experiments, we believe one can conclude that Deuce is the rst viable compiler and
JVM independent Java STM framework.

RB, 10k elements, 0% updates RB tree, 10k elements, 10% updates RB tree, 10k elements, 20% updates
Throughput (× 106) txs/s

Throughput (× 106 txs/s)

Throughput (× 106) txs/s


14 3 1.6
12 2.5 1.4
10 1.2 DSTM2
2 1
8 Lock
DSTM2 TL2
DSTM2 1.5 Lock 0.8
6 LSA
Lock 1 TL2 0.6
4 TL2 LSA 0.4
2 LSA 0.5 0.2
0 0 0
1 16 32 48 64 80 96 112 128 1 16 32 48 64 80 96 112 128 1 16 32 48 64 80 96 112 128
Number of threads Number of threads Number of threads
Fig. 9. The red/black tree benchmark (Sun).

LL, 256 elements, 0% updates LL, 256 elements, 5% updates LL, 256 elements, 20% updates
Throughput (× 103 txs/s)

Throughput (× 103 txs/s)

Throughput (× 103 txs/s)


20000 5000 5000
TL2
4000 4000 LSA
15000 Lock
TL2 3000 3000
10000 LSA
Lock 2000 TL2 2000
5000 LSA
1000 Lock 1000
0 0 0
1 16 32 48 96 1 16 32 48 96 1 16 32 48 96
Number of threads Number of threads Number of threads
Fig. 10. The linked list benchmark (Sun).

LL, 256 elements, 0% updates LL, 256 elements, 5% updates LL, 256 elements, 20% updates
Throughput (× 103 txs/s)

Throughput (× 103 txs/s)

Throughput (× 103 txs/s)

30000 6000 6000


TL2 TL2
25000 5000 LSA 5000 LSA
Lock Lock
20000 4000 4000
TL2
15000 LSA 3000 3000
10000 Lock 2000 2000
5000 1000 1000
0 0 0
1 10 30 50 70 100 1 10 30 50 70 100 1 10 30 50 70 100
Number of threads Number of threads Number of threads
Fig. 11. The linked list benchmark (Azul).

SL, 16k elements, 0% updates SL, 16k elements, 20% updates SL, 16k elements, 50% updates
Throughput (× 106 txs/s)

Throughput (× 106 txs/s)

Throughput (× 106 txs/s)

100 30 15
TL2 TL2
80 25 LSA LSA
Lock Lock
20 10
60
15
40 TL2 10 5
LSA
20 Lock 5
0 0 0
1 16 48 96 128 196 256 1 16 48 96 128 196 256 1 16 48 96 128 196 256
Number of threads Number of threads Number of threads
Fig. 12. The skiplist benchmark (Sun).

On the other hand we observe that the Deuce STMs scale in an impressive way
even with 20% updates (up to 32 threads). While DSTM2 shows no scalability. When
we investigated deeper we found out that most of DSTM2 overhead rest in two areas.
The most signicant area is the contention manager which acts as a contention point,
the second area is the reection mechanism used by DSTM2 to retrieve and assign
values during the transaction and in commit.
SL, 16k elements, 0% updates SL, 16k elements, 20% updates SL, 16k elements, 50% updates

Throughput (× 106 txs/s)

Throughput (× 106 txs/s)

Throughput (× 106 txs/s)


100 30 4
TL2 TL2 TL2
80 LSA 25 LSA LSA
Lock Lock 3 Lock
20
60
15 2
40
10
20 1
5
0 0 0
1 30 50 70 100 140 180 1 30 50 70 100 140 180 1 30 50 70 100 140 180
Number of threads Number of threads Number of threads
Fig. 13. The skiplist benchmark (Azul).

Then we present the results of the tests with the linked list and the skip list on
the Azul machine. The linked list benchmark is known to produce a large degree of
contention as all threads must traverse the same long sequence of nodes to reach the
target element. Results of experiments are shown in Figure 10 and Figure 11 for a
linked list with 256 elements and dierent update rates. We observe that because
there is little potential for parallelism, the single lock, which performs signicantly
less lock acquisition and tracking work than the STMs, wins in all three benchmarks.
The STMs scale a bit until 30 threads when there are up to 5% updates, and then
drop due to a rise in overhead with no benet in parallelism. With 20% updates
the max is reached at about 5 threads as after then we have statistically at least 2
concurrent updates and the STMs suer from repeated transactional aborts.
Next, we consider results from the skip list benchmark. Skip lists allow signi-
cantly greater potential for parallelism than linked lists because dierent threads are
likely to traverse the data structure following distinct paths. Results for a list with
16k elements are shown in Figure 12 and Figure 13. We observe that the STMs scale
in an impressive way as long as there is an increase in the level of parallelism, and
provide great scalability even with 20% updates as long as the abort rate remains at
a reasonable level. We added a benchmark with 50% updates to show that there is a
limit to their scalability, when we increase the fraction of updates to 50%, the STMs
in Figure 13 reach a melting point much earlier (at about 10 threads versus 100
threads in the 20% benchmark) and overall the lock wins again because the abort
rate is again high and the STMs pay with overhead without a gain in parallelism.
We note that typically search structures have about 10% updates.

Real applications. Our last benchmarks demonstrate how simple it's to replace the
critical sections with transactions. The rst takes a serial version of the Lee routing
algorithm and demonstrates how a simple replacement of the critical sections by
transactions signicantly improves scalability. While the second takes a non-multi-
threaded lock based a LRU cache implementation (Cache4J) and shows that it's very
simple to replace the critical sections but scalability isn't promised.
Circuit routing is the process of automatically producing an interconnection be-
tween electronic components. Lee's routing algorithm is attractive for parallelization
since circuits (as shown in [2]) consist of thousands of routes, each of which can
potentially be routed concurrently.
The graph in Figure 14 shows execution time (not throughput). As can be seen,
Deuce scales well, with the overall time decreasing even with a large board (Mem-
Board).
Cache4J is an LRU lock-based cache implementation. The implementation is
based on two internal data structures, a tree and a hash-map. The tree manages
the LRU while the hash-map holds the data. The Cache4J implementation is based
Lee-TM, MemBoard Cache4j (LSA)

Throughput (× 103 txs/s)


140 16
TL2
120 LSA 14

Execution time
100 12
80 10 0% updates
60 8 10% updates
40 6
20 4
0 2
1 10 15 20 25 30 1 16 32 48 64 80
Number of threads Number of threads

Fig. 14. The Lee Routing benchmark. Fig. 15. The cache4j benchmark.

on a single global lock which naturally leads to zero scalability. The graph in Fig-
ure 15 shows the result of replacing the single lock with transactions. As can be seen
Deuce doesn't scale well, with the overall throughput slightly decreasing. A quick
proling shows that the fact that Cache4J is an LRU cache implies that every get
operation also updated the internal tree. This alone makes every transaction an up-
date transaction, the fact that the total throughput remains almost the same even
with 80 threads encouraging due to the fact that transactions' biggest advantages,
part of scalability, is code simplicity and code robustness.
In conclusion, Deuce shows the typical performance patterns of an STM, good
scalability when there is a potential for parallelism, and unimpressive performance
when parallelism is low. The results, however, are very encouraging as we see that
the fact that Deuce is a pure Java library, with all the implied overheads, does not
prevent it from showing scalability even for small numbers of threads.

7 Conclusion
We introduced Deuce, a novel open-source Java framework for transactional mem-
ory. As we showed, Deuce is the rst ecient fully featured Java STM framework
that can be added to an existing application without changes to its compiler or li-
braries. It's a proof that one can design an ecient pure Java STM without a compiler
support.
Though much work obviously remains to be done, we believe Deuce is ready for
use by developers. It is freely downloadable from [Link]
under the Apache license.
In conclusion Table 2 shows a general overview comparing the two known Java
STMs AtomJava
1 and DTMS2 vs Deuce. This comparison shows that Deuce gets
the best from both worlds and provides new novel capabilities which yell much better
usability and performance than currently exists.
Application AtomJava DSTM2 Deuce
Locks Object based Object based Field based
Instrumentation Source Runtime Runtime
API Non-Intrusive Intrusive Non-Intrusive
Libraries support no-support no-support Runtime & oine support
Fields access Anonymous classes Reection low level Unsafe
Execution overhead Medium High Medium
Extensible Yes Yes Yes
Transaction context Atomic block Need to create a Callable @Atomic annotated method
Table 2. AtomJava vs DSTM2 vs Deuce
1
We don't show comparison benchmark results with AtomJava due technical limitations
and bugs found in its current implementation. Although small benchmarks showed Deuce
outperformed AtomJava
Acknowledgements. This paper was supported in part by grants from Sun Microsys-
tems, Intel Corporation, Microsoft Inc., as well as a grant 06/1344 from the Israeli
Science Foundation, European Union grant FP7-ICT-2007-1 (project VELOX), and
Swiss National Foundation grant 200021-118043.

References
1. B. Alpern, S. Augart, S. M. Blackburn, M. Butrico, A. Cocchi, P. Cheng, J. Dolby,
S. Fink, D. Grove, M. Hind, K. S. McKinley, M. Mergen, J. E. B. Moss, T. Ngo, and
V. Sarkar. The jikes research virtual machine project: building an open-source research
community. IBM Syst. J., 44(2):399417, 2005.
2. M. Ansari, C. Kotselidis, I. Watson, C. Kirkham, M. Luján, and K. Jarvis. Lee-tm:
A non-trivial benchmark suite for transactional memory. In Proc. of the International
Conference on Algorithms and Architectures for Parallel Processing (ICA3PP), pages
196207, Berlin, Heidelberg, 2008. Springer-Verlag.
3. W. Binder, J. Hulaas, and P. Moret. Advanced java bytecode instrumentation. In Proc.
of the International Symposium on Principles and Practice of Programming in Java
(PPPJ), pages 135144, New York, NY, USA, 2007. ACM.
4. D. Dice, O. Shalev, and N. Shavit. Transactional locking ii. In Proc. of the International
Symposium on Distributed Computing (DISC), pages 194208, 2006.
5. T. Harris and K. Fraser. Language support for lightweight transactions. In Proc. of the
ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and
Applications (OOPSLA), pages 388402, New York, NY, USA, 2003. ACM.
6. M. Herlihy, V. Luchangco, and M. Moir. A exible framework for implementing software
Proc. of the ACM SIGPLAN Conference on Object-Oriented
transactional memory. In
Programming Systems, Languages and Applications (OOPSLA), pages 253262, New
York, NY, USA, 2006. ACM.
7. M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer III. Software transactional
memory for dynamic-sized data structures. InProc. of the Annual ACM Symposium
on Principles of Distributed Computing (PODC), pages 92101, New York, NY, USA,
2003. ACM.
Proc. of
8. B. Hindman and D. Grossman. Atomicity via source-to-source translation. In
the Workshop on Memory System Performance and Correctness (MSPC), pages 8291,
New York, NY, USA, 2006. ACM.
9. Y. Ni, A. Welc, A.-R. Adl-Tabatabai, M. Bach, S. Berkowits, J. Cownie, R. Geva,
S. Kozhukow, R. Narayanaswamy, J. Olivier, S. Preis, B. Saha, A. Tal, and X. Tian.
Design and implementation of transactional constructs for C/C++. In Proc. of the
ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and
Applications (OOPSLA), pages 195212, New York, NY, USA, 2008. ACM.
10. N. Nystrom, M. R. Clarkson, and A. C. Myers. Polyglot: An extensible compiler frame-
work for java. In Proc. of the International Conference on Compiler Construction (CC),
pages 138152, 2003.
11. W. Pugh. Skip lists: a probabilistic alternative to balanced trees. Commun. ACM,
33(6):668676, 1990.
12. T. Riegel, P. Felber, and C. Fetzer. A lazy snapshot algorithm with eager validation. In
Proc. of the International Symposium on Distributed Computing (DISC), pages 284298,
Sep 2006.
13. T. Riegel, C. Fetzer, and P. Felber. Time-based transactional memory with scalable time
Proc. of the Annual ACM Symposium on Parallel Algorithms and Architectures
bases. In
(SPAA), pages 221228, New York, NY, USA, 2007. ACM.

View publication stats

You might also like