100% found this document useful (2 votes)
393 views638 pages

1482299666

High performance computing in finance

Uploaded by

Mike Smith
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
393 views638 pages

1482299666

High performance computing in finance

Uploaded by

Mike Smith
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

High-Performance

Computing in Finance
Problems, Methods,
and Solutions
High-Performance
Computing in Finance
Problems, Methods,
and Solutions

Edited by
M. A. H. Dempster
Juho Kanniainen
John Keane
Erik Vynckier
MATLAB R is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks

does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion
of MATLAB R software or related products does not constitute endorsement or sponsorship by The

MathWorks of a particular pedagogical approach or particular use of the MATLAB R software.

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742


c 2018 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-1-4822-9966-3 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access
[Link] ([Link] or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Names: Dempster, M. A. H. (Michael Alan Howarth), 1938- editor. | Kanniainen,


Juho editor. | Keane, John. editor. | Vynckier, Erik. editor.
Title: High-performance computing in finance : problems, methods, and
solutions / [edited by] M.A.H. Dempster [and three others].
Description: Boca Raton, FL : CRC Press, 2018.
Identifiers: LCCN 2017052035| ISBN 9781482299663 (hardback) | ISBN
9781315372006 (ebook)
Subjects: LCSH: Finance--Mathematical models. | Finance--Data processing.
Classification: LCC HG106 .H544 2018 | DDC 332.01/5118--dc23
LC record available at [Link]

Visit the Taylor & Francis Web site at


[Link]

and the CRC Press Web site at


[Link]
Contents

Editors xi

Contributors xiii

Introduction xvii

I Computationally Expensive Problems in the


Financial Industry 1

1 Computationally Expensive Problems in Investment


Banking 3
Jonathan Rosen, Christian Kahl, Russell Goyder, and Mark Gibbs

2 Using Market Sentiment to Enhance Second-Order


Stochastic Dominance Trading Models 25
Gautam Mitra, Christina Erlwein-Sayer, Cristiano Arbex Valle,
and Xiang Yu

3 The Alpha Engine: Designing an Automated Trading


Algorithm 49
Anton Golub, James B. Glattfelder, and Richard B. Olsen

4 Portfolio Liquidation and Ambiguity Aversion 77


Álvaro Cartea, Ryan Donnelly, and Sebastian Jaimungal

5 Challenges in Scenario Generation: Modeling Market


and Non-Market Risks in Insurance 115
Douglas McLean

II Numerical Methods in Financial


High-Performance Computing (HPC) 173

6 Finite Difference Methods for Medium- and


High-Dimensional Derivative Pricing PDEs 175
Christoph Reisinger and Rasmus Wissmann

vii
viii Contents

7 Multilevel Monte Carlo Methods for Applications


in Finance 197
Michael B. Giles and Lukasz Szpruch

8 Fourier and Wavelet Option Pricing Methods 249


Stefanus C. Maree, Luis Ortiz-Gracia, and Cornelis W. Oosterlee

9 A Practical Robust Long-Term Yield Curve Model 273


M. A. H. Dempster, Elena A. Medova, Igor Osmolovskiy, and
Philipp Ustinov

10 Algorithmic Differentiation 315


Uwe Naumann, Jonathan Hüser, Jens Deussen,
and Jacques du Toit

11 Case Studies of Real-Time Risk Management


via Adjoint Algorithmic Differentiation (AAD) 339
Luca Capriotti and Jacky Lee

12 Tackling Reinsurance Contract Optimization


by Means of Evolutionary Algorithms and HPC 371
Omar Andres Carmona Cortes and Andrew Rau-Chaplin

13 Evaluating Blockchain Implementation of Clearing


and Settlement at the IATA Clearing House 391
Sergey Ivliev, Yulia Mizgireva, and Juan Ivan Martin

III HPC Systems: Hardware, Software, and


Data with Financial Applications 411

14 Supercomputers 413
Peter Schober

15 Multiscale Dataflow Computing in Finance 441


Oskar Mencer, Brian Boucher, Gary Robinson, Jon Gregory,
and Georgi Gaydadjiev

16 Manycore Parallel Computation 471


John Ashley and Mark Joshi

17 Practitioner’s Guide on the Use of Cloud Computing


in Finance 509
Binghuan Lin, Rainer Wehkamp, and Juho Kanniainen
Contents ix

18 Blockchains and Distributed Ledgers in Retrospective


and Perspective 537
Alexander Lipton

19 Optimal Feature Selection Using a Quantum Annealer 561


Andrew Milne, Mark Rounds, and Peter Goddard

Index 589
Editors

Michael Dempster is Professor Emeritus, Centre for Financial Research,


University of Cambridge. He has held research and teaching appointments
at leading universities globally and is founding editor-in-chief of Quantitative
Finance. His numerous papers and books have won several awards, and he
is Honorary Fellow of the IFoA, Member of the Academia dei Lincei, and
managing director of Cambridge Systems Associates.

Juho Kanniainen is Professor of Financial Engineering at Tampere Uni-


versity of Technology, Finland. He has served as coordinator of two inter-
national EU-programs: HPC in Finance ([Link]fi[Link]) and Big Data
in Finance ([Link]fi[Link]). His research is broadly in quantitative
finance, focusing on computationally expensive problems and data-driven
approaches.

John Keane is Professor of Data Engineering in the School of Computer


Science at the University of Manchester, UK. As part of the UK government’s
Foresight Project, The Future of Computer Trading in Financial Markets,
he co-authored a commissioned economic impact assessment review. He has
been involved in both the EU HPC in Finance and Big Data in Finance
programs. His wider research interests are data and decision analytics and
related performance aspects.

Erik Vynckier is board member of Foresters Friendly Society, partner of


InsurTech Venture Partners, and chief investment officer of Eli Global, fol-
lowing a career in banking, insurance, asset management, and petrochemical
industry. He co-founded EU initiatives on high performance computing and
big data in finance. Erik graduated as MBA at London Business School and
as chemical engineer at Universiteit Gent.

xi
Contributors

John Ashley Ryan Donnelly


NVIDIA Swiss Finance Institute
Santa Clara, California École Polytechnique Fédérale de
Lausanne
Brian Boucher Switzerland

Luca Capriotti
Quantitative Strategies Jacques du Toit
Investment Banking Division The Numerical Algorithms Group
and Ltd.
Department of Mathematics United Kingdom
University College London
London, United Kingdom Christina Erlwein-Sayer
OptiRisk Systems
Álvaro Cartea
Mathematical Institute
Oxford-Man Institute of Georgi Gaydadjiev
Quantitative Finance
University of Oxford Mark Gibbs
Oxford, United Kingdom Quantitative Research
FINCAD
Omar Andres Carmona Cortes
Computation Department
Instituto Federal do Maranhão Michael B. Giles
São Luis, Brazil Mathematical Institute
University of Oxford
M. A. H. Dempster Oxford, United Kingdom
Centre for Financial Research
University of Cambridge James B. Glattfelder
and Department of Banking and Finance
Cambridge Systems Associates University of Zurich
Cambridge, United Kingdom Zurich, Switzerland
Jens Deussen
Department of Computer Science Peter Goddard
RWTH Aachen University 1QBit
Germany Vancouver, Canada

xiii
xiv Contributors

Anton Golub Binghuan Lin


Lykke Corporation Tampere University of Technology
Zug, Switzerland Tampere, Finland

Russell Goyder
Alexander Lipton
Quantitative Research
Stronghold Labs
FINCAD
Chicago, Illinois
Jon Gregory and
MIT Connection Science and
Jonathan Hüser
Engineering
Department of Computer Science
Cambridge, Massachusetts
RWTH Aachen University
Germany
Stefanus C. Maree
Sergey Ivliev Centrum Wiskunde & Informatica
Lykke Corporation Amsterdam, The Netherlands
Switzerland
and Juan Ivan Martin
International Air Transport
Laboratory of Cryptoeconomics and Association
Blockchain Systems
Perm State University
Russia Douglas McLean
Moody’s Analytics
Sebastian Jaimungal Edinburgh, Scotland,
Department of Statistical Sciences United Kingdom
University of Toronto
Canada Elena A. Medova
Centre for Financial Research
Mark Joshi
University of Cambridge
Department of Economics
and
University of Melbourne
Cambridge Systems Associates
Melbourne, Australia
Cambridge, United Kingdom
Christian Kahl
Quantitative Research Oskar Mencer
FINCAD
Andrew Milne
Juho Kanniainen
1QBit
Tampere University of Technology
Vancouver, Canada
Tampere, Finland

Jacky Lee Gautam Mitra


Quantitative Strategies OptiRisk Systems and Department
Investment Banking Division of Computer Science
New York UCL, London, United Kingdom
Contributors xv

Yulia Mizgireva Christoph Reisinger


Lykke Corporation Mathematical Institute
Switzerland and
and Oxford-Man Institute of
Quantitative Finance
Laboratory of Cryptoeconomics and University of Oxford
Blockchain Systems United Kingdom
Perm State University
Russia Gary Robinson
and Jonathan Rosen
Department of Mathematics Quantitative Research
Ariel University FINCAD
Israel
Mark Rounds
Uwe Naumann 1QBit
The Numerical Algorithms Group Vancouver, Canada
Ltd.
Peter Schober
United Kingdom
Goethe University Frankfurt
Chair of Investment
Richard B. Olsen
Portfolio Management and Pension
Lykke Corporation
Finance
Zug, Switzerland
Frankfurt, Hesse, Germany
Cornelis W. Oosterlee Lukasz Szpruch
Centrum Wiskunde & School of Mathematics
Informatica University of Edinburgh
Amsterdam, The Netherlands Edinburgh, United Kingdom
and
Philipp Ustinov
Delft Institute of Applied Cambridge Systems Associates
Mathematics Cambridge, United Kingdom
Delft University of Technology
Delft, The Netherlands Cristiano Arbex Valle
OptiRisk Systems
Luis Ortiz-Gracia United Kingdom
Department of Econometrics
University of Barcelona Rainer Wehkamp
Barcelona, Spain Techila Technologies Limited
Tampere, Finland
Igor Osmolovskiy
Rasmus Wissmann
Cambridge Systems Associates
Mathematical Institute
Cambridge, United Kingdom
University of Oxford
Oxford, United Kingdom
Andrew Rau-Chaplin
Faculty of Computer Science Xiang Yu
Dalhousie University OptiRisk Systems
Halifax, NS, Canada United Kingdom
Introduction

As lessons are being learned from the recent financial crisis and unsuccess-
ful stress tests, demand for superior computing power has been manifest in
the financial and insurance industries for reliability of quantitative models
and methods and for successful risk management and pricing. From a prac-
titioner’s viewpoint, the availability of high-performance computing (HPC)
resources allows the implementation of computationally challenging advanced
financial and insurance models for trading and risk management. Researchers,
on the other hand, can develop new models and methods to relax unrealis-
tic assumptions without being limited to achieving analytical tractability to
reduce computational burden. Although several topics treated in these pages
have been recently covered in specialist monographs (see, e.g., the references),
we believe this volume to be the first to provide a comprehensive up-to-date
account of the current and near-future state of HPC in finance.
The chapters of this book cover three interrelated parts: (i) Computation-
ally expensive financial problems, (ii) Numerical methods in financial HPC,
and (iii) HPC systems, software, and data with financial applications. They
consider applications which can be more efficiently solved with HPC, together
with topic reviews introducing approaches to reducing computational costs
and elaborating how different HPC platforms can be used for different finan-
cial problems.
Part I offers perspectives on computationally expensive problems in the
financial industry.
In Chapter 1, Jonathan Rosen, Christian Kahl, Russell Goyder, and Mark
Gibbs provide a concise overview of computational challenges in derivative
pricing, paying special attention to counterparty credit risk management. The
incorporation of counterparty risk in pricing generates a huge demand for com-
puting resources, even with vanilla derivative portfolios. They elaborate pos-
sibilities with different computing hardware platforms, including graphic pro-
cessing units (GPU) and field-programmable gate arrays (FPGA). To reduce
hardware requirements, they also discuss an algorithmic approach, called algo-
rithmic differentiation (AD), for calculating sensitivities.
In Chapter 2, Gautam Mitra, Christina Erlwein-Sayer, Cristiano Arbex
Valle, and Xiang Yu describe a method for generating daily trading signals
to construct second-order stochastic dominance (SSD) portfolios of exchange-
traded securities. They provide a solution for a computationally (NP) hard
optimization problem and illustrate it with real-world historical data for the
FTSE100 index over a 7-year back-testing period.

xvii
xviii Introduction

In Chapter 3, Anton Golub, James B. Glattfelder, and Richard B. Olsen


introduce an event-based approach for automated trading strategies. In their
methodology, in contrast to the usual continuity of physical time, only events
(interactions) make the system’s clock tick. This approach to designing auto-
mated trading models yields an algorithm that possesses many desired features
and can be realized with reasonable computational resources.
In Chapter 4, Álvaro Cartea, Ryan Donnelly, and Sebastian Jaimungal
consider the optimal liquidation of a position using limit orders. They focus
on the question of how the misspecification of a model affects the trading
strategy. In some cases, a closed-form expression is available, but additional
relevant features in the framework make the model more realistic at the cost
of not having closed-form solutions.
In Chapter 5, Douglas McLean discusses challenges associated with eco-
nomic scenario generation (ESG) within an insurance context. Under Pilar 1
of the Solvency 2 directive, insurers who use their own models need to produce
multi-year scenario sets in their asset and liability modeling systems, which is
a computationally hard problem. McLean provides illustrative examples and
discusses the aspects of high-performance computing in ESG as well.
Part II focuses on numerical methods in financial high-performance com-
puting (HPC).
First, in Chapter 6, Christoph Reisinger and Rasmus Wissmann consider
finite difference methods in derivative pricing. Numerical methods with par-
tial differential equations (PDE) perform well with special cases, but compu-
tational problems arise with high-dimensional problems. Reisinger and Wiss-
mann consider different decomposition methods, review error analysis, and
provide numerical examples.
In Chapter 7, Mike Giles and Lukasz Szpruch provide a survey on the
progress of the multilevel Monte Carlo method introduced to finance by
the first author. Multilevel Monte Carlo has now become a widely applied
variance reduction method. Giles and Szpruch introduce the idea of multi-
level Monte Carlo simulation and discuss the numerical methods that can be
used to improve computational costs. They consider several financial applica-
tions, including Monte Carlo Greeks, jump-diffusion processes, and the multi-
dimensional Milstein scheme.
In Chapter 8, Stef Maree, Luis Ortiz-Gracia, and Cornelis Oosterlee discuss
Fourier and wavelet methods in option pricing. First, they review different
methods and then numerically show that the COS and SWIFT (Shannon
wavelet inverse Fourier technique) methods exhibit exponential convergence.
HPC can be used in the calibration procedure by parallelizing option pricing
with different strikes.
In Chapter 9, Michael Dempster, Elena Medova, Igor Osmolovskiy, and
Philipp Ustinov consider a three-factor Gaussian yield curve model that is
used for scenario simulation in derivative valuation, investment modeling, and
asset-liability management. The authors propose a new approximation of the
Introduction xix

Black (1995) correction to the model to accommodate nonnegative (or nega-


tive lower bounded) interest rates and illustrate it by calibrating yield curves
for the four major currencies, EUR, GBP, USD, and JPY, using the unscented
Kalman filter and estimating 10-year bond prices both in and out of sample.
Calibration times are comparable to those for the original model and with
cloud computing they can be reduced from a few hours to a few minutes.
In Chapter 10, Uwe Naumann, Jonathan Hüser, Jens Deussen, and Jacques
du Toit review the concept of algorithmic differentiation (AD) and adjoint
algorithmic differentiation (AAD), which has been gaining popularity in com-
putational finance over recent years. With AD, one can efficiently compute the
derivatives of the primal function with respect to a specified set of parame-
ters, and therefore it is highly relevant for the calculation of option sensitivities
with Monte Carlo. The authors discuss aspects of implementation and provide
three case studies.
In Chapter 11, Luca Capriotti and Jacky Lee consider adjoint algorithmic
differentiation (AAD) by providing three case studies for real-time risk man-
agement: interest rate products, counterparty credit risk management, and
volume credit products. AAD is found to be extremely beneficial for the case
applications as it is speeding up by several orders of magnitude the computa-
tion of price sensitivities both in the context of Monte Carlo applications and
for applications involving faster numerical methods.
In Chapter 12, Omar Andres Carmona Cortes and Andrew Rau-Chaplin
use evolutionary algorithms for the optimization of reinsurance contracts.
They use population-based incremental learning and differential evolution
algorithms for the optimization. With a case study they demonstrate the par-
allel computation of an actual contract problem.
In Chapter 13, which ends Part II, Sergey Ivliev, Yulia Mizgireva, and
Juan Ivan Martin consider implementation of blockchain technologies for the
clearing and settlement procedure of the IATA Clearing House. They develop
a simulation model to evaluate the industry-level benefits of the adoption of
blockchain-based industry money for clearing and settlement.
Part III considers different computational platforms, software, and data
with financial applications.
In Chapter 14, Peter Schober provides a summary of supercomputing,
including aspects of hardware platforms, programming languages, and par-
allelization interfaces. He discusses supercomputers for financial applications
and provides case studies on the pricing of basket options and optimizing life
cycle investment decisions.
In Chapter 15, Oskar Mencer, Brian Boucher, Gary Robinson, Jon Gregory,
and Georgi Gaydadjiev describe the concept of multiscale data-flow computing,
which can be used for special-purpose computing on a customized architecture,
leading to increased performance. The authors review the data-flow paradigm,
describe Maxeler data-flow systems, outline the data-flow-oriented program-
ming model that is used in Maxeler systems, and discuss how to develop
data-flow applications in practice, and how to improve their performance.
xx Introduction

Additionally, they provide a case study to estimate correlations between a


large number of security returns. Other financial applications including inter-
est rate swaps, value-at-risk, option pricing, and credit value adjustment cap-
ital are also considered.
In Chapter 16, John Ashley and Mark Joshi provide grounding in the
underlying computer science and system architecture considerations needed to
take advantage of future computing hardware in computational finance. They
argue for a parallelism imperative, driven by the state of computer hardware
and current manufacturing trends. The authors provide a concise summary
of system architecture, consider parallel computing design and then provide
case studies on the LIBOR market model and the Monte Carlo pricing of early
exercisable Bermudan derivatives.
In Chapter 17, Binghuan Lin, Rainer Wehkamp, and Juho Kanniainen
review cloud computing for financial applications. The section is written for
practitioners and researchers who are interested in using cloud computing for
various financial applications. The authors elaborate the concept of cloud com-
puting, discuss suitable applications, and consider possibilities and challenges
of cloud computing. Special attention is given to an implementation example
with Techila middleware and to case studies on portfolio optimization.
In Chapter 18, Alexander Lipton introduces blockchains (BC) and dis-
tributed ledgers (DL) and describes their potential applications to money and
banking. He presents historical instances of BL and DL. The chapter intro-
duces the modern version of the monetary circuit and how it can benefit from
the BL and DL framework. Lipton shows how central bank-issued digital cur-
rencies can be used to move away from the current fractional reserve banking
toward narrow banking.
In Chapter 19, the last chapter in the book which points to the future,
Andrew Milne, Mark Rounds and Peter Goddard consider feature selection for
credit scoring and classification using a quantum annealer. Quantum comput-
ing has received much attention, but is still in its infancy. In this chapter, with
the aid of the 1QBit Quantum-Ready Software Development Kit, the authors
apply quantum computing using the DWave quantum simulated annealing
machine to a well-known financial problem involving massive amounts of data
in practice, credit scoring and classification. They report experimental results
with German credit data.
For further in-depth background the reader should consult the
bibliography.

Acknowledgments
This book arose in part from the four-year EU Marie Curie project
High-Performance Computing in Finance (Grant Agreement Number 289032,
[Link]fi[Link]), which was recently completed. We would like to thank
Introduction xxi

all the partners and participants in the project and its several public events,
in particular its supported researchers, many of whom are represented in these
pages. We owe all our authors a debt of gratitude for their fine contributions
and for enduring a more drawn out path to publication than we had originally
envisioned. We would also like to express our gratitude to the referees and to
World Scientific and Emerald for permission to reprint Chapters 71 and 182 ,
respectively. Finally, without the expertise and support of the editors and staff
at Chapman & Hall/CRC and Taylor & Francis this volume would have been
impossible. We extend to them our warmest thanks.
Michael Dempster
Juho Kanniainen
John Keane
Erik Vynckier
Cambridge, Tampere, Manchester
August 2017

Bibliography
1. Foresight: The Future of Computer Trading in Financial Markets. 2012.
Final Project Report, Government Office for Science, London. https://
[Link]/government/publications/future-of-computer-trading-in-financial-
markets-an-international-perspective

2. De Schryver, C., ed. 2015. FPGA Based Accelerators for Financial Applications.
Springer.

 . 2013. Backward Stochastic Differential Equations with Jumps and


3. Delong, L
Their Actuarial and Financial Applications. London: Springer.

4. Zopounidis, C. and Galariotis, E. 2015. Quantitative Financial Risk Manage-


ment: Theory and Practice. New York, NY: John Wiley & Sons.

5. Uryasev, S. and Pardalos, P. M., eds. 2013. Stochastic Optimization: Algorithms


and Applications. Vol. 54. Berlin: Springer Science & Business Media.

6. John, K. 2016. WHPCF’14, Proceedings of the 7th Workshop on High Perfor-


mance Computational Finance. Special issue, Concurrency and Computation
Practice and Experience 28(3).

7. John, K. 2015. WHPCF’15, Proceedings of the 8th Workshop on High Perfor-


mance Computational Finance, SC’15 The International Conference for High
Performance Computing, Networking, Storage and Analysis. New York: ACM.

1 Which appeared originally in Recent Developments in Computational Finance, World


Scientific (2013).
2 Which will appear in Lipton, A., 2018. Blockchains and distributed ledgers in retro-

spective and perspective. Journal of Risk Finance, 19(1).


Part I

Computationally
Expensive Problems in the
Financial Industry

1
Chapter 1
Computationally Expensive
Problems in Investment Banking

Jonathan Rosen, Christian Kahl, Russell Goyder, and Mark Gibbs

CONTENTS
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Valuation requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
[Link] Derivatives pricing and risk . . . . . . . . . . . . . . . . . . . 5
[Link] Credit value adjustment/debit value
adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
[Link] Funding value adjustment . . . . . . . . . . . . . . . . . . . . 10
1.1.2 Regulatory capital requirements . . . . . . . . . . . . . . . . . . . . . . . . 11
[Link] Calculation of market risk capital . . . . . . . . . . . . 12
[Link] Credit risk capital . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
[Link] Capital value adjustment . . . . . . . . . . . . . . . . . . . . . 13
1.2 Trading and Hedging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
[Link] Central processing unit/floating point unit . . . 15
[Link] Graphic processing unit . . . . . . . . . . . . . . . . . . . . . . . 16
[Link] Field programmable gate array . . . . . . . . . . . . . . . 16
[Link] In-memory data aggregation . . . . . . . . . . . . . . . . . . 17
1.3.2 Algorithmic differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
[Link] Implementation approaches . . . . . . . . . . . . . . . . . . . 19
[Link] Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
[Link] Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.1 Background
Financial instruments traded on markets are essentially contractual agree-
ments between two parties that involve the calculation and delivery of quanti-
ties of monetary currency or its economic equivalent. This wider definition of
financial investments is commonly known as financial derivatives or options,

3
4 High-Performance Computing in Finance

and effectively includes everything from the familiar stocks and bonds to the
most complex payment agreements, which also include complicated mathe-
matical logic for determining payment amounts, the so-called payoff of the
derivative.
In the early days of derivatives, they were thought of more like traditional
investments and treated as such on the balance sheet of a business. Complex-
ity mainly arose in the definition and calculation of the option payoff, and
applying theoretical considerations to price them. It was quickly discovered
that probabilistic models, for economic factors on which option payoffs were
calculated, had to be quite restrictive in order to produce computationally
straightforward problems in pricing the balance sheet fair mark-to-market
value. The development of log-normal models from Black and Scholes was
quite successful in demonstrating not only a prescient framework for deriva-
tive pricing, but also the importance of tractable models in the practical appli-
cation of risk-neutral pricing theory, at a time when computational facilities
were primitive by modern standards. Developments since then in quantitative
finance have been accompanied by simultaneous advancement in computing
power, and this has opened the door to alternative computational methods
such as Monte Carlo, PDE discretization, and Fourier methods, which have
greatly increased the ability to price derivatives with complex payoffs and
optionality.
Nevertheless, recent crises in 2008 have revealed the above complexities
are only part of the problem. The state of credit worthiness was eventually
to be revealed as a major influence on the business balance sheet in the event
that a contractual counterparty in a derivatives contract fails to meet the
terms for payment. It was recognized that market events could create such
a scenario due to clustering and tail events. In response to the explosion of
credit derivatives and subsequent global financial crisis, bilateral credit value
adjustments (CVAs) and funding cost adjustments were used to represent
the impact of credit events according to their likelihood in the accounting
balance sheet. Credit value adjustments and funding adjustments introduced
additional complexity into the business accounting for market participants.
While previously simple trades only required simple models and textbook
formulas to value, the CVA is a portfolio derivative problem requiring joint
modeling of many state variables and is often beyond the realm of simple
closed-form computation.
Meanwhile, the controversial decision to use tax payer money to bail out
the financial institutions in the 2008 crisis ignited a strong political interest
to introduce regulation that requires the largest investors to maintain capital
holdings that meet appropriate thresholds commensurate with the financial
risk present in their balance sheet. In recent years, there have been an adoption
of capital requirements globally, with regional laws determining methods and
criteria, for the calculation of regulatory capital holdings. The demand placed
on large market participants to apply additional value adjustments for tax
and capital funding costs requires modeling these effects over the lifetime of
Computationally Expensive Problems in Investment Banking 5

the investment, which has introduced a large systemic level of complexity in


accounting for the financial investment portfolio, even when it is composed of
very simple investments.

1.1.1 Valuation requirements


[Link] Derivatives pricing and risk
In the early days of option trading, accounting principles were completely
decoupled from financial risk measures commonly known today as market risk.
Investors holding derivatives on their balance sheets were mainly concerned
with the ability to determine the fair market value. In the early derivative
markets, the role of central exchanges was quite limited, and most derivative
contracts directly involved two counterparties; these trades were known as
over-the-counter (OTC). As derivative payoffs became more closely tailored
to individual investors, the theoretical pricing of exotic trades was increas-
ingly complex, and there was no possibility for liquid quotes for investor hold-
ings, meaning theoretical risk-neutral considerations became paramount to
the accounting problem of mark-to-market balance sheets.
The innovation in this area began with the seminal work of Black and
Scholes, leading to tractable and computationally efficient theoretical deriva-
tive pricing formulas, which were widely adopted and incorporated into trad-
ing and accounting technology for derivative investors. An example is formula
for the (at expiry) value Vt of a call option struck at K on a stock whose
expected value at expiry t is Ft ,

Vt = Ft Φ(d1 ) − KΦ(d2 ) (1.1)

where Φ denotes the standard normal distribution and


   
1 Ft σ2
d1 = √ ln + t
σ t K 2

d 2 = d1 − σ t (1.2)

where σ is the volatility parameter in the log-normal distribution of Ft


assumed by the model. The number of analytic formulas that could be deduced
was restricted by the complexity of the payoff and the sophistication of the
dynamic model being used. Payoffs that require many looks at underlying fac-
tors and allow periodic or continuous exercise can greatly complicate or even
prohibit the ability to derive a useful analytic formula or approximation for
pricing and risk.
Meanwhile, the state-of-the-art in computation was becoming more
advanced, leading to modern numerical techniques being applied to derivative
pricing problems. Considering the fundamental theorem of asset pricing in
conjunction with the theorem of Feynman-Kac, one arrives at an equivalence
of the expectation of a functional of a stochastic process to partial-integro dif-
ferential equation, which can be further transformed analytically using Fourier
6 High-Performance Computing in Finance

method-based approaches leading to a system of ordinary differential equa-


tions. All three might allow for further simplifications leading to closed-form
solutions or need to be solved numerically.

Partial (integro-) differential equations: Partial differential equations form


the backbone of continuous time-pricing theory, and a numerical approx-
imation to the solution can be achieved with finite difference methods as
well as many other techniques. However, this approach suffers from the
so-called curse of dimensionality, very similar to the case of multivariate
quadrature in that the stability of this approach breaks down for models
with many underlying factors.
Fourier methods: The mathematical formulation of a conjugate variable to
time can be used to greatly simplify convolution integrals that appear in
derivative pricing problems by virtue of the Plancharel equality. The use of
a known characteristic function of a dynamic process allows fast numerical
derivative pricing for options with many looks at the underlying factor.

Monte Carlo simulation: Monte Carlo methods offer a very generic tool
to approximate the functional of a stochastic process also allowing to
deal effectively with path-dependent payoff structures. The most general
approach to derivative pricing is based on pathwise simulation of time-
discretized stochastic dynamical equations for each underlying factor. The
advantage is in being able to handle any option payoff and exercise style,
as well as enabling models with many correlated factors. The disadvantage
is the overall time-consuming nature and very high level of complexity in
performing a full simulation.

Besides dealing with the complexity of the option payoff, the Black–Scholes
formula made use of a single risk-free interest rate and demonstrated that in
the theoretical economy, this rate had central importance for the time value
of money and the expected future growth of risk-neutral investment strategies
analogous to single currency derivatives. This means a single discount curve
per currency was all that was needed, which by modern standards led to
a fairly simple approach in the pricing of derivatives. For example, a spot-
starting interest rate swap, which has future cash flows that are calculated
using a floating rate that must be discounted to present value, would use
a single curve for the term structure of interest rates to both calculate the
risk-neutral implied floating rates and the discount rates for future cash flows.
However, the turmoil of 2008 revealed that collateral agreements were of
central importance in determining the relevant time value of money to use for
discounting future cash flows, and it quickly became important to separate
discounting from forward rate projection for collateralized derivatives. The
subsequent computational landscape required building multiple curves in a
single currency to account for institutional credit risk in lending at different
tenors.
Computationally Expensive Problems in Investment Banking 7

On top of this problem is the risk calculation, which requires the sensitivity
of the derivative price to various inputs and calculated quantities inside the
numerical calculation. A universal approach to this is to add a small amount
to each quantity of interest and approximate the risk with a finite difference
calculation, known as bumping. While this can be applied for all numeri-
cal and closed-form techniques, it does require additional calculations which
themselves can be time-consuming and computationally expensive. The mod-
ern view is that by incorporating analytic risk at the software library level,
known as algorithmic differentiation (AD), some libraries such as our own
produce analytic risk for all numerical pricing calculations, to be described in
Section 1.3.2.

[Link] Credit value adjustment/debit value adjustment


Since the credit crisis in 2008, counterparty credit risk of derivative posi-
tions has become an increasingly important subject. The IFRS 9 standard
required a fair value option for derivative investment accounting (Ramirez,
2015) and originally proposed to include CVA for application to hedge
accounting, to represent the effect of an entity’s risk management activities
that use financial instruments to manage market risk exposures that could
affect overall profit or loss. IFRS 13 set out requirements intended to account
for the risk that the counterparty of the financial derivative or the entity will
default before the maturity/expiration of the transaction and will be unable
to meet all contractual payments, thereby resulting in a loss for the entity or
the counterparty, which required accounting for CVA and a debit value adjust-
ment (DVA) in the balance sheet as nonperformance risk. However, IFRS 9
does not provide direct guidance on how CVA or DVA is to be calculated,
beyond requiring that the resulting fair value must reflect the credit quality
of the instrument.
There are a lot of choices to be made when computing CVA, including the
choice of model used and the type of CVA to be computed. In addition, there
are further decisions on whether it is unilateral or bilateral (Gregory, 2009),
what type of closeout assumptions to make, how to account for collateral, and
considerations for including first-to-default, which will be discussed later. The
variety of possible definitions and modeling approaches used in CVA calcula-
tion has led to discrepancies in valuation across different financial institutions
(Watt, 2011). There are a variety of ways to determine CVA, however, it
is often a computationally expensive exercise due to the substantial number
of modeling factors, assumptions involved, and the interaction among these
assumptions. More specifically, CVA is a complex derivative pricing problem
on the entire portfolio, which has led to substantially increasing the complex-
ity of derivative pricing.
As a measure of the exposure of a portfolio of products to counterparty
default, and if such an event occurs within some time interval, the expected
loss is the positive part of the value of the remainder of the portfolio after
the event. This is still in the realm of the familiar risk-neutral assumption
8 High-Performance Computing in Finance

of derivative pricing in the form of an adjustment to the fair value, whereby


conditioning on the credit event explicitly and accumulating the contribution
from each time interval over the life of the portfolio, an amount is obtained
by which the value of the portfolio can be modified in order to account for the
exposure to counterparty default. This can be expressed formally as
⎡ T ⎤

CVA = E ⎣ max[V̂ (t), 0]D(t) (1 − R(t)) Iτ ∈(t,t+dt) ⎦ (1.3)
0

where R(t) is the recovery rate at time t, V̂ (t) the value at time t of the
remainder of the underlying portfolio whose value is V (t), D(t) the discount
factor at time t, τ the time of default, and Iτ ∈(t,t+dt) is a default indicator,
evaluating to 1 if τ lies between time t and t + dt and 0 otherwise. Note
that Equation 1.3 does not account for default of the issuer, an important
distinction which is in particular relevant for regulatory capital calculation
purposes (Albanese and Andersen, 2014). Including issuer default is commonly
referred to as first-to-default CVA (FTDCVA).
The first step in proceeding with computation of CVA is to define a time
grid over periods in which the portfolio constituents expose either party to
credit risk. Next, the present value of the exposure upon counterparty default
is calculated at each point in time. Often this involves simplifying assump-
tions, such that the risk-neutral drift and volatility are sufficient to model
the evolution of the underlying market factors of the portfolio, and similarly
that counterparty credit, for example, is modeled as jump-to-default process
calibrated to observable market CDS quotes to obtain corresponding survival
probabilities.
However, in practically all realistic situations, the investment portfolio of
OTC derivatives will contain multiple trades with any given counterparty. In
these situations, the entities typically execute netting agreements, such as the
ISDA master agreement, which aims to consider the overall collection of trades
as a whole, such that gains and losses on individual positions are offset against
one another. In the case that either party defaults, the settlement agreement
considers the single net amount rather than the potentially many individual
losses and gains. One major consequence of this is the need to consider CVA
on a large portfolio composed of arbitrary groups of trades entered with a
given counterparty. As Equation 1.3 suggests, the CVA and DVA on such a
portfolio cannot be simply decomposed into CVA on individual investments,
thus it is a highly nonlinear problem which requires significant computational
facilities in order to proceed.
Quantifying the extent of the risk mitigation benefit of collateral is impor-
tant, and this usually requires modeling assumptions to be made. Collateral
modeling often incorporates several mechanisms which in reality do not lead
to perfect removal of credit default risk, which can be summarized as follows
(Gregory, 2015).
Computationally Expensive Problems in Investment Banking 9

Granularity of collateral posting: The existence of thresholds and minimum


transfer amounts, for example, can lead to over- and under-collateralization,
which can provide deficits and surpluses for funding under certain collateral
agreements or Credit Support Annexes (CSAs).
Delay in collateral posting: The operational complexities of modern CSA
management involve some delay between requesting and receiving collat-
eral, including the possibility of collateral dispute resolution.

Cash equivalence of collateral holdings: In situations where collateral


includes assets or securities that are not simply cash in the currency in
which exposure is assessed, the potential variation of the value of the col-
lateral holdings must be modeled over the life of the investment.

Given the arbitrary nature of the investment portfolio’s composition and


the nonlinear nature of CVAs, the computation of CVA typically requires
numerical methods beyond the scope of closed-form computation. Monte Carlo
simulation approaches are common. There is a framework for Monte Carlo
copula simulation described by Gibbs and Goyder (2013a). While simulation
approaches are highly general and well suited to the CVA problem, they are
known to be significantly more complex to define than closed-form methods,
often increasing calculation times by orders of magnitude, not to mention the
difficulty in setting up simulation technology in the most efficient and robust
ways possible for business accounting operations.
DVA, as a balance sheet value adjustment required under IFRS 9, is defined
analogously to CVA except it represents the complimentary adjustment for
risk exposure in the financial portfolio to the credit of the entity. DVA reduces
the value of a liability, recognizing reduced losses when an entity’s own credit-
worthiness deteriorates. Some criticism of this balance sheet adjustment points
to the difficulty in realizing profits resulting from an entity’s default when clos-
ing out financial positions, and the possibility of significant volatility in profit
or loss in periods of credit market turmoil.
CVA and DVA introduce significant model dependence and risk into
the derivative pricing problem, as it is impacted heavily by dynamic fea-
tures such as volatilities and correlations of the underlying portfolio vari-
ables and counterparty credit spreads. Some open-ended problems suggest
that the modeling complexity can increase significantly in times of market
stress. Clustering and tail risk suggest non-Gaussian simulation processes and
copulae of many state variables are necessary to accurately reflect credit
risk in times of market turmoil. Wrong way risk can also be an impor-
tant effect, which is additional risk introduced by correlation of the invest-
ment portfolio and default of the counterparty. Additionally, the incorpora-
tion of first-to-default considerations can have significant effects (Brigo et al.,
2012). The sum of these complexities introduced by IFRS 9 led to a contin-
uous need for computationally expensive credit risk modeling in derivative
pricing.
10 High-Performance Computing in Finance

[Link] Funding value adjustment


A more recently debated addition to the family of value adjustments is the
funding value adjustment (FVA). Until recently, the conceptual foundations
of FVA were less well understood relative to counterparty risk (Burgard and
Kjaer, 2011), though despite the initial debate, there is consensus that funding
is relevant to theoretical replication and reflects a similar dimension to the
CVA specific to funding a given investment strategy. This is simply a way of
saying that funding a trade is analogous to funding its hedging strategy, and
this consideration leads to the practical need for FVA in certain situations.
Some important realities of modern financial markets that involve funding
costs related to derivative transactions are as follows (Green, 2016).

1. Uncollateralized derivative transactions can lead to balance sheet invest-


ments with effective loan or deposit profiles. Loan transfer pricing within
banks incorporates these exposure profiles in determining funding costs
and benefits for trading units.
2. The details of CSA between counterparties affect the availability of
posted funds for use in collateral obligations outside the netting set. The
presence or absence of the ability to rehypothecate collateral is critical
to properly account for funding costs of market risk hedges.
3. The majority of the time derivatives cannot be used as collateral for a
secured loan like in the case of a bond. This reflects the lack of repo
market for derivatives, which means that funding is decided at the level
of individual investment business rather than in the market, requiring
accounting for detailed FVA.

Central to FVA is the idea that funding strategies are implicit in the
derivatives portfolio (essentially by replication arguments), however, the fund-
ing policy is typically specific to each business, and so incorporating this in
a value adjustment removes the symmetry of one price for each transaction.
This leads to a general picture of a balance sheet value which differs from
the mutually agreed and traded price. This lack of symmetry in the busi-
ness accounting problem is one of the reasons that FVA has yet to formally
be required on the balance sheet through regulation, though IFRS standards
could change in the near future to account for the increasing importance of
FVA (Albanese and Andersen, 2014; Albanese et al., 2014).
The details of funding costs are also linked to collateral modeling, which
also has strong impact on CVA/DVA, thus in many cases it becomes impos-
sible to clearly separate the FVA and CVA/DVA calculations in an addi-
tive manner. Modern FVA models are essentially exposure models which
derive closely from CVA, and fall into two broad classes, namely expectation
approaches using Monte Carlo simulation and replication approaches using
PDEs (Burgard and Kjaer, 2011; Brigo et al., 2013).
Computationally Expensive Problems in Investment Banking 11

1.1.2 Regulatory capital requirements


Legal frameworks for ensuring that financial institutions hold sufficient sur-
plus capital, commensurate with their financial investment risk, have become
increasingly important on a global scale. Historically speaking, regulatory cap-
ital requirements (RCR) were not a major consideration for the front-office
trading desk, because the capital holdings were not included in derivative
pricing. Indeed, the original capitalization rules of the 1988 Basel Capital
Accord lacked sensitivity to counterparty credit risk, which was acknowledged
as enabling the reduction of RCR without actually reducing financial risk-
taking.
A major revision to the Basel framework came in 2006 (Basel II ) to
address the lack of risk sensitivity and also allowed for supervisory approval
for banks to use internal models to calculate counterparty default risk, based
on calibrated counterparty default probability and loss-given-default. The new
capital rules led to a sizable increase in RCR directly associated with the front-
office derivatives trading business, led to a subsequent need to focus on the
risks associated with derivatives trading, and for the careful assessment and
modeling of RCR on the trading book. The inflated importance of regulatory
capital for derivatives trading has led to the creation of many new front-office
capital management functions, whose task is to manage the capital of the
bank deployed in support of trading activities.
One serious shortcoming of Basel II was that it did not take into account
the possibility of severe mark-to-market losses due to CVA alone. This was
acknowledged in the aftermath of the 2008 financial crisis, where drastic
widening of counterparty credit spreads resulted in substantial balance sheet
losses due to CVA, even in the absence of default. The revised Basel III frame-
work proposed in 2010 significantly increases counterparty risk RCR, primarily
by introducing an additional component for CVA, which aims to capitalize the
substantial risk apparent in CVA accounting volatility. The need to accurately
model capital holdings for the derivatives trading book, which in turn requires
accurate modeling of risk-neutral valuation including counterparty credit risk
value adjustments, potentially introduces a much greater level of complexity
into the front-office calculation infrastructure. This has led to a dichotomy
of RCR calculation types with varying degrees of complexity, applicable for
different institutions with respective degrees of computational sophistication.
To address the widening gap between accurate and expensive computation
and less sophisticated investors, a family of simpler calculation methods has
been made available. These approaches only take as input properties of the
trades themselves and thus lack the risk sensitivity that is included in the
advanced calculation methods of RCR for counterparty default and CVA risks.
There is a hurdle for model calculations, involving backtesting on PnL attri-
bution, which is put in place so that special approval is required to make use
of internal model calculations. The benefit of advanced calculation methods is
improved accuracy compared with standard approaches, and greater risk sen-
sitivity in RCR, allowing more favorable aggregation and better recognition
12 High-Performance Computing in Finance

of hedges. These factors typically allow advanced methods to provide capital


relief, resulting in a competitive advantage in some cases.

[Link] Calculation of market risk capital


Under Basel II, a standardized method for market risk capital was intro-
duced, which is based on a series of formulas to generate the trading book cap-
ital requirement, with different approaches to be used for various categories
of assets. The formulas do not depend on market conditions, only properties
of the trades themselves, with overall calibration put in place by the Basel
Committee to achieve adequate capital levels.
For banks approved to use internal risk models under the internal model
method (IMM) for determining the trading book capital requirement, this is
typically based on Value-at-Risk (VaR) using one-sided 99% confidence level
based on 10-day interval price shock scenarios with at least 1 year of time-
series data. In Basel III, it is required to include a period of significant market
stress relevant for the bank to avoid historical bias in the VaR calculation.
While VaR calculation can be presented as a straightforward percentile prob-
lem of an observed distribution, it is also nonlinear and generally requires
recalculating for each portfolio or change to an existing portfolio. In addition,
the extensive history of returns that must be calculated and stored, as well
as the definition of suitable macroeconomic scenarios at the level of pricing
model inputs both represent significant challenges around implementing and
backtesting of VaR relating to computational performance, storage overhead,
and data management.
There has been criticism of VaR as an appropriate measure of exposure
since it is not a coherent risk measure due to lack of subadditivity. This dimin-
ished recognition of diversification has been addressed in a recent proposal
titled The Fundamental Review of the Trading Book (FRTB), part of which
is the replacement of VaR with expected shortfall (ES), which is less intuitive
than VaR but does not ignore the severity of large losses. However, theoretical
challenges exist since ES is not elicitable, meaning that backtesting for reg-
ulatory approval is a computationally intensive endeavor which may depend
on the risk model being used by the bank (Acerbi and Szkely, 2014).

[Link] Credit risk capital


The Basel II framework developed a choice of methods for exposure-at-
default (EAD) for each counterparty netting set, both a standardized method
and an advanced method. A serious challenge was posed to define EAD,
due to the inherently uncertain nature of credit exposure driven by mar-
ket risk factors, correlations, and legal terms related to netting and collat-
eral. Consistently with previous methods, SA-CCR treats the exposure at
default as a combination of the mark-to-market and the potential future expo-
sure. If a bank receives supervisory authorization, they can use advanced
models to determine capital requirements under the IMM, which is the most
Computationally Expensive Problems in Investment Banking 13

risk-sensitive approach for EAD calculation under the Basel framework. EAD
is calculated at the netting set level and allows full netting and collateral
modeling. However to gain IMM approval takes significant time and requires
a history of backtesting which can be very costly to implement.
During the financial crisis, roughly two-thirds of losses attributed to coun-
terparty credit were due to CVA losses and only about one-third were due
to actual defaults. Under Basel II, the risk of counterparty default and credit
migration risk was addressed, for example, in credit VaR, but mark-to-market
losses due to CVA were not. This has led to a Basel III proposal that considers
fairly severe capital charges against CVA, with relatively large multipliers on
risk factors used in the CVA RCR calculation, compared with similar formulas
used in the trading book RCR for market risk.
The added severity of CVA capital requirements demonstrates a lack of
alignment between the trading book and CVA in terms of RCR, which is
intentional to compensate for structural differences in the calculations. As
CVA is a nonlinear portfolio problem, the computational expense is much
greater, often requiring complex simulations, compared with the simpler val-
uations for instruments in the trading book. The standard calculations of risk
involve revaluation for bumping each risk factor. The relative expense of CVA
calculations means that in practice fewer risk factors are available, and thus
capital requirements must have larger multipliers to make up for the lack of
risk sensitivity.
However, one technology in particular has emerged to level the playing
field and facilitate better alignment between market risk and CVA capital
requirements. AD, described in Section 1.3.2, reduces the cost of risk fac-
tor calculations by eliminating the computational expense of adding addi-
tional risk factors. With AD applied to CVA pricing, the full risk sensi-
tivity of CVA capital requirements can be achieved without significantly
increasing the computational effort, although the practical use of AD can
involve significant challenges in the software implementation of CVA pric-
ing.

[Link] Capital value adjustment


The ability to account for future capital requirements is a necessity for
the derivatives trading business which aims to stay capitalized over a certain
time horizon. The complexity of the funding and capitalization of the deriva-
tives business presents a further need to incorporate funding benefits associ-
ated with regulatory capital as well as capital requirements generated from
hedging trades into the derivative pricing problem as a capital value adjust-
ment (KVA). As in the case of CVA and FVA, the interconnected nature of
value adjustments means that it is not always possible to fully separate KVA
from FVA and CVA in an additive manner. Recent work (Green, 2016) has
developed several extensions to FVA replication models to incorporate the
effects associated with KVA in single asset models.
14 High-Performance Computing in Finance

The current and future landscape of RCR appears to be driven by calcula-


tions that are risk sensitive, and thus highly connected to the valuation models
and market risk measures. This is a complex dynamic to include in the val-
uation adjustment since it links mark-to-market derivatives accounting with
exogenous costs associated with maintaining sufficient capital reserves that,
in turn, depend on market conditions, analogously to counterparty default
and operational funding costs. Challenges arise due to the use of real-world
risk measures to calculate exposures, which are at odds with arbitrage-free
derivative pricing theory. This requires sophisticated approximation of the
relationship between real-world risk measures used in regulatory EAD cal-
culation with the market-calibrated exposure profile used in the risk-neutral
CVA pricing (Elouerkhaoui, 2016).
Another significant computational challenge for KVA is related to struc-
tural misalignment of the capital requirement formulas with the hedge
amounts determined by internal models. Under the current regulatory frame-
work, there is still a possibility of reducing capital requirements by increasing
risk as determined by internally approved models. Optimal hedge amounts
must be carefully determined so as to maximize the profitability of the deriva-
tives business within a certain level of market and counterparty risk, while
simultaneously reducing the capital requirements associated with hedging
activity. The overall picture is thus pointing towards a growing need for accu-
racy in KVA as a part of derivatives hedge accounting, despite the current
lack of any formal requirement for KVA in the IFRS standards.

1.2 Trading and Hedging


Historically, counterparty credit risk and capital adjustments only played a
minor role within financial institutions often delegated away from front-office
activity and thus covered by departments such as risk and finance. Given the
relatively low capital requirements prior to the financial crisis, the profit cen-
ters were simply charged on a determined frequency ranging from monthly to
yearly against their relevant capital usage, but the impact has been negligi-
ble for almost all trading activity decisions. This also meant that front-office
model development followed a rapid path of innovation whilst counterparty
credit risk and capital modeling was only considered a needed by-product of
operation. Realistically, only CVA did matter for trades against counterpar-
ties where no collateral agreement has been in place, such as corporates, for
which the bank could end up in the situation of lending a significant amount
of money which is already the case for a vanilla interest rate swap. Given
credit spreads of almost zero for all major financial institutions, neither DVA
nor FVA was worth considering.
Given the strong emphasis of both accounting and regulation on xVA
related charges ranging from CVA/DVA to FVA and even KVA, it is of
Computationally Expensive Problems in Investment Banking 15

fundamental importance to have accurate pre- and post-trade information


available. Both are a matter of computational resources albeit with a slightly
different emphasis. Whilst calculation speed is critical for decision making in
pretrade situations, post-trade hedging and reporting is predominantly a mat-
ter of overall computational demand only limited by the frequency of reporting
requirements, which in case of market risk require a daily availability of fair
values, xVA and their respective risk sensitivities. A similar situation holds
for hedging purposes where portfolio figures don’t always need to be avail-
able on demand but calculations on a much coarser frequency are sufficient.
It is important to point out in particular that in periods of stress, the ability
to generate accurate post-trade analysis becomes more important as volatil-
ity markets will also make risk and hedging figures fluctuate more and thus
potentially require to be run on a higher frequency in order to make decision
ranging from individual hedges all the way down to unwinding specific busi-
ness activity, closing out of desks to raising more regulatory capital. Since it is
not economical to just warehouse idle computation resource for the event of a
crisis, one would either have to readjust their current usage where possible or
rely on other scaling capabilities. A point we are investigating in more detail
is in Section 1.3.1.

1.3 Technology
There are a multitude of technology elements supporting derivative pricing
ranging from hardware (see Section 1.3.1) supporting the calculation to soft-
ware technology such as algorithmic adjoint differentiation (see Section 1.3.2).
In the discussion below, we focus predominantly on the computational
demand related to derivative pricing and general accounting and regulatory
capital calculations, and less on the needs related to high-frequency applica-
tions including the collocation of hardware on exchanges.

1.3.1 Hardware
Choice of hardware is a critical element in supporting the pricing, risk
management, accounting and regulatory reporting requirements of financial
institutions. In the years prior to the financial crisis, the pricing and risk man-
agement have typically been facilitated by in-house developed pricing libraries
using an object-oriented programming language (C++ being a popular choice)
on regular CPU hardware and respective distributed calculation grids.

[Link] Central processing unit/floating point unit


Utilizing an object-oriented programming language did allow for a quick
integration of product, model, and pricing infrastructure innovation into
16 High-Performance Computing in Finance

production which is a critical element for a structured product business to


stay ahead of competition. Even with product and model innovation being
of less importance, postcrisis being able to reutilize library components aids
to the harmonization of pricing across different asset classes and products
which is in particular important in light of regulatory scrutiny. C++ has
been the programming language of choice as one could make full benefit from
hardware development on the floating-point-unit writing performance critical
code very close to the execution unit. This is particularly important with CPU
manufacturers utilizing the increase in transistor density to allow for greater
vectorization of parallel instructions (SSEx, AVX, AVX2, . . .) going hand in
hand with the respective compiler support.

[Link] Graphic processing unit


Graphic processing units (GPUs) lend themselves very naturally to the
computationally expensive task in the area of computational finance. The
design of GPUs offers to take the well-established paradigm of multithreading
to the next level, offering advantages in scale and also importantly energy
usage. In particular, Monte Carlo simulations lend themselves very naturally
to multithreading tasks due to their inherent parallel nature, where sets of
paths can be processed efficiently in parallel. Thus, GPUs offer a very pow-
erful solution for well-encapsulated high-performance tasks in computational
finance. One of the key difficulties utilizing the full capabilities of GPUs is the
integration into the analytics framework as most of the object management,
serialization and interface functionality need to be handled by a higher level
language such as C++, C#, Java, or even Python. All of those offer seamless
integration with GPUs where offloading of task only requires some simple data
transfer and dedicated code running on the GPU. However, this integration
task does make the usage of GPUs a deliberate choice as one typically has
to support more than one programming language and hardware technology.
Additionally, upgrades to GPUs would make a reconfiguration or even code
rewrite not unlikely to fully utilize the increased calculation power, which is
no different to the greater level of vectorizations on CPUs.

[Link] Field programmable gate array


Field programmable gate arrays (FPGAs) offer high-performance solutions
close to the actual hardware. At the core of FPGAs are programmable logic
blocks which can be “wired” together using hardware description languages.
At the very core, the developer is dealing with simple logic gates such as
AND and XOR. Whilst both CPUs and GPUs have dedicated floating point
units (FPUs) specifically designed to sequentially execute individual tasks at
high speed, FPGAs allow systems to take advantage of not only a high level
of parallelism similar to multithreading, but in particular creating a pipe of
calculation tasks. At any clock cycle (configurable by the FPGA designer),
the FPGA will execute not only a single task but every instruction along the
Computationally Expensive Problems in Investment Banking 17

calculation chain is executed. Monte Carlo simulation on FPGA can simul-


taneously execute the tasks of uniform random number generation, inverse
cumulative function evaluation, path construction, and payoff evaluation—and
every sub-step in this process will also have full parallelization and pipelining
capacity. Therefore even at a far lower clock speed, one can execute many
tasks along a “pipe” with further horizontal acceleration possible. This allows
a far higher utilization of transistors compared to CPUs/GPUs, where most
parts of the FPU are idle as only a single instruction is dealt with. One
of the core difficulties of FPGAs is that one not only needs to communi-
cate data between the CPU and the FPGA but the developer also needs to
switch programming paradigm. Both CPUs and GPUs can handle a new set
of tasks with relative ease by a mere change of the underlying assembly/ma-
chine code, whilst the FPGA requires much more dedicated reconfiguration
of the executing instructions. Whilst FPGAs would offer even greater calcu-
lation speedups and in particular a significant reduction in energy usage it is
yet to be seen whether they provide a true competitor to traditional hardware
and software solutions. That being said, FPGAs do have a widespread usage
in the context of high-frequency trading; in particular, colocated on exchange
premises as they offer the ultimate hardware acceleration for a very dedicated
task.

[Link] In-memory data aggregation


Significant increase in in-memory data storage capacity opened new capa-
bilities for the calculation and aggregation of risk, counterparty and regulatory
data. Calculation intensive computations such as Monte Carlo simulation for
counterparty credit risk used to require data aggregation as an in-built fea-
ture for the efficiency of the algorithm where one had to discard a significant
amount of intermediate calculation results such as the realization of an individ-
ual index at any time-discretization point. Thus, any change in the structure
of the product (or counterparty, or collateral agreement) required a full rerun
of the Monte Carlo simulation recreating the same intermediate values. Whilst
storage capacity made this the only viable solution a few years ago, memory
storage on 64-bit system does allow for consideration of more efficient alter-
natives. In-memory data aggregation in the context of computational finance
does provide the distinctive advantage that more business decisions can be
made in real time or further high-level optimization to be deployed such as
best counterparty to trade with considering collateral cost, counterparty credit
risk and regulatory capital requirements even keeping the product features the
same.

1.3.2 Algorithmic differentiation


While the impact of hardware choices on computational performance is
critical, that of software choices can be overlooked. Within software, opti-
mization efforts tend to fall into at least two categories: first, the application
18 High-Performance Computing in Finance

of profiling tools to optimize at a low level, often very close to the hardware,
leveraging features such as larger processor caches and vectorization within
FPUs, and second, the choice of algorithms to perform the required calcu-
lations. We might consider a third category to be those concerns relevant
to distributed computation: network latency and bandwidth, and marshaling
between wire protocols and different execution environments in a heteroge-
neous system. For optimal performance, a holistic approach is required where
both local and global considerations are taken into account.
The impact of algorithm choice on performance is perhaps nowhere more
dramatic than in the context of calculating sensitivities, or greeks, for a port-
folio. Current proposed regulatory guidance in the form of ISDA’s Standard
Initial Margin Model (SIMM) and the market risk component of the FRTB
both require portfolio sensitivities as input parameters, and sensitivity calcu-
lation is a standard component of typical nightly risk runs for any trading
desk interested in hedging market risk. The widespread approach to calculat-
ing such sensitivities is that of finite difference, or “bumping,” in which each
risk factor (typically a quote for a liquid market instrument) is perturbed,
or bumped, by a small amount, often a basis point, and the portfolio reval-
ued in order to assess the effect of the change. This process is repeated for
each risk factor of interest, although in order to minimize computational cost,
it is common to bump collections of quotes together to measure the effect
of a parallel shift of an entire interest rate curve or volatility surface, for
example.
Owing to the linear scaling of the computational burden of applying finite
difference to a diverse portfolio with exposure to many market risk factors,
sensitivity calculation has traditionally been expensive. This expense is mit-
igated by the embarrassingly parallel nature of the problem, which means
long-running calculations can be accelerated through investment in computa-
tional grid technology and the associated software, energy and infrastructure
required to support it.
With an alternative algorithmic approach, however, the need for portfolio
revaluation can be eliminated, along with the linear scaling of cost with num-
ber of risk factors. This approach is called algorithmic differentiation (AD).
While popularized in finance relatively recently (Sherif, 2015), AD is not new.
It has been applied in a wide range of scientific and engineering fields from
oceanography to geophysics since the 1970s (Ng and Char, 1979; Galanti and
Tziperman, 2003; Charpentier and Espindola, 2005). It is based on encod-
ing the operations of differential calculus for each fundamental mathemati-
cal operation used by a given computation, and then combining them using
the chain rule when the program runs (Griewank, 2003). Consider the func-
tion h : Rm → Rn formed by composing the functions f : Rp → Rn and
g : Rm → Rp such that

y = h(x) = f (u)
u = g(x) (1.4)
Computationally Expensive Problems in Investment Banking 19

for x ∈ Rm , u ∈ Rp , and y ∈ Rn . The chain rule allows us to decompose the


n × m Jacobian J of h as follows:
  p p    
∂h(x) ∂yi ∂yi ∂uk ∂f (u) ∂g(x)
Jij = = = =
∂x ij ∂xj ∂uk ∂xj ∂u ik ∂x kj
k=1 k=1

p
= Aik Bkj (1.5)
k=1

where A and B are the Jacobians of f and g, respectively. There is a straight-


forward generalization to an arbitrary number of function compositions, so we
choose a single composition here for convenience and without loss of general-
ity. Suppose that a computer program contains implementations of f and g
explicitly, with h formed implicitly by supplying the output of g to a call to f .

[Link] Implementation approaches


AD defines two approaches to computing J; forward (or tangential) and
reverse (or adjoint) accumulation (Rall and Corliss, 1996). In forward accu-
mulation, B is computed first, followed by A. In other words, the calculation
tree for the operation performed by h is traversed from its leaves to the root.
The computational cost of such an approach scales linearly with the num-
ber of leaves because the calculation needs to be “seeded,” that is repeatedly
evaluated with x = diag(δjq ) on the qth accumulation. This is well suited to
problems where n  m, because it allows all n rows in J to be computed
simultaneously for the qth accumulation. Such problems are so hard to find
in the context of financial derivative valuation that it is very rare to read of
any application of forward accumulation in the literature. The main exception
to this rule may be in calibration, where the sensitivity of several instrument
values to an often much smaller number of model parameters is useful for
gradient descent optimizers.
In contrast, reverse accumulation is most efficient when n  m. It con-
sists of two stages—a “forward sweep,” where the relevant partial derivatives
(termed “work variables”) are formed, and then a “backward sweep” where
the relevant products of partial derivatives are added into each element of A
and B, which can then be multiplied to obtain the full Jacobian J. The result
is a single calculation that calculates exact (to machine precision rather than
approximations at an accuracy determined by the size of perturbation) deriva-
tives to every risk factor to which the portfolio is exposed. This scales linearly
with complexity of the portfolio (the number of links in the chain), but is
effectively constant with respect to the number of risk factors. In practice, the
cost of obtaining sensitivities through AD is typically a factor of between 2
and 4 times the cost of valuing the portfolio.
Often, the assembly of differentiation operations into the full chain for a
given calculation is performed automatically by using overloaded operators
on basic data types that record the required operations for later replay, and
so AD is also termed Automatic Differentiation. In the AD literature, one
20 High-Performance Computing in Finance

finds two main approaches to implementation, whether forward or reverse


accumulation, for the function h. Both approaches emphasize the problem of
adding a differentiation capability to an existing codebase that computes the
value alone, as opposed to designing AD into a codebase from the beginning.
The first method, source code transformation, is based on a static analysis
of the source code for the function h. New source code is generated for a
companion function that computes J, then both are compiled (or interpreted).
The second method, operator overloading, requires a language where basic
types can be redefined and operators can be overloaded (e.g., C++, C# or
Java), so that existing code that performs these operations will also trig-
ger the corresponding derivative calculations. Forward accumulation is easier
to implement in an operator overloading approach than reverse. A common
technique for reverse accumulation is to generate a data structure commonly
termed the “tape” that records the relevant set of operations, then interpret
that tape in order to obtain the desired derivatives.

[Link] Performance
Source code transformation is a disruptive technique that complicates build
processes and infrastructure. In practice, operator overloading is more popu-
lar, but it suffers from high storage costs and long run times, with the result
that numerous implementation techniques have been devised to mitigate the
performance challenges inherent with AD and it remains an active area of
research (e.g., Faure and Naumann, 2002). The fundamental barrier to per-
formance in AD is the granularity at which the chain rule is implemented.
When using tools to apply operator overloading to an existing codebase, the
granularity is defined by the operators that are overloaded, which is typically
very fine. Any per-operator overhead is multiplied by the complexity of the
calculation, and for even modestly sized vanilla portfolios, this complexity is
considerable. If instead the Jacobians can be formed at a higher level, so that
each one combines multiple functional operations (multiple links in the chain,
such as f and g together), then per-operator overhead is eliminated for each
group and performance scales differently (Gibbs and Goyder, 2013b).
In addition, it is not necessary to construct an entire Jacobian at any level
in a calculation and hold it in memory as a single object. To illustrate, suppose
Equation 1.4 takes the concrete form
h(x, y, z) = αx + βyz = f (x, g(y, z)) (1.6)
where f (a, b) = αa + βb and g(a, b) = ab where all variables are ∈ R and α
and β are considered constants. By differentiating,
∂h ∂h ∂h
dh = dx + dy + dz
∂x ∂z ∂z
∂f ∂f ∂g ∂f ∂g
= dx + dy + dz
∂x ∂g ∂y ∂g ∂z
= α dx + βz dy + βy dz. (1.7)
Computationally Expensive Problems in Investment Banking 21

The Jacobians are A = (α, β) and B = (b, a), however, in the chain rule
above, only factors such as (∂f /∂g)(∂g/∂y) are required to determine the
sensitivity to a given variable, such as y. These factors comprise subsets of
each Jacobian and turn out to be exactly those subsets that reside on the
stack when typically implementations of hand-coded AD are calculating the
sensitivity to a given risk factor (Gibbs and Goyder, 2013b).
Implementing AD “by hand” in this way also enables a number of opti-
mizations which further reduce both storage costs and increase speed of com-
putation. Deeply recursive structures arise naturally in a financial context,
in path-dependent payoffs such as those found in contracts with accumula-
tor features, and in curve bootstrapping algorithms where each maturity of,
say, expected Libor depends on earlier maturities. Naive implementations of
AD suffer from quadratic scaling in such cases, whereas linear performance is
possible by flattening each recursion.
In Monte Carlo simulations, only the payoff and state variable modeling
components of a portfolio valuation change on each path. Curves, volatility
surfaces, and other model parameters do not, and so it is beneficial to calcu-
late sensitivities per path to state variables and only continue the chain rule
through the relationship between model parameters and market data (based
on calibration) once. In contrast, for certain contracts such as barrier options,
it is common to evaluate a volatility surface many times, which would result
in a large Jacobian. Instead, performance will be improved if the chain rule
is extended on each volatility surface evaluation through the parameters on
which the surface depends. This set of parameters is typically much smaller
than the number of volatility surface evaluations.
While the value of implementing AD by hand is high, so is the cost. A good
rule of thumb is that implementing AD costs, on average, is approximately
the same as implementing the calculation whose sensitivities are desired.
The prospect of doubling investment in quantitative development resources
is unwelcome for almost any financial business, and consequently implemen-
tations of AD tend to be patchy in their coverage, restricted to specific types
of trade or other bespoke aspects of a given context.

[Link] Coverage
When evaluating value, risk, and similar metrics, it is important to have a
complete picture. Any hedging, portfolio rebalancing, or other related decision
making that is performed on incomplete information will be exposed to the
slippage due to the difference between the actual position and the reported
numbers used. Whilst some imperfections, such as those imposed by model
choice, are inevitable, there is no need or even excuse to introduce oth-
ers through unnecessary choices such as an incomplete risk implementation.
This can be easily avoided by adopting a generic approach that enables the
necessary computations to be easily implemented at every step of the risk
calculation.
22 High-Performance Computing in Finance

For complex models, the chain rule of differentiation can and does lead
to the mixing of exposures to different types of market risk. For example,
volatility surfaces specified relative to the forward are quite common and
lead to contributions of the form (∂C/∂σ)((∂σ (F, K))/∂F ) to the delta of
an option trade through the “sticky smile.” This mixing obviously increases
with model complexity and consequently any generic approach has to confront
it directly. Imposing the simple requirement that all first-order derivatives,
rather than just some targeted towards a specific subset of exposures, are
always calculated as part of the AD process allows an implementation to
avoid missing these risk contributions.
It is possible to separate valuation and related calculations into a chain
of logical steps with a high level of commonality and reusability for each link
in the chain. In general, the steps are the calibration of model parameters
to input observables such as market data quotes, the setup of simulation or
similar calculations, specific pricing routines for classes of financial contracts,
and portfolio-level aggregation of the values calculated per-trade. These links
in the chain can be separately built up and independently maintained to form
a comprehensive analytic library that can then be used to cover essentially all
valuation, risk, and related calculations.
Within each of these steps, the dependency tree of calculations has to be
checked and propagated through the calculation. By carefully defining how
these steps interact with each other, a high level of reuse can be achieved.
As well as enabling the use of pricing algorithms within model calibration,
a key consideration when setting up complex models, the nested use of such
calculations is important for analyzing the important class of contracts with
embedded optionality. In particular, the presence of multiple exercise oppor-
tunities in American and Bermudan options necessitates some form of approx-
imation of future values when evaluating the choices that can be made at each
of these exercise points. Typically, this approximation is formed by regressing
these future values against explanatory variables known at the choice time.
Standard methods such as maximum likelihood determine the best fit through
minimizing some form of error function; the act of minimization means that
first-order derivatives are by construction zero. Note that this observation
applies on average to the particular values (Monte Carlo paths, or similar)
used in the regression. Subsequent calculations are exposed to numerical noise
when a different set of values are used in an actual calculation, and this noise
extends to the associated risk computations.

1.4 Conclusion
This chapter provides a comprehensive overview of computational
challenges in the history of derivative pricing. Before the 2008 credit crisis,
the main driver of computational expense was the complexity of sophisticated
contracts and associated market risk management. In the wake of the crisis,
Computationally Expensive Problems in Investment Banking 23

counterparty credit risk came to the fore, while contracts simplified. Paradox-
ically, incorporating counterparty risk into pricing for even vanilla derivative
portfolios presents computational problems more challenging than for the most
complex precrisis contracts.
In response to this challenge, investment in specialized computing hard-
ware such as GPU and FPGA, along with the rising tide of commodity com-
pute power available in clouds, rose sharply in the last decade. However,
specialized hardware requires specialized software development to realize its
power, which in turn limits the flexibility of systems and the speed at which
they are able to cope with changing market and regulatory requirements.
Commodity hardware, particularly if managed in-house, can incur nontrivial
infrastructure and maintenance costs.
For calculating sensitivities, a key input to both market risk management
and regulatory calculations for initial margin and capital requirements, a soft-
ware technique called AD offers the potential to reduce hardware requirements
dramatically. While popularized only recently, AD is an old idea. Tools for
applying AD to existing codebases exist but suffer from performance draw-
backs such that the full potential of the technique, for both performance and
coverage of contract and calculation types, is only realized when AD is incor-
porated into a system’s architecture from the beginning.

References
Acerbi, C. and Szkely, B.: Back-testing expected shortfall. Risk Magazine (December
2014).

Albanese, C. and Andersen, L.: Accounting for OTC Derivatives: Funding Adjust-
ments and the Re-Hypothecation Option. Available at SSRN: [Link]
abstract=2482955 or [Link] accessed March
10, 2017. 2014.

Albanese, C., Andersen, L., and Stefano, I.: The FVA Puzzle: Accounting, Risk
Management and Collateral Trading. Available at SSRN: [Link]
abstract=2517301 or [Link] accessed March
10, 2017. 2014.

Brigo, D., Buescu, C., and Morini, M.: Counterparty risk pricing: Impact of close-
out and first-to-default times. International Journal of Theoretical and Applied
Finance, 15(6): 2012.

Brigo, D., Morini, M., and Pallavicini, A.: Counterparty Credit Risk, Collateral and
Funding. John Wiley & Sons, UK, 2013.

Burgard, C. and Kjaer, M.: Partial differential equation representations of derivatives


with bilateral counterparty risk and funding costs. The Journal of Credit Risk,
7:75–93, 2011.
24 High-Performance Computing in Finance

Charpentier, I. and Espindola, J.M.: A study of the entrainment function in models


of Plinian columns: Characteristics and calibration. Geophysical Journal Inter-
national, 160(3):1123–1130, 2005.

Elouerkhaoui, Y.: From FVA to KVA: Including cost of capital in derivatives pricing.
Risk Magazine (March 2016).

Faure, C. and Naumann, U.: Minimizing the tape size. In Corliss G., Faure C.,
Griewank A., Hascoët L., and Naumann U., editors, Automatic Differentiation
of Algorithms: From Simulation to Optimization, Computer and Information
Science, Chapter 34, pages 293–298. Springer, New York, NY, 2002.

Galanti, E. and Tziperman, E.: A midlatitude-enso teleconnection mechanism via


baroclinically unstable long rossby waves. Journal of Physical Oceanography,
33(9):1877–1888, 2003.

Gibbs, M. and Goyder, R.: Automatic Numeraire Corrections for Generic Hybrid
Simulation. Available at SSRN: [Link] or http://
[Link]/10.2139/ssrn.2311740, accessed August 16. 2013a.

Gibbs, M. and Goyder, R.: Universal Algorithmic DifferentiationTM in the


F3 Platform. Available at: [Link]
whitepaper/universal-algorithmic-differentiation-f3-platform, accessed March
10, 2017. 2013b.

Green, A.: XVA: Credit, Funding and Capital Valuation Adjustments. John Wiley
& Sons, UK, 2016.

Gregory, J.: Being two faced over counterparty credit risk. Risk, 22(2):86–90, 2009.

Gregory, J.: The xVA Challenge. John Wiley & Sons, UK, 2015.

Griewank, A.: A mathematical view of automatic differentiation. In Arieh I., editor,


Acta Numerica, volume 12, pages 321–398. Cambridge University Press, 2003.

Ng, E. and Char, B. W.: Gradient and Jacobian computation for numerical appli-
cations. In Ellen Golden V., editor, Proceedings of the 1979 Macsyma User’s
Conference, pages 604–621. NASA, Washington, D.C., June 1979.

Rall, L. B. and Corliss, G. F.: An introduction to automatic differentiation. In


Berz M., Bischof C. H., Corliss G. F., and Griewank A., editors, Computa-
tional Differentiation: Techniques, Applications, and Tools, pages 1–17. SIAM,
Philadelphia, PA, 1996.

Ramirez, J.: Accounting for Derivatives: Advanced Hedging under IFRS 9. John
Wiley & Sons, UK, 2015.

Sherif, N.: Chips off the menu. Risk Magazine, 27(1): 12–17, 2015.

Watt, M.: Corporates rear CVA charge will make hedging too expensive. Risk,
October Issue 2011.
Chapter 2
Using Market Sentiment to Enhance
Second-Order Stochastic Dominance
Trading Models

Gautam Mitra, Christina Erlwein-Sayer, Cristiano Arbex Valle,


and Xiang Yu

CONTENTS
2.1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.1 Enhanced indexation applying SSD criterion . . . . . . . . . . . 26
2.1.2 Revising the reference distribution . . . . . . . . . . . . . . . . . . . . . . 28
2.1.3 Money management via “volatility pumping” . . . . . . . . . . 29
2.1.4 Solution methods for SIP models . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.5 Guided tour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Market data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 News meta data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Information Flow and Computational Architecture . . . . . . . . . . . . . 31
2.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1 Impact measure for news . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
[Link] Sentiment score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
[Link] Impact score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.2 Long–short discrete optimization model based on SSD . . 34
2.4.3 Revision of reference distribution . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Trading Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.1 Base strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
[Link] Strategy A: Full asset universe . . . . . . . . . . . . . . . 37
2.5.2 Using relative strength index as a filter . . . . . . . . . . . . . . . . . 37
[Link] Strategy B: Asset filter relative strength
index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.3 Using relative strength index and impact as filters . . . . . 38
[Link] Strategy C: Asset filter relative strength
index and impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.4 A dynamic strategy using money management . . . . . . . . . 38
2.6 Solution Method and Processing Requirement . . . . . . . . . . . . . . . . . . 39
2.6.1 Solution of LP and SIP models . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6.2 Scale-Up to process larger models . . . . . . . . . . . . . . . . . . . . . . . 41

25
26 High-Performance Computing in Finance

2.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.8 Conclusion and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.1 Introduction and Background


We propose a method of computing (daily) trade signals; the method
applies a second-order stochastic dominance (SSD) model to find portfolio
weights within a given asset universe. In our report of the “use cases” the
assets are the constituents of the exchange traded securities of different major
indices such as Nikkei, Hang Seng, Eurostoxx 50, and FTSE 100 (see Sec-
tions 2.6 and 2.7). The novelty and contribution of our research are four-
fold: (1) We introduce a model of “enhanced indexation,” which applies the
SSD criterion. (2) We then improve the reference, that is, the benchmark
distribution using a tilting method; see Valle, Roman, and Mitra (2017; also
Section 2.4.3). This is achieved using impact measurement of market sen-
timent; see Section 2.4.1. (3) We embed the static SSD approach within a
dynamic framework of “volatility pumping” with which we control drawdown
and the related temporal risk measures. (4) Finally, we apply a “branch
and cut” technique to speed up the repeated solution of a single-period
stochastic integer programming (SIP) model instances. Our models are driven
by two sets of time-series data, which are first the market price (returns)
data and second the news (meta) data; see Section 2.2 for a full discussion
of these.

2.1.1 Enhanced indexation applying SSD criterion


SSD has a well-recognized importance in portfolio selection due to its con-
nection to the theory of risk-averse investor behavior and tail-risk minimiza-
tion. Until recently, stochastic dominance models were considered intractable,
or at least very demanding from a computational point of view. Computa-
tionally tractable and scalable portfolio optimization models that apply the
concept of SSD were proposed recently (Dentcheva and Ruszczyński 2006;
Roman, Darby-Dowman, and Mitra 2006; Fábián, Mitra, and Roman, 2011a).
These portfolio optimization models assume that a benchmark, that is, a desir-
able “reference” distribution is available and a portfolio is constructed, whose
return distribution dominates the reference distribution with respect to SSD.
Index tracking models also assume that a reference distribution (that of a
financial index) is available. A portfolio is then constructed, with the aim of
replicating, or tracking, the financial index. Traditionally, this is done by min-
imizing the tracking error, that is, the standard deviation of the differences
Using Market Sentiment to Enhance SSD Trading Models 27

between the portfolio and index returns. Other methods have been proposed
(for a review of these methods, see Beasley et al. 2003; Canakgoz and Beasley
2008).
The passive portfolio strategy of index tracking is based on the
well-established “Efficient Market Hypothesis” (Fama, 1970), which implies
that financial indices achieve the best returns over time. Enhanced indexa-
tion models are related to index tracking in the sense that they also consider
the return distribution of an index as a reference or benchmark. However,
they aim to outperform the index by generating “excess” return (DiBar-
tolomeo, 2000; Scowcroft and Sefton, 2003). Enhanced indexation is a very
new area of research and there is no generally accepted portfolio construc-
tion method in this field (Canakgoz and Beasley, 2008). Although the idea of
enhanced indexation was formulated as early as 2000, only a few enhanced
indexation methods were proposed later in the research community; for
a review of this topic see Canakgoz and Beasley (2008). These methods
are predominantly concerned with overcoming the computational difficulty
that arises due to restriction on the cardinality of the constituent assets
in the portfolios. Not much consideration is given to answering the ques-
tion if they do attain their stated purpose, that is, achieve return in excess
of the index.
In an earlier paper (Roman, Mitra, and Zverovich, 2013), we have pre-
sented extensive computational results illustrating the effective use of the
SSD criterion to construct “models of enhanced indexation.” SSD dominance
criterion has been long recognized as a rational criterion of choice between
wealth distributions (Hadar and Russell 1969; Bawa 1975; Levy 1992). Empir-
ical tests for SSD portfolio efficiency have been proposed in Post (2003) and
Kuosmanen (2004). In recent times, SSD choice criterion has been proposed
(Dentcheva and Ruszczynski 2003, 2006, Roman et al. 2006) for portfolio
construction by researchers working in this domain. The approach described
in Dentcheva and Ruszczynski (2003, 2006) first considers a reference (or
benchmark) distribution and then computes a portfolio, which dominates the
benchmark distribution by the SSD criterion. In Roman et al. (2006), a multi-
objective optimization model is introduced to achieve SSD dominance. This
model is both novel and usable since, when the benchmark solution itself is
SSD efficient or its dominance is unattainable, it finds an SSD-efficient port-
folio whose return distribution comes close to the benchmark in a satisfying
sense. The topic continues to be researched by academics who have strong
interest in this approach: Dentcheva and Ruszczynski (2010), Post and Kopa
(2013), Kopa and Post (2015), Post et al. (2015), Hodder, Jackwerth, and
Kolokolova (2015), Javanmardi and Lawryshy (2016). Over the last decade,
we have proposed computational solutions and applications to large-scale
applied problems in finance (Fábián et al. 2011a, 2011b; Roman et al. 2013;
Valle et al. 2017).
From a theoretical perspective, enhanced indexation calls for further
justification. The efficient market hypothesis (EMH) is based on the key
assumption that security prices fully reflect all available information. This
28 High-Performance Computing in Finance

hypothesis, however, has been continuously challenged; the simple fact that
academics and practitioners commonly use “active,” that is, non-index track-
ing strategies, vindicates this claim. An attempt to reconcile the advocates
and opponents of the EMH is the “adaptive market hypothesis” (AMH) put
forward by Lo (2004). AMH postulates that the market “adapts” to the infor-
mation received and is generally efficient, but there are periods of time when
it is not; these periods can be used by investors to make profit in excess of the
market index. From a theoretical point of view, this justifies the quest for tech-
niques that seek excess return over financial indices. In this sense, enhanced
indexation aims to discover and exploit market inefficiencies. As set out ear-
lier, a common problem with the index tracking and enhanced indexation
models is the computational difficulty which is due to cardinality constraints
that limit the number of stocks in the chosen portfolio. It is well known that
most index tracking models naturally select a very large number of stocks in
the composition of the tracking portfolio. Cardinality constraints overcome
this problem, but they require introduction of binary variables and thus the
resulting model becomes much more difficult to solve. Most of the literature
in the field is concerned with overcoming this computational difficulty. The
good in-sample properties of the return distribution of the chosen portfolios
have been underlined in previous papers: Roman, Darby-Dowman, and Mitra
(2006), using historical data; Fábián et al. (2011c), using scenarios generated
via geometric Brownian motion.
However, it is the actual historical performance of the chosen portfolios
(measured over time and compared with the historical performance of the
index) that provides empirical validation of whether the models achieved their
stated purpose of generating excess return.
We also investigate aspects related to the practical application of portfolio
models in which the asset universe is very large; this is usually the case in
index tracking and enhanced indexation models. It has been recently shown
that very large SSD-based models can be solved in seconds, using solution
methods which apply the cutting-plane approach, as proposed by Fábián et al.
(2011a). Imposing additional constraints that add realism (e.g., cardinality
constraints, normally required in index tracking) increase the computational
time dramatically.

2.1.2 Revising the reference distribution


The reference distribution, that is, the distribution of the financial index
under consideration is “achievable” since there exists a feasible portfolio that
replicates the index. Empirical evidence (Roman et al. 2006, 2013; Post and
Kopa 2013) further suggests that in most cases this distribution can be SSD
dominated; this in turn implies that the distribution is not SSD efficient.
When there exists a scope of improvement we set out to compute an improved
distribution as our revised “benchmark.” We achieve this by using a tilting
method and the improvement step is triggered by the impact measure of the
Using Market Sentiment to Enhance SSD Trading Models 29

market sentiment. For a full description of this method, see Valle, Roman, and
Mitra (2017; also Section 2.4.3) and Mitra, Erlwein-Sayer, and Roman (2017).

2.1.3 Money management via “volatility pumping”


Similar to the Markowitz model, SSD works as a “single period SP,” that is,
a static or myopic framework. Whereas volatility or tail risks such as value at
risk (VAR) and conditional value at risk (CVAR) can be controlled, these are
static measures of risk. In contrast “maximum drawdown” and “days to recov-
ery” are dynamic risk measures, which are considered to be more important
performance measures by the active trading and fund management commu-
nity. In order to control drawdown and the related temporal risk measures we
have resorted to money management via “volatility pumping.” This is based
on the eponymous criterion called “Kelly Criterion” of John Kelly, extended
and much refined by Ed Thorp. Ed Thorp first applied this in the setting of
“Black Jack” (see Thorp 1966) and subsequently in trading and fund man-
agement (see Thorp 1967). The underlying principles are lucidly explained by
David Luenberger (1997; see Chapter 15) who has coined the term “volatility
pumping.”

2.1.4 Solution methods for SIP models


Computing “long only” SSD portfolios (strategies) requires solution of an
LP; this can be a computationally tractable problem, yet it may require includ-
ing a large number of constraints or cuts which are generated dynamically in
the solution process (see Fábián et al. 2011a, 2011b). To apply “long–short”
strategy leads to further computational challenge of solving stochastic inte-
ger programs (SIP) made up of binary variables. Given n assets, the model
requires 2n binary decision variables and 2n continuous variables (200 binary
and 200 continuous for the components of FTSE100, for instance). As such
the model is relatively small, however, if we consider all the cuts which are
dynamically added (see Equation 2.7), the model size grows exponentially as
a huge number of constraints are generated: one for every subset of S with
cardinality s = 1, . . . , S. We refer the readers to the formulation set out in
Section 2.4.2 (see Equations 2.1 through 2.8).

2.1.5 Guided tour


In Section 2.2, we discuss the two time-series data sets with which our
models are instantiated. In Section 2.3, we describe the general system archi-
tecture and in Section 2.4, we set out in detail the relevant models and meth-
ods to find SSD optimal portfolios. We present our strategies and the concepts
that underpin these in Section 2.5. The results of out-of-sample back tests are
presented and discussed in Section 2.6.
30 High-Performance Computing in Finance

TABLE 2.1: Description of all the data fields for a company in


the market data
Data field Field name Description
1 ##RIC Reuters instrument code individually
assigned to each company
2 Date In the format DD-MM-YYYY
3 Time In the format hh:mm:ss, given to the
nearest minute
4 GMT Offset Difference from Greenwich mean time
5 Type Type of market data
6 Last Last prices for the corresponding minute

2.2 Data
Our modelling architecture uses two streams of time-series data: (i) market
data which is given on a daily frequency, and (ii) news metadata as supplied
by Thomson Reuters. A detailed description of these datasets is given below.

2.2.1 Market data


In these set of experiments, daily prices for each asset have been used to
test the trading strategy. The data fields of the market data are set out in
Table 2.1. The trading strategy (in Section 2.5) is tested on the entire asset
universe of FTSE100.

2.2.2 News meta data


News analytics data are presented in the well-established metadata format
whereby a news event is given tags of relevance, novelty and sentiment (scores)
for a given individual asset. The analytical process of computing such scores is
fully automated from collecting, extracting, aggregating, to categorizing and
scoring. The result is a numerical score assigned to each news article for each
of its different characteristics. Although we use intraday news data, for a given
asset the number of news stories, hence data points, are variable and do not
match the time frequencies of market data. The attributes of news stories used
in our study are relevance and sentiment.
The news metadata for the chosen assets were selected under the filter of
relevance score, that is, any news item that had a relevance score under the
value of 100 was ignored and not included in the data set. This ensured with a
high degree of certainty that the sentiment scores to be used are indeed focused
on the chosen asset and is not just a mention in the news for comparison
purposes, for example.
Using Market Sentiment to Enhance SSD Trading Models 31

Revision of
Market data reference
distribution SSD static
Money Trade
asset
management signals
allocation

Impact
News data measure
FortSP

FIGURE 2.1: System architecture.

2.3 Information Flow and Computational Architecture


News analytics in finance focus on improving IT-based legacy system
applications. These improvements come through research and development
directed to automated/semi-automated programmed trading, fund rebalanc-
ing and risk control applications.
The established good practice of applying these analytics in the traditional
manual approach is as follows. News stories and announcements arrive syn-
chronously and asynchronously. In the market assets’ (stocks, commodities,
FX rates, etc.) prices move (market reactions). The professionals digest these
items of information and accordingly make trading decisions and re-compute
their risk exposures.
The information flow and the (semi-) automation of the corresponding
IS architecture is set out in Figure 2.1. There are two streams of information
which flow simultaneously, news data and market data. Pre-analysis is applied
to news data; it is further filtered and processed by classifiers to relevant
metrics. This is consolidated with the market data of prices, and together
they constitute the classical datamart which feed into the SSD asset allocation
model. A key aspect of this application is that it sets out to provide technology-
enabled support to professional decision makers.
The SSD asset allocation optimization model is solved with a specialized
solver and its output feeds the recommended traded signals.
The computational system is run in an Intel(R) Core(TM) i5-3337U
CPU@1.80 GHz with 6 GB of RAM and Linux as operating system. The
back testing framework was written in R and the required data are stored in
a SQLite database. The SSD solver is invoked by the R framework and it was
written in C++ with FortSP and CPLEX 12.6 as mixed-integer programming
solvers.
32 High-Performance Computing in Finance

2.4 Models
2.4.1 Impact measure for news
In our analytical model for news we introduce two concepts, namely, (i)
sentiment score and (ii) impact score. The sentiment score is a quantification
of the mood (of a typical investor) with respect to a news event. The impact
score takes into consideration the decay of the sentiment of one or more news
events and how after aggregation these impact the asset behavior.

[Link] Sentiment score


Thomson Reuters’ news sentiment engine analyzes and processes each news
story that arrives as a machine readable text. Through text analysis and
other classification schemes, the engine then computes for each news event:
(i) relevance, (ii) entity recognition, and (iii) sentiment probabilities, as well
as a few other attributes (see Mitra and Mitra, 2011). A news event sentiment
can be positive, neutral and negative and the classifier assigns probabilities
such that all three probabilities sum to one.
We turn these three probabilities into a single sentiment score in the range
+50 to −50 using the following equation:
 
1
Sent = 100 ∗ P rob(positive) + P rob(neutral) − 50 (2.1)
2

where Sent denotes a single transformed sentiment score. We find that such
a derived single score provides a relatively better interpretation of the mood
of the news item. Thus, the news sentiment score is a relative number that
describes the degree of positivity and negativity in a piece of news. During
the trading day, as news arrives it is given a sentiment value. Given that
−50 ≤ Sent ≤ 50, for a given news item k at the time bucket tk , we define
P News(k, tk ) and N News(k, tk ) as the sentiments of the kth news (see the
following section).

[Link] Impact score


It is well known from research studies that news flow affects asset behavior
(Patton and Verardo, 2012; Mitra, Mitra, and diBartolomeo, 2009). There-
fore, the accumulation of news items as they arrive is important. Patton and
Verardo (2012) noticed decay in the impact of news on asset prices and their
betas on a daily timescale and further determine the complete disappearance
of news effects within 2–5 days. Mitra, Mitra, and diBartolomeo (2009) cre-
ated a composite sentiment score in their volatility models after initial exper-
iments revealed no effect on volatility predictions with sentiment alone; the
decay period in this study was over 7 days.
To compute the impact of news events over time, we first find an expression
which describes the attenuation of the news sentiment score. The impact of
Using Market Sentiment to Enhance SSD Trading Models 33

a news item does not solely have an effect on the markets at the time of
release; the impact also persists over finite periods of time that follow. To
account for this prolonged impact, we have applied an attenuation technique
to reflect the instantaneous impact of news releases and the decay of this
impact over a subsequent period of time. The technique combines exponential
decay and accumulation of the sentiment score over a given time bucket under
observation. We take into consideration the attenuation of positive sentiment
to the neutral value and the rise of negative sentiment also to the neutral
value and accumulate (sum) these sentiment scores separately. The separation
of the positive and negative sentiment scores is only logical as this avoids
cancellation effects. For instance, cancellation reduces the news flow and an
exact cancellation leads to the misinterpretation of no news.
News arrives asynchronously; depending on the nature of the sentiment it
creates, we classify these into three categories, namely: positive, neutral, and
negative. For the purpose of deriving impact measures, we only consider the
positive and negative news items.
Let,

POS denote the set of news with positive sentiment value Sent > 0

NEG denote the set of news with negative sentiment value Sent < 0

P News(k, tk ) denote the sentiment value of the kth positive news arriving
at time bucket tk , 1 ≤ tk ≤ 630 and k ∈ POS; P News(k, tk ) > 0
N News(k, tk ) denote the sentiment value of the kth negative news arriving
at time bucket tk , 1 ≤ tk ≤ 630 and k ∈ NEG; N News(k, tk ) < 0

Let λ denote the exponent which determines the decay rate. We have
chosen λ such that the sentiment value decays to half the initial value in a 90
minute time span. The cumulated positive and negative sentiment scores for
one day are calculated as

P Impact(t) = P N ews(k, tk )e−λ(t−1) (2.2)
k ∈ P OS
tk ≤ t

N Impact(t) = N N ews(k, tk )e−λ(t−1) (2.3)
k ∈ N EG
tk ≤ t

In Equations 2.2 and 2.3 for intraday P Impact and N Impact, t is in the range,
t = 1, . . . , 630. On the other hand for a given asset all the relevant news items
which arrived in the past, in principle, have an impact for the asset. Hence,
the range of t can be widened to consider past news, that is, news which are
2 or more days “old.”
A bias exists for those companies with a bigger market capitalization
because they are covered more frequently in the news. For the smaller
34 High-Performance Computing in Finance

companies within a stock market index, their press coverage will be less and
therefore, there is fewer data points to work with.

2.4.2 Long–short discrete optimization model based on


SSD
Let X ⊂ Rn denote the set of the feasible portfolios and assume that X is
a bounded convex polytope. A portfolio x∗ is said to be SSD-efficient if there
is no feasible portfolio: x ∈ X such that Rx ≥ssd Rx∗ .
Recently proposed portfolio optimization models based on the concept of
SSD assume that a reference (benchmark) distribution Rref is available. Let τ̂
be the tails of the benchmark distribution at confidence levels (1/S, . . . , S/S);
that is, τ̂ = (τ̂1 , . . . , τ̂S ) = (Tail S1 Rref , . . . , Tail S Rref ).
S
Assuming equiprobable scenarios as in Roman et al. (2006, 2013) and
Fábián et al. (2011a, 2011b), the model in Fábián et al. (2011b) optimizes
the worst difference between the “scaled” tails of the benchmark and of the
return distribution of the solution portfolio; the “scaled” tail is defined as
(1/β)Tailβ (R). V = min (1/S)(Tails/S (Rx ) − τ̂s represents the worst partial
1≤s≤S
achievement of the differences between the scaled tails of the portfolio return
and
S the scaled tails of the benchmark. The scaled tails of the benchmark are
S S
1 1 2 2 , . . . , S τ̂S .
τ̂ , τ̂
When short-selling is allowed, the amount available for purchases of stocks
in long positions is increased. Suppose we borrow from an intermediary a
specified number of units of asset i(i = 1, . . . , n) corresponding to a proportion
x−i of capital. We sell them immediately in the market and hence have a cash
sum of (1+Σnj=1 x− i )C to invest in long positions; where C is the initial capital
available.
In long–short practice, it is common to fix the total amount of short-
selling to a pre-specified proportion α of the initial capital. In this case, the
amount available to invest in long positions is (1+α)C. A fund that limits their
exposure with a proportion α = 0.2 is usually referred to as a 120/20 fund. For
modelling this situation, to each asset i ∈ {1, . . ., n} we assign two continuous
nonnegative decision variables x+ −
i , xi , representing the proportions invested
in long and short positions in asset i, and two binary variables zi+ , zi− that
indicate whether there is investment in long or short positions in asset i.
For example, if 10% of the capital is shortened in asset i, we write this as
x+ − + −
i = 0, xi = 0.1, zi = 0, zi = 1. We also assigned a decision variable V
defined as above (worst partial achievement).
Using a cutting-plane representation Fábián et al. (2011a), the scaled
long/short formulation of the achievement-maximization problem is written as
max V (2.4)
subject to
n
x+
i =1+α (2.5)
i=1
Using Market Sentiment to Enhance SSD Trading Models 35


n
x−
i =α (2.6)
i=1
x+ +
i ≤ (1 + α)zi ∀i ∈ N (2.7)
x−i ≤ αzi

∀i ∈ N (2.8)
zi+ + zi− ≤ 1 ∀i ∈ N (2.9)
1
n
s
V + τ̂s ≤ rij (x+ −
i − xi ) ∀Js ⊂ {1, . . . , S}, |Js | = s, s = {1, . . . , S}
S S i=1 j∈Js
(2.10)
V ∈ R, x+ −
i , xi ∈R +
, zi+ , zi− ∈ {0, 1} ∀i ∈ N (2.11)

2.4.3 Revision of reference distribution


In recent research, the most common approach is to set the benchmark
as the return distribution of a financial index. This is natural since discrete
approximations for this choice can be directly obtained from publicly avail-
able historical data, and also due to the meaningfulness of interpretation; the
common practice is to compare the performance of a portfolio with the perfor-
mance of an index. Applying the SSD criterion, we may construct a portfolio
that dominates a chosen index yet there is no guarantee that this portfolio
will have the desirable properties that an informed investor is looking for. Set
against this background, we propose a method of reshaping a given reference
distribution and compute a synthetic (improved) reference distribution. It may
not be possible to SSD dominate such a reference distribution; in these cases,
the closest SSD efficient portfolio is constructed by our model (see Roman
et al., 2006). To clarify what we mean by improved reference distribution,
let us consider the upper (light grey) density curve of the original reference
distribution in Figure 2.2. In this example, the reference distribution is nearly
symmetrical and has a considerably long left-tail. The lower (black) curve in
Figure 2.2 represents the density curve of what we consider to be an improved
reference distribution. Desirable properties include a shorter left tail (reduced
probability of large losses), and a higher expected return which translates into
higher skewness. A smaller standard deviation is not necessarily desirable, as
it might limit the upside potential of high returns. Instead, we require the
standard deviation of the new distribution to be within a specified range from
the standard deviation of the original distribution.
We would like to transform the original reference distribution into a syn-
thetic reference distribution given target values for the first three statistical
moments (mean, standard deviation, and skewness). In a recent paper, we
have developed an approximate method of finding the three moments of the
improved distribution. This method solves a system of nonlinear equations
using the Newton–Raphson iterations. For a full description of this approach,
see Valle et al. (2017).
36 High-Performance Computing in Finance

0.25
Density
Improved distribution
0.20 Reference distribution

0.15
Density

0.10

0.05

0.00
−20 −10 0 10 20
Values

FIGURE 2.2: Density curves for the original and improved reference
distributions.

In a recent research project, we have focused on this problem further.


Essentially, our aim is to find a reference distribution with target values μT , σT
and γT . We standardize our observed index distribution and perform a shift
toward the target values. Since we wish to increase the upside potential this
can be stated as maximizing the skewness, that is, the third moment, subject
to constraints. This is formulated as a non-linear optimization problem. The
proposed approach (Mitra et al. 2017) sets out to solve an NLP in which
the constraints are set between the three moments of an index distribution
(these are known, hence they are the parameters) and those of the target
distribution (these are therefore the decision variables) of the NLP model.
The solution of the NLP leads to an improved reference distribution. This in
turn is utilized within the SSD optimization to achieve an enhanced portfolio
performance.

2.5 Trading Strategies


In this study, we present three different trading strategies. The first strat-
egy will be the base case which takes into account all available assets and forms
the portfolio composition based on the SSD methodology stated above. The
second and third strategies also employ the SSD methodology, but restrict
the available asset universe. In the following, we briefly describe the used
strategies.
Using Market Sentiment to Enhance SSD Trading Models 37

2.5.1 Base strategy


[Link] Strategy A: Full asset universe
This strategy basically utilizes the concept of SSD. Whenever new portfolio
weights are needed, we call the SSD optimization to determine these optimal
portfolio weights. The computation of these weights depends on recent market
data as well as on the market regime.
Generally, all assets in the index are considered potential portfolio con-
stituents and the SSD optimization decides which assets will be in the portfolio
(and are chosen either long and short) and which assets are not considered.

2.5.2 Using relative strength index as a filter


[Link] Strategy B: Asset filter relative strength index
Our second trading strategy restricts the asset universe for our SSD tool.
We employ the relative strength index (RSI), which is a technical indicator and
is rather standard in technical analysis of financial assets. RSI was introduced
by J. Welles Wilder (1978) and has since gained in popularity with technical
traders. It is a momentum key figure and analyzes the performance of an asset
over an average period of 14 days. RSI is the ratio of medium gains to losses.
It highlights overbought or oversold assets and indicates an imminent market
reversal.
The RSI takes on values between 0 and 100 and is calculated through
100
RSI = 100 −
1 + RS
where RS denotes the relative strength and is computed by dividing the aver-
age gains by the average losses, that is,
EM A(Gaint )
RS =
EM A(Losst )
Here, we calculate the averages as exponential moving averages. In fact, the
average loss is computed as
1
n
EM A(Losst ) = α(i)Lt−i
n i=0

where α(i) = λe−λ∗i is the exponential weight, Xt is the price of the asset and
Lt = (Xt − Xt−days )I{Xt −Xt−days <0} is the loss at time t with days an offset
of chosen number of days. Analogous, we calculate the average gain as
1
n
EM A(Gaint ) = α(i)Gt−i ,
n i=0

where Gt = (Xt − Xt−days )I{Xt −Xt−days ≥0} is the gain at time t. Typical RSI
values are calculated with average gains and losses over a period of 14 days,
38 High-Performance Computing in Finance

often called the lookback period. In the literature, the RSI is considered to
highlight overbought assets, that is when the RSI is above the threshold of
70, and oversold, that is when it is below the threshold of 30.
In our application, we compute the RSI value for each available asset and
flag assets as potential long candidates, if the RSI is below 30 and as potential
short candidates, if the RSI is above 70. If, on the other hand, the RSI value
is between 30 and 70, the asset is not considered to be a portfolio constituent.
By doing so, we restrict the available asset universe compared to the above
base-case strategy A.

2.5.3 Using relative strength index and impact as filters


[Link] Strategy C: Asset filter relative strength index and impact
For this strategy, we further restrict the available asset universe. As a first
step, we apply the asset filter as in strategy B, which gives us two potential
set of stocks out of which we can choose long and short exposures. In the next
step, we remove a potential long asset, if the corresponding news impact is
negative. Analogously, we remove potential short assets, if the corresponding
impact is positive. This strategy therefore combines the momentum strategy
with news sentiment. The news impact gives an additional lead whether the
RSI indicator picks up the current market information. If the current market
sentiment seems to contradict the indicator, the asset is not included in the
potential asset universe. The impact reflects the current state of the market
and therefore improves further the stock filtering step.
Since we only remove assets as potential portfolio constituents, the asset
universe for strategy C is smaller or equal to the asset universe for strategy B.
The reduction of the asset universe is particularly vital for a large asset uni-
verse, for example, components of NIKKEI 250. The reduction of the number
of assets further accelerates the determination of the portfolio composition.
Further aspects of momentum and value strategies including news sentiment
are discussed in Valle and Mitra (2013).

2.5.4 A dynamic strategy using money management


A major performance criterion in the fund management industry is to mea-
sure “maximum draw down” and “days to recovery.” Thorp and his associates
had formally analyzed the problem of gambling (Thorp, 1966) and stock mar-
ket trading (Thorp and Kassouf, 1967) and had derived strategies which are
“winning propositions.” The central plank to these is the celebrated “Kelly
Strategy (KS).” Luenberger (1997) presents a lucid explanation of this and
has coined the term “volatility pumping.” Whereas SSD is an otherwise static
single-period asset allocation model, we have adopted a simplified version of
the KS and apply it to construct our daily trading signals. As a consequence,
we are able to control the dynamic risk measures such as draw down. This
we have termed “money management.” Thus, portfolios are rebalanced using
Using Market Sentiment to Enhance SSD Trading Models 39

SSD criterion and also applying the principle of money management. Since
our SSD model has long–short positions in risky assets we also determine the
long and short exposures dynamically but staying within regulatory regimes
such as “Regulation T” stipulated by the US regulators. Thus if the strategy
is to have (100 + alpha)/alpha; that is, (100 + alpha) long and (alpha) short
exposures, then we simply control this adaptively by a limit (alpha-max) such
that (alpha) <= (alpha-max). The actual settings of the money management
parameter and the parameter “alpha-max” in our experiments are discussed
in Section 2.7 where we report results of our experiments.

2.6 Solution Method and Processing Requirement


2.6.1 Solution of LP and SIP models
The asset allocation decision involves solving a single-stage integer stochas-
tic program based on a set of S discrete scenarios (the formulation introduced
in Section 2.4.2). Given n assets, the model requires 2n binary decision vari-
ables and 2n continuous variables (200 binary and 200 continuous for the
components of FTSE100, for instance). As such the model is relatively small,
however, if we consider all the cuts which are dynamically added (see Equa-
tion 2.10), the model size grows exponentially as a huge number of constraints
are generated: one for every subset of S with cardinality s = 1, . . . , S.
We have therefore designed the algorithm that solves this formulation set
out in Equations 2.4 through 2.11 to process the model rapidly and efficiently.
This is particularly important because our trade signals are generated for
real-time streaming data. A delay in generating the trade signals may not be
able to use the correct prices at the time of executing the trades. Moreover,
in a simulation environment (back testing), we need to rerun the strategy
repeatedly, whenever a rebalance is required, and so a simple simulation can
be very time-consuming.
To overcome the difficulties caused by the exponential number of con-
straints (2.10), we generate these constraints dynamically in situ. This strat-
egy ensures that the actual number of cuts generated to solve the problem
is much less than its theoretical upper limit. We have further implemented
a branch-and-cut algorithm (Padberg and Rinaldi 1991), which extends the
basic branch-and-bound strategy by generating new cuts prior to branching
on a partial solution.
We begin processing the model with a single constraint (2.10), otherwise
the formulation is unbounded as V can take an infinite value. At every node
of the branch-and-bound tree, we analyze the linear relaxation solution found
at that node to check whether any constraint (2.10) has been violated. For
that, we employ the (polynomial) separation algorithm proposed by Fábián
et al. (2011a), which works as follows.
40 High-Performance Computing in Finance

Given a candidate solution, we sort the scenarios in ascending order of


portfolio return (from worst to best). Then, for s = 1, . . . , S, we compute the
value of the logical variables corresponding to the constraint (2.10) for a set Js
composed of the s worst scenarios. We then add the most violated constraint (if
any exists) to the model and resolve that node. A solution is only accepted as a
candidate for branching whenever it satisfies all the constraints of type (2.10).
The branch-and-cut algorithm is written in C++ and the back testing
framework is written in R (2016). We have used CPLEX 12.6 (see IBM ILOG
CPLEX 2016) as the mixed-integer programming solver. For the tests reported
in this paper, we impose a time limit of 30 seconds each time we process a
problem instance; that is, if within 30 seconds the algorithm is unable to prove
optimality, we retrieve the best integer solution found so far.
In order to give an indication of the computational effort required, we
set out the below processing times for back testing (simulation) runs for the
following data sets. We report the test results for FTSE 100 and NIKKEI
225 (each containing 100 and 225 assets, respectively). In each case, the asset
universe is composed of the constituents of the index and the index future
(also included as a tradable asset). The experiments were run on an Intel(R)
Core(TM) i7-3770@3.40 GHz with 8 GB RAM.
The simulation adopts successive rebalancing over time at a specified fre-
quency (for instance every 5 days). We set S = 1200 and we run the simula-
tion throughout a seven-year period and the computational times are shown
in Table 2.2.
The number of trading days differs between FTSE 100 and NIKKEI 225
due to different holidays in each market. For FTSE 100, 307 rebalances are
computed whenever a 5-day rebalancing frequency is selected, and 1531 if the
rebalance frequency is every day, which is one less than the total number of
trading days as we close all portfolio positions in the last day.
For the FTSE 100, the time required to perform the simulation was about
32 minutes for a 5 days rebalancing frequency and about 155 minutes for daily
rebalance. Naturally, NIKKEI 225 requires more computational time due to
larger models, at roughly 5 hours for daily rebalancing. The vast majority of
rebalances were solved within the time limit set: all for FTSE 100 and about
94% for NIKKEI 225. On average, each FTSE 100 rebalance was solved in 6
seconds and each NIKKEI 225 rebalance in about 11.5 seconds.
Given these results, we believe that the algorithm described here is appro-
priate for back testing virtual trading and finally live trading.

TABLE 2.2: Backtesting


Back Number of Rebalancing Number CPU Rebalances
Asset testing trading frequency of time (s) solved
universe period days (days) rebalances within 30s
FTSE100 January 1532 5 307 1903.4 307
1, 2010 1 1531 9339.1 1531
NIKKEI225 – 1536 5 308 3544.5 287
October 1 1535 17624.5 1445
26, 2016
Using Market Sentiment to Enhance SSD Trading Models 41

2.6.2 Scale-Up to process larger models


In the example above, we have considered a universe of up to 225 assets and
a set of 1200 discrete scenarios. Naturally, the model can grow substantially
if:

• A larger asset universe is considered, such as components of the S&P 500


US or S&P1200 Global indices.

• A higher number of scenarios are taken into account, especially if we con-


sider higher-frequency trading where Δt is in minutes or seconds instead
of daily.

In a higher-frequency setting, we may also be unable to afford a 30 second


rebalance time as decisions must be made quickly. As higher performance
becomes a priority, a natural alternative is to consider parallelizing the branch-
and-cut algorithm described earlier. A branch-and-cut is a combination of
branch-and-bound and cutting-plane techniques; we describe the challenges
in parallelizing them below.
One of the most critical points in a branch-and-bound algorithm is the
fathoming of tree branches—which aims to reduce the (exponential) search
space of enumerated solutions. Fathoming depends on finding strong lower
and upper bounds. In a maximization problem, a lower bound represents the
best-known solution for the problem so far and an upper bound represents a
limit on the value of the global optimal solution. The upper bound is obtained
by solving linear relaxations of the problem—that is, the original formulation
without integrality constraints.
Nodes in the branch-and-bound tree are placed in a priority queue and
solved sequentially and independently. This is a natural setting for paral-
lelization. The challenge however is sharing, among concurrent nodes, the
information on upper and lower bounds in order to reduce the amount of
redundant work. For instance, an upper bound found after solving a node
could allow another node to be fathomed. If this node is already being solved
by a concurrent processor, we have a waste of processing power. Some redun-
dant work is unavoidable, but a parallel branch-and-bound must reduce these
inefficiencies.
The generation of cuts in the cutting-plane part of the algorithm is another
critical factor. We remind the reader that the full description of the SSD for-
mulation presented in Section 2.4.2 requires an exponential number of con-
straints (2.10), and it is effectively impossible to have them all represented
in memory. As such we start with a relaxed version of the problem and, at
each node, we run the separation algorithm described earlier to search for any
violated constraint(s) that must be included in the model. Any cut identified
in one node is valid for all other nodes in the priority queue.
When we have concurrency, parallel nodes may waste computational power
in finding redundant cuts; a lower bound cannot be accepted if it violates a
constraint (2.10), as that lower bound is not valid for the original model. So
42 High-Performance Computing in Finance

a parallel node may have to run many executions of the separation algorithm
until it finds no violated cut for that linear relaxation solution. Perhaps, if
that node was executed after a previous one, the cut found before could have
prevented several of these executions.
Thus in an efficient implementation of branch-and-cut algorithm parallel
processors share a collection of information which comprise current bounds,
pending node priorities, and violated cuts. This is a suitable setting for the
classical master-slaves, or centralized control, strategy. In this case there would
be a dedicated master process handling the queue, fathoming nodes, updating
priorities and controlling cuts, all based on the arrival of asynchronous infor-
mation. Slave processes would be responsible for solving the highest priority
pending tree nodes. Thus, the master process maintains global knowledge and
controls the entire search, while slave processes receive pending nodes from
the master processor, solve their linear relaxations and attempt to find new
violated cuts. Finally, the slave returns information to the master processor.
The higher the number of cuts in the model, the slower is the resolution
of a linear relaxation in a particular node. In a branch-and-cut setting, the
master processor should also maintain a cut pool with a list of previously
found cuts. The master processor must handle a few tasks, listed below:

• Check the cut pool in order to identify “tight” and “slack” cuts, that
is, the master processor must identify cuts that are likely (unlikely) to
change the value of a linear relaxation solution.

• Choose which cuts should be provided to a slave processor. A properly


implemented cut pool may prevent very large and time-consuming linear
relaxations.

• Receive newly identified cuts from slave processors and add them to the
pool.

Overall, some redundant work is unavoidable: a node may be solved before


another processor realizes it could be fathomed, or a node may search for vio-
lated constraints which would not be necessary if it had information about
other previously identified constraints. However, if a master processor properly
implements the policies described above, the branch-and-cut implementation
could benefit from parallelization and be able to solve larger models. Other
strategies for parallelization may be considered here, a more thorough discus-
sion of this technique can be seen in Chapters 1 and 3 of Talbi (2006).

2.7 Results
We use real-world historical daily data (adjusted closing prices) taken
from the universe of assets defined by the Financial Times Stock Exchange
Using Market Sentiment to Enhance SSD Trading Models 43

100 (FTSE100) index over the period October 9, 2008 to November 1, 2016
(1765 trading days). The data were collected from Thomson Reuters Data
Stream platform and adjusted to account for changes in index composition.
This means that our models use no more data than was available at the time,
removing susceptibility to the influence of survivor bias. For each asset, we
compute the corresponding daily rates of return. The original benchmark dis-
tribution is obtained by considering the historical daily rates of return of
FTSE100 during the same time period.
The methodology we adopt is successive rebalancing over time with recent
historical data as scenarios. We start from the beginning of our data set.
Given in-sample duration of S days, we decide a portfolio using data taken
from an in-sample period corresponding to the first S + 1 days (yielding S
daily returns for each asset). The portfolio is then held unchanged for an
out-of-sample period of 5 days. We then rebalance (change) our portfolio, but
now using the most recent S returns as in-sample data. The decided portfolio
is then again held unchanged for an out-of-sample period of 5 days, and the
process repeats until we have exhausted all of the data. We set S = 1200; the
total out-of-sample period spans slightly more than 6 years (October 1, 2010
to October 26, 2010).
Once the data have been exhausted we have a time series of 1532 portfolio
return values for out-of-sample performance, here from period 1201 (the first
out-of-sample return value, corresponding to January 1, 2010) until the end
of the data.
Portfolios are rebalanced every 5 days, for each experiment we solve 307
instances of the long–short SSD formulation described in Section 2.4.2, each
corresponding to a single rebalance.
For every experiment, we set α = 0.2, that is, portfolios can have a long–
short exposure of up to 120/20. The strategies below also apply a money
management technique. That is, at every day, the percentage of the portfolio
mark-to-market value invested in risky assets is fixed at 75%, the remaining
25% being invested in a risk-free investment of 2% a year. Hence, the SSD
strategy itself is rebalanced every 5 days in order to bring the portfolio to
desired proportions.
Figure 2.3 shows portfolio paths for the three different strategies A, B, and
C as well as the Financial Times Stock Exchange 100 Index (FTSE100) over
the period from October 1, 2010 to October 26, 2016. The strategies all invest
in a subset of the companies listed on the FTSE100, where the actual asset
universe is defined by the asset universe filter stated above. The FTSE100
index is shown in solid black; the strategies A, B, and C are shown in dashed
dark-grey, dashed light-grey and solid light-grey, respectively.
All strategies outperform the FTSE100 index in the period considered.
We can also see that strategies B and C, where we filter the asset universe,
outperformed strategy A. Table 2.3 shows selected performance statistics,
namely:
44 High-Performance Computing in Finance

Portfolio strategies

2.5 FTSE100
Strategy A
Strategy B
Strategy C
2.0
Portfolio values

1.5

1.0

2012 2014 2016


Time

FIGURE 2.3: Portfolio strategies


TABLE 2.3: Performance measures
Excess Max Max.
Final RFR Sharpe Sortino draw- rec. Av.
Portfolio value (%) ratio ratio down (%) days Beta turnover
FTSE100 1.24 1.68 0.10 0.14 22.06 481
Strategy A 1.88 9.01 0.70 1.01 13.40 264 0.44 19.02
Strategy B 2.47 14.03 0.99 1.48 11.40 165 0.57 34.42
Strategy C 2.51 14.37 1.00 1.48 14.24 125 0.58 34.26

• Final value: Normalized final value of the portfolio at the end of the
out-of-sample period.

• Excess over RFR (%): Annualized excess return over the risk-free rate.
For FTSE100, we used a yearly risk-free rate of 2%.

• Sharpe ratio: Annualized Sharpe ratio of returns.


• Sortino ratio: Annualized Sortino ratio of returns.

• Max drawdown (%): Maximum peak-to-trough decline (as percentage of


the peak value) during the entire out-of-sample period.
• Max recovery days: Maximum number of days for the portfolio to recover
to the value of a former peak.
• Beta: Portfolio beta when compared to the FTSE100 index.

• Average turnover : Average turnover per day as a percentage of portfolio


mark-to-market.
Using Market Sentiment to Enhance SSD Trading Models 45

Both strategies B and C had higher returns and quicker recovery rates
when compared to strategy A. Strategies B and C have very similar statistics.
Overall, reducing the asset universe via strategies B and C allows us to improve
returns and reduce our risk exposure. However, that comes at a cost of a higher
correlation to the market itself (a higher beta) and a higher average turnover.
The latter is due to the asset universe in two consecutive rebalances being
potentially different; in such cases we may need to liquidate current positions
in assets that are not included in the current asset universe.
Our back testing results for NIKKEI 250 are in line with the FTSE results
stated above. Strategies B and C lead to improved performance measures, the
consideration of technical analysis combined with news sentiment results to
desired portfolio properties.

2.8 Conclusion and Discussions


For the FTSE 100 and for the 7-year back testing period under consid-
eration, the three strategies described earlier lead to progressive improve-
ment on the performance of a passive index fund. In our model, we apply
the SSD criterion to find an optimal static portfolio. For this, the actual
data sets of constituents of the FTSE100 index are used as an input to
our optimization system. The SSD optimization finds an optimal portfolio
with regards to a reference distribution derived from the underlying index.
This revision of the return distribution leads to an improved portfolio with
a shorter left tail. Money management-based rebalancing is then applied to
control the maximum drawdown of the derived portfolio. We investigated
three trading strategies: the first gives all available index constituents to the
SSD optimization; in the second strategy, the available asset universe for
SSD optimization in terms of long–short assets was limited to assets with an
RSI value above/below certain thresholds. Furthermore, sentiments covering
news items were introduced to the third strategy to filter out contradicting
sentiments to RSI choices. The third strategy performed best for most of
our key figures. An enhancement of SSD portfolio choices through sentiment
data feeds is achieved in this combined static and temporal setting. We have
planned further refinement of our strategies which include the detection of
market regimes. We propose to use the identification of bull or bear regime
to derive a market exposure strategy which assigns limits to the long–short
partition.

Acknowledgments
We gratefully acknowledge the contribution of Tilman Sayer to this
work. While working at OptiRisk, Tilman participated in this project and
46 High-Performance Computing in Finance

contributed to the development of some of the trading strategies reported in


this paper.

References
Bawa, V. S. Optimal rules for ordering uncertain prospects. Journal of Financial
Economics, 2(1):95–121, 1975.

Beasley, J. E., Meade, N., and Chang, T. J. An evolutionary heuristic for the index
tracking problem. European Journal of Operational Research, 148(3):621–643,
2003.

Canakgoz, N. A. and Beasley, J. E. Mixed-integer programming approaches for index


tracking and enhanced indexation. European Journal of Operational Research,
196(1):384–399, 2008.

Dentcheva, D. and Ruszczynski, A. Optimization with stochastic dominance con-


straints. SIAM Journal on Optimization, 14(2):548–566, 2003.

Dentcheva, D. and Ruszczyñski, A. Portfolio optimization with stochastic dominance


constraints. Journal of Banking & Finance, 30(2):433–451, 2006.

Dentcheva, D. and Ruszczynski, A. Inverse cutting plane methods for optimization


problems with second order stochastic dominance constraints. Optimization: A
Journal of Mathematical Programming and Operations Research, 59(3):323–338,
2010.

DiBartolomeo, D. The Enhanced Index Fund as an Alternative to Indexed Equity


Management. Northfield Information Services, Boston, 2000.

Fábián, C. I., Mitra, G., and Roman, D. Processing second-order stochastic domi-
nance models using cutting-plane representations. Mathematical Programming,
130(1):33–57, 2011a.

Fábián, C. I., Mitra, G., and Roman, D. An enhanced model for portfolio choice with
SSD criteria: A constructive approach. Quantitative Finance, 11(10):1525–1534,
2011b.

Fábián, C. I., Mitra, G., and Roman, D. Portfolio choice models based on second-
order stochastic dominance measures: An overview and a computational study.
In: Bertocchi M., Consigli G., Dempster M. (eds). Stochastic Optimization Meth-
ods in Finance and Energy. International Series in Operations Research & Man-
agement Science, Springer, New York, NY. vol. 163 pp. 441–469, 2011c.

Fama, E. F. Efficient capital markets: A review of theory and empirical work. The
Journal of Finance, 25(2):383–417, 1970.

Hadar, J. and Russell, W. R. Rules for ordering uncertain prospects. The American
Economic Review, 59(1):25–34, 1969.
Using Market Sentiment to Enhance SSD Trading Models 47

Hodder, J. E., Jackwerth, J. C., and Kolokolova, O. Improved portfolio choice using
second-order stochastic dominance. Review of Finance, 19(1):1623–1647, 2015.

IBM ILOG CPLEX Optimizer. Available from [Link]


commerce/optimization/cplex-optimizer/, last accessed March 5, 2017. 2016.

Javanmardi, L. and Lawryshy, Y. A new rank dependent utility approach to model


risk averse preferences in portfolio optimization. Annals of Operations Research,
237(1):161–176, 2016.

Kopa, M. and Post, T. A general test for SSD portfolio efficiency. OR Spectrum,
37(1):703–734, 2015.

Kuosmanen, T. Efficient diversification according to stochastic dominance criteria.


Management Science, 50(10):1390–1406, 2004.

Levy, H. Stochastic dominance and expected utility: Survey and analysis. Manage-
ment Science, 38(4):555–593, 1992.

Lo, A.W. The adaptive markets hypothesis. The Journal of Portfolio Management,
30(5):15–29, 2004.

Luenberger, D. G. Investment Science. Oxford University Press, USA, 1997. ISBN:


9780195108095.

Mitra, L. and Mitra, G. Applications of news analytics in finance: A review. InThe


Handbook of News Analytics in Finance, Chapter 1, John Wiley & Sons, 2011.

Mitra, L. Mitra, G. and diBartolomeo, D. Equity portfolio risk (volatility) estimation


using market information and sentiment. Quantitative Finance, 9(8):887–895,
2009.

Mitra, G., Erlwein-Sayer, C., and Roman, D. Revision of benchmark distributions to


enhance portfolio choice by the Second Order Stochastic Dominance criterion.
White paper under preparation OptiRisk Systems, 2017.

Padberg, M. and Rinaldi, G. A branch-and-cut algorithm for resolution of large scale


of symmetric traveling salesman problem. SIAM Review, 33:60–100, 1991.

Patton, A. J. and Verardo, M. Does beta move with news? Firm-specific informa-
tion flows and learning about profitability. The Review of Financial Studies,
25(9):2789–2839, 2012.

Post, T. Empirical tests for stochastic dominance efficiency. The Journal of Finance,
58(5):1905–1931, 2003.

Post, T., Fang, Y., and Kopa, M. Linear tests for DARA stochastic dominance.
Management Science, 61(1):1615–1629, 2015.

Post, T. and Kopa, M. General linear formulations of stochastic dominance criteria.


European Journal of Operational Research, 230(2):321–332, 2013.
48 High-Performance Computing in Finance

R Core Team. R: A Language and Environment for Statistical Computing. R Foun-


dation for Statistical Computing, Vienna, Austria. Available at [Link]
[Link]/, last accessed March 5, 2017.

Roman, D., Darby-Dowman, K., and Mitra, G. Portfolio construction based on


stochastic dominance and target return distributions. Mathematical Program-
ming, 108(2):541–569, 2006.

Roman, D., Mitra, G., and Zverovich, V. Enhanced indexation based on second-order
stochastic dominance. European Journal of Operational Research, 228(1):273–
281, 2013.

Scowcroft, A. and Sefton, J. Enhanced indexation. In: Satchell, A. and Scowcroft,


A. (eds.) Advances in Portfolio Construction and Implementation, Butterworth-
Heinemann, London, pp. 95–124, 2003.

Talbi, E. G. Parallel Combinatorial Optimization. John Wiley & Sons, USA, 2006.

Thorp, E. O. Beat the Dealer: A Winning Strategy for the Game of Twenty-One.
Random House, New York, NY, 1966.

Thorp, E. O. and Kassouf, S. T. Beat the Market: A Scientific Stock Market System.
Random House, New York, NY, 1967.

Valle, C. and Mitra, G. News Analytics Toolkit User Manual, OptiRisk Systems,
London, UK, 2004. available online: [Link]
NAToolkit User [Link].

Valle, C. A., Roman, D., and Mitra, G. Novel approaches for portfolio construction
using second order stochastic dominance. Computational Management Science,
DOI 10.1007/s10287-017-0274-9, 2017.

Welles Wilder, J. New Concepts in Technical Trading Systems. Trend Research, UK,
1978.
Chapter 3
The Alpha Engine: Designing an
Automated Trading Algorithm

Anton Golub, James B. Glattfelder, and Richard B. Olsen

CONTENTS
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 Asset management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.2 The foreign exchange market . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.3 The rewards and challenges of automated trading . . . . . . 51
3.1.4 The hallmarks of profitable trading . . . . . . . . . . . . . . . . . . . . . 52
3.2 In a Nutshell: Trading Model Anatomy and Performance . . . . . . . 53
3.3 Guided by an Event-Based Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.1 The first step: Intrinsic time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.2 The emergence of scaling laws . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.3 Trading models and complexity . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.4 Coastline trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.5 Novel insights from information theory . . . . . . . . . . . . . . . . . 62
3.3.6 The final pieces of the puzzle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 The Nuts and Bolts: A Summary of the Alpha Engine . . . . . . . . . 67
3.5 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Appendix 3A A History of Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Appendix 3B Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.1 Introduction
The asset management industry is one of the largest industries in modern
society. Its relevance is documented by the astonishing amount of assets that
are managed. It is estimated that globally there are 64 trillion USD under
management [1]. This is nearly as big as the world product of 77 trillion
USD [2].

49
50 High-Performance Computing in Finance

3.1.1 Asset management


Asset managers use a mix of analytic methods to manage their funds.
They combine different approaches from fundamental to technical analysis.
The time frames range from intraday, to days and weeks, and even months.
Technical analysis, a phenomenological approach, is utilized widely as a toolkit
to build trading strategies.
A drawback of all such methodologies is, however, the absence of a con-
sistent and overarching framework. What appears as a systematic approach
to asset management often boils down to gut feeling, as the manager chooses
from a broad blend of theories with different interpretations. For instance, the
choice and configuration of indicators is subject to the specific preference of
the analyst or trader. In effect, practitioners mostly apply ad hoc rules which
are not embedded in a broader context. Complex phenomena such as changing
liquidity levels as a function of time go unattended.
This lack of consensus, or intellectual coherence, in such a dominant and
relevant industry underpinning our whole society is striking. In a day and
age where computational power and digital storage capacities are growing
exponentially, at shrinking costs, and where there exists an abundance of
machine learning algorithms and big data techniques, one would expect a
more unified and comprehensive methodological and theoretical framework.
To illustrate, consider the recent unexpected success of Google’s AlphaGo
algorithm beating the best human players [3]. This is a remarkable feat for a
computer, as the game of Go is notoriously complex and players often report
that they select moves based solely on intuition.
There is, however, one exception in the asset management and trading
industry that relies fully on algorithmic trade generation and automated exe-
cution. Referred to under the umbrella of term “high-frequency trading,” this
approach has witnessed substantial growth. These strategies take advantage of
short-term arbitrage opportunities and typically analyze the limit order books
to jump the queue, whenever there are large orders pending [4]. While high-
frequency trading results in high trade volumes the assets managed with these
types of strategies are around 140 billion [5]. This is microscopic compared to
the size of the global assets under management.

3.1.2 The foreign exchange market


For the development of our trading model algorithm, and the evaluation
of the statistical price properties, we focus on the foreign exchange market.
This market can be characterized as a complex network consisting of interact-
ing agents: corporations, institutional and retail traders, and brokers trading
through market makers, who themselves form an intricate web of interde-
pendence. With an average daily turnover of approximately 5 trillion USD
[6] and with price changes nearly every second, the foreign exchange mar-
ket offers a unique opportunity to analyze the functioning of a highly liquid,
over-the-counter market that is not constrained by specific exchange-based
The Alpha Engine 51

rules. These markets are an order of magnitude bigger than futures or equity
markets [7].
In contrast to other financial markets, where asset prices are quoted in
reference to specific currencies, exchange rates are symmetric: quotes are cur-
rencies in reference to other currencies. The symmetry of one currency against
another neutralizes effects of trend, which are a significant drivers in other
markets, such as stock markets. This property of symmetry makes currency
markets notoriously hard to trade profitably.
We focus on the foreign exchange market for the development of our trad-
ing model algorithm. Its high liquidity and long/short symmetry make it an
ideal environment for the research and development of fully automated and
algorithmic trading strategies. Indeed, any profitable trading algorithm for
this market should, in theory, also be applicable to other markets.

3.1.3 The rewards and challenges of automated trading


During the crisis of 2007 and 2008, the world witnessed how the financial
system destabilized the real economy and destroyed vast amounts of wealth. At
other times, when there are favorable economic conditions, financial markets
contribute to wealth accumulation. The financial system is an integral part of
the real economy with a strong cross dependency. Markets are not a closed
system, where the sum of all profits and losses net out. If investment strategies
contribute to market liquidity, they can help stabilize prices and reduce the
uncertainty in financial markets and the economy at large. For such strategies,
the investment returns can be viewed as a payoff for the value-added provided
to the economy.
Liquid financial markets offer a large profit potential. The length of a
foreign exchange price curve, as measured by the sum of up and down price
movements of increments of 0.05%, during the course of a year, is, on average,
approximately 1600%, after deducting transaction costs [8]. An investor can,
in theory, earn 1600% unleveraged per year, assuming perfect foresight in
exploiting this coastline length. With leverage, the profit potential is even
greater. Obviously, as no investor has perfect foresight, capturing 1600% is
not feasible.
However, why do most investment managers have such difficulty in earn-
ing even small returns on a systematic basis, if the profit potential is so big?
Especially as traders can manage their risk with sophisticated money man-
agement rules which have the potential to turn losses into profits. Again, the
question arises as to why hedge funds, who can hire the best talent in the
world, find it so hard to earn consistent annual returns. For instance, the
Barclay Hedge Fund Index,1 measuring the average returns of all hedge funds
(except funds of funds) in their database, reports an average yearly return of
5.035% (±4.752%) for the past 4 years. How can we develop liquidity-providing

1 [Link]/research/indices/ghs/Hedge Fund [Link].


52 High-Performance Computing in Finance

investment algorithms that consistently generate positive and sizable returns?


What is missing in the current paradigm?
Another key criterion of the quality of an investment strategy is the size of
assets that can be deployed without a deterioration of performance. Closely
related to this issue is the requirement that the strategy does not distort
the market dynamics. This is, for example, the case with the trend following
strategies that are often deployed in automated trading. Such strategies have
the disadvantage that the investor does not know for sure how his action
of following the trend amplifies the trend. In effect, the trend follower can
get locked into a position that he cannot closeout without triggering a price
dislocation.
Finally, any flavor of automated trading is constrained by the current
computational capacities available to researchers. Although this constraint is
loosening day by day due to the prowess of high-performance computing in
finance, some approaches rely more on number crunching than others. Ide-
ally, any trading model algorithm should be implementable with reasonable
resources to make it useful and applicable in the real world.

3.1.4 The hallmarks of profitable trading


Investment strategies need to be fully automated. For one, the number of
traded instruments should not be constrained by human limitations. Then,
the trading horizons should also include intraday activity, as a condition sine
qua non. Complete automation has its own challenges, because computer code
can go awry and cause huge damage, as witnessed by Knight Capital, which
lost 500 million USD in a matter of 30 seconds due to an operational error.2
Many modeling attempts fail because developers succumb to curve fitting.
They start with a specific data sample and tweak their model until it makes
money in simulation runs. Such trading models can disappoint from the start
when going live or boast good performance only for some period of time until
a regime shift occurs and the specific conditions the model was optimized for
(i.e., curve fitted) disappear and losses are incurred.
Trading models need to be parsimonious and have a limited set of vari-
ables. If the models have too many variables, the parameter space becomes
vast and hard to navigate. Parsimonious models are powerful, because they
are easier to calibrate, assess, and understand why they perform. Moreover,
investment models need to be robust to market changes. For instance, the
models can be adaptive and have their behavior depend on the current mar-
ket regime. Therefore, algorithmic investment strategies have to be developed
on the basis of robust and consistent approaches and methods that provide a
solid framework of analysis.
Financial markets are comprised of a large number of traders that take
positions on different time horizons. Agent-based models can mimic the actual

2 [Link]/news/press-release/2013-222.
The Alpha Engine 53

traders and are therefore well suited to research market behavior [9]. If
agent-based models are fractal, that is, behave in a self-similar manner across
time horizons and only differ with respect to the scaling of their parameters,
the short-term models are a filter for the validity of the long-term models. In
practice, this allows for the short-term agent-based models to be tested and
validated over a huge data sample with a multitude of events. As a result,
the scarcity of data available for the long-term model is not a hindrance of
acceptance if it is self-similar with respect to the short-term models. In effect,
the validation of the model structure for short-term models also implies a
validation for the long-term models, by virtue of the scaling effects. In con-
trast, most standard modeling approaches are typically devised for one time
horizon only and hence there are no self-similar models that complement each
other.
Moreover, the modeling approach should be modular and enable developers
to combine smaller blocks to build bigger components. In other words, models
are built in a bottom-up spirit, where simple building blocks are assembled into
more complex units. This also implies an information flow between building
blocks.
To summarize, our aim is to develop trading models based on parsimo-
nious, self-similar, modular, and agent-based behavior, designed for multiple
time horizons and not purely driven by trend following action. The intellectual
framework unifying these angles of attack is outlined in Section 3.3. The result
of this endeavor is interacting systems that are highly dynamic, robust, and
adaptive; in other words, a type of trading model that mirrors the dynamic
and complex nature of financial markets. The performance of this automated
trading algorithm is outlined in the following section.
In closing, it should be mentioned that transaction costs can represent real-
world stumbling blocks for trading models. Investment strategies that take
advantage of short-term price movements in order to achieve good performance
have higher transaction volumes than longer term strategies. This obviously
increases the impact of transaction costs on the profitability. As far as possible,
it is advisable to use limit orders to initiate trades. They have the advantage
that the trader does not have to cross the spread to get his order executed, thus
reducing or eliminating transaction costs. The disadvantage of limit orders is,
however, that execution is uncertain and depends on buy and sell interest.

3.2 In a Nutshell: Trading Model Anatomy


and Performance
In this section, we provide an overview of the trading model algorithm and
its performance. For all the details on the model, see Section 3.4, and the code
can be downloaded from GitHub [10].
54 High-Performance Computing in Finance

The Alpha Engine is a counter-trending trading model algorithm that pro-


vides liquidity by opening a position when markets overshoot and manages
positions by cascading and de-cascading during the evolution of the long
coastline of prices, until it closes in a profit. The building blocks of the trading
model are as follows:

• An endogenous time scale called intrinsic time that dissects the price
curve into directional changes and overshoots;

• Patterns called scaling laws that hold over several orders of magnitude,
providing an analytical relationship between price overshoots and direc-
tional change reversals;
• Coastline trading agents operating at intrinsic events, defined by the
event-based language;

• A probability indicator that determines the sizing of positions by identi-


fying periods of market activity that deviate from normal behavior;
• Skewing of cascading and de-cascading designed to mitigate the accumu-
lation of large inventory sizes during trending markets; and

• The splitting of directional change and, consequently, overshoot thresh-


olds into upward and downward components, that is, the introduction of
asymmetric thresholds.

The trading model is backtested on historical data comprised of 23


exchange rates:

AUD/JPY, AUD/NZD, AUD/USD, CAD/JPY, CHF/JPY, EUR/AUD,


EUR/CAD, EUR/CHF, EUR/GBP, EUR/JPY, EUR/NZD, EUR/USD,
GBP/AUD, GBP/CAD, GBP/CHF, GBP/JPY, GBP/USD, NZD/-
CAD, NZD/JPY, NZD/USD, USD/CAD, USD/CHF, USD/JPY.

The chosen time period is from the beginning of 2006 until the beginning
of 2014, that is, 8 years. The trading model yields an unlevered return of
21.3401%, with an annual Sharp ratio of 3.06, and a maximum drawdown
(computed on daily basis) of 0.7079%. This event occurs at the beginning
of 2013 and lasts approximately 4 months, as the JYP weakens significantly
following the Quantitative Easing program (“three arrows” of fiscal stimulus)
launched by the Bank of Japan.
Figure 3.1 shows the performance of the trading model across all exchange
rates. Table 3A.1 reports the monthly and yearly returns. The difference in
returns among the various exchange rates is explained by volatility: the trading
model reacts only to occurrences of intrinsic time events, which are function-
ally dependent on volatility. Exchange rates with higher volatility will have a
greater number of intrinsic events and hence more opportunities for the model
to extract profits from the market. This behavior can be witnessed during the
The Alpha Engine 55

20%

15%

10%

5%

0%
2006 2008 2010 2012 2014
NZD/JPY GBP/JPY
3% AUD/NZD EUR/CAD
NZD/CAD USD/CAD
EUR/NZD AUD/USD
NZD/USD USD/CHF
EUR/USD
2% AUD/JPY
USD/JPY
GBP/AUD EUR/GBP
CAD/JPY EUR/CHF
EUR/AUD GBP/USD
CHF/JPY EUR/JPY
1% GBP/CAD GBP/CHF

0%

2006 2008 2010 2012 2014

FIGURE 3.1: Daily Profit & Loss of the Alpha Engine, across 23 currency
pairs, for 8 years. See details in the main text of this section and Section 3.4.

financial crisis, where its deleterious effects are somewhat counterbalanced by


an overall increase in profitable trading behavior of the model, fueled by the
increase in volatility.
The variability in performance of the individual currency pairs can be
addressed by calibrating the “aggressiveness” of the model with respect to
the volatility of the exchange rate. In other words, the model trades more
frequently when the volatility is low and vice versa. For the sake of simplic-
ity, and to avoid potential overfitting, we have excluded these adjustments to
the model. In addition, we also refrained from implementing cross-correlation
measures. By assessing the behavior of the model for one currency pair, infor-
mation can be gained that could be utilized as an indicator which affects the
model’s behavior for other exchange rates. Finally, we have also not imple-
mented any risk management tools.
In essence, what we present here is a proof of concept. We refrained
from tweaking the model to yield better performance, in order to clearly
establish and outline the model’s building blocks and fundamental behavior.
We strongly believe there is great potential for obvious and straightforward
improvements, which would give rise to far better models. Nevertheless, the
bare-bones model we present here already has the capability of being imple-
mented as a robust and profitable trading model that can be run in real time.
With a leverage factor of 10, the model experiences a drawdown of 7.08%
56 High-Performance Computing in Finance

while yielding an average yearly profit of 10.05% for the last 4 years. This is
still far from realizing the coastline’s potential, but, in our opinion, a crucial
first step in the right direction.
Finally, we conclude this section by noting that, despite conventional wis-
dom, it is in fact possible to “beat” a random walk. The Alpha Engine pro-
duces profitable results even on time series generated by a random walk,
as seen in Figure 3B.1. This unexpected feature results from the fact that
the model is dissecting Brownian motion into intrinsic time events. Now
these directional changes and overshoots yield a novel context, where a cas-
cading event is more likely to be followed by a de-cascading event than
another cascading one. In detail, the probability of reaching the profitable
de-cascading event after a cascade is 1 − e−1 ≈ 0.63, while the probability
for an additional cascade is about 0.37. In effect, the procedure of trans-
lating a tick-by-tick time series into intrinsic time events skews the odds in
one’s favor—for empirical as well as synthetic time series. For details, see
Reference 11.
In the following section, we will embark on the journey that would ulti-
mately result in the trading model described above. For a prehistory of events,
see Appendix 3A.

3.3 Guided by an Event-Based Framework


The trading model algorithm outlined in the last section is the result of a
long journey that began in the early 1980s. Starting with a new conceptual
framework of time, this voyage set out to chart new terrain. The whole history
of this endeavor is described in Appendix 3A. In the following, the key elements
of this new paradigm are highlighted.

3.3.1 The first step: Intrinsic time


We all experience time as a fundamental and unshakable part of reality. In
stark contrast, the philosophy of time and the notion of time in fundamental
physics challenge our mundane perception of it. In an operational definition,
time is simply what instruments measure and register. In this vein, we under-
stand the passage of time in financial time series as a set of events, that is,
system interactions.
In this novel time ontology, time ceases to exist between events. In contrast
to the continuity of physical time, now only interactions, or events, let a
system’s clock tick. Hence, this new methodology is called intrinsic time [12].
This event-based approach opens the door to a modeling framework that yields
self-referential behavior which does not rely on static building blocks and has
a dynamic frame of reference.
The Alpha Engine 57

Implicit in this definition is the threshold for the measurement of events.


At different resolutions, the same price series reveals different characteristics.
In essence, intrinsic time increases the signal to noise ratio in a time series by
filtering out the irrelevant information between events. This dissection of price
curves into events is an operator, mapping a time series x(t) into a discrete
set of events Ω[x(t), δ], given the directional change threshold δ.
We focus on two types of events that represent ticks of intrinsic time:
1. A directional change δ [8,13–16];

2. An overshoot ω [8,13,15].
With these events, every price curve can be dissected into components that
represent a change in the price trend (directional change) and a trend com-
ponent (overshoot). For a directional change to be detected, first an initial
direction mode needs to be chosen. As an example, in an up mode an increas-
ing price move will result in the extremal price being updated and continuously
increased. If the price goes down, the difference between the extremal price
and the current price is evaluated. If this distance (in percent) exceeds the pre-
defined directional change threshold, a directional change is registered. Now
the mode is switched to down and the algorithm continues correspondingly.
If now the price continues to move in the same direction as the directional
change, for the size of the threshold, an overshoot event is registered. As long
as a trend persists, overshoot events will be registered. See Figure 3.2a for
an illustration. Note that two intrinsic time series will synchronize after one
directional change, regardless of the chosen starting direction.
As a result, a price curve is now comprised of segments, made up of a direc-
tional change event δ and one or more overshoots of size ω. This event-based

(a) (b)

Directional change
1.40
Overshoot
1. Overshoot
DC
Mid price

1.38

OS
1.36

1.34

0 10 20 30 40 50 60 70
Events

FIGURE 3.2: (a) Directional change and overshoot events. (b) A coastline
representation of the EUR USD price curve (2008-12-14 22:10:56 to 2008-12-16
21:58:20) defined by a directional change threshold δ = 0.25%. The triangles
represent directional change and the bullets overshoot events.
58 High-Performance Computing in Finance

Original price curve Threshold 0.8%


Price

Directional change
Overshoot
Physical time 1. Overshoot

DC
OS

Threshold
Price
Event time
Threshold 0.2% Threshold 0.4%
Intrinsic time

FIGURE 3.3: Coastline representation of a price curve for various directional


change thresholds δ.

time series is called the coastline, defined for a specific directional change
threshold. By measuring the various coastlines for an array of thresholds,
multiple levels of event activity can be considered. See Figures 3.2b and 3.3.
This transformed time series is now the raw material for further investigations
[8]. In particular, this price curve will be used as input for the trading model,
as described in Section 3.3.4. With the publication [17], the first decade came
to a close.

3.3.2 The emergence of scaling laws


A validation for the introduction of intrinsic time is that this event-
based framework uncovers statistical properties otherwise not detectable in
the price curves, for instance, scaling laws. Scaling-law relations characterize
an immense number of natural processes, prominently in the form of

1. Scaling-law distributions;
2. Scale-free networks; and

3. Cumulative relations of stochastic processes.

Scaling-law relations display scale invariance because scaling the function’s


argument x preserves the shape of the function f (x) [18]. Measurements of
scaling-law processes yield values distributed across an enormous dynamic
range, and for any section analyzed, the proportion of small to large events
stays constant.
The Alpha Engine 59

Scaling-law distributions have been observed in an extraordinary wide


range of natural phenomena: from physics, biology, earth and planetary sci-
ences, economics and finance, computer science and demography to the social
sciences [19–22]. Although scaling law distributions imply that small occur-
rences are extremely common, whereas large instances are rare, these large
events occur nevertheless much more frequently compared to a normal proba-
bility distribution. Hence, scaling-law distributions are said to have “fat tails.”
The discovery of scale-free networks [23,24], where the degree distributions
of nodes follow a scaling-law distribution, was a seminal finding advancing the
study of complex networks [25]. Scale-free networks are characterized by high
robustness against random failure of nodes, but susceptible to coordinated
attacks on the hubs.
Scaling-law relations also appear in collections of random variables. Promi-
nent empirical examples are financial time series, where one finds scaling laws
governing the relationship between various observed quantities [16,17,26]. The
introduction of the event-based framework leads to the discovery of a series
of new scaling relations in the cumulative relations of properties in foreign
exchange time series [8]. In detail, of the 18 novel scaling-law relations (of
which 12 are independent), 11 relate to directional changes and overshoots.
One notable observation was that, on average, a directional change δ is
followed by an overshoot ω of the same magnitude

ω ≈ δ. (3.1)

This justifies the procedure of dissecting the price curve into directional change
and overshoot segments of the same size, as seen in Figures 3.2 and 3.3. In
other words, the notion of the coastline is statistically validated.
Scaling laws are a hallmark of complexity and complex systems. They can
be viewed as a universal “law of nature” underlying complex behavior in all
its domains.

3.3.3 Trading models and complexity


A complex system is understood as being comprised of many interacting
or interconnected parts. A characteristic feature of such systems is that the
whole often exhibits properties not obvious from the properties of the indi-
vidual parts. This is called emergence. In other words, a key issue is how
the macro behavior emerges from the interactions of the system’s elements
at the micro level. Moreover, complex systems also exhibit a high level of
resilience, adaptability, and self-organization. Complex systems are usually
found in socio-economical, biological or physio-chemical domains.
Complex systems are usually very reluctant to be cast into closed-form
analytical expressions. This means that it is generally hard to derive mathe-
matical quantities describing the properties and dynamics of the system under
study. Nonetheless, there has been a long history of attempting to understand
finance from an analytical point of view [27,28].
60 High-Performance Computing in Finance

In contrast, we let our trading model development be guided by the insights


gained by studying complex systems [29]. The single most important feature
is surprisingly subtle:

Macroscopic complexity is the result of simple rules of interaction at


the micro level.

In other words, what looks like complex behavior from a distance turns out
to be the result of simple rules at closer inspection. The profundity of this
observation should not be underestimated, as echoed in the words of Stephen
Wolfram, when he was first struck by this realization [30, p. 9]:

Indeed, even some of the very simplest programs that I looked at had
behavior that was as complex as anything I had ever seen. It took
me more than a decade to come to terms with this result, and to
realize just how fundamental and far-reaching its consequences are.

By focusing on local rules of interactions in complex systems, the system


can be naturally reduced to a set of agents and a set of functions describing
the interactions between the agents. As a result, networks are the ideal formal
representation of the system. Now the nodes represent the agents and the
links describe their relationship or interaction. In effect, the structure of the
network, i.e., its topology, determines the function of the network.
Indeed, this perspective also highlights the paradigm shift away from math-
ematical models towards algorithmic models, where computations and simu-
lation are performed by computers. In other words, the analytical description
of complex systems is abandoned in favor of algorithms describing the inter-
action of the agents. This approach has given rise to the prominent field of
agent-based modeling [31–33]. The validation of agent-based models is given
by their capability to replicate patterns and behavior seen in real-world com-
plex systems by virtue of agents interacting according to simple rules.
Financial markets can be viewed as the epitome of a human-generated
complex system, where the trading choices of individuals, aggregated in a
market, give rise to a stochastic and highly dynamic price evolution. In this
vein, a long or short position in the market can be understood as an agent. In
detail, a position pi is comprised of the set {x̄i , ±gi }, where x̄i is the current
mid (or entry price) and ±gi represents the position size and direction.

3.3.4 Coastline trading


In a next step, we combined the event-based price curve with simple rules of
interactions. This means that the agents interact with the coastline according
to a set of trading rules, yielding coastline traders [34–36]. In a nutshell, the
initialization of new positions and the management of existing positions in
the market are clocked according to the occurrence of directional change or
overshoot events. The essential elements of coastline trading are cascading
The Alpha Engine 61

Rational Open Managing


Directional
Extrema
change
x
Overshoot
x

Average overshoot length Average


from scaling laws

Exposure

Coastline trading Take profit

Take profit

1%
1%
FIGURE 3.4: Simple rules: The elements of coastline trading. Cascading
and de-cascading trades increase or decrease existing positions, respectively.

Long trade
Short trade
Close long
1
1 Close short
Coastline trading
2
1
1 2
2
1
2 3
1 1
3 3
1
2

1 1 1 1 2 2 2 1 1 3 2 1 3 3 1 2
Each number 2 2
Take profit short

Take profit short

Take profit short


Open long

Open short

Increase long

Open long

Open short

Open long

Increase long
Coastline trading
Increase short

Take profit long

Take profit long

corresponds to
1 gets advantage

1 gets advantage

an independent
trading agent

FIGURE 3.5: Real-world example of coastline trading.

and de-cascading trades. For the former, an existing position is increased by


some increment in a loss, bringing the average closer to the current price. For
a de-cascading event, an existing position is decreased, realizing a profit. It
is important to note that because position sizes are only ever increased by
62 High-Performance Computing in Finance

the same fixed increments, coastline trading does not represent a Martingale
strategy. In Figures 3.4 and 3.5, examples of such trading rules are shown.
With these developments, the second decade drew to a close. Led by the
introduction of event-based time, uncovering scaling law relations, the novel
framework could be embedded in the larger paradigm related to the study of
complex systems. The resulting trading models were by construction, auto-
mated, agent-based, contrarian, parsimonious, adaptive, self-similar, and mod-
ular. However, there was one crucial ingredient missing, to render the models
robust and hence profitable in the long term. And so the journey continued.

3.3.5 Novel insights from information theory


In a normal market regime, where no strong trend can be discerned, the
coastline traders generate consistent returns. By construction, this trading
model algorithm is attuned to directional changes and overshoots. As a result,
so long as markets move in gentle fluctuations, this strategy performs. In con-
trast, during times of strong market trends, the agents tend to build up large
positions which they cannot unload. Consequentially, each agent’s inventory
increases in size. As this usually happens over multiple threshold sizes, the
overall resulting model behavior derails.
This challenge, related to trends, led to the incorporation of novel elements
into the trading model design. A new feature, motivated by information theory
was added. Specifically, a probability indicator was constructed. Equipped
with this new tool, the challenges presented by market trends could now be
tackled. In effect, the likeliness of a current price evolution with respect to a
Brownian motion can be assessed in a quantitative manner.
In the following, we will introduce the probability indicator L. This is an
information theoretic value that measures the unlikeliness of the occurrence of
price trajectories. As always, what is actually analyzed is the price evolution
which is mapped onto the discretized price curve, which results from the event-
based language in combination with the overshoot scaling law. Point-wise
entropy, or surprise, is defined as the entropy for a certain realization of a
random variable. Following [37], we understand the surprise of the event-based
price curve being related to the transitioning probability from the current state
si to the next intrinsic event sj , i.e., P(si → sj ). In detail, given a directional
change threshold δ, the set of possible events is given by directional changes
or overshoots. In other words, a state at “time” i is given by si ∈ S = {δ, ω}.
Given S, we now can understand all possible transitions as happening in the
stylized network of states seen in Figure 3.6. The evolution of intrinsic time
can progress from a directional change to another directional change or an
overshoot, which, in turn, can transit to another overshoot event ω or back to
a directional change δ.
We define the surprise of the transitions from state si to state sj as

γij = −logP(si → sj ), (3.2)


The Alpha Engine 63

δ ω

FIGURE 3.6: The transition network of states in the event-based represen-


tation of the price trajectories. Directional changes δ and overshoots ω are the
building blocks of the discretized price curve, defining intrinsic time.

which, as mentioned, is the point-wise entropy that is large when the proba-
bility of transitioning from state si to state sj is small and vice versa. Con-
sequently, the surprise of a price trajectory within a time interval [0, T ], that
has experienced K transitions, is

[0,T ]

K
γK = −logP(sik → sik+1 ). (3.3)
k=1

This is now a measure of the unlikeliness of price trajectories. It is a path-


dependent measurement: two price trajectories exhibiting the same volatility
can have very different surprise values.
Following [11], H (1) denotes the entropy rate associated with the state
transitions and H (2) is the second order of informativeness. Utilizing these
building blocks, the next expression can be defined as
[0,T ]
γK − K · H (1)
Δ= √ . (3.4)
K · H (2)
This is the surprise of a price trajectory, centered by its expected value, that
is, the entropy rate multiplied by the number of transitions, and divide it by
the square root of its variance, that is, the second order of informativeness
multiplied by the number of transitions. It can be shown that

Δ → N (0, 1), for K → ∞, (3.5)

by virtue of the central limit theorem [38]. In other words, for large K, Δ con-
verges to a normal distribution. Equation 3.4 now allows for the introduction
of our probability indicator L, defined as
 
[0,T ]
γK − K · H (1)
L=1−Θ √ , (3.6)
K · H (2)

where Θ is the cumulative distribution function of normal distributions. Thus


an unlikely price trajectory, strongly deviating from a Brownian motion, leads
to a large surprise and hence L ≈ 0. We can now quantify when markets show
normal behavior, where L ≈ 1. Again, the reader is referred to [11] for more
details.
We now assess how the overshoot event ω should be chosen. The standard
framework for coastline trading dictates that an overshoot event occurs in
64 High-Performance Computing in Finance

the price trajectory when the price moves by δ in the overshoots’ direction
after a directional change. In the context of the probability indicator, we
depart from this procedure and define the overshoots to occur when the price
moves by 2.525729 · δ. This value comes from maximizing the second-order
informativeness H (2) and guarantees maximal variability of the probability
indicator L. For details, see Reference 11.
The probability indicator L can now be used to navigate the trading mod-
els through times of severe market stress. In detail, by slowing down the
increase of the inventory of agents during price overshoots, the overall trading
models exposure experiences smaller drawdowns and better risk-adjusted per-
formance. As a simple example, when an agent cascades, that is, increases its
inventory, the unit size is reduced in times where L starts to approach zero.
For the trading model, the probability indicator is utilized as follows. The
default size for cascading is one unit (lot). If L is smaller than 0.5, this sizing
is reduced to 0.5, and finally if L is smaller than 0.1, then the size is set to 0.1.
Implementing the above-mentioned measures allowed the trading model
to safely navigate treacherous terrain, where it derailed in the past. However,
there was still one crucial insight missing, before a successful version of the
Alpha Engine could be designed. This last insight evolves around a subtle
recasting of thresholds which has profound effects on the resulting trading
model performance.

3.3.6 The final pieces of the puzzle


Coming back full circle, the focus was again placed on the nature of the
event-based formalism. By allowing for new degrees of freedom, the trading
model puzzle could be concluded. What before were rigid and static thresholds
are now allowed to breathe, giving rise to asymmetric thresholds and fractional
position changes.
In the context of directional changes and overshoots, an innocuous question
to ask is whether the threshold defining the events should depend on the
direction of the current market. In other words, does it make sense to introduce
a threshold that is a function of the price move direction? Analytically

δup for increasing prices;
δ→ (3.7)
δdown for decreasing prices.
These asymmetric thresholds now register directional changes at different val-
ues of the price curve, depending on the direction of the price movement. As
a consequence ω = ω(δup , δdown ) denotes the length of the overshoot corre-
sponding to the new upward and downward directional change thresholds. By
virtue of the overshoot size scaling law
ωup  = δup , ωdown  = δdown . (3.8)
To illustrate, let Pt be a price curve, modeled as an arithmetic Brownian
motion Bt with trend μ and volatility σ, meaning dPt = μdt + σdBt . Now the
The Alpha Engine 65

μ μ μ
No trend ( = 0) Positive trend ( >> 0) Negative trend ( << 0)
σ2 σ2 σ2
0.1% 0.2% 0.3% 0.4% 0.5%

0.1% 0.2% 0.3% 0.4% 0.5%

0.1% 0.2% 0.3% 0.4% 0.5%


δdown

δdown

δdown
0.1% 0.2% 0.3% 0.4% 0.5% 0.1% 0.2% 0.3% 0.4% 0.5% 0.1% 0.2% 0.3% 0.4% 0.5%
δup δup δup

FIGURE 3.7: Monte Carlo simulation of the number of directional changes


N , seen in Equation 3.9, as a function of the asymmetric directional change
thresholds δup and δdown , for a Brownian motion, defined by μ and σ. The
left-hand panel shows a realization with no trend, while the other two panels
have an underlying trend.

expected number of upward and downward directional changes during a time


interval [0, T ] is a function

N = N (δup , δdown , μ, σ, [0, T ]). (3.9)

In Figure 3.7, the result of a Monte Carlo simulation is shown. For the situation
with no trend (left-hand panel), we see the contour lines being perfect circles.
In other words, by following any defined circle, the same number of directional
changes are found for the corresponding asymmetric thresholds. Details about
the analytical expressions and the Monte Carlo simulation regarding the num-
ber of directional changes can be found in Reference 39.
This opens up the space of possibilities, as up to now, only the 45-degree
line in all panels of Figure 3.7 were considered, corresponding to symmetric
thresholds δ = δup = δdown . For trending markets, one can observe a shift
in the contour lines, away from the circles. In a nutshell, for a positive trend
the expected number of directional changes is larger if δup > δdown . This
reflects the fact that an upward trend is naturally comprised of longer up-
move segments. The contrary is true for down moves.
Now it is possible to introduce the notion of invariance as a guiding princi-
ple. By rotating the 45-degree line in the correct manner for trending markets,
the number of directional changes will stay constant. In other words, if the
trend is known, the thresholds can be skewed accordingly to compensate. How-
ever, it is not trivial to construct a trend indicator that is predictive and not
only reactive.
A workaround is found by taking the inventory as a proxy for the trend.
In detail, the expected inventory size I for all agents in normal market con-
ditions can be used to gauge the trend: E[I(δup , δdown )] is now a measure of
trendiness and hence triggers threshold skewing. In other words, by taking the
66 High-Performance Computing in Finance

inventory as an invariant indicator, the 45-degree line can be rotated due to


the asymmetric thresholds, counteracting the trend.
A more mathematical justification can be found in the approach of what is
known as “indifference prices” in market making. This method can be trans-
lated into the context of intrinsic time and agent’s inventories. It then man-
dates that the utility (or preference) of the whole inventory should stay the
same for skewed thresholds and inventory changes. In other words, how can
the thresholds be changed in a way that “feels” the same as if the inventory
increases or decreased by one unit? Expressed as equations
∗ ∗
U (δdown , δup , I) = U (δdown , δup , I + 1), (3.10)

and
∗∗ ∗∗
U (δdown , δup , I) = U (δdown , δup , I − 1), (3.11)
∗ ∗ ∗∗
where U represents a utility function. The thresholds δup , δdown
and , δup ,
∗∗
δdown are “indifference” thresholds.
A pragmatic implementation of such an inventory-driven skewing of thresh-
olds is given by the following equation, corresponding to a long position

δdown 2 if I ≥ 15;
= (3.12)
δup 4 if I ≥ 30.

For a short position, the fractions are inverted



δup 2 if I ≤ −15;
= (3.13)
δdown 4 if I ≤ −30.

In essence, in the presence of trends, the overshoot thresholds decrease as a


result of the asymmetric directional change thresholds.
This also motivates a final relaxation of a constraint. The final ingredi-
ents of the Alpha Engine are fractional position changes. Recall that coast-
line trading is simply an increase or decrease in the position size at intrinsic
time events. This cascading and de-cascading were done by one unit. For
instance, increasing a short position size by one unit if the price increases
and reaches an upward overshoot. To make this procedure compatible with
asymmetric thresholds, the new cascading and de-cascading events resulting
from the asymmetric threshold are now done with a fraction of the original
unit. The fractions are also dictated by Equations 3.12 and 3.13. In effect, the
introduction of asymmetric thresholds leads to a subdivision of the original
threshold into smaller parts, where the position size is changed by subunits
on these emerging threshold.
An example is shown in Figure 3.8. Assuming that a short position was
opened at a lower price than the minimal price in the illustration, the direc-
tional change will trigger a cascading event. In other words, one (negative)
unit of exposure (symbolized by the large arrows) is added to the existing short
position. The two overshoot events in Figure 3.8a trigger identical cascading
The Alpha Engine 67

(a) (b)

FIGURE 3.8: Cascading with asymmetric thresholds. A stylized price curve


is shown in both panels. (a) The original (symmetric) setup with an upward
directional change event (continuous line) and two overshoots (dashed lines).
Short position size increments are shown as downward arrows. (b) The situa-
tion corresponding to asymmetric thresholds, where intrinsic time accelerates
and smaller position size increments are utilized for coastline trading. See
details in text.

events. In Figure 3.8b, the same events are augmented by asymmetric thresh-
olds. Now ωup = ωdown /4. As a result, each overshoot length is divided into
four segments. The new cascading regime is as follows: increase the position
by one-fourth of a (negative) unit (small arrow) at the directional change and
another fourth at the first, second, and third asymmetric overshoots each. In
effect, the cascading event is “smeared out” and happens in smaller unit sizes
over a longer period. For the cascading events at the first and second original
overshoots, this procedure is repeated.
This concludes the final chapter in the long history of the trading model
development. Many insights from diverse fields were consolidated and a unified
modeling framework emerged.

3.4 The Nuts and Bolts: A Summary of the


Alpha Engine
All the insights gained along this long journey need to be encapsulated and
translated into algorithmic concepts. In this section, we summarize in detail
the trading model behavior and specify the parameters.
The intrinsic time scale dissects the price curve into directional changes and
overshoots, yielding an empirical scaling law equating the length of the over-
shoot ω to the size of the directional change threshold δ, that is, ω ≈ δ. This
scaling law creates an event-based language that clocks the trading model.
In essence, intrinsic time events define the coastline trading behavior with its
68 High-Performance Computing in Finance

hallmark cascading and de-cascading events. In other words, the discrete price
curve with occurrences of intrinsic time events triggers an increase or decrease
in position sizes.
In detail, an intrinsic event is either a directional change or a move
of size δ in the direction of the overshoot. For each exchange rate, we
assign four coastline traders CTi [δup/down (i)], i = 1, 2, 3, 4, that operate
at various scales, with upward and downward directional change thresholds
equaling δup/down (1) = 0.25%, δup/down (2) = 0.5%, δup/down (3) = 1.0%, and
δup/down (4) = 1.5%.
The default size for cascading and de-cascading a position is one unit (lot).
The probability indicator Li , assigned to each coastline trader, is evaluated
on the fixed scale δ(i) = δup/down (i). As a result, its states are directional
changes of size δ(i) or overshoot moves of size 2.525729 · δi . The default unit
size for cascading is reduced to 0.5 if Li is smaller than 0.5. Additionally, if
Li is smaller than 0.1, then the size is further reduced to 0.1.
In case a coastline trader accumulates an inventory with a long posi-
tion greater than 15 units, the upward directional change threshold δup (i)
is increased to 1.5 of its original size, while the downward directional change
threshold δdown (i) is decreased to 0.75 of its original size. In effect, the ratio
for the skewed thresholds is δup (i)/δdown (i) = 2. The agent with the skewed
thresholds will cascade when the overshoot reaches 0.5 of the skewed threshold,
that is, half of the original threshold size. In case the inventory with long posi-
tion is greater than 30, then the upward directional change threshold δup (i)
is increased to 2.0 of its original size and the downward directional change
threshold δdown (i) is decreased to 0.5. The ratio of the skewed thresholds
now equals δup (i)/δdown (i) = 4. The agent with these skewed thresholds will
cascade when the overshoot extends by 0.25 of the original threshold, with
one-fourth of the specified unit size. This was illustrated in Figure 3.8b. The
changes in threshold lengths and sizing are analogous for short inventories.
This concludes the description of the trading model algorithm and the
motivation of the chosen modeling framework. Recall that the interested
reader can download the code from GitHub [10].

3.5 Conclusion and Outlook


The trading model algorithm described here is the result of a meander-
ing journey that lasted for decades. Guided by an overarching event-based
framework, recasting time as discrete and driven by activity, elements from
complexity theory and information theory were added. In a nutshell, the pro-
posed trading model is defined by a set of simple rules executed at specific
events in the market. This approach to designing automated trading mod-
els yields an algorithm that fulfills many desired features. Its parsimonious,
The Alpha Engine 69

modular, and self-similar design results in behavior that is profitable, robust,


and adaptive.
Another crucial feature of the trading model is that it is designed to be
counter trend. The coastline trading ensures that positions, which are going
against a trend, are maintained or increased. In this sense, the models provide
liquidity to the market. When market participants want to sell, the invest-
ment strategy will buy and vice versa. This market-stabilizing feature of the
model is beneficial to the markets as a whole. The more such strategies are
implemented, the less we expect to see runaway markets but healthier market
conditions overall. By construction, the trading model only ceases to perform
in low-volatility markets.
It should be noted that the model framework presented here can be realized
with reasonable computational resources. The basic agent-based algorithm
shows profitable behavior for four directional change thresholds, on which
the positions (agents) live. However, by adding more thresholds the model
behavior is expected to become more robust, as more information coming
from the market can be processed by the trading model. In other words, by
increasing the model complexity the need for performant computing becomes
relevant for efficient prototyping and backtesting. In this sense, we expect the
advancements in high-performance computing in finance to positively impact
the Alpha Engine’s evolution.
Nevertheless, with all the merits of the trading algorithm presented here,
we are only at the beginning. The Alpha Engine should be understood as
a prototype. It is, so to speak, a proof of concept. For one, the parameter
space can be explored in greater detail. Then, the model can be improved by
calibrating the various exchange rates by volatility or by excluding illiquid
ones. Furthermore, the model treats all the currency pairs in isolation. There
should be a large window of opportunity for increasing the performance of
the trading model by introducing correlation across currency pairs. This is a
unique and invaluable source of information not yet exploited. Finally, a whole
layer of risk management can be implemented on top of the models.
We hope to have presented a convincing set of tools motivated by a consis-
tent philosophy. If so, we invite the reader to take what is outlined here and
improve upon it.

Appendix 3A A History of Ideas


This section is a personal recount of the historical events that would ulti-
mately lead to the development of the trading model algorithm outlined in this
chapter, told by Richard B. Olsen.
The development of the trading algorithm and model framework dates
back to my studies in the mid-70s and 80s. From the very start, my interests
in economics were influenced by my admiration of the scientific rigor of natural
70 High-Performance Computing in Finance

sciences and their successful implementations in the real world. I argued that
the resilience of the economic and political systems depends on the underlying
economic and political models. Motivated to contribute to the well-being of
society I wanted to work on enhancing economic theory and work on applying
the models.
I first studied law at the University of Zurich and then, in 1979, moved
to Oxford to study philosophy, politics, and economics. In 1980, I attended a
course on growth models by James Mirrlees, who, in 1996, received a Nobel
prize in economics. In his first lecture, he discussed the shortcomings of the
models, such as [40]. He explained that the models are successful in explaining
growth as long as there are no large exogenous shocks. But unanticipated
events are inherent to our lives and the economy at large. I thus started
to search for a model framework that can both explain growth and handle
unexpected exogenous shocks. I spent one year studying the Encyclopedia
Britannica and found my inspiration in relativity theory.
In my 1981 PhD thesis, titled “Interaction Between Law and Society,” at
the University of Zurich, I developed a new model framework that describes
in an abstract language, how interactions in the economy occur. At the core of
the new approach are the concepts of object, system, environment, and event-
based intrinsic time. Every object has its system that comprises all the forces
that impact and influence the object. Outside the system is its environment
with all the forces that do not impact the object. Every object and system
has its own frame of reference with an event-based intrinsic time scale. Events
are interactions between different objects and their systems. I concluded that
there is no abstract and universal time scale applicable to every object. This
motivated me to think about the nature of time and how we use time in our
everyday economic models.
After finishing my studies, I joined a bank working first in the legal depart-
ment, then in the research group, and finally joined the foreign exchange trad-
ing desk. My goal was to combine empirical work with academic research, but
was disappointed with the pace of research at the bank. In the mid-80s, there
was the first buzz about start-ups in the United States. I came up with a busi-
ness idea: banks have a need for quality information to increase profitability,
so there should be a market for quality real-time information.
I launched a start-up with the name of Olsen & Associates. The goal was
to build an information system for financial markets with real-time forecasts
and trading recommendations using tick-by-tick market data. The product
idea combined my research interest with an information service, which would
both improve the quality of decision making in financial markets and generate
revenue to fund further research. The collection of tick market data began in
January 1986 from Reuters. We faced many business and technical obstacles,
where data storage cost was just one of the many issues. After many setbacks,
we successfully launched our information service and eventually acquired 60
big to mid-sized banks across Europe as customers.
The Alpha Engine 71

In 1990, we published our first scientific paper [26] revealing the first scaling
law. The study showed that intraday prices have the same scaling law exponent
as longer term price movements. We had expected two different exponents:
one for intraday price movements, where technical factors dictate price dis-
covery, and another for longer term price movements that are influenced by
fundamentals. The result took us by surprise and was evidence that there are
universal laws that dictate price discovery at all scales. In 1995, we organized
the first high-frequency data conference in Zurich, where we made a large
sample of tick data available to the academic community. The conference was
a big success and boosted market microstructure research, which was in its
infancy at that time. In the following years, we conducted exhaustive research
testing all possible model approaches to build a reliable forecasting service and
trading models. Our research work is described in the book [17]. The book
covers data collection and filtering, basic stylized facts of financial market time
series, the modeling of 24-hour seasonal volatility, realized volatility dynam-
ics, volatility processes, forecasting return and risks, correlation, and trading
models. For many years, the book was a standard text for major hedge funds.
The actual performance of our forecasting and trading models was, however,
spurious and disappointing. Our models were best in class, but we had not
achieved a breakthrough.
Back in 1995, we were selling tick-by-tick market data to top banks and
created a spinoff under the name of OANDA to market a currency converter on
the emergent Internet and eventually build a foreign exchange market making
business. The OANDA currency converter was an instant success. At the start
of 2001, we were completing the first release of our trading platform. At the
same time, Olsen & Associates was a treasure store of information and risk
services, but did not have cash to market the products and was struggling for
funding. When the Internet bubble burst and markets froze, we could not pay
our bills and the company went into default. I was able to organize a bailout
with a new investor. He helped to salvage the core of Olsen & Associates with
the aim of building a hedge fund under the name of Olsen Ltd and buying up
the OANDA shares.
In 2001, the OANDA trading platform was a novelty in the financial indus-
try: straight through processing, one price for everyone, and second-by-second
interest payments. At the time, these were true firsts. At OANDA, a trader
could buy literally 1 EUR against USD at the same low spread as a buyer
of 1 million EUR against USD. The business was an instant success. More-
over, the OANDA trading platform was a research laboratory to analyze the
trades of 10,000 traders, all buying and selling at the same terms and condi-
tions, and observe their behavior patterns in different market environments.
I learned hands on, how financial markets really work and discovered that
basic assumptions of market efficiency that we had taken for granted at Olsen
& Associates were inappropriate. I was determined to make a fresh start in
model development.
72 High-Performance Computing in Finance

At Olsen Ltd, I made a strategic decision to focus exclusively on trading


model research. Trading models have a big advantage over forecasting models:
the profit and losses of a trading model are an unambiguous success criterion
of the quality of a model. We started with the forensics of the old model algo-
rithms and discovered that the success and failure of a model depends critically
on the definition of time and how data is sampled. Already at Olsen & Asso-
ciates, we were sensitive to the issue of how to define time and had rescaled
price data to account for the 24-hour seasonality of volatility, but did not suc-
ceed with a more sophisticated rescaling of time. There was one operator that
we had failed to explore. We had developed a directional change indicator and
had observed that the indicator follows a scaling law behavior similar to the
absolute price change scaling law [16]. This scaling law was somehow forgot-
ten and was not mentioned in our book [17]. I had incidental evidence that
this operator would be successful to redefine time because traders use such an
operator to analyze markets. The so-called point and figure chart replaces the
x-axis of physical time with an event scale. As long as a market price moves
up, the prize stays frozen in the same column. When the price moves down
by a threshold bigger than the box size, the plot moves to the next column.
A new column is started, when the price reverses its direction.
Then I also had another key insight of the path dependence of market
prices from watching OANDA traders. There was empirical evidence that a
margin call of one trader from anywhere in the world could trigger a whole
cascade of margin calls in the global foreign exchange markets in periods of
herding behavior. Cascades of margin calls wipe out whole cohorts of traders
and tilt the market composition of buyers and sellers and skew the long-term
price trajectory. Traditional time series models cannot adequately model these
phenomena. We decided to move to agent-based models to better incorporate
the emergent market dynamics and use the scaling laws as a framework to cal-
ibrate the algorithmic behavior of the agents. This seemed attractive, because
we could configure self-similar agents at different scales.
I was adamant to build bare-bone agents and not to clutter our model
algorithms with tools of spurious quality. In 2008, we were rewarded with
major breakthrough: we discovered a large set of scaling laws [8]. I expected
that model development would be plain sailing from thereon. I was wrong.
The road of discovery was much longer than anticipated. Our hedge fund had
several significant drawdowns that forced me to close the fund in 2013. At
OANDA, things had also deteriorated. After raising 100 million USD for 20%
of the company in 2007, I had become chairman without executive powers.
OANDA turned into a conservative company and lost its competitive edge. In
2012, I left the board.
In July 2015, I raised the first seed round for Lykke, a new startup.
Lykke builds a global marketplace for all asset classes and instruments on
the blockchain. The marketplace is open source and a public utility. We will
earn money by providing liquidity with our funds and/or customer’s funds,
with algorithms as described in this chapter.
TABLE 3A.1: Monthly performance of the unleveraged trading model
% Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec Year
2006 0.16 0.15 0.07 0.12 0.22 0.17 0.19 0.20 0.18 0.08 −0.00 0.04 1.58
2007 0.08 0.22 0.14 0.02 −0.05 −0.03 0.32 0.59 0.07 0.11 0.47 0.20 2.03
2008 0.24 0.07 0.05 0.50 0.26 0.09 0.26 0.16 0.66 2.22 1.27 0.98 6.03
2009 1.14 1.41 1.17 1.00 0.75 0.59 0.22 0.19 −0.13 0.28 0.06 0.25 7.70
2010 0.15 −0.34 0.24 0.14 0.30 0.17 0.27 −0.02 0.03 0.06 0.14 −0.31 1.42
2011 0.45 0.13 0.11 −0.16 0.04 −0.06 −0.40 0.43 0.45 −0.03 0.32 −0.03 0.97
The Alpha Engine

2012 −0.08 0.19 0.29 0.08 −0.12 0.15 −0.20 0.23 0.10 0.13 0.12 0.11 0.86
2013 −0.17 −0.01 −0.10 −0.08 0.32 0.52 0.04 0.24 −0.10 0.01 −0.01 −0.16 0.77
Note: The P&L is given in percentages. All 23 currency pairs are aggregated.
73
74 High-Performance Computing in Finance

Appendix 3B Supplementary Material

Trading model P&L for geometric random walk


35%
P&L

0%
Time

FIGURE 3B.1: Profit & Loss for a time series, generated by a geometric
random walk of 10 million ticks with annualized volatility of 25%. The average
of 60 Monte Carlo simulations is shown. In the limiting case, the P&L curve
becomes a smooth increasing line.

References
1. Baghai, P., Erzan, O., and Kwek, J.-H. The $64 trillion question, convergence
in asset management, McKinsey & Company, 2015.

2. World Bank. World development indicators database, 2015.

3. Chen, J.X. The evolution of computing: Alphago. Computing in Science & Engi-
neering 18(4), 2016, pp. 4–7.

4. Bouveret, A., Guillaumie, C., Roqueiro, C.A., Winkler, C., and Nauhaus, S.
High frequency trading activity in EU equity markets, European Securities and
Markets Authority, 2014.

5. Roseen, T. Are quant funds worth another look? Thomson Reuters, 2016.

6. Bank of International Settlement. Triennial Central Bank Survey of For-


eign Exchange, and OTC Derivatives Markets, 2016. Monetary and Economic
Department, Basel, 2016.

7. ISDA. Central clearing in the equity derivatives market, 2014.

8. Glattfelder, J.B., Dupuis, A., and Olsen, R.B. Patterns in high-frequency fx


data: Discovery of 12 empirical scaling laws. Quantitative Finance 11(4), 2011,
pp. 599–614.

9. Doyne Farmer, J. and Foley, D. The economy needs agent-based modelling.


Nature 460(7256), 2009, pp. 685–686.
The Alpha Engine 75

10. The alpha engine: Designing an automated trading algorithm code. https://
[Link]/AntonVonGolub/Code/blob/master/[Link], 2017. Accessed:
2017-01-04. 2017.

11. Golub, A., Chliamovitch, G., Dupuis, A., and Chopard, B. Multi-scale represen-
tation of high frequency market liquidity. Algorithmic Finance, 5(1), 2016, pp.
3–19.

12. Müller, U.A., Dacorogna, M.M., Davé, R.D., Pictet, O.V., Olsen, R.B., and
Robert Ward, J. Fractals and intrinsic time: A challenge to econometricians. Pre-
sentation at the XXXIXth International AEA Conference on Real Time Econo-
metrics, 14–15 Oct 1993, Luxembourg, 1993.

13. Aloud, M., Tsang, E., Olsen, R.B., and Dupuis, A. A directional-change events
approach for studying financial time series. Economics Discussion Papers, (2012-
36), 6, 2012, pp. 1–17.

14. Ao, H. and Tsang, E. Capturing market movements with directional changes.
Working paper: Centre for Computational Finance and Economic Agents, Univ.
of Essex, 2013.

15. Bakhach, A., Tsang, E.P.K., and Ng, W. Lon. Forecasting directional changes
in financial markets. Working paper: Centre for Computational Finance and
Economic Agents, Univ. of Essex, 2015.

16. Guillaume, D.M., Dacorogna, M. M., Davé, R. R., Müller, U. A., Olsen, R. B.,
and V Pictet, O. From the bird’s eye to the microscope: A survey of new stylized
facts of the intra-daily foreign exchange markets. Finance and Stochastics, 1(2),
1997, pp. 95–129.

17. Gençay, R., Dacorogna, M., Muller, U.A., Pictet, O., and Olsen, R. An Intro-
duction to High-Frequency Finance. Academic Press, New York, 2001.

18. Mandelbrot, B. The variation of certain speculative prices. Journal of Business


36(4), 1963, pp. 394–419.

19. Newman, M.E.J. Power laws, pareto distributions and Zipf’s law. Contemporary
Physics 46(5), 2005, pp. 323–351.

20. Pareto, V. Cours d’economie politique. 1897.

21. West, G.B., Brown, J.H., and Enquist, B.J. A general model for the origin of
allometric scaling laws in biology. Science 276(5309), 1997, p. 122.

22. Zipf, G.K. Human Behaxvior and the Principle of Least Effort. Addison-Wesley,
Reading, MA, 1949.

23. Albert, R. and Barabási, A.L. Statistical mechanics of complex networks. Review
of Modern Physics 74(1), 2002, pp. 47–97.

24. Barabási, A.L. and Albert, R. Emergence of scaling in random networks. Science
1999, p. 509.
76 High-Performance Computing in Finance

25. Newman, M.E.J. The structure and function of complex networks. SIAM review
45(2), 2003, pp. 167–256.

26. Müller, U.A., Dacorogna, M.M., Olsen, R.B., Pictet, O.V., Schwarz, M., and
Morgenegg, C. Statistical study of foreign exchange rates, empirical evidence of
a price change scaling law, and intraday analysis. Journal of Banking & Finance
14(6), 1990, 1189–1208.

27. Hull, J.C. Options, Futures and Other Derivative Securities, 9th edition. Pear-
son, London, 2014.

28. Voit, J. The Statistical Mechanics of Financial Markets, 3rd edition. Springer,
Berlin, 2005.

29. Glattfelder, J.B. Decoding Complexity. Springer, Heidelberg, 2013.

30. Wolfram, S. A New Kind of Science. Wolfram Media, Champaign, 2002.

31. Andersen, J.V. and Sornette, D. A mechanism for pockets of predictability in


complex adaptive systems. EPL (Europhysics Letters), 70(5), 2005, p. 697.

32. Helbing, D. Agent-based modeling. In: Social Self-Organization. Springer, 2012,


pp. 25–70.

33. Lux, T. and Marchesi, M. Volatility clustering in financial markets: A microsim-


ulation of interacting agents. International Journal of Theoretical and Applied
Finance 3(4), 2000, pp. 675–702.

34. Aloud, M., Tsang, E., Dupuis, A., and Olsen, R. Minimal agent-based model for
the origin of trading activity in foreign exchange market. In: 2011 IEEE Sympo-
sium on Computational Intelligence for Financial Engineering and Economics
(CIFEr). IEEE. 2011, pp. 1–8.

35. Dupuis, A. and Olsen, R.B. High Frequency Finance, Using Scaling Laws to
Build Trading Models, John Wiley & Sons, Inc., 2012, pp. 563–584.

36. Glattfelder, J.B., Bisig, T., and Olsen, R.B. R&D Strategy Document. Technical
report, A Paper by the Olsen Ltd. Research Group, 2010.

37. Cover, T.M. and Thomas, J.A. Elements of Information Theory. John Wiley &
Sons, Hoboken, 1991.

38. Pfister, H.D., Soriaga, J.B., and Siegel, P.H. On the achievable information rates
of finite state ISI channels. In Proc. IEEE Globecom. Eds. Kurlander, D., Brown,
M., and Rao, R. ACM Press, November 2001, pp. 41–50.

39. Golub, A., Glattfelder, J.B., Petrov, V., and Olsen, R.B. Waiting Times and
Number of Directional Changes in Intrinsic Time Framework, 2017. Lykke Corp
& University of Zurich Working Paper.

40. Kaldor, N. and Mirrlees, J.A. A new model of economic growth. The Review of
Economic Studies 29(3), 1962, pp. 174–192.
Chapter 4
Portfolio Liquidation and Ambiguity
Aversion

Álvaro Cartea, Ryan Donnelly, and Sebastian Jaimungal

CONTENTS
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Reference Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.1 Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.2 Optimal liquidation problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.3 Feedback controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Ambiguity Aversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Effects of Ambiguity on the Optimal Strategy . . . . . . . . . . . . . . . . . . 88
4.4.1 Arrival rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.2 Fill probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.3 Midprice drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
[Link] Equivalence to inventory penalization . . . . . . . . 90
4.5 Closed-Form Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Inclusion of Market Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.6.1 Feedback controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.6.2 The effects of ambiguity aversion on market order
execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Appendix 4A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.1 Introduction
This work considers an optimal execution problem in which an agent is
tasked with liquidating a number of shares of an asset before the end of a
defined trading period. The development of an algorithm which quantitatively
specifies the execution strategy ensures that the liquidation is performed with
the perfect balance of risk and profits, and in the modern trading environment
in which human trading activity is essentially obsolete, the execution must
be based on a set of predefined rules. However, the resulting set of rules
depends heavily upon modeling assumptions made by the agent. If the model

77
78 High-Performance Computing in Finance

is misspecified, then the balance between risk and profits is imperfect and the
strategy may be exposed to risks that are not captured by model features.
The design of optimal execution algorithms began in the literature with
the seminal work by Almgren and Chriss (2001). This approach was general-
ized over time with models that include additional features. Kharroubi and
Pham (2010) take a different approach to modeling the price impact of the
agent’s trades depending on the amount of time between them. A more direct
approach in modeling features of the limit order book and an induced price
impact appears in Obizhaeva and Wang (2013).
The aforementioned papers allow the agent to employ only market orders in
their liquidation strategy. By allowing the agent to use limit orders instead (or
in addition), the profits accumulated by the portfolio liquidation can poten-
tially be increased. Intuitively this is possible due to acquiring better trade
prices by using limit orders instead of market orders. Guéant et al. (2012)
rephrase the problem as one in which all of the agent’s trades are performed
by using limit orders. This work is then continued in Guéant and Lehalle
(2015) where more general dynamics of transactions are considered. A com-
bination of limit order and market order strategy is developed in Cartea and
Jaimungal (2015a). For an overview on these and related problems in algo-
rithmic and high-frequency trading, see Cartea et al. (2015).
The present work is most related to Guéant et al. (2012) and Cartea and
Jaimungal (2015a). The underlying dynamics we use here are identical to
Guéant et al. (2012), but we use a different objective function. Our work con-
siders a risk-neutral agent, whereas the former considers one with exponential
utility. Since the purpose of this work is to investigate the effects of ambigu-
ity aversion, we consider the risk-neutral agent in order to eliminate possible
effects due to risk aversion. In addition, we also consider the inclusion of mar-
ket orders in the agent’s trading strategy toward the end of this work, similar
to what is done in Cartea and Jaimungal (2015a).
If the agent expresses equal levels of ambiguity toward the arrival rate and
fill probability of orders, then her optimal trading strategy can be found in
closed form. This is of benefit to the agent if the strategy is computed and
implemented in real time because there is no need to numerically solve a non-
linear PDE. Incorporating other relevant features in the framework makes the
model more realistic, but this may come at the cost of not having closed-form
solutions. An example is when the admissible strategies include market orders.
In this case, the optimal strategies are obtained by solving a quasi-variational
inequality. This requires sophisticated numerical methods and computer power
to implement the strategies in real time.
The rest of the work is presented as follows. In Section 4.2, we present
the dynamics of price changes and trade arrivals that our agent considers to
be her reference model. The solution to the optimal liquidation is presented
here in brief. In Section 4.3, we introduce the method by which our agent
expresses ambiguity aversion. This amounts to specifying a class of equivalent
candidate measures which she considers and a function which assigns to each
Portfolio Liquidation and Ambiguity Aversion 79

candidate measure a penalization based on its deviation from the reference


measure. In Section 4.4, we show how ambiguity with respect to different
aspects of the reference dynamics affects the trading strategy of the agent.
Section 4.5 presents some closed-form solutions for the value function and
considers asymptotic behavior of the trading strategy, and Section 4.6 develops
the inclusion of market orders and again analyzes how ambiguity effects the
timing of market order placements.

4.2 Reference Model


4.2.1 Dynamics
The reference model is structured in a similar manner to Avellaneda and
Stoikov (2008), Guéant et al. (2013), and Cartea and Jaimungal (2015b),
except only the sell side of the LOB dynamics is considered. It is not necessary
to consider the intensity or volume distribution of market sell orders since the
agent will only be posing LOs on the sell side of the LOB.
Midprice: The midprice St satisfies
dSt = α dt + σ dWt ,
where α and σ > 0 are constants and (Wt )0≤t≤T is a standard Brownian
motion.
Market Orders: Let μ be a Poisson random measure with compensator
given by νP (dy, dt) = λ F (dy) dt, where F (dy) = κ e−κ y dy. The number
of
 t market buy orders that have arrived up to time t is set equal to Mt =

0 0
μ(dy, ds). The spacial dimension of the Poisson random measure μ is
interpreted as the maximum price of execution of the incoming MO. The form
of the compensator νP implies that MOs arrive at a constant intensity λ and
that the distribution of the maximum price of execution is exponential with
parameter κ.
Limit Order Placement: The agent places limit orders only on the sell side
of the limit order book. The distance of the limit order from the midprice
is denoted by δt . Since the agent’s LO is only lifted by an MO that has a
maximum execution price  t greater than St + δt , the number of LOs which are

filled is equal to Nt = 0 δs μ(dy, ds). Given that an MO arrives at time t
and the agent has an LO posted at price St + δt , the probability that the limit
order is filled is e−κ δt . This is due to the fact that the LO is only filled if the
maximal execution price of the MO is greater than the price of the LO.
The appropriate filtered probability space to consider is (Ω, F, F =
{Ft }0≤t≤T , P) where F is the completion of the filtration generated by the
t∞
midprice (St )0≤t≤T and the process (Pt )0≤t≤T , where Pt = 0 0 yμ(dy, ds).1
The agent uses δt , an Ft -predictable process, as their control.
1 Observing the process P allows one to detect both the time of arrival and the maximum
t
execution price of MOs.
80 High-Performance Computing in Finance

Inventory: The agent’s inventory is denoted by qt , and she begins with a


total of Q shares to be liquidated. Thus qt = Q − Nt , which gives

dqt = −dNt .

Wealth: Immediately after an MO lifts one of the agent’s LOs, her wealth
increases by an amount equal to the trade value St + δt . The wealth, Xt ,
therefore, has dynamics

dXt = (St + δt− ) dNt .

4.2.2 Optimal liquidation problem


Given that the agent’s goal is to liquidate her position in a way that is
financially optimal, we cease the agent’s ability to trade when her inventory
reaches zero. The optimization problem under the reference model is to select
an LO posting strategy which maximizes expected terminal wealth

H(t, x, q, S) = sup EPt,x,q,S [Xτ ∧T + qτ ∧T (Sτ ∧T − (qτ ∧T ))] , (4.1)


(δs )t≤s≤T ∈A

where τ = inf{t : qt = 0}, T is the terminal time of the strategy, qT final


inventory, EPt,x,q,S [ · ] denotes P expectation conditional on Xt− = x, qt− = q
and St = S, and A denotes the set of admissible strategies which are non-
negative Ft -predictable processes. Moreover, the function (qT ), with (0) = 0
and (q) increasing in q, is a liquidation penalty that consists of fees and
market impact costs when the agent unwinds terminal inventory. For example,
(q) = θ q represents a linear impact when liquidating q shares, where θ ≥ 0 is
a penalty parameter. This liquidation penalty will only come into effect if the
agent does not manage to liquidate her entire inventory before the end of the
trading period. Note that since qτ = 0, this gives Xτ + qτ (Sτ − (qτ )) = Xτ ,
and so H(t, x, 0, S) = x.
By standard results, the HJB equation associated with the value function
H is given by

∂t H + α∂S H + 12 σ 2 ∂SS H + sup λ e−κ δ DH1q=0 = 0, (4.2)


δ≥0

subject to the terminal and boundary conditions

H(T, x, q, S) = x + q(S − (q)), (4.3)


H(t, x, 0, S) = x, (4.4)

where the operator D acts as

DH = H(t, x + (S + δ), q − 1, S) − H(t, x, q, S), (4.5)

and 1 is the indicator function.


Portfolio Liquidation and Ambiguity Aversion 81

The form of the HJB equation 4.2 and the conditions 4.3 and 4.4 allow for
the ansatz H(t, x, q, S) = x + q S + hq (t). Substituting this expression into the
above HJB equation gives
 
∂t hq + α q + sup λ e−κ δ (δ + hq−1 − hq ) 1q=0 = 0, (4.6)
δ≥0

hq (T ) = −q (q), (4.7)
h0 (t) = 0 . (4.8)

4.2.3 Feedback controls


Proposition 4.1 (Optimal Feedback Controls). The optimal feedback
controls of the HJB equation 4.2 are given by
 
δq∗ (t) = κ1 − hq−1 (t) + hq (t) , q = 0, (4.9)
+

where (x)+ = max(x, 0).

Proof. Apply first-order conditions to the supremum term in Equation 4.6 to


give 4.9. Verifying that this gives a maximizer to Equation 4.6 is a simple
exercise.

As long as the stated feedback controls remain positive, they can be


substituted into Equation 4.6 to obtain a nonlinear system of equations for
hq (t):

λ −κ( κ1 −hq−1 +hq )


∂t hq + α q + e 1q=0 = 0, (4.10)
κ
along with terminal conditions hq (T ) = −q (q) and boundary condition
h0 (t) = 0. This equation has a closed-form analytical solution which will
be discussed in Section 4.5.
In Figure 4.1, the optimal depth to be posted by the agent is plotted for
each value of inventory. There are numerous qualitative features in this figure
which have intuitive explanations. First, we see that for any level of inventory,
the optimal depth decreases as the time approaches maturity. When the time
until the end of the trading period is longer, the agent experiences no sense of
urgency in liquidating her position and would rather be patient to sell shares
for high prices. But when maturity is close, the possibility of having to pay
the liquidation fee for unliquidated shares is imminent. The agent is more
willing to take lower prices to avoid having to pay this cost. Similarly, we also
see that the depth decreases as the current level of inventory increases for the
same reason; with a larger inventory holding comes a higher probability of
having to pay a liquidation fee at the end of the trading period. The agent
is willing to transact at lower prices to avoid this possibility. Lastly, we see
82 High-Performance Computing in Finance

0.25

0.2
q =1
0.15
Depth (δ)

0.1

0.05
Increasing q
0
0 5 10 15
Time (secs)

FIGURE 4.1: Optimal depth for an ambiguity neutral agent. Parameter


values are κ = 15, λ = 2, σ = 0.01, α = 0, (q) = θ q, θ = 0.01, Q = 10, and
T = 15 seconds.

that for times close to maturity and large levels of inventory, the agent posts
a depth of zero indicating that she is willing to make essentially no profit in
order to avoid the risk of terminal penalty. If it were not for the constraint
δt ≥ 0, the agent would post negative depths in this region if it would mean a
faster rate of order fill. This is an indication of the desire to submit a market
order. A proper formulation of optimal market order submission is provided
in Section 4.6.

4.3 Ambiguity Aversion


In this section, we allow the agent to acknowledge that the reference model
is only an approximation of reality, but ultimately is misspecified. Adjusting
one’s strategy due to the desire of avoiding the risks of working with a mis-
specified model is called ambiguity aversion. There are many approaches to
incorporating ambiguity aversion into optimal control problems, but we focus
on one which allows tractable solutions in our algorithmic trading context.
The approach we consider involves the agent ranking alternative dynamics
through the specification of a measure Q equivalent to the reference measure
P and evaluating the performance of a trading strategy within the dynamics
given by Q. The agent also penalizes deviations of the measure Q from P
which represents the cost of rejecting the reference measure in favor of the
candidate. If the agent is highly confident in the specification of dynamics in
the reference measure, then this cost should be large even for small deviations.
Portfolio Liquidation and Ambiguity Aversion 83

On the other hand, if the agent lacks confidence in the reference measure, then
the cost should be small and even large deviations will incur a small penalty.
The optimization problem as originally posed in Equation 4.1 is altered to
reflect the consideration of new dynamics:

H(t, x, q, S) = sup inf EQ


t,x,q,S [Xτ ∧T + qτ ∧T (Sτ ∧T − (qτ ∧T ))
(δs )t≤s≤T ∈A Q∈Q

+Ht,T (Q|P)] , (4.11)

where the class of equivalent measures Q and the penalty H are specified
below.
Since the agent may exhibit more confidence in the specifications of the
model with respect to some aspects over others, she penalizes deviations of
the model with respect to different aspects with different magnitudes. In par-
ticular, the agent wants unique penalizations toward specifications of:

• Midprice dynamics other than arithmetic Brownian motion

• MO arrival intensities other than the constant λ


• MO maximal price distributions other than exponential with parameter κ

We consider the same class of candidate measures and penalty function intro-
duced in Cartea et al. (2014). Namely, any candidate measure is defined in
terms of a Radon–Nikodym derivative

dQα,λ,κ (η, g) dQα (η) dQα,λ,κ (η, g)


= , (4.12)
dP dP dQα (η)

where the intermediate Radon–Nikodym derivatives are


⎧ ⎫
dQα (η) ⎨ 1 T  α − η 2 T
α − η ⎬
t t
= exp − dt − dWt (4.13)
dP ⎩ 2 σ σ ⎭
0 0

and
⎧ T ∞ ⎫
dQα,λ,κ (η, g) ⎨   T ∞ ⎬
gt (y)
= exp − (e − 1) ν P (dy, dt) + gt (y) μ(dy, dt) .
dQα (η) ⎩ ⎭
0 0 0 0
(4.14)

In denoting the candidate measure Qα,λ,κ , we are conveying that the drift,
arrival rate of MOs, and distribution of MOs have all been changed. The full
84 High-Performance Computing in Finance

class of new measures considered by the agent is


  
P dQα (η)
Q α,λ,κ
= Q α,λ,κ
(η, g) : η, g are F-predictable, E = 1,
dP
 α,λ,κ 
α dQ (η, g) α,λ,κ
EQ (η) α
= 1, and EQ (η,g)
dQ (η)
 T ∞  
2 gt (y)
× y e νP (dy, dt) < ∞ . (4.15)
0 0

The constraints imposed on the first two expectations ensure that the Radon–
Nikodym derivatives in Equations 4.13 and 4.14 yield probability measures.
The inequality constraint ensures that in the candidate measure, the profits
earned by the agent have a finite variance. The set or candidate measures are
parameterized by the process η and the random field g, and the dynamics
of the midprice and MOs can be stated in terms of these quantities. In the
candidate measure Qα,λ,κ (η, g), the drift of the midprice is no longer α, but
is changed to ηt . Also in the candidate measure, the compensator of μ(dy, dt)
becomes νQ (dy, dt) = egt (y) νP (dy, dt), see Jacod and Shiryaev (1987), Chapter
III.3c, Theorem 3.17.
Before introducing the penalty function H, we further decompose the mea-
sure change. First, note that according to the above discussion, if gt (y) does
not depend on y, then the fill probability has not changed in the candidate
measure, only the rate of MO arrival is different. Second, if gt (y) satisfies the
equality

∞
egt (y) F (dy) = 1, (4.16)
0

for all t ∈ [0, T ], then only the fill probability has changed in the candidate
measure and the intensity remains the constant λ. We are then able to break
into two steps the measure change of Equation 4.14. Given a random field g,
we define two new random fields by

∞ 
gtλ = log egt (y) F (dy) , (4.17)
0
gtκ (y) = gt (y) − gtλ . (4.18)

The random field g λ does not depend on y, and it is easily shown that the
random field g κ satisfies Equation 4.16. Thus any measure Qα,λ,κ (η, g) ∈
Qα,λκ allows us to uniquely define two measures Qα,λ (η, g) and Qα,κ (η, g) via
Portfolio Liquidation and Ambiguity Aversion 85

gλ Qα,λ gκ

η g
P Qα Qα,λ,κ

gκ gλ
Qα,κ

FIGURE 4.2: Three natural alternative routes from the reference measure
P to a candidate measure Qα,λ,κ in which midprice drift, MO intensity, and
execution price distribution of MOs have been altered.

Random–Nikodym derivatives:
⎧ T ∞ ⎫
dQα,λ (η, g) ⎨   λ
T ∞ ⎬
= exp − (egt
− 1) ν P (dy, dt) + g λ
μ(dy, dt) ,
dQα (η) ⎩ t

0 0 0 0
⎧ T ∞ ⎫
α,κ ⎨   T ∞ ⎬
dQ (η, g) κ
= exp − (egt (y) − 1) νP (dy, dt) + gtκ (y) μ(dy, dt) .
α
dQ (η) ⎩ ⎭
0 0 0 0

These measure changes represent decomposing the change from Qα (η) to


Qα,λ,κ (η, g) into two subsequent measure changes, one in which only the inten-
sity of orders changes and one in which only fill probability of orders changes.
This is graphically represented in Figure 4.2.
To solve the robust optimization problem proposed in Equation 4.11, we
consider g λ and g κ to be defined on their own, so that g λ does not depend on
y and g κ satisfies Equation 4.16. Then gt (y) = gtλ + gtκ (y) parameterizes a full
measure change. The penalty function we select is
⎧  α   α,λ 

⎪ 1 dQ (η) 1 dQ (η, g)

⎪ log + log

⎪ ϕα dP ϕλ dQα (η)

⎪  

⎪ dQα,λ,κ (η, g)

⎪ 1
if ϕλ ≥ ϕκ ,

⎪+ log ,
⎨ ϕκ dQα,λ (η, g)
H Qα,λ,κ (η, g) | P = (4.19)

⎪  α   α,κ 

⎪ 1 dQ (η) 1 dQ (η, g)

⎪ log + log

⎪ ϕα dP ϕκ dQα (η)

⎪  

⎪ 1 α,λ,κ

⎪ dQ (η, g)
⎩+ log α,κ
, if ϕλ ≤ ϕκ .
ϕλ dQ (η, g)

A graphical representation of how this choice of penalty function places differ-


ent weights on each step of the full measure change is given in Figure 4.3. Note
from the definition 4.19 that when all three ambiguity weights are equal, the
optimization problem 4.11 reduces to using relative entropy as the penalty.
86 High-Performance Computing in Finance

Qα,λ
ϕλ ϕκ

ϕα
P Qα Qα,λ,κ

ϕκ ϕλ
Qα,κ

FIGURE 4.3: Ambiguity weights associated with each sequential step of the
full measure change.

We return now to the optimization problem 4.11 which has an associated


Hamilton–Jacobi-Bellman–Isaacs (HJBI) equation:
  2 
1 1 α−η
∂t H + σ 2 ∂SS H + inf η ∂S H +
2 η 2ϕα σ
 ∞ 
λ κ
+ sup inf inf
κ
λ eg +g (y) F (dy) DH1q=0 (4.20)
δ≥0 g λ g ∈G
δ

+ Kϕλ ,ϕκ (g λ , g κ ) 1ϕλ ≥ϕκ + Kϕκ ,ϕλ (g κ , g λ ) 1ϕλ <ϕκ = 0,

with conditions

H(T, x, q, S) = x + q(S − (q)) (4.21)


H(t, x, 0, S) = x, (4.22)

where the functional K is given by

∞  
1
Kc,d (a, b) = −(ea(y) − 1) + a(y) ea(y)+b(y) λ F (dy)
c
0
∞
1
+ [−(eb(y) − 1) ea(y) + b(y) ea(y)+b(y) ] λ F (dy),
d
0

and the class of functions G is defined by an integral constraint:

 ∞ 
G= g: eg(y) F (dy) = 1 . (4.23)
0

In a similar fashion to Equation 4.2, this equation yields the ansatz


H(t, x, q, S) = x + q S + hq (t), which when substituted into Equations 4.20
Portfolio Liquidation and Ambiguity Aversion 87

through 4.22 gives


  2 
1 α−η
∂t hq + inf η q +
η 2ϕα σ
 ∞ 
g λ +g κ (y)
+ sup inf inf
κ
λ e F (dy) (δ + hq−1 (t) − hq (t))1q=0 (4.24)
δ≥0 g λ g ∈G
δ

+K ϕλ ,ϕκ
(g , g ) 1ϕλ ≥ϕκ + Kϕκ ,ϕλ (g κ , g λ ) 1ϕλ <ϕκ = 0,
λ κ

h(T, q) = −q (q), (4.25)


h(t, 0) = 0 . (4.26)

Proposition 4.2 (Solution to HJBI equation). Equation 4.24 along with


the terminal and boundary conditions has a unique classical solution. Fur-
thermore, the optimum in Equation 4.24 is achieved, where the optimizers are
given by
   
δq∗ (t) = 1
ϕκ log 1 + ϕκ
κ − hq−1 (t) + hq (t) , q = 0,
+
ηq∗ (t) = α − ϕα σ 2 q,
 
ϕλ ∗ ∗
gqλ∗ (t) = log(1 − e−κ δq (t) (1 − e−ϕκ (δq (t)+hq−1 (t)−hq (t)) )) 1q=0 , (4.27)
ϕκ

∗ ∗
gqκ∗ (t, y) = − log(1 − e−κ δq (t) (1 − e−ϕκ (δq (t)+hq−1 (t)−hq (t)) ))

+ ϕκ (δq∗ (t) + hq−1 (t) − hq (t))1y≥δq∗ (t) 1q=0 .

Proof. See Appendix.

The feedback expressions above give the pointwise optimizer in the dif-
ferential equation 4.6. Since we have a classical solution to this equation, the
function H serves as a candidate value function. Also, by appropriately substi-
tuting the processes for the state variables in Equation 4.27, we get candidate
optimal controls. The verification theorem below guarantees that these can-
didates are indeed the value function and optimal controls we seek.

Theorem 4.1 (Verification Theorem). Let hq (t) be the solution to Equa-


tion 4.24 and let H(t, x, q, S) = x+q S+hq (t). Also let δt = δq∗t (t), ηt = ηq∗t (t),
gtλ = gqλ∗
t
(t), and gtκ (y) = gqκ∗
t
(t, y) define processes. Then δ , η , g λ , and
κ
g are admissible controls. Further, H is the value function to the agent’s
control problem 4.11 and the optimum is achieved by these controls.

Proof. See Appendix.


88 High-Performance Computing in Finance

4.4 Effects of Ambiguity on the Optimal Strategy


In this section, we investigate how the agent changes her optimal posting
strategy based on her levels of ambiguity toward each source of uncertainty.
For the three types of ambiguity, we compute the optimal strategy and com-
pare it to the optimal strategy presented in Figure 4.1.

4.4.1 Arrival rate


When the agent is only ambiguous to the arrival rate of market orders,
the expressions for δ ∗ , g λ∗ , g κ∗ in Equation 4.27 are interpreted in terms of a
limit as ϕκ → 0 to arrive at
 
1
δq∗ (t) = − hq−1 (t) + hq (t) , q = 0,
κ +

gqλ∗ (t) = −ϕλ e−κ δq (t) (δq∗ (t) + hq−1 (t) − hq (t))1q=0 ,
gqκ∗ (t, y) = 0,
ηt∗ = α .

When the agent is ambiguity averse only to the market order arrival rate, we
expect g κ∗ = 0 and η ∗ = α. In other words, the agent is fully confident in her
model of fill probabilities and drift of the midprice.
In Figure 4.4, we show a comparison of the optimal posting level when
the agent is ambiguity averse to only market order arrival versus ambiguity
neutral. Also shown is the market order intensity induced by the optimal ∗
candidate measure at various times and for various inventory levels (λ egq (t) ).
Of particular interest is how the effective market order intensities differ as
the strategy approaches maturity. For a fixed level of inventory, the effective
market order intensity decreases as maturity approaches because that is the
time in which such a change can impair the agent’s performance the most.
Also note that the largest change in the optimal posting occurs for the largest
value of inventory. When the inventory is at its maximum, this is when the
agent’s fear of a misspecified arrival rate can have most significant impact on
their trading performance.

4.4.2 Fill probability


When the agent is ambiguous only to fill probability, the quantities in
Equation 4.27 are exactly as stated, with g λ∗ = 0 and η ∗ = α. This result can
be explained along the same lines as the reasoning in the previous section.
The change in optimal posting due to ambiguity to the fill probability,
shown in Figure 4.5, is qualitatively different from the case of ambiguity aver-
sion specific to market order intensity. First, note that the change is smallest
when the inventory position is largest. This is because the agent already has
Portfolio Liquidation and Ambiguity Aversion 89

2 0.25
Market order arrival rate

1.8 t=0 0.2


t = 12 t=5 q=1
1.6

Depth (δ)
0.15
1.4
0.1
1.2
0.05
1 Increasing q
0.8 0
0 2 4 6 8 10 0 5 10 15
Inventory (q) Time (secs)

0 q=1
Change in depth (δ(ϕλ) −δ )

−0.005

−0.01

−0.015
Increasing q
−0.02

−0.025
0 5 10 15
Time (secs)

FIGURE 4.4: Market order arrival rate induced by optimal gt , optimal


depth, and change in depth due to ambiguity for an agent who is ambigu-
ity averse to MO rate of arrival (dashed lines are ambiguity neutral depths).
Parameter values are ϕλ = 6, κ = 15, λ = 2, σ = 0.01, α = 0, (q) = θ q,
θ = 0.01, Q = 10, and T = 15.

× 10−3
0.25
0
Change in depth (δ − δ0)

0.2 −2
q=1
Depth (δ )

0.15 −4
Increasing q
−6
0.1
−8
0.05
Increasing q −10 q=1
0
−12
0 5 10 15 0 5 10 15
Time (secs) Time (secs)

FIGURE 4.5: Optimal depth and change in depth due to ambiguity for an
agent who is ambiguity averse to fill probability (dashed lines are ambiguity
neutral depths). Parameter values are ϕκ = 3, κ = 15, λ = 2, σ = 0.01, α = 0,
(q) = θ q, θ = 0.01, Q = 10, and T = 15.
90 High-Performance Computing in Finance

0.25
0 q=1

Change in depth (δ − δ0)


0.2
−0.005
q=1
Depth (δ )

0.15
−0.01
0.1
−0.015
Increasing q
0.05
−0.02
Increasing q
0
−0.025
0 5 10 15 0 5 10 15
Time (secs) Time (secs)

FIGURE 4.6: Optimal depth and change in depth due to ambiguity for an
agent who is ambiguity averse to midprice drift (dashed lines are ambiguity
neutral depths). Parameter values are ϕα = 5, κ = 15, λ = 2, σ = 0.01, α = 0,
(q) = θ q, θ = 0.01, Q = 10, and T = 15.

a natural level of protection to this type of ambiguity when inventory levels


are large. When holding a large level of inventory, the agent will post lower
prices in order to liquidate faster and avoid the terminal liquidation penalty.
But when posting smaller prices, an equal change in the fill probability suffers
a larger penalty (when compared to a similar change for a large price). The
second difference to note is that the magnitude of the changes decreases as
time approaches the trading horizon T rather than increases.

4.4.3 Midprice drift


When the agent is ambiguity averse to only midprice drift, market order
dynamics follow that of the reference measure with constant intensity and
exponential fill probabilities. However, the asset drift selected by the mini-
mization aspect of the problem explicitly takes the form of a decreasing func-
tion of the agent’s remaining inventory q.

[Link] Equivalence to inventory penalization


Here we briefly pose an equivalent interpretation of ambiguity with respect
to midprice drift in terms of an inventory penalization. Once again, consider
an agent with the goal of liquidating a number of shares before some time
T , but instead of ambiguity aversion the agent enforces a quadratic inventory
penalization so that her value function is

H φ (t, x, q, S) = sup EPt,x,q,S Xτ ∧T + qτ ∧T (Sτ ∧T − (qτ ∧T ))
(δu )t≤u≤T ∈A
τ∧T 
1
− φ σ2 qu2 du .
2
t
Portfolio Liquidation and Ambiguity Aversion 91

After the usual ansatz, substitution into the corresponding HJB equation, and
solving for the optimal controls δ, the resulting system of ODEs is
1  !"
∂t hφq + α q − φ σ 2 q 2 + sup λ e−κ δ δ + hφq−1 − hφq 1q=0 = 0 .
2 δ≥0

This equation is equivalent to Equation 4.24 in the limits ϕλ → 0 and ϕκ → 0,


and therefore we have
 τ∧T 
1
sup EPt,x,q,S Xτ ∧T + qτ ∧T (Sτ ∧T − (qτ ∧T )) − φ σ 2 qu2 du
(δu )t≤u≤T ∈A 2
t

= sup inf α EQ
t,x,q,S [Xτ ∧T + qτ ∧T (Sτ ∧T − (qτ ∧T )) + Ht,τ ∧T [Q|P]] .
(δu )t≤u≤T ∈A Q∈Q

Since we are considering only ambiguity on the midprice drift, the infimum
is taken over by equivalent measures where MO intensity and fill probability
remain fixed to the reference measure. Thus a cumulative inventory penaliza-
tion is equivalent to considering ambiguity on the drift of the midprice of the
asset.
As shown in Figure 4.6, the change in the optimal depth is smallest when
the inventory is smallest. However, the change in optimal depth consistently
becomes more negative with time to maturity. Both of these behaviors make
sense when midprice ambiguity is interpreted as a cumulative inventory penal-
ization. The first characteristic is made clear by noting that the larger the
agent’s inventory position is, the larger the accumulation of the inventory
penalty is, and the faster the agent desire to liquidate shares. The second
characteristic is explained by noting that the impact on the strategy’s perfor-
mance due to a misspecified drift is approximately linear in time to maturity.
The effect of ambiguity on the drift of the midprice turns out to be more
significant than ambiguity on the other two factors in the case of the liqui-
dation problem. As shown in Section 4.5, specifically in Proposition 4.4, the
optimal limit order price grows without bound as time to maturity increases
when there is no ambiguity on drift. However, when ambiguity on drift is
considered, all of the optimal posting levels become finite for all time.

4.5 Closed-Form Solutions


Suppose the agent has equal levels of ambiguity to MO arrival rate and
fill probabilities so that ϕλ = ϕκ = ϕ. Suppose further that the optimal
depths δ ∗ (t, q; ψ) given by Proposition 4.2 are positive for all t and q. Then
substituting all of the optimal controls into Equation 4.24 gives
1 ξ
∂t hq + α q − ϕα σ 2 q 2 + e−κ (−hq−1 +hq ) = 0, (4.28)
2 κ
92 High-Performance Computing in Finance

where
 −(1+ ϕκ )
ϕ
ξ= 1+ λ,
κ
with terminal and boundary conditions hq (T ) = −q (q) and h0 (t) = 0.
Proposition 4.3 (Solving for hq (t) in Equation 4.28). Let Kq = α κ q −
1 2 2
2 ϕα σ κ q .

i. Suppose Kq = 0 for all q (i.e., ϕα = 0 and α = 0). Then


 q 
1
hq (t) = log Cq,n (T − t) ,
n
κ n=0

where
ξ n −κ (q−n) (q−n)
Cq,n = e .
n!
ii. Suppose Kq = 0 only when q = 0. Then
 q 
1 Kn (T −t)
hq (t) = log Cq,n e ,
κ n=0

where
#
j
1
Cn+j,n = (−ξ)j Cn,n ,
p=1
K n+p − Kn


q−1
Cq,q = − Cq,n + e−κ q (q) ,
n=1
C0,0 = 1 .
Proof. See Appendix.
The two cases considered in Proposition 4.3 preclude the case Kq = 0 for
exactly two distinct values of q (the proposition only considers that either all
Kq are zero or only K0 = 0). Due to the quadratic dependence of Kq on q,
this omitted case is the only one not considered. However, it is very unlikely
that this case would arise in reality, for example, if the model was calibrated
to market data. Even if it was the case, an arbitrarily small adjustment could
be made to any of the parameters (the most reasonable choice would be ϕα )
so that the ratio ϕαασ2 is irrational.
Proposition 4.3 shows that the value function can have two different func-
tional forms depending on the possible values of Kq = α κ q − 12 ϕα σ 2 κ q 2 .
Thus how does this affect the optimal depth δ ∗ ? The proposition below shows
the difference in behavior of the optimal depths as time to maturity becomes
arbitrarily large when (i) Kq = 0, ∀q > 0, and (ii) Kq < 0, ∀q > 0.
Portfolio Liquidation and Ambiguity Aversion 93

Proposition 4.4 (Behavior of optimal depths as (T − t) → ∞). Let


τ = (T − t).  
In case (i) of Proposition 4.3, the optimal depths δq∗ (t) grow as κ1 log qξ τ
as τ → ∞.
In case (ii) of Proposition 4.3, if Kq < 0, then the optimal depths δq∗ (t)
   
1 1 −ξ
ϕ
approach ϕ log 1 + κ + κ log K q
as τ → ∞.

Proof. See Appendix.

4.6 Inclusion of Market Orders


Up to this point, the agent has been restricted to trades using limit orders
only. However, she may improve her trading performance if she also executes
market orders. This improvement in performance would be due to submitting
single trades which would assist in avoiding the large penalty invoked by
submitting a large order at the end of the trading period. Mathematically,
the inclusion of market orders corresponds to the additional use of an impulse
control by the agent.
The effect of market orders on the agent’s inventory and wealth dynamics
is treated as follows: market sell orders executed by the agent are filled at a
price Δ2 less than the midprice. Denote the number of market orders executed
by the agent up to time t by Jt . Then the updated dynamics of the agent’s
inventory and wealth are

dqt = −dNt − dJt ,


 
Δ
dXt = (St + δt ) dNt + St − dJt ,
2

respectively.
The agent’s additional control consists of the set of times at which the
process J increments.

4.6.1 Feedback controls


The agent must select a sequence of stopping times τk at which she executes
a market order. The updated optimization problem for the ambiguity neutral
agent is

H(t, x, q, S) = sup EPt,x,q,S [Xτ ∧T + qτ ∧T (Sτ ∧T − (qτ ∧T ))] .


(δs )t≤s≤T ∈ A
(τk )k=1,...,Q
94 High-Performance Computing in Finance

The inclusion of market orders changes the equation satisfied by the value
function H(t, x, q, S). Rather than a standard HJB equation, H now satisfies
a quasi-variational inequality of the following form:
 
1 2
 −κ δ 
max ∂t H + α ∂S H + 2 σ ∂SS H + sup λ e DH 1q=0 ;
δ≥0
     
Δ
H t, x + S − , q − 1, S − H(t, x, q, S) = 0. (4.29)
2

From Equation 4.29, it is clear that one of the two terms must be equal to
zero, and the other term must be less than or equal to zero. This allows the
definition of a continuation region and execution region:
 
 
C = (t, x, q, S) : ∂t H + α∂S H + 12 σ 2 ∂SS H + sup λ e−κ δ DH 1q=0 = 0 ;
δ
 

E = (t, x, q, S) : H(t, x + S − Δ2 , q − 1, S) − H(t, x, q, S) = 0 .

Whenever (t, x, q, S) ∈ E, it is beneficial for the agent to execute a market


order. When (t, x, q, S) ∈ C, it is more beneficial for the agent to refrain from
executing a market order and continue with the optimal placement of limit
orders. In the case of using only limit orders, it was the optimal depths δt
that were of interest. In this case, the boundary between the continuation and
execution regions is also of interest. The ansatz H(t, x, q, S) = x + q S + hq (t)
still applies, which simplifies the quasi-variational inequality to
 
 −κ δ 
max ∂t h + α q + sup λ e (δ + hq−1 − hq ) 1q=0 ;
δ≥0
 
Δ
× hq−1 − hq − 2 = 0. (4.30)

It is clear that the feedback expression for the optimal depths is of the same
form as in Proposition 4.1. However, the numerical values of these depths
will be different because the values of the functions hq (t) will be different.
Also of importance is to note that after making the ansatz, the continuation
and execution regions can be redefined in terms of hq (t) and therefore will
not depend on the state variables x and S. The boundary between the two
regions is a curve in the (t, q) plane. Figure 4.7 illustrates the optimal depths
and market order execution boundary for an agent who liquidates a portfolio
of assets with both limit and market orders. The notable difference in the
optimal depths between this case and that of an agent who does not execute
market orders is that presently, they are bounded below by max(0, κ1 − Δ 2 ),
while without market orders they are bounded below by 0. This is easily seen
from the feedback form of the depths, δq∗ (t) = ( κ1 −hq−1 (t)+hq (t))+ , combined
with the inequality hq−1 (t) − hq (t) − Δ2 ≤ 0.
Portfolio Liquidation and Ambiguity Aversion 95

0.25 60

0.2 50
q=1

Inventory (q)
40
Depth (δ )

0.15
30
0.1
20
0.05
10
Increasing q
0 0
0 5 10 15 20 25 0 5 10 15 20 25
Time (secs) Time (secs)

FIGURE 4.7: Optimal depth and market order execution boundary for
ambiguity neutral agent. Parameter values are κ = 15, λ = 2, σ = 0.01,
α = 0, (q) = θ q, θ = 0.01, Δ = 0.01, Q = 60, and T = 25.

4.6.2 The effects of ambiguity aversion on


market order execution
This section investigates the effects of ambiguity on the liquidating agent’s
use of market orders. These effects are observed as a modification to the market
order exercise curve in the (t, q) plane depending on the levels of ambiguity.
The appropriate quasi-variational inequality is obtained from Equation 4.24
in a similar manner as Equation 4.30 was obtained from Equation 4.6. The
feedback form of the optimal controls is the same in each case, but they result
in quantitatively different depths due to the change in the value function. The
appropriate equation is
   2 
1 α−η
max ∂t hq + inf η q +
η 2 ϕα σ


⎨  λ κ∞  
g +g (y)
+ sup inf inf λ e F (dy) (δ + hq−1 (t) − hq (t)) 1q=0 ;
δ≥0 g λ g ∈G ⎩
κ

δ

 
Δ ⎬
hq−1 − hq − =0 (4.31)
2 ⎭

h(T, q) = −q (q),
h(t, 0) = 0.

The effect of ambiguity on the MO execution boundary can be explained


with similar reasoning to the change in the optimal depths previously dis-
cussed. The most notable feature in Figure 4.8 is the magnitude of the change
for an agent who is ambiguity averse to the midprice drift. The significant
change in MO execution strategy can be intuitively understood again by inter-
preting this type of ambiguity as a cumulative inventory penalty. The agent
96 High-Performance Computing in Finance

(a) (b)
60 ϕλ = 0 60 ϕκ = 0
ϕλ = 6 ϕκ = 10
50 50
ϕλ = 12 ϕκ = 30
Inventory (q)

Inventory (q)
40 40

30 30

20 20

10 10

0 0
0 5 10 15 20 25 0 5 10 15 20 25
Time (secs) Time (secs)

(c)
60 ϕα = 0
ϕα = 1
50
ϕα = 2
Inventory (q)

40

30

20

10

0
0 5 10 15 20 25
Time (secs)

FIGURE 4.8: Effect on market order execution boundary for an agent who is
ambiguity averse to different factors of the reference model. Parameter values
are κ = 15, λ = 2, σ = 0.01, α = 0, (q) = θ q, θ = 0.01, Q = 60, and T = 25.
(a) MO arrival rate, (b) fill probability, and (c) mid price drift.

is strongly encouraged to sell shares very quickly with this type of penalty,
much more so than compared to only the terminal inventory penalty.
The relatively small change due to ambiguity on fill probability can be
understood as follows: for large inventory positions, the natural inclination of
the agent is to post small depths because she is in a hurry to liquidate the
inventory position before maturity. But as the agent posts smaller depths, the
change in the execution price distribution must be larger to have a significant
impact on the fill probability and hence also the performance of the strategy.
This type of large change is prevented by the entropic penalty. Essentially,
by naturally wanting to post small depths, the agent has already gained a
significant amount of protection against a misspecified fill probability.
Ambiguity with respect to market order arrival rate lies in a middle ground.
On one hand, the significance of this type of ambiguity is less than that of
the midprice drift because changing the arrival rate does not directly penal-
ize the holding of inventory. On the other hand, posting smaller depths for
Portfolio Liquidation and Ambiguity Aversion 97

larger inventory positions does not provide a natural protection against a


change in the arrival rate as it did for a change in the distribution of trade
executions.

4.7 Conclusion
We have shown how to incorporate ambiguity aversion into the context of
an optimal liquidation problem and have investigated the impact of ambiguity
with respect to different sources of randomness on the optimal trading strat-
egy. The primary mathematical procedure which allows for the computation
of the optimal strategy is in solving a PDE for the agent’s value function.
When the agent is only allowed to employ limit orders and when her ambi-
guity aversion levels satisfy a particular symmetry constraint, the solution to
this PDE is known in closed form. This allows any application of the strat-
egy to be implemented very efficiently by precomputing the optimal trading
strategy.
When the ability to submit market orders is added to the model, we no
longer have closed-form solutions for the optimal trading strategy. In this case,
the PDE must be solved numerically, a task which can become computation-
ally complex when the number of traded assets is increased or when additional
features are added to the model. Other additional features which could also
potentially cause a loss of closed-form solutions are the inclusion of a trade
signal which indicates favorable or unfavorable market conditions, or the pro-
cess of updating the agent’s reference model in real time through a learning
procedure.

Appendix 4A Proofs
Proof of Proposition 4.2. The minimization in η is independent of the opti-
mization in δ, g λ , and g κ and so can be done directly. First-order conditions
imply that η ∗ = α − ϕα σ 2 q, as desired. This value of η ∗ is easily seen to be
unique as it is a quadratic optimization. For the optimization over δ, g λ , and
g κ , first consider ϕλ > ϕκ . Then the term to be optimized is

∞ 
g λ +g κ (y)
G(δ, g , g ) = λ
λ κ
e F (dy) (δ + hq−1 − hq ) + Kϕλ ,ϕκ (g λ , g κ )
δ
∞ 
λ
+g κ (y)
=λ eg F (dy) (δ + hq−1 − hq ) . . .
δ
98 High-Performance Computing in Finance

∞
1 λ λ
+g κ (y)
+ [λ −(eg − 1) + g λ eg F (dy)] . . .
ϕλ
0
 ∞ 
1 g κ (y) gλ g λ +g κ (y)
+ λ −(e − 1) e + g (y) e
κ
F (dy) . (A.1)
ϕκ
0

The remainder of the proof proceeds as follows:

1. Introduce a Lagrange multiplier γ corresponding to the constraint on


g κ (y)
2. Compute first-order conditions for the unconstrained g κ (y) which mini-
mizes the Lagrange modified term
3. Compute the value of γ

4. Verify that the corresponding g κ∗ (y) provides a minimizer of G(δ, g λ , g κ )


for all functions g κ ∈ G
5. Compute first-order conditions for g λ

6. Verify that the corresponding g λ∗ provides a minimizer of G(δ, g λ , g κ∗ )

7. Compute first-order conditions for δ subject to the constraint δ ≥ 0


8. Verify that the corresponding δ ∗ provides a maximizer of G(δ, g λ∗ , g κ∗ )

9. Prove existence and uniqueness for the solution h


∞ κ
Parts 1 and 2: Solving for g κ : The constraint 0 eg (y) F (dy) = 1 is han-
dled by introducing a Lagrange multiplier γ and then minimizing over uncon-
strained g κ . The optimization with respect to g κ is handled in a pointwise
fashion by minimizing the integrand with respect to g κ (y) for each value of
y ∈ [0, ∞). For y ∈ (δ, ∞), the quantity to be minimized is

λ
+g κ (y) λ λ λ κ
λ eg (δ + hq−1 − hq ) + (−(eg − 1) + g λ eg +g (y) )
ϕλ
λ κ λ λ κ κ
+ (−(eg (y) − 1) eg + g κ (y) eg +g (y) ) + γ(eg (y) − 1). (A.2)
ϕκ

First-order conditions in g κ (y) give


ϕκ λ γϕκ −gλ
g κ (y) = − g − e − ϕκ (δ + hq−1 − hq ) . (A.3)
ϕλ λ

Similarly, first-order conditions in g κ (y) for y ∈ [0, δ] give


ϕκ λ γϕκ −gλ
g κ (y) = − g − e . (A.4)
ϕλ λ
Portfolio Liquidation and Ambiguity Aversion 99

Combining Equations A.3 and A.4 gives


ϕκ λ γϕκ −gλ
g κ∗ (y) = − g − e − ϕκ (δ + hq−1 − hq )1y>δ . (A.5)
ϕλ λ

Part 3: Solving for γ: Substituting this expression into the integral constraint
and performing some computations give an expression for γ:
λ
g λ λ gλ λ eg
γ =− e + log(1 − e−κ δ + e−ϕκ (δ+hq−1 −hq ) e−κ δ ).
ϕλ ϕκ

Substituting this into Equation A.5 gives

g κ∗ (y) = − log(1 − e−κ δ + e−ϕκ (δ+hq−1 −hq ) e−κ δ )


− ϕκ (δ + hq−1 − hq ) 1y>δ . (A.6)

To prove that the expression for g κ∗ (y) is indeed a minimizer, it is conve-


nient to introduce some shorthand notation:

Δhq =hq−1 − hq , (A.7)


−κ δ −ϕκ (δ+Δhq ) −κ δ
A =1 − e +e e , (A.8)
g = − log A, (A.9)
g = − log A − ϕκ (δ + Δhq ). (A.10)

It is important to note that these quantities do not depend on g λ . Also note


that g and g are the two possible values that g κ∗ (y) can take depending
on whether y ≤ δ or y > δ. Let f be any other function in G and define
κ∗ κ∗
k(y) = ef (y) − eg (y) . Then define f (y) = log( k(y) + eg (y) ). One can
easily check that f ∈ G for all  ∈ [0, 1] and that f0 = g κ∗ and f1 = f . Let
m() = G(δ, g λ , f ). We confirm that g κ∗ is the minimizer by showing that

G(δ, g λ , g κ∗ ) = m(0) ≤ m(1) = G(δ, g λ , f ). (A.11)

It is sufficient to show that m has a non-negative second derivative for all


 ∈ [0, 1]. Substituting expressions for f and Equation A.6 into G(δ, g λ , f )
gives
∞ 
g λ +f (y)
m() = λ e F (dy) (δ + Δhq ) . . .
δ
 ∞ 
1 gλ λ g λ +f (y)
+ λ −(e − 1) + g e F (dy) . . .
ϕλ
0
 ∞ 
1 f (y) gλ g λ +f (y)
+ λ −(e − 1)e + f (y)e F (dy)
ϕκ
0
100 High-Performance Computing in Finance

∞ 
gλ g
=λ e (e + k(y)) F (dy) (δ + Δhq ) . . .
δ
 δ 
1 λ λ
+ λ −(eg − 1) + g λ eg (eg + k(y)) F (dy) . . .
ϕλ
0
 ∞ 
1 λ λ
+ λ −(eg − 1) + g λ eg (eg + k(y)) F (dy) . . .
ϕλ
δ
 δ
1 λ λ
+ λ −(eg + k(y) − 1)eg + log(eg + k(y))eg
ϕκ
0

× (e + k(y)) F (dy) . . .
g

 ∞
1 λ λ
+ λ −(eg + k(y) − 1)eg + log(eg + k(y))eg
ϕκ
δ

× (e + k(y)) F (dy) ,
g

and taking a derivative with respect to  gives

∞    ∞ 

gλ 1 λ gλ
m () = λ e k(y) F (dy) δ + Δhq + λ g e k(y) F (dy) . . .
ϕλ
δ 0
 δ 
1 λ λ λ
+ λ −k(y)eg + k(y)eg + k(y) log(eg + k(y))eg F (dy) . . .
ϕκ
0
 ∞ 
1 λ λ λ
+ λ −k(y)eg + k(y)eg + k(y) log(eg + k(y))eg F (dy)
ϕκ
δ
∞  
λ
=λ eg k(y) F (dy) δ + Δhq . . .
δ
 δ 
1 λ
+ λ k(y) log(eg + k(y))eg F (dy) . . .
ϕκ
0
 ∞ 
1 λ
+ λ k(y) log(eg + k(y))eg F (dy) .
ϕκ
δ
Portfolio Liquidation and Ambiguity Aversion 101

Evaluating this expression at  = 0 gives


∞  
λ

m (0) = λ eg k(y) F (dy) δ + Δhq . . .


δ
 δ   ∞ 
1 λ 1 λ
+ λ k(y)geg F (dy) + λ k(y)geg F (dy)
ϕκ ϕκ
0 δ
∞   δ
gλ 1 λ
=λ e k(y) F (dy) δ + Δhq − λ k(y) log(A)eg F (dy) . . .
ϕκ
δ 0
δ  
1 λ
− λ k(y) log(A) + ϕκ (δ + Δhq ) eg F (dy)
ϕκ
0
=0

as expected. Continuing by taking a second derivative with respect to :

δ λ ∞ λ

1 eg k 2 (y) 1 eg k 2 (y)
m () = λ g F (dy) + λ F (dy)
ϕκ e + k(y) ϕκ eg + k(y)
0 δ
∞ gλ 2
1 e k (y)
= λ F (dy)
ϕκ egκ∗ (y) + k(y)
0
∞ λ
1 eg k 2 (y)
= λ F (dy).
ϕκ ef (y)
0

This expression is non-negative for all  ∈ [0, 1], showing that indeed the
expression for g κ∗ (y) in Equation A.6 is a minimizer. This expression is strictly
positive unless k ≡ 0, showing that the inequality in Equation A.11 is strict
unless f = g κ∗ , therefore g κ∗ is the unique minimizer.
Part 5: First-order conditions for g λ : After substituting the expression
A.6 into the term to be minimized (see Equation A.1) and performing some
tedious computations, we must minimize the following with respect to g λ :
λ
λ eg eg (δ + Δhq )e−κ δ
λ λ λ λ λ λ
+ (−(eg − 1) + g λ eg eg )e−κ δ + (−(eg − 1)eg + geg eg )e−κ δ
ϕλ ϕκ
λ λ λ
+ (−(eg − 1) + g λ eg eg )(1 − e−κ δ )
ϕλ
λ λ λ
+ (−(eg − 1)eg + geg eg )(1 − e−κ δ ). (A.12)
ϕκ
102 High-Performance Computing in Finance

Applying first-order conditions in g λ and carrying out some tedious com-


putations gives the candidate minimizer:
ϕλ ϕλ
g λ∗ = log A = log(1 − e−κ δ + e−ϕκ (δ+hq−1 −hq ) e−κ δ ). (A.13)
ϕκ ϕκ
This is the unique root corresponding to the first-order conditions.
Part 6: Verify that g λ∗ is a minimizer : Taking two derivatives of Equation
A.12 with respect to g λ and cancelling terms gives
λ λ λ
g λ eg eg log(A)eg
+ − .
ϕλ ϕλ ϕκ
(ϕλ /ϕκ )
When the expression A.13 is substituted above, this becomes A ϕλ , which
is always positive because A > 0. Thus this value of g λ∗ provides a minimizer.
Uniqueness of the root corresponding to first-order conditions (and the fact
that it is the only critical value) implies that g λ∗ is the unique minimizer.
Part 7: Solving for δ: Substituting expressions A.7 through A.10 and A.13
into A.12, after some tedious computations we must maximize the following
expression over δ:
  
λ ϕλ −κ δ −ϕκ (δ+hq−1 −hq ) −κ δ
1 − exp log(1 − e +e e ) .
ϕλ ϕκ
Maximizing this term is equivalent to minimizing
 
ϕλ
exp log(1 − e−κ δ + e−ϕκ (δ+hq−1 −hq ) e−κ δ ) ,
ϕκ
which is equivalent to minimizing
1 − e−κ δ + e−ϕκ (δ+hq−1 −hq ) e−κ δ . (A.14)
Computing first-order conditions for δ gives
 
∗ 1 ϕκ
δ = log 1 + − hq−1 + hq . (A.15)
ϕκ κ
If this value is positive, we check that it is a minimizer of Equation A.14
by taking a second derivative. If it is non-negative, we show that the first
derivative of Equation A.14 is positive for all δ > 0, meaning that the desired
value of δ ∗ is 0.
Part 8: Verify that δ ∗ is a minimizer of Equation A.14: Suppose the value
given by Equation A.15 is positive. Taking two derivatives of Equation A.14
with respect to δ gives
−κ2 e−κ δ + (ϕκ + κ)2 e−ϕκ (δ+hq−1 −hq ) e−κ δ .
Substituting Equation A.15 into this expression gives

κϕκ e−κ δ > 0,
and so the value in Equation A.15 minimizes Equation A.14. Now suppose
the value in Equation A.15 is nonpositive. This means the following inequality
Portfolio Liquidation and Ambiguity Aversion 103

holds:
κ
e−ϕκ (hq−1 −hq ) ≤ .
κ + ϕκ
The first derivative of Equation A.14 with respect to δ is
(κ − (κ + ϕκ )e−ϕκ (δ+hq−1 −hq ) )e−κ δ ,
and the preceding inequality implies that this is non-negative for all δ ≥ 0,
implying that δ ∗ = 0 is the minimizer of Equation A.14. Thus the value of δ
which maximizes the original term of interest is
 
∗ 1 ϕκ
δ = log(1 + ) − hq−1 + hq ,
ϕκ κ +

as desired. The case of ϕκ > ϕλ is essentially identical.


Part 9: Existence and uniqueness of h: Begin by substituting the optimal
feedback controls, η ∗ , g λ∗ , and g κ∗ (y) into Equation 4.24. This results in
  
1 2 2 −κ δ
∂t hq + α q − 2 ϕα σ q + supδ≥0 ϕλ 1 − exp ϕ
λ
ϕκ log(1 − e
λ


+e−κ δ−ϕκ (δ+hq−1 −hq ) ) 1q=0 = 0,
hq (T ) = −q (q).
(A.16)
This is a system of ODEs of the form ∂t h = F(h). To show existence and
uniqueness of the solution to this equation, the function F is shown to be
bounded and globally Lipschitz. It suffices to show that the function f is
bounded and globally Lipschitz, where f is given by
   
λ ϕλ
f (x, y) = sup 1 − exp log(1 − e−κ δ + e−κ δ−ϕκ (δ+x−y) ) .
δ≥0 ϕλ ϕκ
Boundedness and the global Lipschitz property of f implies the same for
F, and so existence and uniqueness follows from the Picard–Lindelöf theo-
rem. The global Lipschitz property is a result of showing that all directional
derivatives of f exist and are bounded for all (x, y) ∈ R2 . !
The supremum is attained at δ ∗ = ϕ1κ log(1 + ϕκκ ) − x + y . Thus two
+
separate domains for f must be considered: ϕ1κ log(1 + ϕκκ ) > x − y and
1 1 ∗
ϕκ log(1 + κ ) ≤ x − y. First consider ϕκ log(1 + κ ) > x − y so that δ =
ϕκ ϕκ

1
ϕκ log(1 + κ ) − x + y. Substituting this into the expression for f yields:
ϕκ

 
λ ϕλ κ ϕκ
f (x, y) = 1 − exp log 1 − e− ϕκ log(1+ κ )+κ(x−y)
ϕλ ϕκ

− log(1+ ϕκκ )

× (1 − e )
  
λ ϕλ
= 1 − exp log 1 − Be κ(x−y)
,
ϕλ ϕκ
104 High-Performance Computing in Finance
! ϕκ
κ
where B = κ
ϕκ +κ
ϕκ
ϕκ +κ > 0. Letting z = B eκ (x−y) , the inequality
1
ϕκ log(1 + ϕκ
κ ) > x − y implies
  ϕκ
κ κ ϕκ κ
log(1+ ϕκκ )
0<z< e ϕκ
ϕκ + κ ϕκ + κ
  ϕκ
κ κ ϕκ ϕκ + κ ϕκ ϕκ
= ( ) κ = < 1.
ϕκ + κ ϕκ + κ κ ϕκ + κ

Since z is positive we have


  
λ ϕλ λ
f (x, y) = 1 − exp log 1 − z < .
ϕλ ϕκ ϕλ

Taking partial derivatives of f in this domain gives



λ ϕϕλ log 1−Beκ(x−y) Bκeκ(x−y)
∂x f (x, y) = −∂y f (x, y) = e κ
ϕκ 1 − Beκ(x−y)

λ ϕϕλ log 1−z z
= e κ . (A.17)
ϕκ 1−z

This expression is non-negative and continuous for 0 ≤ z ≤ ϕκϕ+κ κ


, and there-
fore achieves a finite maximum somewhere on that interval. Thus ∂x f and ∂y f
are bounded in this domain, and so directional derivatives exist and are also
bounded everywhere in the interior of the domain. On the boundary, direc-
tional derivatives exist and are bounded if the direction is toward the interior
of the domain.
Now consider ϕ1κ log(1 + ϕκκ ) ≤ x − y, which implies δ ∗ = 0. The expression
for f (x, y) in this domain is

λ
(1 − e−ϕλ (x−y) ),
f (x, y) =
ϕλ
 
ϕ
− ϕλ log(1+ ϕκκ )
which is bounded by ϕλ 1 − e
λ κ . Partial derivatives of f are
given by

∂x f (x, y) = −∂y f (x, y) = λ e−ϕλ (x−y) . (A.18)


ϕλ ϕκ
In this domain, the derivatives ∂x f and ∂y f are bounded by λ e− ϕκ log(1+ κ ) .
So similarly to the first domain, directional derivatives exist and are bounded
in the interior. On the boundary, they exist and are bounded in the direction
toward the interior of the domain. Thus we have existence and boundedness on
the boundary toward either of the two domains. The directional derivative on
the boundary is zero when the direction is parallel to the boundary. Existence
Portfolio Liquidation and Ambiguity Aversion 105

and boundedness of directional derivatives for all (x, y) ∈ R2 allow us to show


the Lipschitz condition easily:
$ $  
$ $
|f (x2 , y2 ) − f (x1 , y1 )| = $ ∇f (x, y) · dr $$ ≤ |∇f (x, y)|ds ≤ Ads
$
C C C
= A|(x2 , y2 ) − (x1 , y1 )|,

where C is the curve that connects (x1 , y1 ) to (x2 , y2 ) in a straight line and
A is a uniform bound on the gradient of f . This proves that there exists a
unique solution h to Equation 4.24.

Proof of Theorem 4.1. Let h be the solution to Equation 4.24 with termi-
nal conditions hq (T ) = −q (q), and define a candidate value function by
Ĥ(t, x, q, S) = x + q S + hq (t). From Ito’s lemma we have

T  T T
Ĥ(T, XTδ − , ST − , qTδ − ) = Ĥ(t, x, S, q) + ∂t hqs (s)ds + α qs ds + σ qs dWs
t
t t
T ∞
+ (δs + hqs− −1 (s) − hqs− (s))μ(dy, ds).
t δs

Note that for any admissible measure Q(η, g) and admissible control δ, we
have
T ∞  T ∞ 
Q(η,g) 2 Q(η,g) 2 gt (y)
E (δt ) νQ(η,g) (dy, dt) = E (δt ) e νP (dy, dt)
0 δt 0 δt
T ∞ 
Q(η,g)
≤E y 2 egt (y) νP (dy, dt)
0 δt
T ∞ 
Q(η,g) 2 gt (y)
≤E y e νP (dy, dt) < ∞.
0 0

The remainder of the proof proceeds as follows:

1. We show that the feedback forms of δ , η , g λ , and g κ are admissible.


2. For an arbitrary admissible δ = (δt )0≤t≤T , we define an admissible
response measure indexed by M which is denoted as Qα,λ,κ (η(δ), gM (δ)).
3. We show that M can be taken sufficiently large, independent of t and δ,
such that the response measure is pointwise (in t) -optimal.
106 High-Performance Computing in Finance

4. We show that the candidate function Ĥ satisfies Ĥ(t, x, S, q) ≥


H(t, x, S, q).

5. We show that the candidate function Ĥ satisfies Ĥ(t, x, S, q) ≤


H(t, x, S, q).
Step 1: δ , η , g λ , and g κ are admissible: Since q is bounded between 0 and
Q, η is bounded and therefore admissible. The existence and uniqueness of a
classical solution for h means that it achieves a finite maximum and minimum
for some q ∈ {0, . . . , Q} and t ∈ [0, T ]. Thus, from the feedback expressions
for g λ and g κ , we see that they are also bounded and therefore admissible.
Admissibility of δ is clear.
Step 2: Defining admissible response measure: Let δ = (δt )0≤t≤T be an
arbitrary admissible control and define pointwise minimizing response con-
trols by

ηt (δ) = α − ϕα σ 2 qt ,
ϕλ
gtλ (δ) = log(1 − e−κ δt (1 − e−ϕκ (δt +hqt −1 (t)−hqt (t)) )),
ϕκ
gtκ (y; δ) = − log(1 − e−κ δt (1 − e−ϕκ (δt +hqt −1 (t)−hqt (t)) ))
− ϕκ (δt + hqt −1 (t) − hqt (t))1y≥δt .

These processes each have the same form as the pointwise minimizers found
in Proposition 4.2, and so for a given δ = (δt )0≤t≤T , these controls achieve the
pointwise infimum in Equation 4.24. Since h is a classical solution to Equation
4.24, it is bounded for t ∈ [0, T ] and 0 ≤ q ≤ Q. Using the boundedness of h, we
see that gtλ (0) is finite and bounded with respect to t, and limδ→∞ gtλ (δ) = 0,
therefore gtλ (δ) is bounded. It is also clear that ηt (δ) is bounded. However,
gtκ (y; δ) is only bounded from above, so it is possible that the pair (ηt (δ), gt (δ))
does not define an admissible measure as per the definition in Equation 4.15.
In order to proceed, we use a modification of gtκ :
κ
gt,M (y; δ) = − log(1 − e−κ δt (1 − e−ϕκ (δt +hqt −1 (t)−hqt (t)) ))
− ϕκ min(δt + hqt −1 (t) − hqt (t), M )1y≥δt .
κ
Since gt,M is bounded, letting gt,M (y; δ) = gtλ (δ) + gt,M κ
(y; δ), the pair
(ηt (δ), gM (δ)) does define an admissible measure Q α,λ,κ
(η(δ), gM (δ)). Note
κ
that for a fixed t and δt , gt,M (y; δ) → gtκ (y; δ) as M → ∞ pointwise in y and
in L1 ( F (dy)).
Step 3: Showing pointwise -optimality: As in the proof of Proposition 4.2,
consider the functional
∞ 
λ κ
G(t, δ, g , g ) = λ
λ κ
eg +g (y) F (y)dy (δ + hq−1 (t) − hq (t))
δ
+ Kϕλ ,ϕκ (g λ , g κ )1ϕλ >ϕκ + Kϕκ ,ϕλ (g κ , g λ )1ϕκ >ϕλ .
Portfolio Liquidation and Ambiguity Aversion 107

We now show
lim G(t, δt , gtλ (δ), gt,M
κ
(·; δ)) = G(t, δt , gtλ (δ), gtκ (·; δ))
M →∞

uniformly in t and δ. Consider the first term only, and compute the difference
κ
when evaluated at both gt,M (·; δt ) and gtκ (·; δt ), which we denote by

J(t,δ, M )
$ $$∞ ∞ $
$
gtλ (δ) $
$$ $
$ $ κ
(y;δ) gtκ (y;δ)
F (dy)$$
$δt + hq−1 (t) − hq (t)$$ e F (dy) − e
gt,M
= λe
t δ t δ
$ $ $
ϕκ $
)gtλ (δ) $ −κ δ $ −ϕ min(δ +h (t)−h (t),M )
= λe
(1− ϕ $δt + hq−1 (t) − hq (t)$e t$ κ t q−1 q
λ
$ $ $e
$
$
− e−ϕκ (δt +hq−1 (t)−hq (t)) $$
$ $
(1− ϕκ )g λ (δ) $ $
= λ e ϕλ t $$δt + hq−1 (t) − hq (t)$$e−κ δt e−ϕκ M
$ $
$ $
× $$1 − e−ϕκ (δt +hq−1 (t)−hq (t)−M ) $$1δt +hq−1 (t)−hq (t)≥M
$ $
(1− ϕκ )g λ (δ) $ $
≤ λ e ϕλ t $$δt + hq−1 (t) − hq (t)$$e−κ δt e−ϕκ M .

As previously noted, both h and g λ are uniformly bounded, say by C and D


respectively, so clearly J(t, δ, M ) is bounded. For an arbitrary 
> 0, we may
choose M sufficiently large such that
ϕκ
|1− ϕ |D
J(t, δ, M ) ≤ λ e λ (δt + 2C)e−κ δt e−ϕκ M < 
for all δt ≥ 0.

Showing uniform convergence of Kϕλ ,ϕκ (gtλ (δ), gt,Mκ


(·; δ))1ϕλ >ϕκ , and Kϕκ ,ϕλ
κ λ
(gt,M (·, δ), g (δ))1ϕκ >ϕλ is essentially the same and so the details are omitted.
Let  > 0 be arbitrary and let M be sufficiently large (chosen independently
of t and δ) so that
0 < G(t, δt , gtλ (δ), gt,M
κ
(·; δ)) − G(t, δt , gtλ (δ), gtκ (·; δ)) < .
Then since δ is arbitrary and h satisfies Equation 4.24, the following inequality
holds almost surely for every t:
 2
1 α − ηt (δ)
∂t hqt + ηt (δ)qt +
2ϕα σ
∞ 
gtλ (δ)+gt,M
κ
(y;δ)
+λ e F (dy) (δt + hqt −1 (t) − hqt (t))
δt

+ Kϕλ ,ϕκ (gtλ (δ), gt,M


κ
(·; δ)) 1ϕλ ≥ϕκ + Kϕκ ,ϕλ (gt,M
κ
(·; δ), gtλ (δ)) 1ϕλ <ϕκ <  .
(A.19)
108 High-Performance Computing in Finance

Thus the measure Qα,λ,κ (η(δ), gM (δ)) is pointwise (in t) -optimal, uniformly
in δ.
Step 4: Showing Ĥ(t, x, S, q) ≥ H(t, x, S, q): Taking an expectation of
Ĥ(T, XTδ − , ST − , qTδ − ) in the measure Qα,λ,κ (η(δ), gM (δ)), and using Equation
A.19, gives
 
Qα,λ,κ (η(δ),gM (δ))
Et,x,q,S Ĥ(T, XTδ − , ST − , qTδ − )

T T T
Qα,λ,κ (η(δ),gM (δ))
= Ĥ(t, x, S, q) + Et,x,q,S ∂t hqs (s)ds + α qs ds + σ qs dWs
t t t
T ∞
+ (δs + hqs −1 (s) − hqs (s)) νQα,λ,κ (η(δ),gM (δ)) (dy, ds)
t δs
 T  2
Qα,λ,κ (η(δ),gM (δ)) 1 α − ηs (δ)
≤ Ĥ(t, x, S, q) + (T − t) + Et,x,q,S − ds
2ϕα σ
t
 T
− Kϕλ ,ϕκ (gsλ (δ), gs,M
κ
(·; δ)) 1ϕλ ≥ϕκ
t

+ Kϕκ ,ϕλ (gs,M
κ
(·; δ), gsλ (δ)) 1ϕλ <ϕκ ds .

Therefore, the candidate function satisfies

Ĥ(t, x, S, q) + (T − t)
% T  2
Qα,λ,κ (η(δ),gM (δ)) 1 α − ηs (δ)
≥ Et,x,q,S Ĥ(T, XTδ − , ST − , qTδ − ) + ds
2ϕα σ
t
T 
+ Kϕλ ,ϕκ (gsλ (δ), gs,M
κ
(·; δ)) 1ϕλ ≥ϕκ
t
 &
+K ϕκ ,ϕλ κ
(gs,M (·; δ), gsλ (δ)) 1ϕλ <ϕκ ds
 
Qα,λ,κ (η(δ),gM (δ))
= Et,x,q,S Ĥ(T, XTδ , ST , qTδ ) + Ht,T Qα,λ,κ (η(δ), gM (δ))|P
Qα,λ,κ (η(δ),gM (δ)) ' δ (
= Et,x,q,S XT + qTδ (ST − (qTδ )) + Ht,T Qα,λ,κ (η(δ), gM (δ))|P .

Since this holds for one particular choice of admissible measure Qα,λ,κ (η(δ),
gM (δ)), we have
' δ (
Ĥ(t, x, S, q) + (T − t) ≥ inf EQ
t,x,q,S XT + qT (ST − (qT )) + Ht,T (Q|P) .
δ δ
Q∈Q
Portfolio Liquidation and Ambiguity Aversion 109

This inequality holds for the arbitrarily chosen control δ, therefore

Ĥ(t, x, S, q) + (T − t) ≥ sup inf EQ


t,x,q,S
(δs )t≤s≤T ∈A Q∈Q
' (
× XTδ + qTδ (ST − (qTδ )) + Ht,T (Q|P) = H(t, x, S, q),

and letting  → 0 we finally obtain

Ĥ(t, x, S, q) ≥ H(t, x, S, q) . (A.20)

Step 5: Showing Ĥ(t, x, S, q) ≤ H(t, x, S, q): Now, let δ = (δt )0≤t≤T be the
control process defined in the statement of the theorem, and let ηt , gtλ , and
gtκ (y) be arbitrary such that they induce an admissible measure Qα,λ,κ (η, g) ∈
Qα,λ,κ . Then from Ito’s lemma and the fact that h satisfies Equation 4.24
 
Qα,λ,κ (η,g)  
Et,x,q,S Ĥ(T, XTδ − , ST − , qTδ − )

T T T
Qα,λ,κ (η,g)  
= Ĥ(t, x, S, q) + Et,x,q,S ∂t hqs (s)ds + α qsδ ds +σ qsδ dWs
t t t
T ∞ 
+ δs + hqs −1 (s) − hqs (s)
t δs

× νQα,λ,κ (η,g) (dy, ds)

 T  2
Qα,λ,κ (η,g) 1 α − ηs
≥ Ĥ(t, x, S, q) + Et,x,q,S − ds
2ϕα σ
t
T 
− Kϕλ ,ϕκ (gsλ , gsκ ) 1ϕλ ≥ϕκ
t
 
+K ϕκ ,ϕλ
(gsκ , gsλ ) 1ϕλ <ϕκ ds .

And so the candidate function satisfies


 T  2
Qα,λ,κ (η,g)   1 α − ηs
Ĥ(t, x, S, q) ≤ Et,x,q,S Ĥ(T, XTδ − , ST − , qTδ − ) + ds
2ϕα σ
t
T 
+ Kϕλ ,ϕκ (gsλ , gsκ ) 1ϕλ ≥ϕκ
t
 
+ Kϕκ ,ϕλ (gsκ , gsλ ) 1ϕλ <ϕκ ds
110 High-Performance Computing in Finance
 
Qα,λ,κ (η,g)  
= Et,x,q,S Ĥ(T, XTδ , ST , qTδ ) + Ht,T (Q
α,λ,κ
(η, g)|P)
 
Qα,λ,κ (η,g)   
= Et,x,q,S XTδ + qTδ (ST − (qTδ )) + Ht,T (Qα,λ,κ (η, g)|P) .

Since this holds for any arbitrary admissible measure Qα,λ,κ (η, g), we have
  
δ δ
Ĥ(t, x, S, q) ≤ inf EQ t,x,q,S XT + qT (ST − (qT )) + Ht,T (Q|P) .
δ
Q∈Qα,λ,κ

Therefore,
' δ (
Ĥ(t, x, S, q) ≤ sup inf EQ
t,x,q,S XT + qT (ST − (qT )) + Ht,T (Q|P)
δ δ
(δs )t≤s≤T ∈A Q∈Qα,λ,κ

= H(t, x, S, q) . (A.21)

Combining Equations A.20 and A.21 gives

Ĥ(t, x, q, S) = H(t, x, q, S),

as desired.

Proof of Proposition 4.3. Let ωq (t) = eκhq (t) , or equivalently, hq (t) = κ1 ωq (t).
Substituting this into Equation 4.28 gives

∂t ω q 1 ξ ωq−1
+ α q − ϕα σ 2 q 2 + 1q=0 = 0
κωq 2 κ ωq
1
∂t ωq + ακqωq − ϕα κσ 2 q 2 ωq + ξωq−1 1q=0 = 0
2
∂t ωq + Kq ωq + ξωq−1 1q=0 = 0, (A.22)

together with terminal conditions ωq (T ) = e−κq (q) . To prove Part (i), Equa-
tion A.22 becomes

∂t ωq = −ξωq−1 1q=0 .

For q = 0, this results in ω0 (t) = 1. For q > 0, integrating both sides yields

T
ωq (T ) − ωq (t) = −ξ ωq−1
t
T
ωq (t) = ξ ωq−1 (u)du + ωq (T ) . (A.23)
t
Portfolio Liquidation and Ambiguity Aversion 111

Since ω0 (t) is a constant and each ωq results from the integral of ωq−1 , ωq (t)
can be written as

q
ωq (t) = Cq,n (T − t)n , (A.24)
n=0

where each Cq,n must be computed. Substituting Equation A.24 into Equation
A.23 gives

T
q−1
ωq (t) = ξ Cq−1,n (T − u)n + ωq (T )
t n=0


q−1 T
=ξ Cq−1,n (T − u)n + ωq (T )
n=0 t

q−1
(T − t)n+1
=ξ Cq−1,n + ωq (T )
n=0
n+1
q
(T − t)n
=ξ Cq−1,n−1 + ωq (T ).
n=1
n

From here it is easy to see that

ξ
Cq,n = Cq−1,n−1
n
and
Cq,0 = ωq (T ) .

Together, these give the desired result

ξn ξ n −κ(q−n) (q−n)
Cq,n = ωq−n (T ) = e .
n! n!
To prove Part (ii), return to Equation A.22. Since K0 = 0, it is easily seen that
ω0 (t) = 1. For q > 0, a recursive solution to Equation A.22 can be written as

∂t ωq + Kq ωq + ξωq−1 = 0
∂t (eKq t ωq (t)) = −ξeKq t ωq−1
 T
eKq T
ωq (T ) − e ωq (t) = −ξ
Kq t
eKq u ωq−1 (u)du
t
 T
−Kq t
ωq (t) = ξe eKq u ωq−1 (u)du + ωq (T )eKq (T −t) .
t
(A.25)
112 High-Performance Computing in Finance

With ωq (t) = 1 and each ωq−1 (t) being integrated against eKq t , the general
form of ωq (t) can be written as


q
ωq (t) = Cq,n eKn (T −t) , (A.26)
n=0

where each Cq,n must be computed. Substituting Equation A.26 for ωq−1 into
Equation A.25 gives

T
q−1
−Kq t
ωq (t) = ξe eKq u
Cq−1,n eKn (T −u) du + ωq (T )eKq (T −t)
t n=0

T
q−1
−Kq t
= ξe Cq−1,n eKn T e(Kq −Kn )u du + ωq (T )eKq (T −t)
t n=0


q−1  T
= ξe−Kq t Cq−1,n eKn T e(Kq −Kn )u du + ωq (T )eKq (T −t)
n=0 t


q−1
e(Kq −Kn )T − e(Kq −Kn )t
= ξe−Kq t Cq−1,n eKn T + ωq (T )eKq (T −t)
n=0
Kq − Kn

q−1
eKq (T −t) − eKn (T −t)
=ξ Cq−1,n + ωq (T )eKq (T −t)
n=0
K q − K n


q−1  q−1 
Cq−1,n Kn (T −t) Cq−1,n
= −ξ e + ξ + ωq (T ) eKq (T −t) .
n=0
Kq − Kn n=0
Kq − Kn
(A.27)

−ξCq−1,n
From the first summation term above, it is deduced that Cq,n = Kq −Kn for
q > n. This recursive relation leads to

#
j
1
Cn+j,n = (−ξ)j Cn,n .
p=1
Kn+p − Kn

The second summation term in Equation A.27 leads to Cq,q =


)q−1 C
ξ n=0 Kqq−1,n
−Kn + ωq (T ), which from the previous recursion is the same as


q−1
Cq,q = − Cq,n + ωq (T ).
n=0

These equations together with C0,0 = 1 yield all of the coefficients.


Portfolio Liquidation and Ambiguity Aversion 113

Proof of Proposition 4.4. Case (i): For case (i) with Kq = 0 for all q, the
value function is of the form:
 q 
1
hq (t) = log Cq,n (T − t)n ,
κ n=0

where each Cq,n > 0, and the feedback form of the optimal depth is
 
∗ 1 ϕ
δq (t) = log 1 + − hq−1 (t) + hq (t).
ϕ κ
Substituting the value function into the feedback expression gives
   )q 
n=0 Cq,n (T − t)
n
1 ϕ 1
δq∗ (t) = log 1 + + log )q−1 .
n=0 Cq−1,n (T − t)
ϕ κ κ n

As T − t → ∞, the argument of the logarithm approaches ∞ since the numer-


ator is a polynomial of higher degree than the denominator. The argument of
Cq,q
the logarithm clearly grows as Cq−1,q−1 (T − t). With the expression for Cq,n
from Proposition 4.3, this means the depth grows as
 
1 ξ
δq∗ (t) ≈ log (T − t) .
κ q
Case (ii): For case (ii) with Kq < 0 for all q > 0, the value function is of the
form:
 q 
1 Kn (T −t)
hq (t) = log Cq,n e ,
κ n=0

and so the optimal depth is equal to


   )q Kn (T −t)

∗ 1 ϕ 1 n=0 Cq,n e
δq (t) = log 1 + + log )q−1 .
ϕ κ κ Kn (T −t)
n=0 Cq−1,n e

Since each Kn is negative (except for K0 which is equal to 0), this clearly
converges as (T − t) → ∞ to
   
∗ 1 ϕ 1 Cq,0
δq (t) → log 1 + + log
ϕ κ κ Cq−1,0
   
1 ϕ 1 −ξ
= log 1 + + log .
ϕ κ κ Kq

References
Almgren, R. and Chriss, N., 2001. Optimal execution of portfolio transactions. Jour-
nal of Risk, 3, 5–40.
114 High-Performance Computing in Finance

Avellaneda, M. and Stoikov, S., 2008. High-frequency trading in a limit order book.
Quantitative Finance, 8(3), 217–224.

Cartea, Á., Donnelly, R., and Jaimungal, S., 2017. Algorithmic trading with model
uncertainty. SIAM Journal on Financial Mathematics, 8(1), 635–671.

Cartea, Á. and Jaimungal, S., 2015a. Optimal execution with limit and market
orders. Quantitative Finance, 15(8), 1279–1291.

Cartea, Á. and Jaimungal, S., 2015b. Risk metrics and fine tuning of high-frequency
trading strategies. Mathematical Finance, 25(3), 576–611.

Cartea, Á., Jaimungal, S., and Penalva, J., 2015. Algorithmic and High-Frequency
Trading. Cambridge University Press, Cambridge, United Kingdom.

Guéant, O. and Lehalle, C.-A., 2015. General intensity shapes in optimal liquidation.
Mathematical Finance, 25(3), 457–495.

Guéant, O., Lehalle, C.-A., and Fernandez-Tapia, J., 2012. Optimal portfolio liquida-
tion with limit orders. SIAM Journal on Financial Mathematics, 3(1), 740–764.

Guéant, O., Lehalle, C.-A., and Fernandez-Tapia, J., 2013. Dealing with the inven-
tory risk: a solution to the market making problem. Mathematics and Financial
Economics, 7(4), 477–507.

Jacod, J. and Shiryaev, A. N., 1987. Limit Theorems for Stochastic Processes.
Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin,
Germany.

Kharroubi, I. and Pham, H., 2010. Optimal portfolio liquidation with execution cost
and risk, SIAM Journal on Financial Mathematics, 1(1), 897–931.

Obizhaeva, A. A. and Wang, J., 2013. Optimal trading strategy and supply/demand
dynamics. Journal of Financial Markets, 16(1), 1–32.
Chapter 5
Challenges in Scenario Generation:
Modeling Market and Non-Market
Risks in Insurance

Douglas McLean

CONTENTS
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.1 The challenge of negative nominal interest rates . . . . . . . . 116
5.1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.3 ESG and solvency 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.4 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2 Economic Scenario Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2.1 What does an ESG do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.2.2 The yield curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2.3 Nominal interest-rate models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2.4 Credit models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.2.5 Equity models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.2.6 Calibration issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.2.7 High-performance computing in ESGs . . . . . . . . . . . . . . . . . . 142
5.3 Risk-Scenario Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3.1 Co-dependency structures and simulation . . . . . . . . . . . . . . 144
5.3.2 Mortality and longevity risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.3.3 Lapse and surrender risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.3.4 Operational risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.4 Examples of Challenges in Scenario Generation . . . . . . . . . . . . . . . . . 149
5.4.1 PEDCP representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
[Link] Distribution function . . . . . . . . . . . . . . . . . . . . . . . . . 152
[Link] Density function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.4.2 Stochastic volatility and jump diffusion representation . . 157
[Link] Distributional representation . . . . . . . . . . . . . . . . . 158
[Link] The SVJD model and the combined equity
asset shock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
[Link] Unconditional equity asset shock distribution 161
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

115
116 High-Performance Computing in Finance

5.1 Introduction
5.1.1 The challenge of negative nominal interest rates
In August 2014, the euro-denominated German 1 year bond yield dropped
below zero and stubbornly stayed there until the time of writing (late Novem-
ber 2017). Like many other rates, including more recently, the German 10-year
rate in June 2016, this heralded an unprecedented era of low interest rates.
Symptomatic of the policy of quantitative easing from the European Central
Bank, the bond market has been reacting to a surfeit of cash. Now overnight
bank deposits incur charges and borrowers would receive interest rather than
pay it. Such a policy was designed to stimulate lending and punish those who
might otherwise save. For economic scenario generation, this has exposed a
challenge: model risk. In what will be discussed in the sequel, this has led
to a quandary in terms of the log-normal basis in which the so-called mar-
ket implied volatilities are typically quoted. It was now no longer possible to
receive market quotes from data vendors as the log-normal implied volatili-
ties had ceased to exist. The Libor Market Model (e.g., Brigo and Mercurio
2006; Wu and Zhang 2006) could not be calibrated as implied volatilities
had ceased to exist. The ramification for economic scenario generators (ESG)
with a knock-on effect to insurance companies’ asset and liability modeling
systems (ALM for short) was that they were no longer going to be able to
make stochastic projections of the most fundamental quantity in the market—
the yield curve. Something needed to be done. The error had been in a lack
of foresight: the assumption that nominal interest rates could go negative had
been inconceivable yet it had happened. The obvious solution would be to
switch to a different model that could cope with negative rates. But which
one? Fortunately, the investment banking industry had already foreseen a neg-
ative nominal rate scenario and had been using Bachelier’s arithmetic Brown-
ian motion as a model (1901) for the underlying interest rates for some time.
Bachelier’s model allowed negative nominal rates where Black’s model did not.
Furthermore, the data providers had been offering absolute implied volatili-
ties to match. The switch was made and, for the time at least, another ESG
challenge had been resolved.

5.1.2 Objectives
I purposely romanticized the situation of negative nominal interest rates
to emphasize the scale of the challenge that faces anyone who sets out to
construct an ESG. It is not easy. For one thing, it involves modeling each of the
major asset classes in the financial markets and then “hanging” them together
within a coherent co-dependency structure. The structure of choice is almost
always the Gaussian copula and this is almost never appropriate within a risk-
management system. It is not by accident that this is referred to as the formula
that killed Wall Street (e.g., Mackenzie and Spears 2012). Its misuse certainly
Challenges in Scenario Generation 117

contributed to the 2007/2008 financial crisis. In simply setting a correlation


matrix (whether it is a Pearson or a ranks-based Spearman correlation matrix)
and correlating asset shocks by a Cholesky matrix factorization (e.g., Noble
and Daniel 1977), one makes a de facto choice of the Gaussian copula. In
Section 5.3 I will discuss alternative copulas which do not suffer from the
detraction of permitting full portfolio diversification in times of financial crisis.
One must also be mindful that any ESG must function in a multi-economy
setting: there will almost certainly be more than one economy, perhaps 30 or
40, and one will want to forecast and price both assets and liabilities within
each consistently. One final introductory note is to the question of model
calibration which arises perpetually. Equivalent to non-convex optimization,
model calibration is one of the most challenging areas of mathematics and
must be tackled if one is to realistically parametrize the models within an ESG.
The aim of this chapter is therefore to illustrate some of the challenges that
need to be overcome and to set them in the context of the regulatory world
of insurance that they inhabit. It must be said that the Solvency 2 directive
(and other rather onerous regulatory regimes around the world) provide the
demand for ESGs. To begin with, therefore, I will set the scene by focusing
largely on Pillar 1 of the Solvency 2 regulation of insurance companies in the
European Union.

5.1.3 ESG and solvency 2


Insurance and reinsurance companies rely on ALM systems such as Sun-
Gard’s Prophet Asset Liability Strategy 1 ; Willis Towers Watson’s Replica,
Igloo, and MoSes 2 solutions; and Aon Benfield’s ReMetrica 3 ; to model both
the asset and the liability sides of their balance sheets. In the European Union,
the Solvency 1 regime was the first attempt to assess solvency of pan-European
insurance companies in a coherent way. I do not discuss this regime in any
detail except to say that it was hard to compare insurers precisely on this
measure and that it did not allow for the time value of derivative assets to be
fairly reported on insurers’ balance sheets. The Solvency 2 directive, issued by
the European Insurance and Occupational Pensions Authority (EIOPA)4 was
introduced to improve upon the Solvency 1 directive. Perhaps the greatest
change in moving to Solvency 2 is the enhanced use of stochastic scenarios
for valuation and forecasting of insurers’ balance sheets. The transition to
Solvency 2 has been challenging for many continental European insurers who
have not been used to operating under this paradigm but less onerous for
those in the United Kingdom where there has been a regulatory requirement
for such reporting with systems already in place: for example, in the calcu-
lation of derivative styled market consistent embedded value. Now, however,

1 [Link]
2 [Link]
3 [Link]
4 [Link]
118 High-Performance Computing in Finance

the Solvency 2 directive instructs all European insurers to work on this basis.
It is worthy of note that other regimes like Solvency 2 are in force elsewhere in
the world. The Swiss Solvency Test (SST) has been granted full equivalence to
Solvency 2 while regulatory regimes in Australia, Bermuda, Brazil, Canada,
Mexico, and the United States have been granted Solvency 2 equivalence for
a period of 10 years under Pillar 1. This is largely driven out of necessity for
non-EU domiciled insurers who participate in EU markets.
This chapter focuses on the use of an ESG to generate stochastic scenario
sets for use within the context of the European Union’s Solvency 2 directive
under Pillar 1 (and potentially under Pillar 2). Note that by scenario I mean
the market dynamics that can be simulated from a given multi-variate proba-
bility distribution via coupled systems of stochastic differential or econometric
equations. Monte-Carlo simulation is the vehicle which permits any practical
number of simulations each representing a state of the world. Each scenario
unfolds differently given the same, current market conditions. Except to say
that they may be used within the context of ORSA (Pillar 2), I do not discuss
the careful crafting of specific scenarios which are aligned with stress test-
ing in the Basel II banking regulations. These scenario-set sizes are usually
small and are aimed at testing whether a business can withstand a highly
stylized specific shock such as a worldwide pandemic, financial crisis or ter-
rorist attack. Rather the scenarios which emerge from an ESG are randomly
generated from a well-defined stochastic process and are usually generated on
the order of thousands if not tens or even hundreds of thousands.
The Solvency 2 regulatory regime came into vigor on January 1, 2016 fol-
lowing a period of extended industry consultation lasting more than a decade.
It replaces 13 individual regulatory regimes over 27 jurisdictions in Europe
(including the United Kingdom). There are three Pillars of Solvency 2 which
each assess fundamental components of good insurance practices. Pillar 1 is
by far the most prescriptive of the three and requires each insurance com-
pany to report its solvency capital requirement (SCR). This is a quantita-
tive measure of its ability to withstand a severe 1-in-200-year event over
a 1-year horizon: that is, it is the 1-year value-at-risk (VaR) at the 99.5th
percentile.
The local regulator (such as the Prudential Regulatory Authority, PRA,
in the United Kingdom) is then at liberty to add a capital buffer if it judges
the SCR to be too low given the specific nature of an insurer’s business. An
additional metric is computed: the minimum capital requirement or MCR;
which the PRA sets and lies between 25% and 45% of the SCR (including
any capital buffer). If the regulatory capital held by an insurer falls below this
MCR then it is deemed to be insolvent and its business is put into admin-
istration. If, however, the regulatory capital held is more than the MCR yet
less than the SCR then the regulator issues a requirement to the insurer to
detail how it will increase its regulatory capital above the level required by
the SCR. This plan must be returned within one month after which the regu-
lator may directly intervene in the management of the insurance company if
Challenges in Scenario Generation 119

it is not satisfied. The t-year VaR at level α is defined on the (stochastic) loss
variable Lt :
VaRαt := inf x∈R {x : Pr(Lt ≤ x) ≥ α} (5.1)
It is worthy to note the choice and the stability of the metric that is being
used to define the SCR: the 1-year VaR at 99.5%. The rationale for the 1-year
VaR measure is that it predicts the amount of extra capital that would have
to be raised now and invested at the risk-free rate so that in one year’s time
an insurer would be solvent (McNeil et al., 2015). This argument can be used
to show that the SCR is a quantile of the loss distribution. However, McNeil
et al. (2015) detail the properties a good risk measure should have and go
on to show that the VaR risk measure does not satisfy each of these. The
four properties of a coherent risk measure ρ̃(L) as a function of the loss L
(considered a random variable over some linear space M of losses) are:

1. Monotonicity: for losses L1 < L2 , ρ̃1 (L1 ) ≤ ρ̃2 (L2 )


2. Translational invariance: ∀ ∈ R deterministic, ρ̃(L + ) = ρ̃(L) + 
3. Sub-additivity: any L1 , L2 , ρ̃(L1 + L2 ) ≤ ρ̃(L1 ) + ρ̃(L2 )

4. Positive homogeneity: any λ > 0, ρ̃(λL) = λρ̃(L).

The interpretation of these axioms is as follows. Monotonicity shows that


positions that incur higher losses require higher capital. Adding a constant to
a loss means simply adding a constant to the risk measure. Sub-additivity and
positive homogeneity both show that the risk measure of any two combined
losses cannot lead to a greater loss than when they are measured separately:
diversification is always worthwhile. McNeil et al. (2015) then detail how VaR
is not sub-additive as a risk measure. Since an insurance company is typically
composed of several business units and under Solvency 2 each business unit
must calculate its own VaR with an aggregated VaR measure over the whole
business finally being submitted to the regulator then breaking into further
business units may reduce the VaR measure further. Other risk measures
satisfying all four constraints are termed coherent. For example, there is the
coherent expected shortfall measure:
 1
1
ESα := VaRx (L)dx; (5.2)
1−α α

which is arguably a better choice than the simple VaR measure as it aver-
ages each VaR value above the α-quantile of the loss distribution. In any
event, a risk measure is simply a gross summary statistic from an insurer’s
multivariate risk-driver distribution. Given the large dimensionality directly
implicated in this multivariate density (see Sections 5.2 and 5.3), taking such
a gross summary is a severe marginalization of a high-dimensional object.
Since some aspects of the density will be better estimated than others and
120 High-Performance Computing in Finance

some will be based on very little information at all (if only by expert judge-
ment, e.g., losses incurred by operational risks have little available data), then
the sensitivity of the VaR measure to small perturbations ought to be given
scrutiny. A small change in assumptions leading to a slightly different model
parametrization, for example, has the potential to lead to an unpredictable
change in an insurer’s capital requirements, be it for better or for worse.
Compared to Pillar 1, Pillar 2 is a broad instruction to insurers to make
their own risk and solvency assessment (ORSA). The wording is deliberately
vague and is intended to instill a holistic risk management ethos. Insurers
must demonstrate that they have considered, understood, and quantified all
possible risks their businesses face over a long time horizon (beyond the 1-year
horizon of Pillar 1). It is also a chance for an insurer to use its own in-house
modeling approach that accurately represents the risks on its balance sheet.
One approach is to use an ESG in a multi-period projection of market and
non-market risks. This can be rather onerous on the necessary scenario budget,
however. Given the vague wording of Pillar 2, an insurer may opt for a small
number of stylized scenarios. These are like the Basel II stressed scenarios used
in banking regulation. Typically, they test the robustness of an institution to
specific outcomes in the future such as a second financial crisis. In any case,
insurers must demonstrate that they have implemented an appropriate risk
management strategy to mitigate their risks. Ultimately, the regulator reviews
their approach and can choose whether to accept or reject it. In the latter case,
an insurer would be required to update their risk management procedures to
a new acceptable level and resubmit for approval: a task which is certainly
costly.
Pillar 3 focuses on disclosure and transparency, and sets out regulatory
reporting standards. Under Pillar 3, insurers are required to submit two
reports annually: A Solvency and Financial Condition Report and a Regu-
lator Supervisory Report; the former is made public and the latter is private.
Since scenario generation is primarily concerned with Pillar 1 (and increas-
ingly with Pillar 2) I refer the interested reader to the website of the Bank of
England5 for more information on Pillar 3 and continue with my discussion
of Pillar 1.
Computation of the SCR may follow one of two paths: either insurance
companies use a prescribed standard formula in their calculations or they
may define and use their own internal model. Being formulaic, the standard
formula model is easier to implement but has the downside of being relatively
insensitive to the specifics of an insurer’s business. For example, if a book
of business contains mainly plain vanilla products, that is, such as standard
annuities or self-invested personal pension schemes, then the standard formula
may be a sufficient and cost-effective way of evaluating the SCR. On the other
hand, if the book of business contains defined benefit schemes or guaranteed
products with derivative styled payoffs, it will certainly be advantageous to

5 [Link]
Challenges in Scenario Generation 121

incur the often substantial cost of building one’s own internal model that will
give a more accurate (and hopefully less punitive) SCR.
The regulatory supervisor recognizes the use of stochastic models of mar-
ket dynamics and goes on further to say that these should be supplied by an
ESG. Market dynamics may be simulated under one of the two paradigms:
the real world P-measure or the risk-neutral/market consistent Q-measure.
Valuation of derivative styled optionality embedded in the insurer’s balance
sheet is then computed under the Q-measure, which (typically) guarantees
the absence of arbitrage.6 I say typically since it is not always possible to find
such a Q-measure. For example, when pricing long-dated bonds (or anything
derived from them such as long-dated interest rate hedges), the market may
be quite illiquid with only a few players on the buy and sell sides. Reliable
prices may not exist and so the arbitrage-free Q measure may be unreliable.
This represents a significant challenge in economic scenario generation and is
the subject of current research (Salahnejhad and Pelsser 2015). Forecasting
an insurer’s assets and liabilities over, for example, a 1-year horizon to estab-
lish a loss distribution must be done under the real-world P-measure. This
is the best forward-looking measure given the state of the world today and
where economic expectations may reasonably be expected to fall over time.
Under specific technical conditions (see, e.g., Baxter and Rennie 1996), the
real-world P- and risk-neutral/market-consistent Q-measures may be related
via Girsanov’s theorem and the Radon-Nikodym derivative. One may simply
suffice to postulate the existence of a market price of risk as compensation for
bearing the risk in a world that is not arbitrage-free. This introduces a drift
correction to the growth rates to compensate. Having established the need for
two measures one must choose the most appropriate stochastic models accord-
ingly and calibrate their parameters. This represents a challenging problem in
scenario generation: parameter calibration; a challenge I will return to discuss
in Section 5.2. Typically, one calibrates models under the risk-neutral/market-
consistent Q-measure to a snapshot of the current market. Models under the
real-world P-measure are often fitted to historical data. The obvious diffi-
culty is then in reconciling the financial theory of measure changes between
the risk-neutral and the real world via a drift correction with the statistical
goodness-of-fit in models under each measure. If the fit is good under one
measure yet poor under the other one may opt for different models under
each measure which aren’t related by a measure change. There is no easy way
to reconcile this dichotomy but in practice one needs to be pragmatic.
ALM systems model, at the most granular level, the behavior of the pol-
icyholder schemes and the assets backing them. An essential ingredient of
each ALM system is its market model which is supplied by an ESG. Depen-
dent on the precise flavor of the insurance fund under consideration, these are
invariably sensitive to equity, foreign exchange and credit risk but are always

6 For an introduction to the idea of risk-neutral pricing, see Hull (2005) or Baxter and

Rennie (1996).
122 High-Performance Computing in Finance

sensitive to interest rate and inflation risk. When an ALM is coupled to sce-
narios emerging from an ESG, it becomes a Monte-Carlo simulation engine
that is either used for pricing or forecasting. When pricing, any liability with
derivative styled payoffs may be valued consistently to the market by approx-
imating the expectation of the discounted payoff under the Q-measure, and
by taking the sample mean of Monte-Carlo-simulated market scenarios. It is
imperative that the simulations are run under the risk-neutral measure for
this approach to be appropriate. Indeed, market-consistent valuation occurs
on both sides of the balance sheet: assets as well as liabilities. The real-world
P measure is required under Pillar 1 of the Solvency 2 SCR calculation when
projecting assets and liabilities onto the 1-year horizon. This develops the loss
distribution from which the SCR’s 99.5% 1-year VaR may be taken. How-
ever, the loss distribution will only be completely defined if the variables
in the balance sheet appear at the 1-year horizon. This is not the case for
derivatives such as equity options and interest rate swaptions. Importantly,
if an insurer’s business incorporates policyholder guarantees then these, hav-
ing derivative styled characteristics, will not yet be valued either. For exam-
ple, a minimum money-back guarantee in a with-profits fund hides latent
value or moneyness at any point in time if the fund has policies in-force. One
must make a further projection of the fund but this time under the market-
consistent Q-measure and discount/deflate this value back to the 1-year hori-
zon. This allows a value to be established that considers the time value of
guarantees (TVOG) or the market-consistent embedded value (MCEV) of the
guarantees.
The need for a second projection under the market risk-neutral measure
beyond the 1-year horizon leads to what is known as the nested stochastic
problem and is a significant challenge in economic scenario generation. For
example, if an insurer chooses to build the loss distribution using a thousand
real-world forecasts at 1 year then, for each of these, a similar number (say
another one thousand7 ) of market-consistent scenarios are needed to capture
any latent value inherent in the derivative styled behavior of an insurer’s assets
and liabilities. In some instances, derived securities, such as equity options,
a Black–Scholes model could theoretically replace the need for the second
(inner) set of market risk-neutral simulations. However, such formulas are not
available for more complicated equity models, or for certain interest rate and
credit models. Whenever derivative securities are on an insurer’s balance sheet
then they will need to be valued and one simple solution is to use the Monte-
Carlo simulation to a given time horizon, discount and take a sample mean. For
liabilities, there are certainly no closed-form solutions or numerical methods
for assessing their embedded value, and the Monte-Carlo simulation is the only
way they may be estimated.8 An open question is whether such estimators are

7 Even here, this number may not be enough and is highly problem specific.
8 With the possible exception of using a replicating portfolio of matching assets. However,
getting such a portfolio to match well enough is a somewhat intractable problem.
Challenges in Scenario Generation 123

biased or if they have an unacceptably high variance. One may, for example,
settle for an estimator with a much lower variance if its bias is controllably
small somehow: for example, by using a weighted estimator where the weights
are chosen optimally by minimizing some out-of-sample metric. Opting for
a better estimator may reduce the number of scenarios required which, in
the naive case of the arithmetic sample mean estimator, is of the order of
a million. Such large numbers of scenarios produce a serious performance
bottleneck: ALM systems are currently not sufficiently fast enough to process
this number of scenarios in a reasonable amount of time. Indeed, typically
scenarios from an ESG are written out to a csv (or equivalent) file and then
read into an ALM system. The I/O burden of processing such a large scenario
set file using, say, a monthly time step over multiple risk drivers in multiple
economies is, at least currently, prohibitive.
Mercifully, some solutions to the nested stochastic bottleneck exist. They
include the use of replicating portfolios, curve-fitting and least-squares Monte
Carlo (see Cathcart 2012). The latter technique approximates the asset and/or
liability values with regression functions that have been fitted to noisy ver-
sions of the nested stochastic output. To be more precise, instead of running
one thousand market-consistent simulations for every real-world scenario, a
very small number of market-consistent simulations is run per real-world sce-
nario. This brings the overall scenario budget down from a million to only a
few thousand: something which is much more manageable for an ALM sys-
tem. The challenge here is now to produce a few thousand market-consistent
calibrations for each of the few thousand states of the world emerging at the
1-year horizon: a non-trivial exercise if this is to be done in any reasonable
time. Functional approximations are typically made to produce calibrations
quickly rather than force them through time-consuming numerical optimiza-
tion algorithms. One should exercise caution as, yet again, the parameter
calibration problem appears and care must be taken to ensure it is realis-
tic (see Section 5.2). This scenario set is run through the ALM system and
a regression function (such as a multiple polynomial regression function, or
other, in the underlying variables) is fitted to the (discounted) desired asset
or liability payoffs. The idea behind the least-squares Monte-Carlo technique
is that the expectation of the discounted payoff may be approximated by the
linear predictor coming from the polynomial regression function. If one can
show that the conditions of the Gauss–Markov theorem for best linear unbi-
ased estimators (e.g., Zyskind and Martin 1202; DeGroot and Schervish 2013)
is satisfied, then this approximation is asymptotically rigorous. It is clear that
there will be assets and/or liabilities where this is not the case or where one
is not in the asymptotic limit and so one must use these estimators after they
have been validated for bias and variance. How easy or otherwise this is to
validate is problem specific but it seems that in many cases the simple linear
predictor from an ordinary least-squares regression is adequate. In problem
cases, for example, where the data display heteroskedasticity, simultaneously
modeling the mean and the variance with regression functions can help (see
124 High-Performance Computing in Finance

the generalized additive models for location, shape, and scale9 ). For Pillar 2
ORSA, regression functions could be generated at each time period of interest.
Finally, truly high-performance solutions, such as that offered by Willis
Towers Watson’s MoSes HPC or Aon Benfield’s PathWise(TM) , Phillips
(2016), have only very recently become available. It may take some time before
these solutions can be fully absorbed and accepted by the insurance industry
and so, for the moment at least, approximations are de rigeur.

5.1.4 Layout
This chapter is therefore laid out as follows. In Section 5.2, I discuss the
make-up of a typical ESG including the major asset classes that it models.
At a high level, I will illustrate some of the challenges that face the model-
ing exercise such as parameter calibration and discuss some high-performance
computing techniques that can be deployed to accelerate scenario set gener-
ation.10 In Section 5.3, I will discuss copula co-dependency structures intro-
duced by Sklar (1959) and go on to introduce the risk scenario generator
(RSG). This is a natural extension of an ESG to allow for non-market risks
such as policyholder lapse and operational risk. In Section 5.4, I will illus-
trate in some detail two problems that are typical of the degree of complexity
faced by ESGs. There I give a new method to represent the marginal distri-
bution functions of composite variables. Given this representation, one can
simulate correlated instances of these variables through the copula-marginal
factorization. I conclude with a discussion in Section 5.5.

5.2 Economic Scenario Generators


Several commercial and free ESGs exist. Examples of commercially avail-
able ESGs are: Moody’s Analytics’ RiskIntegrity Suite ESG 11 (formerly the
Barrie and Hibbert ESG), Willis Towers Watson’s Star ESG,12 Numerix’
Oneview ESG,13 Conning’s GEMS ESG,14 Ortec’s Dynamic Scenario Gen-
erator,15 and Deloitte’s XSG.16 A free ESG (Moudiki and Planchet 2016)
is available from the R statistical software package’s CRAN website17 of
which there are two relevant packages: ESG and ESGtoolkit.18 Compared
9 [Link]
10 To which special thanks is due to Colin Carson for many helpful discussions on HPC.
11 [Link] and services/detail/scenario generation
12 [Link]
13 [Link]
14 [Link]
15 [Link]
16 [Link]
17 [Link]
18 [Link] and ESGtoolkit
Challenges in Scenario Generation 125

to its commercially available competitors, the rESG is somewhat limited in


the range of stochastic models it offers and, at the time of writing, it sup-
ports only market-consistent modeling. A real-world ESG could theoretically
be obtained by estimating and inserting a drift correction into the underlying
stochastic differential equations (SDEs) but this often leads to unsatisfactory
real-world probability distributions. It is, however, a genuine achievement to
create and distribute a free scenario generator and credit is due to its creator,
Thierry Moudiki.
A significant risk in scenario generation is that of parameter calibration
of the underlying stochastic models. Not all scenario generators, commercial
or free, offer model calibration toolkits, which is certainly a serious limita-
tion any insurer should consider. As I will describe in Section 5.2.5, good
model calibrations are essential for an ESG but they often come at a cost of
significant analyst intervention. Moody’s Analytics, for example, offers a cal-
ibration toolkit for the technically adept users but also a calibration service
(both standard and bespoke) which addresses this challenge. Indeed, many
insurance companies have dedicated teams that provide internal model cal-
ibrations. Together with the problem of model calibrations, significant chal-
lenges to scenario generation also come from discretization and model error.
Discretization error arises from the choice of time-step size in a simulation.
Processes which are manifestly Gaussian are easier to treat (cf. a Vasicek
model of interest rates to a Libor market model with stochastic volatility,
Wu and Zhang 2006). However, one should be aware that discretization error
comes not only from the stochastic shock terms but also from the deterministic
drift. The error in the drift is on the order of the time-step size19 and this
is cumulative in the number of time steps taken. If the drift is subject to a
systematic bias coming from discretization error, all probability distributions
will end up located in the wrong place. Smaller time steps certainly alleviate
this issue but are computationally more onerous, leading to increased run-time
and infeasibly large scenario sets. The associated I/O bottleneck as they are
uploaded to an ALM system can be punitive and the temptation to use fewer
Monte-Carlo trials can have a deleterious effect on the standard errors in the
estimates of TVOG, MCEV, and ultimately to the SCR.
Difficulties incurred by model error are much subtler: if the given model of
choice does not fit the data very well one would wish to be able to detect this.
Good practice would be to fit more than one model and compare model diag-
nostic plots. For instance, and continuing within the context of the example of
interest rates again, perhaps a Libor market model with stochastic volatility is
the preferred choice of interest rate model. It has many desirable properties: it
can model variations in implied volatility by maturity, tenor, and strike. It also
models market observable forward rates. If rates are expected to be positive,
it may be a very good choice indeed. However, it has a rather large number
of parameters that need adjusting to match to the market data (e.g., to caps
or floors, or to swaptions). The Vasicek model is a short rate model (it does
19 Under a Euler–Maruyama discretization.
126 High-Performance Computing in Finance

not model a market observable quantity) and more limited in its ability to
describe market features such as implied volatility skew. It also predicts nega-
tive rates which, until recently, had been an undesirable property for nominal
interest rates. The Vasicek model does have far fewer parameters and as such
is a much simpler model than a Libor market model with stochastic volatility.
If after optimizing over the mean-squared error each model produces similar
qualities of fit for similar optimal objective values and diagnostic plots, or if
the simpler model provides a better fit then this calls into question the use
of the more complicated model. It is possible that the data simply doesn’t
support the more detailed model even if that is the true generation process. A
simple analogy might be when one tries to regress noisy univariate response
data Y on a predictor X. The true process may be quadratic in X, or perhaps
even cubic or quartic in X; but if the data are insufficient to pin down the
model’s parameters, only noise will be represented by the more complicated
model.
These risks serve to illustrate the complexity and challenges involved in
creating an effective scenario generator. The Moody’s Analytics ESG won
the 2015 [Link] award for the best ESG based on a holistic approach ser-
vicing the insurance sector. Its RiskIntegrity suite included, in addition to
the base ESG, an Automation Module permitting an automatic calibration
and scenario set generation ability to any number of economies and models.
This alleviates a significant burden from insurance companies. In addition,
the Enterprise-level Proxy Generator allows clients to directly address the
nested-stochastic problem inherent in the SCR calculation under Pillar 1 of
Solvency 2.
Before I move on to motivate the basic ESG engine, its asset classes and
challenges, I refer the interested actuarial reader to Varnell (2011) who dis-
cusses scenario generators in the context of Solvency 2 (and to the references
contained therein).

5.2.1 What does an ESG do?


At a high level, an ESG provides a coherent way of evolving the possible
states of a financial market (or markets) through simulated time using the
Monte-Carlo simulation technique (e.g., Glasserman 2003). It will likely be
used to evolve the financial variables from several national economies simulta-
neously under an appropriate co-dependency structure. Care therefore needs
to be taken in how one sets both the intra- and inter-economy correlations.
From an insurance company’s point of view, the role of the ESG is to create
scenario sets that can be uploaded to its ALM system. The scenario sets gen-
erated from an ESG are the missing piece of the jigsaw which allows insurers
to make forecasts under the real-world P-measure or to price their assets and
liabilities under the market-consistent/risk-neutral Q-measure. As such, an
insurer will select the measure and variables so that an output scenario set
reflects the risks it faces on its balance sheet.
Challenges in Scenario Generation 127

Each variable is modeled as a random process in time either using a


stochastic differential equation (SDE) or by an econometric/time-series model.
Typically, SDEs are used in a discretized form to model market risks (e.g.,
interest rate, credit, or equity risk) as there is financial theory underpinning
their dynamics that SDEs have been formulated to reflect. For non-market
risks, such as gross domestic product or an insurer’s mortality risk, an econo-
metric model may be preferred. This is because the economic theory for non-
market risks is much less well developed and more open to statistical modeling
although this partitioning is not strict. One may use an econometric model
for real-world interest rates for example. I will concentrate on SDE models in
this chapter.
A simple example of an SDE is the model of the lognormal asset-price
process sometimes referred to as geometric Brownian motion. Let St be the
price of a stock (or index) at time t ≥ 0. Then its return may be modeled as

dSt
= μ(St , t) dt + σ(St , t) dWt , S0 given (5.3)
St
Here μ is the drift which may depend on the price St and time t, and dt is
the deterministic differential in t. The second term on the right-hand side is
the purely random term. The volatility σ may depend on the price St as well
as on time t like in the drift term μ. The random Brownian component Wt is
written in terms of its stochastic differential dWt . Mathematically, Equation
5.3 is shorthand for a given integral representation that I omit here (but see,
e.g., Baxter and Rennie 1996). The properties of Wt are:

1. Initially, the process is zero: W0 = 0 a.s.


2. Increments are independent: Wt+Δt − Wt is independent of the sigma
algebra generated by Wt each t, Δt ≥ 0
3. The difference Wt+Δt − Wt ∼ N (0, Δt)

4. Wt is continuous with probability 1.

The Euler–Maruyama discretized version (Glasserman 2003) at step iΔt


of Equation 5.3 is

Si+1 = Si [1 + μ(Si , iΔt)]Δt + Si σ(Si , iΔt)Δt1/2 Zi , i ∈ Z+ , S0 given


(5.4)

Writing, in an abuse of notation Si for SiΔt and where Zi ∼ N (0, 1). To


evolve St one must be able to generate standard normal variables quickly.
The Box–Muller (see Glasserman 2003) method generates two standard nor-
mal variables from two standard uniform variables in a computationally inex-
pensive transformation. The generation of other random variables, as shall
be seen in Section 5.4, can be much more onerous. This is particularly true
since correlation of the risk variables is sought. By necessity, therefore, an
128 High-Performance Computing in Finance

ESG designer is obliged to use some random number generator and several of
these exist: for example, Wichmann and Hill (1982, 1984) and the Mersenne
Twister (Matsumoto and Nishimura 1998). These are industry standard meth-
ods for random number generation and they work reasonably well. However,
one should always be cautious of random number generation: if an insuffi-
ciently elegant generator is used, the numbers emerging can eventually show
deterministic patterns.
Many introductory textbooks on financial engineering (e.g., Hull 2005;
Baxter and Rennie 1996) show, by means of the Girsanov theorem, how a
simple change-of-measure may relate the Brownian shocks dWtP under the
P-forecasting measure to those under the pricing measure Q:

WtQ = WtP + λt (5.5)

Here λ is the all-important market-price-of-risk for the lognormal process


(Equation 5.3) modeling the compensation per unit of volatility an investor
should receive for bearing risk in the real world. The change-of-measure often
translates simply into a deterministic correction to an asset’s growth rate
although in more complicated models this correction may itself be stochastic
(e.g., the CIR process Hull 2005). An important consequence of arbitrage-
free-pricing is that under the risk-neutral Q-measure, all assets grow with the
growth rate of the numeraire or reference asset of that class. To see how this
sets the market price of risk, observe that in the case of a constant interest
rate r, drift μ and volatility σ then Equation 5.3 has the solution (using Ito’s
formula):
  
1 2 P
St = S0 exp μ + σ t + σWt
2
  
1 2 Q
= S0 exp μ − λσ + σ t + σWt (5.6)
2
  
1
= S0 exp r + σ 2 t + σWtQ (5.7)
2

where λ is necessarily (μ−r)/σ. This highlights the duality that exists between
the P and Q measures and illustrate the two paradigms under which an ESG
must be able to simulate, here in a simple setting. One observes the stock-
price process Equation 5.6 and is able, at least in principle, to find an estimate
for μ. For pricing purposes, however, one must use the growth rate r from the
numeraire asset and simulate using Equation 5.7. If the price today of a 3-
month equity put option is sought, then the numeraire asset is the 3-month
nominal interest rate government bond. I say it is possible to estimate μ “in
principle” because this can be a challenging exercise in robust statistical esti-
mation. One must appeal to data to find a suitable historical period and then
use the method of moments, maximum likelihood or use a Bayesian approach
to estimate μ. The question of precisely which historical period becomes pri-
mordial: too short a period and sampling error will dominate an estimate, too
Challenges in Scenario Generation 129

long and the lack of stationarity may bias an estimate. A more robust approach
might be to make an economic assumption about the long-term trend in the
growth rates that can take a holistic approach in view of the entire market.
One may then observe r and compute a risk premium μ∗ = μ − r = λσ. A cal-
ibration approach in the real world is therefore to produce a consistent set of
long-term calibration targets in terms of assets’ risk premia. For more detailed
assets, such as real or nominal interest rates, or credit, the calculation of risk-
premia becomes more challenging as a term structure of targets emerges: one
is liable to make functional assumptions about risk-premia. When making
assumptions on risk premia, one must assess their sensitivity to economic
assumptions. This is a major challenge in economic scenario generation made
more difficult owing to the amount of analyst intervention that is needed to
estimate the risk-premia. Unlike a market-consistent calibration where clearly
available market data are available, the situation in real-world modeling is
much more fluid and open to interpretation.
The requirement of assets’ growth to be fixed at the numeraire rate in the
market-consistent setting affords us a particularly useful yet simple valida-
tion technique that is essential for any well-functioning ESG. Specifically, one
may use a simple charting functionality to demonstrate how well a market-
consistent ESG is functioning. As another example, consider a 10-into-20-year
receiver swaption struck at par. This is a derivative contract giving the holder
the right but not the obligation to enter into a 10-into-20-year swap contract in
10 years’ time. If market swap rates fall below the current par-yield 10 years’
hence then the holder may exercise his or her right to enter into the more
valuable swap contract at its option maturity date. Otherwise, the swaption
matures worthless as swap rates are more valuable in the market at maturity.
The reference asset for this swaption is a 20-year fixed term deferred annuity
whose first payment is in 10 years’ time. Arbitrage theory coupled with the
Girsanov theorem tells us that unless the growth rates of the swaption and the
annuity are the same then a risk-free profit may be locked in by constructing
a portfolio which is short one of the assets and long the other in a way that is
known in advance. One way to validate the output from a risk-neutral ESG
is to check that such derivative securities are martingales. Recall that for a
process Mt to be a martingale with respect to a numeraire Bt :
 $ 
Mt $$ Ms
EQ Fs = 0 ≤ s ≤ t; (5.8)
Bt $ Bs

in the filtered probability space ({Ω, F, {Ft }t≥0 , Q). So, one may reasonably
compute the value of the 10-into-20-year swaption Mt over simulated time and
plot the ratio of it with the deferred annuity Bt . If the resulting time-series
plot shows an absence of a trend (i.e., an absence of growth) and wanders
randomly around a value of 1 then one may have confidence that the pricing
measure is sound. This may be repeated for the term structure of swaptions
struck at par (or other) and overlay error bars for a compact validation plot.
130 High-Performance Computing in Finance

Several challenges face this approach to validation. First, a typical ESG


works in discrete-time over time steps, which are perhaps monthly or annual.
It does not work in the continuous-time framework required by the finan-
cial theory and so one may fail to validate Mt as a martingale when it is
indeed a martingale in discrete time. Indeed, one needs to make a choice
regarding the equivalent measure to be used as a reference: often the dis-
cretely compounded rolled-up cash account is used. A Girsanov transform
is then employed whenever this is not the appropriate numeraire. However,
an additional drift correction will be needed to correct for the discrete-time
approximation to continuous-time that is being made. Fortunately, this does
not present any burden, simply something to be mindful of. Another practical
issue for ESG users is the precise number of ESG sample paths needed: too
many and an ALM system becomes prohibitively slow, too few and sampling
error will dominate to the extent that illusory trends in the drift may be seen.
A smaller number of paths is obviously preferred, but how small? Given a reg-
ulator may require an insurer to offer some evidence that a pricing measure
is indeed arbitrage-free at a given significance level α = 1% (say) there is no
way around the hurdle of having to use a sufficient sample size to demonstrate
the martingale property. One possible work-around may be to generate more
scenarios than can be handled by the ALM system in a reasonable time, verify
its martingale property and then optimally select the best scenario subset of
desired size that minimally disturbs the distribution while constraining the
drift to be the drift of the full set. At the very least, this disturbance would
only be a drift correction from the noisy (incorrect) drift of the small sample
to the more accurate drift of the full sample.
Although scenario generation under the pricing measure Q relies on con-
cepts from theoretical financial engineering, insurance users will wish to extend
the interest rate models over periods of time that naturally cover their lia-
bilities (and assets). This becomes problematic when market data are sparse
or illiquid. To cover their policyholder liabilities which tend to be long-dated,
insurers naturally hold long-dated bonds and contingent claims on these. How-
ever, no market exists for these in any great depth: it is largely illiquid and
insurance companies are, in effect, making the market. Although illiquidity
is not a problem for insurers per se20 the difficulty that emerges is more a
regulatory one. An insurer is required to find an appropriate fair value for
their assets, illiquid or otherwise. In the absence, therefore, of a viable risk-
neutral measure economic theory can be useful. It can be used to establish
unconditional forward rates but these are assumptive in nature and should be
justified (e.g., by a sensitivity analysis). Salahnejhad and Pelsser (2015) give
a theoretical basis for pricing in illiquid markets. In any event, an ESG makes
certain assumptions and insurance companies must be aware of them.

20 Note that it does expose them to policyholder lapse and surrender risk should they need

access to cash quickly in times of market distress.


Challenges in Scenario Generation 131

In the sequel, I will continue my discussion of SDE models by introducing


a taxonomy of asset classes beginning with what is potentially the most fun-
damental construct in an ESG—the yield curve. Several techniques exist for
extracting the current yield curve from market data. Under Solvency 2, the
method of choice is due to Smith and Wilson (see EIOPA-BoS-15/035 2015).
I offer an alternative due to Antonio and Roseburgh (2010): a cubic spline
method where there is market data available and an extrapolation beyond
the last market data using the Nelson and Siegel (1987) functional form as is
implemented by Moody’s Analytics. The cubic spline is latent meaning that
the market data itself is not interpolated but rather for an a priori fixed set
of knot points, the space of cubic splines is minimized with respect to a given
criteria satisfying given smoothness conditions. This is often a characteristic
of ESG modeling: the challenge is to extract a quantity of interest that isn’t
directly observable: the instantaneous forward rates representation of the yield
curve is an example.
Following interest rate modeling, I will briefly discuss credit and then equi-
ties before concluding the section with a note on high-performance computing
in insurance and the important considerations of calibrating models to market
data. Although of obvious importance to ESG, I will delay discussion of co-
dependency structures until Section 5.3 on risk scenario generators (RSGs):
co-dependency being of equivalent importance to both economic and market
risks. In the interests of space, I therefore omit discussion of the important
asset classes of foreign exchange and inflation, and focus on market consistent
modeling. I leave the interested reader to follow up these topics with the ref-
erence: McNeil et al. (2015), Brigo and Mercurio (2006); and those contained
therein.

5.2.2 The yield curve


The yield curve is the most fundamental ingredient to an ESG. It under-
pins most of the other asset classes. For instance, an equity option price is the
expectation under the risk-neutral measure of the discounted option payoff.
Discount factors naturally come from the nominal yield curve of a given econ-
omy where the option is traded and are expressed using the yield curve spot
rate. Calibration of a yield curve model is the first step in the introduction of a
full interest rate model that I will discuss in the next section. For the moment,
I will describe a method that can be used for modeling the yield curve on a
given date using bond and swap data that is due to Antonio and Roseburgh
(2010). It is relatively robust, allows a mixture of bonds and swaps to be used
where they are liquid and matches to a specialized functional form beyond the
last available maturity. Although my discussion is aimed at capturing nominal
yield curves, there is no reason why this method cannot be applied to model
the real yield curve. One would simply replace the government treasury bonds
and swap data with inflation proofed index-linked gilts and savings certificates
(in the United Kingdom).
132 High-Performance Computing in Finance

This is an alternative approach to that of Smith and Wilson (EIOPA-


BoS-15/035, EIOPA 2015) which is the method of choice under Solvency 2.
Like Smith and Wilson, calibration may be to either bonds or swaps or both
but the yield curve data is not interpolated but, rather, is represented as a
cubic spline with a Nelson–Siegel functional extrapolation. The instantaneous
forward rate curve f (t) is modeled, being considered latent, that is, it is not
directly observed, rather the parameters of the model are extracted by min-
imizing a functional. Assume the relationship between the zero-coupon bond
prices and f (t) is, for bonds with coupons c ≥ 0 and frequency ω per annum:


n   ti 
P (T ; c) = (δin + c)exp − f (t)dt , n = ωT. (5.9)
i=1 0

The forward rates are represented by a latent cubic spline modeling the
forward rates curve at the K + 1 a priori specified knot points tk for
k = 0, 1, . . . , K (tK = TM , the longest bond maturity) with a Nelson and
Siegel (1987) extrapolation in tK < t ≤ tmax (typically tmax = 120 years):

fns (t|β) = β1 + [β2 + β3 (t − tmax )]e−λ(t−tmax ) , β = (β1 , β2 , β3 ) ∈ R3 ; (5.10)

as


K
f (t|θ1:K ) = I(tk−1 ≤ t < tk )s(t|θk ) + I(tK ≤ t ≤ tmax )fns (t|β); (5.11)
k=1

where each of the K cubic splines are parametrized by a parameter vector


θ = (a, b, c, d)
∈ R4 with s(t|θ) = a + bt + ct2 + dt3 .
The Nelson and Siegel rate parameter λ is set to an economically plausible
O(0.01) value and Antonio and Roseburgh (2010) set β1 = ftmax −β2 to fix the
unconditional forward rate21 at ftmax . To ensure a smooth transition to the
extrapolated forward curve from the last available market data point at TM , a
first derivative “gradient” smoothing penalty of magnitude w1 is applied over
the final 20% of the available market data: [T2 , TM ]. To guard against over-
fit, a second derivative curvature penalty of magnitude w2 is applied from the
first 20% of the available market data till TM : [T1 , TM ].
The bond price may then be written succinctly as

P (T |Θ) = exp(−τ
T Θ) (5.12)

where τT ∈ R4K+3 contains the parameter-free knot point detail from the
cubic spline and Θ = vec(θ1:K , β)
∈ R4K+3 contains the parameters from
each spline component and the extrapolant. If the modeled forward rates given
in Equation 5.11 evaluated at the knot points t0:K are: f = (f0 , f1 , . . ., fK )

RK+1 ; then the relationship between the spline’s parameter vector Θ and the
21 The forward rate at the longest modeled maturity tmax .
Challenges in Scenario Generation 133

discrete forward rates vector f is found through the matrix–vector relationship


arising from the cubic spline:

AΘ = Bf + u, A ∈ R4K+3,4K+3 , B ∈ R4K+3,K+1 , u ∈ R4K+3 . (5.13)

Here A is a known invertible matrix and B is a known rectangular matrix. The


vector u := ftmax ê4K+3 where êk is the unit basis vector along the kth axis, k ∈
[1, . . ., 4K + 3]. Most importantly, the cubic spline matrix–vector relationship
Equation 5.13 shows that the parameter vector Θ can be eliminated in favor
of the forward rates vector f. If the set of bond price/duration data is denoted
by P = {(Pj , Dj ) ∈ R2 : j = 1, . . ., M } and modeled bond prices by P̂j (f )
using Equation 5.12, then the objective criteria are
 2  TM  2

M
Pj − P̂j (f ) ∂f
g(f |P) = + w1 [t|Θ(f )] dt
j=1
Dj T1 ∂t
 TM  2 2
∂ f
+ w2 [t|Θ(f )] dt (5.14)
T2 ∂t2

As a direct consequence of (5.13) g may be directly minimized over f rather


than over the spline-coefficient vector Θ. Applying a nonlinear optimization
technique such as Levenberg–Marquardt (Transtrum and Sethna 2012; Press
et al. 2007) leads to the solution f̂.
In the case that swap rates are also required to aid in the calibration of
the yield curve, note that the T -year forward swap rate at time t = 0 is
*n −1
1− j=1 [1 + f (Tj )/ω]
S(T ) = )n *i −1
(5.15)
−1
i=1 ω j=1 [1 + f (Tj )/ω]

and it is noted that the swap rate is a function of the forward rates at the
required knot points: S[Tj |Θ(f)]. The objective Equation 5.14 may be updated
by adding a term:
M !2
w0 Sj − Ŝ[Tj |Θ(f )] (5.16)
j=1

Some adjustment of the relative proportions of the objective function hyper-


parameters is needed: w0 , w1 , and w2 ; and this can be handled by 10-fold cross-
validation (Hastie et al. 2009) or other out-of-sample validation technique.

5.2.3 Nominal interest-rate models


Dynamic models of the interest rate typically come in one of two flavors:
short rate and forward rate models. Models of the short rate rt are invariably
the simplest attempting only to model the interest rate manifesting itself over
an infinitesimal period of time. This is not a market observable quantity.
134 High-Performance Computing in Finance

However, it can be integrated to produce zero coupon bond prices:


%   $ &
T $
$
P (t, T ) = EQ exp − rs ds $ Ft
t $

where P (t, T ) is the maturity-T zero coupon bond viewed at time t ≥ 0,


Q is the risk-neutral measure and Ft records the market history up until
time t. From these, market observable coupon bearing bond prices may be
derived. Derivative quantities are valued in a similar way. Once one has spec-
ified dynamics for the short rate, the pricing problem comes down to one of
integration. For a certain class of interest rate models there can be a significant
simplification in the integration, specifically for the affine term structure mod-
els. Here the (exponentially) affine form for the zero-coupon bond is sought:

P (t, T ) = exp[A(t, T ) − B(t, T )rt ]

The challenge is then to find functions A and B consistent with the dynamics
for rt :
drt = μ(t, rt )dt + σ(t, rt )dWt
If one seeks affine transforms of the short rate for the drift and (squared)
volatility, then a coupled system of ordinary differential equations emerges
(maturity T is considered a parameter):
dA 1
− β(t)B + δ(t)B 2 = 0, A(T, T ) = 0
dt 2
dB 1
+ α(t)B − γ(t)B 2 = −1, B(T, T ) = 0
dt 2
For the choice of μ = b − ar and σ constant then α = −a, β = b, γ = 0 and
δ = σ 2 then the Vasicek dynamic model emerges:

drt = (b − art )dt + σdWt

with A and B functions as


 
1 1
A(t, T ) = a−2 [B(t, T ) − T + t] ab − σ 2 − a−1 σ 2 [B(t, T )]2
2 4
B(t, T ) = a−1 [1 − exp(−a(T − t))]

The conditional distribution for the short rate given any time horizon is normal
and the benefit of such simple dynamics is in its analytically tractability: a
model such as the Vasicek model is easily implemented in an ESG. However, a
detraction of this model is precisely that its conditional distribution is normal:
this implies negative short rates and, until recently, nominal rates have not
been negative. The Vasicek model still has its place as a real interest rate
model where rates, measuring the purchasing power of a basket of goods, can
realistically be negative. Still, a single-factor model may not produce enough
Challenges in Scenario Generation 135

dispersion and for this one may reasonably take a two-factor Vasicek model
and this is done by introducing a second SDE to model its long-term mean.
Typically, one may work with

drt = α1 (mt − rt )dt + Σ1 dWtr


dmt = α2 (μ − mt )dt + Σ2 dWtm
dWt , dWtm  = 0
r

While this leads to a more flexible affine term-structure model, it does not
yet allow for the initial yield curve to be modeled. Rather, at time zero some
Vasicek implied yield curve emerges over which there is no control. To obviate
this challenge, one moves to the more flexible framework of the Hull and White
model (see Hull and White 1990; Hull 2005). The two-factor Hull and White
model is

rt = φ(t) + x1t + x2t


(i)
dxit = −αih xit dt + σi dWt , i = 1, 2,
+ ,
dWt1 , dWt2 = ρ dt

where, crucially, introduction of the function φ(t) allows the initial yield curve
to be matched exactly. If it is chosen to be the Vasicek implied initial yield
curve, then φ(t) ≡ μ and the Vasicek and Hull and White models may then
be related by the simple change of state variables:

rt = μ + x1t + x2t
α2 − α1 2
mt = μ − xt
α1

which precludes α2 = α1 (but this is easily remedied by considering the special


case in isolation). The parameters are related as follows:
-  2
α1
h
α1,2= α1,2 , σ1 = Σ21 Σ22 ,
α2 − α1
$ $ $ $
$ α1 $ Σ2 $$ α1 $$
$
σ2 = $ $ Σ2 , ρ=−
α2 − α1 $ σ1 $ α2 − α1 $

A consequence, therefore, is that under the Hull and White representation,


the stochastic factors now have a non-zero correlation coefficient. In fact, it
is manifestly negative. The main detraction with the Vasicek and Hull and
White models is that they are fundamentally normal implying that the short
rate may become negative. Until recently, it was not anticipated that nomi-
nal interest rates could become negative. One final affine model is available:
the multi-factor Cox–Ingersoll–Ross (CIR) model. Here, the interest rates are
136 High-Performance Computing in Finance

guaranteed to be positive since the solution of the CIR SDE is a scaled non-
central χ2 -variable. I defer discussion of the CIR process until the next section
on credit models. The final short rate model I will describe is the two-factor
Black–Karasinski model. Its dynamics are

dlnrt = α1 [lnmt − lnrt ]dt + σ1 dW1


dlnmt = α2 [μ
− lnmt ]dt + σ2 dW2

The conditional and unconditional distribution of rt is log-normal the benefit


of which is that it can never be negative. This desirable feature is rather offset
by the analytical intractability of this model. Not being of the affine class, its
calibration is hampered by the lack of any analytically tractable formulae for
its bond prices nor swaptions. Indeed, a two-dimensional recombinant bino-
mial tree can be used for both calibration and simulation. The challenge here
is to store the tree sensibly so that its time step is sufficiently small to obviate
discretization error of the SDEs.
Heath, Jarrow, and Morton (1990) described a framework for the evolution
of the continuous forward rates term structure. If f (t, T ) is the continuous
compounding T -year forward rate at time t > 0 then under the HJM frame-
work its evolution is simply described by

df (t, T ) = μ(t, T )dt + Σ(t, T )


dWt

Here, Σ is a d-dimensional vector of factor loadings for the d-dimensional


vector of Brownian increments dWt . In a market-consistent risk-neutral set-
ting, the drift term is completely specified in order that f (t, T ) is driftless
with respect to its numeraire. Note that it is typically non-zero as Wt is mea-
sured with respect to a common numeraire for simulation purposes: the cash
account (say). Non-zero drift terms appear in the risk-neutral setting owing
to an application of the Girsanov transform.
Practical use of the HJM framework is hampered by the fact that the state
variables: f (t, s); are not market observable quantities. Brace, Gatarek, and
Musiela (1997) mitigated this problem by introducing a model that would
become something of an industry standard: the Libor Market Model (also
referred to as the BGM model). This models the forward rates that are traded
in the market. In London, these are the Libor rates but the model applies
widely to any market (Euribors, for example). If K distinct Libor forward
rates each with maturity Tk (k = 1, . . . , K)Fk (t, T ) are being modeled, the
BGM model is a log-normal model in Fk,t :
dFk,t
= σ
dWt
Fk,t

for vectors of length p of local volatilities σ and Brownian shocks Wt (each


specified under their own numeraire). Another departure from the HJM frame-
work is that the BGM model uses discrete compounding rather than continu-
ous compounding. The relationship between the kth rate and the zero-coupon
Challenges in Scenario Generation 137

bond prices is
 
1 P (t, Tk )
Fk,t = −1
Tk+1 − Tk P (t, Tk+1 )
A first criticism of the basic BGM model is in its insensitivity to skew in
the market implied volatility data: for example, European styled payer or
receiver swaptions show variations in their implied volatility relative to those
struck at par. Market implied volatility data depend on three parameters:
the maturity of the swaption, the tenor of the underlying swap contract and
the strike relative to the par yield curve at contract inception. An insurer
whose liabilities are exposed to a fall in interest rates may purchase a receiver
swaptions portfolio to hedge this risk away. The insurer may like to strike
the swaptions in the hedge at the guaranteed rate of interest it has promised
to its policyholders. This is unlikely to be at the current par yield and so a
correct price is sought away-from-the-money. The BGM model is insensitive
to this strike; put another way, the BGM implied volatility (hyper-) surface
is constant across strike. This problem may be alleviated by the introduction
of a stochastic volatility process Vt modeled by, perhaps, a CIR process. The
downside here is that the analytical formulae for the European swaption is
lost and one must appeal to semi-analytical formulae such as that found by
Wu and Zhang (2006).
A second criticism of the BGM model is in its fundamental assumption of
log-normality. This precludes rates ever becoming negative but as illustrated in
the introduction nominal rates can become negative. Market-implied volatili-
ties in the log-normal BGM model require the initial yield curve to be every-
where positive, however. Moreover, whenever the yield-curve approaches zero
it becomes computationally onerous to compute log-normal implied volatili-
ties from swaption price data: they become unstable. When the yield curve is
negative it is no longer possible to compute swaption prices using the stan-
dard Black’s formula. Log-normal implied volatilities cease to exist and the
consequences for parameter calibration are obvious. A solution is to use dis-
placed forward rates: that is, a constant term is added to the forward rates.
The problem then is how to reliably set this displacement term? Furthermore,
the BGM model is quite likely to achieve exponentially large forward rates in
finite time. While this is not an issue for market-consistent pricing per se, it
does mean that scenario sets created by an ESG will have values in it that are
too large to be represented by computational precision: there will be Nan’s.
This is unacceptable for an insurer’s ALM system. A solution is to switch from
the log-normal BGM model to a normal equivalent model with a stochastic
CIR process Vt :
.
dFk,t = Vt σ
dWt

One may calibrate this model using absolute implied volatility data that can
be obtained from market data providers such as Markit and SuperDerivatives.
Exponential blow-up is largely mitigated and the rate may become negative,
but nowadays, this is something that is very real.
138 High-Performance Computing in Finance

Other dynamic models of interest rates exist such as the SABR-LMM


model. I refer the interested reader to the text Brigo and Mercurio (2006) for
more information.

5.2.4 Credit models


These are split into two categories: the structural and the reduced form
models. Structural models attempt to model the credit risky nature at the
enterprise level by considering their equity as a call option on their assets.
This type of model directly estimates their risk of downgrade or default. The
Moody’s KMV model Berndt et al. (2004) is a commercial example of this
type of model. Structural models, such as the Jarrow–Landau–Turnbull model,
use market credit spreads of the broad ratings classes: AAA, AA, . . ., CCC,
et cetera; to measure the risk-neutral default probabilities (with assumptions
about recovery rates) and relate these through coupled systems of SDEs to
model the spreads themselves. An ESG model either models the credit spread
through default-only models or through downgrade models where a Markov-
chain styled transition matrix of probabilities is specified. When any drivers
are used to describe the credit spreads, the greatest challenge in credit model-
ing is in the reliable calibration and estimation of models, which can become
quickly over-parametrized. The output from an ESG is in terms of spreads
split by credit class and maturity. The complexity of the calculation is similar
to that of the forward interest rates models. The interested reader is referred
to Lando (2004).

5.2.5 Equity models


An important asset class synonymous with financial markets is the equity
market. Models of equity assets reach as far back as Bachelier (1901) who
described the first SDE of an equity. Arithmetic Brownian motion is described
thus
dSt = μdt + σdWt , S0 given (5.17)
Contrast this for a moment with the SDE for geometric Brownian motion,
simplified from Equation 5.3:
dSt
= μdt + σdWt , S0 given (5.18)
St
The single most-important observation is that arithmetic Brownian motion
is unconditionally normally distributed and St can be either positive or neg-
ative. This is unrealistic for an equity process and so Bachelier’s model put
an end to the development of financial engineering for the better part of half
a century. The development and success of Black and Scholes (1973) equity
option pricing theory heralded a new era in financial engineering leading to
such developments as the Libor market model (as discussed earlier). The field
has even turned full circle and embraced once again Bachelier’s arithmetic
Challenges in Scenario Generation 139

model for the purposes of solving the issue of negative nominal interest rates.
For its part, the geometric model Equation 5.18 is lognormally distributed and
non-negative. Non-negativity is certainly sensible for an index or stock price,
but is log-normality always sensible? In the long term, one might envisage a
different unconditional distribution with more or less skew and kurtosis.
For market-consistent pricing of equity derivatives, the lognormal equity
asset model is a practical solution. Its main detraction, however, is that its
price predictions are at odds with the markets. Striking in- and out-of-the
money produces different prices not predicted by the lognormal model. The
volatility surface (Gatheral 2006) is sensitive to strike as well as option matu-
rity. Alternatives to the lognormal model are Merton’s jump diffusion model
(Merton 1976):
dSt
= −λμ̄dt + (ηt − 1)dNt
St (5.19)
ηt ∼ log − N (μ̄, σ 2 )
and dNt ∼ limδt→0+ Po(λδt); and Heston’s stochastic volatility model (Heston
1993):
dSt √ (1)
= μdt + vt dWt
St (5.20)
√ (2)
dvt = α(θ − vt )dt + ξ vt dWt
Jump-diffusions perform well when pricing derivatives of short maturities and
can produce volatility skew there but at longer maturities they do not perform
so well. Stochastic volatility models longer maturities well but suffers on short
maturities. The combined stochastic volatility jump diffusion Bates (1996)
works well:
dSt √ (1)
= (μ − λμ̄) dt + vt dWt + (ηt − 1) dNt (5.21)
St
√ (2)
dvt = α(θ − vt )dt + ξ vt dWt
ηt ∼ log − N (μ̄, σ 2 )
The drift terms above are each carefully set to ensure the zero drift of any
derived quantities of interest under the appropriate numeraire measure. The
remaining difficulty with the Bates model is in a relative inability to corre-
late Bates processes with other processes within an ESG. Although one can
correlate the Brownian shock by means of a Cholesky factorization of the cor-
relation matrix, it is less than evident how to correlate the combined equity
asset process which involves, in addition to the Brownian shock, the effect of
stochastic volatility and a compound Poisson process. I address how to do this
in Section 5.4.
Other popular models that involve stochastic volatility are those of Duffie
et al. (2000), Levy processes (e.g., Carr et al. 2002) and the SABR mixed local
volatility and stochastic volatility models by Hagan et al. (2002).
140 High-Performance Computing in Finance

5.2.6 Calibration issues


It is now an opportune moment to pause and reflect on the sheer complex-
ity of the calibration problem having introduced two very detailed models for
interest rates and equities. In the market-consistent nominal interest rate set-
ting, one popular modeled is to calibrate to market swaption prices (or to their
implied volatilities). One produces modeled prices for each market available
instrument and builds an objective criterion such as the mean-squared error.
This metric may be minimized using a numerical optimization algorithm such
as the Levenberg–Marquardt algorithm (e.g., Transtrum and Sethna 2012;
Press et al. 2007) or other. Since the resulting optimization problem is non-
convex without a directly accessible functional form for the Jacobian nor Hes-
sian, optimization is challenging and potentially onerous.
If pricing derivatives given a set of parameters is the forward problem,
then optimizing a metric given market data to find the best parameter set is
the inverse problem. In such nonlinear optimizations one is not guaranteed
to find the best solution and usually ends up settling for something which is
satisfactory in some sense (if indeed one can find a solution at all). Harder
still, finding the unique global optimum is the fundamental challenge. Indeed,
there may be many local optima and the iterative scheme created by the
Levenberg–Marquardt algorithm is not guaranteed to converge to the global
minimum. Global optimization schemes do exist, notably Nelder–Meade or dif-
ferential evolution (Press et al. 2007). However, they are all computationally
much more onerous than local optimization schemes. Ultimately, any itera-
tive scheme reduces to a multi-dimensional non-autonomous discrete dynam-
ical system. Such systems, even when the dynamics are known, are subject
to multiple critical points, limit cycles and, most interesting of all, strange
attractors (Hale and Kocak 1991). It is worthy of note that strange attractors
may be found in the most innocuous of difference equation systems. Their
key defining characteristic is that any solution tends to the trajectory defined
by the strange attractor yet this trajectory never repeats itself nor intersects
its past trajectory. In a very real sense, it is space-filling in some intermedi-
ate sub-space that lies between integer dimensions and every point along this
path is optimal. In continuous time, strange attractors are precluded from cou-
pled systems of two differential equations by the Poincare–Bendixon theorem
but no result holds for coupled systems of two or more difference equations.
Seeking an optimal calibration is potentially equivalent to tracing the tra-
jectory laid by a strange attractor as the critical phenomena of a system of
nonlinear difference equations. Care must be taken to ensure that of the crit-
ical phenomena listed, a critical point has indeed been obtained: any of the
phenomena can show good validation to vanilla option prices yet still leave
the analyst unaware of the precise nature of the phenomena realized. By a
naive stopping rule, one may have exited the dynamical difference system at
some point on a limit cycle, or at some large but finite number of steps along
the trajectory of a strange attractor thinking that one has found the unique
critical point.
Challenges in Scenario Generation 141

Luckily there are some remedies to this problem. First, start the optimiza-
tion scheme repeatedly from as many different points in the parameter space
and record the ultimate destinations reached. Examine the index plots of the
optimal metric and if a large proportion of the solutions with the smallest
indices have roughly the same metric, one can be assured that a near global
solution has been reached. Now examine the ultimate parameter values using
the same ranking: do the most optimal calibrations correspond to similar val-
ues of the parameters? If so, this is the best outcome. On the other hand, if
the index plot of the parameters shows a high degree of variability, one may
deduce that the market data does not drive a unique solution. This may be due
to the manifestation of a limit cycle or to a strange attractor. Calculation of
the Lyapunov exponent22 for the system may serve to characterize the conver-
gent phenomena. If one does discover either of these phenomena, the current
market snapshot does not define a unique model. A solution is to complement
the snapshot with more information. If one naively inserts restrictive param-
eter limits, one runs the risk of finding solutions along the boundary. This is
problematic as a calibration found on the boundary is equivalent to reducing
the parameter count by one and grossly simplifies the calibration problem in
an arbitrary way that is difficult to control. This introduces a redundancy to
the difference equation system and one ends up solving an alternative opti-
mization program that was not intended. One needs to add more information
into the system sensibly. One way to achieve this is to weigh the current solu-
tion set with the empirical density function generated from the most recent
well-behaved solution set. This makes it possible to select a solution from the
current data in such a way that is close to the previous data (Hastie et al.
2009). At the very least, when one comes to value one’s liabilities, the change
in parameters will not be unusually profound and the value of the liabilities
today will be consistent with the last time a price was sought.
Having a robust calibration technique that does not change significantly
across consecutive time periods except by an amount that is, in some sense
“reasonable,” is certainly desirable. However, a model calibrated to one type
of market data need not necessarily lead to a similar calibration of the same
model under another. For example, receiver swaption interest rate data is
better defined where it is in-the-money for low interest rates. Payer swaption
prices tend to be smaller where receiver swaption prices are larger and conse-
quently, owing to a disparity in scales, there is less information available for
calibration under payers in this instance. One tail of the underlying density
may be less well defined when comparing across calibrations and hence lead
to different parametrizations. One may calibrate to market-implied volatil-
ities to obviate this risk, but while receiver and payer swaption generated
implied volatilities are theoretically equal, they aren’t in practice and one
finds oneself in the same quandary as with prices. If one replaces swaptions

22 The Lyapunov exponent measures the extent to which two solutions to a dynamical

system will tend to diverge after some period (Hale and Kocak 1991) given they both have
the same initial conditions.
142 High-Performance Computing in Finance

with calibrations to interest rate derivatives such as caps and/or floors, then
the calibration may change yet again despite the theory implying that the
same risk-neutral density should pervade. One must be pragmatic, therefore,
and test the sensitivity of any calibration to data that is out-of-sample. This
is a rather serious issue as I note that the sensitivity to MCEV and TVOG,
and ultimately to the SCR under Solvency 2, depends upon it.
Finally, it is noted that the calibration of an equity SVJD model in eight
parameters necessarily requires the interest rate model to be calibrated first.
The equity model is coupled to the interest rate model through the stochastic
interest rate discount factor. Some approximations can usually be made but
this serves to illustrate the depth and difficulty of the general calibration
problem to market data.

5.2.7 High-performance computing in ESGs


The main use of high-performance computing in an ESG is to distribute
many Monte-Carlo trials across multiple servers so that they may be exe-
cuted in parallel. The data are then collated by the client to create the com-
plete dataset. This principal of tasks decomposition and distribution can be
applied to many related problems such as RSG, or to scenario generation for
least-squares Monte-Carlo proxy model fitting. It is, however, unclear how
one should apply high-performance computing to best aid in model param-
eter calibration. Essentially, anything that may be linearly decomposed into
a repetitive sequence of like tasks may make use of high-performance com-
puting. Parameter calibrations are almost certainly iterative and nonlinear in
nature. Since reinitializing a calibration to a new point helps obviate the local
nature of some optimizers, then this is one use of HPC. A second use of HPC
is in the calibration of models to exotic derivatives (e.g., callable swaps needed
to hedge portfolios of fixed rate mortgages) where no analytics are available
or in the calibration of a more complicated interest rate model, again for
whom there are no available option pricing formulae. Here, option payoffs
may be simulated using the Monte-Carlo technique. While this is computa-
tionally prohibitive for one quad-core desktop PC, a multi-core HPC cluster
may speed up the calculation sufficiently to make it a viable calibration alter-
native. One must overcome the HPC communication bottleneck of collating
results by averaging trials from different cores. Some empirical tuning of the
right balance of trials-to-cores is needed.
The use of Microsoft’s Azure servers in the cloud can speed up scenario set
generation and locally, Digipede Grids23 offer a robust local solution. These
provide an array of servers which can process individual fragments of an ESG
simulation. For the least-squares Monte-Carlo Solvency 2 SCR calculation, the
use of bespoke in-house HPC grids to parallelize ESG execution and regression
calculations is necessary, however. The same Digipede Grids are not useful

23 [Link]
Challenges in Scenario Generation 143

because the market-consistent inner component of the least-squares Monte-


Carlo technique has a very low trial budget. A more efficient approach is to run
whole simulations in parallel rather than breaking jobs up trial wise. Other
third-party HPC platforms could be used to achieve similar results.
As mentioned, the cloud is most synonymous with HPC and there are two
way of leveraging such platforms. Products may work on an Infrastructure
as a Service (IaaS) paradigm. Here, more servers are added to increase core
and memory capacity on any environment. The move toward Platform as a
Service (PaaS) is new. The main difference between Iaas and PaaS is the
removal of the notion of “a server” and one works with a pool of resources:
cores and memory; dynamically deploying services to perform calculations.
See the NIST Definition of Cloud Computing from the National Institute of
Science and Technology or Chang et al. (2010). The main players in service-
based computation are Apache Spark, Microsoft’s Service Fabric, and Google’s
App Engine. They offer similar capabilities in the form of a scalable platform
that will allow applications to dynamically burst upwards to a high number
of parallel service instances as demand grows.
In terms of local HPC solutions, NVidia’s GPU computation using OpenCL
or CUDA can yield impressive results. GPU appliances are now available in
Amazon Web Services (AWS) and on Microsoft’s Azure permitting software
engineers access to considerable computing power: see NVidia’s new Pascal-
based servers.24 Maxeler Technologies’ Field Programmable Gate Arrays 25
(FPGA) are programmable compute nodes that can be configured to perform
a specific set calculation very quickly. At each clock cycle, more calculations
can be loaded leading to an impressive throughput. Finally, quantum compu-
tation, if a little speculative at present, works in a similar manner to FPGAs.
They are not classic general processors but, like FPGAs, are devices that can
be configured to solve a particular problem very quickly.

5.3 Risk-Scenario Generators


Generally speaking, non-market risks are any risks faced by an insurer that
are not economic and are typified by the fact that deep and liquid markets do
not exist for trading in them. Historical data in non-market risks is difficult
to obtain and maintain. Solvency 2’s Pillar I and II recognize the importance
of these through the potential impact on an insurer’s balance sheet. A non-
exhaustive list of such risks includes industry-wide systemic risks (e.g., the
financial crisis of 2008/2009), mortality and longevity risks, policyholder lapse
and surrender risk, and operational risk. Being characterized by a general lack
of data is the challenge an insurer has in modeling them.

24 [Link]
25 [Link]
144 High-Performance Computing in Finance

Codependencies and tail risk emerging from, for example, a systemic risk
shock, lead to a plethora of correlation structures. The copula-marginal dis-
tribution factorization (Sklar, 1959) is a key ingredient to any RSG. In the
multivariate risk setting it extends the idea of simple Gaussian correlation of
Brownian shocks via a Cholesky factorization to all risk factors. In what is
clearly unsatisfactory modeling behavior, the Gaussian correlation approach
still allows diversification of risks in times of crisis. During periods of relatively
quiescent and stable economic behavior, diversification seems reasonable but
when concerted action is happening together such as a run on the banks, it is
not. Changing co-dependency structure from a Gaussian to a T -copula on a
low degree of freedom can model the effects of herding in times of crisis yet
still allow for diversifiable behavior during periods of relative calm.
A RSG, therefore, needs to incorporate an ESG but must also provide a
coherent framework for modeling the non-market risks alongside the market
risks. Since the theory for non-market risks is not as well developed as for
market risks, one generally resorts to statistical models for these. RSGs need to
provide an insurer with a statistical toolbox of marginal distribution functions
and copulas to model the joint distribution of market and non-market risks. At
some point downstream of the RSG, the results will be aggregated to produce
capital charges which monetize the impact of these risks on the business. This
section provides some detail on co-dependency structures and non-market
risks.

5.3.1 Co-dependency structures and simulation


McNeil et al. (2015) discuss in detail the simulation of variables from
multivariate distributions. For many of the processes discussed in the previous
section, the Brownian shock is the key ingredient and over a timestep Δt one
may conveniently approximate the stochastic differential dWt by N (0, Δt). In
the multivariate setting of the ESG one seeks to correlate many such variables
(n  1, say) to a user-specified positive semi-definite correlation matrix.
Generating a correlated set of normally distributed variables is relatively
easy. One begins by using an industry standard pseudo-random number gen-
erator such as the Mersenne twister (Matsumoto and Nishimura 1998) to
create uniform random variables on the interval [0, 1]. These are grouped into
pairs and, using the algorithm of Box and Muller (1958), are converted into
pairs of independent and identically distributed pseudo-random normal devi-
ates, which are then used to populate a p-dimensional vector z. Performing
the Cholesky decomposition of the positive semi-definite correlation matrix P
into LL
(where L is a lower triangular matrix) and constructing the vector
x := Lz produces a realization from a multivariate normal distribution with
correlation matrix P and standard normal marginals. The challenge comes
when one requires to correlate non-Gaussian variables.
Sklar’s theorem (Sklar 1959), or rather its converse gives a general and
flexible framework for the factorization of multivariate distribution functions.
Challenges in Scenario Generation 145

Sklar showed that for every multivariate distribution function F there exists a
unique (in the case of distribution functions comprised of continuous variables
with certain restrictions to those comprising discrete variables) function C (the
copula) such that for each of the distribution functions marginals F1:p :

F (x1 , . . . , xp ) = C(F1 (x1 ), . . . , Fp (xp ))

Furthermore, if the density function is f with marginals f1:p (and copula


density c) then

#
p
f (x1 , . . . , xp ) = c(F1 (x1 ), . . . , Fp (xp )) fn (xn )
n=1

If one specifies the copula, C(u1 , . . . , up ), then one may combine it with any
desired marginal distributions and the resulting multivariate distribution func-
tion is unique (see McNeil et al. 2015). The strength in this result for scenario
generators is that to simulate from a multivariate distribution, one need only
simulate a multivariate instance from the copula: (u1 , . . . , up )
; and then eval-
uate the quantile functions of each of the marginal variables at these copula
values: (q1 (u1 ), . . . , qp (up ))
; to obtain a desired (ranks-based) instance of the
joint distribution function. Since it is often easy to simulate from a copula (at
least from many of the most popular multivariate distributions: e.g., the nor-
mal, t-copula, or the grouped t-copula) then one may use whatever marginal
distributions desired: they need not be normal nor relate in any way to the
copula.
I note that it is not always possible to generate random variables with any
prescribed correlation under the Pearson product moment coefficient. How-
ever, since the copular approach uses a theoretical ranks-based correlation
between variables then practically any correlation is possible although it may
not be practically possible to obtain a good sample correlation estimate in the
presence of degenerate variables containing multiple repeated instances of the
same value (e.g., the PEDCP case to be discussed in Section 5.4).
Since my motivation was to introduce more realistic co-dependency struc-
tures than the simple Gaussian copula, I will now illustrate the benefits of
alternatives to it in modeling tail risk. For illustration, I will compare the
Gaussian and T -copula on low degrees of freedom in the bivariate case.
The procedure for representing a bivariate copula underlying given bivari-
ate data is easily done by plotting the normalized ranks of the first variable
against the normalized ranks of the second variable. In the theoretical setting
of a known normal or T -copula then this is achieved by applying the marginal
distribution functions to the realized bivariate data. McNeil et al. (2015) detail
how this is achieved. In this example, n = 2.5 × 104 pairs of illustrative data
from a bivariate normal distribution were generated with a given zero mean
and covariance matrix:  
1 ρ
Σ=
ρ 1
146 High-Performance Computing in Finance

1.0

0.8

Rank of variable 2 0.6

0.4

0.2 Normal
T(4)
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Rank of variable 1

FIGURE 5.1: Normal and T(4) copulas represented as scatterplots.

where ρ = 0.8 (Σ is also the correlation matrix P and both marginals are stan-
dard normal). A bivariate T -distribution is created using the same parame-
ters and an additional degrees-of-freedom parameter ν here set to 4. The
T -distribution is created from the bivariate normal data by generating a vec-
tor v of length 2.5 × 104 containing χ2 (4)-variables. Scaling the i-th pair of
normal data by ν/vi generates bivariate T (ν)-data which is easily accom-
plished in any ESG. Since the marginal distributions of a multivariate T (ν)
distribution are univariate tν distributions, then T (ν)-copula data is easily
obtained by applying the tν distribution function to each element of the pair.
Comparing scatterplots of the simulated copula data sample serves to
illustrate the point. Focusing in the upper right-hand corner of the scat-
terplot corresponds to large extremes of both variables: for example, times
of high financial stress. As variable pairs move further into this corner they
are coerced to move closer and closer together under the T (ν) copula than
under the more dispersed normal copula. This goes some way in modeling
the so-called herding behavior of the markets in times of financial stress. The
T (ν)-copula can be extended to the more versatile grouped T -copula in which
sets of like variables in multivariate data are grouped and allocated a group
degrees-of-freedom parameter. Other copulas have been used in finance but
have not, to the best of my knowledge, been applied within the insurance
context such as the Archimedean and Gumbel–Hougaard copulas (see McNeil
et al. 2015); see Figure 5.1.
In the context of ESG and RSG simulations, the copula approach is often
essential in effectively simulating appropriately correlated risk drivers. Finally,
I note that if one intends to simulate the effect of a T (ν)-copula (or other)
over more than one time step, say n steps, then one must decompose the
copula such that after the application of n steps the T (ν)-copula emerges. If
this decomposition is not made and a T (ν)-copula is simulated at each time
step, then a dilution effect occurs which is sometimes referred to as a copula
Challenges in Scenario Generation 147

central limit theorem and a Gaussian copula emerges after n steps for large
enough n. This leads on to the discussion of α-stable distributions and the
reader is referred to McNeil et al. (2015) for more details.
In Section 5.4 where I illustrate challenges in scenario generation, it is as a
direct result of the copula/marginal factorization and Sklar’s converse that I
can simulate correlated instances of some rather challenging variables. In the
remainder of this section I cover some of the major non-market risk drivers.
As a consequence of the converse to Sklar’s theorem, they may be represented
by any univariate distribution function as long as it is possible to simulate
these using the probability integral transform. This is possible for variables
such as the normal, t, chi-squared, and Gamma whose densities are readily
available. For more challenging variables, one solution is to represent them in
some way such as that given for the examples in Section 5.4.

5.3.2 Mortality and longevity risk


Mortality risk is felt by an insurer during its accrual phase of a life assur-
ance policy. Essentially, if mortality is seen to worsen, then fewer policyholders
will be able to pay insurance premiums and the business can lose a core pro-
portion of its value. Models of mortality are of the Age-Period-Cohort type
and their generalizations. The model of Lee and Carter is a good starting
point (e.g., Haberman and Renshaw 2008).
The Institute and Faculty of Actuaries maintains a Continuous Mortality
Investigation or (CMI26 ) and cites the following as critical to the study of
mortality risk:

1. Annuitant mortality

2. Critical illness and mortality

3. Income protection

4. Self-administered pension scheme (SAPS) mortality.

On the flip-side to mortality risk there is longevity risk. Potentially much


more serious than mortality risk, it measures the risk during the payout phase
of an insurance policy. Since life expectancy has been increasing since the latter
half of the twentieth century it implies that insurers are going to have to pay
more people in their retirement for longer. The potential shortfall in funds
could be very severe depending on the composition of an insurer’s business.
Defined benefit schemes have been all but replaced by defined contribution
schemes for this reason. To mitigate longevity risk, insurers have the option
of entering into a longevity swap agreement with an investment bank, say.
Inevitably this will be costly, however.

26 [Link]
148 High-Performance Computing in Finance

5.3.3 Lapse and surrender risk


New government legislation makes it easier for pensioners to take their
pension as a lump sum upon retirement. In bullish markets people surren-
der their policies because they can achieve better returns elsewhere (and vice
versa). In bear markets, surrender risk is minimized but an insurer will gener-
ally find it difficult to honor guarantees on policies during this time. In times
of market distress, lapse may become an important feature and may lead to
a contraction of the market.

5.3.4 Operational risk


These are often seen as a catch-all category for other risks on an insurer’s
balance sheet. An operational risk is any risk that can be caused either by
human error (e.g., logging a trade in millions rather than tens of millions) to
failing to comply with regulatory tax requirements, for example. Essentially,
anything that arises as an error incurred during the normal operation of a
business.
More and more prominence is being attributed to operational risk/loss as
there is a very real chance it could contain the one killer risk rendering an
insurer insolvent. Under Solvency 2’s Pillar I, the standard formula SCRop is
volume driven and based on an insurer’s premiums. This may lead to exces-
sive capital charges and, in turn, drive an insurer toward a partial internal
model. It is true, however, that under Pillar II, insurers must develop full and
independent risk-monitoring frameworks including operation risk and justify
them to the regulator anyway.
Where an internal model is used, it tends to follow the bespoke scenario
stress testing and loss distribution approach familiar in Basel II operational
risk analyses in the banking sector. The procedure as outlined by Neil et al.
(2012) is as follows:

1. Use Pillar II/ORSA risk control infrastructure

2. Create operational loss scenarios


3. Identify the most potent loss making scenarios

4. Derive theoretical loss distributions for each of these


5. Assess correlations between these scenarios

6. Use a copula co-dependency to aggregate losses and obtain a capital


charge

Despite this framework, operational loss capital charges are notoriously


hard to estimate as so few data exist either within companies or in the pub-
lic domain. Given the requirements of Solvency 2 Pillars I and II, insurers
are beginning to gather and monitor their own data. Targeted at insurance
Challenges in Scenario Generation 149

companies under Solvency 2, the Operational Risk Consortium27 (ORIC) was


founded in 2005 to provide a focal point for insurers to share their operational
risk data anonymously. It also provides industry benchmarks and thought
leadership.
Blending an insurer’s own data with the online ORIC resource is certainly
one approach. However, calibrations of operational risk models can still be
problematic: risk drivers may flip in and out of models frequently dependent
on the precise historical period used. Insurer’s may prefer resort to indepen-
dent expert or panel judgement. Neil et al. (2012), for example, suggest an
alternative approach based on Bayesian networks that specifically model the
links between different risk drivers.
In Section 5.4, I provide a detailed specification for a marginal risk-driver
that can be used to measure operational losses: the truncated Pareto event-
driven compound Poisson distribution. Since the events themselves are of
power-law Pareto type I class, no analytical formula exists for their distri-
bution function (unlike in the case of a Gamma event-driven variable, for
example). However, I derive a new technique that enables a representation of
the distribution functions of this and other compound variables in terms of
polynomial bases. The method is possible when one may simulate a specific
variable without knowledge of its distribution function. The truncated Pareto
event-driven compound Poisson variable can, by its nature, lead to very pecu-
liar distributions but the most challenging aspect of it from a risk-scenario
generator’s perspective is in how to correlate it with other risk drivers. Since
I offer a means to simulate it from its marginal, this can be achieved using a
copula-marginal factorization and the probability integral transform.

5.4 Examples of Challenges in Scenario Generation


This section considers how to derive the distribution function of composite
variables where, although it is possible to simulate these variables, access to
their distribution function is unavailable by analytical means. This is problem-
atical for an economic or RSG because it is not at all obvious how to correlate
such variables. If access to some representation of their distribution functions
were possible, a scenario generator would be able to correlate them via the
probability integral transform. Two variables are considered: first, the trun-
cated Pareto-event-driven compound Poisson (PEDCP) variable and second,
the conditional equity asset shock distribution of a stochastic volatility and
jump diffusion (SVJD) model. The former is represented by a finite-element
basis and the latter by an orthogonal series of Chebyshev polynomials. As
I will show, if one can simulate instances of the variables in isolation then

27 [Link]
150 High-Performance Computing in Finance

the coefficients in their representations may be estimated. The key result is


that one need only hold a relatively small number of coefficients to represent
a variable’s distribution function well by a function that is at least continu-
ous and often differentiable. This is in stark contrast to using an empirical
cumulative distribution function (ECDF) representation where one is obliged
to hold a large amount of simulated data. Whenever the ECDF is to be eval-
uated, one must sift the ranked data until the appropriate quantile has been
bounded by two adjacent data values and interpolate. Besides being computa-
tionally demanding it is also memory intensive compared to accessing a good
distribution function representation. The methods are new.

5.4.1 PEDCP representation


Let the random variable measuring operational loss be X and let us
assume that it is an event-driven compound Poisson variable and write
X ∼ PEDCP(xm , x∗ , α, λ) where the count N is Poisson distributed with
rate parameter λ > 0. The event distribution is taken to be the truncated
Pareto Type I distribution. The events are simulated as Pareto Type I events:
they have a minimum value of xm > 0; but these have an artificial ceiling
applied to them at level x∗ > xm . Truncation of variables is something that
insurers may wish to model as it gives an extra degree of freedom to the event
variable. However, it does lead to a more complicated distribution function.
Recall the Pareto Type I density and distribution function:
αxα xm !α
fPa (x) = α+1m
, FPa (x) = 1 − ∀x ≥ xm
x x
The Pareto Type I power-law decay is denoted α > 0. The effect of truncating
such a variable above a certain value is to stack the remaining probability at
that ceiling. The truncated density is

ft.−Pa (x) = (1 − FPa (x )) fPa (x) + FPa (x )δ(x − x ),


Ft.−Pa (x) = Ix≤x FPa (x) + Ix≥x

Writing Xn ∼ t. − P a(xm , x∗ , α)∀n ∈ N and X0 = 0, the conditional com-


pound Poisson variable is X|N = ΣN n=0 Xn . By the total law of probability,
the density of the unconditional PEDCP variable is

fX (x) = fPo (0; λ)fX|N =0 (x) + [1 − fPo (0; λ)]fX|N >0 (x)

Since the distribution of fX|N =0 (x) is degenerate being equal to δ(x) then
from here on the distribution of X given N > 0 is sought and to simplify
notation I simply write X in place of X|N > 0.
The first step is to simulate instances of this variable in isolation for use
in estimating the parameters in a representation of the variable’s distribution
(and latterly, the density) function. To motivate an interesting example, the
following parameters are taken: λ = 1, α = 3, xm = 0.1, and x∗ = 0.15. First,
Challenges in Scenario Generation 151

6000

4000

Frequency 2000

0
0.1 0.2 0.3 0.4 0.5
X

FIGURE 5.2: PEDCP histogram given N > 0 events.

1.0

0.8
ECDF(X)

0.6

0.4

0.2

0.0
0.0 0.1 0.2 0.3 0.4 0.5
X

FIGURE 5.3: Compound Poisson empirical distribution function given N >


0 events.

5×104 Poisson variables Po(1) are simulated using R’s rpois() function giving
the following summary:

## 0 1 2 3 4 5 6 7
## 18560 18271 9024 3156 818 139 24 8

Note that, of the Nreps = 5 × 104 Poisson variates simulated, some 31440
were non-zero, that is, 62.88% of the variates which approximates well the
theoretical value 1 − e−λ ≈ 63.21%. Then, enough Pareto variables were sim-
ulated by generating standard uniform variables u using R’s runif() function
and inserting them into the Pareto quantile function: q(u) = xm (1 − u)−1/α .
After truncating the Pareto variables at x∗ = 0.15 the requisite numbers given
the Poisson counts N were summed to give X.
A histogram of the non-zero results (and truncating all variates simulated
beyond the 99%-ile for exposition only) and its empirical distribution function,
as computed by R’s ecdf() function, are given, respectively, in Figures 5.2
and 5.3.
152 High-Performance Computing in Finance

Before beginning, the support of the PEDCP variable (denoted here by


X ∈ [xm , ∞) ⊂ R) is transformed into the finite interval [0, 1). Thus, let
the transformed variable be Y and the transformation be y = ϕ(x; κ) :=
1 − e−κ(x−xm) some κ ∈ R+ (here fixed at κ = 0.8). The distribution function
FX (x) is then:
 x
FX (x) = fX (x
)dx

xm
ϕ(x;κ)
= fX (ϕ−1 (y
; κ))dϕ−1 (y
; κ)
0
 ϕ(x;κ)
fX (ϕ−1 (y
; κ))

= dy
0 κ(1 − y
)
 y
fX (ϕ−1 (y; κ))
= gY (y
; κ)dy
=: GY (y; κ) where gY (y; κ) :=
0 κ(1 − y)

Note that the steps above are possible since the inverse transformation of ϕ
is found via  
1 1
x = xm + log =: ϕ−1 (y; κ)
κ 1−y
and so  
d −1 1 d 1 1
ϕ (y; κ) = log =
dy κ dy 1−y κ(1 − y)
In what follows, finite-element representations of both the PEDCP distribu-
tion and density functions are given. As will be seen, the latter leads to a more
satisfactory representation.

[Link] Distribution function


Let the (transformed) distribution function GY be given a representation
in terms of the linear finite-element basis BK over a given number K ∈ N of
elements:

BK = {φk ∈ C 0 ([0, 1]) : φk (y) = Kmax(min(y − yk−1 , yk+1 − y), 0)/2


yk = k/K, 0 ≤ k ≤ K}

Each element of the basis is of the form of a “witches hat”: they are zero
everywhere except over three consecutive nodes: yk−1 , yk and yk+1 ; where
they interpolate the values 0, 1, and 0, respectively. The representation is


K
GY (y) = G̃k φk (y), y ∈ [0, 1]
k=0

and is equivalent to a continuous linear spline function. Note that GY (yk ) =


G̃k by construction as φk (yk ) =1. To determine the coefficients, observe that
Challenges in Scenario Generation 153

for each j ∈[0, K]:


K  1  1
G̃k φj (y)φk (y)dy = GY (y)φj (y)dy
k=0 0 0


K  1
1
iff Ajk G̃k = GY (y)Φj (y)|0 − gY (y)Φj (y)dy
k=0 0

= Φj (1) − EY [Φj (Y )]
This is of the form AG̃ = Φ(1) − E−[Φ(Y )]. The elements of A are, for
0 < j = k < K:  1
2 2
Ajj = φj (y) dy =
0 3K
with A00 = AKK = (1/3K). Elsewhere, for ≤ j, k ≤ K and |j − k| = 1:
 1
1
Ajk = φj (y)φk (y)dy =
0 6K
while for |j − k| > 1Ajk = 0. Finally, observe that Φj is an anti-derivative of
φj . Thus, for 0 < j < K:


⎪ 0 if y < yj−1
⎨ 2
K(y − yj−1 ) /2 if yj−1 ≤ y < yj
Φj (y) = 2

⎪1/K − K(yj+1 − y) /2 if yj ≤ y < yj+1

1/K if y ≥ yj+1
while for j = 0:
 2
1/(2K) − K(y1 − y) /2 if y0 ≤ y < y1
Φ0 (y) =
1/(2K) if y ≥ y1
and for j =K:

0 if y < yK−1
ΦK (y) = 2
K(y − yK−1 ) /2 if yK−1 ≤ y < yK
Here, some K = 1000 finite elements were used to represent the information
carried in 31440 non-zero instances of the PEDCP variable that was simulated.
This is the number of doubles that would otherwise be held in memory if the
empirical distribution function were to be used for simulation. The estimated
distribution function is shown in Figure 5.4.
Unfortunately, the distribution function is non-monotone and so this rep-
resentation is unsatisfactory: it may not be used well with the probability
integral transform for simulating instances of the PEDCP variable. A repre-
sentation which is monotone is required. A good finite-element representation
of the density function, that is, one which was everywhere non-negative, would
ensure that the resulting distribution function (now composed of monotone
increasing quadratic segments) would be everywhere increasing. This is the
subject of the next section.
154 High-Performance Computing in Finance

(a) 1.0 (b) 0.6

0.8
0.5

0.6
F_X

F_X
0.4
0.4

0.3
0.2
ECDF
FE
0.0 0.2

0.0 0.2 0.4 0.13 0.15 0.17


x x

FIGURE 5.4: Finite-element representation of the PEDCP distribution


function. In (a) the whole distribution modeled is shown and in (b) a zoomed
view around the “step” illustrates undesirable non-monotone behavior.

[Link] Density function


Give the (transformed) density function a representation in terms of finite
elements:

K
gY (y) = g̃k φk (y), y ∈ [0, 1]
k=0

and note that gY (yk ) = g̃k since φk (yk ) = 1 (the values in between nodes yk
1
are linearly interpolated). Note also that ∫−1 gY (y) dy = 1 and this can be
maintained by scaling the coefficients g̃k accordingly. If the integral’s value is
currently A then:
⎛ ⎞
K  1
1 ⎜1 1 ⎟
a= g̃k φk (y)dy = ⎝ g̃0 + g̃k + g̃K ⎠
−1 K 2 K−1
2
k=0
k=1

One may then scale each coefficient by a to obtain a proper density function.
It is henceforth assumed that this is done. For j ∈ [0, K]:

K  1  1
g̃k φj (y)φk (y)dy = gY (y)φj (y)dy
k=0 0 0


K  1
iff Ajk g̃k = gY (y)φj (y)dy = EY [φj (Y )]
k=0 0

This is of the form Ag = e where e = EY [φ(Y )] like the distribution


function representation. The estimated density function is shown in Figure 5.5.
Challenges in Scenario Generation 155

60

40

f_X
20

–20

0.0 0.1 0.2 0.3 0.4 0.5


x

FIGURE 5.5: Finite-element representation of the density function (given


N > 0 events).

This solution is noisy and is negative valued for some values of the variable.
To mitigate these problems, seek instead a viscosity solution employing a
regularization technique. Note that the solution x = g of the linear system
described above and denoted by Ax.= e is also the solution to a quadratic
programming problem. Let ||v||2 = Σni=1 vi2 be the familiar 2-norm over all
vectors v ∈ Rn . If h(x; λ) = ||Ax − e||22 + λ||x||22 (λ ∈ R+ ) then the equivalent
quadratic program is
g = argminx∈RK+1 h(x; 0)
The viscosity solution requires λ > 0 and is
gλ = argminx∈RK+1 ,λ>0 h(x; λ)
Note that the objective may be written as
h(x; λ) = x
A
Ax − 2e
Ax + e
e + λx
x
= x
(A
A + λI)x − 2e
Ax + e
e
Let the matrix A
A + λI have the Cholesky factorization LL
for the lower
triangular matrix L ∈ Rn×n . Then defining e∗ := L−1 A
e (or equivalently,
e = A−T Le∗ ) gives
h(x; λ) = x
LL
x − 2(A−T Le )
Ax + (A−T Le )
(A−T Le )
= ||L
x − L−1 A
e||22 + ||e||22 − ||L−1 A
e||22
To find the λ-viscosity solution one may not simply proceed by simultaneously
minimizing h over x and λ as the solution λ → 0 will be sought. Instead, one
observes that for fixed λ > 0 any solution satisfies:
x = L−T L−1 A
e = (LL
)−1 A
e = (A
A + λI)−1 A
e
The λ hyper-parameter is estimated by an out-of-sample-method using 10-fold
cross-validation (Figure 5.6).
156 High-Performance Computing in Finance

Optimization criterion

20 40 60 80 100
Lambda

FIGURE 5.6: Hyper-parameter λ estimation using 10-fold cross-validation.


The y-axis scale is suppressed as the difference between the smallest and
largest objective criteria was 2.29 × 10−9 with the lower grey horizontal line
measuring 0.03583.

The normalization of the density, in order such that it should integrate to


one over its domain, prevents further reduction in the objective after around
a value of λ ≈ 100 where the solution asymptotes to a constant yet non-
zero value: 0.0358326. The regularized solution may therefore be taken as the
solution for large λ and is therefore:

A
e
g∞ =
a

where a is the normalization constant:


⎛ ⎞
1 ⎜1 ∞ 1 ∞⎟
a= ⎝ g + gi∞ + gK ⎠
K 2 0 K−1
2
i=1

Therefore, there is no need to solve the system (A


A + λI)x = A
e and hence
no need to hold the (often prohibitively large) matrix A
A + λI. The results
are shown in Figure 5.7a where the viscosity solution is shown in red super-
imposed on top of the non-regularized solution.
Prior to seeking a viscosity solution (λ = 0) the range of values for gY
was (−66.67, 272.7). Having taken the regularized solution (i.e., where λ is
assumed sufficiently large) the range of gY is (0, 111.3). There is no need to
artificially floor values at zero to ensure a valid density function while the
associated distribution function will necessarily be increasing on its domain.
Given the density function representation for gY a distribution function
GY is sought. This can be obtained by directly integrating the finite-element
Challenges in Scenario Generation 157

(a) 60 (b) 1.0


Classic
Regularized
0.8
40

0.6

F_X
f_X

20
0.4

0
0.2
ECDF
Finite element
–20 0.0

0.0 0.2 0.4 0.0 0.2 0.4


x x

FIGURE 5.7: Regularized finite-element representation of the PEDCP (a)


density and (b) distribution function.

representation. This reduces to integrating each element φk previously derived


as Φk :
K
GY (y) = g̃k Φk (y)
i=0

The results are shown in Figure 5.7b. One may simulate from this distri-
bution function using the probability integral transform assured that it is
continuous, monotone increasing and smooth (being everywhere a quadratic
function). This is unlike the case of the representation of the PEDCP distri-
bution function which was non-monotone and only continuous. This is made
even more remarkable given that it was not necessary to solve a large linear
system in achieving this result.

5.4.2 Stochastic volatility and jump diffusion


representation
A detailed equity model within an ESG might be the stochastic volatility
and jump diffusion (SVJD) model. As previously discussed in Section 5.2.4
it is comprised of two parts: the first is Heston’s stochastic volatility model
and the second is Merton’s jump diffusion model. The aim of this section
is to enable the equity model to hit the target correlations that have been
set between its returns and any other variables in the ESG (including pos-
sibly other instances of SVJD models). A difficulty occurs if one correlates
the equity asset shock distribution conditioned upon the size of the stochas-
tic variance and the jump shock with other asset-shocks. Although incorrect,
this may not seem an unreasonable approach since most asset shocks are sim-
ple Brownian shocks. Rather, it is the detailed nature of the SVJD’s asset’s
158 High-Performance Computing in Finance

unconditional shock that causes the oversight as it is composed of five random


components: the equity asset’s Brownian shock, the Merton jump frequency
and size, the current level of the stochastic variance and the change in the level
of the stochastic variance. In correlating only the Brownian shock, one discov-
ers the returns correlations emerging from a scenario generator simulation file
are subject to a systematic bias compared to their targets. The solution is to
correlate the unconditional equity asset shock with any other asset shock but
this has no known distributional form and is, hence, unavailable analytically.
In this section I will solve the SVJD asset returns correlation problem by
giving a representation of the unconditional equity asset shock distribution
at each modeled period. I will then be able to correlate the equity returns
with any other assets using the probability integral transform. As in Section
5.4.1, this is done by first simulating instances of the process in isolation. The
difference between this and the previous section is that an entire process is
simulated in isolation, rather than a single random variable, which leads to a
consistent sequence of conditional marginals indexed by time. Quasi-random
number generation is used to improve the standard error in the representa-
tions’ coefficients. This approach is new.

[Link] Distributional representation


Suppose the distribution function of the SVJD equity asset shock (for the
moment, denote this by X) is to be given a representation in terms of the
orthogonal Chebyshev polynomial basis of the first kind:

B
= {Tn ∈ C ∞ (R) : Tn (x) = cos[ncos−1 x], n ∈ Z+ }

Observe that this basis satisfies the orthogonality condition:


 1
1 1
Tm (x)Tn (x)w(x)dx = π(1 + δm0 )δmn , w(x) = √ , m, n ∈ Z+
−1 2 1 − x2

The support of X is the entire real line R and must be mapped into the
finite interval (−1, 1). Let the transformed variable be Y and the transforma-
tion: y = ϕ(x; κ) := tanh(κx), some κ ∈ R+ (to be chosen presently). The
distribution function FX (x) is then:
 x
FX (x) = fX (x
)dx

xm
 ϕ(x;κ)
= fX (ϕ−1 (y
; κ))dϕ−1 (y
; κ)
−1
 ϕ(x;κ)
fX (ϕ−1 (y
; κ))
= 2 dy

−1 κ(1 − (y
) )
 y
fX (ϕ−1 (y; κ))
= gY (y
; κ)dy
=: GY (y; κ) where gY (y; κ) :=
−1 κ(1 − y 2 )
Challenges in Scenario Generation 159

Note that the steps above are possible since the inverse transformation of ϕ
is found via:
 
1 1+y d −1
x = xm + log =: ϕ−1 (y; κ) iff ϕ (y; κ)
2κ 1−y dy
1
=
κ(1 − y 2 )
Let the (transformed) distribution function GY be given by the representation
in terms of the orthogonal Chebyshev polynomial basis B
(and truncated after
K ∈ N terms):


K−1
GY (y) = G̃k Tk (y) ≈ G̃k Tk (y)
k=0 k=0

I now proceed directly to derive expressions for the coefficients G̃k . First,
observe the following intermediate result which holds for Chebyshev polyno-
mials. Since y ∈ [−1, 1] then let y = cos u so dy = − sin udu, ∀k > 0:
 
cos[kcos−1 y] 1
Tk (y)w(y)dy = . dy = − sin[kcos−1 y] + c
1−y 2 k
1 #
= − Tk (y) + c
k
having defined Tk# (y) = sin[k cos−1 y]. Note that Tk# (1) = sin(0) = 0 and
Tk# (−1) = sin(kπ) = 0. If k = 0 then the above integral is − cos−1 y + c. The
coefficients G̃k may be obtained, for k = 0 as
 1
a00 G̃0 = GY (y)w(y) dy
−1
 1
= −GY (y) cos−1 (y)|1−1 + gY (y) cos−1 y dy
−1
1
iff G̃0 = EY [cos−1 Y ]
a00
and ∀k = 0, . . . , K − 1, as:
 1
akk G̃k = GY (y)Tk (y)w(y) dy
−1
$1 
1 $ 1 1
= − GY (y)Tk# (y)$$ + gY (y)Tk# (y) dy
k −1 k −1

1 1 1
= gY (y)Tk# (y) dy iff G̃k = EY [Tk# (Y )]
k −1 akk k

In practice, one can now empirically estimate the coefficients G̃k by gener-
ating enough samples from the unconditional distribution of X, transforming
160 High-Performance Computing in Finance

to Y = ϕ(X; κ), and approximating the expectations above by their sam-


ple means. The objective is now to simulate the correct variable X that will
enable, ultimately, recovery of FX .

[Link] The SVJD model and the combined equity asset shock
The SVJD model has the continuous time SDE representation:

dSt √ (1)
= (μ − λμ̄)dt + vt dWt + (ηt − 1)dNt
St
√ (2)
dvt = α(θ − vt )dt + ξ vt dWt

(i) (1) (2)


where Wt (i = 1, 2) are correlated Brownian motions: Cor[δWt , δWt ] = ρ
in the limit28 as δt → 0. The stochastic variable Nt is the standard Poisson
counting process with δNt ∼ Po(λδt), again in the limit as δt → 0. The jump
sizes are controlled by the stochastic variable ηt ∼ log − N (μ̄, σ 2 ).
Consider now the equivalent form:

dSt
= (μ − λμ̄)dt + dXt
St
√ (1)
dXt = vt dWt + (ηt − 1)dNt
√ (2)
dvt = α(θ − vt )dt + ξ vt dWt

and it is about the distribution function of the variable δXt |Ft that I seek
a representation in terms of orthogonal polynomials. For the equity St and
equity shock term δXt , consider the Euler–Maruyama discretization from con-
tinuous time to discrete time29 t = iΔt, i ∈ N, is (S0 given):

ΔSi
= (μ − λμ̄)Δt + ΔXi
Si−1
. . ! (3)
!
(1) (2)
ΔXi = vi−1 Δt 1 − ρ2 Zi + ρZi + eμ̄+σZi − 1 ΔNi
. (2)
Δvi = α(θ − vi−1 )Δt + ξ vi−1 ΔtZi

(1,3)
Here, Zi ∼iid N (0, 1) and ΔNi ∼ Po(λΔt). Since the equity shock depends
on the stochastic variance at time index i − 1 and upon current level of the
(2)
stochastic variance’s normal shock Zi then the equity shock depends upon

28 The notation δX , for some stochastic variable X indexed by continuous time t > 0,
t t
is a short-hand for Xt − Xt−δt where δt ∈ R s.t. 0 < δt  1.
29 Again, but in discrete time I write ΔX = X − X
i i i−1 .
Challenges in Scenario Generation 161

both vi−1 and Δvi as follows:


√ . ! (3)
!
ΔX|(v, Δv) := vΔt 1 − ρ2 Z (1) + ρZ (2) + eμ̄+σZ − 1 ΔN,
.  
√ Δv − α(θ − v)Δt
= vΔt 1 − ρ2 Z (1) + ρ √
ξ vΔt
(3)
!
+ eμ̄+σZ − 1 ΔN
ρ .
= (Δv − α(θ − v)Δt) + (1 − ρ2 )vΔtZ (1)
ξ
(3)
!
+ eμ̄+σZ − 1 ΔN

However, in order to be able to correlate two SVJD processes (or an SVJD


process and another asset), a representation is needed for the {unconditional}
distribution of the variable ΔXi marginalizing out the values v for the variance
(corresponding to its value at time index i − 1), to its step change Δv over
the time interval ti−1 = (i − 1)Δt to ti = iΔt, to the equity asset’s Brownian
shock, to its Poisson jump intensity and log-normal jump size. Since this is to
be achieved through simulation, the process of marginalization is trivial: one
simulates each of the quantities at a given time step and once the equity asset
shock has been formed, they are simply discarded. One must be mindful to
observe the correct conditioning on the step change in the variance, but this
is simple to arrange.
For the purposes of exposition, the following parametrization for the SVJD
model was considered: μ = 0, μ̄ = −0.4, λ = 0.1, σ = 0.2767, α = 0.02462, ξ =
0.1088, θ = 0.25, v0 = 0.02206 and ρ = −0.9462. In simulating SVJD paths
over 30 years, a time step of Δt = 1/12 was used leading to 360 time steps
in total. A Spearman ranks-based shock correlation of 0.8 with a second,
identically parametrized yet independent, SVJD process was set. Since the
support of the unconditional equity asset shock variable ΔX is the whole real
line, a value of κ = 3.664 was found to be appropriate30 for the transformation
y = ϕ(Δx; κ) = tanh(κΔx). A representation in Chebyshev polynomials for
the distribution function of ΔX is now sought: FΔX (x) = FΔX (ϕ−1 (y; κ)) =:
GY (y; κ).

[Link] Unconditional equity asset shock distribution


To simulate the unconditional ΔX shock (which will depend on time
horizon T because the scaled non-central chi-squared solution to the CIR
process depends on time horizon T ), the conditional equity asset shock
representation of the preceding section was used to marginalize the joint
30 By taking κ = log(2/ε−1), with 0 < ε  1 (practically taking ε = 0.05), this guarantees

the transformation y = ϕ(x; κ) = tanh(κx) will have a value of −1 + ε at x = −1. This


ensures that the lower tail of the equity asset shock distribution, here dominated by the
lognormal variable, is adequately modeled by part of the transform before the asymptotic
behavior ensues.
162 High-Performance Computing in Finance

distribution of (ΔX, v, Δv, Z (1) , Z (3) , ΔN ) over the five-tuple (v, Δv,
Z (1) , Z (3) , ΔN ):
ρ . (3)
!
ΔX = (Δv − α(θ − v)Δt) + (1 − ρ2 )vΔtZ (1) + eμ̄+σZ − 1 ΔN
ξ

for every time horizon t of interest, that is, ti = iΔt each i = 1, 2, 3, ..., Tmax =
360. At times t > s ≥ 0 the continuous-time CIR process conditioned on
information up until time s is

vt |vs ∼ k(t − s)χ2 (ν, λ(t − s, vs ))

where
ξ 2 (1 − e−αt ) 4αθ 4αv
k(t) = , ν= , λ(t, v) =
4α ξ2 ξ 2 (eαt − 1)
Using these results, variables vi ∼ k(ti )χ2 (ν, λ(ti , v0 )), vi |vi−1 ∼ k(Δt)χ2 (Δt,
iid
vi ), Z (1:3) ∼ N (0, 1) were generated and the variance shock set to Δvi =
(vi |vi−1 ) − vi−1 . At each time-index i, the unconditional equity asset shock
orthogonal polynomial representations were created for each variable ΔXi .
To understand precisely how each ΔXi was represented, consider the fol-
lowing. First, a 5 × 104 × 5 array composed of 5 × 104 instances of a five-
dimensional quasi-random uniform sequence was created using Rmetrics’ R
package fOptions.31 Quasi-random variables were used rather than pseudo-
random variables to avoid clumping or clustering of variables. Their effect
is to span more homogeneously the support of the equity asset shocks with
a smaller number of variables than would otherwise have been possible had
pseudo-random numbers been used. I note that quasi-random rather than
pseudo-random numbers could have been used in the previous section on
PEDCP variables but this was not found to be necessary. It was advanta-
geous when modeling the equity asset process since an entire process of 360
distributions were sought with much larger memory implications. The func-
tion [Link]() was used with a seed value of 1983 and Owen and Faure-
Tezuka type scrambling. The following steps in generating the data were: (i)
to generate two 5 × 104 -vectors of standard normal deviates and one 5 × 104 -
vector of Poisson data using the third, fourth and fifth columns of the Sobol
array and the appropriate quantile functions; (ii) to loop over the 360 time-
periods constructing the scaled non-central chi-squared level and step-change
data using the first and second columns of the Sobol array and appropriate
quantile functions, and then marginalizing to determine the equity asset shock
data; and (iii) to determine the coefficients in the Chebyshev representations
using K = 100 terms.
It is important to emphasize that the derivation of the equity asset shock
representations is a one-off cost that was done upfront in relatively little time.
31 A package for “Pricing and Evaluating Basic Options” maintained on CRAN by the

Rmetrics Association Zurich, [Link]


Challenges in Scenario Generation 163

Knowledge of these distributions permits the evolution of the equity asset


from Si−1 to Si through application of the discrete equity asset dynamics
for any number of Monte-Carlo trials. It is also not necessary to propagate
the stochastic variance process vi and this can be ignored. However, it can
be done if desired in the following way. Knowledge of it is available at time
zero: v0 ; and so, computation of the value vi |(ΔXi , vi−1 ) is necessary to the
required conditional distribution function. This is done by appealing to and
rearranging the expression for the conditional equity asset shock given in the
preceding section.
ξ
vi |(ΔXi , vi−1 ) = ΔXi + αθ + vi−1 (1 − α)Δt
ρ
5 67 8
=:χ(ΔXi ,vi−1 ) deterministic
ξ . (1) (3)
! 
− (1 − ρ2 )vi−1 ΔtZi + eμ̄+σZi − 1 ΔNi
ρ
5 67 8
=:ui |vi−1 stochastic

Recovery of vi is now only attainable conditioned upon ΔXi and vi−1 . Using
the quasi-random five-dimensional Sobol set, one must represent the condi-
tional distribution of vi as a function of the tuple (ΔXi , vi−1 ) in a bivariate
Chebyshev expansion at each step of the simulation. Only in this way can the
stochastic variance process vi be consistently realized alongside its equity asset
shock process ΔXi at the time of simulation. Given a relatively small num-
ber of terms was necessary for an accurate representation of the equity asset
shock process (around 15 terms were found to be sufficient, see Figure 5.9),
one anticipates this number squared of terms to represent the variance pro-
cess. It is stressed that this step is not necessary in the projection of correlated
equity assets. However, it must be done in this way to produce a consistent
stochastic variance process if that is desired.
The empirical distribution functions of the equity asset shock distributions
ΔXi at different time horizons are shown in Figure 5.8. The representations
in terms of Chebyshev polynomials are omitted from the plot as they are so
similar the differences are immaterial.
Bar plots of the coefficients at different timesteps: 1 month, 90 months
(7.5 years), 180 months (15 years), and 360 months (30 years) are shown
in Figure 5.9. Out of the K = 100 coefficients, only the first 15 have non-
negligible coefficients.
(1) (2)
Two equity asset processes St and St with the same underlying
parametrization (as given above) are now simulated where the correlation32
between their marginal shocks is set to ρ̃ = 0.8. This is done by modifying
the standard uniform shock pair u = (u1 , u2 )
to v = (v1 , v2 )
that seeds the
equity asset shock quantile function. If the correlation matrix between the two
32 Note: this is different from the correlation between the equity asset and stochastic

variance shocks ρ.
164 High-Performance Computing in Finance

1.0 Timestep
1
90
0.8 180
Distribution function

360
0.6

0.4

0.2

0.0

–0.6 –0.4 –0.2 0.0 0.2 0.4 0.6


Asset shock

FIGURE 5.8: SVJD equity asset shock distributions.

equity asset shocks is (somewhat trivially):


 
1 ρ̃
R=
ρ̃ 1

then the Cholesky factorization of R is LLT where:


  .
1 0
L= , η := 1 − ρ̃2
ρ̃ η
Let Φ = (Φ, Φ)
, where Φ is the standard normal distribution function, and
q = (Φ−1 , Φ−1 )
, where q is the standard normal quantile function, be the
independent (bivariate) standard normal distribution and quantile functions,
respectively. The modified shock pair v = (v1 , v2 )
becomes Φ(Lq(u)):
     
v1 1 0 u1
=Φ q
v2 ρ η u2
 
−1 u1
=
Φ ρΦ (u1 ) + ηΦ−1 (u2 )
A plot of the equity asset paths is given in Figure 5.10, which also shows a
scatterplot of the simulated equity asset shocks. The Spearman’s correlation
(1) (2)
between the two equity asset processes Si and Si is: 0.2422. The correlation
(1) (2)
between the equity asset shock processes ΔXi and ΔXi that is targeted
with a value of ρ̃ = 0.8 was 0.8026 demonstrating the success of this exercise.
To further verify the veracity of this modeling, one could compare the behavior
of the Chebyshev asset shock representation with the standard coupled SDE
representations when pricing path-dependent options. This is the subject of a
future work.
Challenges in Scenario Generation 165

Time step 1 Time step 90


0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0 3 6 9 12 0 3 6 9 12

Time step 180 Time step 360


0.6
0.5 0.5

0.4 0.4

0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
–0.1 –0.1
0 3 6 9 12 0 3 6 9 12

FIGURE 5.9: Bar plots of the coefficients from the Chebyshev polynomial
representation of the SVJD equity asset shock distribution.

5.5 Discussion
In this chapter I have described the role of scenario generators within the
context of insurance and have outlined some challenges that they present.
Motivating the chapter with a high-level overview of the Solvency 2 Euro-
pean directive, I illustrated the use of scenario generators particularly when
an insurer chooses to build its own internal model. This approach is costly but
the cost may be offset against the potentially punitive regulatory capital laid
down by the simpler yet prescriptive standard model. I described how Pillar
1 required a quantitative measure of downside risk: the 1 year VaR from an
insurer’s loss distribution. A scenario generator is then required to upload eco-
nomic and risk scenarios to its asset and liability system to enable modeling
of its market and non-market risks. Moreover, under Pillar 2, an insurer must
166 High-Performance Computing in Finance

(a) 3.0 (b)


Path 1
0.6
2.5 Path 2
0.4
2.0

Asset shock 2
0.2
1.5
S_

0.0
1.0
–0.2
0.5
–0.4
0.0
0 10 20 30 –0.4 0.0 0.4
Time Asset shock 1

FIGURE 5.10: (a) Two correlated SVJD paths and (b) scatterplot of the
SVJD equity asset shocks (correlation = 0.8026).

go beyond the 1-year horizon and a multi-year scenario generator is poten-


tially required. The precise wording of Pillar 2 is deliberately vague, however,
and a small set of stylized scenarios could be used instead. Indeed, a diligent
insurer who uses Monte-Carlo trials to forecast its balance sheet must be able
to price under the risk-neutral measure at every time step of the forecast.
Rather than enduring a complexity of O(N 2 ) under Pillar 1, an insurer would
necessarily endure a cost of O(p × N 2 ) where p > 2 is the maximum number
of periods projected: a truly daunting task if p  2. This is perhaps the single
most-challenging problem facing economic scenario generation: it is not that
scenarios can’t be generated fast enough it is that they face an I/O bottleneck
when they are uploaded to an ALM system.
The bottleneck begs the question of whether in-built scenario generators
within ALM systems could alleviate this problem? Unfortunately, a paradigm
shift in the way insurers model their assets and liabilities would be required.
One approach is to simplify the problem to one that is still amenable to
current ALM systems. The Least-Squares Monte Carlo method has been pio-
neered by Cathcart (2012) and implemented as a commercial software solution
by Moody’s Analytics complementing its flagship Scenario Generator prod-
uct. It leverages the use of statistical regression functions to proxy market-
consistent Monte-Carlo valuations allowing calculations in O(N )-time. An
adjacent approach is that of curve fitting (see also Cathcart (2012)). Truly
high-performance computing solutions, for their part, are beginning to enter
the sector. Aon Benfield Securities have implemented PathWise(TM ): a large-
scale hedging tool for portfolios of variable annuities. Using a farm of GPUs,
gamma losses are hedged effectively in real time with rebalancing happening
in fractions of a second, Phillips (2016).
Challenges in Scenario Generation 167

In Section 5.2, I outlined the modeling challenges faced by scenario gen-


erators. Each of the major asset classes requires a model before scenario
generation can begin. Indeed, there is a rich literature in financial and quan-
titative engineering that scenario generators may rely upon. For example, I
described the taxonomy of equity models that led from the simplest arith-
metic Brownian motion model to the most complicated stochastic volatility
jump diffusion (SVJD) model. The great statistician George Box is alleged
to have said that all models are wrong but some are useful and I am inclined
to agree with this sentiment. I would go further and say that: a model is
useful particularly when it is robust and parsimonious. Given the difficulties
of nonlinear and non-convex optimization that one naturally encounters in
model calibration, sometimes less is more: difficult calibrations may serve as
a warning signal that a little bias is better than large over-fit (or no fit at
all when the optimizer ranges over the trajectory of a strange attractor). The
thorny issue of nonlinear and non-convex optimization in multiple dimensions
is a classic unsolved problem of modern mathematics. It must be treated
with great care and often, simpler is not just better, it is the only option for
coherent fits.
In Section 5.3, I introduced the concept of the risk scenario generator as
a generalization of the ESG to include non-market risks such as policyholder
lapse and surrender risk, or operational risk. RSGs typically appeal to statis-
tical distributions to describe these risks in the absence of a robust financial
theory such as that for nominal interest rate modeling. The introduction of
statistical risk drivers tied in to a brief discussion on co-dependency and to the
challenging problem of variable correlation. The converse of Sklar’s theorem
(1959) shows how to represent a multivariate risk distribution by its copula
and marginal distributions: a flexible solution to the specification of full mul-
tivariate distributions. Having access to a risk driver’s marginal distribution
becomes important and problematic when no such analytical form exists for
it. In an approach that is new, I showed how to develop representations of
two challenging marginal risk-driver distributions in the absence of any ana-
lytical formulae. This type and level of challenge is typical of those faced
in ESG and RSG development. First, in a statistical model for operational
risk and second for the SVJD equity asset shock distribution. Correlation
is once again possible and the copula-marginal factorization may continue
to be used.
I purposely avoided discussion of the estimation of the term premia in
equity, fixed interest, credit, and other models when working in the real world.
Estimation of these is demanding and if using historical data or noisy data, one
must take care to perform sensitivity analyses to check for robust parameter
estimates. For example, in one approach to corporate credit spread modeling
at a very high level, it may not even be clear from the data to which credit
class a given set of spread data belongs. One may be forced to bucket the data
in a manner that is arbitrary. How would estimates change if the buckets were
modified? It is here, rather in the fitting of models to market-consistent data,
168 High-Performance Computing in Finance

that one should be mindful to produce standard error analyses. Economic


justification for assumptions such as the unconditional forward interest rate
(i.e., the interest rate applying to very long dated bonds) must be given.
Justification of all assumptions is core to Pillar 3 of the Solvency 2 directive.
Despite the challenges faced by scenario generators, and I have by no
means covered them all in this short chapter (and any views being my own,
of course and not necessarily representative of the views held by Moody’s
Analytics), they will continue to remain core to an insurer’s ability to set its
regulatory and other capital requirements. As to the future, developments in
high-performance computing have been slow to enter the arena. However, Sol-
vency 2 has come into force only relatively recently. When insurers realize that
they can manage their businesses more cost-effectively and more profitably by
making more use of new technologies such as SaaS, HPC, and the cloud, they
will drive the pace of change and Solvency 2 will have been the catalyst.

References
Antonio, D. and Roseburgh, D. Fitting the Yield Curve: Cubic Spline Interpolation
and Smooth Extrapolation. Barrie & Hibbert Knowledge Base Article, Edin-
burgh, UK, 2010.

Bachelier, L. Théorie mathématique du jeu. Annales Scientifiques de l’Ecole Normale


Supérieure., Vol 18, pp 143–210, 1901.

Bates, D. S. Jumps and stochastic volatility: Exchange rate processes implicit in


deutsche mark options. Rev. Fin. Stud., Vol 9(1), pp 69–107, 1996.

Baxter, M. and Rennie, A. Financial Calculus: An Introduction to Derivative Pric-


ing. Cambridge University Press. Cambridge, 1996.

Berndt, A. R., Douglas, R., Duffie, D., Ferguson, F., and Schranz, D. Measuring
Default Risk in Premia from Default Swap Rates and EDFs. Preprint, Stanford
University. 2004.

Black, F. and Scholes, M. The Pricing of Options and Corporate Liabilities. J.


Political Econ., Vol 81(3), pp 637–654, 1973.

Box, G. E. P. and Muller, E. A note on the generation of random normal deviates.


Annals Math. Stat., Vol 29(2), pp 610–611, 1958.

Brace, A., Gatarek, D., and Musiela, M. The market model of interest rate dynamics.
Math. Fin., Vol 7(2), pp 127–154, 1997.

Brigo, D. and Mercurio, F. Interest Rates Models: Theory and Practice. Springer,
Berlin, 2006.

Carr, P., Geman, H., Madan, D. and Yor, M. The fine structure of asset returns: An
empirical investigation. J. Busin., Vol 75(2), pp 305–332, 2002.
Challenges in Scenario Generation 169

Cathcart, M. J. Monte-Carlo simulation approaches to the valuation and risk man-


agement of unit-linked insurance products with guarantees. Doctoral thesis,
School of Mathematical and Computer Sciences, Heriot-Watt University, UK.
2012.

Chang, W. Y., Abu-Amara, H., and Sanford, J. F. Transforming Enterprise Cloud


Services. Springer, London, 2010.

DeGroot, M. H. and Schervish, M. Probability and Statistics. (4th ed.), Pearson,


Cambridge, 2013.

Duffie, D., Pan, J., and Singleton, K. Transform analysis and asset for affine jump-
diffusions. Econometric, Vol 68(6), pp 1343–1376, 2000.

EIOPA. Technical Documentation of the Methodology to Derive EIOPA’s Risk-free


Interest Rate Term Structures. EIOPA technical documentation, EIOPA-BoS-
15/035. [Link] 2015.

Gatheral, J. The Volatility Surface: A Practitioner’s Guide. Wiley Finance. Hobo-


ken, NJ.2006.

Glasserman, P. Monte-Carlo Methods in Financial Engineering. Springer. New York,


NY.2003.

Haberman, S. and Renshaw, A. 2008. Mortality, longevity and experiments with the
Lee-Carter model. Lifetime Data Anal., Vol 14, pp 286, doi: 10.1007/s10985-
008-9084-2

Hagan, P., Kumar, D., Lesniewski, A. E., and Woodward, D. Managing smile risk.
Wilmott Magazine, Vol 1, pp 84–108, 2002.

Hale, J. K. and Kocak, H. Dynamics and bifurcations. Texts Appl. Math., Vol 3, pp
444–494, Springer 1991.

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning:
Data Mining, Inference and Prediction. (2nd ed.), Springer, New York, NY,
2009.

Heath, D., Jarrow, R., and Morton, A. Bond pricing and the term structure of
interest rates: A discrete time approximation. J. Fin. Quant. An., Vol 25, pp
419–440, 1990.

Heston, S. L. A closed-form solution for options with stochastic volatility with appli-
cations to bond and currency options. Rev. Fin. Studies., Vol 6(2), pp 327–343,
1993.

Hull, J. C. Options, Futures and Other Derivatives. (6th ed.), Prentice-Hall, Upper
Saddle River, NJ, 2005.

Hull, J. and White, A. Pricing interest-rate derivative securities. Rev. Fin. Stud.,
Vol 3(4), pp 573–592, 1990.
170 High-Performance Computing in Finance

Lando, D. Credit Risk Modelling: Theory and Applications. Princeton University


Press, Princeton, 2005.

Mackenzie, D. and Spears, T. The Formula that Killed Wall Street?, The Gaussian
Copula and the Material Cultures of Modelling. Working Paper, 2012.

Matsumoto, M. and Nishimura, T. Mersenne twister: A 623-dimensionally equidis-


tributed uniform pseudo-random number generator. ACM Trans. Mod. Comp.
Sim., Vol 8(1), pp 3–30, 1998.

McNeil, A. J., Frey, R., and Embrechts, P. Quantitative Risk Management: Concepts,
Techniques and Tools. (2nd ed.), Princeton University Press. Woodstock, UK;
New Jersey, USA, 2015.

Merton, R. C. Option pricing when underlying stock returns are discontinuous. J.


Fin. Econ., Vol 3, pp 125–144, 1976.

Moudiki, T. and Planchet, F. Economic scenario generators. Book chapter in


Laurent, J. P., Ragnar, N., Planchet, F. (eds) Modelling in Life Insurance—A
Management Perspective. EAA Series, Springer, 2016. doi: 10.1007/978-3-319-
29776-7.

Nelsen, C. R. and Siegel, A. F. Parsimonious modeling of yield curves. J. Bus., Vol


60(4), pp 4733–4489, 1987.

Neil, C., Clark, D., Kent, J., and Verheugen, H. A Brief Overview of Current
Approaches to Operational Risk under Solvency II. Milliman White Paper series,
2012.

Noble, B. and Daniel, J. W. Applied Linear Algebra. Prentice-Hall, Englewood Cliffs,


NJ, 1977.

Phillips, P. PathWise(TM) High Productivity Computing Platform. Aon Securities


Inc., Aon Benfield. [Link]
[Link], 2016.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. Numerical


Recipes: The Art of Scientific Computation (3rd ed.), Cambridge University
Press, New York, NY, 2007.

Salahnejhad Ghalehjooghi, A. and Pelsser, A. Time-Consistent Actuarial


Valuations. Available at SSRN: [Link] or
[Link] 2015.

Sklar, A. Fonctions de repartition a n dimensions et leurs marges. Publ. Inst. Statist.


Univ. Paris, Vol 8, pp 229–231. 1959.

Transtrum, M. K. and Sethna, J. P. Improvements to the Levenberg-


Marquardt Algorithm for Nonlinear Least-Squares Minimization. arXiv preprint,
arXiv:1201.5885, [Link] 2012.

Varnell, E. M. Economic scenario generators and solvency II. BAJ, Vol 16, pp 121–
159, 2011. doi: 10.1017/S1357321711000079.
Challenges in Scenario Generation 171

Wichmann, B. and Hill, D. Algorithm AS 183: An efficient and portable Pseudo-


random number generator. J. Roy. Stat. Soc. C (Appl. Stat.), Vol 31(2), pp
188–190, 1982.

Wichmann, B. and Hill, D. Correction: Algorithm AS 183: An efficient and portable


pseudo-random number generator. J. Roy. Stat. Soc. C (Appl. Stat.), Vol 33(1),
pp 123, 1984.

Wu, L. and Zhang, F. Libor market model with stochastic volatility. J. Indust.
Mgmt. Opt., Vol 2(2), pp 199–227, 2006.

Zyskind, G. and Martin, F. B. On best linear estimation and general Gauss-Markov


theorem in linear models with arbitrary nonnegative covariance structure. SIAM
J. App. Math., Vol 17(6), pp 1190–1202.
Part II

Numerical Methods
in Financial
High-Performance
Computing (HPC)

173
Chapter 6
Finite Difference Methods for
Medium- and High-Dimensional
Derivative Pricing PDEs

Christoph Reisinger and Rasmus Wissmann

CONTENTS
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.2 Finite Difference Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.3 Decomposition Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.3.1 Anchored-ANOVA decomposition . . . . . . . . . . . . . . . . . . . . . . . 182
6.3.2 Constant coefficient PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.3.3 Variable coefficients: Full freezing . . . . . . . . . . . . . . . . . . . . . . . 184
6.3.4 Partial freezing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.3.5 Partial freezing and zero-correlation approximation . . . . 184
6.4 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.4.1 Constant coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.4.2 Variable coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.5.1 Time-dependent simple correlation . . . . . . . . . . . . . . . . . . . . . . 189
6.5.2 Time-dependent exponential correlation . . . . . . . . . . . . . . . . 190
6.5.3 Time-dependent volatilities, simple correlation . . . . . . . . . 190
6.5.4 Time-dependent volatilities, exponential correlation . . . . 191
6.5.5 Asset-dependent correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

6.1 Introduction
Many models in financial mathematics and financial engineering, particu-
larly in derivative pricing, can be formulated as partial differential equations
(PDEs). Specifically, for the most commonly used continuous-time models of
asset prices the value function of a derivative security, that is the option value
as a function of the underlying asset price, is given by a PDE. This opens

175
176 High-Performance Computing in Finance

up the possibility to use accurate approximation schemes for PDEs for the
numerical computation of derivative prices.
As the computational domain is normally a box, or can be restricted to
one by truncation, the construction of tensor product meshes and spatial finite
difference stencils is straightforward [1]. Accurate and stable splitting methods
have become standard for efficient time integration [2].
Notwithstanding this, the most common approach in the financial industry
appears to be Monte Carlo methods. This is partly a result of the perception
that PDE schemes, although highly efficient for simple contracts, are less flex-
ible and harder to adapt to more exotic features. In particular, the widespread
belief is that PDE schemes become too slow for practical use if the number of
underlying variables exceeds 3.
Indeed, the increase in computational time and memory requirements of
standard mesh-based methods with the dimension is exponential and has
become known as the “curse of dimensionality.” Various methods, such as
sparse grids [3,4], radial basis functions [5], and tensor approaches ([6] for an
application to finance and [7] for a literature survey), have been proposed
to break this curse. These methods can perform remarkably well for special
cases, but have not been demonstrated to give accurate enough solutions for
truly high dimensions in applications (larger than, say, 5).
In conversations about numerical methods for high-dimensional PDEs
inevitably the question comes up: “How high can you go?” This is a mean-
ingful question if one considers a specific type of PDE with closely defined
characteristics. But even within the fairly narrow class of linear second-order
parabolic PDEs which are most common in finance, the difficulty of solving
them varies vastly and depends on a number of factors: the input data (such
as volatilities and correlations), the boundary data (payoff), and the quantity
of interest (usually the solution of the PDE at a single point).
It is inherent to the methods presented in this chapter that it is not the
nominal dimension of a PDE which matters. A PDE which appears inaccessi-
ble to numerical methods in its raw form may be very easily approximated if
a more adapted coordinate system is chosen. This can be either because the
solution is already adequately described by a low number of principal com-
ponents (it has low “truncation dimension”), or because it can be accurately
represented as the sum of functions of a low number of variables (it has low
“superposition dimension”).
To exploit such features, we borrow ideas from data analysis to represent
the solutions by sums of functions which can be approximated by PDEs with
low effective dimension. More specifically, the method is a “dynamic” version
of the anchored-ANOVA decompositions which were applied to integration
problems in finance in Reference 8. A version which is equivalent in special
cases has been independently derived via PDE expansions in Reference 3; a
detailed error analysis is found in Reference 9 and also in Reference 10; an
efficient parallelization strategy is proposed in Reference 11; and the method
is extended to complex derivatives in Reference 12 and to CVA computations
Medium- and High-Dimensional Derivative Pricing PDEs 177

in Reference 13. The link of these methods to anchored-ANOVA is already


observed in References 14 and 15. We present here a systematic approach
which extends [9] from Black–Scholes to more general models, and analyze the
accuracy of the approximations by way of carefully chosen numerical tests.
In the remainder of this section, we describe the mathematical framework.
Then, in Section 6.2, we describe the standard approximation schemes. In
Section 6.3, we define and explain in detail a dimension-wise decomposition.
Section 6.4 summarizes known theoretical results for the constant coefficient
case and offers a heuristic argument for the accuracy of a variable coefficient
extension. Section 6.5 gives numerical results for test cases. We draw a con-
clusion in Section 6.6.
Throughout this chapter, we study asset price processes of the form
dSti = μi (St , t) dt + σi (St , t) dWti , i = 1, . . . , N, t > 0, (6.1)
S0i = si , i = 1, . . . , N, (6.2)
where W is an N -dimensional standard Brownian motion, s ∈ RN is a given
initial state, the drift μi and local volatility σi are functions RN × [0, T ] → R,
and we will allow the correlation between the Brownian drivers also to be
“local,” that is, given St at time t the instantaneous correlation matrix is
(ρij (St , t))1≤i,j≤N . We consider European-style financial derivatives on ST
with maturity T > 0 and payoff function h : RN → R, whose value function
V : RN × [0, T ] → R can be written as
T
V (s, t) = E[exp(− t
α(Su , u) du)h(ST )|St = s],
where α is a discount factor, possibly stochastic through its dependence on S,
and V satisfies the Kolmogorov backward PDE [16]

∂V N
∂V 1
N
∂2V
+ μi + σi σj ρij − αV = 0,
∂t i=1
∂si 2 i,j=1 ∂si ∂sj
V (s, T ) = h(s).
For simplicity, we consider functions defined on the whole of RN , but it will
become clear how to deal with bounded domains.
Let p(y, t; s, 0) be the transition density function of St at y given state s
at t = 0. Then if α does not depend on S, we can write

T
V (s, 0) = exp(− 0 α(u) du) p(y, T ; s, 0)h(y) dy.
RN

Here, p satisfies the Kolmogorov forward equation

∂p ∂ 1 ∂2
N N
− − (μi p) + (σi σj ρij p) = 0,
∂t i=1 ∂yi 2 i,j=1 ∂yi ∂yj
p(y, 0; s, 0) = δ(y − s),
where δ is the Dirac distribution centered at 0.
178 High-Performance Computing in Finance

Most commonly, one is interested in approximating the value of V (s0 , 0)


for a given, fixed s0 ∈ RN , and derivatives of V with respect to s0 .
As a first step, we change the time direction to time-to-maturity, t → T −t,
to obtain

∂V N
∂V 1
N
∂2V
= μi + σi σj ρij − αV, (6.3)
∂t i=1
∂si 2 i,j=1 ∂si ∂sj
V (s, 0) = h(s), (6.4)

where we keep the symbols t and V for simplicity. We now transform the
PDE into a standard form by using a rotation and subsequent translation
of the spatial coordinates. For a given orthogonal matrix Q ∈ RN ×N , define
β : RN × [0, T ] → RN componentwise by


N t
βi (x, t) ≡ Qji μj (x, T − u) du (6.5)
j=1 0

for 1 ≤ i ≤ N . We then introduce new spatial coordinates x via

x(s, t) = QT s + β(s0 , t) (6.6)

and set

a = QT s0 + β(s0 , T ). (6.7)

We write s(x, t) = Q(x − β(s0 , t)) for the inverse transform.


A simple calculation shows that the PDEs (6.3–6.4) transform into

∂V
N
∂2V ∂V
N
= LV := λkl + κk − αV, (6.8)
∂t ∂xk ∂xl ∂xk
k,l=1 k=1
V (x, 0) = g(x) := h(s(x, 0)), (6.9)

for a function V : RN × [0, T ] → R, T > 0, where we still call the transformed


function V by slight abuse of notation, and

1
N
λkl (x, t) = Qik Qjl σi σj ρij ,
2 i,j=1
(6.10)

N
κk (x, t) = Qik [μi − μi (s0 , T − t)] ,
i=1

where σi and ρij are functions of (s(x, t), T − t).


For a constant (i.e., independent of time and the spatial coordinates),
positive semidefinite coefficient matrix Σ = (Σij )1≤i,j≤N = (σi σj ρij )1≤i,j≤N ,
Medium- and High-Dimensional Derivative Pricing PDEs 179

we can choose Q to be the matrix of eigenvectors of Σ sorted by eigenvalue


size,1 that is,

1
Q = (q1 , . . . , qN ), Σqi = λi qi , λ1 ≥ · · · ≥ λN ≥ 0, (6.11)
2
and get (λkl )1≤k,l,≤N = diag(λ1 , . . . , λN ) as a constant diagonal matrix.
If μ does not depend on the spatial coordinates x but only on t, then
the difference under the sum in Equation 6.10 vanishes identically and thus
κ(x, t) ≡ 0.
Moreover, if α is also only a function of t, the zero order term ! can be
t
eliminated from Equation 6.8 by considering exp 0 α(T − u) du V .
If all this is satisfied, then L simplifies to the N -dimensional heat operator
in Equation 6.12. Keeping the symbol V for the transformed value function
and L for the operator for simplicity, we obtain

∂V ∂2VN
= LV = λk 2 , (6.12)
∂t ∂xk
k=1
V (x, 0) = g(x), (6.13)

for x ∈ RN , t ∈ (0, T ), λ = (λ1 , . . . , λN ) ∈ RN


+.
In all other cases, that is, if Σ is not constant and μ depends on s, a
transformation to a diagonal diffusion without drift is generally not possible.
T
By translation to s = s0 + 0 μ(s0 , u) du and choosing Q as the eigenvectors
of Σ(s, T ), one obtains λkl (a, 0) = 0 for k = l and κk (a, 0) = 0, but these
coefficients are nonzero for other (x, t).

6.2 Finite Difference Schemes


In this section, we describe the finite difference schemes used for the
one- and two-dimensional versions of Equations 6.12 and 6.8 which we will
need to construct the dimension-wise splitting introduced in Section 6.3. We
choose the Crank–Nicolson scheme for the one-dimensional equations, Brian’s
scheme [17] for multidimensional PDEs without cross-derivatives, and the
Hundsdorfer–Verwer (HV) scheme [18] for PDEs with cross-derivative terms.
These are established techniques from the literature which are routinely used
in financial institutions for derivative pricing, and can be replaced by a method
of choice. As such, this section can be skipped without loss of continuity.
We follow standard procedure [1] to define a finite difference approximation
Vh to V , where h = (Δt, Δx1 , . . . , Δxd ) contains both the time step size
1 If Σ has eigenvectors with multiplicity larger than 1, then this decomposition is not

unique. In that case, we can simply choose any such matrix Q.


180 High-Performance Computing in Finance

Δt > 0 and the spatial mesh sizes Δxi > 0, i = 1, . . . , d, where d is the
dimension of the PDE. We first define basic finite difference operators
Vh (·, t + Δt) − Vh (·, t)
δt Vh (·, t) = ,
Δt
Vh (· + Δxi , t) − Vh (· − Δxi , t)
δxi Vh (·, t) = ,
2Δxi
Vh (· + Δxi , t) − 2Vh (·, t) + Vh (· − Δxi , t)
δxi,i Vh (·, t) = ,
Δx2i
δxi,j Vh = δxi δxj V, i = j,

and then an approximation to L by


d
d
L(t) = κi (·, t) δxi + λij (·, t) δxi,j − α(·, t),
i=1 i,j=1

where the operator κi (·, tn )δxi , applied to Vh , at a point x = (xj1 , . . . , xjd ) is


Vh (x + Δxi ei , tn ) − Vh (x − Δxi ei , tn )
((κi (·, tn )δxi )Vh )j1 ,...,jd = κi (x, tn ) ,
2Δxi
where ei , the ith unit vector, and similar for the σ and α terms.
Ignoring spatial boundaries for the time being, Vh is defined for all (x, t) ∈
Rd × {0, Δt, . . . , T } by the scheme

δt Vh = θL(t + Δt)Vh (t + Δt) + (1 − θ)L(t)Vh (t), (6.14)


Vh (x, T ) = φ(x),

where θ ∈ [0, 1]. Here, Δt = T /Nt , where Nt is the number of timesteps.


In practice, the scheme and solution need to be restricted to a bounded
domain, and for simplicity we restrict ourselves here to a box where xi,min ≤
xi ≤ xi,max . These may be given naturally, for example, xmin = 0 if x is a
positive stock price, or by truncation of an infinite interval at suitably large
values, for example, a certain number of standard deviations away from the
spot. Then with Ni the number of mesh intervals in coordinate direction xi ,
Δxi = (xi,max − xi,min )/Ni , the mesh points are xi,j = xi,min + jΔxi for
j = 0, . . . , Ni , i = 1, . . . , d. We denote the numerical solution on this mesh by
Un , this being the vector (Vh ((xi,ji )i=1,...,N , tn ))ji =0,...,Ni .
Let Ln ≡ L(tn ) be the discretization matrix at time step tn , then this
matrix is first decomposed into

Ln = Ln0 + Ln1 + · · · + Lnd ,

where the individual Lni , 1 ≤ i ≤ d, contain the contribution to L stemming


from the first- and second-order derivatives in the ith dimension,
1
Lni = κi (·, tn )δxi + λii (·, tn )δxi,i − α(·, tn ),
d
Medium- and High-Dimensional Derivative Pricing PDEs 181

and, following Reference 2, we define one matrix F0 which accounts for the
mixed derivative terms,

Ln0 = λij (·, tn )δxi,j .
i=j

For Ln0 = 0, which contains the discretization of Equation 6.12 as a special


case, a simple splitting scheme is given by the Douglas scheme [19],

Y0 = Un−1 + ΔtLn−1 Un−1 ,


(I − θΔtLnj )Yj = Yj−1 − θΔtLn−1
j Un−1 , j = 1, . . . , d, (6.15)
Un = Yd .

The scheme is unconditionally stable for all θ ≥ 1/2 and of second order in
time for θ = 1/2 (otherwise of first order, see Reference 20).
A second-order modification of the above scheme was proposed by Brian
[17], where the first two steps are as above with θ = 1 and step size Δt/2, and
the last step (6.15) is replaced by a Crank–Nicholson-type step

Un − Un−1 1 d
= (Lnj + Ln−1
j )Yj .
Δt j=1
2

For Ln0 = 0, that is, with cross-derivative terms present as in the general
case of Equation 6.14, second order gets lost and an iteration of the idea is
needed. The HV scheme [18],

Y0 = Un−1 + ΔtLn−1 Un−1 ,


(I − θΔtLnj )Yj = Yj−1 − θΔtLn−1
j Un−1 , j = 1, 2, 3,
1 ' (
Y90 = Y0 + Δt Ln Y3 − Ln−1 Un−1 ,
2
(I − θΔtLnj )Y9j = Yj−1 − θΔtLnj Yj , j = 1, 2, 3,
Un = Y93 ,

defines a second-order consistent' ADI √ splitting


( for all θ, and can be √ shown to
be von Neumann stable for θ ∈ 12 + 16 3, 1 [21]. We use θ = 12 + 16 3 ≈ 0.789
in the computations.
A severe computational difficulty arises for d larger than approximately 3,
as the total number of operations is proportional to Nt N1 . . . Nd , that is, grows
exponentially in the dimension. In the numerical tests, we will use N1 = N2 =
800 and Nt = 1000 for the two-dimensional equations. These involve 6.4 × 108
unknowns. In Reference 9, for a second-order extension, N1 = N2 = N3 = 500
and Nt = 50 are used for the three-dimensional equations involved, that is,
6.25 × 109 unknowns. It is clear that within this framework a further increase
in the dimension will only be practically feasible by reducing the number of
mesh points in each direction and consequently sacrificing accuracy.
182 High-Performance Computing in Finance

6.3 Decomposition Methods


In order to accurately approximate derivative prices with N > 3 fac-
tors, we define an approximate dimension-wise decomposition, in the spirit
of anchored-ANOVA decompositions. Here, the starting point a of the trans-
formed process, from Equation 6.7, serves as an “anchor.” We show the basic
concept in a static setting in Section 6.3.1, and its application to constant and
variable coefficient stochastic processes and PDEs in the subsequent sections.
We assume in this section that a suitable rotation and translation (see the
end of Section 6.1) has taken place, so that

λij (a, 0) = 0, i = j, (6.16)


κi (a, 0) = 0. (6.17)

We then denote for simplicity

λi (x, t) ≡ λii (x, t).

For brevity, we set α = 0 in this section, but the extension to α = 0 is


straightforward.

6.3.1 Anchored-ANOVA decomposition


We follow here Reference 8 to define the anchored-ANOVA decomposition
of a function g : RN → R, with a given “anchor” a ∈ RN . For a given index
set u ⊂ N = {i : 1 ≤ i ≤ N }, denote by a\xu the N -vector

xi , i ∈ u,
(a\xu )i =
ai , i ∈
/ u.

Then gu (a; ·) defined for all x ∈ RN by gu (a; x) = g(a\xu ) is a projection of g,


where we make the dependence of gu on the anchor a explicit in the notation.
We proceed to define a difference operator Δ recursively through Δg∅ = g∅
and, for u = ∅,

Δgu = gu − Δgw = (−1)|w|−|u| gw .
w⊂u w⊆u

An exact decomposition of g is then given by the identity


N
g= Δgu = Δgu . (6.18)
u⊆N k=0 |u|=k

This enables the definition in Reference 8 of successive dimension-wise


approximations to the integral of g by truncation of the series.
Medium- and High-Dimensional Derivative Pricing PDEs 183

6.3.2 Constant coefficient PDEs


We start by considering the N -dimensional heat equation

∂V ∂2V N
= LV = λk 2 , (6.19)
∂t ∂xk
k=1
V (·, 0) = g, (6.20)
with constant λ.
Given an initial-value problem of the form 6.19 and 6.20, and an index set
u ⊆ N , define a differential operator
∂2
Lu = λk 2 ,
∂xk
k∈u

and an approximation Vu of V as the solution to


∂Vu
= Lu Vu , (6.21)
∂t
Vu (·, 0) = g. (6.22)
The definition in Equation 6.21 is equivalent to saying
∂Vu
= LVu ,
∂t
Vu (x, 0) = g(a\xu ),
that is, projecting the initial condition, but it is not normally true that Vu
from Equation 6.21 is the projection of the solution V of Equation 6.19 in the
sense of Section 6.3.1.
From here on, we can proceed as in Section 6.3.1 to set

ΔVu == (−1)|w|−|u| Vw .
w⊆u

To approximate V by lower dimensional functions, we truncate the series in


Equation 6.18 and define

s
s
V0,s = ΔVu = ck Vu , (6.23)
k=0 |u|=k k=0 |u|=k

where ck are integer constants that depend on the dimensions N and s. The
point to note is that Vu is essentially a |u|-dimensional function as it only
depends on the fixed anchor and |u| components of x.
In situations where one or several coordinates play a dominant role, it will
be useful to consider a generalization of Equation 6.23 to

s
Vr,s = ck Vu∪{1,...,r} , r + s ≤ N. (6.24)
k=0 |u|=k

Here, all components Vu∪{1,...,r} depend on all the x1 , . . . , xr .


184 High-Performance Computing in Finance

6.3.3 Variable coefficients: Full freezing


The simplest way to deal with variable coefficients is to “freeze” them at
a constant value and then apply the methodology from Section 6.3.2. As we
are interested in the PDE solution at the anchor point a, the obvious choice
is to approximate κi and λij by κi (a, 0) and λij (a, 0).
For a given subset u ⊆ N , we then define (note that in this case κi (a, 0) = 0
and λij = 0, i = j)

∂Vu ∂ 2 Vu
= λii (a, 0) ,
∂t i∈u
∂x2i
Vu (x, 0) = g(x).

6.3.4 Partial freezing


The full freezing approximation in Section 6.3.3 throws away more infor-
mation than needed. In the following extension, we keep as much as possible
of the original dynamics of the process in the low-dimensional cross-section
the process is restricted to.
For given subset u ⊆ N , we now define

∂Vu ∂Vu ∂ 2 Vu
= κi (a\xu , t) + λij (a\xu , t) ,
∂t i∈u
∂xi i,j∈u ∂xi ∂xj
Vu (x, 0) = g(x).

Given the variability of the coefficients, there is generally no static coordi-


nate transformation that reduces the PDE to the heat equation. The difference
to the localized problem in the previous section is that since the PDE coeffi-
cients λ(x, t) and κ(x, t) change with spatial and time coordinates, the PDE
will in general contain first-order and nondiagonal second-order terms.

6.3.5 Partial freezing and zero-correlation approximation


Here, motivated by λij (a, 0) = 0 for all i = j, we make the additional
approximation that this holds for all x and t. So we define now

∂Vu ∂Vu ∂ 2 Vu
= κi (a\xu , t) + λii (a\xu , t) ,
∂t i∈u
∂xi i∈u
∂x2i
Vu (x, 0) = g(x).

This extra approximation in addition to Section 6.3.4 does not give any
further dimension reduction, but simplifies the PDEs somewhat, that is, no
cross-derivative terms are present, which simplifies the construction of numer-
ical schemes.
Medium- and High-Dimensional Derivative Pricing PDEs 185

6.4 Theoretical Results


In this section, we review the rigorous error analysis from Reference 9 for
the constant coefficient case in Section 6.3.2, and give a novel, more heuristic
extension of this analysis to the variable coefficient setting of Section 6.3.4.
What is essential in the analysis is clearly the size of the diffusion and
drift coefficients in the various directions, as well as the variability of the
initial data jointly with respect to different sets of variables. The relevant
measure of variability is defined in the following.

Definition 6.1. Let


 "
C j,k,mix = g ∈ C b : ∂ij1 . . . ∂ijk g ∈ C b , ∀1 ≤ i1 < · · · < ik ≤ N ,
 
Cb = g : RN → R continuous : |g(x)| ≤ c for all x for some c > 0 .

The spaces of functions in Definition 6.1 allow us to measure whether a


function is truly multidimensional by its cross-derivative with respect to sets
of variables. The growth condition ensures well-posedness of the PDE.

6.4.1 Constant coefficients


We follow here Reference 9. Let V:r,s = Vr,s − V be the approximation error
of Vr,s from Equation 6.24. Then the following holds.

Theorem 6.1 (Theorems 5 and 14 in Reference 9).

1. Assume g ∈ C 2,2,mixed in Equations 6.19 and 6.20. Then the expansion


error V:r,1 satisfies
; ;
; ; ; ∂4g ;
;: ; 2 ; ;
;Vr,1 (·, t); ≤t λk λi ; 2 2 ; . (6.25)
∞ ; ∂xi ∂xj ;
r<i<j≤N ∞

2. Assume g ∈ C 2,3,mix in Equations 6.19–6.20. Then the expansion error


V:r,2 satisfies
; ;
; ; ; ∂6g ;
;: ; 3 ; ;
;Vr,2 (·, t); ≤t λi λj λk ; 2 2 2 ; . (6.26)
∞ ; ∂xi ∂xj ∂xk ;
r<i<j<k≤N ∞

The analysis in Reference 9 derives PDEs for the error itself, and then
makes use of standard maximum principle-type arguments to estimate the
size of the error.
186 High-Performance Computing in Finance

For instance, by using the PDEs satisfied by V and V{1,...,r,i} for different
i, it can be shown that
∂ :
Vr,1 = L{1,...,r} V:r,1
∂t
N
' ( ' (
+ L{1,...,r,i} − L{1,...,r} V{1,...,r,i} + L{1,...,r} − L V
i=r+1


r
∂2 :
N
∂2 ' (
= λk 2 V r,1 + λk 2 V{1,...,r,k} − V . (6.27)
∂xk ∂xk
k=1 k=r+1

This is an inhomogeneous heat equation for V:r,1 with zero initial data and a
right-hand side which can be shown to be small. As a consequence, the solution
itself is small. Informally, the terms on the right-hand side V {1,...,r,k} − V
) O(λr+1 + · · · + λN − λk ), and hence the right-hand side is of
are of order
order O( r<i<j≤N λi λj ). A slightly more careful argument gives the precise
bound (Equation 6.25), and a similar but lengthier argument for Vr,2 gives
Equation 6.26.
A number of comments are in order regarding the smoothness requirements
dictated by the error bounds. First, most option payoffs are nonsmooth, have
kinks and discontinuities. This would appear to render Equation 6.25 and its
higher order versions meaningless. A reworking of the derivation shows that
g can actually be replaced by Vr,0 , which is the solution to

∂Vr,0
r
∂ 2 Vr,0
= λk ,
∂t ∂x2k
k=1
Vr,0 (x, 0) = g(x).

So even if g itself is not smooth, Vr,0 will be smooth except in degenerate


situations which are analyzed in detail in Reference 9. Roughly speaking, as
long as the locations of kinks and discontinuities are not parallel to all of the
first r coordinate axes, Vr,0 is smooth enough for the expansion error to be
well defined.
The second important point is that as Equation 6.25 contains only mixed
derivative terms, for any payoffs which depend only on, say, x1 and xk for
some k > 1, the decomposition of the option price is exact. Moreover, the
value of any derivative that can be statically replicated by options with such
simple payoffs is found exactly. Again, a more detailed discussion is found in
Reference 9.

6.4.2 Variable coefficients


The transformation 6.6 with appropriate Q (see the discussion at the end
of Section 6.1) ensures Equations 6.16 and 6.17 but this is only true at t = 0
and x = a. However, using arguments similar to Reference 9 and Section 6.4.1,
Medium- and High-Dimensional Derivative Pricing PDEs 187

we can still derive a PDE for the expansion error even for nonconstant coeffi-
cients. Straightforward calculus yields an expression similar to Equation 6.27,
namely

∂ : r
∂2 :
Vr,1 = λkl (z, t) Vr,1
∂t ∂xk xl
k,l=1
% &
N
∂2 r
∂2 ' (
+ λkk 2 + λkl V{1,...,r,k} − V (6.28)
∂xk ∂xk xl
k=r+1 l=1


N
∂2
− λkl V (6.29)
∂xk xl
k,l=r+1,k=l


N
∂ ' (
+ κk V{1,...,r,k} − V (6.30)
∂xk
k=r+1

This equation contains three source terms, which determine the error size:

• The first term, (see Equation 6.28), is similar to the source term appear-
ing in the constant coefficient case. It is essentially a restricted differential
operator applied to the difference between full and partial solutions.

• The second term, (see Equation 6.29), consists of the nondiagonal terms
not captured at all in the expansion applied to the full solution. It con-
tains the full solution rather than the difference between full and partial
ones, but the λkl involved are zero for t = 0 and x = a.
• The third term, (see Equation 6.30), where κk (a, 0) = 0, captures the
changes in κ and again acts on the differences between partial and full
solutions.

At t = 0 and x = a, all three source terms are zeros, because

V{1,...,r,k} (x, 0) − V (x, 0) = 0 ∀x ∈ RN and λkl (a, 0) = 0, k = l.

Away from these initial coordinates, the terms grow slowly and drive a nonzero
error.
Instead of investigating this further theoretically, we give quantitative
examples in the following section.

6.5 Numerical Examples


In this section, we analyze the numerical accuracy of the decomposition
from Section 6.3 for the approximation of European basket options, where the
188 High-Performance Computing in Finance

TABLE 6.1: Different base cases with nonconstant parameters


Nonconstant
component Parameter Example
Time-dependent drift μ = μ(t) Exactly described
by Equations 6.5 and 6.6
Time-dependent volatilities σ = σ(t) Sections 6.5.3 and 6.5.4
Time-dependent correlation ρ = ρ(t) Sections 6.5.1 and 6.5.2
Asset-dependent drift μ = μ(S) LIBOR market model in
Reference 12
Asset-dependent volatilities σ = σ(S) Local vol—not considered
Asset-dependent correlation ρ = ρ(S) Section 6.5.5

model for the underlying stock has variable coefficients. We list six “base”
cases of how the PDE coefficients can be varied in Table 6.1.
Consider assets whose dynamics for the prices of St1 , . . . , StN are given by

1
d(log Sti ) = − σi2 (St , t) dt + σi (St , t) dWti , 1 ≤ i ≤ N,
2

under the risk-neutral measure with zero interest rates. By considering log
prices as primitive variable in Equation 6.1, in a Black–Scholes setting, that
is, if σ and ρ are constant, the PDE coefficients are constant. Generally, the
Brownian motions W i are correlated according to the correlation matrix

(ρij (S, t))1≤i,j≤N .

We consider two possible correlation structures:


⎛ ⎞
1 γ γ ··· γ
⎜ γ 1 γ ··· γ ⎟
⎜ ⎟
ρsimple (γ) = ⎜ .. .. .. ⎟
⎝ . . . ⎠
γ γ γ ··· 1

for γ ∈ (−1, 1) and

ρexp,ij (γ) = exp(−γ|i − j|)

for γ > 0, where we replace γ by a function γ : RN × [0, T ] → R, possibly


being asset- and time dependent. The covariance matrix Σ(S, t) is then fully
characterized via Σij (S, t) = σi (S, t)σj (S, t)ρij (S, t). Due to the asset- and
time dependency of correlations and volatilities, the asset distributions are
no longer log-normal and hence a transformation of the pricing PDE to the
standard heat equation is generally not possible.
Medium- and High-Dimensional Derivative Pricing PDEs 189

As a test case, we choose a European arithmetic basket option with N =


10. The payout at maturity T = 1 is
N 

h(S) = max ωi Si − K, 0 ,
i=1

with strike K = 100 and weights ωi ∈ R, i = 1, . . . , N . We will examine the


value at the point S0,i = 100 for all i. As payout weight vectors ω, we consider

ω 1 = (1/10, 1/10, 1/10, 1/10, 1/10, 1/10, 1/10, 1/10, 1/10, 1/10),
ω 2 = (4/30, 4/30, 4/30, 4/30, 4/30, 2/30, 2/30, 2/30, 2/30, 2/30),
ω 3 = (1/4, 1/4, 1/4, 1/4, 1/4, 1/4, 1/4, −1/4, −1/4, −1/4).

Using V1,1 as approximation to V , we expect that the accuracy will be best


for ω 1 and worst for ω 3 , because ω 1 is parallel to the principal component of
Σ and ω 3 closer to orthogonal.
The numerical parameters chosen were N1 = N2 = 800 and Nt = 1000,
corresponding to a time step of size Δt = 0.001. For the reference Monte
Carlo estimator VMC , we used 108 paths. This setup reduces the discretization
and simulation errors sufficiently for us to determine a good estimate of the
expansion method’s accuracy.
We implemented and tested two numerical algorithms for the solution of
the PDE problems. One algorithm is the diagonal ADI method from Sec-
diagADI
tion 6.3.5 (with results denoted by VPDE ), where we updated the diffusion
coefficient values at every time step, and the PDE is solved numerically by
Brian’s scheme. The second method from Section 6.3.4 does incorporate the
HV
off-diagonal terms in the lower dimensional problems (denoted VPDE ), where
the numerical PDE solution is based on the HV scheme.
We also compute the results for the fully frozen model from Section 6.3.3,
that is, with covariance matrix fixed at Σ(s0 , T ), both for the expansion
loc loc
(VPDE ) and a full Monte Carlo estimator (VMC ). This allows us to understand
what contribution to the error comes from the variability of the coefficients,
compared to the decomposition error already present for constant coefficients.
Our primary intention here is to give a proof of concept, rather than an
in-depth study of the performance and convergence. We want to demonstrate
that and how expansion methods can be used for variable coefficients.

6.5.1 Time-dependent simple correlation


For time-dependent simple correlation ρ(t) = ρsimple (t), the eigenvalues
change over time. However, the lower N − 1 eigenvalues are identical and the
subspace spanned by their eigenvectors does not change.
Table 6.2 shows results for σi = 0.2 and

ρ(t) = ρsimple (0.8 − 0.8 · (t/T − 0.5)2 ) ∈ [ρsimple (0.6), ρsimple (0.8)].
190 High-Performance Computing in Finance

TABLE 6.2: Time-dependent simple correlation


diagADI HV loc loc
VMC VPDE VPDE VMC VPDE
ω1 6.9463 6.9451 6.9451 6.3784 6.3715
σMC 0.0011 0.0010
Δabs −0.0012 −0.0012 −0.0069
Δrel −0.02% −0.02% −0.11%
Δabs /σMC −1.06 −1.06 −6.73
ω2 6.9602 6.9584 6.9584 6.3991 6.3932
σMC 0.0011 0.0010
Δabs −0.0018 −0.0018 −0.0059
Δrel −0.03% −0.03% −0.09%
Δabs /σMC −1.57 −1.57 −5.75
ω3 7.5631 7.5816 7.5816 7.3585 7.4069
σMC 0.0012 0.0012
Δabs 0.0185 0.0185 0.0484
Δrel 0.24% 0.24% −0.66%
Δabs /σMC 14.96 14.96 −40.58

PDE/ADI and PDE/HV results were almost identical and very close to the
MC results. Only in the third case of ω 3 did they even differ in a statistically
significant way, that is, relative to the standard error σMC , from the MC
computation. It is worth noting that the errors are even slightly larger in the
fully frozen case, implying that the variable coefficients present no particular
problem in this model.

6.5.2 Time-dependent exponential correlation


For a time-dependent exponential correlation ρ(t) = ρexp (t), the eigenval-
ues and eigenvectors change substantially over time, resulting in a significant
contribution from nonzero off-diagonal elements in λ(t).
Table 6.3 shows results for σi = 0.2 and
ρ(t) = ρexp (0.25 − 0.6 · (t/T − 0.5)2 ) ∈ [ρexp (0.1), ρexp (0.25)].
PDE/ADI results are again close to the MC results. The PDE/HV results
differ somewhat more, against the expectation, but note that both solutions
are significantly more accurate than the constant coefficient approximation.
The third case, ω 3 , is again the most challenging one for the dimension-wise
method.

6.5.3 Time-dependent volatilities, simple correlation


For time-dependent σi = σ(t), that is, the case where all volatilities are
time dependent but equal, the eigenvalues λ1 , . . . , λN of Σ are simply scaled
up or down over time and the matrix of eigenvectors stays constant. This
means that all nondiagonal terms of λ vanish and the transformation to the
Medium- and High-Dimensional Derivative Pricing PDEs 191

TABLE 6.3: Time-dependent exponential correlation


diagADI HV loc loc
VMC VPDE VPDE VMC VPDE
ω1 6.0662 6.0738 6.0590 6.8534 6.8477
σMC 0.0010 0.0011
Δabs 0.0076 0.0885 −0.0057
Δrel 0.13% 1.46% −0.08%
Δabs /σMC 7.82 90.88 −5.11
ω2 6.1646 6.1695 6.1547 6.9109 6.9085
σMC 0.0010 0.0011
Δabs 0.0049 −0.0099 −0.0024
Δrel 0.08% −0.16% −0.03%
Δabs /σMC 4.92 −10.00 −2.15
ω3 9.6062 9.5346 9.7786 9.2907 9.3279
σMC 0.0015 0.0015
Δabs −0.0716 0.1724 0.0372
Δrel −0.75% 1.80% −0.40%
Δabs /σMC −46.34 111.54 24.44

heat equation is exact. This case is simple: it merely requires the solution of
a heat equation with time-dependent diffusion coefficients.
For time-dependent σi = σi (t), that is, the case where the volatilities vary
differently over time, the eigenvectors change with t. This in general leads to
the appearance of nonzero off-diagonal terms. With no dependency on the
asset values S, the initial PDE transformation means that those terms vanish
at time t = 0 and then grow over time for t > 0.
Table 6.4 shows results for ρ = ρsimple (0.7) and
   
i−1 i−1
σi (t) = 0.1(1 + t/T ) 1 + ∈ [0.1, 0.2] 1 + .
N −1 N −1
Both the PDE/diagonal ADI and PDE/HV results are fairly accurate for
the first two test cases. They both struggle with the third one, producing
errors of 2.42% and 2.66%. Given that a similar error is present in the fully
localized case, that is, for the model with constant coefficients, we conclude
that this error is primarily due to the expansion method being applied to the
challenging payout direction ω 3 , rather than the nonconstant nature of the
coefficients.

6.5.4 Time-dependent volatilities, exponential correlation


Table 6.5 shows results for
   
i−1 i−1
σi (t) = 0.1(1 + t/T ) 1 + ∈ [0.1, 0.2] 1 +
N −1 N −1
and
ρ(t) = ρexp (0.25 − 0.6 · (t/T − 0.5)2 ) ∈ [ρexp (0.1), ρexp (0.25)].
192 High-Performance Computing in Finance

TABLE 6.4: Time-dependent volatilities, simple correlation


diagADI HV loc loc
VMC VPDE VPDE VMC VPDE
ω1 7.7987 7.7947 7.8234 5.1128 5.1123
σMC 0.0013 0.0008
Δabs −0.0040 0.0248 −0.0005
Δrel −0.05% 0.32% −0.01%
Δabs /σMC −3.10 19.19 −0.57
ω2 7.3183 7.3151 7.3416 4.7972 4.7961
σMC 0.0012 0.0008
Δabs −0.0032 0.0233 −0.0011
Δrel −0.04% 0.32% −0.02%
Δabs /σMC −2.67 19.39 −1.41
ω3 6.2074 6.3579 6.3723 4.0555 4.1658
σMC 0.0010 0.0006
Δabs 0.1504 0.1649 0.1103
Δrel 2.42% 2.66% −2.72%
Δabs /σMC 150.26 164.72 174.38

TABLE 6.5: Time-dependent volatilities, exponential correlation


diagADI HV loc loc
VMC VPDE VPDE VMC VPDE
1
ω 6.9951 7.0905 7.1454 5.1602 5.1595
σMC 0.0012 0.0008
Δabs 0.0955 0.1503 −0.0007
Δrel 1.36% 2.15% −0.01%
Δabs /σMC 83.12 130.87 −0.86
ω2 6.5570 6.7953 6.7047 4.8380 4.8382
σMC 0.0011 0.0008
Δabs 0.2383 0.1477 0.0002
Δrel 3.63% 2.25% 0.00%
Δabs /σMC 223.82 138.71 0.27
ω3 9.6494 10.0537 9.8383 5.6868 5.7252
σMC 0.0015 0.0009
Δabs 0.4042 0.1889 0.0384
Δrel 4.19% 1.96% 0.67%
Δabs /σMC 265.62 124.13 43.29

By combining time-dependent volatilities with time-dependent correlation,


we have created a challenging scenario for our method. The PDE/diagonal
ADI approach starts to be insufficient for the more complicated cases, differ-
ing by more than 4% for ω 3 . The PDE/HV algorithm produces a relatively
constant error of about 2% in all three test cases.
Contrasting with the fully frozen approximation, it is evident that this is
the first scenario in which the variability of the coefficients creates a major
contribution to the overall error.
Medium- and High-Dimensional Derivative Pricing PDEs 193

TABLE 6.6: Asset-dependent correlation


diagADI HV loc loc
VMC VPDE VPDE VMC VPDE
ω1 6.7937 1.4032 6.7393 7.2138 7.2147
σMC 0.0108 0.0010
Δabs −5.3905 −0.0544 −0.0009
Δrel −79.35% −0.80% −0.01%
Δabs /σMC −497.65 −5.02 −0.75
ω2 6.7910 6.7910 6.7534 7.2232 7.2239
σMC 0.0109 0.0010
Δabs −5.3660 −0.0376 −0.0008
Δrel −79.02% −0.55% −0.01%
Δabs /σMC −494.23 −3.47 −0.64
ω3 7.4977 2.4238 7.3838 7.6708 7.6663
σMC 0.0122 0.0010
Δabs −5.0739 −0.1139 0.0045
Δrel −67.67% −1.52% 0.06%
Δabs /σMC −416.90 −9.36 3.56

6.5.5 Asset-dependent correlation


Table 6.6 shows results for σi = 0.2 and
  
1 |Si − 100|
N
ρ(S) = ρsimple 0.6 + 0.2 exp −
N i 10
∈ [ρsimple (0.6), ρsimple (0.8)].

Because of the added computational complexity of having to calculate the


correlation for every vector of asset values encountered, these calculations
were done with 106 Monte Carlo paths, J = 400 grid points and M = 400
time steps.
Clearly, the PDE/diagonal ADI approach is insufficient and the nondiago-
nal PDE terms are necessary for the solution. The PDE/HV approach, which
incorporates them, correspondingly gives fairly accurate results for ω 1 and ω 2 ,
relative to the MC variance. As before, the accuracy decreases for the ω 3 case,
which coincidentally depends only weakly on the chosen correlation dynamics.

6.6 Conclusion
This chapter describes a systematic approach to approximating medium-
to high-dimensional PDEs in derivative pricing by a sequence of lower dimen-
sional PDEs, which are then accessible to state-of-the-art finite difference
methods. The splitting is accurate especially in situations where the dynam-
ics of the underlying stochastic processes can be described well by a lower
194 High-Performance Computing in Finance

number of components. In such situations, the decomposition can loosely be


interpreted as a Taylor expansion with respect to small perturbations in the
other directions.
To complement the theoretical analysis of the method in the constant
parameter setting in earlier work, we describe here various extensions to vari-
able parameters and analyze their accuracy through extensive numerical tests.
Although the examples are necessarily specific, they are chosen to cover a spec-
trum of effects which occur in derivative pricing applications. As the approxi-
mation errors are determined locally by the variability of the solution and the
parameters with respect to the different coordinates and time, the examples
are to some extent representative of a wider class of situations.
Specifically, we designed test cases where different parameters varied with
respect to spatial coordinates and time, and where the payoff varied most
rapidly in different directions relative to the principle component of the covari-
ance matrix. Across all cases, the ω 1 case, where the payout vector is parallel
to the first eigenvector of Σ, showed the best accuracy, while the ω 3 case
showed the worst. This was expected from the theoretical analysis and the
results for constant coefficients, see Section 6.4.1.
Overall, our computations demonstrate that expansion methods can in
principle be applied in this fashion to some variable coefficient asset models.
Higher order methods or other extensions might be necessary to reduce the
error sufficiently for real-world financial applications.

References
1. Tavella, D. and Randall, C. Pricing Financial Instruments: The Finite Differ-
ence Method. Wiley, New York, 2000.

2. in ’t Hout, K. J. and Foulon, S. ADI finite difference schemes for option pricing.
Int. J. Numer. Anal. Mod., 7(2):303–320, 2010.

3. Reisinger, C. and Wittum, G. Efficient hierarchical approximation of high-


dimensional option pricing problems. SIAM J. Sci. Comp., 29(1):440–458, 2007.

4. Leentvaar, C. C. W. and Oosterlee, C. W. On coordinate transformation and


grid stretching for sparse grid pricing of basket options. J. Comput. Appl. Math.,
222:193–209, 2008.

5. Pettersson, U., Larsson, E., Marcusson, G., and Persson, J. Improved radial
basis function methods for multi-dimensional option pricing. J. Comput. Appl.
Math., 222(1):82–93, 2008.

6. Kazeev, V., Reichmann, O., and Schwab, C. Low-rank tensor structure of lin-
ear diffusion operators in the TT and QTT formats. Linear Algebra Appl.,
438(11):4204–4221, 2013.
Medium- and High-Dimensional Derivative Pricing PDEs 195

7. Grasedyck, L., Kressner, D., and Tobler, C. A literature survey of low-rank


tensor approximation techniques. GAMM-Mitteilungen, 36(1):53–78, 2013.

8. Griebel, M. and Holtz, M. Dimension-wise integration of high-dimensional func-


tions with applications to finance. J. Complex., 26(5):455–489, 2010.

9. Reisinger, C. and Wissmann, R. Error analysis of truncated expansion solutions


to high-dimensional parabolic PDEs. arXiv preprint arXiv:1505.04639, 2015.

10. Hilber, N., Kehtari, S., Schwab, C., and Winter, C. Wavelet finite element
method for option pricing in high-dimensional diffusion market models. Techni-
cal Report 2010–01, SAM, ETH Zürich, 2010.

11. Schröder, P., Schober, P., and Wittum, G. Dimension-wise decompositions and
their efficient parallelization. Electronic version of an article published in Recent
Developments in Computational Finance, Interdisciplinary Mathematical Sci-
ences, 14:445–472, 2013.

12. Reisinger, C. and Wissmann, R. Numerical valuation of derivatives in high-


dimensional settings via PDE expansions. J. Comp. Fin., 18(4):95–127, 2015.

13. de Graaf, C. S. L., Kandhai, D., and Reisinger, C. Efficient exposure computa-
tion by risk factor decomposition. arXiv preprint arXiv:1608.01197, 2016.

14. Reisinger, C. Asymptotic expansion around principal components and the com-
plexity of dimension adaptive algorithms. In Garcke, J. and Griebel, M., editors,
Sparse Grids and Applications, number 88 in Springer Lectures Notes in Com-
putational Science and Engineering, Springer-Verlag, Berlin, Heidelberg, pages
263–276, 2012.

15. Schröder, P., Gerstner, T., and Wittum, G. Taylor-like ANOVA-expansion for
high dimensional problems in finance. Working paper, 2012.

16. Musiela, M. and Rutkowski, M. Martingal Methods in Financial Modelling.


Springer, Berlin, 2nd edition, 2005.

17. Brian, P. L. T. A finite-difference method of higher-order accuracy for the solu-


tion of three-dimensional transient heat conduction problems. AIChE J., 7:367–
370, 1961.

18. Hundsdorfer, W. and Verwer, J. Numerical Solution of Time-Dependent


Advection–Diffusion Reaction Equations, volume 33. Springer-Verlag, Berlin,
Heidelberg, 2013.

19. Douglas, J. Alternating direction methods for three space variables. Numer.
Math., 4(1):41–63, 1962.

20. in ’t Hout, K. J. and Mishra, C. Stability of ADI schemes for multidimensional


diffusion equations with mixed derivative terms. Appl. Numer. Math., 74:83–94,
2013.

21. Haentjens, T. and in ’t Hout, K. J. ADI finite difference schemes for the Heston–
Hull–White PDE. Journal of Computational Finance, 16(1):83–110, 2012.
Chapter 7
Multilevel Monte Carlo Methods
for Applications in Finance∗

Michael B. Giles and Lukasz Szpruch

CONTENTS
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.2 Multilevel Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.2.1 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.2.2 Multilevel Monte Carlo theorem . . . . . . . . . . . . . . . . . . . . . . . . 200
7.2.3 Improved multilevel Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 202
7.2.4 Stochastic differential equations . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.2.5 Euler and Milstein discretizations . . . . . . . . . . . . . . . . . . . . . . . 203
7.2.6 Multilevel Monte Carlo algorithm . . . . . . . . . . . . . . . . . . . . . . . 205
7.3 Pricing with Multilevel Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.3.1 Euler–Maruyama scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.3.2 Milstein scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
[Link] Lookback options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.3.3 Conditional Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
7.3.4 Barrier options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
7.3.5 Digital options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.4 Greeks with Multilevel Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.4.1 Monte Carlo greeks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.4.2 Multilevel Monte Carlo greeks . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.4.3 European call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.4.4 Conditional Monte Carlo for pathwise sensitivity . . . . . . . 218
7.4.5 Split pathwise sensitivities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.4.6 Optimal number of samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.4.7 Vibrato Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.5 Multilevel Monte Carlo for Jump-Diffusion Processes . . . . . . . . . . . 222
7.5.1 A jump-adapted Milstein discretization . . . . . . . . . . . . . . . . . 223
[Link] Multilevel Monte Carlo for constant jump
rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

∗Previously published in Recent Developments in Computational Finance: Foundations,

Algorithms and Applications, Thomas Gerstner and Peter Kloeden, editors, Copyright c
2013 by World Scientific Publishing Co. Pte. Ltd.

197
198 High-Performance Computing in Finance

[Link] Multilevel Monte Carlo for path-dependent


rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7.5.2 Lévy processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.6 Multidimensional Milstein Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.6.1 Antithetic multilevel Monte Carlo estimator . . . . . . . . . . . . 227
7.6.2 Clark–Cameron example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
7.6.3 Milstein discretization: General theory . . . . . . . . . . . . . . . . . . 231
7.6.4 Piecewise linear interpolation analysis . . . . . . . . . . . . . . . . . . 233
7.6.5 Simulations for antithetic Monte Carlo . . . . . . . . . . . . . . . . . 235
7.7 Other Uses of Multilevel Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.7.1 Stochastic partial differential equations . . . . . . . . . . . . . . . . . 237
7.7.2 Nested simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.7.3 Truncated series expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.7.4 Mixed precision arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.8 Multilevel Quasi-Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
7.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Appendix 7A Analysis of Brownian Bridge Interpolation . . . . . . . . . . . . . . . 242
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

7.1 Introduction
In 2001, Heinrich [1] developed a multilevel Monte Carlo method for
parametric integration, in which one is interested in estimating the value of
E[f (x, λ)], where x is a finite-dimensional random variable and λ is a parame-
ter. In the simplest case in which λ is a real variable in the range [0, 1], having
estimated the value of E[f (x, 0)] and E[f (x, 1)], one can use 12 (f (x, 0)+f (x, 1))
as a control variate when estimating the value of E[f (x, 12 )], since the variance
of f (x, 12 )− 12 (f (x, 0)+f (x, 1)) will usually be less than the variance of f (x, 12 ).
This approach can then be applied recursively for other intermediate values
of λ, yielding large savings if f (x, λ) is sufficiently smooth with respect to λ.
Giles’ multilevel Monte Carlo path simulation [2] is both similar and dif-
ferent. There is no parametric integration, and the random variable is infinite-
dimensional, corresponding to a Brownian path in the original paper. However,
the control variate viewpoint is very similar. A coarse path simulation is used
as a control variate for a more refined fine path simulation, but since the
exact expectation for the coarse path is not known, this is in turn estimated
recursively using even coarser path simulation as control variates. The coars-
est path in the multilevel hierarchy may have only one timestep for the entire
interval of interest.
A similar two-level strategy was developed slightly earlier by Kebaier [3],
and a similar multilevel approach was under development at the same time
by Speight [4,5].
In this review article, we start by introducing the central ideas in multilevel
Monte Carlo (MLMC) simulation, and the key theorem from Reference 2
which gives the greatly improved computational cost if a number of conditions
Multilevel Monte Carlo Methods for Applications in Finance 199

are satisfied. The challenge then is to construct numerical methods which


satisfy these conditions, and we consider this for a range of computational
finance applications.

7.2 Multilevel Monte Carlo


7.2.1 Monte Carlo
Monte Carlo simulations have become an essential tool in the pricing of
derivatives security and in risk management. In the abstract setting, our goal
is to numerically approximate the expected value E[Y ], where Y = P (X) is
a functional of a random variable X. In most financial applications, we are
not able to sample X directly and hence, in order to perform Monte Carlo
simulations, we approximate X with XΔt such that E[P (XΔt )] → E[P (X)],
when Δt → 0. Using XΔt to compute N approximation samples produces the
standard Monte Carlo estimate
1
N
i
Ŷ = P (XΔt ),
N i=1
i
where XΔt is the numerical approximation to X on the ith sample path and
N is the number of independent simulations of X. By standard Monte Carlo
results Ŷ → E[Y ], when Δt → 0 and N → ∞. In practice, we perform a
Monte Carlo simulation with given Δt > 0 and finite N producing an error to
the approximation of E[Y ]. Here we are interested in the mean square error
that is  
M SE ≡ E (Ŷ − E[Y ])2 .
Our goal in the design of the Monte Carlo algorithm is to estimate Y with
accuracy root-mean-square error ε (M SE ≤ ε2 ), as efficiently as possible.
That is to minimize the computational complexity required to achieve the
desired mean square error. For standard Monte Carlo simulations, the mean
square error can be expressed as
   
E (Ŷ − E[Y ])2 = E (Ŷ − E[Ŷ ] + E[Ŷ ] − E[Y ])2
  !
= E (Ŷ − E[Ŷ ])2 + E[Ŷ ] − E[Y ])2 .
5 67 8 5 67 8
Monte Carlo variance bias of the approximation

The Monte Carlo variance is proportional to (1/N )


%N &
1 1
V[Ŷ ] = 2 V P (XΔt ) = V[P (XΔt )].
i
N i=1
N

For both Euler–Maruyama (EM) and Milstein approximations, |E[Ŷ ]−E[Y ]| =


O(Δt), typically. Hence the mean square error for standard Monte Carlo
200 High-Performance Computing in Finance

simulations is given by
   
1
E (Ŷ − E[Y ])2 = O + O(Δt2 ).
N
To ensure the root-mean-square error is proportional to ε, we must have
M SE = O(ε2 ) and therefore 1/N = O(ε2 ) and Δt2 = O(ε2 ), which means
N = O(ε−2 ) and Δt = O(ε). The computational cost of a standard Monte
Carlo simulation is proportional to the number of paths N multiplied by the
cost of generating a path, that is, the number of timesteps in each sample
path. Therefore, the cost is C = O(ε−3 ). In the following section, we will
show that using MLMC we can reduce the complexity of achieving root mean
square error ε to O(ε−2 ).

7.2.2 Multilevel Monte Carlo theorem


In its most general form, MLMC simulation uses a number of levels of
resolution,  = 0, 1, . . . , L, with  = 0 being the coarsest, and  = L being
the finest. In the context of an SDE simulation, level 0 may have just one
timestep for the whole time interval [0, T ], whereas level L might have 2L
uniform timesteps ΔtL = 2−L T .
If P denotes the payoff (or other output functional of interest), and P
denotes its approximation on level l, then the expected value E[PL ] on the
finest level is equal to the expected value E[P0 ] on the coarsest level plus a sum
of corrections which give the difference in expectation between simulations on
successive levels,
L
E[PL ] = E[P0 ] + E[P − P −1 ]. (7.1)
=1
The idea behind MLMC is to independently estimate each of the expectations
on the right-hand side of Equation 7.1 in a way which minimizes the overall
variance for a given computational cost. Let Y0 be an estimator for E[P0 ]
using N0 samples, and let Y ,  > 0 be an estimator for E[P − P −1 ] using N
samples. The simplest estimator is a mean of N independent samples, which
for  > 0 is

N
Y = N −1 (P i − P −1
i
). (7.2)
i=1

The key point here is that P i − P −1


i
should come from two discrete approxi-
mations for the same underlying stochastic sample [6], so that on finer levels
of resolution the difference is small (due to strong convergence) and so its
variance is also small. Hence very few samples will be required on finer levels
to accurately estimate the expected value.
The combined MLMC estimator Ŷ is

L
Ŷ = Y .
=0
Multilevel Monte Carlo Methods for Applications in Finance 201

We can observe that



N
E[Y ] = N −1 E[P i − P −1
i
] = E[P i − P −1
i
],
i=1

and

L
L
E[Ŷ ] = E[Y ] = E[P0 ] + E[P − P −1 ] = E[PL ].
=0 =1

Although we are using different levels with different discretization errors to


estimate E[P ], the final accuracy depends on the accuracy of the finest level L.
Here we recall the Theorem from Reference 2 (which is a slight general-
ization of the original theorem in Reference 2) which gives the complexity of
MLMC estimation.
Theorem 7.1. Let P denote a functional of the solution of a stochas-
tic differential equation (SDE), and let P denote the corresponding level 
numerical approximation. If there exist independent estimators Y based on
N Monte Carlo samples, and positive constants α, β, γ, c1 , c2 , c3 such that
α ≥ 12 min(β, γ) and

(i) |E[P − P ]| ≤ c1 2−α



E[P0 ], =0
(ii) E[Y ] =
E[P − P −1 ], >0

(iii) V[Y ] ≤ c2 N −1 2−β

(iv) C ≤ c3 N 2γ , where C is the computational complexity of Y


then there exists a positive constant c4 such that for any  < e−1 there are
values L and N for which the multilevel estimator


L
Y = Y
=0

has a mean-square-error with bound


 
2
M SE ≡ E (Y − E[P ]) < 2

with a computational complexity C with bound



⎪ c4 −2 , β > γ,


C≤ c4 −2 (log )2 , β = γ,



c4 −2−(γ−β)/α , 0 < β < γ.
202 High-Performance Computing in Finance

7.2.3 Improved multilevel Monte Carlo


In the previous section, we showed that the key step in MLMC analysis is
the estimation of variance V[P i − P −1 i
]. As it will become more clear in the
next section, this is related to the strong convergence results on approxima-
tions of SDEs, which differentiates MLMC from standard MC, where we only
require a weak error bound for approximations of SDEs.
We will demonstrate that in fact the classical strong convergence may not
be necessary for a good MLMC variance. In Equation 7.2, we have used the
same estimator for the payoff P on every level , and therefore Equation 7.1
is a trivial identity due to the telescoping summation. However, in Reference 7
Giles demonstrated that it can be better to use different estimators for the
finer and coarser of the two levels being considered, P f when level  is the
finer level, and P c when level  is the coarser level. In this case, we require
that
E[P f ] = E[P c ] for  = 1, . . . , L, (7.3)
so that

L
E[PLf ] = E[P0f ] + E[P f − P −1
c
].
=1
The MLMC Theorem is still applicable to this modified estimator. The advan-
tage is that it gives the flexibility to construct approximations for which
P f − P −1
c
is much smaller than the original P − P −1 , giving a larger value
for β, the rate of variance convergence in condition (iii) in the theorem. In
the following sections, we demonstrate how suitable choices of P f and P c can
dramatically increase the convergence of the variance of the MLMC estimator.
The good choice of estimators, as we shall see, often follows from analysis
of the problem under consideration from the distributional point of view. We
will demonstrate that methods that had been used previously to improve the
weak order of convergence can also improve the order of convergence of the
MLMC variance.

7.2.4 Stochastic differential equations


First, we consider a general class of d-dimensional SDEs driven by
Brownian motion. These are the primary object of studies in mathematical
finance. In subsequent sections, we demonstrate extensions of MLMC beyond
the Brownian setting.
Let (Ω, F, {Ft }t≥0 , P) be a complete probability space with a filtration
{Ft }t≥0 satisfying the usual conditions, and let w(t) be a m-dimensional
Brownian motion defined on the probability space. We consider the numerical
approximation of SDEs of the form
dx(t) = f (x(t)) dt + g(x(t)) dw(t), (7.4)
2 2
where x(t) ∈ R for each t ≥ 0, f ∈ C (R , R ), g ∈ C (R , R
d d d d d×m
), and for sim-
plicity we assume a fixed initial value x0 ∈ Rd . The most prominent example
Multilevel Monte Carlo Methods for Applications in Finance 203

of SDEs in finance is a geometric Brownian motion


dx(t) = αx(t) dt + βx(t) dw(t),
where α, β > 0. Although, we can solve this equation explicitly it is still worth-
while to approximate its solution numerically in order to judge the perfor-
mance of the numerical procedure we wish to apply to more complex prob-
lems. Another interesting example is the famous Heston stochastic volatility
model ⎧ .

⎨ds(t) = rs(t) dt + s(t) v(t) dw (t)
. 1
dv(t) = κ(θ − v(t)) dt + σ v(t) dw2 (t) (7.5)


dw1 dw2 = ρ d t,
where r, κ, θ, σ > 0. In this case, we do not know the explicit form of the solu-
tion and therefore numerical integration is essential in order to price certain
financial derivatives using the Monte Carlo method. At this point, we would
like to point out that the Heston model 7.5 does not satisfy standard condi-
tions required for numerical approximations to converge. Nevertheless, in this
paper we always assume that coefficients of SDEs 7.4 are sufficiently smooth.
We refer to References 8–10 for an overview of the methods that can be applied
when the global Lipschitz condition does not hold. We also refer the reader to
Reference 11 for an application of MLMC to the SDEs with additive fractional
noise.

7.2.5 Euler and Milstein discretizations


The simplest approximation of SDEs 7.4 is an EM scheme. Given any step
size Δt , we define the partition PΔt := {nΔt : n = 0, 1, 2, . . . , 2 } of the
time interval [0, T ], 2 Δt = T > 0. The EM approximation Xn ≈ x(n Δt )
has the form [12]

Xn+1 = Xn + f (Xn ) Δt + g(Xn ) Δwn+1

, (7.6)

where Δwn+1 = w((n+1)Δt )−w(nΔt ) and X0 = x0 . Equation 7.6 is written
in a vector form and its i th component reads as

m

Xi,n+1 = Xi,n + fi (Xn ) Δt + gij (Xn ) Δwj,n+1

.
j=1

In the classical Monte Carlo setting, we are mainly interested in the weak
approximation of SDEs 7.4. Given a smooth payoff P : Rd → R, we say that
X2  converges to x(T ) in a weak sense with order α if

|E[P (x(T ))] − E[P (XT )]| = O(Δtα


).

Rate α is required in condition (i) of Theorem 7.1. However, for MLMC con-
dition (iii) of Theorem 7.1 is crucial. We have
' (
V ≡ Var (P − P −1 ) ≤ E (P − P −1 )2
204 High-Performance Computing in Finance

and ' ( ' ( ' (


E (P − P −1 )2 ≤ 2 E (P − P )2 + 2 E (P − P −1 )2 .
2
For Lipschitz continuous payoffs, (P (x) − P (y))2 ≤ L x − y , we then have
' ( ; ;2 
E (P − P )2 ≤ L E ;x(T ) − XT ; .

It is clear now that in order to estimate the variance of the MLMC, we need
to examine strong convergence property. The classical strong convergence on
the finite time interval [0, T ] is defined as
; ;p !1/p
E ;x(T ) − XT ; = O(Δtξ ), for p ≥ 2.

For the EM scheme, ξ = 0.5. In order to deal with path-dependent options,


we often require measuring the error in the supremum norm:
 1/p
' ; ; (
E sup ;x(nΔt ) − Xn ;
p
= O(Δtξ ) for p ≥ 2.
0≤n≤2

Even in the case of globally Lipschitz continuous payoff P , the EM does not
achieve β = 2ξ > 1 which is optimal in Theorem 7.1. In order to improve
the convergence of the MLMC variance, the Milstein approximation Xn ≈
x(n Δt ) is considered, with i th component of the form [12]

m

Xi,n+1 = Xi,n + fi (Xn ) Δt + gij (Xn ) Δwj,n+1

j=1

m

+ hijk (Xn ) Δwj,n

Δwk,n − Ωjk Δt − A jk,n (7.7)
j,k=1

where Ω is the correlation matrix for the driving Brownian paths, and A jk,n
is the Lévy area defined as
(n+1)Δt
 
! ! !
A jk,n = wj (t) − wj (nΔt ) dwk (t) − wk (t) − wk (nΔt ) dwj (t) .
nΔt

The rate of strong convergence ξ for the Milstein scheme is double the value
we have for the EM scheme and therefore the MLMC variance for Lipschitz
payoffs converges twice as fast. However, this gain does not come without a
price. There is no efficient method to simulate Lévy areas, apart from dimen-
sion 2 [13–15]. In some applications, the diffusion coefficient g(x) satisfies a
commutativity property which gives
hijk (x) = hikj (x) for all i, j, k.
In that case, because the Lévy areas are antisymmetric (i.e., Aljk,n = −Alkj,n ),
it follows that hijk (Xn ) Aljk,n + hikj (Xn ) Alkj,n = 0 and therefore the terms
Multilevel Monte Carlo Methods for Applications in Finance 205

involving the Lévy areas cancel and so it is not necessary to simulate them.
However, this only happens in special cases. Clark and Cameron [16] proved
for a particular SDE that it is impossible to achieve a better order of strong
convergence than the EM discretization when using just the discrete incre-
ments of the underlying Brownian motion. The analysis was extended by
Müller-Gronbach [17] to general SDEs. As a consequence if we use the stan-
dard MLMC method with the Milstein scheme without simulating the Lévy
areas, the complexity will remain the same as for EM. Nevertheless, Giles and
Szpruch showed in Reference 18 that by constructing a suitable antithetic esti-
mator one can neglect the Lévy areas and still obtain a multilevel correction
estimator with a variance which decays at the same rate as the scalar Milstein
estimator.

7.2.6 Multilevel Monte Carlo algorithm


Here we explain how to implement the Monte Carlo algorithm. Let us
recall that the MLMC estimator Y is given by


L
Ŷ = Y .
=0

We aim to minimize the computational cost necessary to achieve desirable


accuracy ε. As for standard Monte Carlo, we have
' (  
E (Y − E[P (X)])2 = E (Y − E[Ŷ ])2 + E[PL ] − E[P (X)])2 .
5 67 8 5 67 8
Monte Carlo variance bias of the approximation

The variance is given by


L
L
1
V[Y ] = V[Y ] = V ,
N
=0 =0

where V = V[P − P −1 ]. To minimize the variance of Y for fixed computa-


)L
tional cost C = =0 N Δt−1 , we can treat N as continuous variable and
use the Lagrange function to find the minimum of
 L 

L
1
L= V + λ N Δt−1
−C .
N
=0 =0

1√
First-order conditions show that N = λ− 2 V Δt , therefore


L
V
L
λ
V[Y ] = = √ V .
N V Δt
=0 =0
206 High-Performance Computing in Finance

ε2
Since we want V[Y ] ≤ 2 , we can show that

1
L .
λ− 2 ≥ 2ε−2 V /Δt ,
=0

thus the optimal number of samples for level  is


< =
. L .
N = 2ε−2 V Δt V /Δt . (7.8)
=0

Assuming O(Δt ) weak convergence, the bias of the overall method is equal
to cΔtL = c T 2−L . If we want the bias to be proportional to √ε2 , we set

log (ε/(cT 2))−1
Lmax = .
log 2

From here we can calculate the overall complexity. We can now outline the
algorithm

1. Begin with L = 0

2. Calculate the initial estimate of VL using 100 samples

3. Determine optimal N using Equation 7.8

4. Generate additional samples as needed for new N

5. If L < Lmax , set L := L + 1 and go to 2

Most numerical tests suggest that Lmax is not optimal and we can sub-
stantially improve MLMC by determining optimal L by looking at bias. For
more details, see Reference 2.

7.3 Pricing with Multilevel Monte Carlo


A key application of MLMC is to compute the expected payoff of financial
options. We have demonstrated that for globally Lipschitz European pay-
offs, convergence of the MLMC variance is determined by the strong rate of
convergence of the corresponding numerical scheme. However, in many finan-
cial applications payoffs are not smooth or are path dependent. The aim of
this section is to overview results on mean square convergence rates for EM
and Milstein approximations with more complex payoffs. In the case of EM,
the majority of payoffs encountered in practice have been analyzed by Giles
et al. [19]. Extension of this analysis to the Milstein scheme is far from obvious.
Multilevel Monte Carlo Methods for Applications in Finance 207

This is due to the fact that the Milstein scheme gives an improved rate of con-
vergence on the grid points, but this is insufficient for path-dependent options.
In many applications, the behavior of the numerical approximation between
grid points is crucial. The analysis of the Milstein scheme for complex payoffs
was carried out in Reference 20. To understand this problem better, we recall a
few facts from the theory of strong convergence of numerical approximations.
We can define a piecewise linear interpolation of a numerical approximation
within the time interval [nΔt , (n + 1)Δt ) as

X (t) = Xn + l (Xn+1

− Xn ), for t ∈ [nΔt , (n + 1)Δt ), (7.9)

where l ≡ (t − nΔt )/Δt . Müller-Gronbach [21] has shown that for the
Milstein scheme 7.9, we have
 
; ;p
E sup ;x(t) − X (t); = O(|Δt log(Δt )|p/2 ), p ≥ 2, (7.10)
0≤t≤T

that is the same as for the EM scheme. In order to maintain the strong order of
convergence, we use Brownian Bridge interpolation rather than basic piecewise
linear interpolation:

X̃ (t) = Xn + λ (Xn+1

− Xn ) + g(Xn ) w(t) − w(nΔt ) − λ Δwn+1
l
, (7.11)

for t ∈ [nΔt , (n+1)Δt ). For the Milstein scheme interpolated with Brownian
bridges, we have [21]
 ; ;p 
; ;
E sup ;x(t) − X̃ (t); = O(|Δt log(Δt )|p ).
0≤t≤T

Clearly X̃ (t) is not implementable, since in order to construct it, the knowl-
edge of the whole trajectory (w(t))0≤t≤T is required. However, we will demon-
strate that combining X̃ (t) with conditional Monte Carlo techniques can dra-
matically improve the convergence of the variance of the MLMC estimator.
This is due to the fact that for suitable MLMC estimators only distributional
knowledge of certain functionals of (w(t))0≤t≤T will be required.

7.3.1 Euler–Maruyama scheme


In this section, we demonstrate how to approximate the most common
payoffs using the EM scheme 7.6.
The Asian option we consider has the payoff
⎛ ⎞+
T
P = ⎝T −1 x(t) dt − K ⎠ .
0
208 High-Performance Computing in Finance

Using the piecewise linear interpolation 7.9, one can obtain the following
approximation:
T 
2 −1
−1 −1 1
Pl ≡ T
X (t) dt = T 2 Δt (Xn +Xn+1

).
0 n=0

Lookback options have payoffs of the form:


P = x(T ) − inf x(t).
0≤t≤T

A numerical approximation to this payoff is


P ≡ XT − inf Xn .
0≤n≤2

For both of these payoffs, it can be proved that V = O(Δt ) [19].


We now consider a digital option, which pays one unit if the asset at the
final time exceeds the fixed strike price K and pays zero otherwise. Thus the
discontinuous payoff function has the form:
P = 1{x(T )>K} ,
with the corresponding EM value
P ≡ 1{XT >K} .
Assuming boundedness of the density of the solution to Equation 7.4 in the
neighborhood of the strike K, it has been proved in Reference 19 that V =
1/2−δ
O (Δt ), for any δ > 0. This result has been tightened by Avikainen [22]

1/2
who proved that V = O(Δt log Δt ).
An up-and-out call gives a European payoff if the asset never exceeds the
barrier, B, otherwise it pays zero. So, for the exact solution we have
P = (x(T ) − K)+ 1{sup0≤t≤T x(t)≤B} ,
and for the EM approximation
P ≡ (XT − K)+ 1{sup0≤n≤2 Xn ≤B} .
A down-and-in call knocks in when the minimum asset price dips below the
barrier B, so that
P = (x(T ) − K)+ 1{inf 0≤t≤T x(t)≤B} ,
and, accordingly,
Pl ≡ (XT − K)+ 1{inf 0≤n≤2 Xn ≤B} .
1/2−δ
For both of these barrier options, we have V = O(Δt ), for any δ > 0,
assuming that inf 0≤t≤T x(t) and sup0≤t≤T x(t) have bounded density in the
neighborhood of B [19].
As summarized in Table 7.1, numerical results taken from Reference 7
suggest that all of these results are near-optimal.
Multilevel Monte Carlo Methods for Applications in Finance 209

TABLE 7.1: Orders of convergence for V as


observed numerically and proved analytically for
both Euler discretizations; δ can be any strictly
positive constant
Euler
Option Numerical Analysis
Lipschitz O(Δt ) O(Δt )
Asian O(Δt ) O(Δt )
Lookback O(Δt ) O(Δt )
1/2 1/2−δ
Barrier O(Δt ) O(Δt )
1/2 1/2
Digital O(Δt ) O(Δt log Δt )

7.3.2 Milstein scheme


In the scalar case of SDEs 7.4 (i.e. with d = m = 1), the Milstein scheme
has the form:
l

Xn+1 = Xn + f (Xn2 )Δt + g(Xn )Δwn+1


+ g (Xn )g(Xn )((Δwn+1

)2 − Δt ), (7.12)

where g
≡ ∂g/∂x. The analysis of Lipschitz European payoffs and Asian
options with the Milstein scheme is analogous to the EM scheme and it has
been proved in Reference 20 that in both these cases V = O(Δt2 ).

[Link] Lookback options


For clarity of the exposition, we will express the fine time-step approx-

imation in terms of the coarse timestep, that is P Δt := {nΔt −1 : n =
1 1
0, 2 , 1, 1 + 2 , 2, . . . , 2 −1 }. The partition for the coarse approximation is given
by PΔt−1 := {nΔt −1 : n = 0, 1, 2, . . . , 2 −1 }. Therefore, Xn −1 corresponds to
Xn for n = 0, 1, 2, . . . , 2 −1 .
For pricing lookback options with the EM scheme, as an approximation of
the minimum of the process we have simply taken minn Xn . This approxima-
tion could be improved by taking
!
1/2

Xmin = min Xn − β ∗ g(Xn )Δt .
n

1/2
Here β ∗ ≈ 0.5826 is a constant which corrects the O(Δt ) leading order error
due to the discrete sampling of the path, and thereby restores O(Δt ) weak
convergence [23]. However, using this approximation, the difference between
1/2
the computed minimum values and the fine and coarse paths is O(Δt ),
and hence the variance V is O(Δt ), corresponding to β = 1. In the previous
section, this was acceptable because β = 1 was the best that could be achieved
in general with the Euler path discretization which was used, but we now aim
to achieve an improved convergence rate using the Milstein scheme.
210 High-Performance Computing in Finance

In order to improve the convergence, the Brownian Bridge interpolant


X̃ (t) defined in Equation 7.11 is used. We have
% &
min X̃ (t) = min min X̃ (t)
0≤t<T 0≤n≤2−1 − 12 nΔtl−1 ≤t<(n+ 12 )Δtl−1

= min Xn,min ,
0≤n≤2−1 − 12

where minimum of the fine approximation over the first half of the coarse
timestep is given by [24]

1
Xn,min = Xn + Xn+

1
2 2

> 
! 2

Xn+ 1 − Xn
− 2 g(Xn )2 Δtl log Un , (7.13)
2

and minimum of the fine approximation over the second half of the coarse
timestep is given by

1
Xn+ 1 ,min = Xn+ 1 + Xn+1
2 2 2

> 
! 2

Xn+1 − Xn+
− 2 g(Xn+
2
1 1 ) Δt log U , (7.14)
2 2 n+ 1 2

where Un , Un+

1 are uniform random variables on the unit interval. For the
2
coarse path, in order to improve the MLMC variance a slightly different esti-
mator is used (see Equation 7.3). Using the same Brownian increments as we
used on the fine path (to guarantee that we stay on the same path), Equa-
1 ≡ X̃
−1 −1
tion 7.11 is used to define X̃n+ ((n + 12 )Δt −1 ). Given this inter-
2
polated value, the minimum value over the interval [nΔt −1 , (n + 1)Δt −1 ]
can then be taken to be the smaller of the minima for the two intervals
[nΔt −1 , (n + 12 )Δt −1 ) and [(n + 12 )Δt −1 , (n + 1)Δt −1 ),

−1 1 −1
Xn,min = Xn −1 + X̃n+ 1
2 2

> 
!2
−1 2 Δt −1
− X̃n+ 1 − Xn
−1 −1
− 2 (g(Xn ))
log Un ,
2 2

−1 1 −1 −1
Xn+ 1 = X̃n+ 1 + Xn+1
2 ,min 2 2

> 
!2
−1 2 Δt −1
− Xn+1 − X̃n+ 1 − 2 (g(Xn ))
−1 −1
log Un+ 1 ) .
2 2 2

(7.15)
Multilevel Monte Carlo Methods for Applications in Finance 211

Note that g(Xn −1 ) is used for both timesteps. It is because we used the
Brownian Bridge with diffusion term g(Xn −1 ) to derive both minima. If we
−1 −1
changed g(Xn −1 ) to g(X̃n+ 1 ) in X
n+ 12 ,min
, this would mean that different
2
Brownian Bridges were used on the first and second half of the coarse timestep
and as a consequence condition 7.3 would be violated. Note also the reuse of
the same uniform random numbers Un and Un+
1 used to compute the fine
2
−1 −1
path minimum. The min(Xn,min , Xn+ 1
,min
) has exactly the same distribution
2
−1
as Xn,min , since they are both based on the same Brownian interpolation,
and therefore equality 7.3 is satisfied. Giles et al. [20] proved the following
theorem.

Theorem 7.2. The multilevel approximation for a lookback option which is


a uniform Lipschitz function of x(T ) and inf [0,T ] x(t) has Vl = O(Δt2−δ
l ) for
any δ > 0.

7.3.3 Conditional Monte Carlo


Giles [7] and Giles et al. [20] have shown that combining conditional Monte
Carlo with MLMC results in superior estimators for various financial payoffs.
To obtain an improvement in the convergence of the MLMC variance bar-
rier and digital options, conditional Monte Carlo methods are employed. We
briefly describe it here. Our goal is to calculate E[P ]. Instead, we can write
' (
E[P ] = E E[P |Z] ,

where Z is a random vector. Hence E[P |Z] is an unbiased estimator of E[P ].


We also have ' ( ' (
Var [P ] = E Var [P |Z] + Var E[P |Z] ,
' (
hence Var E[P |Z] ≤ Var (P ). In the context of MLMC, we obtain a bet-
ter variance convergence if we condition on different vectors on the fine
and the coarse level. That is on the fine level we take E[P f |Z f ], where
Z f = {Xn }0≤n≤2 . On the coarse level instead of taking E[P c |Z c ] with
Z c = {Xn −1 }0≤n≤2−1 , we take E[P c |Z c , Z̃ c ], where Z̃ c = {X̃n+ 1 }0≤n≤2−1 are
−1
2
obtained from Equation 7.11. Condition 7.3 trivially holds by tower property
of conditional expectation
 
E [E[P c |Z c ]] = E[P c ] = E E[P c |Z c , Z̃ c ] .

7.3.4 Barrier options


The barrier option which is considered a down-and-out option for which
the payoff is a Lipschitz function of the value of the underlying at maturity,
212 High-Performance Computing in Finance

provided the underlying has never dropped below a value B ∈ R, is


P = f (x(T )) 1{τ>T } .
The crossing time τ is defined as
τ = inf {x(t) < B} .
t

This requires the simulation of (x(T ), 1τ >T )). The simplest method sets
τ Δt = inf {Xn < B}
n

and as an approximation takes (X2 −1 , 1{τ Δt >2−1 } ).


But even if we could
simulate the process {x(nΔt )}0≤n≤2−1 , it is possible for {x(t)}0≤t≤T to cross
the barrier between grid points. Using the Brownian Bridge interpolation, we
can approximate 1{τ >T } by
2−1 − 12
#
1{Xn,min
 ≥B} .
n=0

This suggests following the lookback approximation in computing the mini-


mum of both the fine and coarse paths. However, the variance would be larger
in this case because the payoff is a discontinuous function of the minimum.
A better treatment, which is the one used in Reference 25, is to use the condi-
tional Monte Carlo approach to further smooth the payoff. Since the process
Xn is Markovian, we have
⎡ ⎤
2−1 − 12
#
E ⎣f (X −1 )
2 1{X  ≥B} ⎦
n,min
n=0
⎡ ⎡ ⎤⎤
2−1 − 12
#
= E ⎣E ⎣f (X2 −1 ) 1{Xn,min
 ≥B} |X0 , . . . , X2−1
⎦⎦
n=0
⎡ −1 1

2
#− 2  
= E ⎣f (X2 −1 ) E 1{Xn,min
 ≥B} |Xn , Xn+1

n=0
⎡ ⎤
2−1 − 12
#
= E ⎣f (X2 −1 ) (1 − p n )⎦ ,
n=0

where from [24]


 
p n = P inf X̃(t) < B|Xn , Xn+

1
nΔt ≤t<(n+ 12 )Δt 2

 
−2 (Xn − B)+ (Xn+

1 − B)+
2
= exp ,
g(Xn )2 Δt
Multilevel Monte Carlo Methods for Applications in Finance 213

and
 
p n+ 1 =P inf X̃(t) <
B|Xn+
1 , Xn+1
2 (n+ 12 )Δt ≤t<(n+1)Δt 2

 + + 
−2 (Xn+

1 − B) (Xn+1 − B)

2
= exp 2
.
g(Xn+ 1 ) Δt
2

Hence, for the fine path this gives


2−2 − 12
#
P f = f (X2 −1 ) (1 − p n ), (7.16)
n=0

The payoff for the coarse path is defined similarly. However, in order to reduce
−1
the variance, we subsample X̃n+ 1 , as we did for lookback options, from the
2
−1
Brownian Bridge connecting Xn −1 and Xn+1
⎡ ⎤
2−1
#−1
E ⎣f (X2 −1
−1 ) 1{X −1 ≥B} ⎦
n,min
n=0
⎡ ⎡ ⎤⎤
2−1
#−1
= E ⎣E ⎣f (X2 −1
−1 ) 1{X −1 |X0 −1 , X̃ −1 , . . . , X̃2 −1 −1 ⎦⎦
−1 − 1 , X2−1
n,min ≥B}
1
2 2
n=0
⎡ ⎤
2−1
#−1  
−1 ⎦
= E ⎣f (X2 −1
−1 ) E 1{X −1 |Xn −1 , X̃n+
−1
1 , Xn+1
n,min ≥B} 2
n=0
⎡ ⎤
2−1
#−1
= E ⎣f (X2 −1
−1 ) (1 − p −1 −1 ⎦
1,n )(1 − p2,n ) ,
n=0

where ⎛ ⎞
−2 (Xn −1 − B)+ (X̃n+
−1
1 − B)
+

p −1
1,n == exp ⎝ 2 ⎠
g(Xn −1 )2 Δt
and ⎛ ⎞
+ +
−2 (X̃n+
−1
1 − B) (Xn+1 − B)
−1

p −1
2,n == exp
⎝ 2 ⎠.
g(Xn −1 )2 Δtl
−1 −1
Note that the same g(Xn −1 ) is used (rather than using g(X̃n+ 1 ) in p2,n ) to
2
calculate both probabilities for the same reason as we did for lookback options.
The final estimator can be written as
2−1
#−1
c
P −1 = f (X2 −1
−1 ) (1 − p −1
1,n )(1 − p2,n ).
−1
(7.17)
n=0

Giles et al. [20] proved the following theorem.


214 High-Performance Computing in Finance

Theorem 7.3. Provided inf [0,T ] |g(B)| > 0, and inf [0,T ] x(t) has a bounded
density in the neighborhood of B, then the multilevel estimator for a down-
3/2−δ
and-out barrier option has variance V = O(Δt ) for any δ > 0.

3/2−δ
The reason the variance is approximately O(Δt ) instead of O(Δt2 )
is the following: due to the strong convergence property the probability of
the numerical approximation being outside the Δt1−δ -neighborhood of the
solution to the SDE 7.4 is arbitrarily small, that is for any ε > 0
 
; ;
P sup ;x(nΔt ) − Xn ; ≥ Δt1−ε

0≤nΔt ≤T
 
; ;
≤ Δt−p+pε E sup ;x(nΔt ) − Xn ;p = O(Δpε ). (7.18)

0≤nΔt ≤T

1/2
If inf [0,T ] x(t) is outside the Δt -neighborhood of the barrier B then by
Equation 7.18 it is shown that so are numerical approximations. The proba-
bilities of crossing the barrier in that case are asymptotically either 0 or 1 and
essentially we are in the Lipschitz payoff case. If the inf [0,T ] x(t) is within the
1/2
Δt -neighborhood of the barrier B, then so are the numerical approxima-
tions. In that case it can be shown that E[(P f − P −1 c
)2 ] = O(Δt1−δ ) but due
to the bounded density assumption, the probability that inf [0,T ] x(t) is within
1/2 1/2−δ
Δt -neighborhood of the barrier B is of order Δt . Therefore the overall
3/2−δ
MLMC variance is V = O(Δ ) for any δ > 0.

7.3.5 Digital options


A digital option has a payoff which is a discontinuous function of the value
of the underlying asset at maturity, the simplest example being

P = 1{x(T )>B} .

Approximating 1{x(T )>B} based only on simulations of x(T ) by Milstein


scheme will lead to an O(Δt ) fraction of the paths having coarse and fine path
approximations to x(T ) on either side of the strike, producing P −P −1 = ±1,
3/2−δ
resulting in V = O(Δt ). To improve the variance to O(Δt ) for all
δ > 0, the conditional Monte Carlo method is used to smooth the payoff (see
Section 7.2.3 in Reference 24). This approach was proved to be successful in
Giles et al. [20] and was tested numerically in Reference 25.
If X2 −1 − 1 denotes the value of the fine path approximation one timestep
2
before maturity, then the motion thereafter is approximated as Brownian
motion with constant drift f (X2 −1 − 1 ) and volatility g(X2 −1 − 1 ). The condi-
2 2
tional expectation for the payoff is the probability that X2 −1 > B after one
Multilevel Monte Carlo Methods for Applications in Finance 215

TABLE 7.2: Orders of convergence for Vl as observed


numerically and proved analytically for Milstein
discretizations; δ can be any strictly positive constant
Milstein
Option Numerical Analysis
2
Lipschitz O(Δt ) O(Δt2 )
Asian O(Δt2 ) O(Δt2 )
Lookback O(Δt2 ) O(Δt2−δ
)
3/2 3/2−δ
Barrier O(Δt ) O(Δt )
3/2 3/2−δ
Digital O(Δt ) O(Δt )

further timestep, which is

 
  X2 −1 − 1 +f (X2 −1 − 1 )Δt − B
P f =E 1{X −1 >B} |X2 −1 − 1 =Φ 2
√2 ,
2 2 |g(X2 −1 − 1 )| Δt
2

(7.19)

where Φ is the cumulative Normal distribution.


For the coarse path, we note that given the Brownian increment Δw2 −1
−1 − 1
2
for the first half of the last coarse timestep (which comes from the fine path
simulation), the probability that X2 −1 > B is

 
c
P −1 = E 1{X −1 >B} |X2 −1 −1
−1 −1 , Δw2−1 − 1
−1 2
⎛ 2−1 ⎞
2
X2−1 −1 +f (X2−1 −1 )Δt −1 +g(X2 −1 −1 −1 )Δw2−1 − 1 − B
−1 −1

= Φ⎝ −1 √ 2 ⎠.
|g(X22−1 −1 )| Δt
(7.20)

The conditional expectation of Equation 7.20 is equal to the conditional expec-


f
tation of P −1 defined by Equation 7.19 on level  − , and so equality 7.3 is
satisfied. A bound on the variance of the multilevel estimator is given by the
following result.

Theorem 7.4. Provided g(B) = 0, and x(t) has a bounded density in the
neighborhood of B, then the multilevel estimator for a digital option has vari-
3/2−δ
ance Vl = O(Δtl ) for any δ > 0.

Results of the above section were tested numerically in Reference 7 and


are summarized in Table 7.2.
216 High-Performance Computing in Finance

7.4 Greeks with Multilevel Monte Carlo


Accurate calculation of prices is only one objective of Monte Carlo simula-
tions. Even more important in some ways is the calculation of the sensitivities
of the prices to various input parameters. These sensitivities, known collec-
tively as the “Greeks,” are important for risk analysis and mitigation through
hedging.
Here we follow the results by Burgos et al. [26] to present how MLMC can
be applied in this setting. The pathwise sensitivity approach (also known as
Infinitesimal Perturbation Analysis) is one of the standard techniques for com-
puting these sensitivities [24]. However, the pathwise approach is not applica-
ble when the financial payoff function is discontinuous. One solution to these
problems is to use the Likelihood Ratio Method (LRM) but its weaknesses
are that the variance of the resulting estimator is usually O(Δt−1l ).
Three techniques are presented that improve MLMC variance: payoff
smoothing using conditional expectations [24]; an approximation of the above
technique using path splitting for the final timestep [27]; the use of a
hybrid combination of pathwise sensitivity and the LRM [28]. We discuss the
strengths and weaknesses of these alternatives in different MLMC settings.

7.4.1 Monte Carlo greeks


Consider the approximate solution of the general SDE 7.4 using Euler
discretization 7.6. The Brownian increments can be defined to be a linear
transformation of a vector of independent unit Normal random variables Z.
The goal is to efficiently estimate the expected value of some financial pay-
off function P (x(T )) and numerous first-order sensitivities of this value with
respect to different input parameters such as the volatility or one component
of the initial data x(0). In more general cases, P might also depend on the
values of process {x(t)}0≤t≤T at intermediate times.
The pathwise sensitivity approach can be viewed as starting with the
expectation expressed as an integral with respect to Z:

' (
V ≡ E P (Xn (Z, θ)) = P (Xn (Z, θ)) pZ (Z) dZ. (7.21)

Here θ represents a generic input parameter, and the probability density func-
tion for Z is

pZ (Z) = (2π)−d/2 exp −Z22 /2 ,
where d is the dimension of the vector Z.
Let Xn = Xn (Z, θ). If the drift, volatility and payoff functions are all dif-
ferentiable, Equation 7.21 may be differentiated to give

∂V ∂P (Xn ) ∂Xn
= pZ (Z) ΔZ, (7.22)
∂θ ∂Xn ∂θ
Multilevel Monte Carlo Methods for Applications in Finance 217

with (∂Xn )/∂θ being obtained by differentiating Equation 7.6 to obtain

 
∂Xn+1 ∂Xn ∂f (Xn , θ) ∂Xn ∂f (Xn , θ)
= + + Δtl
∂θ ∂θ ∂Xn ∂θ ∂θ
 
∂g(Xn , θ) ∂Xn ∂g(Xn , θ) l
+ + Δwn+1 . (7.23)
∂Xn ∂θ ∂θ

We assume that Z → Δwn+1 l


mapping does not depend on θ. It can be
proved that Equation 7.22 remains valid (i.e., we can interchange integration
and differentiation) when the payoff function is continuous and piecewise dif-
ferentiable, and the numerical estimate obtained by standard Monte Carlo
with M independent path simulations

M
∂P (Xn ,m ) ∂Xn ,m
M −1
m=1
∂Xn ∂θ

is an unbiased estimate for ∂V /∂θ with a variance which is O(M −1 ), if P (x) is


Lipschitz and the drift and volatility functions satisfy the standard conditions
[12].
Performing a change of variables, the expectation can also be expressed as

' (
Vl ≡ E P (Xn ) = P (x) pXn (x, θ) dx, (7.24)

where pXn (x, θ) is the probability density function for Xn which will depend
on all of the inputs parameters. Since probability density functions are usually
smooth, Equation 7.24 can be differentiated to give
   
∂V ∂pXn ∂(log pXn ) ∂(log pXn )
= P (x) dx = P (x) pXn dx = E P (x) ,
∂θ ∂θ ∂θ ∂θ

which can be estimated using the unbiased Monte Carlo estimator


M
∂ log pXn (Xn ,m )
M −1 P (Xn ,m ) .
m=1
∂θ

This is the LRM. Its great advantage is that it does not require the differen-
tiation of P (Xn ). This makes it applicable to cases in which the payoff is dis-
continuous, and it also simplifies the practical implementation because banks
often have complicated flexible procedures through which traders specify pay-
offs. However, it does have a number of limitations, one being a requirement
of absolute continuity which is not satisfied in a few important applications
such as the LIBOR market model [24].
218 High-Performance Computing in Finance

7.4.2 Multilevel Monte Carlo greeks


The MLMC method for calculating Greeks can be written as
∂E(P0 ) ∂E(P − P −1
L f c
∂V ∂E(P ) ∂E(PL ) )
= ≈ = + . (7.25)
∂θ ∂θ ∂θ ∂θ ∂θ
=1
Therefore extending Monte Carlo Greeks to MLMC Greeks is straightforward.
However, the challenge is to keep the MLMC variance small. This can be
achieved by appropriate smoothing of the payoff function. The techniques
that were presented in Section 7.3.2 are also very useful here.

7.4.3 European call


As an example, we consider a European call P = (x(T ) − B)+ with x(t)
being a geometric Brownian motion with Milstein scheme approximation
given by
σ2

Xn+1 = Xn + r Xn Δt + σ Xn Δwn+1

+
((Δwn+1 )2 − Δt ). (7.26)
2
We illustrate the techniques by computing delta (δ) and vega (ν), the sensi-
tivities to the asset’s initial value x(0) and to its volatility σ.
Since the payoff is Lipschitz, we can use pathwise sensitivities. We
observe that 
∂ + 0, for x < B,
(x − B) =
∂x 1, for x > B.
This derivative fails to exist when x = B, but since this event has probability
0, we may write

(x − K)+ = 1{X>B} .
∂x
Therefore we are essentially dealing with a digital option.

7.4.4 Conditional Monte Carlo for pathwise sensitivity


Using conditional expectation, the payoff can be smooth as we did it in
Section 7.3.2. European calls can be treated in exactly the same way as the
digital option in Section 7.3.2, that is instead of simulating the whole path,
we stop at the penultimate step and then on the last step we consider the full
distribution of (X2 l |w0l , . . . , w2l l −1 ).
For digital options, this approach leads to Equations 7.19 and 7.20. For
the call options, we can do analogous calculations. In Reference 26, numerical
results for this approach were obtained, with a scalar Milstein scheme used
to obtain the penultimate step. The results are presented in Table 7.3. For
lookback options, conditional expectations lead to Equations 7.13 and 7.15 and
for barriers to Equations 7.16 and 7.17. Burgos et al. [26] applied pathwise
sensitivity to these smoothed payoffs, with a scalar Milstein scheme used to
obtain the penultimate step, and obtained numerical results that we present
in Table 7.4.
Multilevel Monte Carlo Methods for Applications in Finance 219

TABLE 7.3: Orders of convergence for V as observed numerically


and corresponding MLMC complexity
Call Digital
Estimator β MLMC complexity β MLMC complexity
Value ≈2.0 O(−2 ) ≈1.4 O(−2 )
−2
Delta ≈1.5 O( ) ≈0.5 O(−2.5 )
−2
Vega ≈2 O( ) ≈0.6 O(−2.4 )

TABLE 7.4: Orders of convergence for V as observed numerically and


corresponding MLMC complexity
Lookback Barrier
Estimator β MLMC complexity β MLMC complexity
Value ≈1.9 O(−2 ) ≈1.6 O(−2 )
−2
Delta ≈1.9 O( ) ≈0.6 O(−2.4 )
−2
Vega ≈1.3 O( ) ≈0.6 O(−2.4 )

7.4.5 Split pathwise sensitivities


There are two difficulties in using conditional expectation to smooth
payoffs in practice in financial applications. This first is that conditional
expectation will often become a multidimensional integral without an obvi-
ous closed-form value, and the second is that it requires a change to the
often complex software framework used to specify payoffs.  As a remedy  for
these problems the splitting technique to approximate E P (X2 l )|X2  −1 and
 
E P (X2 −1−1 )|X −1
2−1 −1
, Δw
2 −2 is used. We get numerical estimates of these
values by “splitting” every simulated path on the final timestep. At the fine
level: for every simulated path, a set of s final increments {Δw2 ,i }i∈[1,s] is
simulated, which can be averaged to get

' ( 1 s
E P (X2  )|X2  −1 ≈ P (X2 −1 , Δw2 ,i ). (7.27)
s i=1

At the coarse level, similar to the case of digital options, the fine increment
of the Brownian motion over the first half of the coarse timestep is used,
 −1  1 s !
−1,i
E P X2 −1
−1 |X 2−1 −1
, Δw
2 −2 ≈ P X2 −1
−1 −1 , Δw2 −2 , Δw2−1 .
s i=1
(7.28)
This approach was tested in Reference 26, with the scalar Milstein scheme used
to obtain the penultimate step, and is presented in Table 7.5. As expected the
values of β tend to the rates offered by conditional expectations as s increases
and the approximation gets more precise.
220 High-Performance Computing in Finance

TABLE 7.5: Orders of convergence for V as


observed numerically and the corresponding MLMC
complexity
Estimator s β MLMC complexity
Value 10 ≈2.0 O(−2 )
500 ≈2.0 O(−2 )
Delta 10 ≈1.0 O(−2 (log )2 )
500 ≈1.5 O(−2 )
Vega 10 ≈1.6 O(−2 )
500 ≈2.0 O(−2 )

7.4.6 Optimal number of samples


The use of multiple samples to estimate the value of the conditional expec-
tations is an example of the splitting technique [27]. If w and z are independent
random variables, then for any function P (w, z) the estimator
 
M S
−1 −1 m (m,i)
YM,S = M S P (w , z )
m=1 i=1

with independent samples wm and z m,i is an unbiased estimator for


 
Ew,z [P (w, z)] ≡ Ew Ez [P (w, z)|w] ,

and its variance is


   
V[YM,S ] = M −1 Vw Ez [P (w, z)|w] + (M S)−1 Ew Vz [P (w, z)|w] .

The cost of computing YM,S with variance v1 M −1 + v2 (M S)−1 is propor-


tional to
c1 M + c2 M S,
with c1 corresponding to the path calculation and c2 corresponding to the pay-
off evaluation. For a fixed computational cost, the variance can be minimized
by minimizing the product

v1 + v2 s−1 (c1 + c2 s) = v1 c2 s + v1 c1 + v2 c2 + v2 c1 s−1 ,
.
which gives the optimum value sopt = v2 c1 /v1 c2 .
c1 is O(Δt−1 ) since the cost is proportional to the number of timesteps,
and c2 is O(1), independent of Δt . If the payoff is Lipschitz, then v1 and v2
−1/2
are both O(1) and Sopt = O(Δt ).

7.4.7 Vibrato Monte Carlo


The idea of vibrato Monte Carlo is to combine pathwise sensitiv-
ity and LRM. Adopting the conditional expectation approach, each path
Multilevel Monte Carlo Methods for Applications in Finance 221

simulation for a particular set of Brownian motion increments w ≡


(Δw1 , Δw2 , . . . , Δw2  −1 ) (excluding the increment for the final timestep) com-
putes a conditional Gaussian probability distribution pX (X2  |w ). For a scalar
SDE, if μw and σw are the mean and standard deviation for given w , then
X2 l (w , Z) = μw + σw Z,
where Z is a unit Normal random variable. The expected payoff can then be
expressed as
    
V = E E [P (X2  )|w ] = P (x) pX  (x|w ) dx pw (y) dy.
2

The outer expectation is an average over the discrete Brownian motion incre-
ments, while the inner conditional expectation is averaging over Z.
To compute the sensitivity to the input parameter θ, the first step
is to apply the pathwise sensitivity approach for fixed wl to obtain
∂μwl /∂θ, ∂σwl /∂θ. We then apply LRM to the inner conditional expectation
to get
  % % &&
∂V ∂ ' ( ∂(log pX  )
=E E P (X2 )|w

= E EZ P (X2 )
2
|w
,
∂θ ∂θ ∂θ
where
∂(log pX  ) ∂(log pX  ) ∂μ  ∂(log pX  ) ∂σ 
2 2 w 2 w
= + .
∂θ ∂μw ∂θ ∂σw ∂θ
This leads to the estimator
 % &
1
N
∂V ∂μw,m ∂(log pX2 ) ,m
≈ E P X2 |w
∂θ N m=1 ∂θ ∂μw
% &
∂σŵ,m ∂(log pX2 ) ,m
+ E P X2 |w . (7.29)
∂θ ∂σw

We compute (∂μw,m )/∂θ and (∂σw,m )/∂θ with pathwise sensitivities. With
X2 ,m,i
l = X2  (w ,m , Z i ), we substitute the following estimators into Equa-
tion 7.29:
⎧ % &

⎪ ∂(log pX2 ) ,m

⎪ E P X2 |w

⎪ ∂μw

⎪  

⎪ ! X 2 ,m,i − μ ,m

⎪ 1 ) s

⎪ ≈ P X2 ,m,i 2  w

⎪ σ2
⎨ % s i=1 &w,m
⎪ ∂(log pX2 ) ,m

⎪ E P X2 |ŵ

⎪ ∂σw

⎪ ⎛ !2 ⎞



⎪ ! X ,m,i
− μ

⎪ 1 ) s
⎜ 1 2 w ,m


⎪ ≈ P X2 ,m,i ⎝− + ⎠.

⎩ s i=1

σw,m 3
σw,m
222 High-Performance Computing in Finance

In a multilevel setting, at the fine level we can use Equation 7.29 directly.
At the coarse level, as for digital options in Section 7.3.5, the fine Brown-
ian increments over the first half of the coarse timestep are reused to derive
Equation 7.29.
The numerical experiments for the call option with s = 10 were obtained
[26], with scalar Milstein scheme used to obtain the penultimate step.

Estimator β MLMC complexity


Value ≈2.0 O(−2 )
Delta ≈1.5 O(−2 )
Vega ≈2.0 O(−2 )

Although the discussion so far has considered an option based on the


value of a single underlying value at the terminal time T , it can be shown
that the idea extends very naturally to multidimensional cases, producing a
conditional multivariate Gaussian distribution, and also to financial payoffs
which are dependent on values at intermediate times.

7.5 Multilevel Monte Carlo for Jump-Diffusion


Processes
Giles and Xia in Reference 29 investigated the extension of the MLMC
method to jump-diffusion SDEs. We consider models with finite rate activity
using a jump-adapted discretization in which the jump times are computed
and added to the standard uniform discretization times. If the Poisson jump
rate is constant, the jump times are the same on both paths and the multi-
level extension is relatively straightforward, but the implementation is more
complex in the case of state-dependent jump rates for which the jump times
naturally differ.
Merton [30] proposed a jump-diffusion process, in which the asset price
follows a jump-diffusion SDE:

dx(t) = f (x(t−)) Δt + g(x(t−)) Δw(t) + c(x(t−)) ΔJ(t), 0 ≤ t ≤ T, (7.30)

)N (t)
where the jump term J(t) is a compound Poisson process i=1 (Yi − 1),
the jump magnitude Yi has a prescribed distribution, and N (t) is a Poisson
process with intensity λ, independent of the Brownian motion. Due to the
existence of jumps, the process is a càdlàg process, that is, having right conti-
nuity with left limits. We note that x(t−) denotes the left limit of the process
while x(t) = lims→t+ x(t). In Reference 30, Merton also assumed that log Yi
has a normal distribution.
Multilevel Monte Carlo Methods for Applications in Finance 223

7.5.1 A jump-adapted Milstein discretization


To simulate finite activity jump-diffusion processes, Giles and Xia [29]
used the jump-adapted approximation from Platen and Bruti-Liberat [31].
For each path simulation, the set of jump times J = {τ1 , τ2 , . . . , τm } within
the time interval [0, T ] is added to a uniform partition PΔtl := {nΔtl : n =
0, 1, 2, · · · , 2l }. A combined set of discretization times is then given by T =
{0 = t0 < t1 < t2 < · · · < tM = T } and we define the length of the timestep
as Δtnl = tn+1 − tn . Clearly, Δtnl ≤ Δtl .
Within each timestep, the scalar Milstein discretization is used to approx-
imate the SDE 7.30, and then the jump is simulated when the simulation
time is equal to one of the jump times. This gives the following numerical
method:


,−
Xn+1 = Xn + f (Xn ) Δtn + g(Xn ) Δwn+1

+ 12 g (Xn ) g(Xn ) (Δ(wn )2 − Δtn ),

,− ,−
Xn+1 + c(Xn+1 )(Yi − 1), when tn+1 = τi ,
Xn+1 = ,−
Xn+1 , otherwise,
(7.31)

where Xn ,− = Xt n − is the left limit of the approximated path, Δwn is the


Brownian increment, and Yi is the jump magnitude at τi .

[Link] Multilevel Monte Carlo for constant jump rate


In the case of the jump-adapted discretization, the telescopic sum 7.1 is
written down with respect to Δt rather than to Δtn . Therefore, we have to
define the computational complexity as the expected computational cost since
different paths may have different numbers of jumps. However, the expected
number of jumps is finite and therefore the cost bound in assumption (iv) in
Theorem 7.1 will still remain valid for an appropriate choice of the constant c3 .
The MLMC approach for a constant jump rate is straightforward. The
jump times τj , which are the same for the coarse and fine paths, are simulated
by setting τj − τj−1 ∼ exp(λ).
Pricing European call and Asian options in this setting is straightforward.
For lookback, barrier, and digital options, we need to consider Brownian bridge
interpolations as we did in Section 7.3.2. However, due to the presence of jumps
some small modifications are required. To improve convergence, we will be
looking at Brownian bridges between timesteps coming from jump-adapted
2−1
discretization. In order to obtain an interpolated value X̃n+ 1 for the coarse
2

timestep, a Brownian Bridge interpolation over interval [kn , k̂n ] is considered,


where

kn = max {nΔtn −1 , max {τ ∈ J : τ < (n + 12 )Δtn −1 }}


(7.32)
k̂n = min {(n + 1)Δtn −1 , min {τ ∈ J : τ > (n + 12 )Δtn −1 }}.
224 High-Performance Computing in Finance

Hence
−1
X̃n+ −1
1 = Xk
n
+ λ −1 (Xk̂ −1 − Xk −1
n
)
n
2
!
+ g(Xk −1
n
) w ((n + 12 )Δt −1 ) − w (kn ) − λ −1 (w (k̂n ) − w (kn ))

where λ −1 ≡ ((n + 12 )Δtn −1 − kn )/(k̂n − kn ).


In the same way as in Section 7.3.2, the minima over time-adapted dis-
cretization can be derived. For the fine timestep, we have
 > 
1 !2
,− ,− 2

Xn,min = Xn + Xn+ 1 −

Xn+ 1 − Xn − 2 g(Xn ) Δt log Un .
n
2 2 2

Notice the use of the left limits X ,− . Following discussion in the previous
sections, the minima for the coarse timestep can be derived using interpolated
−1
value X̃n+ 1 . Deriving the payoffs for lookback and barrier options is now
2
straightforward.
For digital options, due to jump-adapted time grid, in order to find condi-
tional expectations, we need to look at relations between the last jump time
and the last timestep before expiry. In fact, there are three cases:

1. The last jump time τ happens before penultimate fixed-time timestep,


that is, τ < (2l−1 − 2)Δtl .
2. The last jump time is within the last fixed-time timestep, that is, τ >
(2l−1 − 1)Δtl .
3. The last jump time is within the penultimate fixed-time timestep, that
is, (2l−1 − 1)Δtl > τ > (2l−1 − 2)Δtl .

With this in mind, we can easily write down the payoffs for the coarse and
fine approximations as we presented in Section 7.3.5.

[Link] Multilevel Monte Carlo for path-dependent rates


In the case of a path-dependent jump rate λ(x(t)), the implementation of
the multilevel method becomes more difficult because the coarse and fine path
approximations may have jumps at different times. These differences could
lead to a large difference between the coarse and fine path payoffs, and hence
greatly increase the variance of the multilevel correction. To avoid this, Giles
and Xia [29] modified the simulation approach of Glasserman and Merener [32]
which uses “thinning” to treat the case in which λ(x(t), t) is bounded. Let us
recall the thinning property of Poisson processes. Let (Nt )t≥0 be a Poisson
process with intensity l and define a new process Zt by “thinning” Nt : take all
the jump times (τn , n ≥ 1) corresponding to N , keep them with probability
0 < p < 1 or delete them with probability 1 − p, independently from each

other. Now order the jump times that have not been deleted: (τn , n ≥ 1),
Multilevel Monte Carlo Methods for Applications in Finance 225

and define

Zt = 1t≥τn .
n≥1

Then the process Z is Poisson process with intensity pλ.


In our setting, first a Poisson process with a constant rate λsup (which
is an upper bound of the state-dependent rate) is constructed. This gives
a set of candidate jump times, and these are then selected as true jump
times with probability λ(x(t), t)/λsup . The following jump-adapted thinning
Milstein scheme is obtained.
1. Generate the jump-adapted time grid for a Poisson process with constant
rate λsup .
2. Simulate each timestep using the Milstein discretization.

3. When the endpoint tn+1 is a candidate jump time, generate a uniform


λ(x(tn+1 −), tn+1 )
random number U ∼ [0, 1], and if U < ptn+1 = , then
λsup
accept tn+1 as a real jump time and simulate the jump.

In the multilevel implementation, the straightforward application of the


above algorithm will result in different acceptance probabilities for fine and
coarse levels. There may be some samples in which a jump candidate is
accepted for the fine path, but not for the coarse path, or vice versa. Because
of the first-order strong convergence, the difference in acceptance probabili-
ties will be O(Δt ), and hence there is an O(Δt ) probability of coarse and
fine paths differing in accepting candidate jumps. Such differences will give an
O(1) difference in the payoff value, and hence the multilevel variance will be
O(h). A more detailed analysis of this is given in Reference 33.
To improve the variance convergence rate, a change of measure is used so
that the acceptance probability is the same for both fine and coarse paths.
This is achieved by taking the expectation with respect to a new measure Q:
# #
EP [P f − P −1
c
] = EQ [P f Rτf − P −1
c
Rτc ],
τ τ

where τ are the jump times. The acceptance probability for a candidate jump
under the measure Q is defined to be 12 for both coarse and fine paths, instead
of pτ = λ(X(τ −), τ ) / λsup . The corresponding Radon–Nikodym derivatives
are
⎧ ⎧
⎪ 1 ⎪ 1
⎨ 2pfτ , if U < ; ⎨ 2pcτ , if U < ;
f
Rτ = 2 c
Rτ = 2

⎩ 2(1 − p ),
f 1 ⎪
⎩ 2(1 − p ),
c 1
τ if U ≥ , τ if U ≥ .
2 2
Since V[Rτf − Rτc ] = O(Δt2 ) and V[P: − P: −1 ] = O(Δt2 ), this results in the
* *
multilevel correction variance VQ [P: τ Rτf − P: −1 τ Rτc ] being O(Δt2 ).
226 High-Performance Computing in Finance

If the analytic formulation is expressed using the same thinning and change
of measure, the weak error can be decomposed into two terms as follows:
% &
# #
EQ P: Rf − P
τ Rτ
τ τ
% & % &
# # #
= EQ (P: − P ) Rτf + EQ f
P ( Rτ − Rτ ) .
τ τ τ

Using Hölder’s inequality, the bound max(Rτ , Rτf ) ≤ 2 and standard results
for a Poisson process, the first term can be bounded using weak convergence
results for the constant rate process, and the second term can be bounded
using the corresponding strong convergence results [33]. This guarantees that
the multilevel procedure does converge to the correct value.

7.5.2 Lévy processes


Dereich and Heidenreich [34] analyzed approximation methods for both
finite and infinite activity Lévy-driven SDEs with globally Lipschitz payoffs.
They have derived upper bounds for MLMC variance for the class of path-
dependent payoffs that are Lipschitz continuous with respect to supremum
norm. One of their main findings is that the rate of MLMC variance conver-
gence is closely related to the Blumenthal–Getoor index of the driving Lévy
process that measures the frequency of small jumps. In Reference 34, the
authors considered SDEs driven by the Lévy process
s(t) = Σ w(t) + L(t) + b t,
where Σ is the diffusion coefficient, L(t) is a compensated jump process, and
b is a drift coefficient. The simplest treatment is to neglect all the jumps with
sizes smaller than h. To construct MLMC, they took h , that is at level  they
neglected jumps smaller than h . Then similarly as in the previous section,
a uniform time discretization Δt augmented with jump times is used. Let
us denote by ΔL(t) = L(t) − L(t)− , the jump-discontinuity at time t. The
 
crucial observation is that for h > h > 0, the jumps of the process Lh can
be obtained from those of Lh by

ΔL(t)h = ΔLht 1{|ΔL(t)h |>h } ,
this gives the necessary coupling to obtain a good MLMC variance. We define
a decreasing and invertible function g : (0, ∞) → (0, ∞) such that

|x|2
∧ 1ν(dx) ≤ g(h) for all h > 0,
h2
where ν is a Lévy measure, and for  ∈ N we define
Δt = 2− and h = g −1 (2 ).
Multilevel Monte Carlo Methods for Applications in Finance 227

With this choice of Δt and h , the authors of Reference 34 analyzed the stan-
dard EM scheme for Lévy-driven SDEs. This approach gives good results for a
Blumenthal–Getoor index smaller than 1. For a Blumenthal–Getoor index big-
ger than 1, a Gaussian approximation of small jumps gives better results [35].

7.6 Multidimensional Milstein Scheme


In the previous sections, it was shown that by combining a numerical
approximation with the strong order of convergence O(Δt ) with MLMC
results in a reduction of the computational complexity to estimate expected
values of functionals of SDE solutions with a root-mean-square error of  from
O(−3 ) to O(−2 ). However, in general, to obtain a rate of strong convergence
higher than O(Δt1/2 ) requires simulation, or approximation, of Lévy areas.
Giles and Szpruch in Reference 18 through the construction of a suitable anti-
thetic multilevel correction estimator showed that we can avoid the simulation
of Lévy areas and still achieve an O(Δt2 ) variance for smooth payoffs, and
almost an O(Δt3/2 ) variance for piecewise smooth payoffs, even though there
is only O(Δt1/2 ) strong convergence.
In the previous sections, we have shown that it can be better to use different
estimators for the finer and coarser of the two levels being considered, P f when
level  is the finer level, and P c when level  is the coarser level. In this case,
we required that
E[P f ] = E[P c ] for  = 1, . . . , L, (7.33)

so that

L
E[PLf ] = E[P0f ] + E[P f − P −1
c
],
=1

still holds. For lookback, barrier, and digital options, we showed that we can
obtain a better MLMC variance by suitably modifying the estimator on the
coarse levels. By further exploiting the flexibility of MLMC, Giles and Szpruch
[18] modified the estimator on the fine levels in order to avoid simulation of
the Lévy areas.

7.6.1 Antithetic multilevel Monte Carlo estimator


Based on the well-known method of antithetic variates [24], the idea for
the antithetic estimator is to exploit the flexibility of the more general MLMC
c
estimator by defining P −1 to be the usual payoff P (X c ) coming from a
level  − 1 coarse simulation X c , and defining P f to be the average of the
payoffs P (X f ), P (X a ) coming from an antithetic pair of level  simulations,
X f and X a .
228 High-Performance Computing in Finance

X f will be defined in a way which corresponds naturally to the construction


of X c . Its antithetic “twin” X a will be defined so that it has exactly the
same distribution as X f , conditional on X c , which ensures that E[P (X f )] =
E[P (X a )] and hence Equation 7.3 is satisfied, but at the same time
f
X − X c ≈ −(X a − X c )

and therefore

P (X f ) − P (X c ) ≈ −(P (X a ) − P (X c )) ,

so that 12 P (X f ) + P (X a ) ≈ P (X c ). This leads to 12 P (X f ) + P (X a ) −
P (X c ) having a much smaller variance than the standard estimator P (X f ) −
P (X c ).
We now present a lemma which gives an upper bound on the convergence
of the variance of 12 P (X f ) + P (X a ) − P (X c ).
Lemma 7.1. If P ∈ C 2 (Rd , R) and there exist constants L1 , L2 such that for
all x ∈ Rd ; ; ; 2 ;
; ∂P ; ; ;
; ; ≤ L1 , ; ∂ P ; ≤ L2 ,
; ∂x ; ; ∂x2 ;
then for p ≥ 2,
 p 
1
E (P (X f ) + P (X a )) − P (X c )
2
; ;p  ;
;1 f ; ; 
≤ 2p−1 Lp1 E ; (X +X a
) − X c;
+ 2−(p+1) p
L E ;X f − X a ;2p .
;2 ; 2

In the multidimensional SDE applications considered in finance, the


Milstein approximation with the Lévy areas set to zero, combined with the
f
antithetic construction, leads to X f − X a = O(Δt1/2 ) but X − X c = O(Δt).
1 f a c 2
Hence, the variance V[ 2 (Pl + Pl ) − Pl−1 ] is O(Δt ), which is the order
obtained for scalar SDEs using the Milstein discretization with its first-order
strong convergence.

7.6.2 Clark–Cameron example


The paper of Clark and Cameron [16] addresses the question of how accu-
rately one can approximate the solution of an SDE driven by an underly-
ing multidimensional Brownian motion, using only uniformly spaced discrete
Brownian increments. Their model problem is

dx1 (t) = dw1 (t)


dx2 (t) = x1 (t) dw2 (t), (7.34)

with x(0) = y(0) = 0, and zero correlation between the two Brownian motions
w1 (t) and w2 (t). These equations can be integrated exactly over a time interval
Multilevel Monte Carlo Methods for Applications in Finance 229

[tn , tn+1 ], where tn = n Δt, to give

x1 (tn+1 ) = x1 (tn ) + Δw1,n


1 1
x2 (tn+1 ) = x2 (tn ) + x1 (tn )Δw2,n + Δw1,n Δw2,n + A12,n , (7.35)
2 2

where Δwi,n ≡ wi (tn+1 ) − wi (tn ), and A12,n is the Lévy area defined as

t t
n+1
! n+1
!
A12,n = w1 (t) − w1 (tn ) dw2 (t) − w2 (t) − w2 (tn ) dw1 (t).
tn tn

This corresponds exactly to the Milstein discretization presented in Equa-


tion 7.7, so for this simple model problem the Milstein discretization is exact.
The point of Clark and Cameron’s paper is that for any numerical approx-
imation X(T ) based solely on the set of discrete Brownian increments Δw,

1
E[(x2 (T ) − X2 (T ))2 ] ≥ T Δt.
4

Since in this section, we use superscripts f, a, c for fine X f , antithetic X a ,


and coarse X c approximations, respectively, we drop the superscript  for the
clarity of notation.
We define a coarse path approximation X c with timestep Δt by neglecting
the Lévy area terms to give

c c −1
X1,n+1 = X1,n + Δw1,n+1
c c c −1 1 −1 −1
X2,n+1 = X2,n + X1,n Δw2,n+1 + Δw1,n+1 Δw2,n+1 . (7.36)
2

This is equivalent to replacing the true Brownian path by a piecewise linear


approximation as illustrated in Figure 7.1.
Similarly, we define the corresponding two half-timesteps of the first fine
path approximation X f by

f f
X1,n+ 1 = X1,n + Δw
1,n+ 1
2 2

f f f 1
X2,n+ 1 = X2,n + X1,n Δw2,n+ 1 + Δw1,n+ 1 Δw
2,n+ 12
2 2 2 2
f f
X1,n+1 = X1,n+1 + Δw1,n+1
f f f 1
X2,n+1 = X2,n+ 1 + X
1,n+ 1 Δw2,n+1 + Δw1,n+1 Δw2,n+1 ,
2 2 2
−1
where Δwn+1 = Δwn+ 1 + Δwn+1 . Using this relation, the equations for the
2
two fine timesteps can be combined to give an equation for the increment over
230 High-Performance Computing in Finance

the coarse timestep,


f f −1
X1,n+1 = X1,n + Δw1,n+1
f f f −1 1 −1 −1
X2,n+1 = X2,n + X1,n Δw2,n+1 + Δw1,n+1 Δw2,n+1
2
1 !
1 Δw2,n+1 − Δw

+ Δw1,n+ 2,n+ 1 Δw1,n+1 . (7.37)
2 2 2

The antithetic approximation Xna is defined by exactly the same discretiza-


tion except that the Brownian increments δwn and δwn+ 12 are swapped, as
illustrated in Figure 7.1. This gives
a a
X1,n+ 1 = X1,n + Δw1,n+1 ,
2

a a a 1
X2,n+ 1 = X2,n + X1,n Δw2,n+1 + Δw1,n+1 Δw2,n+1 ,
2 2
a a
X1,n+1 = X1,n+ 1 + Δw
1,n+ 1 ,
2 2

a a a 1
X2,n+1 = X2,n+ 1 + X1,n+ 1 Δw2,n+ 1 + Δw1,n+ 1 Δw
2,n+ 12 ,
2 2 2 2 2

and hence
a a −1
X1,n+1 = X1,n + Δw1,n+1 ,
a a a −1 1 −1 −1
X2,n+1 = X2,n + X1,n Δw2,n+1 + Δw1,n+1 Δw2,n+1
2
1 !

Δw1,n+ 1 Δw2,n+1 − Δw

2,n+ 2

1 Δw1,n+1 . (7.38)
2 2


Swapping Δwn+ 1 and Δwn+1 does not change the distribution of the driving
2
Brownian increments, and hence X a has exactly the same distribution as X f .

W
Wc
Wf
Wa

FIGURE 7.1: Brownian path and approximations over one coarse timestep.
Multilevel Monte Carlo Methods for Applications in Finance 231

Note also the change in sign in the last term in Equation 7.37 compared to
the corresponding term in Equation 7.38. This is important because these two
terms cancel when the two equations are averaged.
These last terms correspond to the Lévy areas for the fine and antithetic
paths, and the sign reversal is a particular instance of a more general result
for time-reversed Brownian motion [36]. If (wt , 0 ≤ t ≤ 1) denotes a Brownian
motion on the time interval [0, 1], then the time-reversed Brownian motion
(zt , 0 ≤ t ≤ 1) defined by
zt = w1 − w1−t (7.39)
has exactly the same distribution, and it can be shown that its Lévy area is
equal in magnitude and opposite in sign to that of wt .
Lemma 7.2. If Xnf , Xna , and Xnc are as defined above, then
f a c 1 f a
c
X1,n = X1,n = X1,n , X2,n + X2,n = X2,n , ∀n ≤ N
2
and 
f a
4  3
E X2,N − X2,N = T (T + Δt) Δt2 .
4
In the following section, we will see how this lemma generalizes to nonlinear
multidimensional SDEs 7.4.

7.6.3 Milstein discretization: General theory


Using the coarse timestep Δt, the coarse path approximation Xnc is given
by the Milstein approximation without the Lévy area term:

m
c c
Xi,n+1 = Xi,n + fi (Xnc ) Δt −1 + gij (Xnc ) Δwj,n+1
−1

j=1

m !
+ hijk (Xnc ) Δwj,n Δwk,n+1
−1
− Ωjk Δt −1 .
j,k=1

The first fine path approximation Xnf (that corresponds to Xnc ) uses the
corresponding discretization with timestep Δt/2,

m
f f f
Xi,n+ 1 = Xi,n + fi (Xn ) Δt −1 /2 + gij (Xnf ) Δwj,n+

1
2 2
j=1

m !
+ hijk (Xnf ) Δwj,n+

1 Δw
k,n+ 1 − Ωjk Δt −1 /2 ,

(7.40)
2 2
j,k=1
!
m !
f f f f
Xi,n+1 = Xi,n+ 1 + fi Xn+ 1 Δt −1 /2 + gij Xn+ 1 Δwj,n+1
2 2 2
j=1

m !
f
+ hijk Xn+ 1

Δwj,n+1
Δwk,n+1 − Ωjk Δt −1 /2 , (7.41)
2
j,k=1
232 High-Performance Computing in Finance
−1
where Δwn+1 = Δwn+ 1 + Δwn+1 .
2
The antithetic approximation Xna is defined by exactly the same discretiza-

tion except that the Brownian increments Dwn+ 1 and Δwn+1 are swapped,
2
so that

m
a a a
Xi,n+ 1 = Xi,n + fi (Xn ) Δt −1 /2 + gij (Xna ) δwn+ 12
2
j=1

m

+ hijk (Xna ) Δwj,n+1

Δwk,n+1 − Ωjk Δt −1 /2 ,
j,k=1
!
m !
a a a a
Xi,n+1 = Xi,n+ 1 + fi Xn+ 1 Δt −1 /2 + gij Xn+ 1 Δwj,n+ 1
2 2 2 2
j=1

m ! !
a
k,n+ 1 − Ωjk Δt −1 /2 .

+ hijk Xn+ 1 Δwj,n+ 1 Δw
2 2 2
j,k=1
(7.42)
It can be shown that [18]

Lemma 7.3. For all integers p ≥ 2, there exists a constant Kp such that
 
f a p
E max Xn − Xn  ≤ Kp Δtp/2 .
0≤n≤N

Let us denote the average fine and antithetic path as follows:


f 1 f
Xn ≡ (X +Xna ).
2 n
The main results of [18] is the following theorem.
Theorem 7.5. For all p ≥ 2, there exists a constant Kp such that
 
f c p
E max X n − Xn  ≤ Kp Δtp .
0≤n≤N

This together with a classical strong convergence result for Milstein dis-
cretization allows to estimate the MLMC variance for smooth payoffs. In the
case of payoff which is a smooth function of the final state x(T ), taking p = 2
in Lemma 7.1, p = 4 in Lemma 7.3, and p = 2 in Theorem 7.5 immediately
gives the result that the multilevel variance
 
1 f a
c
V P (XN ) + P (XN ) − P (XN )
2
has an O(Δt2 ) upper bound. This matches the convergence rate for the multi-
level method for scalar SDEs using the standard first-order Milstein discretiza-
tion, and is much better than the O(Δt) convergence obtained with the EM
discretization.
Multilevel Monte Carlo Methods for Applications in Finance 233

However, very few financial payoff functions are twice differentiable on the
entire domain Rd . A more typical 2D example is a call option based on the
minimum of two assets,
P (x(T )) ≡ max (0, min(x1 (T ), x2 (T )) − K) ,
which is piecewise linear, with a discontinuity in the gradient along the three
lines (s, K), (K, s), and (s, s) for s ≥ K.
To handle such payoffs, an assumption which bounds the probability of
the solution of the SDE having a value at time T close to such lines with
discontinuous gradients is needed.

Assumption 7.1. The payoff function P ∈ C(Rd , R) has a uniform Lipschitz


bound, so that there exists a constant L such that
|P (x) − P (y)| ≤ L |x − y| , ∀ x, y ∈ Rd ,
and the first and second derivatives exist, are continuous and have uniform
bound L at all points x ∈ K, where K is a set of zero measure, and there
exists a constant c such that the probability of the SDE solution x(T ) being
within a neighborhood of the set K has the bound
 
P min x(T ) − y ≤ ε ≤ c ε, ∀ ε > 0.
y∈K

In a 1D context, Assumption 7.1 corresponds to an assumption of a locally


bounded density for x(T ).
Giles and Szpruch in Reference 18 proved the following result.
Theorem 7.6. If the payoff satisfies Assumption 7.1, then
% 2 &
1 f a
c
E P XN + P (XN ) − P (XN ) = o(Δt3/2−δ )
2

for any δ > 0.

7.6.4 Piecewise linear interpolation analysis


The piecewise linear interpolant X c (t) for the coarse path is defined within
the coarse timestep interval [tk , tk+1 ] as
t − tk
X c (t) ≡ (1 − λ) Xkc + λ Xk+1
c
, λ≡ .
tk+1 − tk
Likewise, the piecewise linear interpolants X f (t) and X a (t) are defined on the
fine timestep [tk , tk+ 12 ] as
t − tk
X f (t) ≡ (1−λ) Xkf +λ Xk+
f
1, X a (t) ≡ (1−λ) Xka +λ Xk+
a
1, λ≡ ,
2 2 tk+ 12 − tk
234 High-Performance Computing in Finance

and there is a corresponding definition for the fine timestep [tk+ 12 , tk+1 ]. It
can be shown that [18]

Theorem 7.7. For all p ≥ 2, there exists a constant Kp such that


' (
sup E X f (t) − X a (t)p ≤ Kp Δtp/2 ,
0≤t≤T

; f ;p 
; ;
sup E ;X (t) − X c (t); ≤ Kp Δtp ,
0≤t≤T

f
where X (t) is the average of the piecewise linear interpolants X f (t) and X a (t).

For an Asian option, the payoff depends on the average

T
−1
xave ≡ T x(t) dt.
0

This can be approximated by integrating the appropriate piecewise linear


interpolant which gives

T −1
1 c
N
c −1
Xave ≡T X c (t) dt = N −1 c
Xn + Xn+1 ,
n=0
2
0
T
N −1 !
f 1
Xave ≡ T −1 X f (t) dt = N −1 Xnf + 2Xn+
f f
1 + Xn+1 ,
n=0
4 2
0
T
N −1 !
a −1 1
Xave ≡T X a (t) dt = N −1 Xna + 2Xn+
a a
1 + Xn+1 .
n=0
4 2
0

Due to Hölder’s inequality,

; T 
;  ; ;p 
; f a ;p
E Xave − Xave ≤T −1
E ;X f (t) − X a (t); dt
0
; ;p 
≤ sup E ;X f (t) − X a (t); ,
[0,T ]

and similarly
; ;p  ; f ;p 
;1 f ; ; ;
; a c ;
E ; (Xave +Xave ) − Xave ; ≤ sup E ;X (t) − X c (t); .
2 [0,T ]

Hence, if the Asian payoff is a smooth function of the average, then we obtain
a second-order bound for the multilevel correction variance.
Multilevel Monte Carlo Methods for Applications in Finance 235

This analysis can be extended to include payoffs which are a smooth func-
tion of a number of intermediate variables, each of which is a linear functional
of the path x(t) of the form:

T
g T (t) x(t) μ(dt),
0

for some vector function g(t) and measure μ(dt). This includes weighted aver-
ages of x(t) at a number of discrete times, as well as continuously weighted
averages over the whole time interval.
As with the European options, the analysis can also be extended to pay-
offs which are Lipschitz functions of the average, and have first and second
derivatives which exist, and are continuous and uniformly bounded, except for
a set of points K of zero measure.

Assumption 7.2. The payoff P ∈ C(Rd , R) has a uniform Lipschitz bound,


so that there exists a constant L such that

|P (x) − P (y)| ≤ L |x − y| , ∀ x, y ∈ Rd ,

and the first and second derivatives exist, are continuous and have uniform
bound L at all points x ∈ K, where K is a set of zero measure, and there exists
a constant c such that the probability of xave being within a neighborhood of
the set K has the bound
 
P min xave − y ≤ ε ≤ c ε, ∀ ε > 0.
y∈K

Theorem 7.8. If the payoff satisfies Assumption 7.2, then


% 2 &
1 f a c
E (P (Xave ) + P (Xave )) − P (Xave ) = o(Δt3/2−δ )
2

for any δ > 0.

We refer the reader to Reference 18 for more details.

7.6.5 Simulations for antithetic Monte Carlo


Here we present numerical simulations for a European option for a process
simulated by X f , X a , and X c defined in Section 7.6.2 with initial condi-
tions x1 (0) = x2 (0) = 1. The results in Figure 7.2 are for a European call
option with terminal time 1 and strike K = 1, that is, P = (x(T ) − K)+ .
236 High-Performance Computing in Finance

0 0

−5 −5
Log2 variance

Log2 |mean|
−10 −10

Pl Pl
−15 −15
Pl − Pl−1 Pl − Pl−1
Ref. 1 and 5 Ref. 1
−20 −20
0 2 4 6 8 0 2 4 6 8
Level l Level l
104 −2
Std MC
MLMC −4
103

Log2 variance
−6
ε 2 cost

102 −8
−10
101
−12 Xf − Xc
Ref. 1
100 −14
10−4 10−3 0 2 4 6 8
Accuracy ε Level l

FIGURE 7.2: Call option.

The top left plot shows the behavior of the variance of both P and P − P −1 .
The superimposed reference slope with rate 1.5 indicates that the variance
V = V[P − P −1 ] = O(Δt1.5 ), corresponding to O(
−2
) computational com-
plexity of antithetic MLMC. The top right plot shows that E[P − P −1 ] =
O(Δt ). The bottom left plot shows the computational complexity C (as
defined in Theorem 7.1) with desired accuracy . The plot is of 2 C versus ,
because we expect to see that 2 C is only weakly dependent on s for MLMC.
For standard Monte Carlo, theory predicts that 2 C should be proportional
to the number of timesteps on the finest level, which in turn is roughly propor-
tional to −1 due to the weak convergence order. For accuracy  = 10−4 , the
antithetic MLMC is approximately 500 times more efficient than the standard
Monte Carlo. The bottom right plot shows that V[X1. − X1. −1 ] = O(Δt ).
This corresponds to the standard strong convergence of order 0.5. We have
also tested the algorithm presented in Reference 18 for approximation of Asian
options. Our results were almost identical as for the European options. In order
to treat the lookback, digital, and barrier options, we found that a suitable
antithetic approximation to the Lévy areas is needed. For suitable modifica-
tion of the antithetic MLMC estimator, we performed numerical experiments
where we obtained O(−2 log()2 ) complexity for estimating barrier, digital,
and lookback options. Currently, we are working on theoretical justification
of our results.
Multilevel Monte Carlo Methods for Applications in Finance 237

7.7 Other Uses of Multilevel Method


7.7.1 Stochastic partial differential equations
Multilevel method has been used for a number of parabolic and elliptic
SPDE applications [37–39] but the first use for a financial SPDE is in a new
paper by Giles and Reisinger [40].
This paper considers an unusual SPDE which results from modelling credit
default probabilities,

∂p 1 ∂2p √ ∂p
Δp = −μ Δt + Δt − ρ ΔMt , x>0 (7.43)
∂x 2 ∂x2 ∂x

subject to boundary condition p(0, t) = 0. Here p(x, t) represents the proba-


bility density function for firms being a distance x from default at time t. The
diffusive term is due to idiosyncratic factors affecting individual firms, while
the stochastic term due to the scalar Brownian motion Mt corresponds to the
systemic movement due to random market effects affecting all firms.
Using a Milstein time discretization with uniform timestep k, and a central
space discretization of the spatial derivatives with uniform spacing h gives the
numerical approximation

μk + ρ k Zn n
pn+1
j = pnj − pj+1 − pnj−1
2h
(1 − ρ) k + ρ k Zn2 n
+ 2
pj+1 − 2pnj + pnj−1 , (7.44)
2h

where Zn are standard Normal random variables so that hZn corresponds
to an increment of the driving scalar Brownian motion.
This chapter shows that the requirement for mean-square stability as the
grid is refined and k, h → 0 is k/h2 ≤ (1 + 2ρ2 )−1 , and in addition the accuracy
is O(k, h2 ). Because of this, the multilevel treatment considers a sequence of
grids with h = 2 h −1 , k = 4 k −1 .
The multilevel implementation is very straightforward, with the Brownian
increments for the fine path being summed pairwise to give the corresponding
Brownian increments for the coarse path. The payoff corresponds to different
tranches of a credit derivative that depends on a numerical approximation of
the integral
∞
p(x, t) Δx.
0

The computational cost increases by factor 8 on each level, and numerical


experiments indicate that the variance decreases by a factor of 16. The MLMC
Theorem still applies in this case, with β = 4 and γ = 3, and so the overall
computational complexity to achieve an O() RMS error is again O(ε−2 ).
238 High-Performance Computing in Finance

7.7.2 Nested simulation


The pricing of American options is one of the big challenges for Monte
Carlo methods in computational finance, and Belomestny and Schoenmakers
have recently written a very interesting paper on the use of MLMC for this
purpose [41]. Their method is based on Anderson and Broadie’s dual simula-
tion method [42] in which a key component at each timestep in the simulation
is to estimate a conditional expectation using a number of subpaths.
In their multilevel treatment, Belomestny and Schoenmakers use the same
uniform timestep on all levels of the simulation. The quantity which changes
between different levels of simulation is the number of subsamples used to
estimate the conditional expectation.
To couple the coarse and fine levels, the fine level uses N subsamples, and
the coarse level uses N −1 = N /2 of them. Similar research by N. Chen∗ found
that the multilevel correction variance is reduced if the payoff on the coarse
level is replaced by an average of the payoffs obtained using the first N /2 and
second N /2 samples. This is similar in some ways to the antithetic approach
described in Section 7.6.
In future research, Belomestny and Schoenmakers intend to also change
the number of timesteps on each level, to increase the overall computational
benefits of the multilevel approach.

7.7.3 Truncated series expansions


Building on earlier work by Broadie and Kaya [43], Glasserman and Kim
have recently developed an efficient method [44] of simulating the Heston
stochastic volatility model exactly [45].
The key to their algorithm is a method of representing the integrated
volatility over a time interval [0, T ], conditional on the initial and final values,
v0 and vT as
⎛ T ⎞
 ∞


⎝ Vs ds|V0 = v0 , VT = vT ⎠ = d
xn + yn + zn
0 n=1 n=1 n=1

where xn , yn , zn are independent random variables.


In practice, they truncate the series expansions at a level which ensures
the desired accuracy, but a more severe truncation would lead to a tradeoff
between accuracy and computational cost. This makes the algorithm a can-
didate for a multilevel treatment in which level  computation performs the
truncation at N (taken to be the same for all three series, for simplicity).
To give more details, the level  computation would use

N
N
N
xn + yn + zn
n=1 n=1 n=1

∗Unpublished, but presented at the MCQMC12 conference.


Multilevel Monte Carlo Methods for Applications in Finance 239

while the level  − 1 computation would use


N−1

N−1

N−1
xn + yn + zn
n=1 n=1 n=1

with the same random variables xn , yn , zn .


This kind of multilevel treatment has not been tested experimentally, but it
seems that it might yield some computational savings even though Glasserman
and Kim typically retain only 10 terms in their summations through the use
of a carefully constructed estimator for the truncated remainder. In other
circumstances requiring more terms to be retained, the savings may be larger.

7.7.4 Mixed precision arithmetic


The final example of the use of multilevel is unusual, because it concerns
the computer implementation of Monte Carlo algorithms.
In the latest CPUs from Intel and AMD, each core has a vector unit
which can perform 8 single precision or 4 double precision operations with one
instruction. Together with the obvious fact that double precision variables are
twice as big as single precision variables and so require twice as much time
to transfer, in bulk, it leads to single precision computations being twice as
fast as double precision computations. On GPUs (graphics processors), the
difference in performance can be even larger, up to a factor of 8 in the most
extreme cases.
This raises the question of whether single precision arithmetic is sufficient
for Monte Carlo simulations. In general, our view is that the errors due to
single precision arithmetic are much smaller than the errors due to
• Statistical error due to Monte Carlo sampling

• Bias due to SDE discretization

• Model uncertainty

We have just two concerns with single precision accuracy:


• There can be significant errors when averaging the payoffs unless one
uses binary tree summation to perform the summation.

• When computing Greeks using “bumping,” the single precision inaccu-


racy can be greatly amplified if a small bump is used.

Our advice would be to always use double precision for the final accumu-
lation of payoff values and pathwise sensitivity analysis as much as possible
for computing Greeks, but if there remains a need for the path simulation to
be performed in double precision then one could use a two-level approach in
which level 0 corresponds to single precision and level 1 corresponds to double
precision.
240 High-Performance Computing in Finance

On both levels one would use the same random numbers. The multilevel
analysis would then give the optimal allocation of effort between the single
precision and double precision computations. Since it is likely that most of
the calculations would be single precision, the computational savings would
be a factor of 2 or more compared to standard double precision calculations.

7.8 Multilevel Quasi-Monte Carlo


In Theorem 7.1, if β > γ, so that the rate at which the multilevel vari-
ance decays with increasing grid level is greater than the rate at which the
computational cost increases, then the dominant computational cost is on the
coarsest levels of approximation.
Since coarse levels of approximation correspond to low-dimensional numer-
ical quadrature, it is quite natural to consider the use of quasi-Monte Carlo
techniques. This has been investigated by Giles and Waterhouse [46] in the
context of scalar SDEs with a Lipschitz payoff. Using the Milstein approxima-
tion with a doubling of the number of timesteps on each level gives β = 2 and
γ = 1. They used a rank-1 lattice rule to generate the quasi-random numbers,
randomization with 32 independent offsets to obtain confidence intervals, and
a standard Brownian Bridge construction of the increments of the driving
Brownian process.
Their empirical observation was that MLMC on its own was better than
QMC on its own, but the combination of the two was even better. The QMC
treatment greatly reduced the variance per sample for the coarsest levels,
resulting in significantly reduced costs overall. In the simplest case of a Euro-
pean call option, shown in Figure 7.3, the top left plot shows the reduction in
the variance per sample as the number of QMC points is increased. The benefit
is much greater on the coarsest levels than on the finest levels. In the bottom
two plots, the number of QMC points on each level is determined automati-
cally to obtain the required accuracy; see Reference 46 for the precise details.
Overall, the computational complexity appears to be reduced from O(−2 ) to
approximately O(−1.5 ).
Giles and Waterhouse interpreted the fact that the variance is not reduced
on the finest levels as being due to a lack of significant low-dimensional content,
that is, the difference in the two payoffs due to neighboring grid levels is due to
the difference in resolution of the driving Brownian path, and this is inherently
of high dimensionality. This suggests that in other applications with β < γ,
which would lead to the dominant cost being on the finest levels, then the use
of quasi-Monte Carlo methods is unlikely to yield any benefits.
Further research is needed in this area to investigate the use of other low
discrepancy sequences (e.g., Sobol) and other ways of generating the Brownian
increments (e.g., PCA). We also refer the reader to Reference 47 for some
results for randomized multilevel quasi-Monte Carlo.
Multilevel Monte Carlo Methods for Applications in Finance 241

0 0

−5

−10 −5
Log2 variance

Log2 |mean|
−15

−20 −10

−25
1
−30 −15
16 Pl
−35 256
4096 Pl − Pl−1
−40 −20
0 2 4 6 8 0 2 4 6 8
l l
105 10−1
ε = 0.00005
ε = 0.0001
104
ε = 0.0002
ε = 0.0005 10−2
103 ε = 0.001
ε2 cost
Nl

102
10−3

101
Std QMC
MLQMC
100 10−4
0 2 4 6 8 10−4 10−3
l ε

FIGURE 7.3: European call option (From Giles M.B. and Waterhouse B.J.
Advanced Financial Modelling, Radon Series on Computational and Applied
Mathematics, pages 165–181. de Gruyter, 2009.)

7.9 Conclusion
In the past 6 years, considerable progress has been achieved with the
MLMC method for financial options based on underlying assets described
by Brownian diffusions, jump diffusions, and more general Lévy processes.
The multilevel approach is conceptually very simple. In essence it is a
recursive control variate strategy, using a coarse path simulation as a control
variate for a fine path simulation, relying on strong convergence properties to
ensure a very strong correlation between the two.
In practice, the challenge is to couple the coarse and fine path simulations
as tightly as possible, minimizing the difference in the payoffs obtained for
each. In doing this, there is considerable freedom to be creative, as shown in
the use of Brownian Bridge constructions to improve the variance for lookback
242 High-Performance Computing in Finance

and barrier options, and in the antithetic estimators for multidimensional


SDEs which would require the simulation of Lévy areas to achieve first-order
strong convergence. Another challenge is avoiding large payoff differences due
to discontinuous payoffs; here one can often use either conditional expectations
to smooth the payoff or a change of measure to ensure that the coarse and
fine paths are on the same side of the discontinuity.
Overall, multilevel methods are being used for an increasingly wide range
of applications. These biggest savings are in situations in which the coarsest
approximation is very much cheaper than the finest. If the finest level of
approximation has only 32 timesteps, then there are very limited savings to
be achieved, but if the finest level has 256 timesteps, then the potential savings
are much larger.
Looking to the future, exciting areas for further research include:
• More research on multilevel techniques for American and Bermudan
options

• More investigation of multilevel Quasi Monte Carlo methods


• Use of multilevel ideas for completely new financial applications, such as
Gaussian copula and new SPDE models.

Appendix 7A Analysis of Brownian Bridge


Interpolation
Let wα (t) = αt + w(t) denote a Brownian motion with drift α. Its running
minimum we denote by m(t)a = min0≤u≤t w(u)α = min0≤u≤t {αu + w(u)}.
Using the Girsanov theorem and the Reflection principle of the Brownian
motion, we can derive the joint distribution of (wα (t), m(t)α ) and as as con-
sequence the conditional distribution of m(t)α given wα (t), [48]
 
2y(z − y)
P (mαt ≤ y|w(t)α
= z) = exp (7A.1)
t
Brownian Bridge 7.11 for t ∈ [s, u]
x(t) = x(s) + (t − s)/(u − s)(x(u) − x(s)) + g(s)(w(t) − w(s)
− (t − s)/(u − s)(w(u) − w(s))) (7A.2)
can be obtained by considering arithmetic Brownian motion on the time inter-
val [s, t]
x(t) = x(s) + f (s)(t − s) + g(s)(w(t) − w(s))
 
f (s)
= x(s) + g(s) (t − s) + (w(t) − w(s))
g(s)
α(s)
= x(s) + g(s)wt−s ,
Multilevel Monte Carlo Methods for Applications in Finance 243

with α(s) = f (s)/g(s). Similarly, the minimum ys,t of the process (x(t)) on
the time interval [s, t] is given by

α(s)
ys,t = x(s) + g(s)mt−s .

Hence, by Equation 7A.1

 
α(s) α(s)
P [ys,t ≤ y|x(s), x(t)] = P x(s) + g(s)mt−s ≤ y|x(s) + g(s)wt−s = x(t)
 
α(s) y − x(s) α(s) x(t) − x(s)
= P mt−s ≤ |wt−s =
g(s) g(s)
 
2(x(s) − y)(x(t) − y)
= exp − (7A.3)
(g(s))2 (t − s)

Now imagine we want to derive these probabilities over time interval [s, u],
where t ≤ u conditioned on x(s) and x(u). The first strategy would be to
take Equation 7A.2 connecting x(s) and x(u) and calculate the conditional
distribution as we did in Equation 7A.4. The second strategy is as follows: (a)
first we sample a point x(t) from the BB connecting x(s) and x(u); (b) we
calculate the conditional distribution of the minimum of BB (Equation 7A.2)
conditioned first on x(s), x(t), and then on x(t), x(u). However in order to
make sure both strategies give us results that are equivalent in distribution
we are only allow to use the same Brownian bridge as we have used the first
strategy. This has a consequence in calculating conditional distribution of the
minimum given x(t) and x(u):

 
α(s) α(s)
P [ys,t ≤ y|x(t), x(u)] = P x(t) + g(s)mu−t ≤ y|x(t) + g(s)wu−t = x(u)
 
2(x(t) − y)(x(u) − y)
= exp − . (7A.4)
(g(s))2 (u − t)

Notice that we have not changed g(s) to g(t) and hence we have used the
same Brownian bridge for both strategies.
Another implication of conditional distribution 7A.1 is that we can find
the minimum ys,t explicitly. If we know the probability function F (z) of a
continuous random variable Z, we can generate random variable Z using uni-
formly distributed random variable U . Let U ∼ U ([0, 1]), then F −1 (U ) = Z,
where F −1 is an inverse function. It is straightforward to see that from Equa-
tion 7A.1 we have

1 . !

t = z − z 2 − 2t log U in distribution.
2
244 High-Performance Computing in Finance

Now

α(s)
ys,t = x(s) + g(s)mt−s
⎛ - ⎞
2
1 x(t) − x(s) x(t) − x(s)
= x(s) + γ(s) ⎝ − − 2(g(s))2 (t − s) log U ⎠
2 g(s) g(s)
 ? 
1 2 2
= x(t) + x(s) − (x(t) − x(s)) − 2(g(s)) (t − s) log U
2

In order to find minima in the case where additionally subsample from


Brownian bridge we just need to invert appropriate probabilities.

References
1. Heinrich, S. Multilevel Monte Carlo Methods, volume 2179 of Lecture Notes in
Computer Science, pages 58–67. Springer-Verlag, 2001.

2. Giles, M.B. Multilevel Monte Carlo path simulation. Operations Research,


56(3):607–617, 2008.

3. Kebaier, A. Statistical Romberg extrapolation: A new variance reduction


method and applications to options pricing. Annals of Applied Probability,
14(4):2681–2705, 2005.

4. Speight, A.L. A multilevel approach to control variates. Journal of Computa-


tional Finance, 12:1–25, 2009.

5. Speight, A.L. Multigrid techniques in economics. Operations Research,


58(4):1057–1078, 2010.

6. Pagès, G. Multi-step Richardson–Romberg extrapolation: Remarks on variance


control and complexity. Monte Carlo Methods and Applications, 13(1):37–70,
2007.

7. Giles, M.B. Improved multilevel Monte Carlo convergence using the Milstein
scheme. In Keller, A., Heinrich, S., and Niederreiter, H., editors, Monte Carlo
and Quasi-Monte Carlo Methods 2006, pages 343–358. Springer-Verlag, 2008.

8. Kloeden, P. and Neuenkirch, A. Convergence of numerical methods for stochas-


tic differential equations in mathematical finance. In Recent Developments in
Computational Finance: Foundations, Algorithms and Applications, pp. 49–80,
2013.

9. Mao, X. and Szpruch, L. Strong convergence rates for backward Euler–


Maruyama method for nonlinear dissipative-type stochastic differential equa-
tions with super-linear diffusion coefficients. Stochastics An International Jour-
nal of Probability and Stochastic Processes, 85(1):144–171, 2013.
Multilevel Monte Carlo Methods for Applications in Finance 245

10. Szpruch, L., Mao, X., Higham, D.J., and Pan, J. Numerical simulation of a
strongly nonlinear Ait-Sahalia-type interest rate model. BIT Numerical Mathe-
matics, 51(2):405–425, 2011.

11. Kloeden, P.E., Neuenkirch, A., and Pavani, R. Multilevel Monte Carlo for
stochastic differential equations with additive fractional noise. Annals of Oper-
ations Research, 189(1):255–276, 2011.

12. Kloeden, P.E. and Platen, E. Numerical Solution of Stochastic Differential Equa-
tions. Springer, Berlin, 1992.

13. Gaines, J.G. and Lyons, T.J.. Random generation of stochastic integrals. SIAM
Journal of Applied Mathematics, 54(4):1132–1146, 1994.

14. Rydén, T. and Wiktorsson, M. On the simulation of iterated Itô integrals.


Stochastic Processes and their Applications, 91(1):151–168, 2001.

15. Wiktorsson, M. Joint characteristic function and simultaneous simulation of


iterated Itô integrals for multiple independent Brownian motions. Annals of
Applied Probability, 11(2):470–487, 2001.

16. Clark, J.M.C. and Cameron, R.J. The maximum rate of convergence of discrete
approximations for stochastic differential equations. In Grigelionis, B., editor,
Stochastic Differential Systems Filtering and Control, pp. 162–171. Springer,
Berlin, Heidelberg, 1980.

17. Müller-Gronbach, T. Strong Approximation of Systems of Stochastic Differential


Equations. Habilitation thesis, TU Darmstadt, 2002.

18. Giles, M.B. and Szpruch, L. Antithetic Multilevel Monte Carlo estimation
for multi-dimensional SDEs without Lévy area simulation. Arxiv preprint
arXiv:1202.6283, 2012.

19. Giles, M.B., Higham, D.J., and Mao, X. Analysing multilevel Monte Carlo for
options with non-globally Lipschitz payoff. Finance and Stochastics, 13(3):403–
413, 2009.

20. Debrabant, K., Giles, M.B., and Rossler, A. Numerical analysis of multilevel
Monte Carlo path simulation using Milstein discretization: Scalar case. Technical
Report, 2011.

21. Müller-Gronbach, T. The optimal uniform approximation of systems of stochas-


tic differential equations. The Annals of Applied Probability, 12(2):664–690,
2002.

22. Avikainen, R. On irregular functionals of SDEs and the Euler scheme. Finance
and Stochastics, 13(3):381–401, 2009.

23. Broadie, M., Glasserman, P., and Kou, S. A continuity correction for discrete
barrier options. Mathematical Finance, 7(4):325–348, 1997.

24. Glasserman, P. Monte Carlo Methods in Financial Engineering. Springer, New


York, 2004.
246 High-Performance Computing in Finance

25. Giles, M.B. Monte Carlo evaluation of sensitivities in computational finance.


Technical Report NA07/12, 2007.

26. Burgos, S. and Giles, M.B. Computing Greeks using multilevel path simulation.
In Plaskota, L. and Woźniakowski, H., editors, Monte Carlo and Quasi-Monte
Carlo Methods 2010. pp. 281–296. Springer, Berlin, Heidelberg, 2012.

27. Asmussen, A. and Glynn, P. Stochastic Simulation. Springer, New York, 2007.

28. Giles, M.B. Multilevel Monte Carlo for basket options. In Winter Simulation
Conference, pp. 1283–1290. Winter Simulation Conference, 2009.

29. Xia, Y. and Giles, M.B. Multilevel path simulation for jump-diffusion SDEs.
In Plaskota, L. and Woźniakowski, H., editors, Monte Carlo and Quasi-Monte
Carlo Methods 2010. pp. 695–708. Springer, Berlin, Heidelberg, 2012.

30. Merton, R.C. Option pricing when underlying stock returns are discontinuous.
Journal of Finance, 3:125–144, 1976.

31. Platen, E. and Bruti-Liberati, N. Numerical Solution of Stochastic Differential


Equations with Jumps in Finance. Springer, 2010.

32. Glasserman, P. and Merener, N. Convergence of a discretization scheme for


jump-diffusion processes with state-dependent intensities. Proceedings of the
Royal Society of London A, 460:111–127, 2004.

33. Xia, Y. Multilevel Monte Carlo method for jump-diffusion SDEs. Arxiv preprint
arXiv:1106.4730, 2011.

34. Dereich, S. and Heidenreich, F. A multilevel Monte Carlo algorithm for Lévy-
driven stochastic differential equations. Stochastic Processes and their Applica-
tions, 121(7):1565–1587, 2011.

35. Dereich, S. Multilevel Monte Carlo algorithms for Lévy-driven SDEs with Gaus-
sian correction. Annals of Applied Probability, 21(1):283–311, 2011.

36. Karatzas, I. and Shreve, S.E. Brownian Motion and Stochastic Calculus. Grad-
uate Texts in Mathematics, Vol. 113. Springer, New York, 1991.

37. Barth, A., Schwab, C., and Zollinger, N.. Multi-level Monte Carlo finite element
method for elliptic PDEs with stochastic coefficients. Numerische Mathematik,
119(1):123–161, 2011.

38. Cliffe, K.A., Giles, M.B., Scheichl, R., and Teckentrup, A. Multilevel Monte
Carlo methods and applications to elliptic PDEs with random coefficients. Com-
puting and Visualization in Science, 14(1):3–15, 2011.

39. Graubner, S. Multi-level Monte Carlo Method für stochastiche partial Differen-
tialgleichungen. Diplomarbeit, TU Darmstadt, 2008.

40. Giles, M.B. and Reisinger, C. Stochastic finite differences and multilevel Monte
Carlo for a class of SPDEs in finance. SIAM Journal of Financial Mathematics,
3:572–592, 2012.
Multilevel Monte Carlo Methods for Applications in Finance 247

41. Belomestny, D. and Schoenmakers, J. Multilevel dual approach for pricing Amer-
ican style derivatives. Preprint 1647, WIAS, 2011.

42. Andersen, L. and Broadie, M. A primal–dual simulation algorithm for pric-


ing multi-dimensional American options. Management Science, 50(9):1222–1234,
2004.

43. Broadie, M. and Kaya, O. Exact simulation of stochastic volatility and other
affine jump diffusion processes. Operations Research, 54(2):217–231, 2006.

44. Glasserman, P. and Kim K.-K. Gamma expansion of the Heston stochastic
volatility model. Finance and Stochastics, 15(2):267–296, 2011.

45. Heston, S.I. A closed-form solution for options with stochastic volatility with
applications to bond and currency options. Review of Financial Studies, 6:327–
343, 1993.

46. Giles, M.B. and Waterhouse, B.J. Multilevel quasi-Monte Carlo path simulation.
In Advanced Financial Modelling, Radon Series on Computational and Applied
Mathematics, pages 165–181, 2009.

47. Gerstner, T. and Noll, M. Randomized multilevel quasi-Monte Carlo path sim-
ulation. In Recent Developments in Computational Finance: Foundations, Algo-
rithms and Applications, pp. 349–369, 2013.

48. Shreve, S.E. Stochastic Calculus for Finance: Continuous-Time Models, Vol. 2.
Springer, Berlin, Heidelberg, 2004.
Chapter 8
Fourier and Wavelet Option Pricing
Methods

Stefanus C. Maree, Luis Ortiz-Gracia, and Cornelis W. Oosterlee

CONTENTS
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8.1.1 European option pricing problem . . . . . . . . . . . . . . . . . . . . . . . 251
8.2 COS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
8.2.1 Density coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
8.2.2 Plain vanilla payoff coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . 254
8.2.3 Domain truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
8.2.4 Pricing multiple strikes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
8.3 Wavelet Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.4 WA[a,b] Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
8.4.1 Density coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
8.4.2 Plain vanilla payoff coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . 260
8.5 SWIFT method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
8.5.1 Density coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
8.5.2 Payoff coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
8.5.3 Pricing multiple strikes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
8.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
8.6.1 Computational time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
8.6.2 Robustness of WA[a,b] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
8.6.3 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
8.6.4 Multiple strike pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

8.1 Introduction
In this overview chapter, we will discuss the use of exponentially converg-
ing option pricing techniques for option valuation. We will focus on the pricing
of European options, and they are the basic instruments within a calibration
procedure when fitting the parameters in asset dynamics. The numerical solu-
tion is governed by the solution of the discounted expectation of the pay-off
function. For the computation of the expectation, we require knowledge about

249
250 High-Performance Computing in Finance

the corresponding probability density function, which is typically not avail-


able for relevant stochastic asset price processes. Many publications regarding
highly efficient pricing of these contracts are available, where computation
often takes place in the Fourier space. Methods based on quadrature and the
Fast Fourier Transform (FFT) [1–3] and methods based on Fourier cosine
expansions [4,5] have therefore been developed because for relevant log-asset
price processes, the characteristic function appears to be available. The char-
acteristic function is defined as the Fourier transform of the density function.
Here, we wish to extend the overview by discussing the recently presented
highly promising class of wavelet option pricing techniques, based on either
B-splines or Shannon wavelets.
Cosines in corresponding Fourier cosine expansions form a global basis, and
that comes with disadvantages, as, especially for long maturity options, round-
off errors may accumulate, and, for short maturity options, many cosine terms
are needed for an accurate representation of a highly peaked density function.
Local wavelet bases have been considered by Ortiz-Gracia and Oosterlee [6,7]
which rely on Haar and B-spline wavelets. These local bases give flexibility
and enhance robustness when treating long maturity options and heavy tailed
asset processes, but at the cost of a more involved computation and certain loss
of accuracy for the standard cases, where the COS method [4] exhibits expo-
nential convergence. Employing these local wavelet bases, Kirkby [8] computes
the density coefficients by means of Parseval’s identity instead of relying on
Cauchy’s integral theorem in Ortiz-Gracia and Oosterlee [6] and used an FFT
algorithm to speed up the method. Shannon wavelets, based on the sinus car-
dinal (sinc) function, are very interesting alternatives, as we can benefit from
the local features of the approximation of the density function, but the con-
vergence of the method is exponential due to the regularity of the employed
wavelet basis. Long- and short-term maturity options are priced robustly and
accurately, as well as fat-tailed asset price distributions. The resulting option
pricing method is called SWIFT (Shannon Wavelet Inverse Fourier Technique)
and also relies heavily on the use of the FFT.
We will confirm by numerical experiments that the COS and SWIFT meth-
ods exhibit exponential convergence. It is our opinion that the fastest converg-
ing methods should be implemented on high-performance computing (HPC)
platforms. It is well known that mainly with inferior methods a tremendous
speedup can be obtained on these platforms. It is much harder to speed up
highly efficient computational methods governed by low operation counts.
The required HPC speedup can also be achieved by the methods advocated
here, mainly in the context of the calibration exercise. During the calibra-
tion, European options need to be valued for many different strike prices,
and these multiple strike computations can be performed simultaneously and
independently. So, rather than reducing the computation time of an individual
option valuation, highly efficient parallelization should take place in the “strike
direction,” see also [9], which can easily be done under COS and SWIFT
pricing.
Fourier and Wavelet Option Pricing Methods 251

8.1.1 European option pricing problem


The pricing of European options in computational finance is governed by
the numerical solution of partial (integro-)differential equations. The corre-
sponding solution, being the option value at time t, can also be found by
means of the Feynman–Kac formula as a discounted expectation of the option
value at final time t = T , the so-called payoff function. Here we consider the
risk-neutral option valuation formula,

−r(T −t) Q −r(T −t)
v(x, t) = e E [ v(y, T )|x] = e v(y, T )f (y|x) dy, (8.1)
R

where v denotes the option value, T the maturity, t the initial date, EQ the
expectation operator under the risk-neutral measure Q, x and y the state
variables at time t and T , respectively, f (y|x) the probability density function
of y given x, and r the deterministic risk-neutral interest rate.
Whereas f is typically not known, the characteristic function is often avail-
able. We represent the option values as function of the scaled log-asset prices,
and denote these prices by,

x = ln(St /K) and y = ln(ST /K),

where St is the underlying price at time t and K the strike price.


The payoff v(y, T ) for European options in log-asset space is then given by,

+ 1, for a call,
v(y, T ) = [α · K(e − 1)] , with α =
y
(8.2)
−1, for a put.

The strategy to follow to determine the price of the option consists of


an approximation of the density function f in terms of a series expansion,
where the series coefficients can be efficiently recovered using the characteristic
function.

8.2 COS Method


The COS method for European options, introduced by Fang and Oost-
erlee [4], is based on the insight that the Fourier cosine series coefficients
of f (y|x) are closely related to its characteristic function. Since the density
function f (y|x) decays rapidly as y → ±∞, we can truncate the infinite inte-
gration range in the risk-neutral valuation formula without loosing significant
accuracy. Suppose that we have, with [a, b] ⊂ R,

f (y|x) dy < TOL,
R\[a,b]
252 High-Performance Computing in Finance

for some given tolerance TOL. Then we can approximate v(x, t) in


Equation 8.1 by,

b
−r(T −t)
v(x, t) ≈ v1 (x, t) = e v(y, T )f (y|x) dy. (8.3)
a

(The intermediate terms vi are used to distinguish approximation errors.) As


a second step, we replace the (unknown) density function f (y|x) by its Fourier
cosine expansion over [a, b],

 

y−a
f (y|x) = Dk (x) cos kπ ,
b−a
k=0
b  
2 y−a
where Dk (x) = f (y|x) cos kπ dy, (8.4)
b−a b−a
a

where the apostrophe (’) after the summation sign denotes that the first term
of the summation is divided by 2. We will refer to Dk (x) as the (Fourier cosine)
density coefficients.
Inserting the Fourier cosine expansion of f (y|x) into Equation 8.3, using
Fubini’s Theorem, gives,
⎡ b ⎤

  

y − a
v1 (x, t) = e−r(T −t) Dk (x) ⎣ v(y, T ) cos kπ dy ⎦ , (8.5)
b−a
k=0 a

where we note that the integral at the right-hand side is equal to the Fourier
coefficients of v(y, T ) in y (except for a constant). We therefore define the
payoff coefficients Vk as the Fourier cosine series coefficients of v(y, T ) as

b  
2 y−a
Vk := v(y, T ) cos kπ dy, (8.6)
b−a b−a
a

and obtain,

b − a −r(T −t)

v1 (x, t) = e Dk (x)Vk .
2
k=0

Due to the rapid decay of the payoff and density coefficients, we can further
truncate the series summation to obtain,
N −1
b − a −r(T −t)

v(x, t) ≈ v2 (x, t) = e Dk (x)Vk .


2
k=0
Fourier and Wavelet Option Pricing Methods 253

8.2.1 Density coefficients


The strength of the COS method is the insight that the Fourier cosine
coefficients Dk (x) are closely related to the conditional characteristic function.
Let the Fourier transform of a function f , whenever it exists, be given by,

fˆ(ω) := f (y)e−iωy dy,
R

then we can define the conditional characteristic function fˇ(ω; x) related to


the density function f (y|x) as fˇ(ω; x) := fˆ(−ω; x).
Since we assume that the density function f (y|x) is an L2 (R)-function,
the characteristic function fˇ(ω; x) := fˆ(−ω; x) is also in L2 (R), thus it can
approximated well by,

 b
fˇ(ω; x) = f (y|x)e iωy
dy ≈ f (y|x)eiωy dy.
R a

Using the approximation of the characteristic function, we can derive an


approximation for the density coefficients,

b  
2 y−a
Dk (x) = f (y|x) cos kπ dy
b−a b−a
a
⎧ ⎫
⎨ b   ⎬
2 a kπ
= Re e−ikπ b−a f (y|x) exp i y dy
b−a ⎩ b−a ⎭
a
⎧ ⎫
⎨    ⎬
2 a kπ
≈ Re e−ikπ b−a f (y|x) exp i y dy
b−a ⎩ b−a ⎭
R
   
2 kπ a
= Re fˇ ; x e−ikπ b−a =: Dk∗ (x). (8.7)
b−a b−a

In a final step, we replace Dk (x) by its approximation Dk∗ (x) in v2 (x, t) to


obtain the general COS pricing formula,

N −1

   
kπ a
v(x, t) ≈ v3 (x, t) = e−r(T −t) Re fˇ ; x e−ikπ b−a Vk , (8.8)
b−a
k=0

where the payoff coefficients Vk depend on the option type.


254 High-Performance Computing in Finance

8.2.2 Plain vanilla payoff coefficients


The payoff coefficients Vk , as defined in Equation 8.6, for a European call
(or put) with payoff function as in Equation 8.2 are given by,

b  
2 y−a
Vk = [α · K(ey − 1)]+ cos kπ dy.
b−a b−a
a

Let us consider a European call option, α = 1. For a put, the steps are similar.
We distinguish two different cases. If a < b < 0, the integral equals zero, and
Vk = 0 for all k. In the other case, set ā = max(0, a). We can then rewrite
Vk as,
⎡ ⎤
b   b  
2 y − a 2 y − a
Vk = K ⎣ ey cos kπ dy − cos kπ dy ⎦ ,
b−a b−a b−a b−a
ā ā

where the first term within the brackets represents the Fourier cosine coeffi-
cient of the function ey and the second term the Fourier cosine coefficient of
the constant function 1. Both of them can be solved analytically using basic
calculus, and for a proof the reader is referred to [4].

8.2.3 Domain truncation


A next step in the derivation of the COS pricing formula is to determine
the truncation interval [a, b] ⊂ R. It is important that the interval contains
b
almost all “mass” of the distribution function f , that is, f (y|x) dy ≈ 1. This
a
interval might be hard to determine for distribution functions with fat tails
or when little is known about the underlying distribution.
A heuristic solution proposed by Fang and Oosterlee [4] is to make use of
the cumulants of the underlying distribution. Let cn denote the nth cumulant
of y = ln(ST /K) and let L be a scaling parameter, then,
 ? ? 
√ √
[a, b] := c1 − L c2 + c4 , c1 + L c2 + c4 , (8.9)

where L is suggested to be chosen in the range [7.5, 10]. Cumulants for a


number of common underlying models are given by Fang and Oosterlee [4].

8.2.4 Pricing multiple strikes


It is worth mentioning that Equation 8.8 is greatly simplified for the Lévy
and Heston models, so that options for many strike prices can be computed
simultaneously. We denote vectors with bold-faced characters. For a vector of
Fourier and Wavelet Option Pricing Methods 255

strikes K, the Vk formulas for European options can be factored as Vk = KUk ,


where Uk is, independent of the strike, given by,
⎡ ⎤
b   b  
2 y − a 2 y − a
Uk = ⎣ y
e cos kπ dy − cos kπ dy ⎦ .
b−a b−a b−a b−a
ā ā

For Lévy processes, whose characteristic functions can be represented by,


d
fˇ(ω; x) = fˇlevy (ω)eiωx , with fˇlevy (ω) := fˇ(ω; 0),
the pricing formula is simplified to,
N −1

   
kπ x−a
v(x, t) ≈ e−r(T −t) K Re fˇlevy eikπ b−a Uk , (8.10)
b−a
k=0

where the summation can be written as a matrix–vector product if K (and


therefore x) is a vector. We see that the evaluation of the characteristic func-
tion is independent of the strike. In general, the evaluation of the characteristic
function is more expensive than the other computations.
In the section with numerical results, we show that with very small N we
can achieve highly accurate results.

8.3 Wavelet Series


As for the COS method, the wavelet methods discussed in the consecutive
sections use a series expansion, this time in terms of wavelet bases, to approx-
imate the density function. In this section, we briefly introduce Multi Resolu-
tion Analysis (MRA), a general framework for wavelets. Extensive theory on
MRA and wavelets in general can be found in the work by Daubechies [10]. In
the next two sections, we describe the WA[a,b] method [6], which is a wavelet-
based method using B-splines. After, we discuss the SWIFT method [11], a
similar approach, but based on Shannon wavelets.
The starting point of MRA is a family of closed nested subspaces,
· · · ⊂ V−2 ⊂ V−1 ⊂ V0 ⊂ V1 ⊂ V2 ⊂ · · · ,
in L2 (R), where,
@ A
Vj = {0}, Vj = L2 (R),
j∈Z j∈Z

and,
f (x) ∈ Vj ⇔ f (2x) ∈ Vj+1 .
If these conditions are met, then a function φ ∈ V0 exists, such that {φj,k }k∈Z
forms an orthonormal basis of Vj , where,

φj,k (x) = 2j/2 φ(2j x − k).


256 High-Performance Computing in Finance

In other words, the function φ, called the scaling function or father wavelet,
generates an orthonormal basis for each Vj subspace.
Let us define Wj in such a way that Vj+1 = Vj ⊕ Wj . That ) is, Wj is the
space of functions in Vj+1 but not in Vj , and so, L2 (R) = j ⊕Wj . Then a
function ψ ∈ W0 exists, the mother wavelet, such that by defining,

ψj,k (x) = 2j/2 ψ(2j x − k),

the wavelet family {ψj,k }k∈Z gives rise to an orthonormal basis of Wj and
{ψj,k }j,k∈Z is a wavelet basis of L2 (R).
For any f ∈ L2 (R), a projection map of Pm : L2 (R) → Vm is defined by
means of,


m−1
Pm f (x) = dj,k ψj,k (x) = cm,k φm,k (x),
j=−∞ k∈Z k∈Z

where dj,k = f (x)ψj,k (x)dx, (8.11)
R

cm,k = f (x)φm,k (x)dx.
R

Note that the first part in Equation 8.11 is a truncated wavelet series. If j were
allowed to go to +∞, we would have the full wavelet series. The second part
in Equation 8.11 gives us an equivalent sum in terms of the scaling functions
φm,k . When m tends to infinity, by the theory of MRA, the truncated wavelet
series converges to f .
As opposed to the Fourier series in Equation 8.4, wavelets can be translated
(by means of k) and stretched or compressed (by means of j) to accurately
represent local properties of a function.

8.4 WA[a,b] Method


The WA[a,b] method is based on a wavelet series expansion using the jth
order cardinal B-splines as the scaling function for j = 0 and j = 1. The
B-splines of order zero are defined as,

0 1, if x ∈ [0, 1),
φ (x) :=
0, otherwise,

and they are called Haar wavelets.


Fourier and Wavelet Option Pricing Methods 257

Higher order B-splines are defined recursively by a convolution,


1
j
φ (x) = φj−1 (x − t)dt, j ≥ 1,
0

but the resulting wavelet family {φjm,k }m,k∈Z does not form an orthonormal
basis of L2 (R). However, they do form a Riesz basis, a relaxation or orthonor-
mality, which still allows us to apply MRA. For details about Riesz basis,
see [10].
Following the MRA framework, we define the wavelet family {φjm,k }m,k∈Z
with wavelets,
φjm,k (x) = 2m/2 φj (2m x − k),
for a fixed wavelet scale m. We discuss the choice of m ∈ N in the numerical
section at the end of this chapter.
Cardinal B-spline functions are compactly supported, with support
[0, j + 1], and their Fourier transform is,
 j+1
1 − e−iω
φ̂j (ω) = .

Since splines are only piecewise polynomial functions (see Figure 8.1), they
are very easy to implement.
In Ortiz-Gracia and Oosterlee [6], two methods for approximating the den-
sity function are described. We focus on the method that applies a Wavelet
Approximation on a bounded interval [a, b], the WA[a,b] method.

1.2
Order 0
Order 1
1 Order 2
Order 3

0.8

0.6

0.4

0.2

0
–1 0 1 2 3 4 5

FIGURE 8.1: Cardinal B-splines of orders j = 0, 1, 2, 3. For j = 0, we have


the scaling function of the Haar wavelet system.
258 High-Performance Computing in Finance

We assume that the density function f (y|x) is an L2 (R)-function, and


thus the mass in the tails tends to zero when y → ±∞, so that it can be well
approximated in a finite interval [a, b] by,

c f (y|x), if x ∈ [a, b],
f (y|x) =
0, otherwise.

Following the theory of MRA on a bounded interval1 , we can approximate


f ≈ fm,j
c c
for all y ∈ [a, b], where,

(j+1)·(2m −1)
 
j y−a
c
fm,j (y|x) = Dm,k (x)φjm,k (j + 1) · , j ≥ 0, (8.12)
b−a
k=0

with convergence in the L2 (R)-norm and Dm,k j


(x) are the wavelet density
coefficients.
We use the cumulants method of Section 8.2.3 to determine the interval
[a, b] and obtain the truncated risk-neutral option valuation formula for v1 (x, t)
as in Equation 8.3. Then, by substituting fˆ by fˆm,j
c
, we obtain by interchange
of integration and summation,

b
−r(T −t) c
v2 (x, t) = e v(y, T )fm,j (y|x) dy
a (8.13)
(j+1)·(2m −1)

= e−r(T −t) j
Dm,k j
(x)Vm,k ,
k=0

where we defined the payoff coefficients as,

b  
j y−a
Vm,k = v(y, T )φjm,k (j + 1) · dy. (8.14)
b−a
a

8.4.1 Density coefficients


As in Ortiz-Gracia and Oosterlee [6], we use Cauchy’s integral formula
j
to find an expression for the density coefficients Dm,k (x). An alternative
approach, based on Parseval’s identity, is described by Kirkby [8].
The main idea behind the Wavelet Approximation method is to approx-
j
imate fˆ by fˆm,j
c
and then to compute the coefficients Dm,k by inverting the

1 Scaling function in a bounded interval is discussed in detail by Chui [12].


Fourier and Wavelet Option Pricing Methods 259

Fourier Transform. Proceeding this way, we have,


 
fˆ(ω; x) = f (y|x)e−iωy dy ≈ fm,jc
(y|x)e−iωy dy
R R
⎡ ⎤
(j+1)·(2m −1)
  
y−a
= j
Dm,k (x) ⎣ φjm,k (j + 1) · e−iωy dy ⎦ .
b−a
k=0 R

Introducing a change of variables, u = (j + 1) · u−a


b−a , gives us,
⎡ ⎤
m
(j+1)·(2 −1)

b − a −iaω b−a
fˆ(ω; x) ≈ ·e j
Dm,k (x) ⎣ φjm,k (u)e−iω j+1 u du⎦
j+1
k=0 R
(j+1)·(2m −1)
 
b − a −iaω j b−a
= ·e Dm,k (x)φ̂jm,k ·ω .
j+1 j+1
k=0
m k
Taking into account that φ̂jm,k (ω) = 2− 2 φ̂j ( 2ωm )e−i 2m ω and performing a
b−a
change of variables, z = e−i 2m (j+1) ω , we find by rearranging terms,
j
Pm (z; x) ≈ Qjm (z; x), (8.15)
where,
(j+1)·(2m −1)
j
j
Pm (z; x) := Dm,k (x)z k
k=0
2m (j+1)a
!
m 2m (j+1)
2 (j + 1)z −
2 b−a fˆ b−a i · log(z)
Qjm (z; x) := .
(b − a)φˆj (i · log(z))
j
Since Pm (z; x) is a polynomial (in z), it is (in particular) analytic inside
a disc of the complex plane {z ∈ C : |z| < ρ} for ρ > 0. We can obtain
j
expressions for the coefficients Dm,k (x) by means of Cauchy’s integral formula.
This is,
 j
j 1 Pm (z; x)
Dm,k (x) = dz, k = 0, . . . , (j + 1) · (2m − 1),
2πi z k+1
γ

where γ denotes a circle of radius ρ, ρ > 0, about the origin. We set ρ = 0.9995
[13]. Considering now the change of variables z = ρeiu , and the approximation
Pmj
(z; x) ≈ Qjm (z; x) gives us,
2π
1
j
Dm,k (x) ≈ Qjm (ρeiu ; x)e−iku du. (8.16)
2πρk
0

We approximate the above integral with the Trapezoidal Rule over the grid
points un = n 2π
N for N = 2 (j + 1) and n = 0, 1, 2, . . . , N − 1. Thus the final
m
260 High-Performance Computing in Finance

approximation for the density coefficients is,


N −1 B
1
j j,∗ −i 2π
Dm,k (x) ≈ Dm,k (x) := Re Qjm (ρeiun ; x)e N kn . (8.17)
N ρk n=0

Note that we can directly apply the FFT algorithm to compute the whole
j −1
vector of coefficients {Dm,k }N
k=0 with a computational complexity of just
O(N · log2 N ).
The resulting B-splines wavelet pricing formula for general European
options is,


N
v(x, t) ≈ e−r(T −t) j,∗
Dm,k j
(x)Vm,k , (8.18)
k=0

j,∗
where the density coefficients Dm,k (x) are given by Equation 8.17 . The payoff
coefficients depend on the type of contract, which we discuss in the following
section for plain vanilla options.

8.4.2 Plain vanilla payoff coefficients


We derive the payoff coefficients for a European call (or put) with payoff
as in Equation 8.2. The payoff coefficients, as defined in Equation 8.14, are
given by,

b  
j + y−a
Vm,k = [α · K(ey − 1)] φjm,k (j + 1) · dy. (8.19)
b−a
a

Let us consider a European call option, α = 1. For a put, the steps are similar.
We distinguish two different cases. If a < b < 0, the integral equals zero,
j
and Vm,k = 0 for all k. In the other case, set a = max(0, a). We can then
j
rewrite Vm,k as,

⎡ b
  
y−a
j
Vm,k = K ⎣ ey φj (j + 1) · dy
m,k
b−a
a

b  
y − a
− φjm,k (j + 1) · dy ⎦ . (8.20)
b−a
a

Both of the integrals can be solved analytically using basic calculus, and
for a proof the reader is referred to [6].
Fourier and Wavelet Option Pricing Methods 261

8.5 SWIFT method


The SWIFT method, introduced by Ortiz-Gracia and Oosterlee [11] and
extended by Colldeforns-Papiol et al. [14] and Maree et al. [15] is similar to
the WA[a,b] method in methodology; the main novelty is the type of wavelet;
the so-called Shannon wavelet is used, resulting in its name, the “Shan-
non Wavelets Inverse Fourier Technique” (SWIFT) method. Shannon wavelet
approximations are appealing due to their exponential convergence for the
smooth density functions that occur in finance.
The Shannon scaling function is given by,

sin(πx)
πx , if x = 0,
φ(x) = sinc(x) :=
0, x = 0,
and the mother wavelet is,

sin π(x − 12 ) − sin 2π(x − 12 )
ψ(x) = ,
π(x − 12 )
and following the theory of MRA, we define the wavelet family {φm,k }m,k∈Z
with wavelets,
φm,k (x) = 2m/2 φ(2m x − k).
The Shannon scaling function and mother wavelet are shown in Figure 8.2.
In terms of time frequency analysis, the Shannon scaling function is the
opposite of the Haar wavelet φ0 we saw in the previous section. The Shan-
non wavelet, a regular function in the time domain, is a compact-supported

1 phi
psi

0.5

–0.5

–1
–6 –4 –2 0 2 4 6
x

FIGURE 8.2: Shannon scaling function φ(x) (phi, thick line) and wavelet
ψ(x) (psi, dashed line).
262 High-Performance Computing in Finance

rectangle in the frequency domain, given by,


k ω !
φ̂m,k (ω) = 2−m/2 e−i 2m ω rect ,
2m+1 π
where the rectangle function is defined as,


⎨1, if |x| < 1/2,
rect(x) = 1/2, if |x| = 1/2,


0, if |x| > 1/2.
Following the MRA framework, the truncated Shannon wavelet expansion
of the density function f (y|x) is given by,

f (y|x) ≈ Pm f (y|x) = Dm,k (x)φm,k (y),
k∈Z

with Dm,k (x) := f (y|x)φm,k (y)dy, (8.21)
R

where the scaling functions are defined from φ(x) = sinc(x) for a fixed wavelet
scale m ∈ N.
Since Shannon wavelets have infinite support, we take a different approach
in truncating the wavelet series. We note that for h ∈ Z,
 $   $ 

h $$ h $$ m m
f m $ x ≈ Pm f m $ x =22 Dm,k (x)δk,h = 2 2 Dm,h (x).
2 2
k∈Z
2
Now, since f ∈ L (R) and it is nonnegative, and if we assume that
limx→±∞ f (x) = 0 then we conclude that Dm,k vanishes as well as k → ±∞.
We therefore approximate the infinite series in Equation 8.21 by a finite
summation without loss of considerable density mass,

k2
f (y|x) ≈ fm (y|x) := Dm,k (x)φm,k (y), (8.22)
k=k1

for conveniently chosen integers k1 < k2 . When setting Im = [ 2km1 , 2km2 ], the
option pricing formula becomes,

v(x, t) = e−r(T −t) v(y, T )f (y|x)dy
R
(8.23)

k2
−r(T −t)
≈e Dm,k (x)Vm,k ,
k=k1

where the payoff coefficients are defined as,



Vm,k := v(y, T )φm,k (y)dy. (8.24)
Im

We define the truncation parameters k1 and k2 as the smallest integers


(in absolute sense), such that when a and b are the truncation parameters of
Fourier and Wavelet Option Pricing Methods 263

Section 8.2.3, determined by the cumulants of the density function, we have,

k1 k2
≤ a < b ≤ m,
2m 2

which implies that [a, b] ⊂ Im .

8.5.1 Density coefficients


In Ortiz-Gracia and Oosterlee [11] present two methods are presented
to compute the density coefficients Dm,k in Equation 8.21. We discuss the
approach based on Vieta’s formula, as this approach allows us to control the
numerical error.

Theorem 8.1. For J ∈ N, we can approximate the sinc function by,

J−1
2  
∗ 1 2j − 1
sinc(t) ≈ sinc (t) := cos πt , (8.25)
2J−1 j=1
2J

where the absolute error is bounded by,

(πc)2
|sinc(t) − sinc∗ (t)| ≤ , (8.26)
22(J+1) − (πc)2

for t ∈ [−c, c], where c ∈ R, c > 0, and J ≥ log2 (πc).

Proof. We show how to find the expression for sinc∗ (t). The proof of the error
bound is Lemma 2 in Ortiz-Gracia and Oosterlee [11]. As shown by Vieta,
and described by Gearhart and Shultz [16], the sinc function can be written
as an infinite product,


#  
πt
sinc(t) = cos , (8.27)
j=1
2j

and by truncating the infinite product to a finite product with J factors,


we can apply the cosine product-to-sum identity described by Quine and
Abrarov [17]. This gives the desired result,

#
J   J−1
2  
πt 1 2j − 1
sinc(t) ≈ cos = cos πt =: sinc∗ (t).
j=1
2j 2J−1 j=1
2J
264 High-Performance Computing in Finance

If we write out the definition of the coefficients Dm,k in Equation 8.21,


we get, 
m
Dm,k (x) = 2 2 f (y|x)φ(2m y − k)dy.
R
Using Vieta’s approximation sinc(x) by sinc∗ (x) from Theorem 8.1 gives us,
2J−1   
∗ 2m/2 2j − 1
Dm,k (x) ≈ Dm,k (x) := J−1 f (y|x) cos π(2 y − k) dy.
m
2 j=1
2J
R

We now note the resemblance between the integral in the right-hand side of
the equation above and the integral in the COS method in Equation 8.7. In a
similar way, we replace the integral over the unknown density function by its
Fourier transform,
  
2j − 1
f (x) cos π(2 m
x − k) dx
2J
R
⎧ ⎫
⎨   ⎬
2j − 1
= Re f (x) exp −i J π(2m x − k) dx
⎩ 2 ⎭
R
⎧ ⎫
⎨ 2j−1    ⎬
(2j − 1)π2m
= Re ei 2J πk f (x) exp −i x dx
⎩ 2J ⎭
R
  
2j−1 (2j − 1)π2m
= Re ei 2J πk fˆ . (8.28)
2J
Inserting this into the density coefficients gives us an expression for the
density coefficients,
2J−1    
∗ 2m/2 ˆ (2j − 1)π2m ikπ(2j−1)
J
Dm,k (x) = J−1 Re f ;x e 2 . (8.29)
2 j=1
2J

A strategy for choosing J follows from Theorem 8.1, which implies that when
we set Mm,k = max(|2m a − k|, |2m b + k|) and Mm := maxk1 <k<k2 Mm,k then,
we set J = j := log2 (πMm ), where x denotes the smallest integer greater
than or equal to x. For a proof, the reader is referred to [11].
Although for every k another J could be chosen, we decide to fix a j for
all k, such that we can benefit from the efficiency of the FFT algorithm to
compute the vector of density coefficients {Dm,k (x)}kk=k2
1
at once, as described
by Ortiz-Gracia and Oosterlee [11].

8.5.2 Payoff coefficients


We show how to compute the payoff coefficients for a European call (or
put), based on Ortiz-Gracia and Oosterlee [11]. In contrast to the COS and
Fourier and Wavelet Option Pricing Methods 265

WA[a,b] methods, we do not have an analytic expression for the payoff coeffi-
cients, but we can once more benefit from the FFT algorithm for an efficient
approximation.
We look for an expression for the payoff for a European call. The steps for
deriving a formula for the European put are similar. Recall that the payoff
coefficients for a European call are defined by,
⎡ ⎤
 
Vm,k = K ⎣ ey vφm,k (y)dy − φm,k (y)dy ⎦ , (8.30)
Im Im

as in Equation 8.24. Let us define,

b
1
Ij,k (a, b) := ey cos(ωj (2m y − k))dy
a

and,
b
2
Ij,k (a, b) := cos(ωj (2m y − k))dy,
a

where ωj := 2j−12J
π. Note that these integrals are just a change of variables
of integrals we had to solve for the COS payoff coefficients. For a proof the
reader is referred to [11].
When k2 ≤ 0, the payoff coefficients vanish, that is, Vm,k = 0 for every k.
In case 0 < k2 , we can write Equation 8.30 as,

2J−1     
∗ 2m/2 1 k̄1 k2 2 k̄1 k2
Vm,k ≈ Vm,k := K J−1 Ij,k , − Ij,k , . (8.31)
2 j=1
2m 2m 2m 2m

Remark 8.1. As in the case of the density coefficients, we consider a


constant J for all k, which we call j̄, defined by j̄ := log2 (πN ), where
N := maxk1 <k<k2 Nk and Nk := max(|k̄1 − k|, |k2 − k|). This allows us to

compute the whole vector of payoff coefficients {Vm,k }kk=k
2
1
with the help of the
FFT algorithm [11].

The resulting SWIFT pricing formula for general European options is,


k2
v(x, t) ≈ e−r(T −t) ∗
Dm,k ∗
(x)Vm,k , (8.32)
k=k1

where the density coefficients are given by Equation 8.29, and the payoff coef-

ficients Vm,k for a European call are given by Equation 8.31.
266 High-Performance Computing in Finance

8.5.3 Pricing multiple strikes


As for the COS method, both the WA[a,b] and SWIFT methods can be
applied to efficiently price multiple strikes simultaneously. We describe the
approach for the SWIFT method when the underlying process is a Lévy pro-
cess or the Heston model. For these processes, the density function can be
written as,
 
1 1
f (y|x) = fˆ(ω; x)eiωy dω = fˆ(ω; 0)e−iωx eiωy dω = f (y − x|0).
2π 2π
R R

Applying the Shannon wavelet expansion to the density function f (y|0),


instead of f (y|x), gives,


k2

f (y|x) = f (y − x|0) = Dm,k (0)φm,k (y − x),
k=k1


where Dm,k (x) are the density coefficients as in Equation 8.29, evaluated at
x = 0. The SWIFT option pricing formula then becomes,


k2
v(x, t) =e−r(T −t) ∗
Dm,k ∗
(0)Vm,k (x),
k=k1


where Vm,k (x) := v(y, T )φm,k (y − x)dy.
Im

Compared to the original SWIFT pricing formula in Equation 8.32, the depen-
dence on x has been moved from the density coefficients to the payoff coef-
ficients. The density coefficients have to be computed only once. The payoff
coefficients now depend on x, and they are generally cheaper to compute,
especially for the WA[a,b] method.

8.6 Numerical Results


In this section, we give examples of pricing options by the COS, WA[a,b] ,
and SWIFT methods. All examples are implemented in Matlab and run on
an Intel Core i5-4250U CPU @ 1.30 GHz × 4 with 8 GB of memory.
In BENCHOP [18], option pricing methods based on Monte Carlo, Fourier,
finite difference, and radial basis functions are compared for different option
styles and underlying processes. When the characteristic function of the under-
lying method is known, the Fourier methods in general, but the COS method
especially, are shown to be extremely competitive methods in terms of both
accuracy and computational time.
Fourier and Wavelet Option Pricing Methods 267

TABLE 8.1: CPU times (in milliseconds) for a European put on CGMY
dynamics at different scales for a corresponding absolute price error
COS SWIFT Haar
N Error Time Scale Error Time Scale Error Time
32 1.36e−02 0.41 1 1.58e−01 0.36 5 8.59e−02 0.42
64 3.32e−05 0.34 2 2.27e−04 0.42 7 1.41e−04 0.49
128 3.42e−09 0.39 3 5.61e−07 0.48 9 5.13e−07 1.10
256 4.44e−14 0.53 4 5.80e−11 0.70 11 1.73e−08 2.16
Note: Reference price by the COS method.

8.6.1 Computational time


The COS, SWIFT, and WA[a,b] methods all follow the same approach.
First, the density coefficients are recovered from the characteristic function.
For the COS method, this is a single evaluation of the characteristic function
(see Equation 8.7). The SWIFT and WA[a,b] methods require the application
of the FFT for an efficient evaluation of the density coefficients with a compu-
tational complexity of O(N log2 N ). Thus the computation of the coefficients
of the SWIFT and WA[a,b] methods is generally more involved, compared to
the COS method. In Table 8.1, we price a European call with strike K = 100
and T = 1 on Carr-Geman-Madan-Yor (CGMY) dynamics [19] with parame-
ters (C, G, M, Y ) = (1, 5, 5, 1.5) and S0 = 100, r = 0.1. We compare the three
methods, COS, SWIFT, and WA[a,b] (j = 0, Haar basis), and observe that all
of the methods are capable of reaching engineering accuracy in milliseconds.
An advantage of the WA[a,b] method over the COS method is its robust-
ness, as described in the following section.

8.6.2 Robustness of WA[a,b]


The WA[a,b] method cannot compete against the COS method in terms of
computational time, but it is attractive due to its robustness. Since wavelets
series represent functions locally, we can easily adjust coefficients to match
local difficulties.
We demonstrate this robustness by pricing a very long maturity T = 100
call option, which arises in economy and real options. Due to the unbounded
payoff of a call option, extreme payoffs occur on the right-hand side of the
computation domain.
Since the COS method uses a global basis, all coefficients are affected
by this extreme payoff, and as is shown in Figure 8.3a, payoff coefficients
of magnitude 1012 are multiplied by density coefficients of magnitude 10−15 ,
causing such round-off errors that the final option value has an absolute error
of 10+4 .
We price the same long maturity call using WA[a,b] with the Haar basis and
scale m = 5 (32 coefficients), and the coefficients are shown in Figure 8.3b.
268 High-Performance Computing in Finance

(a) COS coefficients (b) WA[a,b] coefficients


1020 1020
Density coefficients
Payoff coefficients
Coefficient value (log-scale)

Coefficient value (log-scale)


1010 1010

100 100

10–10 10–10
Density coefficients
Payoff coefficients
10–20 10–20
5 10 15 20 25 30 5 10 15 20 25 30
kth coefficient kth coefficient

FIGURE 8.3: Coefficients for COS (N = 32) and WA[a,b] (Haar, m = 5)


methods arising from the pricing of a long maturity call on GBM with S0 =
100, K = 100, r = 0.1, q = 0, σ = 0.25, T = 100, and L = 10. Reference price
by the Black–Scholes formula.

Wavelets form a local basis and, as can be seen from the figure, each coefficient
j,∗
Dm,k only affects the points of the density locally, in the interval [ 2km , k+1
2m ].
We can avoid big round-off errors by removing the payoff coefficients that
cause very big round-off errors at the right-hand side of the domain. We
therefore consider the truncated series,


κm
vκm (x, t) := e−r(T −t) j,∗
Dm,k j
(x)Vm,k , (8.33)
k=0

j,∗
with Dm,k as in Equation 8.17, and by choosing κm such that vκm (x, t) < S0 ,
using that S0 is an upper bound for the value of a call, we find an error of
about 10−1 .

8.6.3 Rate of convergence


The COS method and the SWIFT method are extremely powerful when
the underlying asset is driven by a geometric Brownian motion. In that case,
the COS and SWIFT methods converge exponentially, while the WA[a,b]
method for j = 0 (Haar) and j = 1 (linear B-splines) converges algebraically.
The rate of convergence for a European call option under geometric Brownian
motion is shown in Figure 8.4.
The probability density function corresponding to an asset driven by geo-
metric Brownian motion is approximated very well by a Fourier cosine expan-
sion and a Shannon wavelet expansion. The rate of convergence deteriorates
Fourier and Wavelet Option Pricing Methods 269

Black–Scholes European call


105
COS
Haar
100
Absolute price error

Lin.B-Spline
SWIFT

10–5

10–10

10–15
0 50 100 150 200 250 300
Number of coefficients (log scale)

FIGURE 8.4: A European call with strike K = 110 on an asset driven by


geometric Brownian motion with parameters S0 = 100, σ = 0.15, r = 0.1, and
maturity T = 1. Reference price by the Black–Scholes formula.

Heston EU call T = 45
1010
COS
Haar
Absolute price error

105 Lin.B-Spline
SWIFT
100

10–5

10–10
0 20 40 60 80 100 120 140 160 180
Number of coefficients

FIGURE 8.5: A long maturity T = 45 call option under Heston dynamics.


Parameters as in Ortiz-Gracia and Oosterlee [6]. All methods show a similar
rate of convergence.

to algebraic convergence for example when we price a long maturity option


under the Heston model [20], as shown in Figure 8.5. Long maturity option-
alities occur in economy and real option pricing. In that case, the WA[a,b]
method is competitive to the SWIFT and COS methods. Note that we can
further improve the accuracy of the WA[a,b] method by using the truncated
series from Equation 8.33.
270 High-Performance Computing in Finance

CGMY European put—error CPU time


10–1 70
COS (n = 512)
60 SWIFT (m 8)
10–2 Haar (m = 9)
50
Absolute error

Miliseconds
10–3
40

30
10–4

20
10–5 COS (n = 512)
SWIFT (m = 8) 10
Haar (m = 9)
10–6 0
95 100 105 0 50 100 150 200
Strike Number of strikes simultaneously

FIGURE 8.6: A short to maturity T = 0.01 European put options on CGMY


dynamics for a range of strikes K = 95, . . . , 105.

8.6.4 Multiple strike pricing


Generally, asset models are calibrated such that the resulting option prices
match quoted option prices. Therefore, a whole range of options with different
strike prices has to be priced for each parameter set.
When the underlying asset is driven by an exponential Lévy process, all of
the discussed methods can be reformulated such that the density coefficients
only have to be computed once and the payoff coefficients once for every
option.
As mentioned before, the COS method has a computational complexity
of O(N ), while WA[a,b] method has a complexity of O(N log2 N ). However,
when pricing M options simultaneously, the complexity of the COS method is
O(M N ), while the WA[a,b] method has a complexity of O(N log2 N + M N ),
and when M is large compared to N , the two methods have essentially the
same computational complexity. The WA[a,b] method benefits then from the
very simple representation of the payoff coefficients.
In the following example, we price a range of put options on an asset driven
by the CGMY model [19] with parameters (C, G, M, Y ) = (1, 5, 5, 0.5) with a
very short maturity T = 0.01, initial asset price S0 = 100, and strikes K =
95, 80.1, . . . , 119.9, 105. Thus, in total, 100 options are priced simultaneously.
In Figure 8.6, on the left, we see the price errors corresponding to each strike
price.

Remark 8.2. The SWIFT method has a slight disadvantage in multiple strike
pricing, as both the payoff and density coefficients are computed using the FFT,
and thus the resulting computational complexity is O((M + 1)N log2 N ).

Concluding this chapter on Fourier pricing methods, we can state that


Fourier as well as wavelets based option pricing methods are extremely fast
and thus very efficient for European option pricing under Levy models and
Fourier and Wavelet Option Pricing Methods 271

generally for log-asset models from the affine jump-diffusion class, like the
Heston model. By wavelets, and in particular by the SWIFT method, we can
enhance the robustness of Fourier methods, when dealing with fat-tailed dis-
tributions or very long (and very short) maturity options. Parallelization tech-
niques may be employed in the context of the calibration framework, where
option pricing computations need to be performed for multiple strike prices.
It is the strike direction which lends itself well for parallelization, leading to
a truly high-performance calibration.

References
1. Carr, P.P. and Madan, D.B. Option valuation using the fast Fourier transform.
Journal of Computational Finance, 2:61–73, 1999.

2. Lee, R.W. Option pricing by transform methods: Extensions, unification, and


error control. Journal of Computational Finance, 7:51–86, 2004.

3. Lindström, E., Ströjby, J., Brodén, M., Wiktorsson, M., and Holst, J. Sequential
calibration of options. Computational Statistics & Data Analysis, 52:2877–2891,
2008.

4. Fang, F. and Oosterlee, C.W. A novel option pricing method based on Fourier-
cosine series expansions. SIAM Journal on Scientific Computing, 31(2):826–848,
2008.

5. Ruijter, M. and Oosterlee, C.W. Two-dimensional Fourier cosines series expan-


sion method for pricing financial options. SIAM Journal on Scientific Comput-
ing, 34:642–671, 2012.

6. Ortiz-Gracia, L. and Oosterlee, C.W. Robust pricing of European options with


wavelets and the characteristic function. SIAM Journal on Scientific Computing,
35(5):B1055–B1084, 2013.

7. Ortiz-Gracia, L. and Oosterlee, C.W. Efficient VaR and expected shortfall com-
putations for nonlinear portfolios within the delta-gamma approach. Applied
Mathematics and Computation, 244:16–31, 2014.

8. Kirkby, J.L. Efficient option pricing by frame duality with the fast Fourier trans-
form. SIAM Journal on Financial Mathematics, 6(1):713–747, 2016.

9. Zhang, B. and Oosterlee, C.W. Acceleration of option pricing technique on


graphics processing units. Concurrency and Computation: Practice and Experi-
ence, 29(9):1626–1639, 2014.

10. Daubechies, I. Ten Lectures on Wavelets. Society for Industrial and Applied
Mathematics, Philadelphia, PA, USA, 1992.
272 High-Performance Computing in Finance

11. Ortiz-Gracia, L. and Oosterlee, C.W. A highly efficient Shannon wavelet inverse
Fourier technique for pricing European options. SIAM Journal on Scientific
Computing, 38(1):B118–B143, 2016.

12. Chui, C.K. An Introduction to Wavelets. Academic Press, Cambridge, Mas-


sachusetts, USA, 1992.

13. Ortiz-Gracia, L. and Masdemont, J.J. Peaks and jumps reconstruction with
B-splines scaling functions. Journal of Computational and Applied Mathematics,
272:258–272, 2014.

14. Colldeforns-Papiol, G., Ortiz-Gracia, L., and Oosterlee, C.W. Two-dimensional


Shannon wavelet inverse Fourier technique for pricing European options. Applied
Numerical Mathematics, 117:115–138, 2017.

15. Maree, S.C., Ortiz-Gracia, L., and Oosterlee, C.W. Pricing early-exercise and
discrete barrier options by Shannon wavelet expansions. Numerische Mathe-
matik, 136(4):1035–1070, 2017.

16. Gearhart, W.B. and Shultz, H.S. The function sin(x)/x. The College Mathemat-
ics Journal, 21(2):90–99, 1990.

17. Quine, B.M. and Abrarov, S.M. Application of the spectrally integrated Voigt
function to line-by-line radiative transfer modelling. Journal of Quantitative
Spectroscopy & Radiative Transfer, 244:37–48, 2013.

18. von Sydow, L. et al. BENCHOP—The BENCH marking project in option pric-
ing. International Journal of Computer Mathematics, 92:12, 2015.

19. Carr, P.P., Geman, H., Madan, D.B., and Yor, M. The fine structure of asset
returns: An empirical investigation. Journal of Business, 75:305–332, 2002.

20. Heston, S. A closed-form solution for options with stochastic volatility with
applications to bond and currency options. The Review of Financial Studies,
6:327–343, 1993.
Chapter 9
A Practical Robust Long-Term
Yield Curve Model

M. A. H. Dempster, Elena A. Medova, Igor Osmolovskiy, and


Philipp Ustinov

CONTENTS
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
9.2 Multifactor Yield Curve Models and Their Drawbacks . . . . . . . . . 276
9.2.1 Requirements for model development . . . . . . . . . . . . . . . . . . . 276
9.2.2 Available multifactor yield curve models . . . . . . . . . . . . . . . . 277
9.2.3 Classification of three-factor affine short rate models . . . 280
9.2.4 Difficulties with Gaussian affine models . . . . . . . . . . . . . . . . . 282
9.3 Nonlinear Black Correction for the EFM Model . . . . . . . . . . . . . . . . 283
9.3.1 Three-factor basic EFM model . . . . . . . . . . . . . . . . . . . . . . . . . . 284
9.3.2 EFM model calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
9.3.3 Black correction for negative rates . . . . . . . . . . . . . . . . . . . . . . 287
9.3.4 Stylized properties of Black models . . . . . . . . . . . . . . . . . . . . . 288
9.4 HPC Approaches to Calibrating Black Models . . . . . . . . . . . . . . . . . . 288
9.4.1 Three-factor Black model calibration . . . . . . . . . . . . . . . . . . . 290
9.4.2 Monte Carlo bond pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
9.4.3 PDE bond pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
9.4.4 Black model calibration progress . . . . . . . . . . . . . . . . . . . . . . . . 292
9.5 UKF EM Algorithm HPC Implementation . . . . . . . . . . . . . . . . . . . . . . 295
9.5.1 UKF for the Black EFM model . . . . . . . . . . . . . . . . . . . . . . . . . 295
9.5.2 Quasi-maximum likelihood estimation . . . . . . . . . . . . . . . . . . 297
9.5.3 Technical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
9.5.4 HPC implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
9.6 Empirical Evaluation of the Model In- and Out-of-Sample . . . . . 299
9.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
9.6.2 Yield curve bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
9.6.3 In-sample goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
9.6.4 Out-of-sample Monte Carlo projection . . . . . . . . . . . . . . . . . . 304
9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Appendix 9A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

273
274 High-Performance Computing in Finance

9.1 Introduction
Since the 2007–2008 financial crisis, low interest rates have prevailed in
all the world’s major developed economies and were presaged by more than a
decade in Japan. This has posed a problem for the widespread use of diffusion-
based yield curve models for derivative and other structured financial product
pricing and for forward rate simulation for systematic investment, asset lia-
bility management (ALM), and economic forecasting. Indeed, while Gaussian
models remained sufficiently accurate for pricing and discounting in relatively
high-rate environments, their tendency to produce an unacceptable proportion
of negative forward rates at short maturities with Monte Carlo scenario simu-
lation from initial conditions in low-rate environments has called their current
use into question. The implications for this question of negative nominal rates
in deflationary regimes remain to be seen, as does the necessity for currently
fashionable multicurve models. Be that as it may, beginning with work in the
Bank of Japan in the early 2000s, there has recently been a flurry of research in
universities, central banks, and financial services firms to develop yield curve
models whose simulation produces nonnegative rate scenarios.
All this work is based on a posthumously published suggestion of Fisher
Black (1995) to apply a call option payoff with zero strike to the model
instantaneous short rate which leads to a piecewise linear nonlinearity in
standard Gaussian affine yield curve model formulae for zero coupon (dis-
count) bond prices and the corresponding yields, and precludes their explicit
closed-form solution. As a result, most of the published solutions to Black-
corrected yield curve models to date are approximations and even these require
high-performance computing (HPC) techniques for numerical solution. We
shall study here an obvious approximation which works extremely well and is
amenable to cloud computing for speed up.
In practice, there are a variety of approaches to yield curve modeling which
are driven by the intended use of the models. The literature is devoted pre-
dominately to the needs of investment banks in pricing and hedging fixed
income derivative and other structured products. Model calibration is short
term, to current forward market data at pricing and hedging time, and is
updated for rehedging. A specific model is evaluated by its realized hedging
profit and loss.
A second approach is that of central bank forecasting for monetary pol-
icy making. Here calibration uses long-term historical data for medium-term
forecasts and is updated for the next forecast. Model accuracy appears from
the open literature to be mainly evaluated by in-sample fit to the historical
data employed, with little out-of-sample forecasting evaluation reported.
The approach of interest here supports consultants’ and fund managers’
advice to institutional and individual clients regarding product pricing, invest-
ment, and asset liability management over long horizons. It involves long-term
calibration of historical market data, often using filtering techniques, which
is updated for decision points such as restructurings or portfolio rebalances.
A Practical Robust Long-Term Yield Curve Model 275

Models are evaluated by the consistency of their forecasts with out-of-sample


market data, for example, prices and returns.
This paper describes the preliminary development of a robust long-term
yield curve model which is a (mildly) nonlinear version of our workhorse
Gaussian three-factor affine yield curve model, the Economic Factor Model
(EFM),which we have used for Monte Carlo scenario generation over many
years in practical structured derivative pricing, investment, and asset liabil-
ity management. We have employed the EFM model in the past using time
steps from daily to semiannual in the five major currency areas: EUR, CHF,
GBP, USD, and JPY. Its 14 parameters are calibrated to market data using
the expectation–maximization (EM) algorithm which (Dempster et al. 1977)
alternates the linear Kalman filter (KF) with maximum likelihood parameter
estimation (MLE) to convergence. Here we implement a Black (1995) cor-
rected version (Black EFM) of the EFM model using the nonlinear unscented
Kalman filter (UKF) (Julier et al. 1995; Julier and Uhlmann 1997) in place
of the ordinary KF. This represents an approximation to the mathematical
Black model in the presence of Black’s piecewise linear 0-strike call option
nonlinearity which suppresses negative rates. Indeed, while we use the EFM
affine closed-form expressions for yields of all maturities, it should be noted
that the zero lower bound (ZLB) will be inactive for all but those of short,
but not necessarily the shortest, maturities.
Our approach to calibrating the Black-corrected EFM model is promising
both in calibration and forecasting accuracy relative to market data, even at
short maturities, but further work is necessary in both empirical testing and,
particularly, in the theoretical development of the UKF.
The remainder of this chapter is structured as follows. After enunciating
our guiding design requirements, Section 9.2 briefly summarizes recent yield
curve (term structure) models developed for both pricing and forecasting. It
illustrates the nature of the difficulties encountered with each model in terms
of our requirements and due to the current low-rate environment. In Section
9.3, the basic EFM model is set out and the Black correction is defined. Section
9.4 surveys recent contributions to approximating the Black-corrected affine
yield curve models whose calibrations are all (too) heavily computationally
intensive. These models are often (in our view mis-) designated “shadow rate
models,” a term introduced by Black for the underlying Gaussian model which
is to be corrected for nonnegative rates. Our approximate, but accurate, Black
EFM model is described in Section 9.5, which outlines the UKF, including
our HPC implementation. Section 9.6 presents the empirical evaluation of the
model in terms of both in-sample calibration and out-of-sample scenario gen-
eration, using data for the five major currency areas. Although there remains
room for deeper understanding and improvement, the Black EFM model is
found to be sufficiently accurate for both purposes. Conclusions are presented
in Section 9.7, in which it is noted that on average the runtime of its computa-
tionally intensive calibration is only double that of the original EFM model—a
very significant improvement on the current alternatives in the literature. An
appendix gives a step of the EM algorithm for the EFM model in pseudo code.
276 High-Performance Computing in Finance

9.2 Multifactor Yield Curve Models and


Their Drawbacks
The range of yield curve models discussed in the literature is vast. The
task of choosing a suitable model for a variety of purposes, including trading,
systematic investment, asset liability management, and structured product
valuation, is nontrivial. However, there are relatively few papers in the litera-
ture that measure (as opposed to just discussing) the comparative advantages
of more than a handful of different models, as the implementation of the more
complex ones is a time-consuming process. Therefore in developing a suit-
able new model, it is important to start with a clear formulation of model
requirements, so that the set of possible suitable choices is limited.

9.2.1 Requirements for model development


The principal applications of the model we envisage are the following:

• Scenario simulation for a diverse set of (predominantly long-term) ALM


problems for multiple currencies
• Valuation of complex structured derivatives and loans (which often have
embedded derivatives) in multiple currencies

• Risk assessment of portfolios and structured products

The problem of model selection has been discussed in Dempster et al.


(2010, 2014). We borrow some of the requirements for the new yield curve
model from these works and extend them here:

• A continuous-time framework to allow the flexibility of using different


time steps, including uneven time steps
• Mean-reverting behavior

• Allow both pricing and dynamic evolution under the market (real world)
measure, that is, the model should reflect the market risk premium1
• Reproduce a wide range of yield curve shapes and dynamics (to allow for
realistic risk assessment, for example), including steepening, flattening,2
inverted and humped yield curves

1 As argued in Nawalkha and Rebonato (2011), this is especially relevant for the buy-

side practitioner. For sell-side banks, it usually suffices to do pricing and initial hedging
calculations under the risk-neutral measure from the forward market data on the day. Having
an exact fit to the observed yields is thus more important for the sell side.
2 Products based on these properties of the yield curve are traded on the NYSE, for

example, US Treasury Flattener ETN (ticker: FLAT) and iPath US treasury Steepener
ETN (ticker: STPP), although admittedly these are not very popular.
A Practical Robust Long-Term Yield Curve Model 277

• Incorporate realistic modeling of the ZLB empirically observed for zero-


coupon bond yields

• Feasible and efficient bond price calculation in closed form or numerically

• Parameter estimation by efficient model calibration to market data


• Allow estimation and use of the model for multiple correlated yield curves
and currency exchange rates
• Parsimony in parameters

• Time homogeneity

Clearly, the requirements related to ease and speed of calculations contra-


dict the requirements for model realism, so a compromise is obviously neces-
sary. The choice of model made when prevailing short-term yields are near-zero
could be different from that in a high-rate environment, but ideally the chosen
model should cover all rate environments. However, the present global low-
rate environment is the principal motivation for our work to improve on our
workhorse affine yield curve model (Dempster et al. 2010). The most impor-
tant enhancement required for our existing EFM model (see Section 9.3) is a
better way of dealing with the ZLB for initial low short rates.3 If we assume
that both normal and low-rate environments are probable in the medium- to
long-term future, and that once the situation reverts to long-term nominal
levels away from zero it will be reasonably similar to the nominal rate envi-
ronment before the crisis, a prudent decision would be to try to construct a
model that is suitable for both environments.

9.2.2 Available multifactor yield curve models


In order to briefly survey available model choices, we can divide yield curve
models into three broad overlapping classes:

• Short rate models

• Models in the Heath–Jarrow–Morton (HJM) framework


• Market models

Some authors also categorize stochastic volatility models (such as SABR)


separately. Short rate models are based on modeling a process for the instan-
taneous interest rate which is then used to derive zero-coupon bond prices,
that is, discount factors, or their yields at different maturities. This class of
models is the oldest and probably the most extensively researched.
The HJM models start from modeling forward rates directly. A feature
of this framework is using the no-arbitrage property to derive constraints on
3 We have previously used penalties added to the model likelihood function.
278 High-Performance Computing in Finance

the structure of forward rates. This framework is very general and convenient
for studying arbitrage-free properties in theory. However, some of the models
in this framework may be non-Markovian and most practical models coming
from the HJM framework are either well-known short rate models or market
models.
The class of market models is focused on describing the dynamics of the
observable quantities (e.g., LIBOR and SWAP market models). They are espe-
cially useful for derivative pricing. However, under the current actually occur-
ring low interest rate conditions, the popular LIBOR Market Model (see, e.g.,
de Jong et al. 2001) may require parameter estimates that are unrealistic
(e.g., simulated cash returns more volatile than actual equity returns, with
significant probability assigned to interest rates of more than 10,000%).
Model and computational complexity considerations, as well as the appli-
cations envisaged, suggest that short rate models are the most suitable class
for our needs.
There are other factors influencing our considerations. First, we have had
long successful experience with utilization of the EFM model (see Section 9.3)
in situations in which the rate ZLB is not binding. We have confidence in the
performance of the EFM yield curve model in these situations, so we would
prefer our new model not to deviate too far from it.
Secondly, most of the current research on ZLBs is done in the framework
of short rate models. Having a way of estimating the level of “shadow” rates
may be useful, not least because some policy makers appear to monitor them.
There have been attempts in the research literature to use the “shadow” short
rate and its distance to zero as a forecast of the estimated time until the
low-rate regime is lifted, see Ueno et al. (2006) and Wu and Xia (2014). The
Federal Reserve Bank of Atlanta publishes the Wu-Xia Shadow Federal Funds
Rate based on the Wu and Xia paper. It should be noted, however, that the
level of the shadow rate is not a very reliable indicator, as it is strongly model
dependent (see Bauer and Rudebusch 2013; Christensen and Rudebusch 2013).
A review of the literature on short rate models shows that most popu-
lar sub-class of short rate models in empirical research and applications are
the Affine Term Structure Models (ATSMs), due to their analytical tractabil-
ity, flexibility, and empirical efficiency. This class of models includes Vasicek
(1977), Dothan (1978), Cox-Ingersoll-Ross (1985), Ho-Lee (1986), Hull-White
(1990), and many other one- and multifactor models.
To illuminate the analysis that we undertake below for the more complex
multifactor models, we first discuss the characteristics of the simpler one-factor
models. The stochastic differential equations (SDEs) governing the evolution
of the short rate under the risk-neutral or pricing measure Q for the respective
models are:4
1. Vasicek (1977)
dX t = λ(θ − Xt ) dt + σ dW t (9.1)
4 We use boldface type in the sequel to denote stochastic entities, here conditionally.
A Practical Robust Long-Term Yield Curve Model 279

2. Dothan (1978)

dX t = −λXt dt + σXt dW t (9.2)

3. Cox-Ingersoll-Ross (1985)
.
dX t = λ(θ − Xt ) dt + σ Xt dW t (9.3)

4. Ho-Lee (1986)
dX t = θt dt + σ dW t (9.4)
5. Hull-White (1990)

dX t = λ(θt − Xt ) dt + σt dW t . (9.5)

It is easy to see that Hull–White (also called extended Vasicek) is the


most general of these models. It can fit any term structure exactly because
of the time-dependent equilibrium drift coefficients θt . However, having time-
dependent parameters, as in the Ho–Lee and Hull–White models, contradicts
our requirements of parsimony and time homogeneity. The Ho–Lee model
also lacks the desired mean-reversion property and the Vasicek, Ho–Lee, and
Hull–White diffusion models can all produce negative yields. The Dothan and
Cox–Ingersoll–Ross models produce positive yields, but the short rate in these
models never hits the ZLB. In other words, the ZLB in these models is repelling
instead of absorbing. This is not consistent with the historical data recently
observed in developed countries.5 The Vasicek, Dothan, and Cox–Ingersoll–
Ross models do not satisfy our requirement on yield curve shapes in that
the shapes attainable with these models are constrained. On the other hand,
choosing a time-homogeneous structure for our model, by not using nonsta-
tionary parameters in the corresponding SDE, means that exact matching of
the yield curve is not possible with a small number of factors.
The number of factors necessary for adequate modeling of the whole term
structure has been analyzed in Litterman and Scheinkman (1991). Their prin-
cipal component analysis of U.S. Treasury data showed that 99% of the vari-
ance can be captured by three factors.
It is well known (see, e.g., Nawalkha and Rebonato 2011) that single-factor
and two-factor time-homogenous models deviate significantly from the ini-
tially observed bond prices. However, three to five factors produce a close fit.
The Nelson-Siegel (1987) model widely employed in central banks uses three
factors to estimate the entire yield curve but has time inhomogeneous param-
eters, except in the Diebold–Rudebusch arbitrage-free version of the model
(see Diebold and Rudebusch 2013; Rebonato 2015). Rebonato and Cooper
(1995) argued that a two-factor affine or quadratic model cannot reproduce a
realistic correlation structure of interest rate changes, but that three to five
factors are sufficient for this purpose. So it seems that a reasonable choice
5 Note that the multifactor square root (CIR) model and quadratic Gaussian models

(QGMs) are also unable to reproduce the absorbing ZLB.


280 High-Performance Computing in Finance

(taking into account additional computational complexities connected with


introducing the ZLB property) would be an affine short rate model with three
factors.

9.2.3 Classification of three-factor affine short rate models


Duffie and Kan (1996) derive necessary and sufficient conditions on the
SDEs to have an affine representation and Dai and Singleton (2000) analyze
the different subfamilies of ATSMs. The analysis of Dai and Singleton for the
three-factor cases shows that some affine subfamilies explain historical interest
rate behavior better than others.
The SDEs for their factors are of the form
.
dX t = Λ (Θ − Xt ) dt + Σ S t dW t , (9.6)

with X the K-dimensional state vector; W K-dimensional Brownian motion;


Θ a fixed point in K-dimensional space; Λ, St and Σ K × K matrices, and St
a diagonal matrix with diagonal elements satisfying

[St ]ii = αi + βi Xt , i = 1, . . . , K, (9.7)

where prime denotes transpose.


Zero coupon bond prices in terms of expectations of the instantaneous
short rate r t under the risk-neutral Q measure are given by
   t+τ 
Q
Pt (τ ) = Et exp − r s ds . (9.8)
t

For an admissable parametrization, the bond prices can be calculated as



Pt (τ ) = eA(τ )+B(τ ) Xt , (9.9)

where A and B are solutions of certain ODEs (see, e.g., James and Webber
2000).
The instantaneous short rate is also an affine function of the state

rt = φQ Q
0 + φX X t . (9.10)

Zero coupon bond yields to maturity, termed rates, are linked with the
bond prices by
yt (τ ) = − log Pt (τ )/τ. (9.11)
There are models that lack affine structure (and thus forfeit simple formu-
lae for bond prices) but a vector of K rates Rt of specified maturities may
sometimes still be recovered as the numerical solution of the Ricatti equation
∂Rt (τ ) 1
= ΛRt (τ ) − Rt (τ )ΣSΣ
Rt (τ )
+ rt 1, (9.12)
∂τ 2
where 1 is the K vector of ones (Dempster et al. 2014).
A Practical Robust Long-Term Yield Curve Model 281

Dai and Singleton (2000) denote different affine subfamilies by Am (n) with
n the number of factors and m ≤ n the number of bounded factors. They
perform empirical tests on the different subclasses for n equal to 3. Dempster
et al. (2014) also analyzed various three-factor affine models with requirements
similar to ours to uncover a variety of shortcomings with the models evaluated.
In particular, they studied the three-factor extended Vasicek model specified
under the market measure P in Equation 9.6 by
⎛ ⎞
λ11 0 0
Λ := ⎝λ21 λ22 0 ⎠
λ31 λ32 λ33
⎛ ⎞
θ1
Θ := ⎝θ2 ⎠
θ3
⎛ ⎞
σ1 0 0 (9.13)
Σ := ⎝ 0 σ2 0 ⎠
0 0 σ3
⎛ ⎞
1 0 0
S := ⎝0 1 0⎠
0 0 1
r(t) := δ0 + δ1 y1 (t) + δ2 y2 (t) + δ3 y3 (t).
This Dai and Singleton A0 (3) model with 16 parameters (also known as
the Hull–White model) is not econometrically identified under P (i.e., different
values of Θ can give the same paths of the factor process X ) unless Θ := 0,
which is only appropriate to the pricing measure Q, and has other difficulties
as well.
Dempster et al. (2014) were led to introduce a Black-corrected affine model
which always produces nonnegative rates.6 This was based on the recent (2011)
Joslin–Singleton–Zhu (JSZ) three-factor affine Gaussian yield curve model
whose continuous evolution of the three factors Y is given by
.
dY (t) = Λ(θ − Y (t) + Π (t)) dt + Σ S(t) dW (t), (9.14)
where
Π(t) = k0 + K1 Y (t) (9.15)
is the affine state-dependent market price of risk (excess factor return) vector.
JSZ estimate the parameters of the discrete time version of their model with
three observed yield curve points (rates) fit exactly and a few extra rates
fit approximately by least squares by means of two standard econometric
vector autoregression (VAR) models given, respectively, under the market
(real-world) and pricing (risk-neutral) measures P and Q. Dempster et al.
(2014) use only a single extra yield curve point fit by least squares.
6 We shall describe the Black correction in the following section.
282 High-Performance Computing in Finance

Model parameters fitted using JSZ representation—optimization starting point 9


End of month Bank of England data—January 1994–December 2011
1.4
Fitted yield curve
Historical yield curve
1.2 Stationary yield curve
Monthly continuously compounded yield

1.0

0.8

0.6

0.4

0.2

0.0

–0.2
0 50 100 150 200 250 300
Maturity (months)—fitted exactly: 6, 60, 180, approximately: 300
(6 months, 5 years, and 15 years points fitted exactly, 25 years point fitted approximately)

FIGURE 9.1: Joslin–Singleton–Zhu affine model fit numerical instability.

9.2.4 Difficulties with Gaussian affine models


Besides the problem of negative simulated rates, Dempster et al. (2014)
encountered a number of unexpected difficulties with existing models not dis-
cussed in the literature. For example, the JSZ affine model calibration imple-
mentation was initially subject to numerical instability of in-sample fit as illus-
trated in Figure 9.1, which shows extreme oscillating monthly fitting errors
dwarfing a typical UK yield curve up to 25-year maturity.
While this difficulty was relatively easily overcome by parameter bound-
ing in the calibration, Figure 9.2 shows a much more fundamental difficulty
with the JSZ model. Namely, while the properly calibrated model fits well a
selection of historical yield curves in-sample, yield curve extrapolation out-of-
sample projects, that all such curves are monotonically declining to eventually
have negative nominal rates within about a 100-year maturity and on average
at about 80 years.
Figure 9.3 illustrates the worrying difficulty with three-factor models of
the Gaussian A0 (3) affine class in low-rate environments with the EFM model
described in the following section. The figure shows the quantiles of 10,000
30-year forward simulations of 10-year euro bond rates for the EFM yield
curve model which has been calibrated to daily data from January 2, 2001, to
A Practical Robust Long-Term Yield Curve Model 283

Yield curve, historical versus modeled


Model fitted using JSZ representation
End of month Bank of England data from January 1994 to December 2011
0.008
Fitted yield curve
Historical yield curve
Stationary yield curve
0.006
Monthly continuously compounded yield

0.004

0.002

0.000

–0.002

–0.004
0 200 400 600 800 1000 1200
Maturity (months)—fitted exactly: 6, 60, 180, approximately: 300
(Estimated using exact fit to 6 months, 5 years, and 15 years data points
and approximate fit to 25 years point)

FIGURE 9.2: Joslin–Singleton–Zhu 25-year UK yield curve out-of-sample


projections.

January 2, 2012. This model gives a 25% probability of future 10-year negative
rates within 3.5 years starting from an initial value of about 3.2% which it
predicts will remain the median value over the 30-year forecast horizon.
In summary, we have stated a number of desirable requirements for a prac-
tical long-term yield curve model and have briefly surveyed the range of models
available in the literature. We determined that the short rate class is the most
suitable for our needs and within this class it seems that the most reasonable
decision a priori is to evaluate a model with three factors, in particular, within
the A0 (3) affine class. We have illustrated some of the potential drawbacks of
such models which do not have a correction for simulated nonnegative rates.

9.3 Nonlinear Black Correction for the EFM Model


Around the turn of the last century, a famous Austrian economist, Eugen
von Bohm-Bawerk (1851–1914), declared that the cultural level of a nation is
284 High-Performance Computing in Finance

0.2

0.15

0.1 Q_0.01
Q_0.05
Q_0.25
0.05
Q_0.5
Q_0.75
0
Q_0.95
Jan-12
Jul-13
Jan-15
Jul-16
Jan-18
Jul-19
Jan-21
Jul-22
Jan-24
Jul-25
Jan-27
Jul-28
Jan-30
Jul-31
Jan-33
Jul-34
Jan-36
Jul-37
Jan-39
Jul-40
Q_0.99
–0.05 Market data

–0.1

–0.15

FIGURE 9.3: EFM model Euro 10-year rate forecast for 30 years.

mirrored by its rate of interest: the higher a people’s intelligence and moral
strength, the lower the rate of interest (Homer and Sylla 2005).
As a low-rate environment has prevailed in most major developed countries
since 2008, and in Japan since the early 1990s, it is crucial to realistically
model rates behavior in these circumstances. We will present a Black-corrected
version of the EFM model discussed in Medova et al. (2005), Yong (2007), and
Dempster et al. (2010).

9.3.1 Three-factor basic EFM model


We first describe the original EFM model of the yield curve, which we
have used previously in a variety of applications in the five principal currency
areas with various time steps from daily to quarterly.7
The evolution under the risk-neutral measure Q of the three unobservable
(i.e., latent) factors of the model is governed by the SDEs

dX t = λX (θX − Xt ) dt + σX dW X t
dY t = λY (θY − Yt ) dt + σY dW Yt (9.16)
dRt = k (Xt + Yt − Rt ) dt + σR dW Rt ,

7 It is interesting to note that this model originated at Long-Term Capital Management

and was first brought to our attention by Lehman Brothers under the auspices of Pioneer
Investments of UniCredit Bank.
A Practical Robust Long-Term Yield Curve Model 285

with fixed pairwise correlations of the standard Brownian motion innovations


given by
(ρXY dt, ρXR dt, ρY R dt). (9.17)
The stochastic evolution of the three factors under the market (i.e., real-world)
measure P involving the market prices of risk of the factors is governed by
 
γX σX
dX t = λX θX + − Xt dt + σX dW X t
 λ X 
γY σY (9.18)
dY t = λY θY + − Yt dt + σY dW Yt
λY !
γR σR
dRt = k Xt + Yt + − Rt dt + σR dW R t .
k
The first factor X t represents the long rate and the third factor Rt the
short rate. The second factor Y t represents minus the slope of the yield curve
between the long rate and the unobservable instantaneous short rate. Thus
the sum of the first two factors X t + Y t represents the unobservable stochas-
tic instantaneous short rate about which the observable short rate Rt mean
reverts.
Note that as the X and Y equations have the same form, the factor
dynamics under Q given in Equation 9.18 are not econometrically identified,
that is, the parameters are not uniquely determined in that different sets will
generate the same factor dynamics. However, the factor dynamics under P
given in Equation 9.18 are identified by virtue of expressing the fixed factor
market prices of risk in volatility units. Also note that the rates of mean
reversion of the three factors are identical under P and Q. As a result, the
parameters of the dynamics must be estimated from market data under the P
measure and the resulting market price of risk estimates set to 0 to generate
the dynamics of the Q measure for pricing.
The SDEs (9.16 and 9.18) have explicit solutions. Substituting the explicit
solutions of the SDEs (9.16) into the sum of the first two factors and using
Equations 9.10 and 9.11 produces closed-form formulae for bond prices and
yields in terms of affine functions of the SDE parameters (see, e.g., Medova
et al. 2005; Yong 2007 for the factor covariance matrix). In particular, denoting
the three latent factors at time t in vector form by xt := (Xt , Yt , Rt )
, the yields
of the K different maturity zero coupon bonds are given in affine form by

yt = Bxt + d, (9.19)

where B and d are closed-form deterministic affine functions (matrix or vector-


valued, respectively) of the SDE parameters.

9.3.2 EFM model calibration


Calibration of the EFM model is a nontrivial task even without the Black
correction. The parameters of the model are estimated using a version of the
286 High-Performance Computing in Finance

EM algorithm (Dempster et al. 1977) which iterates to parameter convergence


the KF to generate sample paths and MLE of parameters for each path. Given
a fixed set of parameters, the KF produces estimates for the unobserved states
of the factors and prediction for the yields from Equation 9.19. These are then
used as the observed sample for the next numerical parameter optimization
step of MLE.
Note that for this version of the EFM model in state–space form, MLE is
trying to fit all of the observed rates approximately, in contrast to some other
approaches discussed previously which fit a small number of rates (equal to
the number of factors) exactly.
KF transition equation
Taking the discretization time step Δt := 1, the Euler approximation of the
SDEs for the three-factor state variables becomes the state variable transition
equation
x t+1 = Axt + c + η t , (9.20)

where η t ∼ N (0, G) is the Gaussian innovation, with A, c, and G deterministic


matrix or vector-valued functions of the SDE coefficients.
KF measurement equation
The corresponding measurement equation is

y obs
t = Bxt + d + εt , (9.21)

where y obs
t corresponds to the yields observed in the market and B and d are
defined above. The centered measurement error process εt is a K-vector of
serially independent Gaussian noise with covariance matrix H.8
Given a data series for the observed yields y obst , the KF generates an
estimated expected path of the Gaussian state variables and their conditional
covariance matrix Σt|t−1 .
Initialization
The filter is initialized using unconditional moments which following
Harvey (1993) gives initial values

x̂0 := (I − A)−1 c
(9.22)
vec(Σ0 ) := (I − A ⊗ A)−1 vec(G),

where ⊗ is the Kronecker product, vec(.) is the operation of writing out a


matrix as a vector, and G is the covariance matrix of the factor dynamics

8 But see Dempster and Tang (2011) regarding handling measurement error serial corre-

lation, which we intend to implement in future research.


A Practical Robust Long-Term Yield Curve Model 287

innovations η. The matrix A and the vector c are the entities in the transition
equation 9.20 and the elements of Σ0 can be computed analytically.
KF prediction

x̂t|t−1 = Ax̂t−1 + c
ŷt|t−1 = B x̂t|t−1 + d (9.23)
T
Σt|t−1 = AΣt−1 A + G.
KF update
vt := ytobs − ŷt|t−1 = ytobs − B x̂t|t−1 − d
Ft = BΣt|t−1 B
+ H
(9.24)
x̂t = x̂t|t−1 + Σt|t−1 B T Ft−1 vt
Σt =Σt|t−1 − Σt|t−1 B
Ft−1 BΣt|t−1 .
Quasi-maximum likelihood parameter estimation
Let Θ denote the 14 SDE model parameters of the transition equation and
define ψ := {Θ, H}. Then the log-likelihood is given by

1 1
−1
T T
TK
log L(Θ, H ) = − log 2π − log det Ft − v F vt , (9.25)
2 2 t=1 2 t=1 t t

where K is the total number of maturities used, T is the number of time steps,
and v and F are computed using Equation 9.24.
The maximization of the log-likelihood is performed in two steps, alter-
natively optimizing Θ and H to convergence. There are two phases of the
numerical optimization: a global phase using the DIRECT global optimization
algorithm (Jones et al. 1993) to locate the region of the maximum, followed
by a local phase using an approximate conjugate gradient algorithm (Powell
1964) to locate the maximum itself.

9.3.3 Black correction for negative rates


The distribution of the instantaneous short rate r t is Gaussian in most
yield curve models. Therefore it is easy to see that it can become negative
when initialized at a low level. Black (1995) suggested a way of solving this
problem. He argued that nominal rates cannot become negative, because there
always exists the option of investing in the 0-yielding currency. Black started
from a process s t which can take negative values, which he called the shadow
short rate, and the nominal short rate is then defined as
r t := max(0, s t ) = 0 ∨ s t , (9.26)
where ∨ denotes meet in the natural lattice order of the real line. This
modification makes all the yields calculated through the bond price nonneg-
ative. Unfortunately, the shadow short rates implied by affine models lose
288 High-Performance Computing in Finance

FIGURE 9.4: Black JSZ model 10-year gilt rate 50-year prediction.

their linearity when modified using this idea. This makes the resulting mod-
els difficult to calibrate. We discuss different approaches to calibration in the
following section.

9.3.4 Stylized properties of Black models


Most recent papers considering Black-corrected models have been based
on the A0 (3) class. The EFM three-factor model which has proven itself in
a variety of different applications also belongs to this class. Dempster et al.
(2014) summarized in a table the stylized features satisfied by the alternative
models they evaluated, which is reproduced here as Table 9.1.
Figure 9.4, also reproduced from their paper, shows for the Black-corrected
JSZ model the quantiles of the 50-year predicted distribution for the 10-year
UK gilt rate from December 30, 2011, based on 10,000 scenarios with no
nonnegative rates.
Recent literature has also investigated the implications of Black-corrected
affine yield curve models for the long-term reversion from near-zero rates to
more normal levels (see Swanson and Williams 2013; Krippner 2014; Chris-
tensen 2015) as well as nonlinear models (Feldhutter et al. 2015).

9.4 HPC Approaches to Calibrating Black Models


Although Black’s idea was proposed in the 90s, the first implementation
only followed 9 years later in the work of Gorovoi and Linetsky (2004). Active
TABLE 9.1: Properties of evaluted yield curve models with regard to stylized facts
Yield curve model
CIR BDFS Vasicek HW/JSZ HW/JSZ/BRW Black
Stylized fact properties A3 (3) A3 (3) A3 (3) A1 (3) A0 (3) A0 (3)
Mean reverting rates Yes Yes Yes Yes No Yes
Nonnegative rates Yes No No No No Yes
Stochastic rate volatility Yes Yes No No No Yesa
Closed-form bond prices Yes Yes Yes Yes Yes No
Replicates all observed curves No Yes Yes Yes Yes Yes
Good for long-term simulations No No No No No Yes
State-dependent risk premia No No No Yes Yes Yes
+ve Rate/volatility correlation No No No No No Yes
Effective in low-rate regimes No No No No No Yes
a Rate volatilities are piecewise constant punctuated by random jumps to 0 at rate 0 boundary hitting points.
A Practical Robust Long-Term Yield Curve Model
289
290 High-Performance Computing in Finance

work on extensions to multifactor models started only after the crisis of 2008.
There are two main reasons for such a timeline. First, the ZLB was not
observed in the United States from the Great Depression until 2008. Only
in Japan from the mid-1990s did rates come near zero in a major economy.
Perhaps more importantly, the implementation of the Black correction is con-
siderably more difficult (both theoretically and computationally) than imple-
mentation of the usual ATSMs. The main problem is the lack of a closed-form
formula for the bond price given by
   t+τ 
Pt (τ ) = EtQ exp − (0 ∨ s u ) du . (9.27)
t

9.4.1 Three-factor Black model calibration


Calibration of nonlinear Black models with any underlying three-factor
Gaussian shadow rate is much more computationally intensive than for
the underlying affine shadow rate model and requires HPC facilities to
be undertaken successfully. This is because even linear Kalman filtering
is not computationally insignificant, likelihood functions are multiextremal
and discount bond prices or yields must be calculated for all maturities
at each discrete time point, for example, daily, over the model simulation
horizon.
The various current approaches to calibrating Black models based on affine
three-factor shadow rate models may be categorized in terms of how they
handle the three steps crucial to the process. These are:

• Method of inferring states of the latent three factors from observed mar-
ket rates
– Inverse mapping or least squares
– Extended or iterated extended Kalman filter (EKF or IEKF) using
piecewise linearization
– Unscented Kalman filter (UKF) using averaged multiple displaced KF
paths

• Method of parameter estimation for a given factor path


– Method of moments
– Maximum likelihood estimation (MLE) or quasi-maximum likelihood
estimation (QMLE)

• Method of bond price calculation


– Monte Carlo simulation
– PDE (partial differential equation) solution
– Approximation
A Practical Robust Long-Term Yield Curve Model 291

Filtering
Christensen and Rudebusch (2013), Bauer and Rudebusch (2014), and
Lemke and Vladu (2014) use the EKF for parameter estimates. However,
Krippner (2013a) uses IEKF to fit his shadow rate approximation for the case
of two and three factors, because he found that using the EKF was not robust.
Priebsch (2013) uses the UKF. Christoffersen et al. (2014) perform a series of
comparisons of EKF with UKF and the particle filter. They conclude that the
UKF significantly outperforms the EKF and performs well compared to the
significantly more computationally expensive particle filter.
Likelihood maximization
Most of the papers on shadow rate models omit discussion of the opti-
mization methods used for this purpose. Richard (2013) mentions that he
maximizes the likelihood function by using Nelder-Mead (1965) global search
combined with Powell (1964) local search.

9.4.2 Monte Carlo bond pricing


Dempster et al. (2014) use cloud facilities for bond pricing in a three-
factor Black-corrected JSZ 4 yield curve point model using a combination
of analytical closed-form calculations of yields in the affine form given by
Equations 9.9 and 9.11 for short maturities (Joslin et al. 2011) and Monte
Carlo simulation (10,000 paths) for longer maturities. In more detail, the aver-
ages of Monte Carlo forward simulated paths are used for long rates, which
automatically take account of the convexity adjustments otherwise required
for this model. They suggest using the full unconditional likelihood function,
cf. (9.25), and multiple starting points for parameter optimization, as there
will be numerous local optima for the likelihood. With this approach filter-
ing a multicurrency EFM for over the counter (OTC) structured derivative
valuation becomes very computationally intensive.
Bauer and Rudebusch (2014) use Monte Carlo simulations (circa 500 paths
of the shadow short rate) to calculate the bond prices. They employ the EKF
to infer the states from the observed yields. However, they report using the
same workaround as Bomfim (2003), that is, estimating the parameters of
the model on the subset of data where the ZLB is not important, to com-
pare the shadow rate affine and Black models practically. Lemke and Vladu
(2014) have applied the Monte Carlo bond price calculation method and
the EKF to construct yield curves in the Eurozone. Krippner (2013b) sug-
gests using his framework results as control variates in Monte Carlo simula-
tions for calculating true Black model bond prices. Christensen and Rude-
busch (2013, 2015) apply the Krippner framework to estimate a three-factor
shadow rate model. They argue that the divergence of the Krippner approach
from the fully arbitrage-free Black approach is not very significant and is
for example well compensated for by much greater tractability. Wu and Xia
(2014) apply an approach equivalent to the Krippner framework in discrete
time.
292 High-Performance Computing in Finance

9.4.3 PDE bond pricing


A possible key to calibration of, for example, both the JSZ and EFM mod-
els is the efficient solution for discount bond prices Pt (τ ) of all maturities τ at
each time t of a three-dimensional (3D) parabolic quasi-linear PDE of the form
3
3

∂Pt (τ )/∂τ = aij ∂ 2 Pt (τ )/∂yi ∂yj + bi ∂Pi (τ )/∂yi + cPt (τ ). (9.28)
i,j=1 i=1

Kim and Singleton’s 2D alternating direction implicit (ADI) solution


method will not cope well with the 3D case (Kim and Singleton 2011; Lip-
ton et al. 2014). In future, we intend to investigate applying a fast robust 3D
PDE solver based on an interpolating wavelet-specified irregular mesh implicit
method that we have developed for complex derivative valuation (Jameson
1998; Carton de Wiart and Dempster 2011) and which is expected to form a
part of the Numerical Algorithms Group library in future.

9.4.4 Black model calibration progress


The investigations of Black models in the literature naturally began with
one-factor models and proceeded to the more generally accepted three-factor
models.
Gorovoi and Linetsky (2004) showed that for a shadow short rate follow-
ing an 1D diffusion process, the zero-coupon bond price can be calculated as
the Laplace transform (at the unit value of the transform parameter) of the
area functional of the shadow rate process. They applied the method of eigen
function expansions (see Linetsky 2002; Davydov and Linetsky 2003; Linet-
sky 2004) to derive the quasi-analytic formulae (relying on Weber-Hermite
parabolic cylinder functions) for the bond prices in the Vasicek and shifted
CIR process cases. Unfortunately their method only works in the scalar case.
Gorovoi and Linetsky applied their method to estimating yield curve models
for only a single time point. However, their method was used for the Japanese
market by Ueno et al. (2006) who applied the method to a dynamic model
with a market price of risk.
The shadow rate s t in these models is given by a diffusion, therefore in
state–space form the single discretized transition equation takes the form

x t+1 = axt + c + η t . (9.29)

The mapping that links observed yields and the shadow rate is no longer
linear, so it takes the piecewise linear form

ytobs = h(xt ). (9.30)

Ueno et al. (2006) applied the KF with conditional linearization of Equa-


tion 9.30 to calibrate the model. However, it was clear from their results that
further work in developing the shadow rate models would be needed. For
A Practical Robust Long-Term Yield Curve Model 293

example, the shadow rates in their analysis reach the implausibly low levels
of −15%, which suggests model misspecification (see Ichiue and Ueno, 2006,
2007).
Both Bomfim (2003) and Kim and Singleton (2011) relied on a numerical
(finite-difference) method for solving a 2D parabolic quasi-linear bond price
PDE given by
 
∂Pt 1 ∂ 2 Pt ∂Pt
− tr ΣΣ
− K(θ − x) + max [0, s(x)] Pt = 0 (9.31)
∂τ 2 ∂x∂x
∂x

with boundary condition P (τ = 0, x) = 1. Here the short rate s(x) is an


affine function of the 2D state x. Bomfim (2003) estimated the parameters of
his model on the subset of data where rates were safely above zero, using an
analytical approximation, that is, the usual affine model. Kim and Singleton
(2011) used the EKF with quasi-maximum likelihood to estimate the param-
eters. Ueno et al. (2006) performed a sensitivity analysis of the two-factor
Black-corrected model without estimating the parameters.
Kim and Singleton and Ueno et al. report superior performance of the full
shadow rate models compared to their standard affine term structure equiva-
lents (used for the shadow rates). Two-factor models also produce more plau-
sible levels of the shadow rate. However, the analysis of Section 9.2 suggests
that three factors would be preferred for realistic modeling. Unfortunately,
the ADI finite difference scheme used by Kim and Singleton cannot easily be
extended to the corresponding PDE in three dimensions as noted above.
Krippner (2013b) applies a different method, which can be seen as an
approximation to the Black model. The advantage of his method is that the
forward rates have closed-form formulae. In the Black model, the price of a
bond can be expressed as

Pt (τ ) = PtS (τ ) − CtA (τ, τ ; 1), (9.32)

where PtS (τ ) is the shadow bond price (i.e., the price of a bond in a market
where currency is not available) and CtA (τ, τ ; 1) is the value of an American
call option at time t with maturity in τ years and strike 1, written on the
shadow bond maturing in τ years. There is no analytic formula for CtA (τ, τ ; 1),
but Krippner argues that the American option can be approximated by an
analytically tractable European one and introduces an auxiliary bond price
equation
Ptaux (τ, τ + δ) = PtS (τ + δ) − CtE (τ, τ + δ; 1), (9.33)

where CtE (τ, τ + δ; 1) is the value of a European call option at time t with
maturity at time t + τ and strike 1 written on a shadow bond maturing at
t + τ + δ . Krippner then takes the limit with δ → 0 to obtain the nonnegative
(due to future currency availability immediately after maturity) instantaneous
294 High-Performance Computing in Finance

forward rate as  
∂ aux
ft (τ ) = lim − ln Pt (τ, τ + δ) . (9.34)
δ→0 ∂δ
The nonnegative yield with maturity τ in Krippner’s framework is calcu-
lated as
   
1 t+τ S 1 t+τ ∂ CtE (τ, τ + δ; 1)
yt (τ ) = ft (s) ds = yt (τ ) + lim ds.
τ t τ t δ→0 ∂δ Pt (s + δ)
(9.35)
Here ytS (τ ) are the shadow bond yields. Unfortunately, closed-form ana-
lytic expressions for the bond prices and yields are still not available, but
they can be evaluated through calculating integrals that are numerically
tractable. More importantly, Krippner’s approach is not fully arbitrage-free.
The short rates are identical under the market measure P in the Black and
Krippner frameworks, but different under the risk-neutral measure Q. Kripp-
ner’s approach is extendible to three factors. Priebsch (2013) proposes to view
the quantity
   t+τ 
Q
log Pt (τ ) = log Et exp − (0 ∨ s u ) du . (9.36)
t

as the value at −1 of the conditional cumulant-generating function of the


 t+τ
random variable St (τ ) = t max(0, su ) du under Q. It can be expanded as

j κQ
j
log EtQ [exp (−S t (τ ))] = (−1) , (9.37)
j=1
j!

where κQ j is the jth cumulant of S t (τ ) under Q and an approximation can


be computed by taking a finite number of terms in this series. The method
of Ichiue and Ueno (2013) is equivalent to using the first-term approximation
in Equation 9.28. Priebsch (2013) evaluates both one- and two-term approx-
imations by analytically deriving the expression for the first two moments of
S t (τ ) (see also Kim and Priebsch 2013). He shows that this technique is suf-
ficiently fast and accurate to fit the term structure within a half basis point
for a single time step. Priebsch notes that the Krippner (2012) approximation
tends to underestimate the arbitrage-free yields of the Black model, while
the first-order cumulant approximation tends to overestimate these yields,
suggesting a systematic error. The errors of second-order cumulant approxi-
mation do not appear to have a discernible bias in any direction. Andreasen
and Meldrum (2014) compare cumulant approximation to shadow rate models
with quadratic term structure to find that shadow rate models are better at
out-of-sample forecasts.
Using HPC techniques, Richard (2013) estimates a full Black-corrected
shadow rate model by solving the 3D PDE (9.28) for bond prices using
an implicit numerical scheme and notes that calibration “requires a long
A Practical Robust Long-Term Yield Curve Model 295

time, literally a month, on a large and fast computer to estimate (model


parameters).” In the latest version of his paper, the search time has been
reduced to 3 days on the Cornell supercomputer, but the methods used to
achieve this are not specified.
In summary, implementing the Black correction leads to nonlinearity of
the measurement equation, that is, of the mapping of factors/states to yields,
so that the classical KF is no longer applicable.

9.5 UKF EM Algorithm HPC Implementation


Taking account of the information in the literature reviewed above, we will
use the UKF (Julier et al. 1995; Julier and Uhlmann 1997) for our shadow rate
model. To calculate the bond yields, we will use the measurement equation
approximation
ytobs = 0 ∨ (Bxt + d) + εt , (9.38)
where ∨ now denotes coordinate-wise maximum at each step of the UKF
dynamics. It will be demonstrated in the sequel that the computational times
for the HPC implementation of our approach are very acceptable relative to
those of the basic linear Kalman filtering algorithm and the nonlinear KF
alternatives.

9.5.1 UKF for the Black EFM model


We initialize the filter at the unconditional mean and variance of the state
variables under the P measure in the EFM model. This can be justified by
the fact that most of our datasets start before the onset of low-rate regimes.
Since only the measurement equation is nonlinear, the state prediction step is
the same as that of the linear Kalman filter in Equation 9.23.
For the factor path update step of the UKF, the state is first augmented
with the expected measurement error (here 0) of the linear KF to give
   
xat|t−1 = x̂
t|t−1 , E εt , (9.39)
and the state innovation conditional covariance matrix is augmented with the
measurement error covariance matrix to give
 
a Σt|t−1 0
Σt|t−1 = . (9.40)
0 H
Next, a set of perturbed sigma points is constructed as
χ0t|t−1 = x̂t|t−1
? !
χjt|t−1 = x̂t|t−1 + (L + λ) Σat|t−1 , j = 1, . . . , L
!j (9.41)
?
χjt|t−1 = x̂t|t−1 − (L + λ) Σat|t−1 , j = L + 1, . . . , 2L,
j−L
296 High-Performance Computing in Finance

where denotes the matrix square root of the symmetric positive definite
augmented matrix (9.40), whose jth column augments the conditional state
vector to give the augmented conditional state vector. Here L is the dimension
of the augmented state and the scalar parameter λ is defined as
λ := α2 (L + κ) − L, (9.42)
where α and κ control the spread of the sigma points in an elliptical config-
uration around the conditional augmented state vector. The choice of these
parameters is very important for the results of the calibration. We used an
(NAG) code which sets α equal to 1 and κ equal to 0, but we shall see that
this is probably not the best choice.
Next, the (here piecewise linear) measurement equation is evaluated at
the 2L (= 34) sigma points to obtain 2L estimates of the augmented obser-
vations as
j
γt|t−1 = h(χjt|t−1 ) = 0 ∨ (Bχjt|t−1 + d), j = 1, . . . , 2L. (9.43)

These 2L sigma point results are then combined to obtain the predicted
(here yield) measurements, measurements covariance matrix, and predicted
state-measurement cross-covariance matrix
2L

ŷt|t−1 = Wsj γtj
j=0
2L
  
Σyt yt = Wcj γtj − ŷt|t−1 γtj − ŷt|t−1 (9.44)
j=0
2L
  
Σxt yt = Wcj χjt|t−1 − x̂t|t−1 γtj − ŷt|t−1 ,
j=0

where the weights W j for combining sigma point estimates (predictions) are
potentially different for the state vector and the covariance matrices. They
are given by
Ws0 := λ
L+λ Wc0 := λ
L+λ + (1 − α2 + β)
1
(9.45)
Wsj := Wcj := 2(L+λ) j = 1, . . . , 2L.

Here β is related to the higher moments of the state vector distribution


and is usually set to 2, which is optimal for Gaussian innovations.
These results are used to compute UKF Kalman gain
Kt := Σxt yt Σ−1
yt yt , (9.46)

which, defining vt := ytobs − ŷt|t−1 = ytobs − B x̂t|t−1 − d, gives the updated


state estimate in observation prediction error feedback form as
x̂t := x̂t|t−1 + Kt vt (9.47)
A Practical Robust Long-Term Yield Curve Model 297

with updated state covariance matrix


Σt = Σt|t−1 − Kt Σyt yt Kt . (9.48)
As noted above, the choice of parameters (α, κ, β) for the UKF is very
important and is not usually detailed in the literature (but see Turner and
Rasmussen 2012). Some nonlinear models are known to exhibit UKF algorithm
divergence with certain parameter values. The other problem is inefficient
estimation because of excessive spread of the sigma points. The first issue is
not a problem here, but we begin to address the last issue in the following
section.

9.5.2 Quasi-maximum likelihood estimation


Parameter estimates in the approximate Black-corrected EM algorithm are
calibrated from the current UHF data path prediction as before by maximizing
the log-likelihood function (9.21)

1 1
−1
T T
TK
log L(Θ, H ) = − log 2π − log det Ft − v F vt ,
2 2 t=1 2 t=1 t t

by alternating between the parameters Θ and H, except that now the mea-
surement prediction errors in the last term of the log-likelihood are those of
the UKF and the calculation of the Ft terms from Equations 9.24 and 9.25
uses the UKF state covariance matrices Σt of Equation 9.48.

9.5.3 Technical implementation


The Numerical Algorithms Group Ltd (NAG) routine g13 ejc with key
parameter setting flexibility was used for the UKF implementation. The EM
calibration code was implemented in C++ with quasi-MLE optimization using
global search with DIRECT (Jones et al. 1993) followed by a local (approx-
imate) conjugate gradient algorithm (Powell 1964) which does not require
derivatives.
The Black EFM model calibration run times for currencies with 12 years
of daily data are around 4.5 hours (scaling linearly with data length) on the
system described next. This is approximately twice the time the basic KF
takes for EFM model calibration to the same data on this system.

9.5.4 HPC implementation


The development was coded in C and C++ under Linux with the use of
MPI functionality. Calculations are performed on 5 compute nodes with 32
cores in total and the following hardware:

• Node 1:
Memory: 16 GB
298 High-Performance Computing in Finance

Master thread
DIRECT, Powell optimization,
slaves synchronizing

Slave thread 30

Slave thread 31
Slave thread 1

Slave thread 2
Kalman filter

Kalman filter

Kalman filter

Kalman filter
... ...

Shared space with common variables:


objective function value, predictions

FIGURE 9.5: Parallelization schema.

4 x CPU Xeon (X5550) 2.66 GHz quad core


OS is Centos 5.7

• Nodes 2 to 5:
4 x CPU Xeon (TM) 3 GHz
Memory: 16 GB
OS Centos 5.7

The DIRECT global optimization algorithm to cope with the nonuni-


modal likelihood function was implemented in parallel in a master–slave con-
figuration with synchronicity. The Powell local optimization algorithm is not
parallelizable.
The functionality actually parallelized was the full linear KF algorithm
path estimation which is the basis of iteration step of the EM algorithm for
the EKF algorithm. The master thread controls the optimization calculation
synchronizing 31 slave threads (number of cores = number of threads = 32),
that is, providing them at each optimization step with the current best step
values of the log-likelihood objective function and calculated filter predictions
(Figure 9.5).
When the Black EFM model UKF implementation is fully developed, we
will migrate it to the cloud. We have already experimented with using the
Amazon cloud for this model and have consulted previously on a compute
intensive earlier Black model yield curve commercial development which uses
this cloud (Dempster et al. 2014).
A Practical Robust Long-Term Yield Curve Model 299

9.6 Empirical Evaluation of the Model


In- and Out-of-Sample
This section contains the preliminary empirical evaluation of our approach
to developing a robust long-term nonnegative yield curve model from the
EFM Gaussian model using the Black correction. We will evaluate the Black-
corrected EFM yield curve model against the original EFM model for the
market data used to calibrate both models, both in-sample, for goodness-of-
fit, and out-of-sample, for prediction accuracy.9

9.6.1 Data
We use a combination of LIBOR data and fixed interest rate swap rates
(the ISDA fix) for each of 4 currency areas (EUR, GBP, USD, JPY) to boot-
strap the yield curve daily for 14 maturities:

3 months, 6 months, 1 year, 2 years, 3 years, 4 years, 5 years, 6 years,


7 years, 8 years, 9 years, 10 years, 20 years, 30 years

In the case of the Swiss franc (CHF), only 12 maturities are available:

3 months, 6 months, 1 year, 2 years, 3 years, 4 years, 5 years, 6 years,


7 years, 8 years, 9 years, 10 years

The calibration periods used for these 5 currencies are the following:

EUR: 02.01.2001 to 02.01.2012


CHF: 02.01.2001 to 31.05.2013
GBP: 07.10.2008 to 31.05.2013
USD: 02.01.2001 to 31.05.2013
JPY: 30.03.2009 to 31.05.2013

After the 2012 Libor scandals, ICAP (formerly InterCapital Brokers) lost
to ICE Benchmark Administration Limited its role as administrator for the
ISDA fix rates, data collection, and calculation. Major reforms in the calcu-
lation methodology are being implemented (changing sources from polls of
contributing banks to actual transaction quotes). This transfer process is not
without difficulties for data providers.
The data was obtained from Bloomberg (indices US000**, BP000**,
EE000**, JY000**, SF000** for LIBOR rates and USISDA**, BPISDB**,
JYISDA**, SFISDA** for ISDAFIX rates).

9 Longer term out-of-sample yield curve prediction has recently been independently found

to be superior to the arbitrage-free Nelson–Siegel model of Christensen et al. (2011) widely


used by central banks.
300 High-Performance Computing in Finance

TABLE 9.2: Comparative model in-sample goodness-of-fit


Sample Fit
log- MSE
Currency Observation Calibration Likelihood (vol)(bp)
EUR 2817 EFM 232,652 15
EFM UKF 252,500 17
Black EFM σ := 0.0025 259,436 15
CHF 3100 EFM 232,100 8
EFM UKF 250,391 10
Black EFM σ := 1.0 253,095 8
GBR 1171 EFM 98,021 16
EFM UKF 103,529 15
Black EFM σ := 0.0001 105,368 14
USD 3093 EFM 279,114 15
EFM UKF 280,745 25
Black EFM σ := 0.001 292,954 22
JPY 950 EFM 91,014 6
EFM UKF 84,564 28
Black EFM σ := 0.006 102,544 6

9.6.2 Yield curve bootstrapping


Given the vector of parameters θ the Gaussian EFM model has rates (zero
coupon bond yields) for maturity τ := T − t of the form

y(t, T ) = τ −1 [A(τ, θ)Rt + B(τ, θ)Xt + C(τ, θ)Yt + D(τ, θ)]. (9.49)

For each currency jurisdiction, we interpolate the appropriate market swap


curve linearly to obtain swap rates at all maturities and then use 1-, 3-, and
6-month LIBOR rates and the swap curve to recursively back out a zero
coupon bond yield curve for each day from the basic swap pricing equation
(Ron 2000). This gives the input data for model calibration to give the param-
eter estimates θ̂.

9.6.3 In-sample goodness-of-fit


First, let us consider statistics for overall goodness-of-fit across the entire
sample period for the five currency areas EUR, CHF, GPB, USD, and JPY,
ordered by average rates in the data period from highest to lowest average
rate. Table 9.2 shows the comparative goodness-of-fit, in terms of optimal
log-likelihood and standard deviation (vol) of the sample measurement errors
across all yields at the data maturities and all observations, of three models:
the affine EFM estimated with both the KF and UKF and the Black EFM
estimated with the UKF.
The table allows an overall comparison of the fitting errors of the original
and Black-corrected models and also of the size of the fitting error relative
A Practical Robust Long-Term Yield Curve Model 301

0.054

0.052

0.05

0.048
Black
0.046 EFM
0.044 Data

0.042

0.04
0.25
1.5
2.75
4
5.25
6.5
7.75
9
10.25
11.5
12.75
14
15.25
16.5
17.75
19
20.25
21.5
22.75
24
25.25
26.5
27.75
29
FIGURE 9.6: In-sample EUR yield curves on August 12, 2008.

0.04
0.035
0.03
0.025
Black
0.02
EFM
0.015 Data
0.01
0.005
0
0.25
0.75
1.25
1.75
2.25
2.75
3.25
3.75
4.25
4.75
5.25
5.75
6.25
6.75
7.25
7.75
8.25
8.75
9.25
9.75

FIGURE 9.7: In-sample CHF yield curves on August 20, 2001.

0.04
0.035
0.03
0.025
0.02 Black

0.015 EFM
Data
0.01
0.005
0
0.25
1.5
2.75
4
5.25
6.5
7.75
9
10.25
11.5
12.75
14
15.25
16.5
17.75
19
20.25
21.5
22.75
24
25.25
26.5
27.75
29

FIGURE 9.8: In-sample GBP yield curves on February 18, 2013.


302 High-Performance Computing in Finance

0.06

0.05

0.04

0.03 Black
EFM
0.02 Data

0.01

0
0.25
1.25
2.25
3.25
4.25
5.25
6.25
7.25
8.25
9.25
10.25
11.25
12.25
13.25
14.25
15.25
16.25
17.25
18.25
19.25
20.25
21.25
22.25
23.25
24.25
25.25
26.25
27.25
28.25
29.25
FIGURE 9.9: In-sample USD yield curves on October 14, 2008.
to the average level of rates in the currency area. From Table 9.2 and Fig-
ures 9.6 through 9.10, we can see that the total measurement error volatility
of the best model fit is very respectably small for all currency areas, EUR,
GBP, and USD being the highest and CHF and JPY the lowest. Moreover, for
all but USD the Black-corrected EFM model goodness-of-fit equals or exceeds
that of the EFM shadow rate model.
Although the three models in Table 9.2 all have the same parameter set,
their likelihoods are not generally comparable as the models are not nested
in the statistical sense. However, the likelihoods of the affine EFM model
estimated with the KF and the UKF are comparable and in all cases, except
for Japan, the UKF likelihood exceeds the KF likelihood, a reflection of the
general power of the UKF widely attested to in the literature.
We may nevertheless compare the likelihoods achieved with the UKF for
the affine EFM and nonlinear Black EFM models. Here the Black EFM like-
lihood exceeds that of the EFM in all cases. In terms of total measurement
error standard deviation, the two models are also close, but the Black EFM
again gives the lowest values.
We found that the NAG UKF code we used required careful tuning of the
α parameter of Equation 9.42 which adjusts the displacement of sigma points
from the central path for each data set. The calibration column of Table 9.2
shows that for all currencies except CHF small α parameter values closer to the
generally recommended 10−3 are appropriate for the simple piecewise linear
option “hockey stick” nonlinearity being handled here with the UHF. We are
currently at a loss to explain the anomalous case with α = 1, particularly since
JPY has an even better overall fit than CHF. This suggests that in future it
may be fruitful to consider the reparametrization of sigma point displacement
and more generally to study the properties of the algorithm theoretically—
which appears to be a lacunae in the literature to date. To this end, further
A Practical Robust Long-Term Yield Curve Model 303

0.02
0.018
0.016
0.014
0.012
0.01 Black
0.008 EFM
Data
0.006
0.004
0.002
0
0.25
1.5
2.75
4
5.25
6.5
7.75
9
10.25
11.5
12.75
14
15.25
16.5
17.75
19
20.25
21.5
22.75
24
25.25
26.5
27.75
29
FIGURE 9.10: In-sample JPY yield curves on November 12, 2012.

experimentation with the current algorithm is called for. In these experiments,


we should probably also consider using a nonzero β parameter in Equation
9.42 to reflect the positive skew in the Black EFM yield distributions, cf.
Figures 9.7 through 9.11.
Turning to yield curve fits on specific days, Figures 9.6 through 9.10 show
yield curve fits of the Black EFM model on representative days throughout
the data period for all five currency areas relative to both the data and the
original linear EFM alternative. (We have in fact developed software that can
show these yield curve fits stepping through every (daily) observation in the
data.) Each figure shows a typical good fit of the Black EFM model for a
single currency. Root mean square error (in basis points) based on quarterly
evaluation of the yield curve rates over 30 years (10 for CHF) is calculated
using the model expression for the yield at each quarterly maturity in terms of
the estimated parameters. These individual day figures are much smaller than
the overall figures for the Black EFM model in Table 9.2 due to a relatively
few very bad fits (for all three models) on certain days. Note however the
generally nontextbook shapes of the yield curves in Figures 9.6 through 9.10,
where the Black EFM outperforms the affine EFM by following yield curve
distortions more closely.
Overall, the Black EFM model is broadly an improvement on the origi-
nal EFM model in terms of in-sample yield curve fit in all five jurisdictions.
However, USD is in general fit worse than EUR and GBP by both models
and much worse than CHF and JPY. However, for GBP, USD, and JPY, the
Black model fits the short-end kinks (see Figures 9.8 through 9.10) signifi-
cantly better than the original model (although both models are based on
three factors). It should be noted that such nontextbook yield curve shapes in
the data period may reflect a behavioral market excess demand for short-term
bonds or have resulted from market manipulation, or both.
304 High-Performance Computing in Finance

EUR 10-year rate EFM with market data up to 15 July 2015

EUR 10-year rate EFM


0.2

0.15 Q_0.01
Q_0.05
0.1
Q_0.25
0.05 Q_0.5

0 Q_0.75
Jan-12
Aug-13
Mar-15
Oct-16
May-18
Dec-19
Jul-21
Feb-23
Sep-24
Apr-26
Nov-27
Jun-29
Jan-31
Aug-32
Mar-34
Oct-35
May-37
Dec-38
Jul-40
Q_0.95
–0.05
Q_0.99
–0.1 Market data

–0.15

EUR 10-year rate Black EFM with market data up to 15 July 2015

EUR 10-year rate Black UKF alpha = 0.0025


0.16
0.14
0.12 Q_0.01
0.1 Q_0.05
Q_0.25
0.08
Q_0.5
0.06 Q_0.75
Q_0.95
0.04
Q_0.99
0.02 Market data
0
Jan-12
May-13
Sep-14
Jan-16
May-17
Sep-18
Jan-20
May-21
Sep-22
Jan-24
May-25
Sep-26
Jan-28
May-29
Sep-30
Jan-32
May-33
Sep-34
Jan-36
May-37
Sep-38
Jan-40
May-41

EUR 10-year rate forecast RMSE over 44 months

Black median 0.89%


EFM median 1.13%

FIGURE 9.11: Out-of-sample EUR 10-year rate 30 year projections on May


31, 2013.

9.6.4 Out-of-sample Monte Carlo projection


Figures 9.11 through 9.14 show the results of monthly out-of-sample Monte
Carlo scenario projection over a 30-year horizon using the EFM and Black
EFM models calibrated to the last day of the data period, May 31, 2013. These
are for 10-year rates in all 5 currency areas. The figures show the evolution
A Practical Robust Long-Term Yield Curve Model 305

CHF 10-year rate EFM with market data up to 11 December 2014

CHF 10-year rate EFM


0.09
0.08 Q_0.01
0.07
Q_0.05
0.06
0.05 Q_0.25
0.04 Q_0.5
0.03
Q_0.75
0.02
0.01 Q_0.95
0 Q_0.99
May-13
May-15
May-17
May-19
May-21
May-23
May-25
May-27
May-29
May-31
May-33
May-35
May-37
May-39
May-41
–0.01
Market data
–0.02

CHF 10-year rate Black EFM with market data up to 11 December 2014

CHF 10-year Black UKF alpha = 1.0


0.2
0.18
0.16 Q_0.01
0.14 Q_0.05
0.12 Q_0.25
0.1
Q_0.5
0.08
Q_0.75
0.06
0.04 Q_0.95
0.02 Q_0.99
0 Market data
Nov-14

Nov-17

Nov-20

Nov-23

Nov-26

Nov-29

Nov-32

Nov-35

Nov-38

Nov-41
May-13

May-16

May-19

May-22

May-25

May-28

May-31

May-34

May-37

May-40

CHF 10-year rate forecast RMSE over 20 months


Black median 0.40%
EFM median 0.45%

FIGURE 9.12: Out-of-sample CHF 10-year rate 30 year projections on May


31, 2013.

of the paths of the quartiles and 1% and 5% tails of the 10,000 scenario dis-
tribution. The actual market data evolution is also plotted on these figures
up to a more recent date for each currency area: EUR and CHF, December
11, 2014; GPB and USD, January 15, 2015; and JPY, January 24, 2014.10
The out-of-sample 10-year rate median forecast root mean square error rel-
ative to the market realization is also shown in the figures for EUR, CHF,

10 After which date, Bloomberg dropped JPY CMS swap data.


306 High-Performance Computing in Finance

GBP 10-year rate EFM with market data up to 15 July 2015

GBP 10-year rate EFM


0.15

0.1 Q_0.01
Q_0.05
0.05 Q_0.25
Q_0.5
Q_0.75
0 Q_0.95
May-13
Aug-14
Nov-15
Feb-17
May-18
Aug-19
Nov-20
Feb-22
May-23
Aug-24
Nov-25
Feb-27
May-28
Aug-29
Nov-30
Feb-32
May-33
Aug-34
Nov-35
Feb-37
May-38
Aug-39
Nov-40
Feb-42
Q_0.99
Market data
–0.05

–0.1

GBP 10-year rate Black EFM with market data up to 15 July 2015

GBP 10-year rate Black UKF alpha = 0.0001


0.25
Q_0.01
0.2
Q_0.05
0.15 Q_0.25
Q_0.5
0.1
Q_0.75
0.05
Q_0.95
0 Q_0.99
Nov-14

Nov-17

Nov-20

Nov-23

Nov-26

Nov-29

Nov-32

Nov-35

Nov-38

Nov-41
May-13

May-16

May-19

May-22

May-25

May-28

May-31

May-34

May-37

May-40

Market data

GBP 10-year rate forecast RMSE over 27 months


Black median 0.56%
EFM median 0.68%

FIGURE 9.13: Out-of-sample GBP 10-year rate 30 year projections on May


31, 2013.

GBP, and USD, which are naturally higher than the comparable in-sample
figures. Surprisingly these are best for the poorest in-sample fitting USD,
for which, as for CHF, the Black EFM model prediction is superior to that
of EFM.11
These figures demonstrate the basic negative scenario generation problem
with the Gaussian EFM model (cf. Dempster et al. 2014) and the primary
11 The data series for JPY was deemed too short at 8 months to be significant.
A Practical Robust Long-Term Yield Curve Model 307

USD 10-year rate EFM with market data up to 15 July 2015

USD 10-year rate EFM


0.2

0.15 Q_0.01
Q_0.05
0.1 Q_0.25
Q_0.5
Q_0.75
0.05
Q_0.95
Q_0.99
0 Market data
May-13
Aug-14
Nov-15
Feb-17
May-18
Aug-19
Nov-20
Feb-22
May-23
Aug-24
Nov-25
Feb-27
May-28
Aug-29
Nov-30
Feb-32
May-33
Aug-34
Nov-35
Feb-37
May-38
Aug-39
Nov-40
Feb-42
–0.05

–0.1

USD 10-year rate Black EFM with market data up to 15 July 2015
USD 10-year rate Black UKF alpha = 0.001
0.25
Q_0.01
0.2 Q_0.05
Q_0.25
0.15 Q_0.5
Q_0.75
0.1
Q_0.95
0.05 Q_0.99
Market data
0
May-13
Oct-14
Mar-16
Aug-17
Jan-19
Jun-20
Nov-21
Apr-23
Sep-24
Feb-26
Jul-27
Dec-28
May-30
Oct-31
Mar-33
Aug-34
Jan-36
Jun-37
Nov-38
Apr-40
Sep-41
Feb-43

USD 10-year rate forecast RMSE over 27 months


Black median 0.44%
EFM median 0.51%

FIGURE 9.14: Out-of-sample USD 10-year rate 30 year projections on May


31, 2013.

effectiveness of the nonnegative Black correction for the long-standing low-


rate Japanese economy (Figure 9.15). By way of comparison of the two- model
stochastic predictions, it should be noted that the spread of the 10-year rate
scenario distributions as time evolves for the Black EFM model produces a
wider, perhaps more realistic, spread than the diffusion-based affine EFM for
the 10-year rate over a 30-year horizon for all economies except the Japanese.
Intuitively, this is likely due to the pushing up to nonnegativity of the negative
scenarios of the shadow EFM model in the Black EFM model, but a deeper
theoretical understanding is left to future research.
308 High-Performance Computing in Finance

JPY 10-year rate EFM


0.3
0.25
0.2
0.15 Q_0.01
Q_0.05
0.1
Q_0.25
0.05 Q_0.5
Q_0.75
0
Q_0.95
Nov-14

Nov-17

Nov-20

Nov-23

Nov-26

Nov-29

Nov-32

Nov-35

Nov-38

Nov-41
May-13

May-16

May-19

May-22

May-25

May-28

May-31

May-34

May-37

May-40
–0.05 Q_0.99
–0.1
–0.15
–0.2

JPY 10-year rate Black EFM

JPY 10-year rate Black UKF alpha = 0.006


0.18
0.16
0.14 Q_0.01
0.12 Q_0.05
0.1 Q_0.25
0.08
Q_0.5
0.06
Q_0.75
0.04
Q_0.95
0.02
Q_0.99
0
Jan-12
Jul-13
Jan-15
Jul-16
Jan-18
Jul-19
Jan-21
Jul-22
Jan-24
Jul-25
Jan-27
Jul-28
Jan-30
Jul-31
Jan-33
Jul-34
Jan-36
Jul-37
Jan-39
Jul-40

FIGURE 9.15: Out-of-sample JPY 10-year rate 30 year projections on May


31, 2013.

9.7 Conclusion
This chapter explains the initial development and evaluation of a new
approximation of the Black (1995) correction to ensure nonnegative nominal
rates at all maturities for a practically effective Gaussian three-factor affine
yield curve model—the EFM model. Perhaps the most important feature of
this novel approach is the demonstrated fact that the HPC calibration of the
Black EFM model can be effected in only about double the runtime of that of
the underlying shadow rate EFM model. Although some issues with using the
UKF code have been identified here for further work, the results presented
in this paper are promising, both in- and out-of-sample. We are confident
that addressing the identified issues in future research can result in a deeper
understanding of both the Black correction and the UKF.
A Practical Robust Long-Term Yield Curve Model 309

Acknowledgments
The research leading to these results has received funding from the Euro-
pean Union Seventh Framework Programme (FP7/2007–13) under grant
agreement no. 289032 (HPC Finance). The authors wish to acknowledge
support and helpful comments from John Holden and Martyn Byng of the
Numerical Algorithms Group, Grigorios Papamanousakis of Aberdeen Asset
Management and, particularly, Giles Thompson, senior associate of Cambridge
Systems Associates.

Appendix 9A
Given an initial set of parameters (Θ0 , H0 ), the EM algorithm for
estimating the parameters of the EFM model from market data using the
Kalman filter alternates between generating paths with the filter for the log-
likelihood function and optimizing this function in the model parameters.
Defining O(Θ, H) := log L(Θ, H), a single step of the EM algorithm for
quasi-MLE can be presented in pseudo code as follows:
Calculation of the log-likelihood function

1. Input (Θ0 , H0 )
2. for t = 1 to T do
3. KF predictions (9.19)
4. KF update (9.20)
5. Calculate a term of the log-likelihood function (9.21)
6. end for
7. Compute O(Θ, H) (9.21)
8. Output O(Θ, H)

Optimization of the log-likelihood function


The two-phase optimization algorithm is as follows:

1. Initialize parameters (Θ, H) from previous EM algorithm step

2. while |ΔO(Θ, H)| ≥ tolerance1 do


3. DIRECT optimization step of O(Θ, H) with H fixed

4. DIRECT optimization step of O(Θ, H) with Θ fixed

5. end while
310 High-Performance Computing in Finance

6. while |ΔO(Θ, H)| ≥ tolerance2 do

7. Powell optimization of O(Θ, H) with H fixed

8. Powell optimization of O(Θ, H) with Θ fixed

9. end while

References
Andreasen, M. A. and Meldrum, A. 2014. Dynamic term structure models: The best
way to enforce the zero lower bound. CREATES Research Paper 2014–47.

Bauer, M. D. and Rudebusch, G. D. 2013. The shadow rate, Taylor rules and
monetary policy lift-off. Working Paper, Federal Reserve Bank of San Francisco,
February 2013. [Link]
[Link]

Bauer, M. D. and Rudebusch, G. D. 2014. Monetary Policy Expectations at the


Zero Lower Bound: Federal Reserve Bank of San Francisco Working Paper
2013–18.

Black, F. 1995. Interest rates as options. Journal of Finance 50, 1371–1376.

Bomfim, A. N. 2003. Interest rates as options: Assessing the markets view of the liq-
uidity trap. Working Paper 2003–45, Finance and Economics Discussion Series,
Federal Reserve Board, Washington, DC.

Carton de Wiart, B. and Dempster, M. A. H. 2011. Wavelet optimized valuation of


financial derivatives. International Journal of Theoretical and Applied Finance
14, 1113–1137.

Christensen, J. H. E. 2015. A regime-switching model of the yield curve at the zero


bound. Federal Reserve Bank of San Francisco Working Paper 2013–34.

Christensen, J., Diebold, F., and Rudebusch, G. D. 2011. The affine arbitrage-free
class of Nelson–Siegel term structure models. Working Paper, Federal Reserve
Bank of San Francisco. Journal of Econometrics 164, 4–20.

Christensen, J. H. E. and Rudebusch, G. D. 2013. Modeling Yields at the Zero Lower


Bound: Are Shadow Rates the Solution. Federal Reserve Bank of San Francisco,
working paper, December 2013, [Link]
[Link]

Christensen, J. H. E. and Rudebusch, G. D. 2015. Estimating shadow-rate term


structure models with near-zero yields. Journal of Financial Econometrics 13,
226–259.

Christoffersen, P., Dorion, C., Jacobs, K., and Karoui, L. 2014. Nonlinear Kalman fil-
tering in affine term structure models: CREATES Research Paper 14–04, Aarhus
University, January 2014.
A Practical Robust Long-Term Yield Curve Model 311

Cox, J., Ingersoll, J., and Ross, S. 1985. A theory of the term structure of interest
rates. Econometrica 53, 363–384.

Dai, Q. and Singleton, K. J. 2000. Specification analysis of affine term structure


models. Journal of Finance 50, 1943–1978.

Davydov, D. and Linetsky, V. 2003. Pricing options on scalar diffusions: An eigen-


function expansion approach. Operations Research 51185209.

Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood for
incomplete data via the EM algorithm. Journal of the Royal Statistical Society
39, 1–38.

Dempster, M. A. H., Evans, J., and Medova, E. A. 2014. Developing a practical


yield curve model: An odyssey. In New Developments in Macro-Finance Yield
Curves. Chadha, J., Durre, A., Joyce, M., and Sarnio, L., eds., Cambridge
University Press, Cambridge, pp. 251–290.

Dempster, M. A. H., Medova, E. A., and Villaverde, M. 2010. Long-term interest


rates and consol bond valuation. Journal of Asset Management 11, 113–135.

Dempster, M. A. H. and Tang, K. 2011. Estimating exponential affine models with


correlated measurement errors: Applications to fixed income and commodities.
Journal of Banking and Finance 35(3), 639–652.

Diebold, F. X. and Rudebusch, G. D. 2013. Yield Curve Modelling and Forecasting—


The Dynamic Nelson–Siegel Approach. Princeton University Press, Princeton.

Dothan, M. 1978. On the term structure of interest rates. Journal of Financial


Economivs 7, 229–264.

Duffie, J. D. and Kan, R. 1996. A yield-factor model of interest rates. Mathematical


Finance 6, 379–406.

Feldhutter, P., Heyerdahl-Larsen, C., and Illeditsch, P. 2015. Risk premia and volatil-
ities in a non-linear term structure model. Working Paper, The Wharton School,
University of Pennsylvania. [Link]

Gorovoi, V. and Linetsky, V. 2004. Black’s model of interest rates as options, eigen-
function expansions and Japanese interest rates. Mathematical Finance 14(1),
49–78.

Harvey, A. C. 1993. Time Series Models, Second edition. Harvester-Wheatsheaf,


Hemel Hempstead.

Ho, T. and Lee, S. 1986. Term structure movements and pricing interest rate con-
tingent claims. Journal of Finance 41, 1011–1029.

Homer, S. and Sylla, R. 2005. A History of Interest Rates. Wiley, Hoboken, NJ, p.1.

Hull, J. and White, A. 1990. Pricing interest rate derivative securities. Review of
Financial Studies 3, 573–592.
312 High-Performance Computing in Finance

Ichiue, H. and Ueno, Y. 2006. Monetary policy and the yield curve at zero inter-
est: The macro-finance model of interest rates as options. Working Paper 06-
E-16, Bank of Japan. [Link] rev/wps 2006/
data/[Link]

Ichiue, H. and Ueno, Y. 2007. Equilibrium interest rates and the yield curve in a
low interest rate environment. Working Paper 07-E-18, Bank of Japan.

Ichiue, H. and Ueno, Y. 2013. Estimating term premia at zero bound: An analysis
of Japanese, US and UK yields. Working Paper 13-E-8, Bank of Japan.

Jameson, L. 1998. A wavelet-optimized, very high order adaptive grid and order
numerical method. SIAM Journal on Scientific Computing 19, 1980–2013.

James, J. and Webber, N. 2000. Interest Rate Modelling. Wiley, Chichester.

Jones, D. R., Perttunen, C. D., and Stuckmann, B. E. 1993. Lipschitzian optimiza-


tion without the Lipschitz constant. Journal of Optimization Theory and Appli-
cations 79, 157–181.

Jong, F. de, Driessen, J., and Pelsser, A. 2001. Libor market models versus swap
market models for pricing interest rate derivatives: An empirical analysis. Euro-
pean Finance Review 5, 201–237.

Joslin, S., Singleton, K. J., and Zhu, H. 2011. A new perspective on Gaussian
dynamic term structure models. Review of Financial Studies 24, 926–970.

Julier, S. J., Uhlmann, J. K., and Durrant-Whyte, H. 1995. A new approach for
filtering nonlinear systems. In Proceedings of the American Control Conference,
1628–1632.

Julier, S. J. and Uhlmann, J. K. 1997. A new extension of the Kalman filter to non-
linear systems. International Symposium on Aerospace/Defense Sensing, Sim-
ulation and Control, Signal Processing, Sensor Fusion, and Target Recognition
VI 3, 182–193.

Kim, D. H. and Priebsch, M. A. 2013. Estimation of multi-factor shadow-rate term


structure models. Working Paper, Federal Reserve Board, October 9, 2013.

Kim, D. H. and Singleton, K. J. 2011. Term structure models and the zero bound: An
empirical investigation of Japanese yields. Journal of Econometrics 170, 32–49.

Krippner, L. 2012. Modifying Gaussian term structure models when interest rates
are near the zero lower bound. Discussion Paper 2012/02, Reserve Bank of New
Zealand.

Krippner, L. 2013a. A tractable framework for zero lower bound Gaussian term
structure models. CAMA Working Paper No. 49/2013, Australian National
University.

Krippner, L. 2013b. Faster solutions for Black zero lower bound term structure
models. CAMA Working Paper No. 66/2013, Australian National University.
A Practical Robust Long-Term Yield Curve Model 313

Krippner, L. 2014. Measuring the stance of monetary policy in conventional and


unconventional environments. CAMA Working Paper No. 6/2014, Australian
National University.

Lemke, W. and Vladu, A. L. 2014. A shadow-rate term structure for the Euro Area.
[Link]/events/pdf/conferences/140908/lemke [Link].

Linetsky, V. 2002. Exotic spectra. RISK, April 2002, 85–89.

Linetsky, V. 2004. The spectral decomposition of the option value. International


Journal of Theoretical and Applied Finance 7, 337–384.

Lipton, A., Gal, A., and Lasis, A. 2014. Pricing of vanilla nad first-generation exotic
options in the local stochastic volatility framework: Survey and results. Quan-
titative Finance 14, 1899–1922.

Litterman, R. and Scheinkman, J. A. 1991. Common factors affecting bond returns.


Journal of Fixed Income 1, 54–61.

Medova, E. A., Rietbergen, M. I., Villaverde, M., and Yong, Y. S. 2005. Modelling
the long term dynamics of yield curves. Working Paper 09/2005, Centre for
Financial Research, Judge Business School, University of Cambridge.

Nawalkha, S. K. and Rebonato, R. 2011. What interest rate models to use? Buy side
versus sell side. SSRN Electronic Journal, 01/2011.

Nelder, J. A. and Mead, R. 1965. A simplex method for function minimization.


Computer Journal 7, 308–313.

Nelson, C. R. and Siegel, A. F. 1987. Parsimonious modeling of yield curves. Journal


of Business 60, 473–489.

Powell, J. D. 1964. An efficient method of finding the minimum of a function of


several variables without calculating derivatives. Computer Journal 11, 302–
304.

Priebsch, M. A. 2013. Computing arbitrage-free yields in multi-factor Gaussian


shadow-rate term structure models. Working Paper 2013–63, Finance and Eco-
nomics Discussion Series, Federal Reserve Board.

Rebonato, R. 2015. Review of Yield Curve Modelling and Forecasting—The


Dynamic Nelson–Siegel Approach by Diebold, F. X. and Rudebusch, G. D.
Quantitative Finance 15, 1609–1612.

Rebonato, R. and Cooper, I. 1995. The limitations of simple two-factor interest rate
models. Journal of Financial Engineering 5, 1–16.

Richard, S. F. 2013. A non-linear macroeconomic term structure model. Working


Paper, University of Pennsylvania.

Ron, U. 2000. A practical guide to swap curve construction. Working Paper 2000–17,
Bank of Canada, August 2000.
314 High-Performance Computing in Finance

Swanson, E. T. and Williams, J. C. 2013. Measuring the effect of the zero lower
bound on medium- and longer-term interest rates. Working Paper, Federal
Reserve Bank of San Francisco, January 2013. [Link]
research/files/[Link]

Turner, R. and Rasmussen, C. E. 2012. Model based learning of sigma points in


unscented Kalman filtering. Neurocomputing 80, 47–53.

Ueno, Y., Baba, N., and Sakurai, Y. 2006. The use of the Black model of interest
rates as options for monitoring the JGB market expectations. Working Paper
06E15, Bank of Japan. [Link] rev/wps 2006/
data/[Link]

Vasicek, O. 1977. An equilibrium characterization of the term structure. Journal of


Financial Economics 5, 177–188.

Wu, J. C. and Xia, F. D. 2014. Measuring the macroeconomic impact of mone-


tary policy at the zero lower bound. Working Paper, Booth School of Busi-
ness, University of Chicago, July 2014. [Link]
wu/research/pdf/[Link]

Yong, Y. S. 2007. Scenario Generation for Dynamic Fund Management. PhD Thesis,
Centre for Financial Research, Judge Business School, University of Cambridge.
Chapter 10
Algorithmic Differentiation

Uwe Naumann, Jonathan Hüser, Jens Deussen, and


Jacques du Toit

CONTENTS
10.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
10.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
10.2.1 Tangent mode AD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
10.2.2 Adjoint mode AD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
10.2.3 Second derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
10.2.4 Review of AD in finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
10.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
10.3.1 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
10.3.2 Implicit functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
[Link] Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
[Link] Nonlinear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
[Link] Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 326
10.3.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
10.3.4 Preaccumulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
10.3.5 Further issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
10.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.4.1 European option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.4.2 American option pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
10.4.3 Nearest correlation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
10.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

10.1 Motivation
Inspired by Giles and Glasserman [1], Algorithmic Differentiation (AD)
[2,3] has been gaining popularity in computational finance over recent years.
Adjoint AD (AAD) in particular facilitates a paradigm shift in financial mod-
elling through provision of first-order sensitivities at a relative computational
cost which is independent of the number of sensitivities asked for.
For illustration, we consider a simple European call option written on an
underlying driven by a local volatility process. Let S = (St )t≥0 be the solution

315
316 High-Performance Computing in Finance

to the stochastic differential equation (SDE)

dSt = rSt dt + σ (log(St ), t) St dWt , (10.1)

where W = (Wt )t≥0 is a standard Brownian motion, r > 0 is the risk-free inter-
est rate, and σ is the local volatility function. The price of the call option is
then given by
V = e−rT E(ST − K)+ (10.2)
for maturity T > 0 and strike K > 0. The local volatility σ = σ(x, t) is
computed from the market observed implied volatility surface using bicubic
spline interpolation.
To compute the call price V from Equation 10.2, we apply a simple Euler–
Maruyama scheme to the log process Xt = log(St ) which satisfies the SDE

dXt = r − 12 σ 2 (Xt , t) dt + σ(Xt , t) dWt . (10.3)

Setting Δ = T /M for some integer M and defining a sequence of Monte Carlo


time steps ti = iΔ for i = 1, 2, . . . , M , we set

Xti+1 = Xti + r − 12 σ 2 (Xti , ti ) Δ + σ(Xti , ti ) ΔZi , (10.4)

where each Zi is a standard normal random number and Xt0 = log(S0 ).


N sample paths of the log process are generated and used in a Monte Carlo
integral of Equation 10.2 to estimate the price V and to obtain a confidence
interval.
In Figure 10.1, we compare run times of central finite difference (CFD)
approximation and AAD for the computation of sensitivities of V with respect
to the input parameters K, T, r, S0 , and a set of n − 4 market observed implied
volatilities for N = 105 , M = 360, and growing values of n. The cost of CFD
relative to the cost of a single price calculation scales with n. The AAD solution

600
FD
500 AAD

400
Run time (s)

300

200

100

0
0 50 100 150 200
n

FIGURE 10.1: Motivation.


Algorithmic Differentiation 317

exhibits an essentially constant relative cost of less than 10 price calculations


independent of the size of n. The AD tool dco/c++ [4] is used throughout
this chapter.

10.2 Introduction
AD is a semantic transformation1 of a given computer code called the
primal code or primal function. In addition to computing the primal func-
tion value, the transformed code also computes the derivatives of the primal
function with respect to a specified set of parameters.
Consider a computer implementation of a function F mapping IRn × IRñ
into IRm × IRm̃ . We are interested in derivatives of the first m outputs of F
(the active outputs) with respect to the first n inputs (the active inputs). The
second m̃ outputs and second ñ inputs of F are termed the passive outputs and
passive inputs, respectively. For example, an active output may be the Monte
Carlo price of an option while a passive output may be the corresponding
confidence interval. An active input may be the initial asset price S0 , while
a passive input may be the set of random numbers used in the Monte Carlo
simulation. Without loss of generality and to keep the notation simple, we
restrict the discussion to scalar active outputs, that is, m = 1.2 We therefore
consider multivariate functions of the type

F : IRn × IRñ → IR × IRm̃ , (y, ỹ) = F (x, x̃) , (10.5)

where we assume that F and its computer implementation are differentiable up


to the required order.3 We are interested in (semi-)automatic ways to generate
the vector of all partial derivatives of the active output y with respect to the
active inputs x, that is, the gradient
 T
∂y
∇F = ∇F (x, x̃) ≡ ∈ IR1×n , (10.6)
∂xi i=0,...,n−1

along with the values of all active and passive outputs as functions of the
active and passive inputs. Similarly, we look for the Hessian of all second
partial derivatives of y with respect to x, that is,
 2 
2 2 ∂ y
∇ F = ∇ F (x, x̃) ≡ ∈ IRn×n . (10.7)
∂xj ∂xi j,i=0,...,n−1

1 That is, changes the meaning.


2 The modifications for m > 1 are straightforward.
3 Differentiability is a crucial prerequisite for (algorithmic) differentiation. Nondifferentia-

bility at selected points can be handled by smoothing techniques as well as by combinations


of AD with local finite difference approximations (see Section 10.3.3).
318 High-Performance Computing in Finance

10.2.1 Tangent mode AD


In the following, we use the notation from Reference 3. Tangent (also:
forward) mode AD yields the tangent function

F (1) : IRn × IRn × IRñ → IR × IR × IRm̃ .

A corresponding tangent code implements (y, y (1) , ỹ) := F (1) (x, x(1) , x̃),
where
y (1) := ∇F (x, x̃) · x(1) (10.8)

is the directional derivative of F in direction x(1) . We use superscripts (1)


to denote first-order tangents. The operator := represents the assignment of
imperative programming languages, not to be confused with equality = in
the mathematical sense. The entire gradient can be calculated entry by entry
with n runs of the tangent code through a process known as seeding and
harvesting. The vector x(1) in Equation 10.8 is successively set equal to each
of the Cartesian basis vectors in IRn (it is seeded ), the tangent code is run,
and the corresponding gradient entry is harvested from y (1) . The gradient
is computed with machine accuracy while the computational cost is O(n) ·
Cost(F ), the same as that of a finite difference approximation.

10.2.2 Adjoint mode AD


Adjoint (also: reverse) mode AD yields the adjoint function

F(1) : IRn × IRn × IRñ × IR → IR × IRm̃ × IRn .

A corresponding adjoint code implements (y, ỹ, x(1) ) := F(1) (x, x(1) , x̃, y(1) ),
where
x(1) := x(1) + ∇F (x, x̃)T · y(1) . (10.9)

The adjoint code therefore increments given adjoints x(1) of the active inputs
with the product of the gradient and a given adjoint y(1) of the active out-
put (see Reference 3 for details). We use subscripts (1) to denote first-order
adjoints. Initializing (seeding) x(1) = 0 and y(1) = 1 yields the gradient in x(1)
from a single run of the adjoint code. Again, the gradient is computed with
machine accuracy. The computational cost no longer depends on the size of
the gradient n.

10.2.3 Second derivatives


The second derivative is the first derivative of the first derivative. Second
derivative code can therefore be obtained by any of the four combinations of
tangent and adjoint modes. In tangent-over-tangent mode AD tangent mode
Algorithmic Differentiation 319

is applied to Equation 10.8 yielding the second-order tangent function

F (1,2) : IRn × IRn × IRn × IRn × IRñ → IR × IR × IR × IR × IRm̃ .

A corresponding second-order tangent code implements

(y, y (2) , y (1) , y (1,2) , ỹ) := F (1,2) (x, x(2) , x(1) , x(1,2) , x̃),

where

y (2) := ∇F (x, x̃) · x(2)


T (10.10)
y (1,2) := x(1) · ∇2 F (x, x̃) · x(2) + ∇F (x, x̃) · x(1,2) .

The computational cost of Hessian accumulation is of the same order as that of


a second-order finite difference approximation. Obviously, the accuracy of the
latter is typically far from satisfactory, particularly for calculations performed
in single precision floating-point arithmetic.
Symmetry of the Hessian implies mathematical equivalence of the three
remaining second-order adjoint modes (see Reference 3 for details). We there-
fore focus on one of them. In tangent-over-adjoint mode AD tangent mode is
applied to Equation 10.9 yielding the second-order adjoint function
(2)
F(1) : IRn × IRn × IRn × IRn × IRñ × IR × IR → IR × IR × IRm̃ × IRn × IRn .

A corresponding second-order adjoint code implements


!
(2) (2) (2) (2)
y, y (2) , ỹ, x(1) , x(1) ) := F(1) (x, x(2) , x(1) , x(1) , x̃, y(1) , y(1) ,

where

y (2) := ∇F (x, x̃) · x(2)


(2) (2) (10.11)
x(1) := x(1) + y(1) · ∇2 F (x, x̃) · x(2) .

The full Hessian can be obtained by n runs of the second-order adjoint code.
The reduction in computational complexity due to the initial application of
adjoint mode to the primal code is therefore carried over to the second-order
adjoint. Sparsity of the Hessian can and should be exploited [5].

10.2.4 Review of AD in finance


Starting with the paper of Giles and Glasserman [1] in 2006, AAD has
increasingly been adopted in computational finance applications. Since then
there have been several contributions to the literature, utilizing AAD for both
the calculation of Greeks and for calibration.
Leclerc et al. extended the pathwise approach of Giles and Glasserman
to Greeks for Bermudan-style derivatives [6]. Denson and Joshi [7] applied
320 High-Performance Computing in Finance

the pathwise AAD method for Greeks to a LIBOR market model and
Joshi has several other contributions in the field with different collaborators
[8–11]. Capriotti et al. applied AAD for fast Greeks in various different applica-
tions including PDEs, credit risk, Bermudan-style options, and XVA [12–16].
Antonov also used AAD for XVA and callable exotics [17,18]. Contributions
in the area of calibration include those of Turinici [19], Kaebe et al. [20],
Schlenkrich [21], and Henrard [22]. For Greeks in the context of discontinuous
payoffs, Giles introduced the Vibrato Monte Carlo method [23]. The problem
of discontinuity is also treated in works related to second-order Greeks [24,25].
The above is not meant as a complete review of AAD applications in
finance. Further related work can be found in the references within the cited
publications.

10.3 Implementation
AD constitutes a set of rules for deriving first- and higher-order tangent
and adjoint versions of a primal numerical simulation code. Two fundamental
modes of implementing AD are distinguished between source transformation
and operator overloading.
Source transformation rewrites the given primal code yielding a corre-
sponding derivative code usually in the same programming language. For
example, consider an implementation of f : IR2 → IR, y = f (x) = esin(x0 ·x1 )
as
v o i d f ( d o u b l e x , d o u b l e& y ) { x∗=y ; y=exp ( s i n ( x ) ) }
where x =ˆ x0 and y = ˆ x1 on input and with output y =
ˆ y. The first-order
tangent version returns the directional derivative
∂y
y (1) := · x(1)
∂x
in addition to the function value, for example,
v o i d t f ( d o u b l e x , d o u b l e& tx , d o u b l e& y , d o u b l e &ty ) {
tx=tx ∗y+x∗ ty ; x∗=y ;
y=\exp ( \ s i n ( x ) ) ; ty=y ∗\ c o s ( x ) ∗ tx .
}
Each arithmetic statement is augmented with its directional derivative (also
tangent).
A first-order adjoint version returns the adjoint directional derivative

∂y T
x(1) := x(1) + · y(1)
∂x
in addition to the function value as
Algorithmic Differentiation 321

v o i d a f ( d o u b l e x , d o u b l e& ax , d o u b l e& y , d o u b l e &ay ) {


// augmented f o r w a r d s e c t i o n
d o u b l e s=x ; x∗=y ; d o u b l e v=exp ( s i n ( x ) ) ;
// r e v e r s e s e c t i o n
d o u b l e av=ay ∗v∗ c o s ( x ) ; ax+=av ∗y ; ay=av ∗ s ; y=v .
}

Adjoints of all arithmetic statements are executed in reverse order (see reverse
section of the adjoint code). They require intermediate values computed in the
augmented forward section of the adjoint code. Some of these values may need
to be recorded prior to getting lost due to overwriting, for example, s. An in-
depth discussion of adjoint code generation rules is beyond the scope of this
article. Refer to Reference 3 for a more detailed description. Manual source
transformation turns out to be tedious, error-prone, and hard to maintain from
a software evolution perspective. Preprocessing tools have been developed for
many years providing reasonable coverage of Fortran and C in addition to
various simpler special-purpose scripting languages [26].
Currently, there is no mature source transformation tool for C++. The
method of choice for implementing AD for C++ programs is based on
operator and function overloading typically combined with advanced meta-
programming techniques using the dynamic typing mechanism provided by
C++ templates. Instead of parsing the primal source code followed by unpars-
ing a differentiated version, the semantics of operators and intrinsic functions
are redefined. Overloading for a custom active data type yields augmented
operations. For example, in basic tangent mode the active data type consists
of a value and a directional derivative component. All operations are over-
loaded for such pairs yielding, for example, (y, y (1) ) := (sin(x), cos(x) · x(1) ).
In basic adjoint mode, the operations are overloaded to record a tape of the pri-
mal computation. Conceptually, the tape can be regarded as a directed acyclic
graph with vertices representing the inputs to the program as well as all oper-
ations performed to compute its outputs. Edges represent data dependencies.
They can be labeled with the local partial derivative of the value represented
by its target vertex with respect to the value represented by its source. An
example is shown in Figure 10.2a. Adjoints are computed by interpretation of
the tape. The tape interpreter eliminates all intermediate vertices in reverse
topological order by introducing new edges connecting all predecessor vertices
with all successors. New edges are labeled with the product of the local partial
derivatives on the corresponding incoming and outgoing edges. Parallel edges
are merged by adding their labels. The vertex to be eliminated is removed
together with its incident edges; see Figure 10.2b and c for elimination of ver-
tices 3 and 2 from the tape in Figure 10.2a. Tape interpretation amounts to
a sequence of fused multiply–add (fma; the elemental operation of the chain
rule) operations whose length is of the order of the number of edges in the
tape. The resulting bipartite graph contains only edges representing nonzero
Jacobian/gradient entries.
322 High-Performance Computing in Finance

(a) 4: y(eˆ) (b) 4: y(eˆ) (c) 4: y(eˆ)

(y)

3: sin [y ⋅ cos(v2)]

[cos(v2)] [x1 ⋅ (y ⋅ cos(v2))] [x0 ⋅ (y ⋅ cos(v2))]

2: ∗ 2: ∗

[x1] [x0] [x1] [x0]

0: x0 1: x1 0: x0 1: x1 0: x0 1: x1

FIGURE 10.2: AAD by overloading and tape interpretation.


Basic adjoint mode does not exploit special structural nor mathematical
properties of the given primal code. It automatically records the sequence of
operations including data dependencies and required values on a tape followed
by purely sequential interpretation of the tape. Real-world adjoints feature
sophisticated extensions to be discussed in further detail in the following.
The efficiency (or relative cost) of AD codes is usually measured as the
ratio
Run time of given AD code
R= . (10.12)
Run time of primal code
For basic adjoint mode, the ratio R typically ranges between 5 and ∞ (out of
memory) depending on the level of optimization applied to the primal code,
its computational complexity, the amount of tape memory available, and the
speed of tape accesses. Compared to tangent mode AD or finite differences, R
gives an indication of how large the gradient must be before adjoint methods
become attractive.
Basic adjoint mode is likely to yield infeasible tape memory requirement
for real-world applications. Checkpointing techniques have been proposed to
overcome this problem. They can be regarded as special cases of a more general
approach to handling gaps in the tape. Let therefore the primal function
(y, ỹ) = F (x, x̃) = (F q ◦ · · · ◦ F 1 )(x, x̃)
be decomposable into a sequence of elemental function4 evaluations xi =
F i (xi−1 ) for i = 1, . . . , q and where x0 = x and y = xq . Adjoint AD computes
xk−1
(1)
7 85 6
T T T
x(1) := x(1) + ∇F 1 · (. . . · (∇F k · (∇F k+1 · . . . · (∇F q T · xq(1) ) . . .)) . . .)
5 67 8
xk
(1)

4 In basic adjoint mode, elemental functions are the arithmetic operations built into the

given programming language. In general, elemental functions can be arbitrarily complex


subfunctions of F , such as single Monte Carlo path evaluations, individual time steps in an
integration scheme, numerical methods for solving systems of linear or nonlinear equations,
or calls to black-box routines.
Algorithmic Differentiation 323

(a) 3: x := F 3(x) (b) (c) ↓ F1 (d) ↓ F1


→ F1 F1
[∇F 3] F1
→ F2 ↓ F2
2: x :=F 2(x) F2 → ←
F3 ; F3 F2
→ → ←
[∇F 2] F3 ↑ F1 F3 ; F3

1: x :=F 1(x) F3 F1 ↑ F2
← → ← → ←
[∇F 1] F2 F2 ; F2 F2 ; F2

0: x F1 ↑ F1 ↑ F1
→ ← → ←
F1 ; F1 F1 ; F1

FIGURE 10.3: Adjoint of evolution: tape (a); store-all (b); recompute-all


(c); and checkpointing (d).

T
(1) = ∇F
assuming availability of adjoint elementals xi−1 · xi(1) for i = q, . . . , 1.
i

T
(1) = ∇F
Let xk−1 · xk(1) for some k ∈ {1, . . . , q} not be treated in basic adjoint
k

mode. A gap is induced in the tape to be filled by some alternative approach to


computing xk−1 k
(1) given x(1) . Potential scenarios include checkpointing, adjoints
of implicit functions, smoothing of locally nondifferentiable functions, and
coupling with hand-written adjoint code potentially running on accelerators
such as GPUs. Brief discussions of these topics including references for further
reading are presented in the following sections.

10.3.1 Checkpointing
Consider Figure 10.3 for motivation. Basic adjoint mode applied to x :=
F i (x) for i = 1, . . . , q (q = 3 in Figure 10.3) uses a store-all strategy. It
generates a tape of size q assuming unit tape size for the individual F i . The
total primal operations count5 adds up to q assuming unit primal operations
count per F i . A tape similar to Figure 10.3a is generated for all F i by running
→ ←
F i for i = 1, . . . , q followed by its interpretation by F j for j = q, . . . , 1 (see
Figure 10.3b). The tape memory requirement reaches its minimum 1 +  in
a recompute-all strategy by checkpointing the original inputs of size   1
(↓ F 1 ) followed by the evaluation of F i for i = 1, . . . , q − 1, the generation
→ ←
of the tape for F q (F q ) and its interpretation (F q ). Repeated accesses to the
checkpoint (↑ F 1 ) enable the recursive application of this data flow reversal
scheme for i = q − 1, . . . , 1 (see Figure 10.3c). The primal operations count
grows quadratically with q yielding 6 for q = 3. Figure 10.3d illustrates a data

5 Number of evaluations of the primal function. The adjoint operations count is invariant

with respect to different data flow reversal schemes as the adjoint of each primal operation
is evaluated exactly once.
324 High-Performance Computing in Finance

(a) (b)
4: y := R(x1, x2, x3) ↓ F1
F1 ; F2 ; F3
→ ←
[∇x1 R] [∇x2 R] [∇x3 R] R; R
↑ F1
→ ←
1: x1 := F 1(x) 2: x2 := F 2(x) 3: x3 := F 3(x) F1 ; F1
↑ F1
→ ←
[∇F 1) [∇F 2] [∇F 3] F2 ; F2
↑ F1
→ ←
0: x F3 ; F3

FIGURE 10.4: Adjoint of ensemble: tape (a) and pathwise adjoints (b).

flow reversal scheme built on two checkpoints. It reduces the primal operations
count to 5 at the expense of additional memory required to store the second
checkpoint yielding a total memory requirement of 1 + 2.
Single- and multilevel checkpointing schemes have been proposed in the
literature [27]. The fundamental combinatorial optimization problem of min-
imizing the primal operations count given an upper bound on the available
memory for storing tape and checkpoints is known to be NP-complete [28].
Efficient algorithms for its solution exist for relevant special cases such as
evolutions [29] similar to the previous example. They form the core of many
iterative algorithms including Crank–Nicolson schemes used in the context of
finite difference methods for solving parabolic partial differential equations
(see also Section 10.4).
A second special case with particular relevance to finance is Monte Carlo
sampling for solving SDEs as, for example, in Section 10.1 (see also Sec-
tion 10.4). Refer to Figure 10.4 for illustration. Adjoints of such ensembles
can be computed very efficiently through exploiting the missing data depen-
dencies among the individual paths (F i ) drawing from a common set of active
inputs x. Their results are typically computed in parallel followed by a reduc-
tion to a (often scalar) value y by some function R. A gap in the tape is
induced by checkpointing x (with size in memory equal to ) as an input to
the F i (e.g., ↓ F 1 ) followed by a passive evaluation of the primal ensemble and
the generation and interpretation of the tape for R. Adjoints can be computed
individually for each path after recovering x (e.g., ↑ F 1 ). The maximum mem-
ory requirement is limited to 1+ under the assumption that a single path has
unit memory requirement exceeding the memory occupied by the tape of R.
The primal operations count is roughly doubled, that is, approximately equal
to 2q, where again q = 3 in Figure 10.4. Parallelization of pathwise adjoint
computation turns out to be straightforward. Potential conflicts need to be
handled when writing to x(1) .
Algorithmic Differentiation 325

10.3.2 Implicit functions


In this section, we consider implementations of F : IRn → IR as y = F (x) =
F (F 2 (F 1 (x))), where F 2 is a numerical method for evaluating an implicit
3

function defined, for example, by systems of linear or nonlinear equations or


as an unconstrained convex nonlinear optimization problem.

[Link] Linear systems


Let F 1 : IRn → IRn×n × IRn : (A, b) = F 1 (x), F 2 : IRn×n × IRn → IRn :
s = F 2 (A, b) such that s is the solution of the system of linear equations
A · s = b, and F 3 : IRn → IR : y = F 3 (s). Symbolic adjoint differentiation
of the solution s with respect to A and b in direction s(1) yields the linear
system
AT · b(1) = s(1) (10.13)
see, for example, Reference 30. The adjoint system matrix turns out to have
unit rank,
A(1) = −b(1) · sT . (10.14)
Equation 10.14 uses the primal solution s. Hence, its evaluation needs to be
preceded by a single run of the primal solver to obtain (a sufficiently precise
approximation of) s and a factorization of the system matrix A if a direct
solver is used. This factorization (LU, QR, . . .) can be reused for the solution of
Equation 10.13 [31]. Consequently, both memory requirement and operations
count can be reduced from O(n3 ) when using basic adjoint AD to O(n2 ) when
differentiating the linear system symbolically.

[Link] Nonlinear systems


Consider F 1 : IRn → IRn × IRk : (s0 , λ) = F 1 (x), F 2 : IRn × IRk →
IR : s = F 2 (s0 , λ) with s being the solution of the system of parameterized
n

nonlinear equations N (s, λ) = 0, and F 3 : IRn → IR : y = F 3 (s). Let F 2


be implemented by Newton’s method with initial estimate of the solution s0 ,
parameter vector λ ∈ IRk , and ν denoting the number of Newton iterations
performed. Basic adjoint AD applied to F 2 yields both memory requirement
and operations count of O(ν · n3 ); a direct solver is assumed to be used for
the solution of the Newton system, which can be differentiated symbolically as
shown in Section [Link]. Alternatively, according to Reference 32 the adjoint
Newton solver needs to compute a solution to the linear system
∂N
(x, λ)T · z = −x(1)
∂x
followed by a single call of the adjoint of N seeded with the solution z which
gives
∂N
λ(1) = λ(1) + (x, λ)T · z.
∂λ
Again, the (sufficiently accurate) primal solution is required. Both memory
requirement (O(n2 )) and operations count (O(n3 )) of computing λ(1) are
326 High-Performance Computing in Finance

reduced significantly. The derivative of s with respect to s0 vanishes iden-


tically at the solution.

[Link] Convex optimization


Consider F 1 : IRn → IRn × IRk : (s0 , λ) = F 1 (x), F 2 : IRn × IRk → IRn :
s = F 2 (s0 , λ) with s being the solution of the unconstrained convex nonlinear
optimization problem argmins∈IRn G(s, λ), and F 3 : IRn → IR : y = F 3 (s).
F 2 can be regarded as root finding for the first-order optimality condition
∂s G(s, λ) = 0. Consequently, the computation of λ(1) amounts to solving the

linear system
∂2G
(s, λ)T · z = −s(1)
∂s2
followed by a single call of the second-order adjoint version of F at the solution
z to obtain
∂2G
λ(1) := λ(1) + (s, λ)T · z.
∂λ∂s
Savings in computational complexity are obtained as in Section [Link]. Sim-
ilar comments apply. See Reference 33 for a discussion of this method in the
context of calibration.

10.3.3 Smoothing
AD is based on the assumption that the given implementation of the target
function F : IRn → IRm is continuously differentiable at all points of inter-
est. This prerequisite is likely to be violated in many practical applications.
Generalized derivatives have been proposed to overcome this problem; see, for
example, References 34 and 35 for recent work in this area.
In pricing of financial derivatives, nonsmoothness is often induced by
branches in the flow of control depending, for example, on a strike price. An
option may be exercised or not. Any data flow dependence of the predicted
price or payoff p on strike K is lost suggesting independence of the sensitivity
of p from K, which is obviously false. Local finite differencing as well as vari-
ous smoothing techniques can be used to potentially overcome this problem.
For example, in sigmoidal smoothing [36] the nonsmooth function

f1 (S), S < K,
f (S) =
f2 (S), otherwise

is replaced by f (S) = (1 − σs (S))f1 (S) + σs (S)f2 (S), where


1
σs (S) = S−K
1 + e− α

and the width of the transition α controls the quality of the approximation.
A case study is discussed in Section 10.4.2.
Algorithmic Differentiation 327

10.3.4 Preaccumulation
Preaccumulation is a technique for speeding up the computation of adjoints
while at the same time reducing the size of the tape. It comes in various flavors.
As an example, we consider adjoint versions of implementations of a simulation
F : IRn → IRm as y = F (x) = F 3 (F 2 (F 1 (x))), where F 2 : IRk → IRl is
assumed to yield a tape with q edges (local partial derivatives). Without loss
of generality, the Jacobian ∇F 2 ∈ IRl×k is assumed to be dense.6 Hence, the
number of scalar fma operations required for its evaluation in tangent mode
AD is k · q. Accumulation of the overall Jacobian ∇F ∈ IRm×n in adjoint
mode induces a local computational cost of m · q due to m interpretations
of the tape of F 2 . Preaccumulation of ∇F 2 in tangent mode yields a tape
with k · l edges. The contribution of F 2 to the total cost of accumulating ∇F
without preaccumulation (m · q) potentially exceeds the cumulative cost of
preaccumulation of ∇F 2 (k · q) followed by interpretation of the compressed
tape (adding m · k · l). Moreover, no tape is required for preaccumulating ∇F 2
yielding a reduction of the overall tape size by q − k · l assuming unit memory
size per tape entry.
Alternative scenarios for preaccumulation include the repeated use of a
local Jacobian as part of an iteration in the enclosing adjoint computation. In
this case, the local Jacobian should be preaccumulated (cached) potentially
yielding significant savings in terms of tape size and run time. An exponential
number (in terms of the size of the tape) of different scenarios for preaccumu-
lation result from the associativity of the chain rule. Determining the optimal
method turns out to be computationally intractable [37]. Further theory is
discussed in Reference 38.

10.3.5 Further issues


Variety and complexity of available hardware platforms and of correspond-
ing software yield a large number of further issues to be taken into account
for the design of adjoint solutions. Relevant topics include the integration of
(hand-written) adjoint source code into tape-based solutions, AD of parallel
code, AD on accelerators (e.g., GPUs), adjoint numerical libraries, and the
handling of black-box code in the context of AD. In any case, the augmen-
tation of a large simulation software package with AD capabilities remains
a challenging conceptual as well as software engineering task. AD tools can
simplify this process tremendously depending on their levels of robustness and
efficiency and the flexibility of their application programming interface. Once
started, AD needs to become an element of the overall software development
strategy.

6 Jacobian compression methods based on coloring techniques enter the scene in case of

sparsity [5].
328 High-Performance Computing in Finance

10.4 Case Studies


The given selection of case studies presented in this section is by no means
complete. Our intention is to give the reader some impression of potential use
cases for AAD in computational finance.

10.4.1 European option


The material in this section is based on a more extensive discussion in
Reference 4. We consider the computation of first- and second-order Greeks
for the simple option pricing problem described in Section 10.1. In addition
to the previously outlined Monte Carlo approach, we investigate a Crank–
Nicolson method for the solution of the corresponding initial and boundary
value problem for a partial differential equation (PDE). To recast the pricing
problem in a PDE setting, we extend the value V into a value function V :
IR × [0, T ] → IR+ given by
+
V (x, t) = e−r(T −t) Ex,t eXT − K , (10.15)

where Ex,t denotes expectation with respect to the measure under which the
Markov process X starts at time t ∈ [0, T ] at the value x ∈ IR. Standard
results from the theory of Markov processes then show that V satisfies the
parabolic PDE
∂ ∂
0= V (x, t) + r − 12 σ 2 (x, t) V (x, t) (10.16)
∂t ∂x
∂2
+ 12 σ 2 (x, t) 2 V (x, t) − rV (x, t) for (x, t) ∈ IR × [0, T ),
∂x
(ex − K)+ = V (x, T ) for all x ∈ IR. (10.17)

To these we add the asymptotic boundary conditions

lim V (x, t) = 0 for all t ∈ [0, T ], (10.18)


x→−∞

lim V (x, t) = e−r(T −t) (ex − K) for all t ∈ [0, T ]. (10.19)


x→∞

The system is solved by a Crank–Nicolson scheme as described in Reference 39.


We use AAD to compute the same sensitivities as before.
Table 10.1 compares peak memory requirements and elapsed run times
for the Monte Carlo solutions without and with (ensemble) checkpointing for
N = 105 sample paths at M = 360 time steps each with CFDs on our reference
computer with 3 GB of main memory available and swapping to disk disabled.
Basic adjoint mode (mc/a1s) yields infeasible memory requirements for gra-
dient sizes n ≥ 22. Exploiting the ensemble property in mc/a1s_ensemble
yields adjoints with machine precision at the expense R < 10 primal function
evaluations for all gradient sizes.
Algorithmic Differentiation 329

TABLE 10.1: Run times and peak memory requirements as a function


of gradient size n for dco/c++ first-order adjoint code vs. central finite
differences for the monte carlo code from section 10.1
n mc/primal mc/cfd mc/a1s mc/a1s ensemble R
10 0.3 s 6.1 s 1.8 s (2 GB) 1.3 s (1.9 MB) 4.3
22 0.4 s 15.7 s (>3 GB) 2.3 s (2.2 MB) 5.7
34 0.5 s 29.0 s (>3 GB) 3.0 s (2.5 MB) 6.0
62 0.7 s 80.9 s (>3 GB) 5.1 s (3.2 MB) 7.3
142 1.5 s 423.5 s (>3 GB) 12.4 s (5.1 MB) 8.3
222 2.3 s 1010.7 s (>3 GB) 24.4 s (7.1 MB) 10.6
Note: Basic adjoint mode fails for n ≥ 22 due to prohibitive memory requirement. The rel-
ative computational cost R is given for mc/a1s ensemble. Although theoretically constant,
R is sensitive to specifics such as compiler flags, memory hierarchy and cache sizes, and
level of optimization of the primal code.

TABLE 10.2: Accuracy of selected (ith) forward and


central finite difference gradient entries vs. AD for the
monte carlo code with scenario n = 10
i mc/ffd mc/cfd
0 0.982097033091 0.982097084181
7 −0.0716705322265 −0.071666955947
4 0.346174240112 0.346131324768
i mc/t1s mc/a1s ensemble
0 0.982097083159485 0.982097083159484
7 −0.0716660568246482 −0.0716660568246484
4 0.346126820239318 0.346126820239324
Note: Top row shows best case (rel. err. ≈ 1e−8), bottom row worst
case (rel. err. ≈ 6e−5) while middle row is a representative value (rel.
err. ≈ 2e−5).

Table 10.2 shows the accuracy of gradient entries computed via finite dif-
ferences (forward and central) compared with AD. Figures are for the smallest
problem (gradient size n = 10).
Table 10.3 compares peak memory requirements and elapsed run times
for the PDE solutions without and with (evolution) checkpointing for N =
104 spatial grid points and M = 360 time steps on our reference computer.
Basic adjoint mode (pde/a1s) yields infeasible memory requirements for all
gradients of size n ≥ 10.

10.4.2 American option pricing


The following case study is presented in further detail in Reference 40.
An American-put option on a single asset priced by the Longstaff–Schwartz
algorithm [41] is considered. For the generation of the stock price paths under
the risk neutral measure and without a dividend yield, Equation 10.4 for a
330 High-Performance Computing in Finance

TABLE 10.3: Run time and peak memory requirements as a function


of the gradient size n of the basic and checkpointed adjoint codes vs. central
finite differences
n pde/primal pde/cfd pde/a1s pde/a1s checkpointing R
10 0.3 s 6.5 s (>3 GB) 5.2 s (205 MB) 17.3
22 0.5 s 19.6 s (>3 GB) 8.3 s (370 MB) 16.6
34 0.6 s 37.7 s (>3 GB) 11.6 s (535 MB) 19.3
62 1.0 s 119.5 s (>3 GB) 18.7 s (919 MB) 18.7
142 2.6 s 741.2 s (>3 GB) 39 s (2 GB) 15.0
222 4.1 s 1857.3 s (>3 GB) 60 s (3 GB) 14.6
Note: The checkpointing used is equidistant (every 10th time step). The basic adjoint ran
out of memory even for the smallest problem size. The relative computational cost R is
given for pde/a1s checkpointing. Again, this theoretically constant value is typically rather
sensitive to specifics of the target computer architecture and software stack.

constant volatility is used.


  t
1 2
Xt = X 0 + r − σ t · Δ + σ Zi . (10.20)
2 i=0

First-order sensitivities are obtained by AAD for five active inputs, the
stock price, volatility, time to maturity, risk-free interest rate, and strike price,
respectively. Second-order sensitivities are computed for Δ with respect to the
five active inputs.
A computation of the test case with the number of paths NP equal to 106
and 102 exercise opportunities NT yields enormous memory requirements in
basic adjoint mode. Moreover, it can be seen that some of the second-order
sensitivities including Γ are computed to be zero by using the AD technique,
due to its inability to capture control flow dependencies. After each regression,
a decision occurs whether to exercise the current option or to hold it (see
line 11 of Algorithm 10.1). This decision leads to zero adjoints of the local
cash flow with respect to the exercise boundary.
For the memory reduction evolution, checkpointing introduced in Sec-
tion 10.3.1 is applied to each iteration in line 2 of Algorithm 10.1. A checkpoint
consists of the values of the inputs, the local cash flow, and the time of the
respective loop cycle.
To capture the control flow dependency of the exercise decision sigmoidal
smoothing is used as introduced in Section 10.3.3. Therefore the exercise
boundary is chosen to be the center of the transition. Then, the “if” statement
in line 11 and the payoff function in line 12 of the algorithm are replaced by
the assignment (10.21) in which σs is the sigmoid function and vp denotes the
payoff of the current path.
vp := (1 − σs ) · (K − Sp ) + σs · vp . (10.21)
Checkpointing yields a reduction in memory requirement by approximately
85% compared to basic adjoint mode. Larger test cases can be computed at a
Algorithmic Differentiation 331

Algorithm 10.1 Longstaff–Schwartz algorithm for put options


In:

→ initial stock price S0 ∈ R, strike price K ∈ R, time to maturity T ∈ R,


volatility σ ∈ R, risk-free interest rate r ∈ R, number of paths NP ∈
N, number of time steps NT ∈ N, accumulated random numbers Z ∈
RNP ×NT
→ implementation of the stock price generation for a given time and path:
h : R4 × N2 × R → R, Sp,t = h(S0 , T, σ, r, t, NT , Zp,t )
→ implementation of the regression for the set of paths in the money I, the
vector of stock prices for a given time St and the vector of discounted
cash flows v:
R : Ni × Ri × Ri → R, b = R(I, St , v)
Out:
← option price: V ∈ R

Algorithm:
1: Initialization
2: for t = NT − 2 to 1 do
3: I ← {}
4: for p = 1 to NP do
5: vp ← vp · exp (−r · T /NT )
6: Sp,t = h(S0 , T, σ, r, t, NT , Zp,t )
7: if Sp,t < K then
8: I ← I ∪ {p};
9: b = R(I, St , v)
10: for all p ∈ I do
11: if K − Sp,t > b then
12:
)vp ← K − Sp,t
13: V ← p vp · exp (−r · T /NT ) /NP

slightly higher computational cost. Run times and memory requirements are
shown in Table 10.4.
Those second-order sensitivities which could not be calculated satisfac-
torily due to the missing control flow dependencies are approximated by the
smoothing approach for the exercise decision. All other sensitivities are similar
to the values that are computed with the AAD method without smoothing.
The option price as well as the sensitivities are given in Table 10.5. The qual-
ity of the smoothing depends on the transition parameter α often determined
through experiments in practice.
By assuming the missing control flow dependencies to be negligible, the
time and the path loop can be switched and the algorithm for the sensitivity
332 High-Performance Computing in Finance

TABLE 10.4: Run times and required tape memory for a single pricing
calculation and the basic and the checkpointed adjoint methods
Run time (s) Memory requirement (GB)
Basic Checkpointed Basic Checkpointed
NT Pricer adjoint adjoint Pricer adjoint adjoint
100 22 192 (8.7) 228 (10.4) 0.80 84.78 10.93
500 113 – 1011 (8.9) 3.78 >100 49.90
1000 226 – 2245 (9.9) 7.51 >100 98.47
Note: Relative run times are given in brackets.

TABLE 10.5: Value and sensitivities of the test cases for the
algorithmic differentiation methods applied to the basic and smoothed
version (subscript s) of the Longstaff–Schwartz algorithm for S0 = 1, K = 1,
T = 1, r = 4%, σ = 20%, and α = 0.005
NT V Vs Δ Δs ν νs Γ Γs
100 0.06361 0.06353 −0.41962 −0.41761 0.37653 0.37921 0 0.82227
500 0.06378 0.06345 −0.42414 −0.41749 0.37688 0.38275 0 0.86813
1000 0.06369 0.06318 −0.42465 −0.41869 0.37611 0.38515 0 0.73878
Note: In Reference 44, analytical reference values for V and Δ are given as Vref = 0.064
and Δref = −0.416.

computation can be simplified. This pathwise adjoint approach reduces the


memory requirement further and it enables parallelization [6].

10.4.3 Nearest correlation matrix


Monte Carlo simulation of multiple correlated underlyings requires the
sampling of jointly distributed random variables. The n-dimensional Gaus-
sian copula model can be used to sample jointly normal random variables
z = (z1 , . . . , zn )T by starting from a sample z̃ = (z̃1 , . . . , z̃n )T of independent
standard normal variates. The Cholesky decomposition G = AAT of the cor-
relation matrix G ∈ Rn×n is then used to correlate the components of z̃ by
performing the matrix vector product z = A · z̃.
Real-world correlation data are often not consistent in the sense that the
matrix G of pairwise correlations is not positive definite as required by the
Cholesky decomposition. To arrive at a positive definite matrix C ∈ Rn×n
that is a correlation matrix with a given lower bound on eigenvalues and that
is closest to G in the Frobenius norm, we can solve a variation of the nearest
correlation matrix (NCM) problem as a preprocessing step [42].
Due to Qi and Sun [43], a generalized Newton method can be used to find
the root of the first-order optimality condition for the unconstrained convex
optimization problem formulation of the NCM

h(y, G) = Diag((G + diag(y))+ ) − e = 0. (10.22)


Algorithmic Differentiation 333

Notation is as follows: Diag(·) gives the diagonal of a matrix, diag(·) gives a


matrix with the given vector as diagonal, (·)+ is the projection onto the set
of positive semidefinite matrices, and e = (1 . . . 1)T ∈ Rn . To get the NCM,
we solve Equation 10.22 for y = y∗ ∈ Rn and set

C := (G + diag(y∗ ))+ . (10.23)

AAD has been introduced as an efficient way of quantifying correlation


risk in a copula-based Monte Carlo setting by Capriotti and Giles [13]. If we
want to compute correlation risk through the NCM step to get sensitivities
with regard to the actual model inputs, an adjoint of the NCM algorithm is
required.
As described in Section [Link], basic AAD applied to a Newton algorithm
yields a tape memory requirement of O(ν · n3 ), where ν is the number of New-
ton iterations. We examine how known derivatives and the implicit function
theorem can be used to arrive at a symbolic adjoint for the NCM algorithm
with much better performance. Since the involved (·)+ operator is not dif-
ferentiable where the argument has an eigenvalue that is zero, a smoothed
version can be used to get sensitivities comparable to those obtained by finite
differences.
A matrix is projected onto the set of positive semidefinite matrices by
setting all negative eigenvalues to zero. We can compute both tangents and
adjoints of the (·)+ operator implicitly by using the symbolic matrix deriva-
tive results of the eigen decomposition provided in Reference 30 and a deriva-
tive of the max function. Given C(1) the adjoints G(1) , y∗ (1) for the assign-
ment (10.23) follow directly from the adjoint of the (·)+ operator.
Since Equation 10.22 already gives the first-order optimality condition, the
adjoint G(1) of the function y∗ = y∗ (G) implicitly defined by h(y∗ , G) = 0
can be computed by first solving the linear system

∂h ∗
(y , G)T · z = −y∗ (1) (10.24)
∂y

and then setting


∂h ∗
G(1) := G(1) + (y , G)T · z, (10.25)
∂G
where the adjoint of h(y, G) again follows from the adjoint of the (·)+ operator.
With the above method, the adjoint G(1) as a function of C(1) can be
computed with a memory complexity of only O(n2 ) and a run time complexity
of O(n3 ) but its accuracy now depends on the Newton residual, that is, the
quality of the NCM solution itself.
In Table 10.6, we give the run time results of the primal NCM computation
and adjoints acquired by basic AAD, finite differences and the symbolic adjoint
approach described earlier. We also give the memory requirements for the
basic adjoint tape. Test matrices are correlation matrices with a few perturbed
elements resulting in small negative eigenvalues.
334 High-Performance Computing in Finance

TABLE 10.6: Run time results for computing the NCM


and three different adjoint routines and the required tape memory
for the basic adjoint
Run time (s) Tape memory (GB)
NCM Basic Symbolic Finite
n primal adjoint adjoint differences Basic adjoint
20 0.0014 0.12 0.0015 0.85 0.038
50 0.0078 1.6 0.0094 38 0.53
100 0.045 12 0.056 890 4.4

The basic adjoint run times clearly show the advantages of AAD for com-
puting Greeks as opposed to using a finite differences based approach. The
symbolic adjoint of the NCM presents a significant improvement as it is mul-
tiple orders of magnitude faster than the basic adjoint and is not restricted
by memory even for large or hard problem instances.

10.5 Summary and Conclusion


This chapter presented AD and its adjoint mode in particular as one of the
fundamental elements of the high-performance computational finance toolbox.
Improvements in computational cost by an order of complexity over classical
finite difference approximation of first-order Greeks motivate the widespread
use of adjoint AD. An introduction to its basic principles was followed by a
discussion of implementation and related conceptual challenges. Illustration
was provided by three case studies.
Getting started with basic adjoint AD on relatively simple problems is typ-
ically rather straightforward. A growing set of AD tools is available to support
this process; see, for example, [Link]. Much more challenging is
the application of AD to large financial libraries including the need for check-
pointing, symbolic differentiation of implicit functions, smoothing, preaccumu-
lation, integration of adjoint source code, support for MPI and/or OpenMP,
acceleration using GPUs, and so on. Second- and higher-order adjoints might
be needed. We expect large-scale high-performance adjoints to require a sub-
stantial amount of user interaction for the foreseeable future. AD tools need to
expose the corresponding flexibility through an intuitive and well-documented
user interface.

References
1. Giles, M. and Glasserman, P. Smoking adjoints: Fast Monte Carlo greeks. Risk,
19:88–92, 2006.
Algorithmic Differentiation 335

2. Griewank, A. and Walter, A. Evaluating Derivatives. Principles and Techniques


of Algorithmic Differentiation (2nd Edition). SIAM, Philadelphia, 2008.

3. Naumann, U. The Art of Differentiating Computer Programs. An Introduction to


Algorithmic Differentiation. Number 24 in Software, Environments, and Tools.
SIAM, Philadelphia, 2012.

4. Naumann, U. and Du Toit, J. Adjoint algorithmic differentiation tool support


for typical numerical patterns in computational finance. NAG Technical Report
No. TR3/14 , 2014. Journal of Computational Finance (to appear).

5. Gebremedhin, A., Manne, F., and Pothen, A. What color is your Jacobian?
Graph coloring for computing derivatives. SIAM Review, 47:629–705, 2005.

6. Leclerc, M., Liang, Q., and Schneider, I. Fast Monte Carlo Bermudan greeks.
Risk, 22(7):84–88, 2009.

7. Denson, N. and Joshi, M. Flaming logs. Wilmott Journal, 1:259–262, 2009.

8. Joshi, M. and Pitt, D. Fast sensitivity computations for Monte Carlo valuation
of pension funds. Astin Bulletin, 40:655–667, 2010.

9. Joshi, M. and Yang, C. Fast and accurate pricing and hedging of long-dated
CMS spread options. International Journal of Theoretical and Applied Finance,
13:839–865, 2010.

10. Joshi, M. and Yang, C. Algorithmic hessians and the fast computation of cross-
gamma risk. IIE Transactions, 43:878–892, 2011.

11. Joshi, M. and Yang, C. Fast delta computations in the swap-rate market model.
Journal of Economic Dynamics and Control, 35:764–775, 2011.

12. Capriotti, L. Fast greeks by algorithmic differentiation. Journal of Computa-


tional Finance, 14:3–35, 2011.

13. Capriotti, L. and Giles, M. Fast correlation greeks by adjoint algorithmic differ-
entiation. Risk, 23:79–83, 2010.

14. Capriotti, L. and Giles, M. Adjoint greeks made easy. Risk, 25:92, 2012.

15. Capriotti, L., Jiang, Y., and Macrina, A. Real-time risk management: An AAD–
PDE approach. International Journal of Financial Engineering, 2:1550039,
2015.

16. Capriotti, L., Jiang, Y., and Macrina, A. AAD and Least Squares Monte Carlo:
Fast Bermudan-style options and XVA greeks. Algorithmic Finance, 6(1–2):35–
49, 2017.

17. Antonov, A. Algorithmic differentiation for callable exotics. 2017. Available at


SSRN: [Link] or [Link]
2839362

18. Antonov, A., Issakov, S., Konikov, M., McClelland, A., and Mechkov, S. PV and
XVA greeks for callable exotics by algorithmic differentiation. 2017. Available at
336 High-Performance Computing in Finance

SSRN: [Link] or [Link]


2881992

19. Turinici, G. Calibration of local volatility using the local and implied instanta-
neous variance. Journal of Computational Finance, 13(2):1, 2009.

20. Käbe, C., Maruhn, J. H., and Sachs, E. W. Adjoint-based Monte Carlo calibra-
tion of financial market models. Finance and Stochastics, 13(3):351–379, 2009.

21. Schlenkrich, S. Efficient calibration of the Hull–White model. Optimal Control


Applications and Methods, 33:352–362, 2012.

22. Henrard, M. Calibration in finance: Very fast greeks through algorithmic dif-
ferentiation and implicit function. Procedia Computer Science, 18:1145–1154,
2013.

23. Giles, M. Vibrato Monte Carlo sensitivities. In L’Ecuyer, P. and Owen, A. edi-
tors. Monte Carlo and Quasi Monte Carlo Methods, 369–382. Springer, 2009.

24. Capriotti, L. Likelihood ratio method and algorithmic differentiation: Fast sec-
ond order greeks. Algorithmic Finance, 4:81–87, 2015.

25. Pironneau, O. et al. Vibrato and automatic differentiation for high order deriva-
tives and sensitivities of financial options. arXiv preprint arXiv:1606.06143 ,
2016.

26. Utke, J., Naumann, U., Fagan, M., Tallent, N., Strout, M., Heimbach, P., Hill,
C., and Wunsch, C. OpenAD/F: A modular open-source tool for automatic
differentiation of Fortran codes. ACM Transactions on Mathematical Software,
34:18:1–18:36, July 2008.

27. Stumm, P. and Walther, A. Multi-stage approaches for optimal offline check-
pointing. SIAM Journal of Scientific Computing, 31:1946–1967, 2009.

28. Naumann, U. DAG reversal is NP-complete. Journal of Discrete Algorithms,


7:402–410, 2009.

29. Griewank, A. and Walther, A. Algorithm 799: Revolve: An implementation of


checkpoint for the reverse or adjoint mode of computational differentiation.
ACM Transactions on Mathematical Software, 26(1):19–45, 2000. Also appeared
as Technical Report IOKOMO-04-1997, Technical University of Dresden.

30. Giles, M. Collected matrix derivative results for forward and reverse mode algo-
rithmic differentiation. In Bischof, C., Bücker, M., Hovland, P., Naumann, U.,
and Utke, J., editors. Advances in Automatic Differentiation, Volume 64 of Lec-
ture Notes in Computational Science and Engineering, pages 35–44. Springer,
2008.

31. Naumann, U. and Lotz, J. Algorithmic differentiation of numerical methods:


Tangent-linear and adjoint direct solvers for systems of linear equations. Tech-
nical Report AIB-2012-10, LuFG Inf. 12, RWTH Aachen, June 2012.
Algorithmic Differentiation 337

32. Naumann, U., Lotz, J., Leppkes, K., and Towara, M. Algorithmic differentiation
of numerical methods: Tangent and adjoint solvers for parameterized systems
of nonlinear equations. ACM Transactions on Mathematical Software, 41:26:1–
26:21, 2015.

33. Henrard, M. Adjoint algorithmic differentiation: Calibration and implicit func-


tion theorem. Journal of Computational Finance, 17(4):37–47, 2014.

34. Griewank, A. On stable piecewise linearization and generalized algorithmic dif-


ferentiation. Optimization Methods and Software, 28:1139–1178, 2013.

35. Khan, K. and Barton, P. A vector forward mode of automatic differentiation for
generalized derivative evaluation. Optimization Methods and Software, 30(6):1–
28, 2015.

36. Schneider, J. and Kirkpatrick, S. Stochastic Optimization. Springer, 2006.

37. Naumann, U. Optimal Jacobian accumulation is NP-complete. Mathematical


Programming, 112:427–441, 2008.

38. Naumann, U. Optimal accumulation of Jacobian matrices by elimination meth-


ods on the dual computational graph. Mathematical Programming, 99:399–421,
2004.

39. Andersen, L. B. G. and Brotherton-Ratcliffe, R. The equity option volatility


smile: An implicit finite difference approach. Journal of Computational Finance,
2000.v

40. Deussen, J., Mosenkis, V., and Naumann, U. Fast Estimates of Greeks from
American Options: A Case Study in Adjoint Algorithmic Differentiation. Tech-
nical Report AIB-2018-02, RWTH Aachen University, January 2018.

41. Longstaff, F. A. and Schwartz, E. S. Valuing American options by simulation: A


simple Least-Squares approach. Review of Financial Studies, 14:113–147, 2001.

42. Higham, N. J. Computing the nearest correlation matrix—A problem from


finance. IMA Journal of Numerical Analysis, 22(3):329–343, 2002.

43. Qi, H. and Sun, D. A quadratically convergent Newton method for comput-
ing the nearest correlation matrix. SIAM Journal Matrix Analysis Applications,
28:360–385, 2006.

44. Geske, R. and Johnson, H. E. The American put option valued analytically.
Journal of Finance, 39(5):1511–1524, 1984.
Chapter 11
Case Studies of Real-Time Risk
Management via Adjoint
Algorithmic Differentiation (AAD)

Luca Capriotti and Jacky Lee

CONTENTS
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
11.2 Adjoint Algorithmic Differentiation: A Primer . . . . . . . . . . . . . . . . . . 341
11.2.1 Adjoint design paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
[Link] A simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
11.3 Real-Time Risk Management of Interest Rate Products . . . . . . . . 344
11.3.1 Pathwise derivative method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
[Link] Libor market model simulation . . . . . . . . . . . . . . . 345
11.4 Real-Time Counterparty Credit Risk Management . . . . . . . . . . . . . 349
11.4.1 Counterparty credit risk management . . . . . . . . . . . . . . . . . . . 350
11.4.2 Adjoint algorithmic differentiation and the counterparty
credit risk management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
[Link] Rating transition risk . . . . . . . . . . . . . . . . . . . . . . . . . 354
11.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
11.5 Real-Time Risk Management of Flow Credit Products . . . . . . . . . 357
11.5.1 Pricing of credit derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
[Link] Calibration step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
11.5.2 Challenges in the calculation of credit risk . . . . . . . . . . . . . . 359
11.5.3 Adjoint calculation of risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
11.5.4 Implicit function theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
11.5.5 Adjoint of the calibration step . . . . . . . . . . . . . . . . . . . . . . . . . . 362
11.5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
[Link] Credit default swaps . . . . . . . . . . . . . . . . . . . . . . . . . . 363
[Link] Credit default index swaptions . . . . . . . . . . . . . . . 364
11.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

339
340 High-Performance Computing in Finance

11.1 Introduction
The renewed emphasis of the financial industry on quantitatively sound
risk management practices comes with formidable computational challenges.
In fact, standard approaches for the calculation of risk require repeating the
calculation of the P&L of the portfolio under hundreds of market scenar-
ios. As a result, in many cases these calculations cannot be completed in a
practical amount of time, even employing a vast amount of computer power,
especially for risk management problems requiring computationally intensive
Monte Carlo (MC) simulations. Since the total cost of the through-the-life risk
management can determine whether it is profitable to execute a new trade,
solving this technology problem is critical to allow a securities firm to remain
competitive.
Following the introduction of adjoint methods in Finance [1], a compu-
tational technique dubbed adjoint algorithmic differentiation (AAD) [2–4]
has recently emerged as tremendously effective for speeding up the calcu-
lation of sensitivities in MC in the context of the so-called pathwise derivative
method [5].
Algorithmic differentiation (AD) [6] is a set of programming techniques for
the efficient calculation of the derivatives of functions implemented as com-
puter programs. The main idea underlying AD is that any such function—no
matter how complicated—can be interpreted as a composition of basic arith-
metic and intrinsic operations that are easy to differentiate. What makes AD
particularly attractive, when compared to standard (finite-difference) meth-
ods for the calculation of derivatives, is its computational efficiency. In fact,
AD exploits the information on the structure of the computer code in order
to optimize the calculation. In particular, when one requires the derivatives
of a small number of outputs with respect to a large number of inputs, the
calculation can be highly optimized by applying the chain rule through the
instructions of the program in opposite order with respect to their original
evaluation [6]. This gives rise to the adjoint (mode of) algorithmic differenti-
ation (AAD).
Surprisingly, even if AD has been an active branch of computer science
for several decades, its impact in other research fields has been fairly limited
until recently. Interestingly, in a twist with the usual situation in which well-
established ideas in Applied Maths or Physics have been often “borrowed” by
quants, AAD has been introduced in MC applications in Natural Science [7]
only after its “rediscovery” in Quantitative Finance.
In this chapter, we discuss three particularly significant applications of
AAD to risk management, interest rate products, counterparty credit risk
management (CCRM), and volume credit products, that illustrate the power
and generality of this groundbreaking numerical technique.
Case Studies of Real-Time Risk Management via AAD 341

11.2 Adjoint Algorithmic Differentiation: A Primer


Reference 6 contains a detailed discussion of the computational cost of
AAD. Here we will only recall the main results in order to clarify how this
technique can be beneficial for the applications discussed in this chapter. The
interested reader can find in References 2–4 several examples illustrating the
intuition behind these results.
To this end, consider a function

Y = FUNCTION(X) (11.1)

mapping a vector X in Rn in a vector Y in Rm through a sequence of steps

X → · · · → U → V → · · · → Y. (11.2)

Here, each step can be a distinct high-level function or even an individual


instruction.
The adjoint mode of AD results from propagating the derivatives of
the final result with respect to all the intermediate variables—the so-called
adjoints—until the derivatives with respect to the independent variables are
formed. Using the standard AD notation, the adjoint of any intermediate
variable Vk is defined as
m
∂Yj
V̄k = Ȳj , (11.3)
j=1
∂Vk

where Ȳ is vector in Rm . In particular, for each of the intermediate variables


Ui , using the chain rule we get,

m
∂Yj ∂Yj ∂Vk
m
Ūi = Ȳj = Ȳj ,
j=1
∂Ui j=1
∂Vk ∂Ui
k

which corresponds to the adjoint mode equation for the intermediate function
V = V (U )
∂Vk
Ūi = V̄k , (11.4)
∂Ui
k

namely a function of the form Ū = V̄ (U, V̄ ). Starting from the adjoint of the
outputs, Ȳ , we can apply this to each step in the calculation, working from
right to left,
X̄ ← · · · ← Ū ← V̄ ← · · · ← Ȳ (11.5)
until we obtain X̄, that is, the following linear combination of the rows of the
Jacobian of the function X → Y :

m
∂Yj
X̄i = Ȳj , (11.6)
j=1
∂Xi

with i = 1, . . . , n.
342 High-Performance Computing in Finance

In the adjoint mode, the cost does not increase with the number of inputs,
but it is linear in the number of (linear combinations of the) rows of the
Jacobian that need to be evaluated independently. In particular, if the full
Jacobian is required, one needs to repeat the adjoint calculation m times,
setting the vector Ȳ equal to each of the elements of the canonical basis in
Rm . Furthermore, since the partial (branch) derivatives depend on the values
of the intermediate variables, one generally first has to compute the original
calculation storing the values of all of the intermediate variables such as U
and V , before performing the adjoint mode sensitivity calculation.
One particularly important theoretical result [6] is that given a computer
program performing some high-level function 11.1, the execution time of its
adjoint counterpart
X̄ = FUNCTION b(X, Ȳ ) (11.7)

(with suffix b for “backward” or “bar”) calculating the linear combination


11.6 is bounded by approximately four times the cost of execution of the
original one, namely
Cost[FUNCTION b]
≤ ωA , (11.8)
Cost[FUNCTION]

with ωA ∈ [3,4]. Thus one can obtain the sensitivity of a single output, or of
a linear combination of outputs, to an unlimited number of inputs for a little
more work than the original calculation.
As also discussed at length in References 2–4, AAD can be straightfor-
wardly implemented by starting from the output of an algorithm and pro-
ceeding backwards applying systematically the adjoint composition rule 11.4
to each intermediate step, until the adjoints of the inputs 11.6 are computed.
As already noted, the execution of such backward sweep requires information
that needs to be computed and stored by executing beforehand the steps of
the original algorithm—the so-called forward sweep.

11.2.1 Adjoint design paradigm


The propagation of the adjoints according to the steps 11.5, being mechan-
ical in nature, can be automated. Several AD tools are available1 that given a
procedure of the form 11.1 generate the adjoint function 11.7. Unfortunately,
the application of such automatic AD tools on large inhomogeneous computer
codes, like the one used in financial practices, is challenging. Indeed, pricing
applications are rarely available as self-contained packages, for example, that
can be easily parsed by an automatic AD tool. In practice, even simple option
pricing software is almost never implemented as a self-contained component.
In fact, to ensure consistency and leverage reusability, it generally relies on
auxiliary objects representing the relevant market information, for example,

1 An excellent source of information can be found at [Link]ff.org.


Case Studies of Real-Time Risk Management via AAD 343

a volatility surface or an interest rate curve, that are shared across different
pricing applications.
Fortunately, the principles of AD can be used as a programming paradigm
for any algorithm. An easy way to illustrate the adjoint design paradigm is
to consider again the arbitrary computer function in Equation 11.1 and to
imagine that this represents a certain high-level algorithm that we want to
differentiate. By appropriately defining the intermediate variables, any such
algorithm can be abstracted in general as a composition of functions as in
Equation 11.2. In the following section, we give a very simple example illus-
trating this idea. The interested reader can find in Reference 3 a practical
step-by-step guide.

[Link] A simple example


As a simple example of AAD implementation, we consider an algorithm
mapping a set of inputs (θ1 , . . . , θn ) into a single output P , according to the
following steps:

Step 1 Set Xi = exp(−θi2 /2 + θi Z), for i = 1, . . . n, where Z is a constant.


)n
Step 2 Set P = ( i=1 Xi − K)+ , where K is a constant.

The corresponding adjoint algorithm consists of Steps 1–2 (forward sweep),


plus a backward sweep consisting of the adjoints of Steps 2̄ and 1̄, respec-
tively:

)n
Step 2̄ Set X̄i = P̄ I( i=1 Xi − K), for i = 1, . . . , n. Here I(x) is the Heavi-
side function.

Step 1̄ Set θ̄i = Xi (−θi + Z), for i = 1, . . . , n.

It is immediate to verify that the output of the adjoint algorithm above


gives for P̄ = 1 the full set of sensitivities with respect to the inputs, θ̄i =
∂P/∂θi . Note that, as described in the main text, the backward sweep requires
information that is computed during the execution)of the forward sweep, Steps
n
1 and 2, for example, to compute the indicator I( i=1 Xi − K) and the value
of Xi . Finally, simple inspection shows that both the forward and backward
sweeps have a computation complexity O(n), that is, all the components of
the gradient of P can be obtained at a cost that is of the same order of the
cost computing P , in agreement with the general result 11.8. It is easy to
recognize in this example a stylized representation of the calculation of the
pathwise estimator for vega (volatility sensitivity) of a call option on a sum
of lognormal assets.
344 High-Performance Computing in Finance

11.3 Real-Time Risk Management of Interest


Rate Products
An important application of AAD is the efficient implementation of the
so-called pathwise derivative method [5] in MC applications. We begin by
briefly recalling the main ideas underlying this method. Then we discuss its
application to the simulation of the Libor Market Model [8] in the context
of risk management of interest rate products. This example is of particular
significance being the first financial application of adjoint methods, due to the
seminal paper by Giles and Glasserman [1], inspiring the subsequent develop-
ment of AAD [2,3] in a financial context.

11.3.1 Pathwise derivative method


Option pricing problems can be typically formulated in terms of the cal-
culation of expectation values of the form

V = EQ [P (X(T1 ), . . . , X(TM ))]. (11.9)

Here X(t) is a N -dimensional vector and represents the value of a set of under-
lying market factors (e.g., stock prices, interest rates, foreign exchange pairs,
and so on) at time t. P (X(T1 ), . . . , X(TM )) is the discounted payout function
of the priced security and depends in general on M observations of those fac-
tors. In the following, we will indicate the collection of such observations with
a d = N × M -dimensional state vector X = (X(T1 ), . . . , X(TM ))t .
The expectation value in Equation 11.9 can be estimated by means of
MC by sampling a number NMC of random replicas of the underlying state
vector X[1], . . . , X[NMC ], sampled according to the distribution Q(X), and
evaluating the payout P (X) for each of them. This leads to the estimate of
the option value V as

1
NMC

V $ P (X[iMC ]) . (11.10)
NMC iMC =1

The pathwise derivative method allows the calculation of the sensitivities


of the option price V (Equation 11.9) with respect to a set of Nθ parameters
θ = (θ1 , . . . , θNθ ), with a single simulation. This can be achieved by first
expressing the expectation value in Equation 11.9 as being over P(Z), the
distribution of the independent random numbers Z used in the MC simulation
to generate the random samples of Q(X), so that

V = EQ [P (X)] = EP [P (X(Z))]. (11.11)

The point of this subtle change is that P(Z) does not depend on the parameters
θ whereas Q(X) does. Indeed, whenever the payout function is regular enough,
for example, Lipschitz-continuous, and under additional conditions that are
Case Studies of Real-Time Risk Management via AAD 345

often satisfied in financial pricing (see, e.g., [9]), one can write the sensitivity
θ̄k  ≡ ∂V /∂θk as
 
∂Pθ (X)
θ̄k  = EQ . (11.12)
∂θk
In general, the calculation of Equation 11.12 can be performed by applying
the chain rule and averaging on each MC path the so-called pathwise derivative
estimator
∂Pθ (X) ∂Pθ (X) ∂Xj
d
∂Pθ (X)
θ̄k ≡ = × + . (11.13)
∂θk j=1
∂X j ∂θ k ∂θk

For nonpath-dependent options in the context of Libor Market Models,


Giles and Glasserman [1] have shown that the pathwise derivative method
can be efficiently implemented by expressing the calculation of the estimator
11.13 in terms of linear algebra operations, and utilize adjoint methods to
reduce the computational complexity by rearranging appropriately the order
of the calculations.
Unfortunately, algebraic adjoint approach requires in general a fair amount
of analytical work and is difficult to generalize to both path-dependent payouts
and multiassets simulations. Instead, as we discuss in the following section,
AAD overcomes these difficulties.

[Link] Libor market model simulation


In order to make the connection with previous algebraic implementations
of adjoint methods [1,10,11], we discuss the implementation of AAD in the
Libor Market Model. Here we indicate with Ti , i = 1, . . . , N + 1, a set of N + 1
bond maturities, with spacings δ = Ti+1 − Ti , assumed constant for simplicity.
The dynamics of the Libor rate as seen at time t for the interval [Ti , Ti+1 ),
Li (t) takes the form:

dLi (t)
= μi (L(t)) dt + σi (t)T dWt , (11.14)
Li (t)

0 ≤ t ≤ Ti , and i = 1, . . . , N , where Wt is a dW -dimensional standard Brow-


nian motion, L(t) is the N -dimensional vector of Libor rates, and σi (t) the
dW -dimensional vector of volatilities, at time t. Here the drift term in the spot
measure, as imposed by the no arbitrage conditions [8], reads


i
σiT σj δLj (t)
μi (L(t)) = , (11.15)
1 + δLj (t)
j=η(t)

where η(t) denotes the index of the bond maturity immediately following
time t, with Tη(t)−1 ≤ t < Tη (t). As is common in the literature, to keep this
example as simple as possible, we take each vector σi to be a function of time
346 High-Performance Computing in Finance

to maturity
σi (t) = σi−η(t)+1 (0) = λ(i − η(t) + 1). (11.16)
Equation 11.14 can be simulated by applying an Euler discretization to
the logarithms of the forward rates, for example, by dividing each interval
[Ti , Ti+1 ) into Ns steps of equal width, h = δ/Ns . This gives

Li (tn+1 )  √ 
= exp μi (L(tn )) − ||σi (tn )||2 /2 h + σiT (n)Z(tn ) h , (11.17)
Li (tn )

for i = η(nh), . . . , N , and Li (tn+1 ) = Li (tn ) if i < η(nh). Here Z is a


dW -dimensional vector of independent standard normal variables.
In a recent paper, Denson and Joshi [10] extended the original Adjoint
implementation to the propagation of the Libor under the predictor–corrector
drift approximation, consisting in replacing the drift in Equation 11.15 with
 
1
i
pc σiT σj δLj (tn ) σiT σj δ L̂j (tn+1 )
μi (L(tn )) = + , (11.18)
2 1 + δLj (tn ) 1 + δ L̂j (tn+1 )
j=η(nh)

where L̂j (tn+1 ) is calculated from Lj (tn ) using the evolution 11.17, that is,
with the simple Euler drift 11.15.
The pseudocode for the propagation of the Libor rates for dW = 1, cor-
responding to a function PROP implementing the Euler step 11.17, is shown
in Figure 11.1. Here, as discussed in Reference 1, the computational cost of
implementing Equation 11.17 is minimized by first evaluating


i
σj δLj (tn )
vi (tn ) = , (11.19)
1 + δLj (tn )
j=η(nh)

as a running sum for i = η(nh), . . . , N , so that μi = σiT vi .


The algebraic formulation discussed in Reference 10 comes with a signifi-
cant analytical effort. Instead, as illustrated in Figure 11.2, the AAD imple-
mentation is quite straightforward. According to the general design of AAD,
this simply consists of the adjoints of the instructions in the forward sweep
executed in reverse order. In this example, the information computed by PROP
that is required by PROP b is stored in the vectors scra and hat scra. By
inspecting the structure of the pseudocode, it also appears clear that the
computational cost of PROP b is of the same order as evaluating the original
function PROP.
As a standard test case in the literature, here we have considered contracts
with expiry Tn to enter in a swap with payments dates Tn+1 , . . . , TN +1 , with
the holder of the option paying a fixed rate K


N +1
V (Tn ) = B(Tn , Ti )δ(Sn (Tn ) − K)+ , (11.20)
i=n+1
Case Studies of Real-Time Risk Management via AAD 347

FIGURE 11.1: Pseudocode implementing the propagation method PROPn


for the Libor Market Model of Equation 11.14 for dW = 1, under the predic-
tor corrector Euler approximation 11.18, and the volatility parameterization
11.16.

where B(Tn , Ti ) is the price at time Tn of a bond maturing at time Ti

#
i−1
1
B(Tn , Ti ) = , (11.21)
1 + δLl (Tn )
l=n

and the swap rate reads

1 − B(Tn , TN +1 )
Sn (Tn ) = )N +1 . (11.22)
δ l=n+1 B(Tn , Tl )
348 High-Performance Computing in Finance

FIGURE 11.2: Adjoint of the propagation method PROP bn for the Libor
Market Model of Equation 11.14 for dW = 1, under the predictor corrector
Euler approximation 11.18, and the volatility parameterization 11.16. The
corresponding forward method is shown in Figure 11.1. The instructions com-
mented are the forward counterpart to the adjoint instructions immediately
after.
Case Studies of Real-Time Risk Management via AAD 349

140
130
120
110
100
90 Delta and Vega bumping
80
RCPU

80
60
50
Delta bumping
40
30
20
10 Delta and Vega AAD
~2.2
0
0 5 10 15
Tn

FIGURE 11.3: Ratio of the CPU time required for the AAD calculation of
the Delta and Vega and the time to calculate the option value for the swaption
in Equation 11.20 as a function of the option expiry Tn . The time to calculate
Delta and Vega using bumping is also shown. Lines are guides for the eye.

Here we consider European style payouts. The extension to Bermuda options


of Leclerc and collaborators [11] can be obtained with a simple modification
of the original algorithm.
The remarkable computational efficiency of the implementation discussed
earlier is illustrated in Figure 11.3. Here we plot the execution time for the
calculation of all the Delta, ∂V /∂Li (0), and Vega, ∂V /∂σi (n), relative to the
calculation of the swaption value as obtained with the implementation above
and by finite differences. As the maturity of the swaption increases, the number
of risk to compute also increases. For typical applications, AAD results in
orders of magnitude speedups with respect to bumping.

11.4 Real-Time Counterparty Credit Risk Management


One of the most active areas of risk management today is CCRM. Manag-
ing counterparty risk is particularly challenging because it requires the simul-
taneous evaluation of all the trades facing a given counterparty. For multiasset
portfolios, this typically comes with extraordinary computational challenges.
350 High-Performance Computing in Finance

Indeed, with the exclusion of the simplest portfolios of vanilla instruments,


computationally intensive MC simulations are often the only practical tool
available for this task. Standard approaches for the calculation of risk require
repeating the calculation of the P&L of the portfolio under hundreds of market
scenarios. As a result, in many cases these calculations cannot be completed in
a practical amount of time, even employing a vast amount of computer power.
Since the total cost of the through-the-life risk management can determine
whether it is profitable to execute a new trade, solving this technology problem
is critical to allow a securities firm to remain competitive.
In this section, we demonstrate how this powerful technique can be used
for a highly efficient computation of price sensitivities in the context of
CCRM [12].

11.4.1 Counterparty credit risk management


As a typical task in the day-to-day operation of a CCRM desk, here we
consider the calculation of the credit valuation adjustment (CVA) as the main
measure of a dealer’s counterparty credit risk. For a given portfolio of trades
facing the same investor or institution, the CVA aims to capture the expected
loss associated with the counterparty defaulting in a situation in which the
position, netted for any collateral agreement, has a positive mark-to-market
for the dealer. This can be evaluated at time T0 = 0 as
 !+ 
VCVA = E I(τc ≤ T )D(0, τc )LGD (τc ) N P V (τc ) − C(R(τc− )) , (11.23)

where τc is the default time of the counterparty, N P V (t) is the net present
value of the portfolio at time t from the dealer’s point of view, C(R(t)) is the
collateral outstanding, typically dependent on the rating R of the counter-
party, LGD (t) is the loss given default, D(0, t) is the discount factor for the
interval [0, t], and I(τc ≤ T ) is the indicator that the counterparty’s default
happens before the longest deal maturity in the portfolio, T . Here for simplic-
ity of notation we consider the unilateral CVA, the generalization to bilateral
CVA [13] is straightforward. The quantity in Equation 11.23 is typically com-
puted on a discrete time grid of “horizon dates” T0 < T1 < · · · < TNO as, for
instance,
NO 
!+ 
VCVA $ E I(Ti−1 < τc ≤ Ti )D(0, Ti )LGD (Ti ) N P V (Ti ) − C R(Ti− ) .
i=1
(11.24)
In general, the quantity above depends on several correlated random market
factors, including interest rate, counterparty’s default time and rating, recov-
ery amount, and all the market factors the net present value of the portfolio
depends on. As such, its calculation requires an MC simulation.
To simplify the notation and generalize the discussion beyond the small
details that might enter in a dealer’s definition of a specific credit charge, here
Case Studies of Real-Time Risk Management via AAD 351

we consider expectation values of the form

V = EQ [P (R, X)], (11.25)

with “payout” given by


NO
P = P (Ti , R(Ti ), X(Ti )) , (11.26)
i=1

where

NR
P (Ti , R(Ti ), X(Ti )) = P̃i (X(Ti ); r) δr,R(Ti ) . (11.27)
r=0

Here the rating of the counterparty entity including default, R(t), is rep-
resented by an integer r = 0, . . . , NR for simplicity; X(t) is the realized
value of the M market factors at time t. Q = Q(R, X) represents a prob-
ability distribution according to which R = (R(T1 ), . . . , R(TN0 ))t and X =
(X(T1 ), . . . , X(TN0 ))t are distributed; P̃i (·; r) is a rating-dependent payout at
time Ti .2
The expectation value in Equation 11.25 can be estimated by means of
MC by sampling a number NMC of random replicas of the underlying rating
and market state vector, R[1], . . . , R[NMC ] and X[1], . . . , X[NMC ], according
to the distribution Q(R, X), and evaluating the payout P (R, X) for each of
them.
In the following, we will make minimal assumptions on the particular
model employed to describe the dynamics of the market factors. In partic-
ular, we will only assume that for a given MC sample the value at time Ti of
the market factors can be obtained from their value at time Ti−1 by means
of a mapping of the form X(Ti ) = Fi (X(Ti−1 ), Z X ) where Z X is an N X -
dimensional vector of correlated standard normal random variates, X(T0 ) is
today’s value of the market state vector, and Fi is a mapping regular enough
for the pathwise derivative method to be applicable [9], as it is generally the
case for practical applications.
As an example of a counterparty rating model generally used in practice,
here we consider the rating transition Markov chain model of Jarrow et al.
[14] in which the rating at time Ti can be simulated as


NR !
R(Ti ) = I Z̃iR > Q(Ti , r) , (11.28)
r=1

where Z̃iR is a standard normal variate and Q(Ti , r) is the quantile-threshold


corresponding to the transition probability from today’s rating to a rating r
at time Ti . Note that the discussion below is not limited to this particular
2 The discussion below applies also to the case in which the payout at time T depends
i
on the history of the market factors X up to time Ti .
352 High-Performance Computing in Finance

model, and it could be applied with minor modifications to other commonly


used models describing the default time of the counterparty, and its rating [15].
Here we consider the rating transition model 11.28 for its practical utility, as
well as for the challenges it poses in the application of the pathwise derivative
method, because of the discreteness of its state space.
In this setting, MC samples of the payout estimator in Equation 11.10
can be generated according to the following standard algorithm. For i =
1, . . . , NO :

Step 1 Generate a sample of N X + 1 jointly normal random vari-


ables (ZiR , ZiX ) ≡ (ZiR , Zi,1
X X
, . . . , Zi,N X
)t distributed according to
R X
φ(Zi , Z ; ρi ), an (NX + 1)-dimensional standard normal probability
density function with correlation matrix ρi , for example, with the first
row and column corresponding to the rating factor.
Step 2 Iterate the recursion: X(Ti ) = Fi (X(Ti−1 ), Z X ).
)i √
Step 3 Set Z̃iR = R
j=1 Zj / i and compute R(Ti ) according to Equa-
tion 11.28.3
Step 4 Compute the time Ti payout estimator P (Ti , R(Ti ), X(Ti )) in Equa-
tion 11.27, and add this contribution to the total estimator in Equa-
tion 11.26.

The calculation of risk can be obtained in an highly efficient way by imple-


menting the pathwise derivative method [5] according to the principles of AAD
[2–4]. In particular, the pathwise derivative estimator reads in this case

∂Pθ (R, X) O N M
∂Pθ (R, X) ∂Xl (Ti ) ∂Pθ (R, X)
θ̄k ≡ = × + , (11.29)
∂θk i=1
∂Xl (Ti ) ∂θk ∂θk
l=1

where we have allowed for an explicit dependence of the payout on the model
parameters. Due to the discreteness of the state space of the rating factor, the
pathwise estimator for its related sensitivities is not well defined. However,
as we will show below, one can express things in such a way that the rating
sensitivities are incorporated in the explicit term ∂Pθ (R, X)/∂θk .
In the following, we will show how the calculation of the pathwise derivative
estimator 11.29 can be implemented efficiently by means of AAD.

11.4.2 Adjoint algorithmic differentiation and the


counterparty credit risk management
When applied to the pathwise derivative method, AAD allows the simul-
taneous calculation of the pathwise derivative estimators for an arbitrarily
3 Here we have used the fact that the payout 11.27 depends on the outturn value of the

rating at time Ti and not on its history.


Case Studies of Real-Time Risk Management via AAD 353

large number of sensitivities at a small fixed cost. Here we describe in detail


the AAD implementation of the pathwise derivative estimator 11.29 for the
CCRM problem 11.23.
As noted above, the sensitivities with respect to parameters affecting the
rating dynamics need special care due to discrete nature of the state space.
However, setting these sensitivities aside for the moment, the AAD imple-
mentation of the pathwise derivative estimator consists of Steps 1–4 described
earlier plus the following steps of the backward sweep. For i = NO , . . . , 1:

Step 4̄ Evaluate the adjoint of the payout,

(X̄(Ti ), θ̄) = P̄ (Ti , R(Ti ), X(Ti ), θ, P̄ ),

with P̄ = 1.

Step 3̄ Nothing to do: the parameters θ do not affect this nondifferentiable


step.
Step 2̄ Evaluate the adjoint of the propagation rule in Step 2.

(X̄(Ti−1 ), θ̄) + = F̄i (X(Ti−1 ), θ, Z X , X̄(Ti ), θ̄),

where + = is the standard addition assignment operator.

Step 1̄ Nothing to do: the parameters θ do not affect this step.

A few comments are in order. In Step 4̄, the adjoint of the payout function
is defined while keeping the discrete rating variable constant. This provides
the derivatives X̄l (Ti ) = ∂Pθ /∂Xl (Ti ) and θ̄k = ∂Pθ /∂θk . In defining the
adjoint in Step 2̄, we have taken into account that the propagation rule in
Step 2 is explicitly dependent on both X(Ti ) and the model parameters θ. As
a result, its adjoint counterpart produces contributions to both θ̄ and X̄(Ti ).
Both the adjoint of the payout and of the propagation mapping can be imple-
mented following the principles of AAD as discussed in References 2 and 3.
In many situations, AD tools can be also used as an aid or to automate the
implementation, especially for simpler, self-contained functions. In the back-
ward sweep above, Steps 1̄ and 3̄ have been skipped because we have assumed
for simplicity of exposition that the parameters θ do not affect the correlation
matrices ρi , and the rating dynamics. If correlation risk is instead required,
Step 2̄ produces also the adjoint of the random variables Z X , and Step 1̄ con-
tains the adjoint of the Cholesky decomposition, possibly with the support of
the binning technique, as described in Reference 4.
354 High-Performance Computing in Finance

[Link] Rating transition risk


The risk associated with the rating dynamics can be treated by noting
that Equation 11.27 can be expressed more conveniently as

!
NR !
P Ti , Z̃iR , X(Ti ) = P̃i (X(Ti ); 0) + P̃i (X(Ti ); r) − P̃i (X(Ti ); r−1)
r=1
!
×I Z̃iR > Q(Ti , r; θ) , (11.30)

so that the singular contribution to the pathwise derivative estimator reads


NR !
∂θk P Ti , Z̃i , X(Ti ) = − P̃i (X(Ti ); r) − P̃i (X(Ti ); r − 1)
r=1
!
× δ Z̃iR = Q(Ti , r; θ) × ∂θk Q(Ti , r; θ). (11.31)

This estimator cannot be sampled in this form with MC. Nevertheless, it can
be integrated out using the properties of Dirac’s delta along the lines of [16],
giving after straightforward computations,


NR
φ(Z  , ZiX , ρi ) !
θ̄k = − √ ∂θ k
Q(T i , r; θ) P̃ i (X(Ti ); r) − P̃ i (X(Ti ); r−1) ,
r=1
i φ(ZiX , ρXi )

(11.32)
)i−1 √
where Z  is such that (Z  + j=1 ZjR )/ i = Q(Ti , r; θ) and φ(ZiX , ρX i ) is
a NX -dimensional standard normal probability density function with corre-
lation matrix ρX i obtained by removing the first row and column of ρi ; here
∂θk Q(Ti , r; θ) is not stochastic and can be evaluated (e.g., using AAD) once
per simulation. The final result is rather intuitive as it is given by the proba-
bility weighted sum of the discontinuities in the payout.

11.4.3 Results
As a numerical test, we present here results for the calculation of risk on
the CVA of a portfolio of swaps on commodity Futures. For the purpose of this
illustration, we consider a simple one-factor lognormal model for the Futures
curve of the form
dFT (t)
= σT exp(−β(T − t)) d Wt , (11.33)
FT (t)

where Wt is a standard Brownian motion; FT (t) is the price at time t of


a Futures contract expiring at T ; σT and β define a simple instantaneous
volatility function that increases approaching the contract expiry, as empiri-
cally observed for many commodities. The value of the Futures’ price FT (t)
Case Studies of Real-Time Risk Management via AAD 355

can be simulated exactly for any time t so that the propagation rule in Step 2
reads for Ti ≤ T
 . 
1
FT (Ti ) = FT (Ti−1 ) exp σi ΔTi Z − σi2 ΔTi , (11.34)
2
where ΔTi = Ti − Ti−1 , and

σT2 !
σi2 = e−2βT e2βTi − e2βTi−1
2βΔTi
is the outturn variance. In this example, we will consider deterministic interest
rates. As underlying portfolio for the CVA calculation, we consider a set of
commodity swaps, paying on a strip of Futures (e.g., monthly) expiries tj ,
j = 1, . . . , Ne the amount Ftj (tj ) − K. The time t net present value for this
portfolio reads
Ne !
N P V (t) = D(t, tj ) Ftj (t) − K . (11.35)
j=1

Note that although we consider here for simplicity of exposition a linear portfo-
lio, the method proposed applies to an arbitrarily complex portfolio of deriva-
tives, for which in general the N P V will be a nonlinear function of the market
factors Ftj (t) and model parameters θ.
For this example, the adjoint propagation rule in Step 2̄ simply reads
 . 
1 2
F̄T (Ti − 1) + = F̄T (Ti ) exp σi ΔTi Z − σi ΔTi ,
2
. !
σ̄i = F̄T (Ti )F (Ti ) ΔTi Z − σi ΔTi

with σ̄i related to this step’s contribution to the adjoint of the Futures’ volatil-
ity σ̄T by >
σ̄i !
σ̄T + = √ e−2βT e2βTi − e2βTi−1 .
2βΔTi
At the end of the backward path, F̄T (0) and σ̄T contain the pathwise derivative
estimator 11.29 corresponding, respectively, to the sensitivity with respect to
today’s price and volatility of the Futures contract with expiry T .
The remarkable computational efficiency of the AAD implementation is
clearly illustrated in Figure 11.4. Here we plot the speedup produced by AAD
with respect to the standard finite-difference method. On a fairly typical trade
horizon of 5 years, for a portfolio of 5 swaps referencing distinct commodi-
ties Futures with monthly expiries, the CVA bears nontrivial risk to over 600
parameters: 300 Futures prices (FT (0)), and at the money volatilities (σT ),
(say) 10 points on the zero rate curve, and 10 points on the CDS curve of
the counterparty used to calibrate the transition probabilities of the rating
transition model 11.28. As illustrated in Figure 11.4, the CPU time required
for the calculation of the CVA, and its sensitivities, is less than 4 times the
356 High-Performance Computing in Finance

160
140

Speedup/RCPU 120
100
80
60
40
20
0 ~3.8
0 100 200 300 400 500 600
Nrisks

FIGURE 11.4: Speedup in the calculation of risk for the CVA of a portfolio
of 5 commodity swaps over a 5-year horizon, as a function of the number
of risks computed (empty dots). The full dots are the ratio of the CPU time
required for the calculation of the CVA, and its sensitivities, and the CPU time
spent for the computation of the CVA alone. Lines are guides for the eye.

TABLE 11.1: Variance reduction (VR) on the sensitivities with respect


to the thresholds Q(1, r) (NR = 3) for a call option with a rating-dependent
strike
δ VR[Q(1,1)] VR[Q(1,2)] VR[Q(1,3)]
0.1 24 16 12
0.01 245 165 125
0.001 2490 1640 1350
Note: δ Indicates the perturbation used in the finite-difference estimators of the
sensitivities.4

CPU time spent for the computation of the CVA alone, as predicted by Equa-
tion 11.8. As a result, even for this very simple application, AAD produces
risk over 150 times faster than finite differences, that is, for a CVA evaluation
taking 10 seconds, AAD produces the full set of sensitivities in less than 40
seconds, while finite differences require approximately 1 hour and 40 minutes.
Moreover, as a result of the analytic integration of the singularities intro-
duced by the rating process, the risk produced by AAD is typically less noisy
than the one produced by finite differences. This is clearly illustrated in
Table 11.1 showing the variance reduction on the sensitivities with respect
to the thresholds Q(Ti , r) for a simple test case. Here we have considered the
calculation of a call option of the form (FT (Ti ) − C(R(Ti )))+ with a strike
C(R(Ti )) linearly dependent on the rating, and Ti = 1. The variance reduction
displayed in the table can be thought of as a further speedup factor because
4 The Specification of the Parameters used for This Example is Available upon Request.
Case Studies of Real-Time Risk Management via AAD 357

it corresponds to the reduction in the computation time for a given statisti-


cal uncertainty on the sensitivities. This diverges as the perturbation in the
finite-difference estimators δ tends to zero and may be very significant even
for a fairly large value of δ.
In conclusion, these numerical results illustrate how AAD allows an
extremely efficient calculation of counterparty credit risk valuations in MC.
In fact, AAD allows one to perform in minutes risk runs that would take oth-
erwise several hours or could not even be performed overnight without large
parallel computers.

11.5 Real-Time Risk Management of Flow


Credit Products
The aftermath of the recent financial crisis has seen a dramatic shift in the
credit derivatives markets, with a conspicuous reduction of demand for com-
plex, capital intensive products, like bespoke collateralized debt obligations
(CDO), and a renewed focus on simpler and more liquid derivatives, such as
credit default indices and swaptions [17].
In this background, dealers are quickly adapting to a business model geared
toward high-volume, lower margin products for which managing efficiently
the trading inventory is of paramount importance. As a result, the ability to
produce risk in real time is rapidly becoming one of the keys to running a
successful trading operation.
In this section, we demonstrate how AAD can be extremely effective also
for simpler credit products, typically valued by means of faster semianalytical
techniques. We will show how AAD provides orders of magnitude savings in
computational time and makes the computation of risk in real time—with no
additional infrastructure investment—a concrete possibility [18].

11.5.1 Pricing of credit derivatives


The key concept for the valuation of credit derivatives, in the context of
the models generally used in practice, is the hazard rate, λu , representing the
probability of default per unit time of the reference entity between times u
and u + du, conditional on survival up to time u. By modeling the default
event of a reference entity i as the first arrival time of a Poisson process with
deterministic intensity λiu , the survival probability, Qi (t, T ), is given by

⎡ ⎤
T
Q(t, T ; λi ) = exp ⎣− du λiu ⎦ . (11.36)
t
358 High-Performance Computing in Finance

In the hazard rate framework, the price of a credit derivative can be


expressed mathematically as
V (θ) = V (λ(θ), θ), (11.37)
where λ = (λ1 , . . . , λN ) are the hazard rate functions for N credit entities
referenced in a given contract. Here we have indicated generically with θ =
(θ1 , . . . , θNθ ) the vector of model parameters, for example, credit spreads,
recovery rates, volatilities, correlation and the market prices of the interest
rate instruments used for the calibration of the discount curve.
In general, the valuation of a credit derivative can be separated in a
Calibration Step:
θ → λ(θ)
for the construction of the hazard rate curve given liquidly traded CDS prices,
a term structure of recoveries and a given discount curve, and a
Pricing Step:
θ → V (λ(θ), θ)
mapping the hazard rate curves and the other parameters to which the pric-
ing model is explicitly dependent on, to the price of the credit derivative.
The pricing step is obviously specific to the particular credit derivative under
valuation. Instead, the calibration step is the same for any derivative priced
within the hazard rate framework. For the purpose of the discussion below, it
is useful to recall the main steps involved in the calibration of a hazard rate
curve.

[Link] Calibration step


The hazard rate function λu in Equation 11.36 is commonly param-
eterized as piecewise constant with M knot points at time (t1 , . . . , tM ),
λ = (λ1 , . . . , λM ), such that
 
1 Q(t, tn−1 ; λ)
λu = λn−1 = ln
tn − tn−1 Q(t, tn ; λ)
for tn−1 ≤ u < tn and t0 equal to the valuation date. In the calibration step,
the hazard rate knot points are calibrated from the price, or equivalently the
credit (par) spreads (s1 , . . . , sM ), of a set of liquidly traded CDS with matu-
rities T1 , . . . , TM , for example, using the standard bootstrap algorithm [17].
Such calibration can be expressed mathematically as solving a system of
M equations
Gj (λ, θ) = 0, (11.38)
j = 1, . . . M , with
L(t, Tj ; λ, θ)
Gj (λ, θ) = sj − , (11.39)
A(t, Tj ; λ, θ)
Case Studies of Real-Time Risk Management via AAD 359

where L(t, Tj ; λ, θ) and A(t, Tj ; λ, θ) are, respectively, the expected loss and
credit risky annuity for a Tj maturity CDS contract starting at time t.5 These
are defined as
T  
dQ(t, u; λ)
L(t, T ; λ, θ) = du Z(t, u; θ) (1 − Ru ) − , (11.40)
du
t

and (e.g., for continuously paid coupons)

T
A(t, T ; λ, θ) = du Z(t, u; θ)Q(t, u; λ). (11.41)
t

Here, Z(t, u; θ) is the discount factor from time t to time u, Q(t, u; λ) (resp.
−dQ(t, u; λ)) is the probability that the reference entity survives up to (resp.
defaults in an infinitesimal interval around) time u, and Ru is the expected
percentage recovery upon default at time u. The latter is generally expressed
as a piecewise constant function with the same discretization of the hazard
rate function, say R = (R1 , . . . , RM ).
The calibration Equations 11.38 and 11.39 are based on the definition of
par spread si as break-even coupons c making the value of CDS

VCDS (t, T ; θ) = L(t, T ; λ, θ) − c A(t, T ; λ, θ), (11.42)

worth zero.6 Since both the expected loss and the risky annuity at time Ti
depend on hazard rate points λj with j ≤ i, the calibration equations can be
solved iteratively starting from i = 1, by keeping fixed the hazard rate knot
points λj with j < i, and solving for λi .
Through the calibration process, the system of M Equation 11.38 defines
implicitly the function λ = λ(θ), linking the hazard rate to the credit spreads,
the term structure of expected recovery and the discount factors. These are
in turn a function of the market instruments that are used for the calibration
of the discount curve.

11.5.2 Challenges in the calculation of credit risk


The computation of the sensitivities of the price of the credit derivative
11.37 with respect to the model parameters θ can be performed by means of
5 Note that although the credit spreads s are contained in the model parameter vector
j
θ, the risky annuity and the expected loss do not depend explicitly on them.
6 Note that since the standardization of CDS contracts in 2008, liquidly traded CDS are

characterized by a standard coupon and are generally quoted in terms of upfronts or quote
spreads. Both mark types can be mapped to a dollar value of a CDS contract by means of a
market standard parameterization [19], and hazard rates can be equivalently bootstrapped
from these marks using Equation 11.42. Credit (par) spreads remain nonetheless commonly
used in the market practice as risk factors for credit derivatives. The analysis of this paper
can be easily formulated in terms of quote spreads or upfronts.
360 High-Performance Computing in Finance

the chain rule


dV ∂V ∂V ∂λj
M
= + , (11.43)
dθk ∂θk j=1 ∂λj ∂θk

where the first term captures the explicit dependence on the model param-
eters θ through the pricing step and the second term captures the implicit
dependence via the calibration step.
The computation of the calibration component of the price sensitivities
with standard bump and reval approaches is particularly onerous because it
involves repeating the calibration step for each perturbation. Especially for
portfolio of simple credit derivatives, like CDS, this can easily represent the
bulk of the computational burden. In addition, finite size perturbations of
credit spreads, recovery, or interest rates often correspond to inputs that do
not admit an arbitrage-free representation in terms of a nonnegative hazard
rate curve, thus making the robust and stable computation of sensitivities
challenging.

11.5.3 Adjoint calculation of risk


Both the computational costs and stability of the calculation of credit
risk can be effectively addressed by means of the AAD implementation of the
chain rule 11.43. In particular, the adjoint of the algorithm consisting of the
Calibration Step and Pricing Step, described earlier reads

P ricing Step :

∂V ∂V
θ̄k = V̄ λ̄j = V̄ , (11.44)
∂θk ∂λj

Calibration Step :


M
∂λj
θ̄k = θ̄k + λ̄j . (11.45)
j=1
∂θk

Although in the following we will give explicit examples of the adjoint of


the pricing step for portfolios of CDS and credit default index swaptions, here
we focus our discussion on the adjoint of the calibration step in Equation 11.45
which is a time-consuming and numerically challenging step common to all
pricing applications within the hazard rate framework.

11.5.4 Implicit function theorem


The adjoint of the calibration step θ → λ(θ) can be produced following
the general rules of AAD. The associated computational cost can be generally
Case Studies of Real-Time Risk Management via AAD 361

expected to be of the order of the cost of performing the bootstrap algorithm


a few times (but approximately less than 4 according to the general result of
AAD quoted above). This in itself is generally a very significant improvement
with respect to bump and reval approaches, involving repeating the bootstrap
algorithm as many times as sensitivities required. However, following the sug-
gestions of [20,21], a much better performance can be obtained by exploiting
the so-called implicit function theorem, as described later.
By differentiating with respect to θ the calibration identity 11.38, we get

∂Gi ∂Gi ∂λj


M
+ = 0,
∂θk j=1
∂λj ∂θk

for i = 1, . . . , M , and k = 1, . . . , Nθ , or equivalently


% −1 &
∂λi ∂G ∂G
=− . (11.46)
∂θk ∂λ ∂θ
ik

This relation allows the computation of the sensitivities of λ(θ), locally defined
in an implicit fashion by Equations 11.38 and 11.39, in terms of the sensitivities
of the function 11.39.
In the specific case, when θk = sj for j = 1, . . . , M , that is, when con-
sidering sensitivities with respect to market risk factors other than the credit
spreads, Equation 11.46 can be expressed in turn as
% −1 &
∂λi ∂s(λ, θ) ∂s(λ, θ)
=− . (11.47)
∂θk ∂λ ∂θ
ik

Here we have used that θk is not a credit spread so that ∂G/∂θk =


−∂s(λ, θ)/∂θk , where the par spread functions
s(λ, θ) = (s1 (λ, θ), . . . , sM (λ, θ)) ,
L(t, Tj ; λ, θ)
sj (λ, θ) = (11.48)
A(t, Tj ; λ, θ)
are defined by Equations 11.38 and 11.39.
In the case of credit spread sensitivities, θk = sk , Equation 11.46 simplifies
as follows:
M  −1 M  −1  −1
∂λi ∂s(λ, θ) ∂Gj ∂s(λ, θ) ∂sj ∂s(λ, θ)
= = = ,
∂sk j=1
∂λ ij ∂sk j=1
∂λ ij ∂sk ∂λ ik
(11.49)
where we have used that the par spread functions do not explicitly depend on
the credit spreads sk .
Equations 11.47 and 11.49 express the implicit function theorem in the
context of hazard rate calibration. These allow the computation of the sensi-
tivities ∂λi /∂θk by (i) evaluating the sensitivities of the par spread functions
362 High-Performance Computing in Finance

with respect to the model parameters, ∂sj (λ, θ)/∂θk , and the hazard rates,
∂sk (λ, θ)/∂λi , and (ii) solving a linear system, for example, by Gaussian elim-
ination. This method is significantly more stable and efficient than the naı̈ve
approach of calculating the derivatives of the implicit functions θ → λ(θ)
by differentiating directly the calibration step either by bump and reval or
by applying AAD to the calibration step. This is because s(λ, θ) in Equa-
tion 11.48 are explicit functions of the hazard rate and the model parameters
that are easy to compute and differentiate.
Combining the implicit function theorem with adjoint methods results in
extremely efficient risk computations, as we will demonstrate later.

11.5.5 Adjoint of the calibration step


All the sensitivities necessary to compute Equations 11.47 and 11.49 can
be obtained through the adjoint of the function

sj = sj (λ, θ)

defined by Equation 11.48, namely, using the definitions 11.1 and 11.6,

(λ̄, θ̄) = s̄j (λ, θ, s̄j ),

where the scalar s̄j is the adjoint of the jth par spread with j = 1, . . . , M . By
applying the rules of AAD, this can be implemented as
L(t, Tj ; λ, θ)
Aj = −s̄j
A(t, Tj ; λ, θ)2
1
Lj = s̄j
A(t, Tj ; λ, θ)
(λ̄, θ̄) += A(t, Tj ; λ, θ, Aj )
(λ̄, θ̄) += L(t, Tj ; λ, θ, Lj ),

where A(t, Tj ; λ, θ, Aj ) and L(t, Tj ; λ, θ, Lj ) are the adjoints of A(t, Tj ; λ, θ)


and L(t, Tj ; λ, θ), respectively.
Combining AAD and the implicit function theorem results therefore in the
following algorithm for the adjoint of the calibration routine, θ̄ = λ̄(θ, λ̄):

1. Execute (λ̄, θ̄) = s̄j (λ, θ, s̄j ) with s̄j = 1 for j = 1, . . . , M . This gives
the derivatives:
∂sj ∂sj
λ̄ij = θ̄kj = ,
∂λi ∂θk
for i = 1, . . . , M and k = 1, . . . , Nθ .

2. Find the matrix ∂λ/∂θ by solving the linear system


∂s ∂λ ∂s
=− .
∂λ ∂θ ∂θ
Case Studies of Real-Time Risk Management via AAD 363

40
AAD (calibration)
35 AAD (total)
Bumping
30
Calibration
Pricing
Ratio CPU time

25

20

15

10

0
2 4 8 16 24 36
# Risks

FIGURE 11.5: Cost of computing the sensitivities with respect to the


credit spreads and interest rate instruments—relative to the cost of a single
valuation—as a function of the number of sensitivities.

3. Return:

M
∂λi
θ̄k = λ̄i ,
i=1
∂θk
for k = 1, . . . , M .
The adjoint of the calibration algorithm described earlier is extremely effi-
cient. Indeed, as illustrated in Figure 11.5, the sensitivities of the hazard rate
with respect to the credit spreads, and interest rate instruments can be com-
puted in ∼25% less time than performing a single bootstrap.

11.5.6 Results
[Link] Credit default swaps
As a first example, we consider the calculation of price sensitivities for a
(portfolio of) CDS. In this case, the adjoint of the pricing step simply reads,
from Equation 11.42,
L = V̄
A = −V̄ c
(λ̄, θ̄) = A(t, T ; λ, θ, A)
(λ̄, θ̄)+ = L(t, T ; λ, θ, L),
where the risky annuity and expected loss (and their adjoint counterparts) are
those of the CDS in the portfolio. In this case, as illustrated in Figure 11.5,
364 High-Performance Computing in Finance

the cost of the pricing step is a small portion (∼10%) of the overall cost
of computing the sensitivities which is instead dominated by the cost of the
calibration step. As a result, all the sensitivities can be obtained by means of
AAD for ∼15% less than the cost of performing a single valuation. In typical
applications, where computing sensitivities with respect to 18 spread tenors
and interest rate instruments is commonplace, this results in a reduction of
the computational cost by a factor of 50 or more.

[Link] Credit default index swaptions


As a second example, also of significant practical relevance, we consider
credit default index swaptions. The value of these instruments at time t is
given by
  !!
Vt = Z(t, TE ; θ)Et max ζ ViCDS (TS , TE ) + L(TE ) − PE , 0 (11.50)

where ζ = 1 for a payer and ζ = −1 for a receiver option, ViCDS (TE , TM ) is the
value at time TE of the underlying credit default index swap (long protection)
with standard coupon rate and maturity TM , PE is the exercise fee, and L(TE )
is the value at time TE of the loss given default associated to the names that
have defaulted before expiry,

N
L(TE ) = I(τ i < TE )N i (1 − Rτi ),
i=1

where N is the number of names in the index, I is the indicator function, and
N i , τ i , and Rui are the notional, default time, and recovery function of the
ith name in the portfolio.7
According to the de facto market standard model [22], the value at time
TE of the random quantity given by the sum of the loss amount, L(TE ), and
the value of the credit default index swap, ViCDS (TE , TM ), are modeled in
terms of a single state variable, the default adjusted forward spread sTE , as
ViCDS (TE , TM ) + L(TE ) = Ntot Aisda (sTE , TE , TM ) (sTE − c), (11.51)
where c) is the fixed rate in the underlying credit default index swap and
N
Ntot = i=1 N i is the total notional of the index. Here Aisda (s, t, T ) is the
standardized risky annuity of Equation 11.41 calculated assuming a flat term
structure of the credit spread s, according to the standard ISDA conventions
[19]. In the simplest setting, the default adjusted forward spread is assumed
lognormally distributed,
 . 
1 2
sTE = FTE exp − σTE (TE − t) + σTE TE − t Z̃ , (11.52)
2
where σTE is the volatility of the default adjusted forward spread, Z̃ is a
standard normal random variable and the forward FTE , can be determined by
7 Here for simplicity of exposition, we assume that no names in the index have defaulted

at valuation time.
Case Studies of Real-Time Risk Management via AAD 365

taking the expectation of both sides of Equation 11.51 giving

adj isda
GF (FTE , λ, θ) ≡ ViCDS (TE , TM ; λ, θ) − ViCDS (TE , TM ; FTE , θ) = 0. (11.53)

The first term in the equation above,


 
adj
ViCDS (TE , TM ; λ, θ) = Et ViCDS (TE , TM ) + L(TE ) ,

can be computed according to the standard hazard rate model using the time
t default and recovery curves of the index constituents:

adj
ViCDS (TE , TM ; λ, θ) =L̃(t, TE ; λ, θ) + Z(t, TE ; θ)

N

× N i L(TE , TM ; λi , θ) − c A(TE , TM ; λi , θ) ,
i=1
(11.54)

with


N
L̃(t, TE ;λ, θ) = Z(t, TE ; θ) N i L̃i (t, TE ; λi , θ),
i=1

where L̃i (t, T ; λi , θ) is defined by setting in Equation 11.40 Z(t, u; θ) → 1 to


reflect that the loss amounts occurred before option expiry are settled at TE .
The second term can be computed instead by numerical integration over the
distribution of sTE , Equation 11.52,
 
isda
ViCDS (TE , TM ; FTE , θ) = Et Ntot Aisda (sTE , TE , TM ) (sTE − c) . (11.55)

The calibration Equation 11.53 defines implicitly the loss adjusted forward
spread, FTE , as a function of its volatility σTE , the hazard rates and expected
recoveries of the index constituents, and the risk parameters of the discount
curve, in short

FTE = FTE (λ; θ). (11.56)

For a given set of input parameters θ and the calibrated hazard rates
for the index constituents λ, the pricing algorithm consists of the following
steps:

Step 1 Calibrate the forward by solving the calibration Equation 11.53. This
involves computing Equation 11.54 using the hazard rate model and
Equation 11.55 by numerical integration for each trial value of FTE .
366 High-Performance Computing in Finance

Step 2 Compute the option value 11.50 using Equation 11.51, for example,
using Gaussian quadrature


L
Vt = Z(t, TE ) wk φ(xk ; FTE , θ)Pk , (11.57)
k=1

where φ(xk ; FTE , θ) is the probability density function of sTE ,


 !+
Pk = ζ N Aisda (xk , TE , TM ) (xk − c) − PE , (11.58)

L is the number of quadrature points, and wk the quadrature weights.


The adjoint of the implicit forward function 11.56,

(λ̄, θ̄) = F̄TE (λ, θ, F̄TE ), (11.59)

can be computed by means of the implicit function theorem, similarly to what


we described for the adjoint of the hazard rate calibration. More explicitly,
one first computes the adjoint of the calibration function 11.53

(F̄TE , λ̄, θ̄) = ḠF (FTE , λ, θ, ḠF )

with
adj isda
ḠF = V iCDS (TE ,TM ; λ, θ, ḠF ) − V iCDS (TE , TM ; FTE , θ, ḠF ). (11.60)

Here
adj adj
(λ̄, θ̄) = V Idx (TE , TM ; λ, θ, V̄iCDS )

and
isda isda
(F̄TE , θ̄) = V iCDS (TE , TM ; FTE , θ, V̄iCDS )

are the adjoints of Equations 11.54 and 11.55, respectively. For ḠF = 1,
Equation 11.60 gives F̄TE = ∂GF /∂FTE , λ̄ij = ∂GF /∂λij , and θ̄k = ∂GF /∂θk ,
for i = 1, . . . , N , j = 1, . . . , M , k = 1, . . . , Nθ . Applying the implicit function
theorem to the function GF , one finally obtains the outputs of the function
in Equation 11.59:
 −1
∂FTE ∂GF ∂GF
λ̄ij = F̄TE =− ,
∂λij ∂FTE ∂λij
 −1
∂FTE ∂GF ∂GF
θ̄k = F̄TE =− .
∂θk ∂FTE ∂θk

The adjoint of the pricing algorithm consists therefore of the following


steps:
Case Studies of Real-Time Risk Management via AAD 367

Step 2̄ Set:
Vt
Z̄ = V̄
Z(t, TE ; θ)
and
θ̄ = Z̄(t, TE ; θ, Z̄),
where Z̄(t, T ; θ, Z̄) is the adjoint of the discount function. Then com-
pute the adjoint of the Gaussian quadrature Equations 11.57 and
11.58, namely set F̄TE = 0, and
φ̄k = V̄ Z(t, TE ; θ)wk Pk ,
(F̄TE , θ̄) += φ̄(xk ; FTE , θ, φ̄k ),
for k = 1, . . . , L, where φ̄(xi ; FTE , θ, φ̄i ) is the adjoint of the proba-
bility density function. Note that due to the linearity of the adjoint
function with respect to the adjoint input, these instructions can be
re-expressed in terms of a numerical integration of the form

L
(F̄TE , θ̄) = Z(t, TE ; θ) wk φ̄(xk ; FTE , θ, V̄ )Pk ,
k=1

that is, the adjoint of a Gaussian quadrature can be expressed in


terms of the quadrature of the adjoint of the integrand.
Step 1̄ Set λ̄ = 0 and execute the adjoint of the implicit forward function
11.56,
(λ̄, θ̄) += F TE (λ, θ, F̄TE ),
computed as described earlier. Note that the adjoint function in Equa-
tion 11.55 can also be expressed in terms of a Gaussian quadrature.
Steps 2̄ and 1̄ provide the outputs of the adjoint of the pric-
ing step in Equation 11.44. Performing the adjoint of the calibra-
tion step 11.45 as previously described generates the full set of
sensitivities.
The remarkable computational efficiency achievable for swap-
tions is illustrated in Figure 11.6. Here we plot the cost of comput-
ing the sensitivities with respect to the volatility, the constituents’
credit spreads and interest rate instruments—relative to the cost
of performing a single valuation—for different numbers of index
constituents, ranging from 10 (e.g., for iTraxx SOVX Asia Pacific)
to 125 (e.g., for iTraxx Europe or [Link]). Combining AAD
with the implicit function theorem allows the computation of inter-
est rate and (constituents) credit spread risk in 20% less than the
cost of computing the option value, resulting in up to 3 orders
of magnitude savings (note the logarithmic scale) in computational
time.
368 High-Performance Computing in Finance

10,000

AAD
Bumping
Ratio CPU time 1000

100

10

0
10 25 50 100 125
# Index constituents

FIGURE 11.6: Cost of computing the sensitivities with respect to the volatil-
ity, the constituents’ credit spreads and interest rate instruments—relative to
the cost of performing a single valuation—as a function of the number of index
constituents.

11.6 Conclusion
In conclusion, we have shown how AAD is extremely beneficial for the risk
management of financial derivatives by discussing three examples: (i) interest
rate products; (ii) counterparty credit risk management; and (iii) flow credit
products. These examples illustrate how AAD is effective in speeding up, by
several orders of magnitude, the computation of price sensitivities both in the
context of MC applications and for applications involving faster numerical
methods. In particular, we have shown how by combining adjoint ideas with
the implicit function theorem one can avoid the necessity of repeating mul-
tiple times the calibration step of financial model which, especially for flow
products, often represents the bottle neck in the computation of risk. A recent
publication [23] illustrates the application of these ideas to the calculation of
risk for Partial Differential Equation application.
These examples illustrate how AAD allows one to perform in minutes risk
runs that would take otherwise several hours or could not even be performed
overnight without large parallel computers. AAD therefore makes possible
real-time risk management on an industrial scale without onerous investments
in calculation infrastructure, allowing investment firms to hedge their posi-
tions more effectively, actively manage their capital allocation, reduce their
infrastructure costs, and ultimately attract more business.
The opinions and views expressed in this chapter are uniquely those of the
authors and do not necessarily represent those of Credit Suisse.
Case Studies of Real-Time Risk Management via AAD 369

References
1. Giles, M. and Glasserman, P. Smoking adjoints: Fast Monte Carlo greeks. Risk,
19:88–92, 2006.

2. Capriotti, L. Fast greeks by algorithmic differentiation. Journal of Computa-


tional Finance, 3:3–35, 2011.

3. Capriotti, L. and Giles, M. Algorithmic differentiation: Adjoint greeks made


easy. Risk, 25:92–98, 2010.

4. Capriotti, L. and Giles, M. Fast correlation risk by adjoint algorithmic differen-


tiation. Risk, 23:79–85, 2010.

5. Broadie, M. and Glasserman, P. Estimating security price derivatives using sim-


ulation. Management Science, 42:269–285, 1996.

6. Griewank, A. Evaluating Derivatives: Principles and Techniques of Algorithmic


Differentiation. Frontiers in Applied Mathematics, Philadelphia, 2000.

7. Sorella, S. and Capriotti, L. Algorithmic differentiation and the calculation of


forces by quantum Monte Carlo. Journal of Chemical Physics, 133:234111, 2010.

8. Brace, A., Gatarek, D., and Musiela, M. The market model of interest rate
dynamics. Mathematical Finance, 7:127–155, 1997.

9. Glasserman, P. Monte Carlo Methods in Financial Engineering. Springer, New


York, 2004.

10. Denson, N. and Joshi, M. Fast and accurate greeks for the Libor Market Model.
Journal of Computational Finance, 14:115–125, 2011.

11. Leclerc, Q. M. and Schneider, I. Fast Monte Carlo Bermudan greeks. Risk, 22:84–
88, 2009.

12. Capriotti, L., Peacock, M., and Lee, J. Real time counterparty credit risk man-
agement in Monte Carlo. Risk, 24:86–90, 2011.

13. Brigo, D. and Capponi, A. Bilateral counterparty risk with application to CDSS.
Risk, 22:85–90, 2010.

14. Jarrow, R., Lando, D., and Turnbull, S. A Markov model for the term structure
of credit risk spreads. Review of Financial Studies, 10:481–523, 1997.

15. Schonbucher, P. Credit Derivatives Pricing Models: Models, Pricing, Implemen-


tation. Wiley Finance, London, 2003.

16. Joshi, M. and Kainth, D. Rapid computation of prices and deltas of nth to
default swaps in the li model. Quantitative Finance, 4:266–275, 2004.

17. O’Kane, D. Modelling Single-name and Multi-name Credit Derivatives, volume


573. John Wiley & Sons, 2011.
370 High-Performance Computing in Finance

18. Capriotti, L. and Lee, J. Adjoint credit risk management. Risk, 27:90–96, 2014.

19. ISDA. ISDA CDS standard model. Lehman Brothers Quantitative Credit
Research, 2003.

20. Christianson, B. Reverse accumulation and implicit functions. Optimization


Methods and Software, 9:307–322, 1998.

21. Henrard, M. Adjoint algorithmic differentiation: Calibration and implicit func-


tion theorem. Open Gamma Quantitative Research, 2011.

22. Pedersen, C. M. Valuation of portfolio credit default swaptions. Lehman Broth-


ers Quantitative Credit Research, 2003.

23. Capriotti, L., Jiang, Y., and Macrina, A. Real-time risk management: An AAD–
PDE approach. International Journal of Financial Engineering, 2:1550039, 2015.
Chapter 12
Tackling Reinsurance Contract
Optimization by Means of
Evolutionary Algorithms and HPC

Omar Andres Carmona Cortes and Andrew Rau-Chaplin

CONTENTS
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
12.2 Modeling the RCO Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
12.2.1 Reinsurance costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
12.2.2 Reinsurance recoveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
12.2.3 The risk value and optimization problem . . . . . . . . . . . . . . . 375
12.3 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
12.3.1 Population-based incremental learning (PBIL) . . . . . . . . . 377
12.3.2 Differential evolution (DE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
12.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
12.4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
12.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
[Link] Parallel version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

12.1 Introduction
Risk hedging strategies are at the heart of prudent risk management. Indi-
viduals often hedge risks to their property, particularly from infrequent but
expensive events such as fires, floods, and robberies, by entering into risk
transfer contracts with insurance companies. Insurance companies collect pre-
miums from those individuals with the expectation that at the end of the year
they will have taken in more money than they have had to pay out in losses
and overhead, and therefore remain profitable or at least solvent. Perhaps
not surprisingly, insurance companies themselves try to hedge their risks, par-
ticularly from the potentially enormous losses often associated with natural
catastrophes such as earthquakes, hurricanes, and floods. Much of this hedging
is facilitated by the global “property cat” reinsurance market [1], where rein-
surance companies insure primary insurance companies against the massive

371
372 High-Performance Computing in Finance

Premium/risk Premium/risk

Claims Claims

Primary insurer
Consumer Reinsurers

Claims payments Claims payments

FIGURE 12.1: Risk and premium flow.

claims that can occur due to natural catastrophes. Figure 12.1 illustrates how
this flow works.
Analytics in the reinsurance market is becoming increasingly complex for
at least three reasons. First, factors like climate change are skewing the data
in ways that are not fully understood, making experience less useful for deci-
sion making. Second, the global distribution of economic activity is changing
rapidly with key supply chains now having significant presence in parts of
the world where catastrophic risk is not as well understood. For example, few
in 2011 understood that a Thailand flood event could cost US$47 billion in
property losses and cause a global shortage of hard disk drives that lasted
throughout 2012. Lastly, there is a tendency for risk transfer contracts to
become ever more complex, in large part by increasing the number of subcon-
tracts (called layers) that make up a contract. This in turn makes it increas-
ingly important to have good computational tools that can help underwrit-
ers understand the interaction between layers and to decide on placement
percentages—placement percentages involve choosing layers to buy and how
large a share or percentage of them to buy in order to maximize the risk
hedging and the expected return.
From the perspective of an insurance company, the problem is known as
Reinsurance Contract Optimization (RCO). In this problem, we can iden-
tify a reinsurance contract consisting of a fixed number of layers and a set
of expected loss distributions (one per layer) as produced by a Catastrophe
Model [2], plus a model of current costs in the global reinsurance market.
The main difficulty in this problem is to identify the optimal combinations of
placements (percent shares of subcontracts).
In order to solve the RCO problem, an enumeration method can be used;
however, this approach presents two main problems: (i) it has to be discretized,
demanding some changes in numerical algorithms and (ii) it is only applica-
ble in small problem instances ranging from two to four layers, whereas real
instances of the RCO problem can have seven or more layers. For instance, a
seven-layered problem can take several weeks to be solved with a 5% level of
discretization on the search space using the enumeration method as presented
Tackling Reinsurance Contract Optimization 373

in Reference 3. Thus we discard the enumerative search in favor of an evolu-


tionary heuristic search.

12.2 Modeling the RCO Problem


As previously stated, insurance organizations, with the help of the global
reinsurance market, look to hedge their risk against potentially large claims
or losses. This transfer of risk is done in a manner similar to how a consumer
cedes part of the risk associated with their private holdings. Unlike the case
of the consumer, who is usually given options as to the type of insurance
structures to choose from, the insurer has the ability to set its own structures
and offers them to the reinsurance market. Involved in this process are deci-
sions around what type of and the magnitude of financial structures, such as
deductibles and limits, as well as the amount of risk the insurer wishes to
maintain. The deductible describes the amount of loss that the insurer must
incur before being able to claim a loss to the reinsurance contract. The limit
describes the maximum amount in excess of the deductible that is claimable.
Lastly, the placement describes the percentage of the claimed loss that will be
covered by the reinsurer.
Typically, companies try to hedge their risk by placing multiple layers at
once as illustrated in Figure 12.2, that is, they may have multiple sets of limit
and deductible combinations. These different layers may also have different
placement amounts associated with them. At the same time, insurers are price
takers in terms of the compensation paid to reinsurers for assuming risk. This
compensation, or premium, depends on both the amount of risk associated
with a layer and the placement amount of the layer. For this reason, it is
important for insurers to choose placements when seeking to buy multiple
layers. This optimal placement ensures that the insurer is able to maximize
their returns on reinsurance contracts for potentially large future events.
In order to simplify the problem description, we focus on the primary
contractual terms. Secondary terms such as the contractual costs associated
with brokerage fee and contractual expenses, as well as provisions such as
reinstatement premiums, are straightforward to add. As is typically done in
reinsurance markets, contracts are assumed to be enforced for a 1-year period.

12.2.1 Reinsurance costs


The basic cost of reinsurance to an insurer comes in the form of premium
payments. As mentioned previously, the amount of premium paid for a layer
can vary with the amount of the layer being placed in the market. In general,
premiums are stated per unit of limit, also known as a rate on line. The cost
of the reinsurance layer can then be expressed as
p = πμ(π, l, d) × l (12.1)
374 High-Performance Computing in Finance

Layer 1

Layer 2

Solution 1 Solution 2

Solution Layer Limit Deductible Placement Max recovery Premium


Layer 1 200 300 60% 120 $40
1
Layer 2 200 100 20% 40 $40
Layer 1 200 300 30% 60 $25
2
Layer 2 200 100 95% 190 $100

FIGURE 12.2: An example two-layer reinsurance contract optimization


problem with two sample solutions.

where p is the monetary value of the premium, μ is the rate on line, π is the
placement, d is the deductible, and l is the limit.
For contracts with multiple layers, Equation 12.1 can be generalized to
Equation 12.2,
 T Lπ
p=μ (12.2)

where L is an n × n diagonal matrix of limits, μ  is an n × 1 vector of rate on


lines, π is an n × 1 vector of placements, and n is the number of layers being
placed. This matrix defines a model of expected reinsurance costs.

12.2.2 Reinsurance recoveries


Losses affecting an insurer can be defined as a random variable X, such
that
X ∼ fX (x) (12.3)

where fX is some distribution that represents severity of X. These losses, once


claimed, are subject to the financial terms associated with the contract they
are being claimed against. Any one instance of X, xi , then results in a claim of

ci = max{0, min{l, xi − d}}π (12.4)


Tackling Reinsurance Contract Optimization 375

where ci is the value of the claim for ith instance of X. Equation 12.4 can
then be extended to contracts with multiple layers as follows:

n
ci = max{0, min{lj , xi − dj }}πj (12.5)
j=1

where lj , dj , and πj are the limit, deductible, and placement of the jth layer,
respectively. In addition to this, many contracts allow for multiple claims in
any given contractual year. The yearly contractual loss is then, assuming no
financial terms that impose a maximum amount claimable, simply the sum of
all individual claims in a given contractual year

n
yj = cij (12.6)
i=1

where yj is the annual amount claimed for the jth layer. The annual return
for reinsurance contract is then defined as
r = y Tπ − p
= y Tπ − μ
 T Lπ (12.7)
T T
= (y − μ
 L)π

where y is an n × 1 vector of annual claims for each layer.

12.2.3 The risk value and optimization problem


Given a fixed number of layers and loss distributions, the insurer is then
faced with selecting an optimal combination of placements. As with most
financial structures, the problem faced involves selecting an optimal propor-
tion, or placement, of each layer such that, for a given expected return on the
contracts, the associated risk is minimized. This is generally done by using a
risk value such as a variance, Value at Risk (VaR), or a Tail-Value at Risk
(TVaR). The TVaR is also referred to as a conditional Value at Risk (CVaR)
or a conditional tail expectation (CTE). Unlike the traditional finance port-
folio problem, in the insurance context a claim made, or loss, to the contract
is income to the buyer of the contract. This means, from the perspective of
the insurer, there is a desire to maximize the amount claimable for a given
risk value. In doing so, they minimize amount of loss the insurer may face in
a year.
Equation 12.7 can be rewritten in matrix format such that

R = (Y − ML)π (12.8)

where R is an m × 1 vector of recoveries, Y is an m × n matrix of annual


claims, and M is an m × n matrix of rates on line (ROL), which is a layers
times share percentage matrix of rates-on-line values, that is, a model of the
376 High-Performance Computing in Finance

cost of placing risk in the marketplace. Since the same year is being simulated
each row in matrix M is the same. This formulation leads to this optimization
problem:
maximize VaR α (R(π))
(12.9)
s.t. E(R(π)) = a
Given that the expected return a is not specified, Equation 12.9 can be
rewritten as a Pareto Frontier problem such that

maximize VaR α (R(π)) − qE(R(π)) (12.10)

where q is a risk tolerance factor greater than zero.


This problem can be approached using a number of different methods.
Mistry et al. [4] use an enumeration approach by discretizing the search space
for each layer’s placements. The discretization of the placements may be desir-
able for practical reasons (i.e., a placement with more than two decimal places
may be invalid in negotiations) and the full enumeration method lends itself
well to parallel computation. However, the computational time to evaluate
all possible combinations increases exponentially as the number of layers and
the resolution of the discretization increases. This renders the enumeration
approach infeasible for many practically sized problems.
Mitschele et al. introduced the use of heuristic methods for addressing
reinsurance optimization problems [5]. They show the power of two multiob-
jective evolutionary algorithms (EAs) in finding nondominated combinations,
in comparison to the true nondominated set of points. However, their work
is done exclusively in continuous space and focus on algorithms that change
the limit and deductible aspects of a reinsurance contract. Their methods are
therefore not directly applicable here.

12.3 Evolutionary Algorithms


EAs keep track of the most fit or optimal solution for a specific problem
by employing a population of individuals where each individual represents a
possible solution. The population has to undergo genetic operators on many
iterations until some stop criteria is reached, either: (i) after a certain amount
of iterations, (ii) when there is no more evolution, or (iii) when the algorithm
cannot acquire a more optimal solution. Figure 12.3 shows this process.
In the first step of an EA, the population is initialized at random, normally
using a uniform distribution. The population is then evaluated in order of its
members to determine how each individual scores in fitness as a solution.
The better an individual’s fitness, the stronger the individual is within that
population. Subsequently, the probability of an individual to be selected for
genetic operators or to go to the next iteration (generation) is higher for
stronger members of the population.
Tackling Reinsurance Contract Optimization 377

Initial population

Evaluation

Stop Yes
Final population
criteria

No

Genetic operators

FIGURE 12.3: Structure of an EA.

Typically, the genetic operators are: selection, cross-over, and mutation.


Selection is the process of choosing an individual to undergo genetic operators
or to go to the next generation. In cross-over operators, parents exchange infor-
mation (genes) between themselves in order to create one or more offspring.
Ideally, when two strong individuals exchange their genes, the offspring tends
to be stronger than its parents [6] and then spreads its genes to further gener-
ations. On the other hand, this behavior can lead to a premature convergence
of the solution because the population can be trapped in a local optima. The
mutation operator has the purpose of avoiding the premature convergence by
applying modifications to one or more genes. In other words, the process of
using genetic operators tends to improve the solution quality as new genera-
tions carry on [7].
Taking these operators into account, we can observe that several EAs share
similar features. For instance, genetic algorithms and evolutionary strategies
may use all of these mentioned genetic operations. However, the sequence that
these operators are used can differ. Evolutionary strategy applies other genetic
operators before selection operators, whereas genetic algorithms select the
individuals and then perform those genetic operators. Moreover, evolutionary
strategies need two vectors for representing an individual instead of only one
as is typical for genetic algorithms.

12.3.1 Population-based incremental learning (PBIL)


PBIL was first proposed by Baluja [8]. The algorithm’s populations are
encoded using binary vectors and an associated probability vector, which was
378 High-Performance Computing in Finance

then updated based on the best members of a population. Unlike other EAs,
which transform the current population into new populations, a new popu-
lation is generated at random using an updated probability vector on each
generation. Baluja describes his algorithm as a “combination of evolutionary
optimization and hill-climbing” [8].
Since Baluja’s work, extensions to the algorithm have been proposed for
continuous and base-n represented search spaces [9–11]. The extension to con-
tinuous search spaces using histograms (PBILH ) and real-code (RPBIL) sug-
gests splitting the search space into intervals, each with their own probability
[9,11]. For multivariate cases, the probability vector is then substituted for
a probability matrix, such that each row or column of the matrix represents
a probability vector for any given independent variable. While those meth-
ods support continuous spaces, a similar idea extended PBIL to a discrete
approach in Reference 3 in order to deal with the reinsurance problem con-
straints, that is, we substitute the intervals for equidistant increments in the
lower and upper bounds of the search space.

Algorithm 12.1: DiPBIL

Initialization: Pij = 1/I , LRN , NLRN , xbest best


G ={}, xi ={}
for i = 1 to NG do
Generate a population X of size n from Pij
Evaluate f = fun(X)
Find xbest
G from the current and previous populations
Find xbest
i for top q − 1 members of the current population
Update Pij based on xbest
G ∪ xbest
i using LRN and NLRN
end

In the same spirit as canonical PBIL, the probability matrix is initialized


with all increments having equal probability. This matrix is then updated
after every generation with the best combinations member. The updating of
each vector in the matrix, however, is done using the base-n method, with
an adjusted learning rate and updating function [10]. RPBIL suggests its own
updating function which exponentially increases the probabilities as you move
toward intervals that are closer to the best individuals [9]. The base-n updating
method, however, has the side effect of slightly increasing the probability of
increments further from the best individuals in the search space and may allow
for a chance at more population diversity [10]. To ensure more population
diversity from across generations, the probability matrix is updated with best
member from previous generations as well as the top q members from the
current generation. This modifies the updating equation slightly to the one
presented in Equation 12.11

q
LF ijk
pN
ij
EW
= pOLD
ij (12.11)
q
k=1
Tackling Reinsurance Contract Optimization 379

where LFijk is the ith learning factor, as described in Reference 10, for the
kth best result for the jth variable.
The main drawback of the discrete PBIL is that it requires all of the objec-
tive functions get transformed into a single function. In order to address this
issue, and compute a better Pareto frontier, a true multiobjective optimization
approach, called MOPBIL, is presented in Algorithm 12.2.

Algorithm 12.2: MOPBIL (Sketch)


Input: NG = number of generations; nbest = number of best individuals;
while (NG not reached) do
Create the population using probability matrix;
Evaluate objective function on each member of the population;
Merge archive and the new population;
Determine the nondominated set;
Cluster the nondominated set into k clusters;
Select k representative individuals;
Insert the k individuals into the best population;
Update and mutate the probability matrix;
end
Determine the final Pareto frontier;

MOPBIL creates a random population by using the probability matrix as


it is proposed in Reference 3. Then, nondominated solutions are found and
the resulting set is clustered. Finally, k representative solutions are chosen to
update the probability matrix. The clustering process is introduced to help
to keep the diversity along the generations, that is, the algorithm tries to
choose representatives from kth most different solutions keeping them in the
population and also updating the probability matrix. The individual which
presents the best risk value is chosen to represent a cluster. In other words,
we are trying to drive the search toward the optimal risk values and retain the
diversity in the population at the same time. When the number of generations
is reached, the last results are combined with the archive in order to get the
final Pareto frontier. Figure 12.4 graphically illustrates a high-level structure
of this process.

12.3.2 Differential evolution (DE)


The DE algorithm was proposed by Storn and Price [12] in 1995 where
it is based on the difference between two individuals that is summed up to a
third one. The process is sketched in Algorithm 12.3. As in Particle Swarm
Optimization, the first step is to create a population at random. Then, while
the stop criteria is not reached, the vector of differences is calculated accord-
ing to the equation v = Pidx3 + F ∗ (P opkidx1 − P opkidx2 ), where idxi is a vector
with three individual randomly chosen, and F is a multiplication factor nor-
mally between 0 and 1. This strategy is called DE/Rand/1 because Pidx3 is
380 High-Performance Computing in Finance

Repeat for n iterations


1. Generate population 2. Evaluate population 3. Find nondominated set

4. Cluster (k = 4) 5. Update population 6. Mutate and update


probability matrix

FIGURE 12.4: How the MOPBIL creates points and updates the probability
matrix.

randomly chosen. When Pidx3 is the best individual in the population, the
strategy is called DE/Best/1.
The process of computing v is called mutation. Afterwards, a new individ-
ual is created in a similar way as the discrete cross-over of genetic algorithms,
that is, for each dimension d, a gene is chosen from the vector of differences v
with a probability of CR, or from the target individual i with a probability of
1 − CR. Finally, if the fitness of the new individual is better than the fitness
of the target one, then the new individual replaces the individual i.
The canonical DE presents the same problem of PBIL single objective, that
is, it has to aggregate all functions into only one evaluation function. Thus, a
multiobjective version, called DEMO, is shown in Algorithm 12.4, where we
can observe that it is similar to the canonical version of DE whose strategy
is DE/Rand/1 [13]. The differences start in line 16 when the new population
is selected for the next iteration. Thus if a new individual (indiv) dominates
the target one (P opi ), then the new one is added into a new population; if the
target individual dominates the new one, then the target element is added into
the new population; otherwise, both individuals go to the new population. The
dominance process builds a new population whose size ranges from pop size to
2 × pop size. Finally, if the size of the new population is larger than pop size,
then the new individuals which go to next iteration are selected by crowding
distance (select cdistance function).
The main drawback of the original DEMO was not to maintain an archive
thereby loosing good solutions when the number of nondominated points
overcome the size of the population. Taking this into account, we changed
Tackling Reinsurance Contract Optimization 381

Algorithm 12.3: DE
P op ← generate pop(n,d)
f it ← evaluate (P opk )
while (Stop Criteria is FALSE) do
for i = 1 to #pop size do
idx ← select indiv(3)
v ← P opidx3 + F ∗ (P opkidx1 − P opkidx2 )
for j = 1 to dimension do
nj = rand()
if (nj < CR) then
pop’← vj
else
pop’← popi j
end
end
f it
i ← evaluate (Pi )
if f it
i < f iti then
popi ← pop
i
f iti ← f it
i
end
end
end

the original algorithm into two parts. First, we introduce an archive in the
algorithm (after line 31, i.e., it is done on each iteration) in order to not lose
nondominated solutions from one iteration to another due to the crowding dis-
tance algorithm in line 30. Doing so, we are able to use different mutation oper-
ators such as those presented in Equations 12.12 through 12.15. These strate-
gies are called DE/ND/Rand/1, DE/ND/Rand/RF/1, DE/Arch/Rand/1, and
DE/Arch/Rand/RF/1, respectively. Equation 12.12 uses an random individ-
ual from the set of nondominated ones. In order to do so, it is necessary to
compute the nondominated set between lines 3 and 4, that is, before starting
the loop which deals with the population. Equation 12.13 is similar to the
previous one; however, F is a random number between 0 and 1. Then, both
Equations 12.14 and 12.15 use the first individual chosen from the archive.
The difference between them is the use of F which is randomly chosen in
Equation 12.15.

v ← non dominatedidx + F × (P opidx1 − P opidx2 ) (12.12)


v ← non dominatedidx + Rand() × (P opidx1 − P opidx2 ) (12.13)
v ← archiveidx + F × (P opidx1 − P opidx2 ) (12.14)
v ← archiveidx + Rand() × (P opidx1 − P opidx2 ) (12.15)
382 High-Performance Computing in Finance

Algorithm 12.4: DEMO


P op ← generate pop(n, d)
f it ← evaluate(P op)
while (Stop Criteria is FALSE) do
for i = 1 to #pop size do
idx ← select indiv(3)
v ← P opidx3 + F ∗ (P opidx1 − P opidx2 )
for j = 1 to dimension do
nj = rand()
if (nj < CR) then
indiv← vj
else
indiv← P opi j
end
end
f it
← evaluate(indiv)
if (f it
dominates f iti ) then
pop
← indiv
nf ← f it

else if (f iti dominates f it


) then
pop
← P opi
nf ← f iti
else
add indiv and P opi into pop

add f it and f it
into nf
end
end
if (nrow(pop
) == nrow(P opi ))) then
P op ← pop

else
[P op, f it] ← select cdistance(pop
, nf )
end
end

12.4 Case Study


12.4.1 Metrics
In this section, we discuss the experimental evaluation of the MODE
algorithm. First, the average number of nondominated points (number of
solutions) found in the Pareto frontier was determined. Second, the average
hypervolume, which is the volume of the dominated portion of the objective
space as presented in Equation 12.16, was measured, where for each solution
i ∈ Q a hypercube vi is constructed. Having each vi , we calculated the final
Tackling Reinsurance Contract Optimization 383

hypervolume by the union of all vi . The final number of solutions after all
trials is showed as well.
⎛ ⎞
|Q|
A
hv = volume ⎝ vi ⎠ (12.16)
i=1

Third, the dominance relationship between Pareto frontiers obtained with


different mutation operators was calculated as depicted in Equation 12.17.
Roughly speaking, C(A, B) is the percentage of solutions in B that are dom-
inated by at least 1 solution in A [14]. Therefore, if C(A, B) = 1, then all
solutions in A dominate B; C(A, B) = 0 would indicate that all solutions
in B dominate A. It is important to notice that this metric is neither com-
plementary by itself nor symmetric, that is, C(A, B) = 1 − C(B, A) and
C(A, B) = C(B, A) making it important to compute in both directions:
C(A, B) and C(B, A).

|{b ∈ B|∃a ∈ A : a & b}|


C(A, B) = (12.17)
|B|

Finally, the resulting frontiers can be reviewed by experts for reasonability.


For further details about the use of these metrics, see Reference 15.
In terms of parallelism, we calculated the speedup according to Equa-
tion 12.18,
Ts
speedup = (12.18)
Tp
where Ts is the execution time considering one thread and Tp represents the
time in parallel using p threads. This kind of metric is called weak speedup
and it was suggested in Reference 16 because the code is exactly the same
regardless the number of threads. Thus it is not necessary to guarantee that
the serial version is the best one.

12.4.2 Results
All tests were conducted using R version 3.2.1 on a Red Hat Linux 64-bit
operating system with an Intel Xeon processor comprised of two Xeon pro-
cessors E5-2650 running at 2.0 GHz with 8 cores, hyperthreading and 256 GB
of memory. Considering 250 and 500 with a population size equals to 50. The
following parameters were used for MOPBIL:

• Population size = 50
• Slice size = 0.05

• Number of generations = 250, 500

• Best population = 3
384 High-Performance Computing in Finance

• [Link] = 0.1
• [Link] = 0.075

• [Link] = 0.02

• [Link] = 0.05

In terms of DEMO, the following parameters were used:

• Population size = 50
• Strategy = DE/Arch/Rand/RF/1

Table 12.1 shows metrics for 250, 500, and 1000 iterations (Figure 12.5).
As expected, as we increase iterations results tend to be better. On the other
hand, we can see DEMO is the best of the two executed algorithms in terms
of both number of solutions and hypervolume.

[Link] Parallel version


Because DEMO tends to present better results, we only included it in the
parallel tests. So, in order to parallelize the code we used the Snow [17] pack-
age from R, which is a package for automatic parallelization. We parallelized
the iteration loop that presents better results than parallelizing the population
loop. The mutation operator we used is the M5 with 1000 iterations because
it presented better results than the other ones.

TABLE 12.1: Metrics for 7 layers as we increase iterations


NS HV Time
250 it
MOPBIL 119.7666667 1.79E+15 162.5251333
20.24451111 2.66E+14 2.625853618
DEMO 235.1 2.17E+15 164.4676
12.38004512 2.81E+14 1.362389321
500 it
MOPBIL 122.6 1.91E+15 325.8486333
16.31500262 2.03E+14 6.554302917
DEMO 289.4666667 2.15E+15 318.5643
11.34941388 2.80E+14 1.866340697
1000 it
MOPBIL 136.8 1.95871E+15 671.1731333
18.55541561 2.44412E+14 99.93907566
DEMO 337.3666667 2.25E+15 626.3866667
12.60673967 2.14E+14 3.61856901
Tackling Reinsurance Contract Optimization 385

–1.4 × 109
MOPBIL
DEMO
–1.45 × 109

–1.5 × 109
Risk ($)

–1.55 × 109

–1.6 × 109

–1.65 × 109
–2.8 × 107 –2.6 × 107 –2.4 × 107 –2.2 × 107 –2 × 107 –1.8 × 107 –1.6 × 107 –1.4 × 107 –1.2 × 107
Expected return ($)
–1.4 × 109
MOPBIL
DEMO
–1.45 × 109

–1.5 × 109
Risk ($)

–1.55 × 109

–1.6 × 109

–1.65 × 109
–2.8 × 107 –2.6 × 107 –2.4 × 107 –2.2 × 107 –2 × 107 –1.8 × 107 –1.6 × 107 –1.4 × 107 –1.2 × 107
Expected return ($)
–1.4 × 109
MOPBIL
DEMO

–1.45 × 109

–1.5 × 109
Risk ($)

–1.55 × 109

–1.6 × 109

–1.65 × 109
–2.8 × 107 –2.6 × 107 –2.4 × 107 –2.2 × 107 –2 × 107 –1.8 × 107 –1.6 × 107 –1.4 × 107 –1.2 × 107
Expected return ($)

FIGURE 12.5: Final Pareto frontier for DEMO and MOPBIL using 7 layers,
250, 500, and 1000 iterations.
386 High-Performance Computing in Finance

Figures 12.6 and 12.7 show the time and speedup reached in the Xeon
architecture variating the thread count. Regardless the number of layers, the
best efficiency is reached using two threads representing an efficiency of 96.7%
and 98.2%, respectively. In terms of speedup, it is almost linear up to four
threads. Then, the best one is reached using 32 threads representing 9.38 and
8.33 for 7 and 15 layers, respectively; however, the use of 32 threads represents
an efficiency of 29.3% and 26% for 7 and 15 layers. Moreover, the best speedups
are reached by 7 layers saturating in approximately 16 threads.
Figure 12.8 presents the Pareto frontier obtained by varying the thread
count for 1000 iterations and 7 layers, where we can observe that visually all
Pareto frontiers seem to be the same. Table 12.2 depicts the average in term
of metrics. Even though, the number of solutions decrease as we increase the
number of threads, the final number of solutions is not affected. Moreover,

1600
1400
1200
1000
Seconds (s)

800
600
400
200
0
1T 2T 4T 8T 16T 32T
7 Layers 15 Layers

FIGURE 12.6: Time for 7 and 15 layers and 1000 iterations on Xeon.

10
9
8
7
6
Speedup

5
4
3
2
1
0
2T 4T 8T 16T 32T
7 Layers 15 Layers

FIGURE 12.7: Speedup for 7 and 15 layers and 1000 iterations on Xeon.
Tackling Reinsurance Contract Optimization 387

–1.35E+09
–28,000,000 –23,000,000 –18,000,000 –13,000,000 –8,000,000
–1.4E+09

–1.45E+09

–1.5E+09

–1.55E+09

–1.6E+09

–1.65E+09

T1 T2 T4 T8 T16 T32

FIGURE 12.8: Pareto frontier varying thread count for 1000 iterations and
7 layers.

TABLE 12.2: Metrics for 7 layers and 1000 iterations


#NS Hypervolume Time #NS final
1T 337.3666667 2.25E+15 626.3866667 403
12.60673967 2.14E+14 3.61856901
2T 336.6333333 2.30E+15 323.8888 398
11.60999371 1.59E+14 1.890259066
4T 329.6333333 2.34E+15 181.6333667 390
10.49296425 1.05E+14 0.808078286
8T 315.8666667 2.35E+15 103.0467333 406
12.01359383 1.28E+13 0.340365719
16T 288.9666667 2.35E+15 67.123 403
9.86628999 3.00E+13 0.711079801
32T 246.9 2.35E+15 66.7494 390
13.47628824 1.45E+13 2.711422067

the hypervolume is quite stable between threads, therefore, the faster the
execution the better. In fact, the small numbers in Table 12.3, which represent
the coverage, mean that the Pareto frontiers are very similar regardless the
number of threads.
Figure 12.9 shows the Pareto frontier obtained by varying the thread count
for 1000 iterations and 15 layers, where we can observe that, visually, the dif-
ference between Pareto frontiers obtained by different counting of threads
is not meaningful. On the other hand, Table 12.4 presents how the num-
ber of solutions decrease as we increase the number of threads; nonetheless,
the hypervolume indicates that this decrement is worth up to eight threads
388 High-Performance Computing in Finance

TABLE 12.3: Coverage for 7 layers and 1000 iterations


T1 T2 T4 T8 T16 T32
T1 – 0.028 0.03 0.08 0.16 0.20
T2 0.04 – 0.04 0.09 0.16 0.215
T4 0.03 0.03 – 0.07 0.14 0.20
T8 0.030 0.035 0.028 – 0.14 0.17
T16 0.019 0.015 0.015 0.057 – 0.16
T32 0.017 0.022 0.026 0.02 0.086 –

–1.29E+09
–36,000,000 –31,000,000 –26,000,000 –21,000,000 –16,000,000 –11,000,000 –6,000,000
–1.34E+09

–1.39E+09

–1.44E+09

–1.49E+09

–1.54E+09

–1.59E+09

–1.64E+09

–1.69E+09

T1 T2 T4 T8 T16 T32

FIGURE 12.9: Pareto frontier varying thread count for 1000 iterations and
15 layers.

TABLE 12.4: Metrics for 15 layers and 1000 iterations


#NS Hypervolume Time #NS final
1T 290.73 3.39E+15 1426.90 517
22.97 8.93E+14 6.33
2T 296.03 3.82E+15 726.32 515
15.83 7.22E+14 2.16
4T 280.37 4.03E+15 387.39 470
12.70 5.74E+14 1.13
8T 237.40 4.22E+15 232.96 450
13.63 3.72E+14 1.12
16T 201.00 4.12E+15 182.87 378
13.43 3.30E+14 9.25
32T 164.00 3.99E+15 171.36 336
12.71 4.05E+14 16.34
Tackling Reinsurance Contract Optimization 389

TABLE 12.5: Coverage for 15 layers and


1000 iterations
T1 T2 T4 T8 T16 T32
T1 – 0.50 0.57 0.68 0.79 0.87
T2 0.40 – 0.32 0.54 0.65 0.77
T4 0.30 0.14 – 0.46 0.61 0.75
T8 0.20 0.09 0.15 – 0.52 0.70
T16 0.13 0.05 0.10 0.16 NA 0.55
T32 0.059 0.017 0.04 0.08 0.20 NA

because it is larger. Actually, the time saved using eight threads is also of
great note. Table 12.5 reinforces that the quality of using eight threads is an
attractive option because it dominates 52% and 70% of solutions from 16 and
32 threads, respectively.

References
1. Cai, J., Tan, K. S., Weng, C., and Zhang, Y. Optimal reinsurance under VAR
and CTE risk measures. Insurance: Mathematics and Economics, 43(1):185–196,
2008.

2. Grossi, P. and Kunreuther, H. Catastrophe Modeling: A New Approach to Man-


aging Risk, volume 25 of Catastrophe Modeling. Springer, US, 2005.

3. Cortes, A. C., Rau-Chaplin, A., Wilson, D., Cook, I., and Gaiser-Porter, J.
Efficient optimization of reinsurance contracts using discretized PBIL. In The
Third International Conference on Data Analytics, pp. 18–24, Porto, Portugal,
2013.

4. Mistry, S., Gaiser-Porter, J., McSharry, P., and Armour, T. Parallel computation
of reinsurance models, 2012. (Unpublished)

5. Mitschele, A., Oesterreicher, I., Schlottmann, F., and Seese, D. Heuristic opti-
mization of reinsurance programs and implications for reinsurance buyers. In
Operations Research Proceedings, pp. 287–292, 2006.

6. Herrera, F., Lozano, M., and Verdegay, J. L. Tackling real-coded genetic algo-
rithms: Operators and tools for behavioural analysis. Artificial Intelligence
Review, 12(4):265–319, 1998.

7. Michalewicz, Z. Genetic Algorithms + Data Structure = Evolution Programs,


3rd edition, Springer-Verlag, Berlin, Heidelberg, New York, 1999.

8. Baluja, S. Population based incremental learning. Technical report, Carnegie


Mellon University, Pittsburgh, Pennsylvanian, 1994.
390 High-Performance Computing in Finance

9. Bureerat, S. Improved Population-Based Incremental Learning in Continuous


Spaces, pp. 77–86. Number 96. Springer, Berlin, Heidelberg, 2011.

10. Servais, M.P., de Jager, G., and Greene, J. R. Function optimisation using
multiple-base population based incremental learning. In The Eighth Annual
South African Workshop on Pattern Recognition, Rhodes University, Graham-
stown, South Africa, 1997.

11. Yuan, B. and Gallagher, M. Playing in continuous, some analysis and extension
of population-based incremental learning. In IEEE Congress on Evolutionary
Computation, IEEE, pp. 443–450, Canberra, Australia, 2003.

12. Storn, R. and Price, K. Differential evolution: A simple and efficient heuristic
for global optimization over continuous spaces. Journal of Global Optimization,
12(4):341–359, 1997.

13. Qin, A. K., Huang, V. L., and Suganthan, P. N. Differential evolution algo-
rithm with strategy adaptation for global numerical optimization. Transaction
on Evolutionary Computation, 13(2):398–417, 2009.

14. Zhang, Q. and Li, H. MOEA/D: A multiobjective evolutionary algorithm based


on decomposition. IEEE Transactions on Evolutionary Computation, 11(6):
712–731, 2007.

15. Deb, K. Multi-Objective Optimization Using Evolutionary Algorithms. John


Wiley & Sons LTDA, 2001.

16. Alba, E. Parallel evolutionary algorithms can achieve super-linear performance.


Information Processing Letters, 82(1):7–13, 2002.

17. Tierney, L., Rossini, A. J., Li, N., and Sevcikova, H. Snow, [Link]
[Link]/web/packages/snow/[Link], 2017.
Chapter 13
Evaluating Blockchain
Implementation of Clearing
and Settlement at the IATA
Clearing House

Sergey Ivliev, Yulia Mizgireva, and Juan Ivan Martin

CONTENTS
13.1 ICH (IATA Clearing House)’s Current
Clearing Procedure and Potential for Improving
Its Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
13.1.1 ICH’s clearance procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
13.1.2 Improving ICH’s clearance procedure with
blockchain technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
13.2 Data Simulation Description and Model
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
13.3 Mathematical Formulation of the Model . . . . . . . . . . . . . . . . . . . . . . . . 397
13.3.1 Discrete-time optimal control model . . . . . . . . . . . . . . . . . . . . 397
13.3.2 The variants of the objective function . . . . . . . . . . . . . . . . . . 399
[Link] Reducing the transaction costs . . . . . . . . . . . . . . . 399
[Link] Improving the liquidity profile . . . . . . . . . . . . . . . . 399
13.3.3 How to choose the values of the control variables? . . . . . 400
[Link] The control variable a ∈ [0, 1] . . . . . . . . . . . . . . . . 400
[Link] The control variables ui ∈ {0; 1}, i = 1, N . . . . 400
13.3.4 The final mathematical formulation of the model . . . . . . 402
13.4 Practical Implementation of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . 403
13.4.1 Results representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
[Link] Before IATA Coin adoption . . . . . . . . . . . . . . . . . . 403
[Link] After IATA Coin adoption . . . . . . . . . . . . . . . . . . . . 404
13.4.2 The model results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
[Link] Example of one simulation . . . . . . . . . . . . . . . . . . . . 406
[Link] Increasing the number of simulations . . . . . . . . . 407
13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

391
392 High-Performance Computing in Finance

13.1 ICH (IATA Clearing House)’s Current


Clearing Procedure and Potential for Improving
Its Efficiency
IATA (The International Air Transport Association) is a trade association
in the airline industry. In 2016, it had more than 260 airline members, which
together represent about 83% of total air traffic.
IATA defines its mission as “to be the force for value creation and inno-
vation driving a safe, secure, and profitable air transport industry that sus-
tainably connects and enriches our world.” In other words, IATA intends to
represent, lead, and serve the airline industry. It aims to simplify processes
by developing the commercial standards for the airline business, to increase
passenger convenience, and to reduce costs by improving efficiency.
The ICH is an organization serving the air transport industry which pro-
vides clearing and settlement services for its members. ICH was founded
in 1947, 2 years after the foundation of IATA. Now it has about 275
airline members (both IATA and non-IATA members) and 75 nonairline
members.
The ICH members are in close collaboration with each other, and typically
they have two-way billings. By applying the principles of set-off/netting, it
reduces significantly the amount of cash required to settle these billings.

13.1.1 ICH’s clearance procedure


The ICH takes responsibility for a clearing and settlement procedure which
is held on a weekly basis and lasts for 2 weeks. It means that during a week
before the closure day, each member sends their invoices to ICH (the invoice
contains the information about the payer, the sum of money, the currency
of payment, and so on). After the closure day, ICH does the offset of the
invoices and on the advice day, it sends confirmation messages of the final
balances to the members. A week after this, on the call day, the net debtors
must pay their balance to ICH and 2 days later (on the settlement day)
the net creditors will receive their payments. A clearance cycle covers a single
week with a year of 48 cycles. So, the payments are received about 2 weeks
after the day when the last invoice of current clearance cycle came to ICH.
A schematical illustration of the ICH’s current clearing process is rep-
resented in Figure 13.1. Such a process of clearing has its advantages and
disadvantages. The main advantages are the following:

• IATA guarantees that all invoices will be submitted for payment and this
means that ICH’s clearing and settlement procedure reduces the credit
risk.

• About a 64% offset ratio (the relation between the volume of billings
and the amount of cash required to settle them). This means that from
Implementation of Clearing and Settlement at the IATA Clearing House 393

Invoices

t
t t+7 t + 21

Clearance cycle

Cash flow
ICH

t
t t+7 t + 21

Cash flow
Other

t
t t+7 t + 21

FIGURE 13.1: Example of an airline member’s cash flow (1 clearing cycle)—


how it is now.

a total volume of billings which, for example, is $54.3 B, the cash-out is


about $19.8 B.

There are also disadvantages for both sides—for ICH and its members.
They are:

• Slow payment procedure (a huge delay between sending the invoices and
receiving the money)

• High banking fees

• Potentially not an optimal offset ratio

These disadvantages can be overcome with the help of a blockchain tech-


nology implementation.

13.1.2 Improving ICH’s clearance procedure with


blockchain technology
Blockchain is a novel development in the world of computing science that
has potentially far-reaching implications for many aspects of global commerce.
Over the coming decade, this technology is expected to fundamentally disrupt
established modes of operation in many sectors, including banking, logistics,
394 High-Performance Computing in Finance

and government, to name just a few. Similarly, blockchain has the potential
to be a positive transformative force within the airline industry.
A distributed ledger is essentially an asset database that is shared across
a network of multiple sites, geographies, or institutions. The assets can be
financial, legal, physical, or virtual in nature. Selected participants within the
network have copies of the ledger and any changes to the ledger are reflected
in all copies within minutes, or in some cases, seconds. The security and accu-
racy of the transactions stored in the ledger are ensured cryptographically
through the use of keys and signatures to control who can do what within the
shared ledger. Entries can also be updated by one, some, or all of the par-
ticipants according to rules agreed by the network. Distributed ledger tech-
nologies (DLTs) hold the potential to redefine a number of industries and
various aspects of society. From reducing the transaction costs experienced by
large companies to providing greater possibilities for distributed economies
and business models, they hold great disruptive potential. DLT can bring
multiple benefits to its users, including but not limited to:

• High availability due to structural multiple redundancy

• Cryptographically secured records

• Multiple layers of cryptographically enforced access control


• Configurable trust levels depending on the degree of decentralization

A distributed ledger allows untrusting parties with common interests to


cocreate a permanent, unchangeable, and transparent record of exchange and
processing. This aim aligns well with the need of the airline industry to have
some means of cross-organizational value exchange. The degree to which a
high-throughput low-cost solution can be realized in the context of any given
use case depends primarily on the openness of the system in terms of who
is permitted to submit, validate, and process transactions. A cryptocurrency
solution will help streamline the conversion of funds by allowing fast, secure,
low-cost movement of value. Cryptocurrencies do not require long clearing
times, while their cryptographic mechanisms provide heightened protection
from fraudulent transactions [1].
By creating a new industry cryptocurrency, IATA gives a chance for mem-
bers to submit invoices immediately and exclude them from the clearing pro-
cess. Potentially it would help to overcome the disadvantages of the current
clearing procedure and

• Accelerate cash flows (the payments will become significantly faster with
settlement and confirmation on a distributed ledger)

• Reduce transaction costs (the amount of SWIFT transfers will decrease)

• Increase the offset ratio


Implementation of Clearing and Settlement at the IATA Clearing House 395

Invoices

t
t t+7 t + 21

Clearance cycle

Cash flow
Submitting invoices in IATA Coins ICH

t
t t+7 t + 21
And exchanging IATA Coins for fiat currency

Cash flow
Other

t
t t+7 t + 21
Can provide liquidity when it’s necessary

FIGURE 13.2: Example of an airline member’s cash flow (1 clearing cycle)—


how it will be.

• Provide additional liquidity for members inside the current clearance


cycle

IATA would be the liquidity provider, guaranteeing the retro-convertibility


of IATA Coins into fiat currency. Airlines will be able to present the coins to
IATA to cash out in fiat currencies.
A schematical illustration of the ICH’s clearing process with the opportu-
nity to submit the invoices in IATA Coins is represented in Figure 13.2.

13.2 Data Simulation Description and Model


Assumptions
For analysis of the potential benefits of blockchain technology implementa-
tion for IATA, a simulation model will be developed. Initially, the model will
be focused on only one company. The main assumptions which will be neces-
sary to take into account during the building of the model are as follows.

1. There are two main agents whose interests are not the same:
• ICH, which tries to increase the offset ratio and to accelerate the
payment procedure
396 High-Performance Computing in Finance

• The company, which has its own view (and its own criteria of opti-
mality) in choosing the dates when invoices should be submitted
2. The company is acting rationally and trying to reduce transaction costs.

3. ICH current clearance cycle is still working, but all invoices can be sub-
mitted in IATA Coins and consequently excluded from ICH’s clearing
procedure.

4. When the company sends an invoice to ICH, it can mark whether it


wants to receive the money in IATA Coins or in fiat currency (as the
company is a rational agent let us assume that for all sent invoices,
submission in IATA Coins is desirable, because this can help to decrease
transaction costs).

5. The company can choose the times when it wants to exchange IATA
Coins for fiat money (according to the liquidity profile of the company).
6. On the settlement day of the clearance cycle, the payments are effected
not in IATA Coins, but in fiat currency.
7. For calculating the offset ratio, only fiat currency payments will be taken
into account.

8. Due to the absence of historical data, the sample for analysis will be
generated according to the following distributional assumptions:
• The number of invoices per day for a company will be considered as a
random variable with a Poisson distribution (with an arbitrary value
of the parameter)
• The volume of invoices will be considered as a random variable with
a lognormal distribution (with arbitrary values of the parameters)

9. For simplicity we assume:


• Only one fiat currency
• The exchange rate is not taken into account
• The company has enough IATA Coins to cover the invoices and enough
money in fiat currency to cover the after-clearing payments (borrowing
activities are not taken into account)

So, the sample for constructing the model in the simplest case will contain
the following attributes:

• Date

• Value (positive value—paid by the company, negative—paid to the


company)
Implementation of Clearing and Settlement at the IATA Clearing House 397

• Dummy variable for indicating if this particular observation is an invoice


sent to IATA or a payment which does not go through IATA

13.3 Mathematical Formulation of the Model


As was said above, ICH and every company counterparty have their own
goals and their own opinions about the dates when the invoices should be
submitted. From ICH’s point of view, the main criterion is the higher offset
ratio, meanwhile a company intends to minimize its transaction costs and to
improve its liquidity profile. In fact, each company in the network, if it is a
rational agent, will have its own strategy for choosing the invoices to submit
in IATA Coins.
The situation can be considered from the game theoretic point of view,
wherein the decision of each company will influence the strategy chosen by
every other company. Due to the complexity of this financial system (there are
more than 260 members in ICH), it is almost impossible to formalize the model
as a game model and it seems that there would not be a clear equilibrium.
But the problem can also be considered as involving an optimization model
from the point of view of one agent of the system—either ICH or a typical
company.
In this chapter, the second model form, namely the optimization model,
will be considered. In fact, the only decision makers in this financial system
are the companies, and ICH cannot influence the decisions that will be made.
That is why we will focus on the construction of the model from a typical
company’s point of view.
It also should be mentioned that the model will take as an input not the
whole clearing year, but only the current clearing cycle. This means that in
each clearing cycle a company is trying to find an optimal strategy disregarding
the information from other clearing cycles.
The mathematical formulation of the model will be represented in the
following sections.

13.3.1 Discrete-time optimal control model


The notation and main equations will correspond to Figures 13.1 and 13.2.
We will consider the clearance cycle for the time period [t, t + 7].
Let the volume of the invoice paid to (or received from) the company A
be denoted by Invi , where i is the invoice number in the period [t, t + 7] and
i = 1, N . The fiat currency payment which does not go through IATA will
be denoted by Rj , where j is the payment number in the period [t, t + 7] and
j = 1, M .
Then the after-clearing sum of money for the period [t, t + 7] which
the company A should pay to (or receive from) IATA can be calculated
398 High-Performance Computing in Finance

from Equation 13.1:



N
Clt+21 = Invi (1 − ui ) , (13.1)
i=1

where
Clt+21 is the after-clearing sum of money for company A and the time period
[t, t + 7];
Invi is the volume of the invoice i of company A;
ui is the control variable defined by Equation 13.2 as


⎨ 0, if the company decides to include Invi into clearing;
ui := 1, if the company decides to submit Invi in IATA Coins


and exclude it from clearing.
(13.2)

The company has the opportunity to exchange IATA Coins for fiat cur-
rency. Let the proportion of IATA Coins exchanged for fiat currency be
denoted by a. The sum of money for the period [t, t + 7] which will be paid
to (or received from) the company A in fiat currency can be calculated by
Equation 13.3:

N
M
Pt+7 = a Invi ui + Rj , (13.3)
i=1 j=1

where
Pt+7 is the total volume of payments of company A in the time period [t, t + 7]
which are not included into clearing and are submitted in fiat currency;

Rj is the volume of fiat currency payment j which does not go through IATA.

The control variable a can take on the values [0, 1]. If a = 1, the sum of all
the volumes of invoices submitted in IATA Coins will be exchanged for fiat
money. If 0 < a < 1, only )some IATA Coins will be exchanged and the others
N
will be equal to (1 − a) i=1 Invi ui .
It is easy to see that


N
N
M
Clt+21 + Pt+7 + (1 − a) Invi ui = Invi + Rj . (13.4)
i=1 i=1 j=1

This means that the factual cash flow (in fiat currency) can be reduced by the
)N
sum (1 − a) i=1 Invi ui which will be held in IATA Coins.
Implementation of Clearing and Settlement at the IATA Clearing House 399

The following section presents the possible variants of the objective


functions.

13.3.2 The variants of the objective function


The objective function from the company’s point of view can be formulated
in two ways:

• Reducing the transaction costs


• Improving the liquidity profile

[Link] Reducing the transaction costs


This objective function can be formulated in terms of decision variables
a, u1 , . . . , uN as follows:

)N )M
| i=1 Invi (1 − ui )| + j=1 |Rj | + |Clt+7 |
)N )M → min. (13.5)
i=1 |Invi |(1 − ui ) + j=1 |Rj | + |Clt+7 |

Expression (13.5) is the equivalent of the offset ratio maximization. This ratio-
nal multiextremal function is nonlinear and nonconvex, and its global solution
can only be found by exhaustive search. It may be approximated by a suitable
global optimization procedure.

[Link] Improving the liquidity profile


This objective function can be formulated as follows:
$ $
$ N $
$ M
$
|Pt+7 + Clt+7 | = $$a Invi ui + Rj + Clt+7 $$ → min. (13.6)
$ i=1 j=1 $

In other words, the company tries to decrease the absolute value of the sum
of all the fiat currency payments on the time period [t, t + 7]. This potentially
can help to smooth the liquidity profile by receiving additional liquidity from
exchanging IATA Coins for fiat currency.
In order to give this bijective model, only a single optimization criterion,
the second objective (13.6) will be implemented as a constraint:

$ N $ $M $
$
M $ $ $
$ $ $ $
$a Invi ui + Rj + Clt+7 $ ≤ $ Rj + Clt+7 $ . (13.7)
$ $ $ $
i=1 i=1 i=1
400 High-Performance Computing in Finance

13.3.3 How to choose the values of the control variables?


In this section, some recommendations for a company on how to find an
optimal strategy will be presented. Each control variable will be considered
separately.

[Link] The control variable a ∈ [0, 1]


There are four main situations that may influence the decision on how
to choose a. They depend ! on the aggregated volume of the invoices in
)N
IATA Coins i=1 Invi and the total volume of fiat currency payments
)M !
j=1 Rj + Clt+7 in the current clearance cycle.

)N )M
1. i=1 Invi > 0 and j=1 Rj + Clt+7 > 0 : It does not make sense to
exchange IATA Coins for fiat currency ⇒ a = 0.
)N )M
2. i=1 Invi > 0 and i=1 Rj + Clt+7 < 0 : It makes sense to exchange
IATA Coins for fiat currency ⇒ 0 < a ≤ 1.
There are two cases which may appear:
)N )M !
a. i=1 Invi ui > − j=1 Rj + Clt+7 : The sum of IATA Coins is
greater than the value of the negative fiat currency balance. In this
case, the company will intend only to cover the!value of the negative
)N )M
balance: a i=1 Invi ui = − j=1 Rj + Clt+7 ⇒ 0 < a < 1.
)N )M !
b. i=1 Invi ui ≤ − j=1 Rj + Clt+7 : The sum of IATA Coins is
less or equal to the value of the negative fiat currency balance. In
this case, the company will intend to exchange all IATA Coins that
it possesses ⇒ a = 1.
)N )M
3. i=1 Invi < 0 and j=1 Rj + Clt+7 > 0 : The balance in IATA Coins
is negative, there is nothing to exchange ⇒ a = 0.
It does not make sense to exchange fiat currency for IATA Coins,
because it seems to be more rational to include the invoices in clearing
and later to pay for them in fiat currency.
)N )M
4. i=1 Invi < 0 and j=1 Rj + Clt+7 < 0 : The balance in IATA Coins
is negative, there is nothing to exchange ⇒ a = 0.

[Link] The control variables ui ∈ {0; 1}, i = 1, N


The choosing of ui can be considered from both the company’s and from
ICH’s point of view.
Implementation of Clearing and Settlement at the IATA Clearing House 401

1. The company’s opinion: Using IATA Coins, the company will try to
decrease the cash amount which is circulated in the network. This can
be formulated as the following constraints:
a. Total cash payments after IATA Coin adoption should be not greater
than before it:
$N $ $ N $
$ $ $ $ M
$ $ $ $
$ Invi (1 − ui )$ + $a Invi ui $ + |Rj | + |Clt+7 |
$ $ $ $
i=1 i=1 j=1
$N $
$ $ M
$ $
≤$ Invi $ + |Rj | + |Clt+7 | (13.8)
$ $
i=1 j=1

or
$ $ $ $ $ $
$ N $ $ N $ $ N $
$ $ $ $ $ $
$ Invi (1 − ui )$ + $a Invi ui $ ≤ $ Invi $ (13.9)
$ $ $ $ $ $
i=1 i=1 i=1

b. The same inequality for the absolute volumes of invoices:


$ N $

N $ $ N
$ $
|Invi | (1 − ui ) + $a Invi ui $ ≤ |Invi | (13.10)
$ $
i=1 i=1 i=1

c. Liquidity profile is improved by using IATA Coins:


$ $ $ $
$ N $ $M $
$ M
$ $ $
$a Invi ui + $ $
Rj + Clt+7 $ ≤ $ Rj + Clt+7 $$ . (13.11)
$
$ i=1 j=1 $ $j=1 $

2. Counterparty’s opinion: It should be mentioned that the decision about


the submission of a particular invoice in IATA Coins also depends on the
opinion of the company’s counterparty. Let us assume that both sides
of the invoice have the same criterion for choosing the invoices that will
be submitted in IATA Coins. And let us also suppose that the invoice
will be submitted in IATA Coins only when the both sides have made
this decision.

3. ICH’s criterion: As was said above, ICH’s main goal is to increase the
offset ratio. In fact, ICH is not a decision maker in the model, so it
only can hope that the decisions made by the companies will improve
the offset ratio. Actually, it can be taken for granted (according to the
formulation of the model) that no matter what decisions are made by
the companies (of course, if they are rational), they will not decrease
the initial value of the offset ratio.
402 High-Performance Computing in Finance

13.3.4 The final mathematical formulation of the model


Combining all above mentioned, the final optimization model becomes
)N )M
| i=1 Invi (1 − ui ) | + j=1 |Rj | + |Clt+7 |
)N )M → min,
i=1 |Invi | (1 − ui ) + j=1 |Rj | + |Clt+7 |
$N $ $ N $ $N $
$ $ $ $ $ $
$ $ $ $ $ $
$ Invi (1 − ui )$ + $a Invi ui $ ≤ $ Invi $ ,
$ $ $ $ $ $
i=1 i=1 i=1
$ $
N $ N $ N
$ $ (13.12)
|Invi | (1 − ui ) + $a Invi ui $ ≤ |Invi |,
$ $
i=1 i=1 i=1
$ $ $ $
$ N $ $M $
$ M
$ $ $
$a Inv u + R + Cl $ ≤ $ R + Cl $
t+7 $ ,
$ i i j t+7 $ $ j
$ i=1 j=1 $ $j=1 $
a ∈ [0, 1] , ui ∈ {0; 1}, i = 1, N .

There are three variants of simplification depending on the value of the


control variable a:

1. For the situation where a = 0, the model will be


)N )M
| i=1 Invi (1 − ui ) | + j=1 |Rj | + |Clt+7 |
)N )M → min,
i=1 |Invi | (1 − ui ) + j=1 |Rj | + |Clt+7 |
$N $ $N $
$ $ $ $
$ $ $ $
$ Invi (1 − ui )$ ≤ $ Invi $ ,
$ $ $ $ (13.13)
i=1 i=1

N
N
|Invi | (1 − ui ) ≤ |Invi |,
i=1 i=1
a ∈ [0, 1] , ui ∈ {0; 1}, i = 1, N .

2. For the situation where 0 < a < 1, the model will be


)N )M
| i=1 Invi (1 − ui ) | + j=1 |Rj | + |Clt+7 |
)N )M → min,
i=1 |Invi | (1 − ui ) + j=1 |Rj | + |Clt+7 |
$ $ $$ $ $
$ $N $
$ N $ $ M
$ $ $
$ $ $ $
$ Invi (1 − ui )$ + $ Rj + Clt+7 $$ ≤ $ Invi $ ,
$ $ $ $ $ $ (13.14)
i=1 j=1 i=1
$ $
$M $
N
$ $ N
|Invi | (1 − ui ) + $$ $
Rj + Clt+7 $ ≤ |Invi |,
i=1 $j=1 $ i=1
a ∈ [0, 1] , ui ∈ {0; 1}, i = 1, N .
Implementation of Clearing and Settlement at the IATA Clearing House 403

3. For the situation where a = 1, the model will be

)N )M
| i=1 Invi (1 − ui ) | + j=1 |Rj | + |Clt+7 |
)N )M → min,
i=1 |Invi | (1 − ui ) + j=1 |Rj | + |Clt+7 |
$N $ $N $ $N $
$ $ $ $ $ $
$ $ $ $ $ $
$ Invi (1 − ui )$ + $ Invi ui $ ≤ $ Invi $ ,
$ $ $ $ $ $
i=1 i=1 i=1
$ $
N $ N $ N
$ $ (13.15)
|Invi | (1 − ui ) + $ Invi ui $ ≤ |Invi |,
$ $
i=1 i=1 i=1
$ $ $ $
$N $ $M $
$ M
$ $ $
$ Inv u + R + Cl $ ≤ $ R + Cl $
t+7 $ ,
$ i i j t+7 $ $ j
$ i=1 j=1 $ $ j=1 $
a ∈ [0, 1] , ui ∈ {0; 1}, i = 1, N .

In the following section, the above described optimization model will be


evaluated experimentally.

13.4 Practical Implementation of the Model


Before presenting the results of the model, some words should be said
about how the results of the clearing procedure can be represented.

13.4.1 Results representation


All the payments which are included to the clearing cycle can be rep-
resented as a payment matrix in the forms that are shown in the following
sections.

[Link] Before IATA Coin adoption


In the current clearing procedure, all the invoices can be combined into
one matrix:

Total )
1 2 ... n Offset
1 a12 ... a1n a1· a1 = a1· − a·1
2 a21 ... a2n a2· a2 = a2· − a·2
... ... ... ... ... ...
n
) an1 an2 ... an· an = an· − a·n
a·1 a·2 ... a·n
404 High-Performance Computing in Finance

where
aij is the sum of money, that airline i needs to pay to airline j in current
clearing cycle (the sum of invoices from airline i to airline j during the clearing
cycle, where Nij is the number of invoices from airline i to airline j):


Nij
aij = Invijk , (13.16)
k=1

ai· is the sum of money, that airline i needs to pay to all of the members of
ICH:
n
ai· = aij , (13.17)
j=1

a·j is the sum of money, that airline j should receive from all of the members
of ICH:
n
a·j = aij . (13.18)
i=1

Then offset ratio will be


)n
|ai |
Offset1 = 1 − )n )n
i=1
)Nij . (13.19)
i=1 j=1 k=1 |Invijk |

[Link] After IATA Coin adoption


After the adoption of IATA Coins, it will be necessary to consider sepa-
rately the payments in IATA Coins and in fiat currency and, instead of one
payment matrix, there will be two:

In IATA Coins )
1 2 ... n
1 b12 . . . b1n b1·
2 b21 . . . b2n b2·
... ... ... ... ...
n
) bn1 bn2 . . . bn·
b·1 b·2 . . . b·n

In Fiat Currency
)
1 2 ... n Offset
1 ã12 . . . ã1n ã1· ã1 = ã1· − ã·1
2 ã21 . . . ã2n ã2· ã2 = ã2· − ã·2
... ... ... ... ... ...
n
) ãn1 ãn2 . . . ãn· ãn = ãn· − ã·n
ã·1 ã·2 . . . ã·n
Implementation of Clearing and Settlement at the IATA Clearing House 405

where
bij is the sum of payments in IATA Coins, that airline i will pay to airline j.

ãij is the sum of cash, that airline i needs to pay to airline j in the current
clearing cycle (invoices submitted in IATA Coins are excluded).
Generally, the sum of billings both in cash and IATA Coins is equal to the
volume of billings in the case when airlines do not use IATA Coins:

bij + ãij = aij . (13.20)

The offset ratio for fiat payments can be calculated in a similar way as in
Equation 13.19. But actually, for invoices in IATA Coins if we calculate the
offset (in the same meaning as for fiat currency invoices), it will be equal to 0,
because the invoices submitted in IATA Coins are not aggregated.
The total offset ratio for the system:
)n )n )n
i=1 |ãi | + i=1 j=1 bij
Offset2 = 1 − )n )n )Nij . (13.21)
i=1 j=1 k=1 |Invijk |

But, if we say that offset ratio is the relation between the volume of billings
and the amount of cash required to settle them and, in fact, IATA Coins are
not the cash, in that sense it can be said that the offset ratio for the second
case will be )n
∗ |ãi |
Offset2 = 1 − )n )n i=1 )Nij . (13.22)
i=1 j=1 k=1 |Invijk |

13.4.2 The model results


The final version of the model was practically implemented on invoice
samples simulated according to the assumptions that

• The number of invoices per day for a company was considered as a ran-
dom variable with Poisson distribution. The parameter λ was fitted on a
sample of the invoices between 2 airlines which was provided by IATA.
It is approximately equal to 10, meaning that for each pair of counter-
parties on each day there are 10 invoices on average. The model was also
implemented on sample invoices with other values of λ.

• Other payments which do not go through IATA are also included into
the sample (for simplicity, there is only one counterparty nonmember of
ICH and λ is set the same, equal to 10).

• The volumes of invoices (and other payments) were a random variable


with a lognormal distribution. The parameters were fitted on the sample
of the invoices between two airlines which was provided by IATA. So,
μ̂ = 8.190488 and σ̂ = 2.118524.
406 High-Performance Computing in Finance

The solution of the optimization problem was found via a genetic algorithm
[2–4]. It is rather a resource-intensive optimization method which explains why
the number of simulations was only 1000 and the number of the companies in
the sample was equal to 3 in order to decrease the time needed for calculations.
The above described model is implemented for one clearing cycle and for
each company. This means that every company chooses its optimal strategy
independently from the others. Then for each invoice in the sample, the opin-
ions of both counterparties (the payer and the payee) are compared. If both
companies decide that the submission of the invoice in IATA Coins is more
profitable for them, then the invoice will be submitted in IATA Coins, other-
wise it will be included in the clearing procedure.

[Link] Example of one simulation


The results of the implementation of the model calculated on one simulated
sample are given in the tables.

1. Total cash flow before the IATA Coins adoption


)
1 2 3 Offset
1 1,269,504 1,636,788 2,906,292 1,149,624
2 779,289 920,626 1,699,915 −725,878
3
) 977,379 1,156,289 2,133,668 −423,746
1,756,668 2,425,793 2,557,414 2,299,248

2. Cash flow in IATA Coins


)
1 2 3
1 226,544 13,102 239,646
2 0 0 0
3
) 0 0 0
0 226,544 13,102 239,646

3. Cash flow in fiat currency


)
1 2 3 Offset
1 1,042,960 1,623,686 2,666,646 909,978
2 779,289 920,626 1,699,915 −499,334
3
) 977,379 1,156,289 2,133,668 −410,644
1,756,668 2,199,249 2,544,312 1,819,956

We see that using IATA Coins can potentially help to reduce the cash flow:

• Total cash flow without IATA Coins is equal to 2,299,248

• Total cash flow using IATA Coins is equal to 239,646 + 1,819,956 =


2,059,602
Implementation of Clearing and Settlement at the IATA Clearing House 407

Cashflow (IATA Coins + Cash)

Current clearing: 2,299,248


1e–06 Model: 2,059,602
Random: Average value = 2,751,709

8e–07

6e–07
Density

4e–07

2e–07

0e–00
1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000 4,500,000
Cashflow

FIGURE 13.3: Example of model implementation.

In order to check the quality of the model, the results of the model were
compared with random behavior in which invoices that should be submitted
in IATA Coins were selected in a random way.
The results of the comparison are shown in Figure 13.3. For this example,
a large number (namely, 100,000) of scenarios of random invoice selection
were generated and for each scenario the value of after-clearing net cash flow
was calculated. Then these values were represented in the form of histogram,
where also three values are marked with vertical lines:

• The value of net cash flow without using IATA Coins (as in the current
clearing procedure)

• The modeled value of net cash flow

• The average value of the cash flow in the randomly generated scenarios

We see that the current clearing procedure gives a better result than the
average of the random scenarios, but the model can help to improve this
result.
In Table 13.1, there are main metrics calculated for these three cases:
the values of cash flow (going through ICH and externally), the sum of the
invoices, and the offset ratio. In this example, the offset ratio has increased
from 65.9% to 69.4%, an increase of 5.3%.

[Link] Increasing the number of simulations


In order to check whether the results are sensitive to the number of simu-
lations or not, the model was also run on 1000 simulations. The average values
of all the metrics are given in Table 13.2.
408 High-Performance Computing in Finance

TABLE 13.1: Metrics calculated on one example simulation


Current Random (averaged
clearing over 100,000
procedure Model simulations)
1. Cash flow 20,899,899 20,660,253 21,352,360
1.1. Total (ICH) 2,299,248 2,059,602 2,751,709
1.1.1. In fiat 2,299,248 1,819,956 2,078,041
1.1.2. In IATA Coins 0 239,646 673,668
1.2. External 18,600,651 18,600,651 18,600,651
2. Sum of invoices 6,739,875 6,739,875 6,739,875
3. Offset (ICH) 65.9% 69.4% 59.2%
4. Offset (including 17.5% 18.5% 15.7%
external cash flow)

TABLE 13.2: Metrics averaged over 1000 example simulations


Current Random (averaged
clearing over 100,000
procedure Model simulations)
1. Cash flow 20,720,171 20,382,077 21,726,810
1.1. Total (ICH) 6,270,014 5,931,921 7,276,653
1.1.1. In fiat 6,270,014 5,245,561 5,909,896
1.1.2. In IATA Coins 0 686,360 1,366,757
1.2. External 14,450,157 14,450,157 14,450,157
2. Sum of invoices 13,662,411 13,662,411 13,662,411
3. Offset (ICH) 56.8% 59.5% 49.2%
4. Offset (including 27.3% 28.6% 23.6%
external cash flow)

Due to the nature of a small sample from the lognormal invoice volume
distribution, the averaged offset ratio is lower than in the example from pre-
vious subsection, but the model result is better than the current procedure
result by 4.8%.

13.5 Summary
This chapter covers the issues of the implementation of blockchain tech-
nologies in the clearing and settlement procedure of the ICH. We have devel-
oped an approach to estimate the industry level benefits of adoption of the
blockchain-based industry money (IATA Coin) for clearing and settlement.
The potential system-wide offsetting benefits are mainly driven by the follow-
ing factors:

• Immediate settlement of the value on the blockchain

• Shortened cycle of expense recognition


Implementation of Clearing and Settlement at the IATA Clearing House 409

• More flexible liquidity management enabled by ad hoc retroconvertibility


of IATA Coin

• Increased adoption of the IATA Coin as a means of settlement for the


supply chain, as well as for institutional and retail customers

To evaluate the magnitude of these benefits, we have developed a model


that allows simulation of the current cycle and of the proposed innovations
and their comparison in terms of offset ratio.
The main result gives practical evidence that using IATA Coins can help
to increase the offset ratio, reduce the transaction costs, and improve the
liquidity profile for a company, assuming that all members of the financial
network are acting rationally and are intending to decrease their transactions
costs and to improve their liquidity profile.

References
1. Committee on Payments and Market Infrastructures (CPMI), 2017. Distributed
Ledger Technology in Payment, Clearing and Settlement: An Analytical Frame-
work. Basel. [Link]

2. Lucasius, C.B. and Kateman, G., 1993. Understanding and using genetic
algorithms—Part 1. Concepts, properties and context. Chemometrics and Intel-
ligent Laboratory Systems, 19:1–33.

3. Lucasius, C.B. and Kateman, G., 1994. Understanding and using genetic
algorithms—Part 2. Representation, configuration and hybridization. Chemo-
metrics and Intelligent Laboratory Systems, 25:99–145.

4. Willighagen, E. and Ballings M., 2015. Package “genalg”: R Based Genetic Algo-
rithm. [Link]
Part III

HPC Systems: Hardware,


Software, and Data with
Financial Applications

411
Chapter 14
Supercomputers

Peter Schober

CONTENTS
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
14.2 History, Current Landscape, and Upcoming Trends . . . . . . . . . . . . . 416
14.3 Programming Languages and Parallelization
Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
14.4 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
14.5 Supercomputers for Financial Applications . . . . . . . . . . . . . . . . . . . . . . 422
14.5.1 Suitable financial applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
14.5.2 Performance measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
14.5.3 Access to and costs of supercomputers . . . . . . . . . . . . . . . . . . 425
14.6 Case Study: Pricing Basket Options Using C++
and MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
14.6.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
14.6.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
[Link] Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
[Link] Combination technique . . . . . . . . . . . . . . . . . . . . . . . 428
[Link] Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
[Link] Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
14.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
14.7 Case Study: Optimizing Life Cycle Investment
Decisions Using MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
14.7.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
14.7.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
[Link] Discrete time dynamic programming . . . . . . . . . 433
[Link] Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
[Link] Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
14.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438

413
414 High-Performance Computing in Finance

14.1 Introduction
Generally speaking, a supercomputer is a computer which is one of the
fastest computers of its time. Usually, the computer’s performance is mea-
sured in Floating Point Operations Per Second (FLOPS) when running a cer-
tain benchmark program. Since 1993, the TOP500 list [1] ranks the top 500
supercomputers worldwide according to their maximal performance achieved
when running the LINPACK benchmark.1 As of November 2016, the fastest
supercomputer in the world is the Sunway TaihuLight at the National Super-
computing Center in Wuxi, China. Its maximal performance is at 93 Peta
FLOPS, that is, 93 × 1015 FLOPS.
All supercomputers on the current TOP500 list are massively parallel com-
puters designed either in a computer cluster or in a Massively Parallel Process-
ing (MPP) architecture. A computer cluster consists of many interconnected,
independent computers that are set up in a way that they are virtually a sin-
gle system. The computers in the cluster (often called compute nodes) usually
are standalone systems and are configured to work on a parallel job, which is
distributed over the nodes and controlled by specific job scheduling software.
In contrast, an MPP computer consists of many individual compute nodes
that do not qualify as a standalone computer and are connected by a custom
network. For example, in IBM’s Blue Gene series, the compute nodes comprise
multiple CPUs and a shared RAM, but no hard disk, and are interconnected
via a multidimensional torus network. In MPP architecture, often times the
complexity of a single CPU (e.g., the number of the transistors or the clock
frequency) is reduced to allow for a higher number of parallel cores. Both archi-
tectures are frequently combined with certain hardware acceleration methods
such as coprocessors or general-purpose GPUs (GPGPUs).
An example of a supercomputer with MPP architecture is the Blue Gene/Q
supercomputer JUQUEEN, currently ranked 19 on the TOP500 list. It com-
bines compute nodes which have 16 cores with 1.6 GHz clock frequency per
node sharing 16 GB RAM and run the lightweight Compute Node Kernel
(CNK). These are stacked together in 28 racks, share a main memory, and
are interconnected by infiniband with five-dimensional torus structure.2 The
supercomputer can be accessed via the SSH protocol on separate log-in nodes
that run Linux. Jobs can be sent to the compute nodes via a central job
manager called LoadLeveler. A simple example for a computer cluster is a
compilation of conventional servers; each with multiple cores, a high amount
of RAM and its own hard drive running on a Windows HPC Server. These
servers are put in one rack and are connected via Gigabit Ethernet. A central
software on a dedicated head node, the HPC Cluster Manager, schedules the
jobs to the compute nodes.
1 Theoretical maximum performance can also be reported and is calculated as follows:
maximum number of floating point operations per clock cycle × maximum frequency of one
core × number of cores of the computer.
2 The full specifications can be found in Reference 2.
Supercomputers 415

10 EFlop/s

1 EFlop/s

100 PFlop/s

10 PFlop/s

1 PFlop/s
Performance

100 TFlop/s

10 TFlop/s

1 TFlop/s

100 GFlop/s

10 GFlop/s

1 GFlop/s

100 MFlop/s
1995 2000 2005 2010 2015 2020
Lists
Sum #1 #500

FIGURE 14.1: Exponential performance development and projection.


(Adapted from the Top500 homepage. Available: [Link] 2017.)

A reason for the development of massively parallel computers is that


Moore’s Law of doubling the number of transistors in an integrated circuit
(and by this also the speed of CPUs) roughly every 2 years is considered to be
outdated due to the physical limits in transistor size and difficulties with heat
dissipation. Today’s supercomputers cluster a large number of interconnected
CPUs with multiple cores each, and code has to be run in parallel on these
machines. The Sunway TaihuLight, for example, sources its computing power
from 10,649,600 cores with only 1.45 GHz clock frequency.3 Supercomputers
at the end of the TOP500 list still have several thousand cores and around
350 Tera FLOPS LINPACK performance. The tendency for parallelization can
also be observed in personal computer and embedded systems development:
the total number of cores per device now roughly doubles every 2 years. This
way, the performance of (super)computers in terms of FLOPS is still steadily
growing exponentially, compare Figure 14.1.
One of the main advantages of working with supercomputers is that appli-
cations can be programmed using a standard programming language (such as
C++) combined with a standardized parallel programming interface (such as

3 For comparison: today, common desktop computers comprise multicore CPUs with two,

four, or eight cores with clock frequencies around 3 GHz.


416 High-Performance Computing in Finance

MPI). This leads to easy portability of code as well as independence of the


application from the actual hardware. Essentially, this opens the door to run
the same application developed for a multicore desktop computer to be run on
small size supercomputers or even on the top supercomputers worldwide. In
addition, if exponential growth of computing power persists, the same appli-
cations as implemented for supercomputers today may be suitable to run
on personal computers in a few years time. Having said this, the following
sections can be applied to any hardware that is similar to the architectural
approaches of today’s TOP500 supercomputers—even though the reader’s
“supercomputer” might have far less cores and “only” a performance of a
few Tera FLOPS.

14.2 History, Current Landscape, and


Upcoming Trends
In 1837, computer pioneer Charles Babbage proposes the Analytical
Engine, which can be considered as the first concept of a mechanical general-
purpose computer function in the same manner as modern computers. The
Analytical Engine stays a concept and is never actually built by Babbage.
Around one century later, in 1941, Konrad Zuse unveils the Zuse Z3, the first
universally programmable fully functional digital computer operating at 5.3
Operations Per Second (OPS). Another 38 years later, in 1979, the Cray-1 4 at
the Los Alamos National Laboratory, New Mexico, USA, leads the first com-
piled LINPACK performance list with ca. 250 Mega FLOPS. Jack Dongarra’s
LINPACK software library [4] to solve a dense system of linear equations
soon becomes established as an important benchmark for the measurement
of supercomputer performance. At the 8th Mannheim Supercomputer Semi-
nar in 1993, the first TOP500 list is published. At its top, Thinking Machines
Corporation’s CM-5, also located at the Los Alamos National Laboratory, per-
forms at 59.7 Giga FLOPS when running the LINPACK benchmark on 1024
cores. In 2008, IBM’s Roadrunner is the first supercomputer to break the bar-
rier of one Peta FLOPS, which is more than twice as fast as the number one
system on the TOP500 list at that time. Since 2010, hybrid systems have also
made it to the top of the TOP500 list. First, the Tianhe-1A at the National
Supercomputing Center, Tianjin, China, with 14,336 CPUs and 7168 GPUs
used as accelerators makes it to the top. In 2012, when the 40th TOP500
list is published, the Titan supercomputer at Oak Ridge National Labora-
tory, Tennessee, USA, combines GPUs and traditional CPUs to become the
world’s fastest supercomputer. Titan’s performance of 17.6 Peta FLOPS on
the LINPACK benchmark is facilitated by 18,688 nodes that contain a GPU

4 The Cray-1 was built by Cray Research, Inc., founded by Seymour Cray, who is broadly

considered the “father of supercomputing” [3].


Supercomputers 417

and a CPU with 16 cores, which totals 560,640 cores. Lately, China matches
up with the USA (171 systems each in the TOP500 list as of November 2016,
Germany is third with 32 systems) and manifests its dominance at the top
of the list. In June 2016, the Chinese Sunway TaihuLight, developed by the
National Research Center of Parallel Computer Engineering & Technology,
with 93 Peta FLOPS and over 10 million cores, took the position as the
world’s fastest supercomputer from the likewise Chinese Tianhe-2 (32.9 Peta
FLOPS). Thus the top two Chinese systems alone account for about 127 Peta
FLOPS of the total aggregate of 672 Peta FLOPS on the current TOP500
list.
The focus on speed as the only measure for performance leads to increas-
ing power consumption of the top supercomputers during the last decade.
Half of the costs of a conventional cluster can be accounted for cooling of the
facilities, the other half for the actual computations. In 2006, an awareness of
social responsibility regarding climate change and global warming, but also
cost efficiency within the High Performance Computing community, initiates
the announcement of the Green500 list [5]: the Green500 aims at ranking the
most energy efficient supercomputers worldwide (measured by Mega FLOPS
per Watt). Since then, architecture develops toward hybrid supercomputers
combining CPUs with GPGPU and coprocessor accelerators, which are con-
nected with low latency network technologies like Infiniband. Besides the good
suitability of GPGPUs and coprocessors for data flow tasks, a reason for the
increased use of these technologies is that their power and cost efficiency is
about 10 times higher than for conventional CPUs. In November 2014, the
L-CSC at the GSI Helmholtz Center claims the number one position in the
Green500 list with 5271 Mega FLOPS per Watt. It is developed at the Frank-
furt Institute for Advanced Studies (FIAS) [6], a public-private partnership
of the Goethe University Frankfurt and private sponsors. L-CSC reaches its
top position by massive use of GPU acceleration and an especially efficient
cooling system. Since November 2016, the in-house supercomputer DGX Sat-
urnV of the GPU manufacturer NVIDIA based in Santa Clara, Calif., USA,
is at the top of the Green500 list. It operates at 9462 Mega FLOPS per Watt
and is closely followed by the Swiss Piz Daint (7453 Mega FLOPS per Watt),
which also claims a rank of 8 in the TOP500 list. Recently, in the advent of
energy efficient supercomputers, the European Mont-Blanc project [7] aims at
building power efficient supercomputers based on low-power embedded system
technology as used in smart phones or tablet computers. Within this project
funded by the European Commission, Systems on Chip (SoC) are stacked
together in blades and are interconnected by 10 Gigabit Ethernet. The first
Mont Blanc prototype has a total of 2160 CPUs and 1080 GPUs and runs
parallel applications from physics and engineering using conventional pro-
gramming languages and parallelization approaches.
Whichever technology is going to prevail in the future, if computing power
doubles every 2 years, supercomputers will be operating at Exa scale FLOPS
by 2020.
418 High-Performance Computing in Finance

14.3 Programming Languages and Parallelization


Interfaces
When developing a parallel application, the following two questions have
to be answered:

1. Which programming language should be used? A compiled programming


language, such as Fortran, C or C++, or an interpreted language, such
as MATLAB or Python?

2. Which parallelization interface is appropriate? Shared memory paral-


lelization, such as OpenMP or POSIX threads (Pthreads), or distributed
memory parallelization with the Message Passing Interface (MPI)?

Table 14.1 depicts features of programming languages that are widely used by
scientific researchers and practitioners for writing applications for supercom-
puters. Of course, this overview has no claim to completion, and various other
programming languages exist that can be used on supercomputers in combina-
tion with parallelization. Table 14.2 shortly summarizes the characteristics of
the most important parallelization interfaces. Usually, these are extended by
task-specific (e.g., I/O of data) or hardware-specific parallelization interfaces
(such as CUDA for the use of GPUs).
Still widely used programming languages in scientific computing are For-
tran and C followed by C++, recently more often combined with Python.

TABLE 14.1: Overview on programming languages


Programming language Features
Compiled
C General-purpose, imperative programming
language that is available on all computer
systems. C supports structured programming
and has a static type system
C++ Object-oriented extension of C
Fortran (and its variants) General-purpose, imperative programming
language that is especially suited to numeric
computation and scientific computing.
Fortran 2003 interoperates with C and C++
Interpreted
MATLAB High-level, interpreted, interactive
programming language that supports object
orientation. Commercial software
Python Interpreted, interactive, object-oriented,
extensible programming language.
Open source
Supercomputers 419

TABLE 14.2: Overview on parallelization interfaces


Parallelization interface Features
MPI Standardized interface for message passing
between computing units on distributed
(and shared) memory (super)computers.
Parallelization is organized by MPI processes
that communicate with each other using
interface commands. Parallelizing is the
duty of the programmer
Pthreads Thread-level parallelism using the POSIX
library. Only limited support for data-parallel
operations
OpenMP Library and set of compiler directives for
Fortran, C, and C++. The compiler handles
the parallel thread creation, which makes
OpenMP far more simple than MPI or
Pthreads programming

Parallelization of code is mostly done with MPI, because it can handle the
distributed memory of the systems, but is often combined with shared memory
parallelization using OpenMP, Pthreads, or more recently, GPUs. Table 14.3
provides an overview on programming languages and parallelization interfaces
used by codes that run on all 458,752 cores of the supercomputer JUQUEEN.
Double counting is possible as nearly all codes use combinations of multiple
programming languages and parallelization interfaces. However, all codes use
MPI for parallelization, mostly in combination with Fortran, closely followed
by C and C++. OpenMP is frequently used on node level, because all cores
on one node share the node’s memory. Pthreads are rarely used. While inter-
preted programming languages as Python or MATLAB play a minor role in

TABLE 14.3: Overview on programming languages and parallelization


interfaces of codes that run on all 458,752 cores of the supercomputer
JUQUEEN —the so-called high-Q club
C C++ Fortran CUDA/openCL
MPI 16 10 18 2
OpenMP 13 9 14 –
Pthreads 4 − 3 –
Other (mostly for I/O) 6 3 9 –
Source: High-Q Club—Highest Scaling Codes on JUQUEEN. [Online]. Available:
[Link] [Link], 2017.
420 High-Performance Computing in Finance

scientific computing,5 practitioners often use interpreted, interactive program-


ming languages which already include parallel capabilities.
As pointed out in the introduction, an application that can be implemented
on your desktop computer is—when sticking with standard programming lan-
guages and parallelization interfaces—easily portable to a supercomputer.
Usually, the trade off in choosing a combination of programming language
and parallelization interface resides in programming flexibility and applica-
tion performance against implementation time and ease of use. For exam-
ple, C++ in combination with MPI, that is, a parallelization approach that
can handle distributed memory, can be run on multicore desktops, smaller
supercomputers, and TOP500 supercomputers with nearly no code changes
(although recompiling might be required). On the downside, to program C++
a deeper understanding of memory management, compilation and features of
high-level programming languages (such as object orientation) is necessary.
Using MPI as the parallelization interface is flexible but comes at the cost
of higher programming effort to go from a serial to a parallel version of the
application. Additionally, debugging is a cumbersome task.6
In contrast, using an interpreted interactive programming language such as
MATLAB together with MATLAB’s Parallel Computing Toolbox provides a
convenient way to parallelize independent computations. MATLAB wraps the
MPI library and abstracts basic MPI functionalities to high-level, user-friendly
functionalities, such as parallel for-loops or distributed arrays. In addition,
certain MPI functions that enable communication are exposed to the user.
MATLAB can easily be configured to use the multicore architecture from
desktop computers and even supercomputers (which is a more difficult task).
On the downside, MATLAB requires licenses for as many cores as are used
by the computation—which is costly7 —and MATLAB has to be installed in
the appropriate version on the supercomputer. Besides that, the MATLAB’s
user-friendliness comes at the cost of loss of control over the parallelization,
meaning that it is hardly possible for experienced programmers to overcome
shortcomings of the high-level parallel interfaces MATLAB provides. This is
further aggravated by MATLAB being closed source. Python, with its many
easy-to-use frameworks for parallel computing, provides capabilities very simi-
lar to MATLAB’s parallel for-loops and beyond. Also, Python supports native
thread level parallelism with its multiprocessing module, and MPI extensions
exist.8 However, oftentimes, Python is only used as a wrapper for parallel,
high-performance modules, which are in turn programmed in a compiled pro-
gramming language like Fortran or C.

5 Python is sometimes used to write wrappers for pre- and postprocessing of the results

or similar tasks.
6 There is (commercial) software that facilitates parallel debugging with MPI.
7 However, these licenses are part of a special license for a computer clusters called Dis-

tributed Computing Server and are cheaper than full MATLAB licenses.
8 Most MPI extensions for Python, similarly to MATLAB, provide (a subset) of the MPI

functionalities of the wrapped MPI library.


Supercomputers 421

14.4 Advantages and Disadvantages


As already addressed, an important advantage of the use of supercomput-
ers is the extensive hardware independence and easy portability of applica-
tions. Another advantage is that parallelization using standard programming
languages and parallelization interfaces not only scales well for data flow prob-
lems but also for control flow problems. That is, for example, not only can
floating point operations on vectors of data be parallelized easily, but also
complex function calls including conditional branches. This makes supercom-
puters especially suitable for coarse-grain parallelization, where large parts of
the application can be run independently and, optimally, without communica-
tion between the cores. Of course, this makes proper load balancing of utmost
importance, as the overall runtime of a parallel application is governed by the
runtime of the slowest independently run part.
Coincidentally, this is also one of the major disadvantages of supercomput-
ers: no or only little problem-specific optimization can be done. For example,
the use of GPUs for data flow problems can be highly favorable over CPUs. A
contemporary GPU has several thousand cores at a lower frequency (around
500 MHz) for stream processing of data, whereas a high-end CPU has only
eight cores at about 3 GHz.9 In addition, GPUs are more power efficient in
terms of FLOPS per Watt. Other hardware that can be tailored to specific
financial applications (such as Field Programmable Gate Arrays) promise an
even better performance. However, there is a high awareness for this disad-
vantage within the supercomputing community and recently more and more
supercomputers are complemented with problem-specific hardware such as
GPGPUs and coprocessors. The application programmer can then decide
which calculations are suitable to be run on a GPGPU—potentially using
GPU-specific code parts—and which bigger building blocks are parallelized
on a CPU level using standard programming languages and parallelization
interfaces. Modern coprocessors constitute a hybrid solution as they also have
a large number of cores and are especially suitable for data flow problems,
such as GPUs, but the CPU automatically decides which calculation tasks it
offloads to the coprocessor. Besides configuration of the application’s runtime
environment, there is no additional problem-specific programming necessary.
In addition, the openCL standard further aids in overcoming the obstacles
of programming for hybrid hardware approaches as it intends to provide a
unified framework for programming applications that run on heterogeneous
platforms consisting of CPUs, GPUs, and coprocessors.
Another disadvantage of the use of supercomputers is the ease of access
and associated costs. Computing time of many supercomputers cannot simply
be bought. Also, many of the supercomputers are only accessible for research

9 Usually, GPUs have many more cores and threads with a lower clock frequency than

standard CPUs.
422 High-Performance Computing in Finance

purposes.10 Setting up, operating, and maintaining a proprietary supercom-


puter is costly due to high initial costs, high energy costs—especially for appro-
priate cooling—and high costs for qualified staff.11

14.5 Supercomputers for Financial Applications


Besides many financial research institutions that lease computing time
on supercomputers from the TOP500 list to efficiently solve finance prob-
lems using parallelization, there are several (undisclosed) institutions from the
finance industry operating supercomputers that are ranked in several TOP500
lists. When considering how costly it is to set up and maintain such super-
computers, there obviously is demand within the financial industry for super-
computing.

14.5.1 Suitable financial applications


There is a vast set of financial applications from various fields: pricing of
financial products, risk management, computation of solvency capital require-
ments, portfolio optimization, and many more. However, the different numer-
ical methods employed can be roughly clustered:
• Monte Carlo simulations

• Numerical integration
• Finite differences and finite element methods

Some of these numerical methods are inherently parallel, such as Monte Carlo
simulations; others are inherently coupled, like finite difference methods.
The problems that are inherently parallel are often called embarrassingly
parallel problems. This term usually applies if one big problem can be disas-
sembled into a set of decoupled problems. Examples include:

• Simulation of multiple paths in a Monte Carlo simulation for determining


solvency capital requirements of an insurer

• Variance decompositions of partial differential equations (PDEs), which


can be used for the pricing of derivatives using finite difference methods
(this is only possible under certain assumptions)

• Portfolio optimization over the life cycle with time discrete dynamic pro-
gramming
10 Due to the high costs of implementing and maintaining, supercomputers are often run by

government financed nonprofit organizations, huge corporations, or industry consortiums.


11 An exemplary cost calculation is presented in Section 14.5.
Supercomputers 423

The latter two are discussed as case studies in Sections 14.6 and 14.7. For
embarrassingly parallel problems, parallelization is straightforward and, using
the right combination of programming language and parallelization interface,
often a “quick win” when using supercomputers.
In the case of coupled problems, parallelization is usually not so straight-
forward and harder to implement, for example, for finite differences or finite
element methods for general pricing of derivatives.

14.5.2 Performance measurement


To assess how well a parallel application speeds up when using an increas-
ing number of computing units (that is, CPUs, cores, processes or threads,
and so on), scaling efficiency can be measured. Depending on the application
at hand, there are two ways to measure the parallel performance: strong and
weak scaling.
When measuring strong scaling, usually the overall runtime of the appli-
cation is the limiting factor. In this case, the problem size (e.g., the number of
paths in a Monte Carlo simulation) stays fixed and the number of computing
units is increased until the whole problem is solved in parallel (e.g., every
path is simulated on a single core). Let the time needed to solve the problem
on the minimum number of available computing units Pmin be tmax (in ref-
erence to the maximum runtime of the application). When scaling strongly,
an application is considered to scale linearly if the speedup Sstrong = tmax /t
is proportional to the number of computing units P used: Sstrong ∼ P . The
opt
optimally achievable speedup is thus Sstrong = P/Pmin . A measure of overall
parallel efficiency is then defined by

Sstrong
Estrong = opt . (14.1)
Sstrong

For example, if the runtime on one computing unit, Pmin = 1, is tmax = 80 sec-
onds, then one expects the runtime on P = 8 cores to be t = 10 seconds and the
opt
application would scale linearly with speedup Sstrong = 80/10 = 8 = Sstrong .
Parallel efficiency is Estrong = 1. Embarrassingly parallel problems often times
scale strongly, but, in general, it is harder to achieve high strong scaling effi-
ciencies at larger number of computing units since, depending on the paral-
lelization approach, the communication overhead for many applications also
increases proportionally in the number of computing units used. In addition,
ideally the problems should be distributed in a way that the variance of the
runtime distribution of the atomic components (e.g., the paths) is minimal.
Otherwise, the longest running computing unit governs the overall solution
time. As already indicated, an example is a Monte Carlo simulation where
the development of a single path is decoupled from the development of all
other paths and hence all paths can be computed in parallel. If a fraction of
the code cannot run in parallel, the realized speedup Sstrong cannot achieve
424 High-Performance Computing in Finance
opt
the optimal speedup Sstrong . Consider the following example: the sequential
part of the code takes 1 second and the parallelizable part of the code takes
4 seconds. With four computing units available, the optimal speedup would
be 4, that is, a runtime of 1.25 seconds. However, since there is a part of the
code with runtime of 1 second that is not parallelizable, the best achievable
runtime is 2 seconds, that is, at best a speedup of 2.5. Amdahl’s law con-
nects the fraction α of the code that can run in parallel with the theoretically
achievable speedup given this fraction by

theo 1
Sstrong = (14.2)
(1 − α) + α
opt
Sstrong

and the theoretically achievable parallel efficiency given Amdahl’s law is

theo Sstrong
Estrong = theo
. (14.3)
Sstrong

For an application that scales weakly, usually the system or node level
resources like the RAM are the limiting resource. That means, the problem size
per computing unit stays constant and additional computing units are used
to solve an altogether bigger problem. For example, when parallelizing the
solution routine to a PDE using a finite difference grid, the grid is partitioned
into pieces and each computing unit works on a piece of the grid that just
fits in its memory. If the total number of grid points is to be increased, more
computing units have to be used. In the case of weak scaling, optimal scaling
is achieved if the run time stays constant while the problem size is increased
opt
proportionally to the number of computing units, that is, Sweak = 1. The
realized speedup is then given by Sweak = tmax /t and parallel efficiency is
measured by
Sweak
Eweak = opt = Sweak . (14.4)
Sweak
For example, if the runtime on one computing unit, Pmin = 1, is tmax = 10 sec-
onds, then one expects the runtime of an eight times larger problem on P = 8
cores to be t = 10 seconds and the application would scale with speedup
opt
Sweak = 10/10 = 1 = Sweak . In contrast to embarrassingly parallel problems,
coupled problems often scale weakly. Most applications scale well to larger
numbers of computing units as they typically employ nearest-neighbor com-
munication. That is, in this example the pieces of the grid only need to know
the values at the grid points of their direct neighbors. Thus the communica-
tion overhead is constant regardless of the number of computing units used.
For weak scaling, Gustafson’s law describes the theoretically possible speedup
in presence of inherently serial code parts:

theo (1 − α)Pmin
Sweak = + α. (14.5)
P
Supercomputers 425

The theoretically achievable parallel efficiency given Gustafson’s law is

theo Sweak
Eweak = theo
. (14.6)
Sweak

The following best practice rules should be taken into consideration when
measuring parallel code performance:

1. Optimally, start the scaling efficiency measurement at one computing


unit. Then always double the number of computing units (and the prob-
lem size, when measuring weak scaling efficiency). Keep in mind that
often times computing resources can only be allocated node or even
rack wise, meaning that a certain minimum number of computing units
is always allocated and allocation step size is fixed.12
2. Report average runtimes and standard deviations to account for load
imbalances of the network or other supercomputer-specific architectural
shortcomings. The architectures of large supercomputers usually set
great store on homogeneity of the compute nodes and their intercon-
nects, which leads to low standard deviations of runtimes when running
the same code multiple times. However that might not hold true for
simple supercomputers, especially when heterogeneous servers are con-
nected via standard Ethernet to form a computer cluster.

3. Calculate the speedup for all scaling stages on basis of the average run-
times and compare it to the respective ideal linear speedup in your
setting.

4. Calculate the nonparallel fraction of your code and report the theoreti-
cally achievable speedup given by Amdahl’s or Gustafson’s law, respec-
tively.

5. Use an application configuration that reflects your production run of the


application.

14.5.3 Access to and costs of supercomputers


Due to the high costs and special knowledge needed for developing, setting
up, and maintaining a supercomputer, access to the largest supercomputers in
the world is mostly open for governmental or research facilities and granted via
centrally organized application processes. However, there are supercomputers
also open for commercial use. For example, the High Performance Computing
Center (HLRS) in Stuttgart, Germany, which operates Hazel Hen (ranked 14

12 The allocation step size is often times a power of 2.


426 High-Performance Computing in Finance

TABLE 14.4: Total costs of a small computer cluster consisting of


8 servers with two 3.30 GHz CPUs and 192 GB RAM each
Total costs of a small computer cluster (128 cores)
Estimated purchase price 85,000 EUR
Per year
Staff 30,000 EUR
Power 3687 EUR
Cooling 5091 EUR
Linear depreciation 21,250 EUR
Total costs over lifetime (4 years) 240,111 EUR
Source: Own calculations and Fujitsu, Fujitsu PRIMERGY Servers Performance Report
PRIMERGY RX200 S8, Fujitsu K.K., Tech. Rep., 2013.
Note: The systems total up to 128 cores and approximately 6.4 Tera FLOPS LINPACK
performance.

on the current TOP500 list) offers access and a variety of services to the indus-
try and small- and medium-sized enterprises. In Germany, computing time
at the tier 0 supercomputers Hazel Hen, JUQUEEN (currently ranked 19),
and SuperMUC (rank 36) can be applied for at the Gauss Centre for Super-
computing. In Europe, the Partnership for Advanced Computing in Europe
(PRACE)—an international nonprofit association consisting of 25 member
countries—provides access to supercomputers for large-scale scientific and
engineering applications. Though access to the supercomputing infrastructure
can be granted to the industry, it is foremost for research purposes such as sim-
ulations in the automotive or aerospace sector. Thereby, the costs commercial
institutions have to bear are not publicly disseminated. Moreover, computing
time is distributed among the users using a job queuing system. This means,
that the job execution start time is conditional on the queue priority, which
is determined by the total runtime of the job (the so-called walltime) and the
availability of the requested computing resources, that is, the number of cores
and potentially special accelerators. Altogether, having access to a shared
supercomputer is not sufficient for common operational tasks in financial insti-
tutions, such as overnight portfolio valuation or pricing of derivatives at the
trading desk.
Proprietary supercomputers, run and maintained by an institution or
shared over different entities in a bigger institution or consortium, pose an
often chosen but costly alternative to shared access to the largest super-
computers of the world. Table 14.4 depicts the costs associated with setting
up and maintaining a small proprietary computer cluster of 128 cores with
an estimated LINPACK performance of 6.4 Tera FLOPS. A medium-sized
supercomputer with roughly 3200 cores has total costs of approximately
700,000 EUR.13

13 Based on calculations presented by Bull Atos Technologies at the ICS 2015 [9].
Supercomputers 427

14.6 Case Study: Pricing Basket Options Using C++


and MPI
This case study demonstrates how to efficiently calculate the price of an
arithmetic average basket put option on the S&P 500 index using the super-
computer JUQUEEN. Since every constituent of the S&P 500 contributes to
one dimension in the PDE representation of the pricing problem, the solu-
tion to the problem suffers from the curse of dimensionality, which refers to
the exponential growth of the resulting linear equation system to be solved
depending on the dimensionality of the PDE. However, supercomputers can
be utilized to calculate solutions despite the curse of dimensionality. Problem,
approach, and results of this case study have been extensively studied in, and
this section is closely based on [11].

14.6.1 Problem
In higher dimensions, the development of d correlated stocks over time can
be described by a vector process x = (x1 , . . . , xd ) ∈ Rd , where the component
i follows a geometric Brownian motion with drift


n
dxi (t) = xi (t)μi dt + xi (t) σij dWj (t), t ≥ 0, xi (0) = x0i . (14.7)
j=1

Here, μ is a d-dimensional mean vector, σij = σi σj ρij with correlation ρij


and standard deviation σ ∈ Rd is an entry of the d × d covariance matrix
{σ}i,j , and W is a d-dimensional vector of independent Wiener processes.
Using the Feynman–Kac formula, the arbitrage free price can be expressed as
the solution u to the multidimensional Black–Scholes equation

∂u d
∂u 1
d
∂2u
−r xi − σi σj ρij xj xj + ru = 0, (14.8)
∂t i=1
∂xi 2 i,j=1 ∂xi ∂xj

where r denotes the risk-free rate. Subject of this case study is a European
put option with strike K on the S&P 500 index with an arithmetic average
payoff. Hence, the initial condition for the PDE 14.8 is
 500


u(x, 0) = K− γi xi ∀x ∈ Rd+ . (14.9)
i=1 +

The sample basket comprises of the S&P 500 index constituents as of


June 2013 with uniformly distributed weights γi and estimated volatilities
428 High-Performance Computing in Finance

as well as correlations on basis of daily logarithmic increments of the last


360 days.14

14.6.2 Approach
Firstly, a decomposition technique is employed to decompose the high-
dimensional PDE into a linear combination of low-dimensional PDEs. For
these low-dimensional PDEs, the curse of dimensionality is then broken by use
of the so-called combination technique on sparse grids, which in turn allows
for straightforward parallelization. This parallelization approach significantly
reduces the overall runtime of the solution routine for the decomposed high-
dimensional PDE.

[Link] Decomposition
The Taylor-like ANOVA decomposition is a special form of an anchored
ANOVA-type decomposition of the solution u(x), x ∈ Rd , to a d-dimensional
PDE. In addition to the anchor point a, all contributing terms also depend
on the first r ≥ 1 coordinates. With s ≤ d − r, the solution can be written as

(r)

s (r)
u(x) = u0 (a; x1 , . . . , xr ) + ui1 ,...,ip (a; x1 , . . . , xr ; xi1 , . . . , xip ).
p=1 {i1 ,...,ip }
⊆{r+1,...,d}

(14.10)
For u being a function of the eigenvalues (λ1 , . . . , λd ) of the covariance matrix
of the vector process x, the first-order approximation, s = 1, r = 1, is given by

d !
(1) (1) (1)
u(x) ≈ u0 (a; x1 ) + uj (a; x1 ; xj ) − u0 (a; x1 )
j=2


d
(1) (1)
≈ uj (a; x1 ; xj ) − (d − 2)u0 (a; x1 ), (14.11)
j=2

(1)
where u1,j (a; x1 ; xj ) is a solution to the heat equation

∂u 1 ∂ 2 u 1 ∂ 2 u
− λ1 2 − λj 2 = 0 ∀(t, x1 , xj ) ∈ [0, T ] × R × R (14.12)
∂t 2 ∂x1 2 ∂xj

with appropriate initial conditions.

[Link] Combination technique


For a given level n ∈ N, the combination technique combines solutions
ul (x) to the PDE on multiple, mostly anisotropic full grids with mesh widths
14 As there were seven stocks without correlations, actually the solution to a basket option

on the “S&P 493” is calculated.


Supercomputers 429

2−l for multi-index l ∈ Nd by the combination formula:


 
d−1
un (x) = (−1)n+d−|l|1 −1 ul (x). (14.13)
|l|1 − n
n≤|l|1 ≤n+d−1

)d
Here, |l|1 = i=1 ld .
Computing the approximated solution un (x) to the (high-dimensional)
PDE thus boils down to computing the solution to each of the O(dnd−1 )
full grid solutions ul (x) and each of these grids has only O(2n ) grid points.
This solution approach breaks the curse of dimensionality as it only involves
O(2n dnd−1 ) grid points compared to O(2nd ) grid points of the full grid solu-
tion with mesh width 2−n . It can be shown that the error of the approximation
is only slightly deteriorated by a logarithmic factor, if u fulfills certain smooth-
ness conditions. In addition, the combination technique facilitates straightfor-
ward parallelization as the up to O(dnd−1 ) full grid solutions can be computed
in parallel.

[Link] Parallelization
The Taylor-like ANOVA decomposition 14.11 generates a set of d = 493
independent problems (1 one-dimensional, 492 two-dimensional). To solve the
two-dimensional PDEs in parallel, computing units are grouped and each
group of computing units solves one (or more) of the two-dimensional PDEs
on sparse grids using the combination technique. In dependence of the level
n used in the combination technique, let Pmax be the maximum number of
computing units for a fully parallel solution. When having Pmax computing
units available, every computing unit calculates the solution to exactly one full
grid. If there were more than Pmax computing units available, the additional
computing units would be idle. Assuming there are less than Pmax computing
units available, it is favorable to split the available computing units into as
many groups as independent problems exist. Figure 14.2 sketches the idea.

[Link] Implementation
The calculations are run on JUQUEEN. For the combination technique,
the PDEs are solved on the full grids using a Crank–Nicolson time stepping
scheme combined with a finite difference discretization in space. The resulting
set of linear equations is solved using a BiCGSTAB solver with a precision of
the minimum residual norm of 10−12 . The implementation of the paralleliza-
tion approach is done with C++ and MPI and the code is compiled using the
IBM XL compiler.

14.6.3 Results
For an at-the-money basket put option with strike K = 1, the price of the
basket option computes to u(x0 , 0) = 0.0276 using the Taylor-like ANOVA
430 High-Performance Computing in Finance

d
u(x) ≈ ∑ uj(1) (a; x1; xj) (d – 2) u0(1) (a; x1)
j=2

CPU CPU CPU CPU CPU


1 f 1 f 1

d–1

FIGURE 14.2: Schematic overview on solving multiple Taylor-like ANOVA


terms in parallel using sparse grid parallelization. (Adapted from Schober, P.,
Schröder, P., and Wittum, G., In revision at the Journal of Computational
Finance, available at SSRN 2591254, 2015.)

decomposition of first order 14.11 and the combination technique 14.13 on level
n = 10. Since there is no analytical solution, a benchmark result was calculated
using a Monte Carlo simulation with 100,000 paths (u(x0 , 0) = 0.0281). To
test for strong scaling, the number of computing units is repeatedly doubled
until Pmax = 13,285 computing units are reached, that is, full parallelization
when utilizing 13,296 computing units, because of round lot allocation on
JUQUEEN. The code runs seven times for every scaling stage and average
runtimes μ and standard deviations σ are reported to account for jitter due to
the job scheduler of the cluster and the cluster’s network. Table 14.5 depicts
realized mean runtimes, standard deviations, and speedups. Figure 14.3 plots

TABLE 14.5: S&P 500 realized mean runtime μ, standard deviation σ,


opt
and speedup Sstrong as well as Sstrong
opt
P μ [s] σ [s] Sstrong Sstrong
1024 1471.71 0.44 1.00 1.00
2048 778.63 1.89 1.89 2.00
4096 433.48 1.92 3.40 4.00
8192 265.88 3.55 5.54 8.00
13,296 170.97 3.77 8.61 12.98
Source: Schober, P., Schröder, P., and Wittum, G., In revision at the Journal of Computa-
tional Finance, available at SSRN 2591254, 2015.
Supercomputers 431

16
Realized
Theoretical

8
Speedup

1
1024 2048 4096 8192 16,384
Number of cores

FIGURE 14.3: Strong scaling results for the arithmetic basket option on
the S&P 500. (Adapted from Schober, P., Schröder, P., and Wittum, G., In
revision at the Journal of Computational Finance, available at SSRN 2591254,
2015.)

the realized speedup against the theoretically achievable speedup. The parallel
efficiency is Estrong = 63.85% at 13,296 computing units with respect to 1024
computing units. Looking at the realized mean runtime μ, the basket option
was priced within 3 minutes using massive parallelization.

14.7 Case Study: Optimizing Life Cycle Investment


Decisions Using MATLAB
How an individual can maximize her utility from consumption over the
lifetime is the subject of this case study. Therefore, the individual can dynam-
ically decide (i.e., adapt her decisions as time evolves) on how much to consume
momentarily and how to allocate her remaining wealth in certain financial
products to finance future consumption. This section demonstrates that the
resulting optimization problem can be solved efficiently by parallelizing a dis-
crete time dynamic programming approach for an exemplary dynamic port-
folio choice problem. A small proprietary computer cluster with 97 cores and
standard MATLAB is used to reduce the total overall runtime of the numerical
solution routine from roughly 12.5 hours on a single core to about 8 minutes.
References 12 and 13 serve as references for all details left out here for the
sake of brevity.
432 High-Performance Computing in Finance

14.7.1 Problem
Formally, the problem of maximizing the individual’s expected utility from
her choices pt ∈ Rk over all time periods t ∈ {0, . . . , T } needs to be solved:


T
max ρt E0 [u (pt )] . (14.14)
pt
t=0

Here, u denotes a Constant Relative Risk Aversion utility function and


ρ < 1 is the time discount factor. At every point in time t, her choices
are how much to consume, Ct , to invest in stocks St yielding a risky return
S
rt+1 ∼ LN (μS , σS2 ), to invest in bonds Bt yielding a risk-free return rB , and
how much annuity claims to buy giving a yearly, lifelong income stream of
1/ät dollar; pt = (Ct , St , Bt , At )T . Here, ät is the age-dependent actuarial
premium charged by an insurance company in exchange for a lifelong pay-
ment of 1 dollar yearly to an individual in time period t starting in t + 1,
that is, the annuity factor. Her decision is conditional on her state st ∈ Rd ,
where current wealth Wt , multiples of average permanent labor income Pt ,
and amount of yearly annuity payments Lt are tracked; st = (Wt , Pt , Lt )T .
The individual stays alive with probability πt . Uncertainty in the labor income
is introduced via a permanent risk factor ν and transitory risk factor ϑ that
are uncorrelated and iid lognormally distributed with E[νt+1 ] = E[ϑt+1 ] = 1.
That is, ν ∼ LN (−σν2 /2, σν2 ) and ϑ ∼ LN (−σϑ2 /2, σϑ2 ). The deterministic
component of labor income is given by Gt , which reflects the dollar amount
associated with Pt = 1. After retirement, income is deterministic and a
fraction λ of the last income Pt Gt . Since the stochastic risk factors in the
model are all2 lognormal, the random variable

S
ωt+1 = (rt+1 , νt+1 , ϑt+1 )T ∼
2 2 2 2 T
LN (μS , −σν /2, −σϑ /2) , (σS , σν , σϑ )
T
is multidimensionally lognormally
distributed. With these definitions, the state space dynamics from t to t + 1,
which is also random variable ft+1 : Rk × Rd × Ω (→ Rd , can be defined as

1
ft+1 S
:= Wt+1 = St rt+1 + Bt rB + 1{t<tR } Gt Pt νt+1 ϑt+1 + 1{t≥tR } λGt Pt
2
ft+1 := Pt+1 = 1{t<tR } Pt νt+1 + 1{t≥tR } Pt (14.15)
3 1
ft+1 := Lt+1 = Lt +
ät

with 1 being the indicator function and tR the retirement age.


Problem 14.14 can be solved using a dynamic programming approach by
setting up the Bellman equation for the value function jt at discrete points in
time t = 0, . . . , T − 1:
 
jt (st ) = max u(pt ) + ρπt Et [jt+1 (ft+1 (pt , st , ωt+1 ))] (14.16)
pt

jT (sT ) = v(sT ) (14.17)


Supercomputers 433

subject to
St , Bt , At ≥ 0 (14.18)
Ct − ε ≥ 0 (14.19)
Wt − Ct − St − Bt − At = 0. (14.20)
Here, v is a known function of the final state and a minimal consumption of
ε > 0 is assumed.

14.7.2 Approach
A general way to solve problem 14.16–14.20 is discrete time dynamic pro-
gramming stepping backwards in time, which allows for a simple paralleliza-
tion easily implemented in MATLAB.

[Link] Discrete time dynamic programming


The state space is discretized by a mesh of a finite number of equidistant
grid points n ∈ Nd with mesh size h = ((u1 − l1 )/n1 , . . . , (ud − ld )/nd )T ∈
Rd , where l ∈ Rd is the lower and u ∈ Rd the upper boundary for every
dimension of the state space. A grid point sit can then be represented by
a multi-index i ∈ I = {i ∈ Nd | i ≤ n element-wise} and is addressed by
sit = (l1 + i1 h1 , . . . , ld + id hd )T . On this grid, let the interpolating function be

jt (st ) ≈ cit φi (st ), (14.21)
i∈I

i
where the basis functions φ can be global polynomials or Ansatz functions
with local support (such as B-splines). The same grid is chosen for every time
step t and the coefficients cit are determined in such way that the approxi-
mation fits the known function values at all grid points at a given time. The
last period’s optimal value function 14.17 is given by jT (siT ) = v(siT ), ∀i ∈ I.
To determine the optimal solution of the next-to-last period jT −1 (sit ) and all
earlier periods t ∈ {0, . . . , T − 2} at all grid points sit , i ∈ I, an optimization
routine which maximizes 14.16 over the real-valued vector pt is used (sub-
ject to the boundary and budget constraints 14.18–14.20). Since within the
optimization routine ft+1 does not generally correspond to a grid point, the
expectation is approximated by a Gaussian quadrature rule with q = 1, . . . , m
nodes ω q and weights wq :
' (
m
q q
Et jt+1 ft+1 (pt , st , ωt+1 ) ≈
i
cit φi (ft+1 (pt , sit , ωt+1 )) wt+1 . (14.22)
q=1 i∈I

[Link] Parallelization
The dynamic programming approach yields a set of independent optimiza-
tion problems in every period t: for every grid point in the state space grid
434 High-Performance Computing in Finance

the maximum of the continuation value of the Bellman equation has to be


computed. As the number of grid points is fixed, the maximum number of
computing units that can be employed equals the number of grid points #I
(i.e., the cardinality of the multi-index set I):

Pmax = #I. (14.23)

Note that this is a fixed-size parallelization problem that should scale strongly
until Pmax is reached.
For the parallelization, a master–worker pattern is employed: one of the P
computing units is the designated master, the other P − 1 computing units
become the workers. The master firstly generates a (possibly random) execu-
tion order over all multi-indexes i from the index set I, and then assigns an
optimization problem associated to the grid point indexed by i to all workers.
Every time any computing unit idles, the master dispatches the next prob-
lem from the predefined order to this idle computing unit until all problems
are solved. Figure 14.4 contains a schematic representation of the approach.
If there is no a priori knowledge about the runtime of the single problems
that are dispatched in parallel, the master–worker pattern is a simple load-
balancing solution that in practice achieves very high speedups.

[Link] Implementation
The computations are performed using MATLAB version R2012b on a
proprietary computer cluster consisting of heterogeneous high-performance
servers with CPU speed varying from 2 GHz to 3.3 GHz, number of cores
from 12 to 16 and RAM size from 48 GB to 128 GB per server. All servers are
stacked in one air-cooled rack and are connected via Gigabit Ethernet. The
servers run on Windows HPC Server 2008 with Microsoft HPC Pack as a job
scheduler.
The parallelization is done using MATLAB’s parfor command that imple-
ments a master–worker pattern. For the optimization, the gradient-based con-
strained optimization solver fmincon from the MATLAB Optimization Tool-
box is used. Besides standard MATLAB, the Parallel Computing Toolbox
and MATLAB Distributed Computing Server are required to use the parfor
construct for parallelization across MATLAB sessions running on multiple
servers.

14.7.3 Results
To evaluate the optimal choices obtained from the optimization, a Monte
Carlo simulation over the life cycle from age 20 to age 99 for 100,000 paths is
performed. In every path and every time step, the optimal choices from the
numerical solution are evaluated to determine the path’s evolution. Finally,
labor income, stock and bond investment, consumption, and annuity purchase
profiles are generated by averaging over all paths, see Figure 14.5.
Supercomputers 435

(a)

<<controls>>

P1 P2 P3 P4

<<communicates>>

(b) 64 63 62 61 60
59 58 57 56 55 54

<<controls>>

1 2 3

P1 P2 P3 P4

<<communicates>>

FIGURE 14.4: Figurative state space decomposition and parallel solution


of decomposed, independent optimization problems: (a) state space decompo-
sition and (b) distribution of independent optimization problems. (Adapted
from Horneff, V., Maurer, R., and Schober, P., Efficient parallel solution meth-
ods for dynamic portfolio choice models in discrete time, available: SSRN
2665031, 2016.)
To test for strong scalability, the number of computing units is doubled
repeatedly until the limits of the cluster are reached (P = 97). The code is
run seven times for every scaling stage to account for jitter due to the job
scheduler of the cluster and the cluster’s network. Table 14.6 and Figure 14.6
436 High-Performance Computing in Finance

70 Consumption
Stock holdings
60 Bond holdings
Averages in 1000 US dollar

Annuity purchases
50 Labor income
40

30

20

10

0
20 30 40 50 60 70 80 90 100
Age

FIGURE 14.5: Average life cycle profile for consumption, stock and bond
holdings, annuity purchases, and labor income in total US dollar values.

TABLE 14.6: Realized mean runtime μ, standard deviation σ,


opt
and speedup Sstrong as well as Sstrong
opt
P μ[s] σ[s] Sstrong Sstrong
8 5659.60 21.55 1.00 1.00
16 3283.04 5.71 1.72 2.00
32 1321.91 7.28 4.28 4.00
64 669.96 10.22 8.45 8.00
97 481.90 11.30 11.74 12.13

depict realized mean runtimes, standard deviations, and realized speedup as


well as optimal speedup. Parallel efficiency is nearly optimal at 96.86%. Note
that the realized speedup crosses the optimal speedup upwards at 32 cores.
This is due to the heterogeneity of the cluster: the first 16 allocated cores only
operate at 2 GHz clock speed, all allocated cores afterwards operate at 3 GHz
and more.

Further Reading
Section 14.1 [1] contains the current TOP500 list, various statistics, his-
torical data, and an extensive FAQ regarding the TOP500. In Reference 4, the
LINPACK benchmark is published and available for download. For detailed
information on the LINPACK benchmark, see also [14]. Supercomputers as a
platform for heavily parallel applications are discussed in Reference 15.
Supercomputers 437

16
Realized
Theoretical

8
Speedup

1
8 16 32 64 128
Number of cores

FIGURE 14.6: Strong scaling results for the parallelization approach.

Section 14.2 [16] contains an overview on the evolution of supercomput-


ers. The homepage of the TOP500 [1] has a timeline depicting the mile-
stones of supercomputing. The homepage of the Green500 [5] provides the
current Green500 list and information as well as historical data regarding the
Green500. Details on the construction of L-CSC can be found in Reference 17.
About the use of SoCs in supercomputing, the homepage of the Mont-Blanc
project [7] has various information about the project itself and applications.
The accompanying article [18] compares the use and performance of SoC with
similar trends in the 1990s.
Section 14.3 [15] gives an alternative overview on parallel programming
languages (and, additionally, parallel compilers). A comprehensive discus-
sion of MATLAB’s parallel programming features, design aspects, and imple-
mented parallelization paradigms is given in Reference 19. Reference 20 is one
recommendable library (of many) to include MPI in Python. The MPI stan-
dard itself is documented in Reference 21. A good compilation of pros and
cons for the usage of OpenMP and MPI can be found in Reference 22.
Section 14.4 advantages and disadvantages of supercomputers and the
associated programming and parallelization approaches are also discussed in
Reference 15.
Section 14.5 [23] provides a comprehensive overview on quantitative mod-
els for finance and common numerical methods. Various numerical applica-
tions in finance, including dynamic stochastic programming and Quasi Monte
Carlo methods, are covered in Reference 15. Reference 24 introduces into
strong and weak scaling measurement. Amdahl’s law bases on his conference
contribution [25] and [26] contains Gustafson’s response. The cost example
was estimated on basis of the technical report [10].
438 High-Performance Computing in Finance

Section 14.6 [11] is the basis for the case study. Reference 27 covers the
special class of ANOA decomposition used. Reference 28 is a comprehensive
reference for sparse grids. Reference 29 proposes the parallelization approach
applied.
Section 14.7 [13] comprises and extends this case study for the same life
cycle model, which is taken from Reference 12.

References
1. Top500 homepage. Available: [Link] 2017.

2. About JUQUEEN —Jülich Blue Gene/Q Supercomputer. [Online]. Available:


[Link]
Configuration/Configuration [Link], 2017.

3. IEEE Computer Society, Tribute to Seymour Cray. [Online]. Available: https://


[Link]/web/awards/about-cray, 1996.

4. Petitet, A., Whaley, C., Dongarra, J., and Cleary, A., LINPACK. [Online]. Avail-
able: [Link] 2008.

5. Green500 homepage. [Online]. Available: [Link] 2017.

6. Frankfurt Institute for Advanced Studies. [Online]. Available: [Link]


[Link]/en/, 2017.

7. Mont-Blanc project homepage. [Online]. Available: [Link]


[Link]/, 2017.

8. High-Q Club—Highest Scaling Codes on JUQUEEN. [Online]. Available:


[Link] [Link],
2017.

9. Warschko T. M., Advanced cluster computing, In Presentation at the ISC Con-


ference, Frankfurt, 2015.

10. Fujitsu, Fujitsu PRIMERGY servers performance report PRIMERGY RX200


S8, Fujitsu K.K., Tech. Rep., 2013.

11. Schober, P., Schröder, P., and Wittum, G., Efficient parallel solution methods
for high-dimensional option pricing problems, In revision at the Journal of Com-
putational Finance, available at SSRN 2591254, 2015.

12. Horneff, W. J., Maurer, R. H., and Stamos, M. Z., Life-cycle asset allocation
with annuity markets, Journal of Economic Dynamics and Control, vol. 32, no.
11, pp. 3590–3612, 2008.

13. Horneff, V., Maurer, R., and Schober, P., Efficient parallel solution methods
for dynamic portfolio choice models in discrete time, Available: SSRN 2665031,
2016.
Supercomputers 439

14. Dongarra, J., Luszczek, P., and Petitet, A., The LINPACK benchmark: Past,
present and future, University of Tennessee, Tech. Rep., 2002.

15. Vajteršic, M., Zinterhof, P., and Trobec, R., Overview—Parallel Computing:
Numerics, Applications, and Trends. Springer, London, United Kingdom, 2009.

16. Kaufmann, W. J. and Smarr, L. L., Supercomputing and the Transformation of


Science. WH Freeman & Co., New York, NY, 1992.

17. Lindenstruth, V., The L-CSC construction and its applications, In Presentation
at the ISC Conference, Frankfurt, 2015.

18. Rajovic, N., Carpenter, P. M., Gelado, I., Puzovic, N., Ramirez, A., and Valero,
M., Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC?
In High Performance Computing, Networking, Storage and Analysis (SC), 2013
International Conference for IEEE, pp. 1–12, 2013.

19. Sharma, G. and Martin, J., Matlab R


: A language for parallel computing, Inter-
national Journal of Parallel Programming, vol. 37, pp. 3–36, 2009.

20. Dalcin, L. MPI for Python. [Online]. Available: [Link] 2017.

21. Message Passing Interface Forum, MPI: A Message passing interface standard,
Version 2.2. [Online]. Available: [Link]
[Link], 2009.

22. Pros and Cons of OpenMP/MPI. [Online]. Available: [Link]


edu/rc/classes/intro mpi/parallel prog [Link], 2011.

23. Wilmott, P., On Quantitative Finance, 2nd ed. John Wiley & Sons, Ltd., Chich-
ester, United Kingdom, vol. I–III, 2006.

24. Measuring Parallel Scaling Performance. [Online]. Available: [Link]


[Link]/help/[Link]/Measuring Parallel Scaling Performance, 2017.

25. Amdahl, G. M., Validity of the single processor approach to achieving large scale
computing capabilities, In Proceedings of the April 18–20, 1967, Spring Joint
Computer Conference. ACM, pp. 483–485, 1967.

26. Gustafson, J. L., Reevaluating Amdahl’s law, Communications of the ACM,


vol. 31, no. 5, pp. 532–533, 1988.

27. Schröder, P., Gerstner, T., and Wittum, G., Taylor-like ANOVA expansion for
high-dimension PDEs in finance, Working paper , 2013.

28. Bungartz, H.-J. and Griebel M., Sparse grids, Acta Numerica, vol. 13, pp. 147–
269, 2004.

29. Schröder, P., Mlynczak, P., and Wittum, G., Dimension-Wise Decomposi-
tions and Their Efficient Parallelization. World Scientific, ch. 13, pp. 445–472,
2013, [Online]. Available: [Link]
9789814436434 0013.
Chapter 15
Multiscale Dataflow Computing
in Finance

Oskar Mencer, Brian Boucher, Gary Robinson, Jon Gregory, and


Georgi Gaydadjiev

CONTENTS
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
15.2 The Dataflow Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
15.3 Maxeler Dataflow Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
15.3.1 DFEs in the cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
15.4 Dataflow Programming Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
15.5 Development Process and Design Optimization . . . . . . . . . . . . . . . . . 457
15.6 A Case Study: Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
15.7 Financial Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
15.7.1 Maxeler RiskAnalytics platform . . . . . . . . . . . . . . . . . . . . . . . . . 461
15.7.2 Interest rate swap pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
15.7.3 Value-at-risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
15.7.4 Exotic interest rate pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
15.7.5 Credit value adjustment capital . . . . . . . . . . . . . . . . . . . . . . . . . 466
15.7.6 Standard initial margin model . . . . . . . . . . . . . . . . . . . . . . . . . . 467
15.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470

15.1 Introduction
Computer technology has become an essential driver for the financial
industry in almost all of its areas. Advances in hardware and software tech-
nology, novel numerical methods, financial models, and algorithms have made
computers a key technology that became essential for all financial institutions.
High-performance computing (HPC) systems are widely used to price finan-
cial products or to quickly calculate the risk of complex portfolios. Often,
the available computational power determines the types of problems that can
be practically solved. Being able to handle a more complex problem or to
obtain the results faster than all other organizations directly translates into a
competitive advantage.

441
442 High-Performance Computing in Finance

Conventional computer architectures used in many areas of everyday life


including mobile devices, desktop computers, and HPC systems generally
follow the basic concepts of general-purpose processing [1]. Such processors
perform calculations by executing a sequence of instructions that can either
carry out arithmetic, control, or input/output (IO) operations. This model
of execution is generic and hence extremely flexible; however, it is also inher-
ently sequential. Over many decades, the performance of processors has been
improved by increasing the clock rates, and also by extending the basic pro-
cessor architecture with complex structures to deal with issues like control
divergence, main memory access penalties and to recover low-level binary
instructions parallelism. Many micro-architectural innovations such as caches,
branch prediction, out-of-order execution, and Single Instruction Multiple
Data (SIMD) extensions were developed to alleviate the fundamental draw-
backs of the general-purpose processor inherently sequential paradigm. This
has led to the modern complex processor microarchitectures where only a tiny
part of the chip area is dedicated to useful calculations at very high speeds
while the rest of the device is used for auxiliary functions such as caching of
instructions and data. With the end of clock frequency improvements offered
by the CMOS technology scaling, additional performance can now be only
obtained through exploiting parallelism. Multithreaded implementations on
multiple cores or SIMD extensions are just two examples. However, the indi-
vidual cores (or threads) still rely on a fundamentally sequential computing
principle, that is, performing a sequence of instructions. In addition, legacy
applications have to be rewritten, analyzed, and optimized in order to achieve
satisfactory performance levels. Attempting to compute larger and larger
problems by simply scaling over existing processor technology is no longer
practically possible for many current and future HPC applications [2]. Even
if performance requirements can be met by using a large number of machines,
the cost, area, and power requirements may exceed practical limits.
These limitations have led to an increased interest in special-purpose com-
puting where an algorithm or parts of it are targeted onto a customized archi-
tecture, leading to both increased performance and improved power efficiency.
A special-purpose architecture can be customized and tailored to the unique
requirements of the application, resulting in a combination of increased perfor-
mance, reduced power, smaller area, and lower economical cost as compared to
its general-purpose counterparts. Nowadays, special-purpose units are added
to many processors to perform specific, frequently used, demanding tasks such
as encryption or digital video decoding. However, these special-purpose units
are available only for a limited set of common functionalities, and those are
“frozen” during the design of the processor. HPC can also benefit from special-
purpose processing, but due to the vast space of possible applications with
different characteristics, prefabricated (and hence fixed) accelerators are not
practical. Instead, a flexible computing substrate that can be customized on
demand by the designer according to the application requirements is required.
Reconfigurable devices, such as Field-Programmable Gate Arrays (FPGAs),
Multiscale Dataflow Computing in Finance 443

offer such a substrate technology, and significant speed-ups over conventional


computing systems have been reported for a wide range of applications [3,4].
Public Cloud computing has recently started to provide reconfigurable solu-
tions for the high-end computing market. However, the downside of this highly
capable technology is often a complex, very low-level programming model that
requires highly specialistic knowledge in hardware design.
Maxeler Technologies is pioneering a novel approach to dataflow oriented
supercomputers. Maxeler computing systems are a type of special-purpose sys-
tem that can be customized to the unique requirements of an application. At
the heart of a Maxeler system are one or several Dataflow Engines (DFEs) that
combine a large and powerful reconfigurable device with significant amount of
DRAM memory. DFEs are programmed using a simple dataflow model that
enables domain experts to optimize both their algorithms and the underly-
ing architecture simultaneously, cutting through the typical layer approach of
custom libraries, standard libraries, operating systems, and hardware orga-
nization. This approach has led to orders-of-magnitude higher performance,
lower power consumption, and significantly lower data center space as com-
pared to traditional approaches. A wide range of applications ranging from
3D finite-difference partial differential equation solvers [5] to Monte Carlo
simulations have been successfully accelerated in commercial products [6].
In addition, speedups can also enable completely new computational models
that were previously not feasible under hard timing constraints. For example,
computing a 24-hour forecast is not practical if the computation would take
48 hours to complete. If, however, the same computation can be achieved in
2 hours, then running 24-hour forecasts becomes a realistic scenario.

15.2 The Dataflow Paradigm


Maxeler’s dataflow oriented computing paradigm fundamentally differs
from conventional processors which are control-flow centric. This approach
is illustrated in Figure 15.1 and it represents an evolution of dataflow and sys-
tolic array concepts [7,8]. A conventional processor operates by reading and
decoding an instruction, loading the required data, performing an operation
on the data, and returning the result to memory. This process is iterative in
nature and requires complex control mechanisms that manage the operation of
the processor. The dataflow execution model is greatly simplified in compari-
son. Data is streamed from memory into the chip where arithmetic operations
are performed by chains of functional units (dataflow cores) statically inter-
connected in a structure corresponding to the implemented functionality. It
should be noted that the dataflow structure preforms computations entirely
without instructions. Data flows from one functional unit directly to the next
without the need of complex control mechanisms. Data simply arrives when it
is needed and the final results are streamed back into memory. Each dataflow
444 High-Performance Computing in Finance

(a) (b) Data


.c
.java Dataflow Dataflow
core core
Compiler
Dataflow Dataflow Dataflow
MaxCompiler
core core core
Data/instructions Memory
Dataflow Dataflow Dataflow
core core core
CPU

Function Memory Dataflow Dataflow


Memory Instructions
unit controller core core

Data
Data

Control-flow system Data-flow system

FIGURE 15.1: Conventional control-flow oriented processor (a) as compared


to Maxeler’s DFE (b).

core performs only a simple operation such as addition or multiplication and


hence, thousands of operations can fit on the chip surface characterized by a
specific area.
Unlike the control-flow based processor, where operations are computed on
a time-shared functional unit (“computing in time”), the complete dataflow
computation is laid out in space over the entire chip (“computing in space”).
Dependencies in the dataflow are resolved statically at compile time, and
because there is no data-dependent behavior present at run time, the entire
DFE can be deeply pipelined. Every stage of the pipeline computes in parallel
with the dataflow architecture maintaining overall throughput of one result
per clock cycle. A simple analogy of this approach is an assembly line in a car
factory. The most efficient way to realize large-scale productions (computa-
tions) is a specialized assembly line (pipeline) where parts (data) move from
storage (memory) to a chain of dedicated workstations (custom functional
units) where they are assembled together (data is processed) and moved for-
ward in lock step to produce complete cars (final results) at the end. There are
no instructions and hence, instruction decoding logic is not required. Also, a
static dataflow model does not require control-flow techniques such as branch
prediction and out-of-order execution for obvious reasons. General-purpose
caches are equally not necessary and data is always kept on chip with the
minimum amount of buffering memory for intermediate results. By eliminat-
ing these extraneous functions, all the chip’s resources can be dedicated to
perform useful computations instead of managing the execution.
Maxeler realizes this dataflow oriented computing approach by mapping
an application described in its dataflow model onto a DFE. DFEs are highly
efficient for large-scale computations with a static execution model due to the
elimination of sequential execution and control, and the optimization of mem-
ory access to a simple feed-forward model. However, DFEs are inefficient for
small-scale computations with control-dominated dynamic behavior. The key
to effective dataflow computing systems is therefore the combination of DFEs
with a conventional processor. The DFE carries out the compute-intensive
Multiscale Dataflow Computing in Finance 445

part of the application while host processors are tasked with control-intensive
tasks and also with setting up and controlling the computation on the DFE.
Depending on the nature of the problem, one can also adopt a combined
processing approach where the processor computes the less demanding part
of the application while the DFE will target the performance-critical part.
This results in a codesign approach where we develop a conventional pro-
cessor application together with a customized DFE implementation. In the
following, we first cover Maxeler dataflow systems, followed by programming
principles and custom optimizations.

15.3 Maxeler Dataflow Systems


At the center of Maxeler’s dataflow systems sits its proprietary DFE hard-
ware. In state-of-the-art MAX4 generation systems, DFEs are based on large
Altera Stratix-V FPGAs that provide the reconfigurable computing substrate
for dataflow cores. This device is surrounded by large amounts of DRAM
memory (currently between 48 and 96 GB), providing a very high memory
capacity enabling large computational problems. This is called Large Memory
(LMem). In addition, the FPGA itself also provides embedded on-chip mem-
ories which are spread throughout the chip’s fabric and can be used to hold
local values of the computation. These embedded memories are called Fast
Memory (FMem) as they can be accessed with a total bandwidth of several
terabytes/second. This is an important factor for the efficiency of DFE com-
putations because data can be kept locally where it is needed and accessed
with very high speeds. This is in contrast to CPU caches where data is kept
on a speculative basis, and replicated several times, with only the smallest
L1 cache providing very high speed to the computational unit.
As previously mentioned, DFEs are not intended to fully replace conven-
tional CPUs; instead, they are integrated into an HPC system consisting of
CPUs, DFEs, storage, and networking. Various system architectures are possi-
ble and the overall balance of components can be tailored to the requirements
of the user application. As a key feature, DFEs always contain large amounts
of DRAM to facilitate the previously described model of dataflow processing.
Various configurations between DFEs and CPU, as well as between multiple
DFEs are possible. In the following, we give a brief overview of the current
Maxeler MPC-C, MPC-X, and MPC-N series.
Maxeler MPC-C systems couple x86 server-grade CPUs with up to 4 DFE
cards (see Figure 15.2). Each DFE card contains 48 GB of DRAM as LMem,
and each DFE card is connected to the CPUs via a PCI Express (PCIe)
bus. DFE cards are also directly connected to each other through a dedicated
high-speed, low-latency link called MaxRing. This provides fast communica-
tion between neighboring DFEs, enabling larger applications to scale across
multiple DFEs without the PCIe link becoming a communication bottleneck.
446 High-Performance Computing in Finance

MPC-C
MaxRing
Main memory Data-flow engines (up to 4)

Data-flow LMem
cores 48 GB
PCIe Gen 2

×86 CPU
cores

FIGURE 15.2: MPC-C series architecture. A single node contains both ×86
CPUs and 4 DFEs connected via PCIe and MaxRing.

Standard server MPC-X

MaxRing
Memory Data-flow engines (8)

Infiniband
Switch fabric
Data-flow LMem
cores 48 GB
×86 CPU
cores

FIGURE 15.3: MPC-X series architecture with eight DFEs inside a node.

The system also includes storage and networking, and it is integrated into a
dense 1U1 industry standard rack unit. Such a system supports simple stand-
alone deployment of DFE technology, tightly coupled with high-end CPUs.
The architecture is beneficial for high-performance applications that run on a
fixed number of CPU cores and continuously use one or multiple DFEs.
The MPC-X series enable a more heterogeneous system architecture sup-
porting dynamic balancing of CPU and DFE resources. MPC-X series systems
are pure DFE nodes without any CPUs (see Figure 15.3). An MPC-X system
combines 8 DFE cards in a 1U chassis directly connected through MaxRing.
DFEs are also linked through Infiniband to a cluster of CPU nodes. The
system can dynamically allocate arbitrary (often large) numbers of DFEs,
1 U is the abbreviation of Rack Unit. It is the standard unit of measure for the vertical

usable data center space, or height defined as 1.75 inches (44.45 mm). A typical full-size rack
cage is 42U high, composed by multiple 1U, 2U, or 4U boxes.
Multiscale Dataflow Computing in Finance 447

MPC-N

Main memory Data-flow engines (up to 4)

LMem PTP, PPS,


Timing IRIG
24 GB Data-flow
2 × 10 G cores
PCIe Gen 2 QMem
links Ethernet
72 MB 3 × 40 G links
×86 CPU per DFE card
cores

FIGURE 15.4: MPC-N series architecture with four DFEs inside a node.

providing good scalability and flexibility for applications with changing behav-
ior, for example, when the computation has several stages that vary in their
characteristics. The CPUs to DFEs ratio can be matched to user application
requirements.
Maxeler’s MPC-N series systems (see Figure 15.4) are a network-oriented
platform that provides Ethernet connections directly to the DFEs, supporting
ultra low-latency line-rate processing of multiple 10–40 Gbit data streams.
A single MPC-N node contains up to 4 DFE cards similar to the MPC-C
series architecture. However, each DFE card also supports up the 3 QSFP+
40 Gbit Ethernet connections where each 40 Gbit port can be split into 4 ×
10 Gbit ports. Providing fast Ethernet connections directly to the DFE enables
network processing with minimal latency. The memory architecture in DFE
also differs from the two previous system architectures: in addition to 24 GB
DRAM available as LMem, the DFE also integrates 72 MB of QDR SRAM
(QMem) supporting very low latency off-chip data access. The system contains
additional 10 Gbit connections to the CPU. MPC-N series systems are well
suited for a range of networking applications including gateways, aggregators,
or endpoints.
Maxeler systems are provided with a compilation and simulation environ-
ment (called MaxCompiler) for application development, and the MaxelerOS
system management environment. MaxelerOS coordinates the use of DFE
resources at run time, and manages the scheduling and data movement within
Maxeler systems. MaxCompiler provides a high-level programming environ-
ment to express dataflow structures, and produces the necessary binaries for
CPU and DFE binaries.

15.3.1 DFEs in the cloud


Public cloud computing has been taking the world by storm, with over
32,000 attendees at the Amazon AWS re:Invent in Las Vegas in November
448 High-Performance Computing in Finance

2016. Why is cloud computing becoming so popular? A public cloud appeals


to both senior management and to IT departments of many organizations.
For senior management, a public cloud creates a second source for IT services,
removing the lock-in with large internal IT teams and expensive infrastructure.
For the new generation of IT professionals, the public cloud offers the power to
manage a massive amount of resources with much less effort and much lower
expertise is required.
In Finance, the public cloud had a hard youth. Data confidentiality issues,
the negotiation power of current IT infrastructure owners, as well as service
levels (e.g., liability contract clauses) of public cloud providers made it pro-
hibitive for many years for senior management of banks to use public cloud
services. This is now rapidly changing, with cloud providers offering service
guarantees, and addressing critical business continuity services, which are des-
perately needed by many financial institutions. Cloud providers are also start-
ing to also look into high-performance services: 2016 has seen the integration
of Altera into Intel, on the back of a large deployment of Altera products
at Microsoft; and more recently Amazon announced the new AWS EC2 F1
instance as their top end performance instance offering [9].
This Amazon EC2 F1 instance (not to be mistaken for a Formula 1 event)
offers raw computer hardware which is fully compatible with the MAX5 gen-
eration of DFEs. F1 instances come in single and eight-way configurations
with eight DFE compatible units, each with 64 Gigabytes of DRAM. Each
MAX5 DFE compatible F1 unit has about 3× the compute performance of a
MAX4 card. The instances come with 480 GB and 960 GB of SSD2 storage,
respectively. The instances are therefore similar to the Maxeler single DFE
MPC-C appliances and the MPC-X 8-way appliance.
With the availability of Amazon F1 instances, DFEs now have a high
volume second source, as well as the Amazon Marketplace creating a com-
mercial platform for vendors to build various applications on top. For large-
scale deployment, a combined solution of on-premise private DFE cloud plus
the elasticity to the Amazon F1 instances further enhances the efficiency and
operational cost reduction provided by DFEs while significantly reducing the
operational risks associated with mixed technology stacks. So a bank would
use DFEs on premise based on predicted average load requirements and elas-
tically expand into the cloud during peak times, as well as being able to build
a solid business continuity solution, where if a bank data center goes down,
the Amazon cloud can pick up seamlessly without service interruption.
On top of all Cloud advantages, the Multiscale Dataflow programming
ecosystem enables a simple software development process and a debug and
optimization process that has been fine tuned for over a decade to enable
scientists and domain experts to achieve Maximum Performance Computing
for ultra-large scale software packages and workloads. Previously this was
only available to a few top computing experts and typically worked only for
2 Solid-State Drive: A type of mass storage device similar to a hard disk drive (HDD)

with no moving parts and hence faster and more predictable read and write times.
Multiscale Dataflow Computing in Finance 449

tiny compute examples such as simple matrix multiply. Moreover, Multiscale


Dataflow programming has successfully proven through its five generations
to be truly independent of the underlying technology with built in by design
performance portability towards the next DFE generation. These two prop-
erties are both considered of extreme value in the Cloud deployment context.
With all the above, DFEs offer a manageable migration path from standard
solutions to cloud-enhanced ultra high-end computing.

15.4 Dataflow Programming Principles


In the following, we outline the dataflow oriented programming model that
is used in Maxeler systems. As described in the previous section, Maxeler
dataflow systems are based on a combination of DFEs and CPUs. The basic
logical architecture of such a system is illustrated in Figure 15.5. The CPU is
responsible for setting up and controlling the computation on the DFE. The
DFE contains one or multiple dataflow kernels that perform the accelerated
arithmetic and logical computations. Each DFE also contains a manager that
is responsible for the connections between kernels, DFE memory, and the
various interconnects such as PCIe, Infiniband, and MaxRing.
Separating computation and communication into kernels and managers
is beneficial because it allows the data path inside the kernels to be deeply
pipelined without any synchronization issues. When developing the kernel,
a designer would simply focus on achieving high degrees of pipelining

CPU application
CPU

SLiC
Kernels
MaxelerOS

Configurable logic
Memory

Interconnect
+

+ *

Memory
Manager

FIGURE 15.5: Logical architecture of a dataflow computing system with


one CPU and one DFE.
450 High-Performance Computing in Finance

Rewrite
CPU code to be MaxIDE
application accelerated
[Link]
Matlab, [Link]
Python, C...

User input

Compiler MaxCompiler
MaxelerOS
SliC library
library
Output

Linker MAX File


Sim or DFE

Output

Executable

FIGURE 15.6: Compiling a dataflow application with MaxCompiler.

and parallelism without worrying about scheduling or synchronization. The


scheduling of operations inside the kernel will be performed automatically by
the compiler. The manager code describes how kernels are connected to mem-
ory and other IO interfaces, and the necessary synchronization logic will also
be generated by the compiler.
Developing an application for a DFE-based system therefore includes three
parts:

1. A CPU application typically written in C/C++, Matlab, Python, or


FORTRAN

2. One or multiple dataflow kernels written in extended Java3


3. A manager configuration, also written in extended Java3

The compilation flow of a Maxeler dataflow design is illustrated in


Figure 15.6. The design typically starts with a CPU application where a
performance-critical part needs to be accelerated. This part of the application
will be targeted on a DFE. Designing a DFE application involves describing
one or multiple kernels and a manager in MaxJ. MaxJ is a Java-based meta-
language that describes dataflow. It is important to note that executing the
MaxJ program will not perform the computations described within the pro-
gram. Instead, it will trigger the generation of a configuration file for the DFE

3 Maxeler provides extensions to the Java language, referred to as MaxJ.


Multiscale Dataflow Computing in Finance 451

(the so-called .max file). The computation will later be performed by loading
the .max configuration file into the DFE and streaming the data through it.
Before we can do this, we need to modify the CPU application to invoke the
DFE. To simplify this process, MaxCompiler will generate the necessary func-
tion prototypes and header files. The CPU code is then compiled as usual and
is linked with the .max file and Maxeler’s Simple Live CPU (SLiC) interface
library. The result of this is a single executable file that contains all the binary
code to run on both the conventional CPUs and the DFEs in a system.
Let us focus on the principles of dataflow programming in MaxJ. As men-
tioned previously, MaxJ is a metalanguage that describes dataflow computing
structures; it uses Java syntax but is in principle different from regular Java
programming (or other imperative programming paradigms that describe com-
putations by changing state). The most important principle in MaxJ is that
we describe a fixed spatial dataflow structure that can perform computations
by simply streaming through data, and not a sequence of instructions to be
executed on a traditional processor.
To illustrate these principles, we show how a simple loop computation can
be transformed into a dataflow description using MaxJ. Let us assume we
want to calculate y = x2 + 3x + 17 over a data set. Even though there is
nothing inherently sequential in this computation, a conventional C program
would require a for loop. This is illustrated in Figure 15.7. The calculation is
repeated for the number of data elements in a loop. Within the loop body, all
operations also run sequentially.
In contrast, a dataflow implementation would focus on identifying the core
part of the computation and creating a data path for it. Figure 15.8 illustrates
such a dataflow implementation. The same computation that is described
inside the loop body can be performed by a fixed data path that contains two
multipliers and two adders. It is one of the key features of dataflow computing
having several operators present at the same time and running concurrently,
instead of using a time-shared functional unit inside a processor. A practical
dataflow implementation can have thousands of operators in a data path all
running concurrently. Another important principle is the absence of control
and instructions. The data path is fixed and the computation is performed by
streaming data from memory directly into the data path.
Figure 15.9 depicts the MaxJ kernel description that can generate the data
path shown in Figure 15.8. The MaxJ descriptions begin by extending the
kernel class (line 1). The kernel class is part of the Maxeler Java extensions

1 for ( i = 0; i < numDataElements ; i ++) {


2 float x = input [ i ] ;
3 float y = x ∗ x + 3 ∗ x + 17;
4 output [ i ] = y;
5 }

FIGURE 15.7: C code of a simple computation inside a loop.


452 High-Performance Computing in Finance

X X

+
17

FIGURE 15.8: A dataflow implementation for the computation inside the


loop body.

and the user develops their own kernels by using inheritance. Next, we define
a constructor for SimpleCalc class (line 2). It is important to remember that
this MaxJ program will only run once to build the DFE configuration; the
constructor will facilitate building the dataflow implementation. To create the
streaming inputs and outputs for the kernel, the methods [Link] (line 3)
and [Link] (line 5) are used. Streaming inputs and outputs replace the for
loop in the original C code that iterates over data. The input method takes two
arguments: the name on the input that will be used by the manager to connect
the kernel and the data type of the input. In this case, we use a standard single
precision floating point format (8-bit exponent and a 24-bit mantissa), but
MaxJ also supports custom data types that can be defined by the user. This
is useful when optimizing the numerical behavior and performance, which will

1 c l a s s SimpleCalc extends Kernel {


2 SimpleCalc ( ) {
3 DFEVar x = i o . i n p u t ( ” x ” , d f e F l o a t ( 8 , 2 4 ) ) ;
4 DFEVar y = x ∗ x + 3 ∗ x + 1 7 ;
5 i o . output ( ” y ” , y , d f e F l o a t ( 8 , 2 4 ) ) ;
6 }
7 }

FIGURE 15.9: A MaxJ description that generates the dataflow implemen-


tation shown in Figure 15.8.
Multiscale Dataflow Computing in Finance 453

1 int x = 10;
2 DFEVar y ;
3 DFEVar z ;
4 y = x ; // ok , a s s i g n c o n s t a n t t o run−time v a r i a b l e
5 x = y ; // c o m p i l e r e r r o r , cannot r e a d run−time v a r i a b l e i n t o
c o m p i l e −time Java v a r i a b l e
6 z = y ; // ok , both h a n d l e run−time data

FIGURE 15.10: DFEVars handle run-time data, Java constants are evalu-
ated only at compile time.

be covered later. The output method uses three arguments: the name of the
output to be used by the manager, the variable to connect to the output, and
the data format. The computation itself (see Figure 15.9) is expressed in a
very similar way as in the original C code (line 4).
In MaxJ, the DFEVar object is used to handle run-time data. Since MaxJ
describes a dataflow graph rather than a procedure, we have to distinguish
between run-time values and compile-time values. Regular Java variables such
as int will be evaluated and fixed at compile time. Such variables can be
used as constants for improved code readability or to control the build of the
dataflow graph. The values of DFEVars are known only at run time when
data is streamed through the kernel. This means assigning a Java variable
to a DFEVar will result in a constant. However, it is not possible to read a
DFEVar and assign its value to a Java variable (see Figure 15.10).
This principle described above means that we can use Java variables and
control constructs to shape the structure of our dataflow graph. Let us consider
an example of a nested loop as shown in Figure 15.11. We observe that the
outer for loop performs an iteration over data, while the inner for loop
describes a computation with a cyclic dependency of v from one loop iteration
to another.
This example can be effectively transformed into a dataflow description
as illustrated in Figure 15.12. Again, the outer loop is replaced by streaming
inputs and outputs. The inner loop is described with the same for for loop
statement in Java, but the compilation of this loop will result in an unrolled
implementation of the loop body in space, as depicted in Figure 15.13. Unlike

1 f o r ( i = 0 ; i < numDataElements ; i ++) {


2 f l o a t d = input [ i ] ;
3 f l o a t v = 2.91 − 2.0 ∗ d ;
4 f o r ( i t e r a t i o n = 0 ; i t e r a t i o n < 4 ; i t e r a t i o n ++) {
5 v = v ∗ (2.0 − d ∗ v) ;
6 }
7 output [ i ] = v ;
8 }

FIGURE 15.11: C code of a nested loop with dependency.


454 High-Performance Computing in Finance

1 c l a s s Loop e x t e n d s K e r n e l {
2 Loop ( ) {
3 DFEVar d = i o . i n p u t ( ” d ” , d f e F l o a t ( 8 , 2 4 ) ) ;
4 DFEVar v = 2 . 9 1 − 2 . 0 ∗ d ;
5 f o r ( i n t i t e r a t i o n = 0 ; i t e r a t i o n < 4 ; i t e r a t i o n +=
1) {
6 v = v ∗ (2.0 − d ∗ v) ;
7 }
8 i o . output ( ” output ” , v , d f e F l o a t ( 8 , 2 4 ) ) ;
9 }
10 }

FIGURE 15.12: A MaxJ implementation of the inner loop will be statically


evaluated resulting in spatial replication.

2.0 d

2.91 – X
v
Iteration 0
X

2.0 –

X
v
Iteration 1
X

2.0 –

X
v
Iteration 2
X

2.0 –

X
v
Iteration 3
X

2.0 –

FIGURE 15.13: The result of the MaxJ loop is an unrolled and pipelined
data path.
Multiscale Dataflow Computing in Finance 455

1 DFEVar x = i o . i n p u t ( ” x ” , d f e F l o a t ( 9 , 3 1 ) ) ;
2 DFEVar a = i o . i n p u t ( ” y ” , d f e F l o a t ( 9 , 3 1 ) ) ;
3
4 DFEVar y1 = x ∗ 5 ;
5 DFEVar y2 = x − 7 ;
6
7 DFEVar y = a > 3 ? y1 : y2 ;
8
9 i o . output ( ” y ” , y , d f e F l o a t ( 9 , 3 1 ) ) ;

FIGURE 15.14: Data-dependent control with the ternary operator and use
of a custom number format.

the original loop in C, the for loop in MaxJ does not carry out four iterations
at run time. Instead, the compiler can resolve the dependency of v from one
loop iteration to another and construct an unrolled, acyclic data path where
the calculation inside the loop body is replicated four times, and each v is
connected to the result from the previous iteration.
The previous example has shown how a Java for loop can be used to
control the replication of statements inside the loop body into an unrolled
data path. Likewise, it is possible to use Java conditionals such as if or case
to control the construction of the dataflow graph. The Java if condition is
evaluated at compile time, and the block of code inside the conditional state-
ment will be added into the dataflow graph only if the condition is evaluated
as true.
However, we cannot use a Java conditional on DFEVars because their
value will be only known at run time. As previously mentioned, run-time
dependent behavior is undesirable as it is against the principles of static
dataflow computing. If a data-dependent decision needs to be made, then
this can be expressed using the ternary operator ? : (see Figure 15.14).
This example results in data-dependent control, but in the data path, both
y1 and y2 will be computed concurrently. At the output, we simply select
one of the two results, depending on the value of a. This switching will be
very fast and will not delay or stall the stream processing. However, it also
means that we require resources for both computations on the DFE chip even
though only one of the two outputs will be used at any time. This makes
this type of control effective for fast, small-scale switching. For switching
between larger blocks of computation, it might be more effective to imple-
ment separate DFE kernels and handle the switching and control from the
CPU host.
Figure 15.14 also illustrates that custom number formats other than con-
ventional single or double-precision floating point can be used. In this exam-
ple, we use a 9-bit exponent and a 31-bit mantissa, which offers better scaling
and precision than single precision (8, 24 bit) but less than double precision
(11, 53 bit). Likewise, it is possible to use any arbitrary fixed-point or integer
format. The application developer can use such custom number formats to
456 High-Performance Computing in Finance

1 DFEVar x = i o . i n p u t ( ” x ” , d f e F l o a t ( 8 , 2 4 ) ) ;
2 DFEVar p r e v = s t r eam . o f f s e t ( x , −1) ;
3 DFEVar n e x t = stream . o f f s e t ( x , 1 ) ;
4 DFEVar y = ( p r e v + x + n e x t ) / 3 ;
5 i o . output ( ” y ” , y , d f e F l o a t ( 8 , 2 4 ) ) ;

FIGURE 15.15: Using stream offsets to access values with relative offsets
in the stream.

tailor the implementation to the numerical requirements of the application,


and using such custom formats will yield better resource utilization and per-
formance than relying on the next larger standard format.
All previous examples have considered operations where the output is a
function of inputs with the same array index within the stream, for example:

zi = 5xi + yi , zi+1 = 5xi+1 + yi+1 , ... (15.1)

However, in some cases we need to access values that are ahead or behind
the current element in the data stream. For example, in a moving average
filter we need to compute:
xi−1 + xi + xi+1
yi = (15.2)
3
In dataflow computing, x is a stream rather than an indexed array, and we
need a way of accessing elements of the same stream with other indices than
the current one. This can be achieved with the [Link] method that
accesses values with a relative offset from the current value in the stream. In
the moving average example (see Figure 15.15), we need the previous value
(−1) and the next value (+1):
Figure 15.16 illustrates how a DFE application interacts with the CPU host
application. On the right side, we see the moving average kernel MAVKernel
from our last example. As previously mentioned, we also create a manager to
describe the connectivity between the kernel and the available DFE interfaces.
In Figure 15.16, the kernel is connected directly to the CPU, and all of the
communication will be facilitated via PCIe. The manager also makes visible
to the CPU application all the names of the kernel streaming inputs and
outputs. Compiling the manager and kernel will produce a .max file that can
be included in the host application code. In the host application, running
the moving average calculation will be performed with a simple function call
to MAVKernel(). In this example, the host application is written in C but
MaxCompiler can also generate bindings for a variety of other languages such
as MATLAB or Python.
MaxelerOS and the SLiC library provide a software layer that facilitates
the execution and control of the DFE applications. The SLiC Application
Programming Interface (API) is used to invoke the DFE and process data
on it. In the example in Figure 15.16, we use a simple SLiC interface and
Multiscale Dataflow Computing in Finance 457

x
CPU Memory
Host code
MATLAB, Python, C... –1 +1
DFE
+
SliC
MaxelerOS PCI x

–1 +1
+
+
3
+

Main /
3

Express /
memory x y y

Manager
y

Host code (.c) Manager (.maxj) MAVKernel (.maxj)


#include “[Link]” Manager m = new Manager(); DFEVar x = [Link](“x”, dfeFloat(8, 24));
Kernel k = new MAVKernel();
float x[n], y[n]; DFEVar prev= [Link](x, –1);
[Link](k);
DFEVar next = [Link](x, 1);
[Link](
MAVKernel(x, y, n); link(“x”,CPU), DFEVar y = (prev+x+next) / 3;
link(“y”, CPU));
[Link](); [Link](“y”, result, dfeFloat(8, 24));

FIGURE 15.16: Interaction among host code, manager, and kernel in a


dataflow application.

the simple function call MAVKernel() will carry out all DFE control functions
such as loading the binary configuration file and streaming data in and out
over PCIe. More advanced SLiC interfaces are also available that provide the
user with additional control over the DFE behavior. For example, in many
cases it is beneficial to transfer the data to DFE memory (LMem) first and
then start the computation. This is one of many performance optimizations,
which we will briefly cover in the following section.

15.5 Development Process and Design Optimization


In the previous section, we have introduced the principles of dataflow pro-
gramming. We now outline how to develop dataflow applications in practice
and how to improve their performance. In traditional software design, a devel-
oper usually targets a given platform and optimizes the application based on
available libraries that reflect the capabilities and architectural characteristics
of the targeted platform. Developing a dataflow implementation fundamen-
tally differs in that we codesign the application and architecture. Instead of
mapping a problem to preexisting APIs and data types, we enable domain
experts, for example, physicists, mathematicians, and engineers, to create a
solution all the way from the formulation of the computational problem down
to design of the best possible dataflow architecture. A developer would there-
fore optimize the scientific algorithm to match the capabilities of the dataflow
458 High-Performance Computing in Finance

Analysis Transformation Partitioning Implementation

FIGURE 15.17: Process for developing and optimizing dataflow applica-


tions.

architecture while at the same time optimizing the dataflow structure to match
the requirements of the algorithm. Another key difference to traditional soft-
ware design is the implementation and optimization cycle. In software design,
a developer would typically implement a design, go on to profile and evaluate
the performance of the current implementation, and then tweak the implemen-
tation. In dataflow design, we adopt a different approach where the design is
optimized before it is implemented. The behavior inside a DFE is very pre-
dictable and we can therefore plan and precisely predict the performance of a
possible solution without even implementing it. This means the design will be
analyzed and optimized with simple spreadsheet calculations before we create
the final implementation.
This development process is illustrated in Figure 15.17. The first step con-
sists of an application analysis phase. The purpose of this step is to establish
an understanding of the application, the data set, the algorithms used, and
the potential performance-critical parts. Since we will codesign an algorithm
and its dataflow architecture, this analysis should cover all parts of the com-
putational problem, from the mathematical formulation and algorithm to the
architecture and implementation details. Typical considerations are the type
and regularity of the computation, the ratio between computation and mem-
ory accesses, the ratio of computation to disk IO or network communication,
and the balance between recomputation and storage of precomputed results.
All these aspects can have a significant impact on the performance of the
final implementation. If, for instance, an application is limited by the speed
at which data can be read from disk, then optimizing the throughput of the
compute kernel beyond that limit will have no benefit.
The second step involves algorithmic transformations. A designer could
attempt to choose a different algorithm to solve the problem, or transform
the code, data access patterns or number representations. A typical example
of an algorithmic transformations is to change the number format: Choos-
ing a smaller number representation can support more IO bandwidth, and
higher computational performance, but the numerical effects of the algorithm
have to be well understood. The reconfigurable technology used inside the
DFEs support far greater flexibility in the available number formats than
all conventional processors. Instead of choosing from single or double pre-
cision floating point, a design can exploit a custom format with arbitrary
bit-widths of its exponent and mantissa. Another common optimization is the
Multiscale Dataflow Computing in Finance 459

reordering of data access patterns to support better dataflow. The impact


of algorithmic transformations has to be evaluated through iterative analysis
of the design.
The third step is to partition the application between the CPU and the
DFE. This partitioning covers program code as well as data. For the program
code, we can choose whether the code should run on the CPU or the DFE.
Large-scale applications typically involve multiple DFEs and this also involves
partitioning DFE code over multiple DFEs. Furthermore, it is often beneficial
to follow a coprocessing approach where the CPU and DFE work on different
parts of the computation at the same time. For instance, the CPU can perform
lightweight precalculations or more control-intensive parts of the application.
For this purpose, the SLiC library provides nonblocking functions to control
the DFEs. Another consideration is the partitioning of data. The example in
Figure 15.17 showed DFE data being streamed from main CPU memory. For
processing larger data sets, it is usually beneficial to locate the data in the
large DFE memory (LMem). Coefficients or frequently accessed values can be
kept inside the DFE reconfigurable substrate in fast memory (FMem).
A high-level performance model is used to evaluate the design as it under-
goes various transformations, code, and data partitionings. The process of
analysis and optimization is repeated iteratively as additional possibilities are
explored. Only when the design is fully optimized, will the designer proceed
to step 4: the implementation of the design.

15.6 A Case Study: Correlation


To illustrate the principles of Section 15.5, we consider the following real-
world example: given a collection of price time series {Sti } for times t and
stocks i = 1, . . . , N , with N significantly large, we wish to compute the corre-
lations ρi,j
t for each of the N (N − 1)/2 pairs 0 ≤ i < j ≤ N over a sliding time
interval from t − T to t. These correlation values are important in portfolio
management and trading strategies such as statistical arbitrage, where a sta-
tistical relationship between a pair of stocks can be used to predict price move-
ments in one from the other. Typical values for this example are N = 6000
and T = 100.
The formula for correlation is
) ) )
T rτi rτj − rτi rτj
τ τ τ
ρi,j
t =? ) ) > ) ) (15.3)
T (rτi )2 − ( rτi )2 · T (rτj )2 − ( rτj )2
τ τ τ τ

where t − T ≤ τ < t and rti = ln(St+1


i
/Sti ) is the return on stock i in period
t. We begin our analysis step by writing a CPU program to perform this
460 High-Performance Computing in Finance

calculation, and evaluating it for performance bottlenecks using well-known


code analysis tools.
We can transform the problem by noting that our goal is to calculate
the sequences {ρt } over time as new price data {St } becomes available. In a
single time step increment from t to t + 1, we can update each of the sums in
Equation 15.3 by adding a single term for τ = t and subtracting a term for
τ = t − T —in particular,) this greatly reduces the number of multiplications
needed to calculate the ri rj terms from N (N − 1)(T /2) to N (N − 1). This
type of “streaming” calculation is particularly well-suited to DFEs, which
consume data at a consistent and predictable rate.
To partition the problem between CPUs and DFEs, we note that the ) most
computationally intensive components are the pairwise product terms ri rj ,
which must be calculated for nearly 18 million pairs in our example, while the
other terms need only be calculated once for each of the 6000 stocks and
the values reused. Accordingly, we first calculate the single stock terms on the
CPU and store the values in LMem to be consumed by the DFE later—only
the pairwise multiplications are implemented on the DFE. We also add logic
to the DFE to calculate the final correlation value and store the pairs with
the highest correlation.
Turning to the implementation stage, we need to determine the number of
multiplications we will perform simultaneously on the DFE. This is a prod-
uct of the number of multiplier units, or pipes, that can physically fit on the
chip, as well as the number of clock cycles required for each multiplication

12 pipes

LMem
load

FIGURE 15.18: Layout of the correlation calculation.


Multiscale Dataflow Computing in Finance 461

(the pipeline depth). These values will vary between DFE generations and
can only be truly determined through performing place-and-route using the
reconfigurable chip vendor tools. The latter takes a significant amount of com-
pute time (typically 10s of hours). Figure 15.18 shows the layout of the DFE
correlation with 12 parallel pipes each with a pipeline depth of 12 stages, so
that 144 data elements can be simultaneously processed with a throughput of
12 data elements per clock cycle. Even when assuming a low clock frequency,
this design is able to perform all 36 million required multiplications in less
than 30 ms.

15.7 Financial Application Examples


Maxeler dataflow technology has been deployed in a number of areas
including finance [6,10], oil and gas exploration [11,12], atmospheric mod-
elling [13], and DNA sequence alignment [14]. The range of applications
includes Monte Carlo, finite difference, and irregular tree-based partial dif-
ferential equations, to name a few. Maxeler provides a number of products
and solutions in the financial domain, including financial analytics and trad-
ing applications, particularly for low-latency/high frequency electronic trading
on organized exchanges.

15.7.1 Maxeler RiskAnalytics platform


Maxeler RiskAnalytics is a financial valuation and risk management plat-
form designed from the ground-up, where the core analytic algorithms are
accelerated on Maxeler dataflow systems. The purpose of the platform is to
go beyond simply providing highly efficient computational finance capabilities,
but rather the aim is to provide a complete, vertically integrated application
stack that provides a platform containing all the necessary components for
streamlined front-to-back portfolio risk management, including:

• Front-end, pretrade valuation, and risk checking


• Exchange-based, electronic trade execution, portfolio valuation, and risk
management
• Front-end trade booking, portfolio management, model and risk report-
ing and analysis

• Posttrade model and risk metric selection and verification


• Rapid and flexible transaction analysis and reporting
• Application layer in software for quick and flexible functional reconfigu-
ration
462 High-Performance Computing in Finance

• Large memory to enable rapid and flexible in-memory portfolio risk


analysis

• Regulatory reporting for Basel III, EMIR, Dodd-Frank, Volker-rule,


Solvency II, and so on

• Adaptive load balancing

• Database integration

All core RiskAnalytics components have been implemented in both soft-


ware and on Maxeler DFE-based systems, requiring integration of the DFE
technology with expertise of quantitative analysts with extensive investment
banking experience. The platform has been designed in a modular fashion to
maximize flexibility and performance. Each module realizes a core analytics
component, such as curve bootstrapping or Monte Carlo path generation. To
support flexible hardware/software coprocessing and to enable ease of inte-
gration with existing systems, each module is available as both a CPU and
DFE library component. As outlined in Section 15.5, achieving an efficient
implementation depends on the overall system composition, architecture, and
application structure. Making use of preexisting CPU and DFE library com-
ponents greatly simplifies this process. In the following, we show the use of
Maxeler’s RiskAnalytics library in several commercial use cases.

15.7.2 Interest rate swap pricing


An interest rate swap is a financial derivative with high liquidity that is
commonly used for hedging. Such a swap involves exchanging interest rate
cashflows based on a specified notional amount from one interest rate to
another, for example, exchanging fixed interest-rate flows for floating interest-
rate flows. Figure 15.19 illustrates a typical module configuration for pricing

Curve bootstrap OIS, LIBOR

Generate and interpolate


discount factors and swap rates

Generate cash flow schedules

Calculate value and risk for swap portfolio

FIGURE 15.19: A typical swap pricing pipeline.


Multiscale Dataflow Computing in Finance 463

TABLE 15.1: Possible configurations for swap pricing pipeline


Application characteristics OIS LIBOR Cashflow Pricing
Many curves, few swaps DFE DFE CPU CPU
Few curves, many swaps CPU CPU DFE DFE

interest rate swaps, involving bootstrapping the Overnight Index Swap (OIS)
curve and the London Interbank Offered Rate (LIBOR) curve, followed by gen-
erating swap cashflow schedules, valuing swaps, and calculating swap portfolio
risk. Each stage is available as either a CPU or a DFE library component and
can be accessed via number of convenient APIs. The implementation provides
construction of and access to all intermediate and final objects.
Depending on the characteristics of the swap pricing application, DFE
acceleration can be beneficial at one or more stages of the computa-
tion. Table 15.1 illustrates two possible module configurations where the
performance-critical DFE acceleration can be carried out at different stages of
the pipeline. Modular design of Maxeler’s RiskAnalytics allows the user appli-
cation to dynamically load balance between CPUs and DFEs, and to target
heavy compute load to DFEs, leaving CPUs to support application logic and
lighter compute loads. DFE functionality can be switched in real time by using
MaxelerOS SLiC API functions. Fully pipelined, a Maxeler DFE-equipped 1U
MPC-X node can value a portfolio of 10-year interest rate swaps at a rate of
over 2 billion per second—including bootstrapping of the underlying interest
rate curves.

15.7.3 Value-at-risk
Value-at-risk, or VaR, a measure widely used to evaluate the risk of loss on
a portfolio over a given time period. VaR defines the loss amount that a portfo-
lio is not expected to exceed for a specified level of confidence over a given time
frame. VaR can be calculated in a number of ways (e.g., using fixed historical
scenarios, or using arbitrarily specified scenarios, a delta-based approach, or
using Monte Carlo generated scenarios). Irrespective of the method chosen,
the VaR computation involves evaluating many possible market scenarios, a
technique that is computationally very demanding. Regardless of the chosen
approach, the computation of VaR using conventional technology is frequently
slow and often inaccurate, as well as being unstable in the tail of the loss distri-
bution, resulting in uncertainty in risk attribution and difficulty in optimizing
against portfolio VaR targets. This is illustrated in Figure 15.20a, where the
tail of the loss distribution for a mixed portfolio of interest rate swaps exhibits
a stepwise profile, making it extremely difficult to accurately manage portfolio
VaR.
Mitigating these problems requires massively increased number of scenar-
ios, in order to provide higher resolution in the tail of the loss distribution,
in order to significantly improve stability for risk attribution and/or provide
464 High-Performance Computing in Finance

(a) (b)

FIGURE 15.20: Value-at-Risk with 10,000 scenarios (a) and 500,000 sce-
narios (b).

greater visibility of the impact of market and portfolio changes. This is clearly
illustrated when comparing Figure 15.20a and b. In the second case, the num-
ber of Monte Carlo scenarios is increased by a factor of 50, resulting in far
greater granularity in the tail of the loss distribution leading to improved accu-
racy of portfolio risk management. Fully pipelined, a Maxeler DFE-equipped
1U MPC-X node can compute full revaluation VaR on a portfolio of 250,000
10-year interest rate swaps (equivalent to a rate of over 2 billion swaps per
second)—including bootstrapping of the underlying interest rate curves, as
well as scenario construction.
Increasing the number of Monte Carlo scenarios as suggested above obvi-
ously increases the computational requirements, but with DFE acceleration,
the extra scenarios can be easily and practically achieved. When the accu-
racy of computation is increased, several new approaches to VaR can become
feasible:

• Prehorizon cashflow generation and dynamic portfolio hedging

• Sensitivity metrics for enhanced risk explanation and attribution

• Stable and efficient portfolio optimization

15.7.4 Exotic interest rate pricing


Another use case for DFEs is pricing an exotic product such as a Bermudan
swaption, which is an option to enter into an interest rate swap on any one of
a number of predetermined dates. One of the industry standard approaches
to this pricing problem is to use the LIBOR market model (LMM) which
employs a high-dimensional Monte Carlo simulation with complex dynamics
and a large state space. Pricing involves a multistage algorithm with forward
and backward cross-sectional (Longstaff–Schwartz) computations across the
full path space. Here, the challenge is to manage large path data sets, typically
several gigabytes, across multiple stages.
Multiscale Dataflow Computing in Finance 465

FIGURE 15.21: Bermudan swaptions computation on a DFE.

Figure 15.21 illustrates the RiskAnalytics DFE implementation, including


cashflow generation and Longstaff–Schwartz backward regression. By closely
coordinating between multiple DFE stages and DRAM memory, 6666 quar-
terly 30-year Bermudan swaptions can be priced per second on a Maxeler 1U
MPC-X node. This represents an 23× improvement over an 1U CPU node.
Table 15.2 provides a comparison of different instruments priced per second
for a range of instrument types supported in RiskAnalytics. As it can be seen,

TABLE 15.2: Performance comparison of 1U CPU and DFE nodes


Instruments priced Conventional 1U Maxeler 1U
per second CPU-node MPC-X node Comparison
European swaptions 848,000 35,544,000 42×
American options 38,400,000 720,000,000 19×
European options 32,000,000 7,080,000,000 221×
Bermudan swaptions 296 6666 23×
CDS 432,000 13,904,000 32×
CDS bootstrap 14,000 872,000 62×
466 High-Performance Computing in Finance

a single 1U MPC-X node can replace between 19 and 221 conventional CPU-
based units. The power efficiency advantage due to the dataflow nature of the
implementation also ranges between 1 and 2 orders of magnitude.

15.7.5 Credit value adjustment capital


Nowadays, banks are ever more focused on returns measured against their
regulatory capital requirements for OTC (over-the-counter) derivatives. This
has even led to the introduction of a valuation adjustment to represent such
costs, namely capital value adjustment (KVA) alongside other similar adjust-
ments such as credit value adjustment (CVA) and funding value adjustment
(FVA).
Future regulatory changes will increase the amount of regulatory capital
banks, required to hold against counterparty credit risk (CCR) and CVA. Such
changes will make OTC derivative businesses yet more expensive and make
return on capital targets even harder to achieve. Many banks may decide
to minimize their activity in derivative businesses, only participating in key
business areas and those that service the needs of key clients (see Figure 15.22).
However, an alternative route to downsizing is to manage CVA more
actively. By doing this, banks can save capital and thus increase profitabil-
ity in two ways. First, by using more risk sensitive capital methodologies, it
will be possible to benefit from smaller regulatory capital charges. Second, by
actively managing CVA, further capital savings will be possible thanks to the
capital reducing benefit of CVA-related hedges allowed under these more risk
sensitive approaches. There will likely also be further nontangible benefits of
taking a more proactive approach, such as improving the risk culture within
a firm. Banks pursuing this active management route will not only enhance

3.0%
2017 2019
2.5%

2.0%
Capital

1.5%

1.0%

0.5%

0.0%
Current SA-CCR BA-CVA SA-CVA SA-CVA
(managed)

FIGURE 15.22: Estimated impact of future regulatory change on the capital


requirements for an uncollateralized 10-year interest rate swap denominated
in Euros.
Multiscale Dataflow Computing in Finance 467

their return on capital but should also be able to expand their OTC derivative
franchises relative to competitors who are more capital (and leverage ratio)
constrained.
A limiting factor for market participants is the computational complex-
ity of the CVA calculation, which requires an entire subportfolio of deriva-
tive assets to be projected through multiple scenarios to determine the
potential exposure to counterparty default. Since large portfolios of trades
between two parties can extend to tens of thousands of assets, each asset
must be projected until its maturity date (which can run as long as 30
years) and a large number of scenarios must be used to appropriately cap-
ture tail risk, this quickly becomes a problem too large in scale for all tradi-
tional CPU-based implementations (including state-of-the art and overclocked
systems).
Utilizing the DFE-accelerated components of the Maxeler RiskAnalytics
library, CVA calculations can be made practical at enterprise scale for large
financial institutions. Additionally, banks are also becoming more rigorous
about pricing capital via KVA. This requires the simulation of future capi-
tal requirements rather than just the calculation of spot capital. Under the
SA-CVA4 approach, for example, this is particularly time-consuming because
this methodology is sensitivity based and so these sensitivities would need
to be calculated in all possible future states. Nevertheless, due to the con-
vexity of capital requirements (i.e., capital can go up more than it can go
down) such a calculation is extremely important to accurately quantify KVA.
Figure 15.23 shows such a capital projection run on Maxeler DFEs. The
bold red line is the expected value, often known as the expected capital
profile (ECP).

15.7.6 Standard initial margin model


Prior to the 2008 financial crisis, it was common for many derivative
contracts to be traded bilaterally without significant margin or collateral
requirements. Since the crisis, new regulations such as Dodd–Frank in the
United States and EMIR in Europe require many trades to be cleared
through a central clearing house, and impose significant margin require-
ments on uncleared trades. The methodologies for calculating these margin
requirements have not yet been fully standardized, which has led to dis-
agreements between counterparties on exactly how much margin to post. In
response, the International Swaps & Derivatives Association (ISDA) has cre-
ated the Standard Initial Margin Model (SIMM), with the first draft released
in September 2016.
The SIMM framework is based on first-order greeks to make it more com-
putationally tractable than methodologies such as Expected Historical VaR.
Trades are broken down into risk categories, and the sensitivities of each trade

4 Standardized approach for credit valuation adjustment.


468 High-Performance Computing in Finance

16%

14%

12%

10%
Capital

8%

6%

4%

2%

0%
0 2 4 6 8 10
Time (years)

FIGURE 15.23: Capital evolution scenarios.

to risk factors, such as interest rates or equity prices, are computed and then
aggregated using a proscribed set of weights and correlations. One key draw-
back of the framework as currently drafted is that the methodology to calcu-
late these sensitivities is not similarly proscribed, so counterparties may not
agree on the inputs to the model. Exposures may be netted within a single
counterparty portfolio, but are grossed up between counterparties, so firms
have a strong incentive to reduce initial margin requirements through careful
matching of trades and portfolio compression.
In order to calculate the initial margin requirement for a new trade, every
trade already existing between two counterparties must be revalued under
both a base scenario and a number of shocked scenarios, so that any risk off-
sets between the new trade and the existing portfolio are taken into account.
If this trading portfolio is already large, a significant performance cost can
be incurred simply transferring it from disk storage to memory—in a DFE-
accelerated solution, the portfolio can instead be stored in LMem and accessed
directly as needed. Combining this memory model with high-performance pric-
ing engines like that described in Section 15.7.2 allows for margin calculations
to occur at the speed of trading.
Maxeler’s SIMM calculation product on Amazon Web Services (AWS)
combines the ease of deployment of a cloud service with an industry-proven
risk analytics infrastructure. Our SIMM product splits naturally into the cal-
culation of sensitivities, and the application of risk weights and aggregation.
The Maxeler RiskAnalytics library provides the framework for calculating
greeks on CPUs as well as on Maxeler DFEs. Initial margin requirements can
be calculated directly from a portfolio of trades supplied in FpML format, or
Multiscale Dataflow Computing in Finance 469

indirectly by supplying sensitivity values from external models (e.g., a liability


exposure to be hedged).
The additional performance provided by DFEs also enables more complex
use cases, such as:

• Leveraging the cloud to allow a pair of counterparties to view the same


SIMM calculation before agreeing to post margin
• For a given trade, automatically selecting the counterparty with the most
offsetting risk for minimal margin requirements

• Identifying trades for portfolio compression

• Evaluating “what–if” trades and hedging strategies for margin impacts

The Maxeler competitive advantage is basically twofold: (i) performance


advantage for complex risk calculations of 10–50× for real-time results before
trading and (ii) transparent methodology, with the option of a source code
license to participating clients.

15.8 Conclusion
Cutting-edge applications in computational finance depend on highly capa-
ble computational systems while scaling over current computing technology is
becoming increasingly problematic. Maxeler has pioneered a novel vertically
integrated, dataflow oriented approach that can deliver orders-of-magnitude
improvement in performance, data center space, and power consumption for
a wide range of real highly demanding applications. DFEs provide a highly
efficient computational model for data- and compute-intensive parts of an
application. In addition, DFE-based systems can be balanced with all other
computing system resources, for example, CPUs and storage, according to the
specific requirements of the application. Elastic scaling of these systems to the
public cloud to cope with peak performance demands is another key advan-
tage to be mentioned. Maxeler supports a high-level programming model that
allows application experts to harness the computational power of dataflow
systems and focus on optimizing their applications all the way from the for-
mulation of the algorithm down to the design of the best possible dataflow
architecture for its solution. This dataflow technology is key to many finance
applications where a more complex model, larger data sets, or more fre-
quent recomputation often directly translates into monetisable, competitive
advantage. A number of DFE-based products for analytics and trading are
available from Maxeler, and we described several practical application scenar-
ios that could not be achieved with conventional, general-purpose computing
technology.
470 High-Performance Computing in Finance

References
1. Godfrey, M., Hendry, D.: The computer as von Neumann planned it. Annals of
the History of Computing, IEEE 15(1), 11–21, 1993.

2. Pell, O., Mencer, O.: Surviving the end of frequency scaling with reconfigurable
dataflow computing. SIGARCH Computer Architecture News 39(4), 60–65, 2011.

3. Chau, T.C.P., Niu, X., Eele, A., Maciejowski, J., Cheung, P.Y.K., Luk, W.:
Mapping adaptive particle filters to heterogeneous reconfigurable systems. ACM
Transactions on Reconfigurable Technology and Systems 7(4), 36:1–36:17, 2014.

4. Thomas, D.B., Luk, W.: Multiplierless algorithm for multivariate gaussian ran-
dom number generation in FPGAs. IEEE Transactions on VLSI Systems 21(12),
2193–2205, 2013.

5. Lindtjorn, O., Clapp, R., Pell, O., Fu, H., Flynn, M., Mencer, O.: Beyond tradi-
tional microprocessors for geoscience high-performance computing applications.
IEEE Micro 31(2), 41–49, 2011.

6. Weston, S., Spooner, J., Racanière, S., Mencer, O.: Rapid computation of value
and risk for derivatives portfolios. Concurrency and Computation: Practice and
Experience 24(8), 880–894, 2012.

7. Dennis, J.B.: Data flow supercomputers. Computer 13(11), 48–56, 1980.

8. Kung, H.T.: Why systolic architectures? Computer 15(1), 37–46, 1982.

9. Amazon Web Services: [Link]

10. Jin, Q., Dong, D., Tse, A.H.T., Chow, G.C.T., Thomas, D.B., Luk, W.,
Weston, S.: Multi-level customisation framework for curve based Monte Carlo
financial simulations. In: Reconfigurable Computing: Architectures, Tools and
Applications—8th International Symposium, ARC, pp. 187–201. Springer, 2012.

11. Fu, H., Gan, L., Clapp, R.G., Ruan, H., Pell, O., Mencer, O., Flynn, M.J.,
Huang, X., Yang, G.: Scaling reverse time migration performance through recon-
figurable dataflow engines. IEEE Micro 34(1), 30–40, 2014.

12. Pell, O., Bower, J., Dimond, R., Mencer, O., Flynn, M.J.: Finite-difference wave
propagation modeling on special-purpose dataflow machines. IEEE Transactions
on Parallel Distributed Systems 24(5), 906–915, 2013.

13. Gan, L., Fu, H., Yang, C., Luk, W., Xue, W., Mencer, O., Huang, X., Yang,
G.: A highly-efficient and green data flow engine for solving Euler atmospheric
equations. In: 24th International Conference on Field Programmable Logic and
Applications, FPL 2014, Munich, Germany, September 2–4, 2014, pp. 1–6. IEEE,
2014.

14. Arram, J., Luk, W., Jiang, P.: Ramethy: Reconfigurable acceleration of bisulfite
sequence alignment. In: Proceedings of the International Symposium on Field-
Programmable Gate Arrays (FPGA), pp. 250–259. ACM, 2015.
Chapter 16
Manycore Parallel Computation

John Ashley and Mark Joshi

CONTENTS
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
16.1.1 What a practitioner needs to know . . . . . . . . . . . . . . . . . . . . . 472
16.1.2 Outline of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
16.2 The Parallelism Imperative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
16.2.1 Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
16.2.2 Dennard Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
16.2.3 Performance then and now . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
16.3 Systems Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
16.3.1 Building blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
16.3.2 Basic CPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
16.3.3 Basic GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
16.4 Parallelism and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
16.4.1 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
16.4.2 Gustafson’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
16.4.3 Little’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
16.4.4 Task parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
16.4.5 Instruction parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
16.4.6 Data parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
16.5 Parallelism and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
16.5.1 Logical threading models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
16.5.2 Physical execution models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
16.5.3 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
16.6 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
16.7 The LIBOR Market Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
16.7.1 The LMM in discrete time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
16.7.2 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
16.7.3 Packages and hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
16.7.4 Design overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
16.7.5 Memory use, threads, and blocks . . . . . . . . . . . . . . . . . . . . . . . . 493
16.7.6 Path generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
16.7.7 Product specification and design . . . . . . . . . . . . . . . . . . . . . . . . 497
16.7.8 Least-squares and multiple regressions on the GPU . . . . 498
16.7.9 The data collection phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

471
472 High-Performance Computing in Finance

16.7.10 The pricing phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500


16.7.11 Speed comparisons and numerical results . . . . . . . . . . . . . . . 501
16.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504

16.1 Introduction
In this chapter, we will argue that there is a parallelism imperative: quants
must learn to write effective parallel code in order to take advantage of future
computing hardware. We provide a grounding in the basic computer science
and hardware considerations needed to explore effective parallelism in more
depth. These are, fortunately, far simpler than the typical financial mathemat-
ics encountered in computational finance. We spend the bulk of the chapter
applying parts of these foundational points in working a detailed example of
coding a nontrivial early exercise LMM problem on the GPU, which highlights
the key issues of writing highly parallel code. The results clearly highlight the
parallelism imperative—quants who leverage parallel execution well can gain
a significant advantage over competitors who do not.

16.1.1 What a practitioner needs to know


A practitioner of the art of computational finance (more usually known as
a quantitative analyst, financial engineer, or quant), once could have counted
on a solid knowledge of mathematical finance and a cursory knowledge of
programming to be successful. In today’s environment, however, very few
practitioners will thrive with anything less than solid programming skills. In
the future, the ability to efficiently use the compute resources available will
be a critical factor in judging the performance of practitioners—and so the
perspectives the practitioner needs to adopt will expand, to include that of
the computer engineer, the computer scientist, and the computational (tradi-
tional supercomputing or technical computing) scientist. The future of com-
putational finance will be written by professionals with a solid grounding in
mathematical finance, who also have the complete programming toolkit of the
modern HPC specialist at their fingertips.

16.1.2 Outline of the chapter


This chapter is divided into background and practice. It is possible to dive
directly into the practice section, but we hope that even those with a strong
existing background in parallel computing will still find some benefits in the
earlier sections. As background, we first present our argument, in Section 16.2,
based on current computer architecture and manufacturing trends, for the
existence of a parallelism imperative. We look at computer architecture in
Manycore Parallel Computation 473

Section 16.3. In Sections 16.4 and 16.5, we start to fill our parallel computing
design considerations toolbox. We switch gears in Section 16.6, and begin to
work an actual example, which we detail in Section 16.7.

16.2 The Parallelism Imperative


From a computer hardware engineering perspective, parallelism is the
future of computing. Power and its always present companion heat currently
conspire to limit practical power onto a single chip in commodity hardware to
around 300 W. Current materials, design, and manufacturing are beginning to
run into the physical limits of the materials and processes as we understand
them today. We will soon run into the theoretical physical limits of what can
be done with electrons. Engineers have been unable to break through the
power and frequency walls; instead they have circumvented them, lowering or
holding constant clock rates but adding more parallelism to build newer and
better chips. Quants need to understand the trends in the computer hardware
industry to design code that will be performant for more than one product
generation.

16.2.1 Moore’s Law


Moore’s Law was coined by Gordon Moore (1965) and was, for many years,
taken to mean that because of underlying advances in hardware, software
would run twice as fast every 18–24 months. The true point of Moore’s Law,
however, is that the number of transistors on a single die will roughly double
every 18 months. There was a long period of time where both statements were
effectively equivalent—but that time has now passed.
Moore’s Law was primarily predicated on process improvements in the
manufacture of transistors. As the size of the building blocks of transistors
shrinks (process shrink), the number of transistors per unit area increases.
Additional process improvements have resulted, over time, in our being able
to economically produce larger silicon dies. Taken together these two manu-
facturing improvements have given us Moore’s Law.
Quantum effects dictate that there is a lower bound to the size of a tran-
sistor. Current public vendor roadmaps run out to features in the 7–10 nm
range; the quantum limit is normally quoted as being around 1 nm—this is on
the order of the scale of the wavelength of an electron. Process shrink, while
not yet over, cannot continue forever.
While transistor counts are still following Moore’s Law, the performance
of most computer programs is increasing much more modestly—unless those
programs can exploit the new levels of parallelism being introduced with each
chip generation to keep pace with Moore’s Law. Why is that?
474 High-Performance Computing in Finance

16.2.2 Dennard Scaling


Another benefit of smaller feature sizes is that smaller features inherently
have lower electrical capacitance. Lower capacitance helps allow higher fre-
quencies at the same power level. Additionally, improvements in materials
science and transistor design, coupled with the ability to fit more complex
circuits in the same area, meant that chip operating voltages could also be
lowered. Again, lower voltage equals lower power dissipation, which can be
used to enable higher frequency (Dennard and Gaensslen, 1974).
This Dennard Scaling, coupled with Moore’s Law, meant that for a long
time, programs just got faster with every new microprocessor generation. Tun-
ing code to be efficient was frequently not required, as it would simply run
twice as fast next year anyway.
Unfortunately, physics giveth, and physics taketh away. Materials science,
fabrication technology, and transistor design have reached a point where sig-
nificant voltage reduction isn’t possible. This is due to leakage current—no
transistor is perfect—and to mainatin correct operations in the face of leakage
current requires a certain minimum voltage plus some safety margin. With-
out voltage scaling, the energy efficiency of the improvements in capacitance
caused by a process shrink don’t outweigh the increase in energy dissipa-
tion per unit area caused by having more transistors. To maintain total over-
all power usage, frequency needed to be reduced (or at least could not be
increased). Code suddenly ran SLOWER on the newest hardware. What could
the industry do?

16.2.3 Performance then and now


In the single-core, single-threaded era of computing, the benefits of Moore’s
Law and Dennard Scaling were generally spent in three ways—increasing fre-
quency, increasing cache sizes, and adding new hardware functions.
Frequency increases had obvious benefits for any code that was spending
most of its time executing instructions. This was the rising tide that lifted all
boats.
Increasing the number and size of caches was important—as core compute
frequencies increased, memory, disk, and IO subsystems couldn’t keep up.
Caches enabled the CPU to hide some of these growing imbalances in IO and
continue to compute.
Adding new hardware functions had dramatic impacts on codes that had
previously used software emulation or multiple software steps to perform the
same task. Floating point math is an obvious example—originally, there was
no HW math acceleration on x86 architecture chips—then the math copro-
cessor was introduced. This was originally on a separate chip—but Moore’s
Law meant that those transistors could be moved onto the main CPU die.
Today, video encode and decode, cryptographic, and graphics features are all
routinely included as new hardware functions.
Manycore Parallel Computation 475

When frequency could no longer scale, the obvious things to do were to


make bigger caches and add more functions. But beyond a certain point,
caches give diminishing performance returns for many codes; and unless spe-
cial function hardware is doing something ubiquitous, the additional benefits
of new features apply to smaller and smaller portions of a smaller number of
codes—again, diminishing returns.
The answer was twofold: provide more general-purpose cores and by using
vector instructions, make those cores “wider” for some math and logic opera-
tions. CPUs become dual core, then quad core. SSE and AVX evolved to allow
SIMD style operations on adjacent data in a vector (hence vector units). Cur-
rent CPUs can have a dozen or more cores with vector units up to eight
double precision values wide (eight lanes); current GPUs have thousands of
cores (which are a bit like a cross between a core and a vector lane). Cache
sizes continue to grow but more slowly when measured on a per core basis.
Today, achieving maximum performance on a modern architecture requires
careful use of parallelism and caches to ensure that all the cores are being fed
data as fast as they can process it. And as every new generation becomes more
parallel at a hardware level, for software to see a performance gain means it
must be written with more parallelism in mind than what today’s hardware
is delivering or it will be obsolete in one or two hardware generations.

16.3 Systems Architecture


From a computer system engineer’s perspective, the early days of the
computing industry were dominated by bespoke hardware; with advances in
hardware and software, more homogenous architectures and systems became
dominant. Today, driven by the commodity availability of compute accel-
erators and generally increasing levels of parallelism, the architectures that
designers need to be able to exploit are again becoming more varied and more
parallel. Quants need to know enough about the organization of the hardware
to ensure they are using it effectively.

16.3.1 Building blocks


There are many courses in computer architecture, and many levels of detail
can be used to analyze various products. This section presents a simplified
set of components that can be used to help us think about performance on
multiple architectures.
Logically, we will switch between the viewpoint of a program, a task, and
a thread as it suits us. For our purposes, a program, abstractly, is a collection
of partially ordered tasks that transform data. A task takes a collection of
input data and maps it to output data—for our purposes, a task doesn’t
share intermediate results unless it does so explicitly (via message or shared
476 High-Performance Computing in Finance

memory region). A task can be implemented by one or more threads. We will


see that CPUs and GPUs map this logical model to hardware differently.
Threads are executed on a core, which may have one or more vector lanes—
which can all perform the same mathematical or logical operation on a single
operand of a multioperand vector. Cores may be organized into groups that
share some resources (like caches).
We define a number of properties of data storage—proximity to compute
resources (latency), rate of data movement (bandwidth), organization of data
movement, and size. Typically, modern systems will have layers of storage
that increase in latency and size, while decreasing in bandwidth, and sharing
a common data movement organization through the majority of the layers.
The layer nearest to the cores is usually called the registers and the layer
furthest from the cores is called system memory. Layers in between are called
caches and are usually distinguished by a number indicating distance in layers
from the cores (2–4 layers are typical today). A useful rule of thumb is to
expect nearer layers to have an order of magnitude less latency and a factor
of 1–5 more bandwidth than further layers. The relationship of size of a layer
to location relative to the cores is architecture specific. In most CPUs today
it is a strict pyramid, whereas in many GPUs it is an hourglass—there can
as much or more memory in the register file than in the nearest cache layer
(with lower latency and more bandwidth).
Until it moves into registers, data is usually requested and moves through
the system in chunks of memory called cache lines—these are generally multi-
ple operands (many bytes) in size. To simplify the circuitry involved in memory
management, cache lines always start on some aligned memory boundary—for
example, if a cache line is 128 bytes long, the starting address of all cache lines
will be an integer multiple of 128. This “chunkiness” of memory access can be
exploited to gain more performance.
Specialized hardware determines if a memory request can be served from
a nearby cache (cache hit) or has to come from more distant memory (cache
miss). The memory subsystem is coherent (all CPU and some GPU memory)
if multiple threads always see consistent values for a single memory address.
Disk and network IO can be thought of as additional layers beyond sys-
tem memory with multiple orders of magnitude more latency and 1 or 2
orders of magnitude lower bandwidth. They aren’t terribly relevant to this
discussion.

16.3.2 Basic CPU architecture


A typical Intel CPU can have a dozen or more cores per each chip (socket),
and the server will have two or sometimes four sockets. Each core has a small
local set of registers (capable of supporting one or two active threads at a time)
and a local L1 cache. L2 caches are shared between a subset of the cores, and
there is a chip wide shared L3 cache. These caches will be coherent—the mem-
ory subsystem ensures consistency across all layers of the cache and between
Manycore Parallel Computation 477

sockets. Each core will have a dedicated set of 2/4 or 4/8 single/double pre-
cision vector lanes.
CPUs based on other microarchitectures such as ARM or IBM Power will
have different characteristics, but these are currently less commonly used in
finance and are out of the scope of this chapter, as are systems based on
AMD’s x86 chips.

16.3.3 Basic GPU architecture


From a compute perspective, the majority of GPUs deployed are from
NVIDIA, and so our discussion will focus on NVIDIA hardware. The details
of hardware from AMD and other vendors would be quite different in detail
but many of the concepts would remain the same.
Each core has a large local set of registers (capable of supporting 16 or
more active threads per core). Each group of 32 cores shares some instruc-
tion control hardware. While the number varies by hardware generation, the
Maxwell (2014+) generation of NVIDIA processors will have sets of 128 cores
sharing a private, scratchpad memory space, and a coherent L1 cache; from
one to several dozen of these groups (called Streaming Multiprocessors or
SMs) form a single GPU and share L2 cache and system memory. Each GPU
card has a global memory area called device memory, usually between 4 and
12 GB in size.

16.4 Parallelism and Performance


A computer scientist’s perspective has evolved with computer hardware.
Increasing performance meant that some performance could be traded off
against other goals like system portability and code maintainability. Even-
tually, the abstractions and isolation of the programmer from the hardware
meant that a significant amount of performance could be regained by writing
more performance efficient code. This code can maintain much of the abstrac-
tion, portability, and maintainability of existing code but fundamentally needs
to be better adapted to the increasing parallelism of systems and ensure that
it more effectively uses any hardware features that are present. A quant needs
to be able to think about how the organization of code and data will interact
with hardware and other layers of software, today and tomorrow, to build
durable code that performs well.

16.4.1 Amdahl’s Law


Amdahl’s Law (Amdahl, 1967) is effectively the law of diminishing returns
for parallelization of code. If the parallelizable percentage of the execution
478 High-Performance Computing in Finance

time of the code is P , and the total execution time is Told , then if you can
infinitely accelerate the parallel portion of the code, or equivalently, drive
the execution time of the parallel portion of the code to zero, then the new
execution time is given by

Tnew = (1 − P ) ∗ Told ,

and the maximum achievable parallel speedup is

Told /Tnew = 1/(1 − P ).

This is closely related to the concept of strong scaling, which looks at the ratio
of improvement in compute time to the number of processing units engaged in
the computation. Normally, there is a point where strong scaling breaks down,
as communications and coordination costs overwhelm the actual computation.
The strong scaling properties of a code on a system limit our ability to achieve
the theoretical Amdahl limit. Combined with Gustafson’s Law, below, this is
extremely useful in defining exceptions for performance on existing and future
hardware.
From a parallel speedup perspective, if a code spends 90 seconds in poten-
tially parallel sections and 30 seconds in serial sections, a 3× speedup of the
parallel code reduces execution time from 2 minutes to 1 minute. If in a sec-
ond phase of the performance tuning the parallel section is now accelerated
by 9×, the overall elapsed time goes to 40 seconds. This highlights the law of
diminishing returns from a performance perspective.
From a strong scaling perspective, if the problem at hand is to evaluate
32,000 paths, and the parallelism of the problem is at the path level, then
adding threads up to 32,000 could result in some performance improvement—
any threads/cores added after that have no useful work to do and contribute
no additional performance. Strong scaling performance cliffs are not always
this dramatic.

16.4.2 Gustafson’s Law


Gustafson’s Law (Gustafson, 1988) is related to Amdahl’s Law, but
whereas Amdahl’s Law focuses on the limit of acceleration of a problem of
constant size, Gustafson’s Law addresses how much larger a problem can be
solved in the same unit of time via parallelization. This is related to the con-
cept of weak scaling—how much more work can we do with more parallel
processing units. Eventually, communications and coordination overhead will
overwhelm any increase in efficiency of a code on a system and prevent us
from reaching the Gustafson limit.
From a theoretical perspective, evolution of Monte Carlo paths has perfect
weak scaling in the number of cores (more cores, more paths) until some other
system limit is hit (memory or disk space or bandwidth) that prevents full
utilization of additional cores.
Manycore Parallel Computation 479

16.4.3 Little’s Law


Little’s Law (Little, 1961) comes from queuing theory. Our interest lies
in maximizing our use of bandwidth (or throughput)—if a memory system
has a request (cache line) size of C bytes, latency to respond to a request of
L seconds, and a usable bandwidth W in B/s, then the number of memory
requests outstanding to maximize bandwidth used is WL/C. Relative to CPU
memory, GPUs tend to have higher values of W, similar or larger values of C,
and longer (higher) latency L—meaning that more threads need to have more
outstanding memory requests at any point in time to saturate the memory
system. Fortunately, GPUs support many more active threads than CPUs and
so this is usually not a problem. This same law can apply to any pipelined
compute element.
For example, assume a math pipeline 4 clock cycles deep. If the code only
has one math operation to do, it submits it to the math processor at time t = 0,
and gets a result back at time t = 3, using only 1/4 of the available pipeline. If,
on the other hand, it can submit math operations at time t = 0, 1, 2, 3, . . . , n
then it will get results at t = 3, 4, 5, 6, . . . , n + 4—getting full usage of the
pipeline from t3, . . . , n.

16.4.4 Task parallel


In our definitions above, it is obvious that two programs can be run inde-
pendently and in parallel—we ignore for now higher level data flow dependen-
cies between jobs. We also define a task in such a way that multiple threads
may coordinate to execute the task, with only explicit coordination required
between those threads. Any section where multiple threads can execute large
sections of code between any required communication or coordination can be
thought of as task parallel. This is the most commonly used multicore CPU
programming paradigm, but as the number of tasks grows, managing the
coordination and communications between threads can become either very
complicated or must be hidden behind abstractions that result in lower over-
all performance. Map and reduce style algorithms, for example, identify task
parallel steps and then synchronize all threads expensively on disk (as in
Hadoop) or in memory (as in Spark). Traditional HPC codes might use a
Message Passing Interface (MPI) to coordinate many hundreds or thousands
of threads, but finance code rarely requires such large processor sets to coop-
erate so closely on a single task. Indeed, many production finance codes are
single threaded today.

16.4.5 Instruction parallel


Instruction Level Parallelism (ILP) is very common in today’s comput-
ing environment, but is rarely directly seen by programmers, as it is usually
exploited by compilers and dedicated hardware functions. When the compiler
480 High-Performance Computing in Finance

builds an internal representation of the data and instruction dependencies of


a section of code, it can identify generated instructions that can be performed
in any order so long as they are performed prior to a certain point. This can
result in the hardware being able to execute multiple instructions in different
units simultaneously or may allow scheduling instructions more tightly into
pipelined instruction units.

16.4.6 Data parallel


Data Parallelism can be exploited when the same instruction sequence
needs to be applied to a large amount of data with limited or well-structured
dependencies between output data elements. For example, calculating the sum
of two long vectors, the product of two dense matrices, or the membership
function of the Mandelbrot set all exhibit data parallelism. The additional
structure of a data parallel problem can be exploited, allowing many more
processing units to contribute to the work with a reduction in communication
and coordination overheads. Pure data parallelism leads to “embarrassingly
parallel” problems which generally have good strong and weak scaling prop-
erties. In finance, Monte Carlo codes are the poster child for embarrassing
parallelism, but PDE and even adjoint methods can be implemented in a
highly data parallel way.

16.5 Parallelism and Execution


In the traditional supercomputing space, the computational scientist has
an evolving role that sits between a pure domain scientist and a pure program-
mer. The computational scientist’s perspective includes an understanding of
how the discrete mathematics of computers interacts with the pure mathemat-
ics of models, and also understanding how algorithms and code are mapped to
hardware and executed. Both these perspectives are used to tune programs to
study some of the largest scientific and engineering questions. A quant needs
to understand issues of algorithmic and discrete mathematical accuracy, sta-
bility, and performance, just like a computational scientist.

16.5.1 Logical threading models


The logical threading model on the CPU is likely familiar to readers—but
we present a broad summary here in order to contrast it with the model on a
GPU. A CPU thread may be independent or may be part of a group of several
threads under a single process. Generally, a CPU thread has a relatively large
startup and teardown cost. It can be actively resident on a core in the register
set, or it can be swapped out, with its state stored in main memory. Swapping
in and out from main memory is expensive, swapping between threads resident
Manycore Parallel Computation 481

in the register file is fast. Because of the limit on the number of resident
threads, the compiler and the hardware need to ensure that a given thread
completes as much work as possible, and keeps the CPU core as busy as
possible, while it is active. Logically, threads have great flexibility in passing
messages and sharing memory, with a few more options for threads within a
given process than for threads between processes. CPU threads thrive on task
parallelism, have dedicated hardware to take advantage of ILP, and if written
to take advantage of the vector units, have some ability to leverage them for
data parallelism as well. Highly performant code must take advantage of all
three types of parallelism.
A logical thread on a GPU is quite different. A GPU thread is always part
of a hierarchy that includes a local thread block and a set of thread blocks
called a grid (which in our definitions here is a task). A GPU thread is a very
lightweight construct, with the majority of startup and teardown managed in
hardware. GPU threads have three main states—waiting to be active, active
and in the register set of a core, and completed. All the threads in a local
thread block are in the same main state; and threads within an active thread
block can communicate and synchronize through a variety of mechanisms.
Threads that are not in the same thread block can only communicate through
the main device memory. GPUs excel at data parallelism, although they can
take limited advantage of ILP and, particularly with more recent GPUs, can
exploit task parallelism as well. The logical threading model for the GPU does
not allow the programmer to assume or force any particular ordering for the
execution of thread blocks.
Whereas efficient CPU code will normally have one or two active threads
per core, efficient GPU code will have many active threads per core—and
with many more cores. Cooperative CPU codes may use dozens of threads,
but GPUs will not function efficiently without thousands of threads, and rou-
tinely are used with millions of logical threads. This is very much due to
differences in the complexity of the large CPU cores versus the smaller but
more numerous GPU cores (bringing Amdahl’s and Gustafson’s Laws into
play) and the latency and bandwidth of the memory subsystems (bringing in
Little’s Law).

16.5.2 Physical execution models


On a CPU, the physical execution model of a thread is familiar—a process
is launched, it spawns whatever threads it needs, and each thread starts to
do its work. It swaps in and out of the core, and may execute on many dif-
ferent cores over its lifetime (unless limited by the operating system). It will
swap out whenever it is bumped by another thread that is now entitled to a
timeslice on the core either by virtue of priority or elapsed time. Hardware
and software will aggressively attempt to use branch prediction, speculative
execution, and cache prefetching to avoid allowing idle time while the thread
is resident on a core.
482 High-Performance Computing in Finance

When a core needs data that isn’t already in the local register file, it will
normally begin a cascade of checks down the cache hierarchy until the memory
management subsystem finds the data it is looking for; this is then hoisted
through the various cache layers into the local register file where it can be
used as an operand in instructions. Writes that spill out of the register file or
are directed at memory proceed to invalidate caches in other cores until they
reach system memory, causing cache synchronization traffic.
On a GPU, things are less familiar but in many ways simpler. When a
task (kernel) is launched onto the GPU by a CPU side thread, a grid of GPU
threads is created. The threads are grouped into thread blocks, and each
thread block is independently scheduled onto a single SM. Depending on the
hardware and resources required, several thread blocks may be resident on
an SM at a given time and many SMs can be servicing thread blocks from
a grid at the same time. Once a thread block is active, it will remain active
on that SM until all its threads have exited, having processed their section of
the overall dataset. As a programmer, you have no control over or ability to
assume what order thread blocks will execute in nor can you communicate,
coordinate, or synchronize with other thread blocks except through global
device memory.
Memory management on the GPU presents several more options that the
programmer can consider. Reads and writes to and from a device global mem-
ory behave similarly to CPU reads and writes. Alternatively, however, a thread
block can declare a small area of memory to be a shared memory. This creates
a private allocation of memory, physically drawn from the same pool as the L1
cache, that is visible to all the threads in the thread block and to no threads
outside that thread block. Because this data is very near to the cores in the
memory hierarchy, it is extremely fast. And because it is not coherent with
any other memory in the system, it pays no synchronization or more distant
memory management costs. It is effectively a private working space for the
threads in a thread block to share data.
Drawing on their graphics heritage, GPUs also have additional paths from
global device memory to the cores that bypass some or all of the conventional
caches—the most commonly used are the constant cache and texture cache.
The constant cache is now well leveraged by the compiler and, as its name
suggests, is well suited to delivering a very small number of values to all the
threads in the SM. The texture cache is another path to memory that has
slightly different caching behavior from the standard caches. The usefulness
of the texture cache to the programmer changes from generation to generation
on the GPU. Some algorithms and data access patterns can make good use of
the texture cache.

16.5.3 Data structures


Earlier in Section 16.5 we discussed cache lines and alignment, and noted
that there are potential performance advantages to be gained by being aware
Manycore Parallel Computation 483

of these points. On a CPU, these considerations are most important when


using the vector units but also can help performance in general; on a GPU
they are critical to gaining maximum performance because of the data parallel
nature of the device.
There are two interesting and common cases where this can be observed.
In the first case, a container or array of objects or structures is stored. In
a simple example, assume a trade has fields of symbol (4 bytes), trade date
(8 bytes), trade time (8 bytes), exchange (2 bytes), quantity (4 bytes), price
(8 bytes), fees (8 bytes)—a total of 42 bytes. Assume we have a large number
of trades, and a cache line size of 64 bytes. We need to get the net total
position and value by symbol by day, and we assign each thread to a symbol.
The first thread requests the first record, and assuming the array of structures
started on a cache line boundary (is 64 bytes aligned), it gets the first record
plus 22 bytes of the second record delivered to it. Of the bytes delivered in the
first cache line, it uses the symbol, trade date, quantity, price, and fees—32
bytes. If 64 bytes were delivered and only 32 bytes were used, the effective
bandwidth is 1/2 what could have been achieved with a more efficient data
structure and assignment of work to threads.
If there were many threads in parallel sharing the data request, as there are
in the GPU, then the second record’s bytes would be partially used and the
efficiency might rise to 32/42 on average. Arranging the data as a structure
of arrays—an array of symbols, then an array of trade dates, and so forth—
means that in the case where there are parallel threads, the unused fields are
never read and every byte moved is used by a thread—maximally efficient.
Consider another case: two- or higher dimensional arrays are commonly
used in financial mathematics. For simplicity, assume we have a 2D path array,
V , indexed by path and time—V [p, t]. If we have paths A, B, C and times 0, 1,
2, in C compatible languages these are stored in row major order in memory.
So in memory, we would see A0, A1, A2, B0, B1, B2, C0, C1, C2. If each
element is 16 bytes and the cache line size is 64 bytes, then the first cache line
contains A0, A1, A2, B0, the second contains B1, B2, C0, C1, and the third
contains C2, junk, junk, junk.
If each thread is processing one path, for instance in path generation, then
the first thread needs the first cache line, the second thread needs the first
and second cache lines, and the third thread needs the second and third cache
lines—3 threads generate 5 requests to memory and only use 0.75, 0.25, 0.5,
0.5, 0.25 of the returned data.
If each thread is processing a time step, for instance doing a potential
future exposure calculation, then the first thread needs the first and second
cache lines, the second thread needs the first and second cache lines, and the
third thread needs the first, second, and third cache lines. This is 7 requests
to memory, using 0.5, 0.25, 0.25, 0.5, 0.25, 0.25, 0.25 of the returned data.
With array padding, rather than declaring V [3, 3], you could declare it to
be V [3, 3 + 1]. What happens then? We get memory of A0, A1, A2, pad, B0,
B1, B2, pad, C0, C1, C2, pad. From our first example, each thread requests a
484 High-Performance Computing in Finance

cache line and uses 3/4 of it. Effective BW goes from 0.45 to 0.75 of peak—a
huge improvement. In the second example, each thread needs all three cache
lines—9 requests to memory, with an efficiency of 0.25 as opposed to 0.32,
which is only slightly worse.
Good data structure design needs to balance usability and performance.
As these toy examples show, data layout can have a huge impact on how
efficient a code can be on any given hardware.

16.6 Putting It All Together


The advent of multicore graphics cards places the possibility of a huge
amount of computing power in the desktop or server room at a reasonably
low cost. Such cards are known as graphics processing units or GPUs. How-
ever, they can be used for many purposes other than graphics. In particular,
they are very naturally adapted to Monte Carlo simulations since their design
is naturally highly parallel in nature. They work best when threads are per-
forming the same operations with varying initial data (data parallelism).
To gain some idea of the level of computing power, consider one of
NVIDIA’s recent GPUs, the Tesla K20. This has a peak performance of
3.52 teraflops, that is, it can perform 3.52 trillion single precision floating
point operations per second. However, possession of raw computing power is
not enough. Different architectures require different coding designs and the
question is therefore what level of performance can one achieve for realis-
tic practical problems? For example, Aldrich, Fernandez-Villaverde, Gallant,
and Rubio-Ramirez (2011) demonstrate the effectiveness of GPUs for solving
dynamic equilibrium problems in economics using iterative methods.
The pricing of nonearly exercisable derivatives using GPUs is straight-
forward and large speedups can be obtained. The case of Asian options was
studied by Joshi (2010). An early piece of work on pricing exotic interest
derivatives using the LIBOR market model was produced by Giles and Xiaoke
(2008) where a 120 times speedup was achieved. Whilst this was an interesting
pilot, the case studied was a little simple in that a one-factor model was used.
The questions we address here are:
• Can a complex multifactor displaced diffusion LIBOR market model be
implemented in such a way as to achieve large speedups whilst maintain-
ing genericity?

• Is it possible to implement an effective and fast Monte Carlo pricer for


early exercisable Bermudan derivatives using the GPU?
We answer both questions in the affirmative and achieve over a hundred times
speedups over single-threaded CPU C++ code. The second problem is much
tougher than the first in that the pricing of early exercisable derivatives via
Monte Carlo simulation is much more complex than pricing path-dependent
Manycore Parallel Computation 485

derivatives lacking this feature. The fundamental reason is that an exercise


strategy has to be developed and the development of the strategy requires
interaction between paths. Thus one cannot set many threads going, let each
one handle a different path and accumulate results at the end. Unlike many
Monte Carlo simulation problems, the method is not “embarrassingly paral-
lel,” and it is a challenge to design an effective implementation.
The problem of pricing Bermudan (or American) derivatives by Monte
Carlo simulation is well known and used to be regarded as very hard. However,
much progress has been made in recent years. We focus on lower bounds in
this example. The use of regression to develop estimates of continuation values
and therefore to decide exercise decisions has, in particular, proven popular.
This was introduced by Carrière (1996) and popularized by Longstaff and
Schwartz (2001). Whilst these methods work reasonably well for finding lower
bounds in many cases, they are not always successful. Continued work has
therefore focussed on enhancing these techniques (Beveridge, Joshi, and Tang,
2013; Broadie and Cao, 2008; Kolodko and Schoenmakers, 2006). Whilst these
improved techniques have proven effective, they are very much designed for
the virtues and constraints of a CPU. For example, early termination of a
subsimulation if an inaccurate number is sufficient makes sense on the CPU,
but running lots of subsimulations with differing numbers of paths is unnatural
on a GPU. We therefore introduce a new approach to early exercise based on
using a cascade of multiple regressions. This builds on previous work (Broadie
and Cao, 2008) on double regressions.
In this portion of this chapter, we present and discuss solutions to the
challenges of pricing early exercisable exotic interest derivatives on the GPU
using NVIDIA’s CUDA language. In particular, we discuss in detail the design
choices made in the open-source project Kooderive which contains an example
of the pricing of a 40-rate cancellable swap using a five-factor displaced dif-
fusion LIBOR market model. The code is fully available for download under
the GNU public licence 3.0 (Joshi, 2014). The project is a collection of C++
and CUDA code targeting the Kepler 3.5 architecture and the K20c NVIDIA
graphics card. It will, however, run on other cards with earlier or later archi-
tectures. It is designed for use with Windows operating systems and comes
with project files for Visual Studio which allows immediate building.
Whilst we focus on the LIBOR market model (LMM), we emphasize that
the approaches used are generic and they could equally well be applied to
other models. In particular, the multidimensional Black–Scholes model is from
the point of view of implementation really just a simplification of the LMM in
which drifts and discounting are much easier. The implementation of the early
exercise code is done in generic fashion and does not use specific features of the
model and product. In fact, a key part of the design is that the early exercise
strategy is generated in a different component that only interacts with the
path generator via the paths produced and so will be very broadly applicable.
We focus here on cancellable swaps because their pricing is mathematically
equivalent to that of Bermudan swaptions and callable fixed rate bonds. These
486 High-Performance Computing in Finance

products are the most common exotic interest rate derivatives. However, no
special features of the product are used and other products can be handled
by modifying the functions defining the coupons with no changes elsewhere.
We note that there has been previous work on the use of GPUs for Bermu-
dan/American options. Dang, Christara, and Jackson (2010) proceed using a
PDE approach. Abbas-Turki and Lapeyre (2009) use a least-squares Monte
Carlo approach. However, the case they study is four-dimensional and indica-
tive rather than realistic so it is difficult to know how their techniques would
translate to the high-dimensional interest rate case. Most other work appears
to be focussed on binomial trees and/or the one-dimensional case.
In this example, we focus on lower bounds. However, another part of the
Bermudan pricing problem is upper bounds: one cannot be sure a lower bound
is good without a nearby upper bound. To do a regression-based method such
as that in Joshi and Tang (2014) would not be a particularly hard extension of
the work here. First, one would have to compute Deltas along each path using
adjoint differentiation techniques. Second, one would have to use similar tech-
niques to those here to compute regression estimates of their value. Third, one
would run a hedging simulation using these estimates. To do a method such
as Andersen and Broadie (2004) would be more challenging since it involves
running many subsimulations. One could put each subsimulation on the GPU
as a separate kernel; however, that would result in a very large number of
kernel launches and possibly not much of a relative speedup unless large num-
bers of paths were being used. We defer the problem of designing a smarter
implementation to future work.
We review the LIBOR market model in Section 16.7. We discuss its algo-
rithmic implementation in a discretized setting in Section 16.7.1. We develop
new ideas for early exercise in Section 16.7.2. We discuss the software and
hardware used, in Section 16.7.3. We outline the design of the code in Sec-
tion 16.7.4. The intricacies of memory use are examined in Section 16.7.5. We
study how to evolve the LIBOR market model on the GPU in Section 16.7.6.
The specification of products is done in Section 16.7.7. We go into the imple-
mentation details of regression on the GPU in Section 16.7.8. Section 16.7.9
covers methodologies for data collection to prepare for least-squares. Pricing
is discussed in Section 16.7.10. We present timings and numerical results in
Section 16.7.11 and we conclude in Section 16.8.
Mark would like to thank Oh Kang Kwon for his assistance with coding a
Brownian bridge and with a skipping Sobol generator. Mark is also grateful
to Jacques Du Toit for his comments on an earlier version of this work.

16.7 The LIBOR Market Model


In this section, we briefly review the displaced diffusion LIBOR market
model. This is standard material. We refer the reader to standard texts such
Manycore Parallel Computation 487

as Andersen and Piterbarg (2010), Brace (2007), Brigo and Mercurio (2006),
and Joshi (2011) for more details.
Since it was given a firm theoretical base in the fundamental papers by
Brace, Gatarek, and Musiela (1997), Musiela and Rutkowski (1997), and
Jamshidian (1997), the LIBOR market model has become a very popular
method for pricing interest rate derivatives. It is based on the idea of evolving
the yield curve directly through a set of discrete market observable forward
rates, rather than indirectly through the use of a single nonobservable quantity
which is assumed to drive the yield curve.
Suppose we have a set of tenor dates, 0 = T0 < T1 < · · · < Tn+1 , with
corresponding forward rates f0 , . . . , fn . Let δj = Tj+1 − Tj , and let P (t, T )
denote the price at time t of a zero-coupon bond paying one at its maturity,
T . Using no-arbitrage arguments,
P (t,Tj )
P (t,Tj+1 ) −1
fj (t) = ,
δj

where fj (t) is said to reset at time Tj , after which point it is assumed that it
does not change in value. We work solely in the spot LIBOR measure, which
corresponds to using the discretely compounded money market account as
numeraire, because this has certain practical advantages (Joshi, 2003a). This
numeraire is made up of an initial portfolio of one zero-coupon bond expiring
at time T1 , with the proceeds received when each bond expires being reinvested
in bonds expiring at the next tenor date, up until Tn . More formally, the value
of the numeraire portfolio at time t will be

#
η(t)−1

N (t) = P t, Tη(t) (1 + δi fi (Ti )),
i=1

where η(t) is the unique integer satisfying

Tη(t)−1 ≤ t < Tη(t) ,

and thus gives the index of the next forward rate to reset.
Under the displaced diffusion LIBOR market model, the forward rates
that make up the state variables of the model are assumed to be driven by
the following process:

dfi (t) = μi (f, t)(fi (t) + αi ) dt + σi (t)(fi (t) + αi ) dWi (t), (16.1)

where σi (t)’s are deterministic functions of time, αi ’s are constant displace-


ment coefficients, Wi ’s are standard Brownian motions under the spot LIBOR
martingale measure, and μ
i s are uniquely determined by no-arbitrage require-
ments. It is assumed that Wi and Wj have correlation ρi,j and throughout
{Ft }t≥0 will be used to denote the filtration generated by the driving Brow-
nian motions. In addition, all expectations will be taken in the spot LIBOR
488 High-Performance Computing in Finance

probability measure. The requirement that the discounted price processes of


the fundamental tradable assets, that is the zero-coupon bonds associated to
each tenor date, be martingales in the pricing measure, dictates that the drift
term is uniquely given by


i
(fj (t) + αj )δj
μi (f, t) = σi (t)σj (t)ρi,j ;
1 + fj (t)δj
j=η(t)

see Brigo and Mercurio (2001).


Displaced diffusion is used as a simple way to allow for the skews seen
in implied caplet volatilities that have long persisted in interest rate mar-
kets (Joshi, 2003a). In particular, the use of displaced diffusion allows for the
wealth of results concerning calibrating and evolving rates in the standard
LIBOR market model to be carried over with only minor changes. The model
presented collapses to the standard LIBOR market model when αi = 0 for all
values of i.

16.7.1 The LMM in discrete time


In a computer program, we discretize time into a number of steps. Each
step is typically from one reset date to the next. Thus it is a time-discretized
version of the LMM that is important when implementing. Our philosophy as
expounded in Joshi (2011) is that calibration should be done post discretiza-
tion. We thus have a finite strictly increasing sequence of reset and payment
times for LIBOR rates Tj , and a similar sequence of evolution times as inputs
to our calibration. We will take these to be equal in what follows for simplic-
ity, although this is certainly not a necessity. We assume N time steps and an
F -factor model.
The effective calibration of the LIBOR market model is the subject of
many papers. We will not address it here but instead will assume that a CPU
routine has already produced a calibration which we use as an input for our
pricing routine. The output of our calibrator is the following:

• The initial value of each forward rate, fr (0)


• The displacement for each of these rates, αr

• The pseudo-square root, Aj−1 , of the covariance matrix of the log forward
rates for each time step from Tj−1 to Tj

Note that one could equivalently specify the covariance matrix, Cj , for the
time step instead of the pseudo-square root. The pseudo-square root uniquely
determines the covariance matrix, of course, and it is the covariance matrix
that determines the drifts. However, since we are working with reduced-factor
models, a pseudo-square root is a more natural object. See Joshi (2011) for
Manycore Parallel Computation 489

discussion of this approach to calibration. Plus when working with low discrep-
ancy numbers, it is generally believed that working with a spectral pseudo-
square root can improve convergence (Giles, Kuo, Sloan, and Waterhouse,
2008), Jäckel (2001), so it is convenient to specify this explicitly.
We use log rates xr = log(fr + αr ) and the rates fr as convenient. We need
to compute drifts; these are state dependent so only the drift at the start of
the first step is known in advance. We use a predictor–corrector algorithm so
they have to be computed twice per step. The discretized drift of a log forward
rate fr across step j is

r
(fr (t) + αr )δr
μj,r = −0.5Cj,rr + Cj,rl ,
1 + fr (t)δr
l=0

where t = Tj−1 when predicting and Tj when correcting. Whilst this expres-
sion is correct, it is inefficient from a computational perspective and an algo-
rithm giving the same numbers with lower computational order is presented
in Joshi (2003b) and we use it here.
Our evolution algorithm for the rates on each path is therefore as follows:
1. Draw uncorrelated N F standard normals for a quasi-random generator.

2. Use a Brownian bridge to develop these into F Brownian motion paths.

3. Take the successive difference of these paths to get a vector, Zj , of F


standard normals for each step j.
4. Compute drifts μj,r for step j using (fr (Tj−1 )). (For the initial step, use
the stored values instead.)
5. Let
(x̂r (Tj )) = (xr (Tj−1 )) + Aj Zj + (μj,r ).
6. Compute drifts μ̂j,r for step j using ex̂r (Tj ) − αj .

7. For each r, let


1
xr (Tj ) = (x̂r (Tj )) + (μ̂j,r − μj,r ).
2
8. Let fr (Tj ) = exp(xr (Tj )) − αr .
9. Unless at end of path go back to 3.

Once the forward rate path has been developed, other ancillary quantities
such as discount factors are easy to compute.

16.7.2 Multiple regression


In this section, we discuss a new algorithm for developing the exercise
strategy. This algorithm is designed to be simple but requires a large number of
490 High-Performance Computing in Finance

paths, making it suited to GPU programming. We first recall the least-squares


method for cancellable products (Amin, 2003; Carrière, 1996; Longstaff and
Schwartz, 2001). The method generally works in three phases. In the first
phase, a set of paths is generated. In the second phase, regression coefficients
are estimated, and in the third these are used to develop a lower bound price.
We refer the reader to Joshi (2011) for further discussion.
The main choice in the algorithm regards which basis functions are used to
regress continuation values against, and this can have a great effect on results
(Beveridge, Joshi, and Tang, 2013; Brace, 2007). Here we make the distinction
between basis functions and basis variables. We make the former polynomials
in the latter. The crucial point is that the functions are easily generated from
the variables. A variable would typically be a stock price, a forward rate, a
swap rate, or a discount bond. We do not investigate the question of basis
function choice here but instead refer the reader to the extensive analysis in
Beveridge, Joshi, and Tang (2013).
In the second phase, a backwards inductive algorithm is used. First, at
the final exercise time, the remaining discounted cash flows for each path are
regressed against the basis function values for that point on that path. The
regression coefficients then yield an estimate of the continuation value. This
estimate is compared to the exercise value for the path and the value of the
product at this final exercise time is set to the exercise value if it is greater,
and to the discounted value of the remaining cash flows if it is not.
We then step back to the previous exercise time. We discount the value
at the succeeding exercise time and add on the discounted value of any cash
flows that occur in between on each path. We then regress again and repeat
all the way back to zero.
Whilst in simple cases, the method works well, in complicated ones it can
do unacceptably poorly or be highly basis function dependent. Many tech-
niques have been developed for improving the method, for example, Beveridge
and Joshi (2008), Kolodko and Schoenmakers (2006), Broadie and Cao (2008),
and Beveridge, Joshi, and Tang (2013). However, their methodologies tend to
be better suited to CPU codes. For example, the use of sub-simulations with
varying numbers of paths appears difficult to code effectively on the GPU.
However, we have the advantage that using large numbers of paths is prac-
tical. We therefore adapt and extend an idea from Broadie and Cao (2008)
and Beveridge, Joshi, and Tang (2013). They suggested using double regres-
sion: perform a least-squares regression and then perform a second regression
for paths for which the absolute difference between the estimated continua-
tion value and the exercise value is below some threshold such as 3%. The
idea is that the second regression yields greater accuracy in the area where it
is most needed. However, the approach does implicitly require that the first
regression be sufficiently accurate that the exercise boundary is within this
truncated domain.
We adapt this approach by picking a fraction θ ∈ (0, 1) and a regression
depth d equal to say 5. We regress using least-squares, discard the fraction
Manycore Parallel Computation 491

1 − θ of paths farthest from the estimated boundary, and then repeat. We do


this until we reach the depth d. If we initially have N1 paths, we finish with

N d = N1 θ d

paths. Typically, we would only require the proportion discarded to be approx-


imately correct rather than exact for efficiency. The value θ would be chosen
to make Nd of the size required. For example, we might let

θ = 0.11/5

when using 327,680 paths so that the final regression has roughly 32,768 paths.
The advantage of this approach is that by only discarding a small fraction of
paths that are very far from the money at each stage, we are less likely to be
affected by a substantial misestimation of the boundary.
Many authors use second-order polynomials in forward rates and swap
rates as basis functions following Piterbarg (2004). We will take the first for-
ward rate, the adjoining coterminal swap rate and the final discount factor
as our basis variables. The basis functions are then quadratic polynomials in
these with or without cross-terms.

16.7.3 Packages and hardware


In this section, we briefly describe the software and hardware used. The
principal software tool is the CUDA 5.5 toolkit available from NVIDIA for
free download. This is used in conjunction with Visual Studio Professional
Edition 2012 as IDE and as a C++ complier. The Thrust open-source library
for developing algorithms on the GPU is used extensively both for algorithms
and memory allocation. The Thrust library now ships as part of the CUDA
toolkit.
The CUDA code discussed in this chapter is all part of the Kooderive
open-source library. Project files for Visual Studio are included and allow
immediate building after download. This has a number of components. In
particular, “gold” projects are written in C++ and run on the CPU. The
main static library project is “kooderive” and the examples discussed here are
in the project “kooexample.”
The hardware used as a GPU is a single K20c Tesla card.1 Its programming
is purely done using the CUDA language. The compute capability of the device
is 3.5 and the code assumes that such a GPU is available. The CUDA code is
written in 64 bits.
The CPU used is an Intel(R) Xeon (R) CPU E5-2643 at 3.30 GHz. We
use routines from the QuantLib open-source library as a comparison. This
is compiled using Visual C++ in 32-bit mode since that is the configuration
specified by the current release.
1 We thank NVIDIA for providing this hardware.
492 High-Performance Computing in Finance

16.7.4 Design overview


The pricing of a Bermudan contract by Monte Carlo can be divided into
three phases:

1. A number of paths, N1 , are generated and the relevant aspects of these


paths for developing an exercise strategy are stored.

2. A backwards induction is performed, generating regression coefficients


and updating continuation values at each exercise date.

3. A second Monte Carlo simulation is run using the exercise strategy gen-
erated in the second phase using N2 paths. This is equivalent to pricing a
path-dependent derivative with no optionality since the exercise strategy
has been fixed.

It is the first phase that requires most care. The reason being that the data
generated has to be stored until the end of the second phase. Thus we will
require memory proportional to N1 . For our design, we keep all this data on
the GPU at all times and so there must be sufficient memory to store it. If the
GPU’s total global memory is M and we store m bytes per path, we have the
immediate constraint
N1 m < M.
A Tesla K20c has 4800 megabytes of global memory so if we run 327,680
paths, the maximum storage per path is 15,360 bytes. In practice, since other
data must be stored the maximum would be lower. A float takes 4 bytes so
we have storage for less than 3840 floats per path.
If our forward rate evolution has N rates and n steps, to store the entire
evolution for a path will take N n rates. If we take N = n, we will run out of
memory for some N < 62. Whilst one could squeeze some more memory by
discarding already reset rates, it would complicate accesses and there would
be very little left for other computations. In practice, one will often want to
develop many pieces of auxiliary data for all paths simultaneously, such as all
the implied discount factors, which multiplies the memory requirements.
We therefore adopt an approach based on batching. Thus rather than
storing everything about 327,680 paths, we divide into say 10 batches and only
store the aspects of the paths that are required for the backwards induction.
So what must be stored? First, we note that we only need data at exercise
times. We use “exercise step” to mean the step from one exercise date to the
next. This may or may not be the same as the step from one reset date to the
next.

• The sum of the discounted values of any cash flows generated by the
product during the exercise step. This yields 1 float per exercise step.

• The value of the numeraire at the start of the exercise step again yields
1 float.
Manycore Parallel Computation 493

• The discounted value of any rebate generated on exercise gives 1 more


float.

• The basis variables for the exercise time are an input according to choices
of basis functions but typically 3 is enough.

Here “discounted” means discounted to the start of the exercise step. We


therefore typically have 6 data points per exercise step per path. If we have
40 exercise dates, and 327,680 paths, this requires 300 MB, leaving plenty of
space for more dates, paths, or other needs.
Our second step processes the data from the first step and outputs regres-
sion coefficients. An interesting feature of the algorithm is that the second
step uses no specific features of the model or product other than these out-
puts. This means that it is flexible and generic. Although developed for can-
cellable swaps in the LIBOR market model, it could equally well be used
for Bermudan max options in a multidimensional Black–Scholes model or,
indeed, for any Bermudan derivatives pricing problem priced using martingale
techniques.
For the third phase, we templatize the cash-flow generation on the exercise
strategy using the coefficients produced during the second phase. This means
that the exercise strategy is simply an input that could be changed drastically
without changing this phase’s design. Thus one could use a parametric strategy
instead of a least-squares one whilst only making changes to the second step.
Note that the third phase is again batched, and we only need to store a single
number from the output of each batch: the mean value of the product. In
consequence, there are no memory constraints on how many batches we use
for this phase. Thus our only constraint on the size of N2 is time.
In all phases, we divide the algorithm into a sequence of steps. Each step
is performed on all paths in a batch by a single GPU routine. Thus almost
everything is done for all these paths before any of them are complete. This is
quite different from a typical CPU program such as QuantLib where the first
path is complete before anything is done for the second path. Each of these
small steps in Kooderive will generally correspond to a single call to the GPU.

16.7.5 Memory use, threads, and blocks


A CUDA program consists of a C++ or C program together with various
calls to kernels which run on the GPU. Each kernel is configured as a number
of blocks (e.g., 64) and each block has the same number of threads (e.g., 512).
Threads in a block are divided into groups of 32 called warps. Threads in
the same warp are constrained to perform the same instructions, and if the
program requires them to do otherwise, idling of some threads occurs as the
branches are evaluated serially.
A significant difference between GPU programming and CPU program-
ming is the importance of how memory is used. A GPU program has to
explicitly use many different sorts of memory and how this is done can have
494 High-Performance Computing in Finance

drastic effects on the speed of a program. The principal sorts of memory are
(with the amount on a K20c in parenthesis):

• Host—the computer’s ordinary memory that the CPU uses

• Global—the graphics card’s main memory (4800 MB), large and plentiful
but must be accessed correctly or slowness occurs

• Shared—a small amount of memory shared between threads in the same


block (49,352 bytes), very fast but amount is very limited
• Constant—read-only for the GPU but writable by the CPU (65,536
bytes), again fast but use is very constrained

• Textures—a way of placing global memory in a read-only cache, fast for


read without constraints, but the memory cannot be changed within a
kernel

Memory transfer between host and global memory is typically slow for
large data sets and is often the main bottleneck in GPU programs. Kooderive
avoids this issue by simply not using host memory after the set-up phase.
Thus the model calibration and product specifications are passed to the GPU
initially but the only data passed back thereafter is the mean values for paths.
The layout of how data is stored in global memory greatly affects speed.
This is a consequence of the fact that threads do not access global memory
independently. Each thread has a thread number, t, and a block number, b.
We will call the total of number of threads the width, w. The number of
threads per block we denote s for size. If the code is written so that in a warp,
thread t accesses location l + t for some l, then the access is coalesced and
occurs quickly. However, if the mapping is more complicated and each thread
accesses f (t) for some nontrivial function f such as f (t) = l + αt for some
α > 1, memory access is slower because more cache lines must be fetched to
service all the memory requests. A common approach throughout Kooderive
is that thread t in block b is responsible for the path

t + bs + kw,

for all k such that t + bs + kw is less than the total number of data points
(typically the paths in a batch). Data is then stored with the path in the
smallest dimension, the time step in the largest dimension, and any other
index in the middle. Thus if there are R rates, N steps, and P paths, forward
rate r on time step s for path p would be stored at location

p + rP + sRP.

With this layout, coalescing occurs naturally. This is in contrast to a typical


CPU program where all the data for each path would be stored together, and
Manycore Parallel Computation 495

one might use location


r + sR + pRN.
Coalescing is more important for writes than reads, since textures provide
a fast route to memory access provided there are no writes to that part of
memory during the kernel. Thus for pieces of data that do not change during
the kernel, textures are widely used in Kooderive to speed up memory access
and to avoid coalescing constraints. The textures provide a cache that also
speeds up access. Note that this cache can be explicitly accessed using the
ldg() function in the Kepler architecture and this is also sometimes done.
An alternate approach, used in Kooderive, is to copy constant data into shared
memory at the start of a kernel. However, this relies on the amount of data
being small and we therefore use this technique only a little. Whilst constant
memory provides an additional alternative, the advantages do not seem suffi-
cient to justify its unwieldiness and it is not used in Kooderive.
Shared memory is also useful as a fast workspace and this is done by the
main path generation kernel.

16.7.6 Path generation


For both the first and third phases, we want to develop large numbers of
LMM paths rapidly. We develop the paths in batches of say 32,768 in size.
The batch creation is divided into a number of kernels:

• Generation of Sobol numbers as integers

• Scrambling

• Conversion to normals

• Brownian bridging
• Path generation

The generation of Sobol numbers is done using a modification of the example


in the CUDA SDK. The main difference being a skip method has been added
to allow the Sobol sequence to start at an arbitrary point, and the return of
unsigned integers rather than floats or doubles. The first facilitates batching,
each batch simply skips to the end of the previous batch. The passing back
of unsigned integers is to allow scrambling. Here we input a fixed vector of
unsigned integers for the batch and apply exclusively or to each vector of
Sobol draws. If the scrambling vector is drawn randomly, we can view the
batch average of quantities as an unbiased estimate of any expectations. This
is similar to randomized QMC (Giles, Kuo, Sloan, and Waterhouse, 2008).
To transform to normals, we use the Shaw–Brickman algorithm (Shaw and
Brickman, 2009). This is performed using the “transform” algorithm from the
Thrust library. We use one call to do both the conversion to uniforms from
unsigned ints and to take the inverse cumulative normal.
496 High-Performance Computing in Finance

The Brownian bridge is performed using a two-dimensional grid of blocks.


This reflects the fact that the Sobol paths will be of dimension N F with N
number of steps and F number of factors. We effectively have to create F
paths of N steps from these for each overall path. The x-coordinate of the
block and the Thread Id determine overall path. The y-coordinate determines
the factor. For each block, we first copy all the auxiliary data needed for
the bridge into shared memory. Each thread then develops the bridge for one
factor for one path without further interaction with other threads. If the total
number of threads across all blocks is less than the total number of paths,
then a thread will do multiple paths each separated by the global width. The
paths generated by each thread are successively differenced at the end so that
the outputs would be independent standard N (0, 1) random variables if the
inputs were. We refer the reader who is interested in high performing Brownian
bridges to du Toit (2011) for discussion of an alternative approach.
The main kernel that does the forward rate evolution and computes the
implied discount factors is:
LMM evolver all steps pc kernel devicelog discounts.
This implements the Hunter, Jäckel, and Joshi (2001) algorithm for predictor–
corrector evolutions. A large fraction (roughly 70%) of the total compute time
for phases 1 and 3 is spent in this kernel, and so its efficiency is very important.
We therefore discuss it in detail. Its inputs are:

• The pseudo-square roots of the covariance matrices (texture)

• The accruals of the forward rates (texture)

• The displacements (texture)

• The fixed part of the drifts (texture)


• The value of the state-dependent drifts for the first step (texture)

• Indices that denote which rates are not yet reset for each step (texture)
• The initial logs of the rates (texture)
• The quasi-random variates (texture)

• The numbers of paths, rates, and steps (integers)

So all data is passed in as either a texture or an integer. It outputs the full


forward rate evolutions, the log displaced rates, and the implied discount
factors which are all stored in global memory.
Each path is handled by a single thread. If there are more paths than
threads, a thread will handle multiple paths with an index separated by the
total width in a serial manner.
The steps are done one at a time. (The first step is handled differently since
the drifts are already known.) The use of the fast drift computation algorithm
Manycore Parallel Computation 497

from Joshi (2003b) requires floats equal to the number of factors to store the
partially computed sums. We therefore use factors times block-size floats in
shared memory for each thread as storage. Given these facts, the evolution
for each path is then straightforward and the coding of the algorithm is little
different from that of a C program.
For step 0, we multiply the variates by the pseudo-square root and add
them to the log rates. We add on the drifts. We then compute the state-
dependent drifts at the end of the step. We correct the values of the log
rates and the rates. We store their values in the output data and we use this
location to retrieve them when needed during this kernel. We then use the
rates to compute the discount factors implied by these forward rates for the
step. For the other steps, we first compute the drifts at the start of the step,
and then do as for step zero.
Whilst the code is robust against variation in block and grid size, the com-
bination of 128 threads per block and 256 blocks proved effective when using
32,768 paths per batch. Note that this implies that each thread does precisely
one path. A slight further optimization could be obtained by rewriting the
code not to handle other cases, but the gains do not seem sufficient for this
to be worthwhile.

16.7.7 Product specification and design


It is not enough just to be able to price a single product using hand-
crafted code. One wants a code design that allows flexibility and changes.
One approach advocated in Joshi (2008) is to use an object encapsulating
the product’s termsheet with virtual functions providing the necessary data.
However, this does not seem well adapted to working on the GPU with CUDA
(although it is supported). We therefore decompose the product into a number
of components which can be written independently and then slotted together.
First, we regard the product as generating a pair of cash flows at each of a
set of evolution times. For each generation, the product is passed three rates
as well as the current discount curve and forward rates. The computation or
extraction of the three distinguished rates is performed independently and so
the product knows nothing of its origin or meaning. Thus they could be swap
rates or forward rates or something more complicated. In addition, if one
wished to incorporate OIS discounting, then a distinction between LIBOR
rates and OIS rates could be made at this point by adding a spread.
Second, the product simply generates two flows but does not specify their
timing. Instead, separate payment schedules are specified and these are passed
to a discounting routine. This allows changes, for example, from in-advance to
in-arrears with minimal changes to the code simply by changing these timings.
The actual cash-flow generation routine which turns the rates into the cash-
flow sizes is done using a templatized kernel. The template parameter is the
product. It steps through the evolution times passing the rates and discount
curves to the product and storing the cash flows as they are generated. The
498 High-Performance Computing in Finance

path is terminated when the product indicates to do so. The design is set up
so that the product is able to store auxiliary data such a running coupon if
necessary which allows the possibility of path dependence. Alternatively, one
of the rates passed in could be made path dependent.
Exercise values are generated independently of the product, again allowing
maximum flexibility.

16.7.8 Least-squares and multiple regressions on the GPU


The second phase of the least-squares algorithm is to perform regressions
on estimated continuation values. Here we enact multiple regressions for each
step. The least-squares algorithm works backwards in time. On each step, we
first have to find the values of the basis functions. Note here the distinction
between basis functions and basis variables. The latter are extracted in phase
1 and would typically be a forward rate, x, the adjoining swap rate, y, and
the final discount factor, z. The former are polynomials in these. We focus on
quadratic polynomials. We can work with or without cross terms, so we can
use
1, x, y, z, xy, yz, zx, x2 , y 2 , z 2

and have 10 basis functions, or use

1, x, y, z, x2 , y 2 , z 2 ,

and have 7. The code is templatized on the algorithm for turning variables
into functions to allow flexibility.
Thus at the start of each step, we first use a kernel to generate the basis
functions for the step. Note that we only ever store the full basis functions for
one step at a time to reduce memory usage.
Once the basis functions are known, we have to find the minimal least-
squares error solution of a highly overdetermined rectangular system

Ax = y,

where A has N1 rows. Each row consists of the values of the basis functions
for one path for the step. The target y is the discounted future cash flows
for the path. Typically, N1 = 327,680 and there are 10 basis functions so the
system is very overdetermined. We solve in two phases. First, we write

(At A)x = At y,

reducing to a 10 × 10 (or similar) system and then we solve this system.


For the computation of At A, we use two kernels. The first kernel computes
for each j and k with j ≤ k,

aij aik ,
Manycore Parallel Computation 499

with the sum taken over a subset of i. The subsets for different blocks partition
the paths, and we use 1024 blocks. So at the end, we have 1024 numbers for
each matrix entry still to be summed. These sums are performed by a second
kernel which uses a different block for each entry. The computation is then
done by copying the data into shared memory and then using a repeated
binary summation so that each thread in the first half adds the value for the
corresponding thread in the second half. Once a thread reaches the second half
it does nothing further. Eventually only thread zero remains and it contains
the value of the sum. Note that this approach minimizes the length of the
computation chain which is an important consideration when working with
floats to avoid round off error. The computation of At y is done similarly.
For the second part of solving the small system, we copy the problem to
the CPU and solve there. The fact that it is only a 10 × 10 system means
that this is fast. The solution of the small system is the vector of regression
coefficients.
However, the step is not yet done since we are doing multiple regressions.
The regression coefficients yield an implied continuation value for every path.
Our multiple regression algorithm requires us to discard the fraction 1 − θ
of paths which are furthest from the exercise boundary, that is the ones that
yield the largest absolute value for discounted continuation value minus dis-
counted exercise value. First, a cut-off level is found which gives the threshold
above which paths are discarded. This is done by repeated bisection with the
counting being done using the thrust transform and count algorithms. Second,
the data for the remaining paths are moved to be contiguous in memory. This
is performed using thrust’s scatter-if algorithm. The process is repeated on
the remaining data until the preset regression depth is reached (e.g., 5) or
too few paths remaining according to a preset cut-off such as 2048. For sub-
sequent estimates of continuation values, cascading through the coefficients is
performed until a threshold level is reached or the maximum depth is obtained.
Once the continuation values have been estimated, the next stage of the
algorithm is to set the paths’ stepwise values to be either the discounted future
flows for the path if exercise does not occur, or to the discounted exercise value
if it does. This is straightforwardly performed by a simple kernel. The final
action for the step is to deflate to the previous exercise time using the ratio
of the numeraire values at the two times. This is again straightforward.
We then simply repeat back to step 0. The average value after doing step
0 yields the first pass estimate of the discounted cash flows on or after the
first exercise time.

16.7.9 The data collection phase


As discussed earlier, the first phase of the pricing is to collect data for the
regression and backward induction. The generation is divided into a number
of batches (e.g., 10) and in each one the important data is stored. Each batch
is straightforward. The following operations are carried out:
500 High-Performance Computing in Finance

• The paths are generated as in Section 16.7.6.


• The coterminal swap rates and their annuities are computed that is
the swap rates with the same final date but the first date varying (c.f.
Jamshidian 1997).

• The just reset forward rate at each step is extracted as is the adjoining
coterminal swap rate.

• The final discount factor for each step is also extracted.

• The basis variables are computed from the three previous extracted val-
ues and stored.

• The rates underlying the product are also extracted.


• The numeraires along the paths are computed.

• The cash flows along the paths are generated and discounted to exercise
dates. They are then aggregated to each exercise date and stored.

• The exercise values are discounted and stored.

• The numeraire values at exercise dates are stored.

Each of these is done by a dedicated kernel. We have already discussed the


cash-flow generation. The other kernels are straightforward.

16.7.10 The pricing phase


The third and final phase is the actual pricing. At this point, the exercise
strategy has been generated during the second phase and so we are pricing
conditionally on a set of regression coefficients. Most of the phase is very
similar to the first one. The main differences being that there is no need to
store data for use after the batch has been processed, and that the cash-flow
generation takes the exercise strategy into account.

• The paths are generated as in Section 16.7.6.

• The coterminal swap rates and their annuities are computed that is
the swap rates with the same final date but the first date varying (c.f.
Jamshidian 1997).
• The just reset forward rate at each step is extracted as is the adjoining
coterminal swap rate.
• The final discount factor for each step is also extracted.

• The basis variables are computed from the three previous extracted val-
ues and stored.
Manycore Parallel Computation 501

• The rates underlying the product are also extracted.


• The numeraires along the paths are computed.

• The cash flows along the paths are generated up to the exercise time. If
exercise occurs, the exercise value is taken into account.
• The cash flows are deflated.

• The cash flows are summed for each path.


• The pathwise values are averaged using the reduce algorithm from thrust.

For each of these, a simple dedicated kernel is used. The only one of much
interest is the cash-flow generation kernel. For maximum flexibility, this is
templatized on three parameters: the product, the exercise value computer,
and the exercise strategy. The auxiliary data for the strategy is moved into
shared memory for rapid access. Once this has been done, the routine is very
similar to that for the cash-flow generation in the first phase.
Note that each batch produces one number which is the mean pay-off
value. Different batches can be achieved either by making the quasi-random
generator skip or by using scrambling.

16.7.11 Speed comparisons and numerical results


We focus on the case of a 40-rate cancellable swap studied in Beveridge,
Joshi, and Tang (2013). We choose this as it is challenging enough to be
interesting without being esoteric. We also want to study a tough example
already in the literature to demonstrate the nonartificiality of our results.
The product has 40 underlying rates. The first one starts in 0.5 years. Coupon
payments occur at 0.5 + 0.5j for j = 1, 2, . . . , 40. It is cancellable at time 1.5
and every 0.5 years thereafter. No rebate is paid on cancellation. The swap
pays a fixed rate of 0.04 and receives floating.
The calibration is that we set the forward rate from 0.5j to 0.5(j + 1) to
be
0.008 + 0.002j for j = 0, 1, 2, . . . , 40.
The well-used “abcd” time-dependent volatility structure is used, with

0, t > Ti ;
σi (t) =
(0.05 + 0.09(Ti − t)) exp (−0.44(Ti − t)) + 0.2, otherwise,
and the instantaneous correlation between the driving Brownian motions is
assumed to be of the form
ρi,j = exp (−φ|ti − tj |) ,
with φ = 2 × 0.0669. Displacements for all forward rates are assumed to be
equal, with
αj = 1.5%,
for all values of j.
502 High-Performance Computing in Finance

We use a five-factor model. To obtain the reduced pseudo-square root


matrices, we proceed as follows:

• Compute the full-factor covariance matrix for the step. This involves
integrating the product of the volatility functions and multiplying by
the instantaneous correlation for each entry.

• Perform a principal components analysis to obtain eigenvalues and eigen-


vectors, λj and ej with λj decreasing.
.
• Form a column matrix B = ( λj ej ).

• Scale the rows of B so that the variances of the log rates are the same
as before factor reduction.

Our motivations for using five factors are:

• This seems to be as many as practitioners commonly use in industry.


• It is more than enough to encompass the major modes of deformation of
yield curves.

• We wish to study an example already in the literature rather than devel-


oping a new one.

We note, however, that there is nothing special about 5 from the implementa-
tion perspective and the Kooderive code will function for a differing number
of factors.
Beveridge, Joshi, and Tang (2013) achieve a price of 1088 with a standard
error of 2.5 using double regression and the exclusion of suboptimal points.
They use policy iteration to get an increase of 6.5 with a standard error of
2. Thus their lower bound price is 1094.5 with a standard error of 3.2. Their
upper bound has the slightly lower value of 1094 with a standard error of 3.
We present results on timings and price for varying regression depths in
Table 16.1. We use 10 batches of 32,768 paths for the first pass and 32 of
them for the second. We separate the time actually spent doing regressions
from that spent on doing other parts of the exercise strategy building. The
price increases substantially when we increase from single regression to double
regression. Another slight increase occurs from double to triple and it is stable
thereafter. The fraction of paths retained after each regression is given by
0.11/d where d is the total regression depth. The implied price whilst slightly
lower than that in Beveridge, Joshi, and Tang (2013) is within one standard
error and so can be regarded as accurate.
In Table 16.2, we present timing comparisons for Kooderive versus
QuantLib. For simplicity in running the QuantLib (QL) code, we do not con-
sider the first two noncall coupons. The value of these is analytic in any case
and a simulation pricing of them is not needed. We see that the first pass of
path storage is 179 times faster in Kooderive. Similarly, the second pass pricing
Manycore Parallel Computation 503

TABLE 16.1: Timings and prices for the 40-rate cancellable swap with
varying numbers of regressions
Regression
depth 1 2 3 4 5
Time taken for 0.207 0.206 0.207 0.206 0.207
first pass paths
Time taken for 0.206 0.169 0.172 0.179 0.179
regression set
up
Time taken for 0.094 0.191 0.254 0.32 0.39
regression
Time taken for 0.697 0.71 0.729 0.747 0.765
second pass
Total time 1.204 1.276 1.362 1.452 1.541
Second pass 0.10740 0.10911 0.10924 0.10927 0.10925
price
Note: All standard errors are between 0.5 and 0.6 basis points.

TABLE 16.2: Timing comparison for QuantLib versus Kooderive for


a 38 rate cancellable swap with 38 call dates
Time QL Time Kooderive Ratio
First pass 34.958 0.195 179.2718
Strategy building 11.037 0.379 29.12137
Second pass 122.013 0.648 188.2917
Total 168.008 1.222 137.4861
Note: Time, seconds; 327,680, first pass paths; 1,048,576, second pass paths; Single regres-
sion.

is 188 times faster. Note that despite the fact that QuantLib early terminates
path generation when appropriate and Kooderive does not. The timing ratio
for the computation of regression coefficients is not quite so impressive as
only a 29 times speedup is achieved. Note, however, that the division between
stages in Kooderive is slightly different from QuantLib. The generation of
basis functions from basis variables is done in the first pass in QuantLib and
during strategy building in Kooderive. This means that our numbers overstate
the speed of the first pass and the slowness of strategy building. The overall
ratio, which is what really matters, at 137 is very large. We can run numbers
of paths in seconds that would previously have been regarded as silly in a live
environment.
Of course, a CPU implementation could also be multithreaded and the first
and third parts should scale well since they are embarrassingly parallel. For
the second part, one would have to solve similar challenges to that presented
for the GPU. We do not explore how to carry out such an implementation
here. However, we note that if the CPU used t threads, the best we could
hope for is a t-times speed up, and so we would still expect the GPU to 137/t
504 High-Performance Computing in Finance

times faster. It would take a very large number of CPU cores for the CPU to
be competitive against a single GPU.

16.8 Conclusion
We spent the first part of this chapter arguing that:

• There is a parallel performance imperative, driven by the state of com-


puter hardware and manufacturing
• To maintain efficiency and performance, software has to adapt to avail-
able parallelism

• Quants need to be able to write efficient parallel code

Given those arguments, we presented a very basic primer in the hardware


features that are relevant to quants, on both the more traditional CPU and on
modern compute accelerators such as GPUs from NVIDIA. We cover the key
concepts in parallel acceleration, hardware, thread, and memory management.
We then shifted gears and presented a realistic example of a challenging
computation. The calculations were described in sufficient detail that they
can be understood as they are mapped for execution on a GPU. The key
implementation choices were explained. Pointers to sample code online were
provided.
Ultimately, we have demonstrated that it is possible to develop a full-
featured powerful and flexible displaced diffusion LIBOR market model than
runs in a highly parallel fashion on the GPU. The speedup obtained is over
100 times versus a comparable nonparallel code, and this is using a single
GPU. This speedup is achievable with a basic knowledge of modern hardware
architectures and features modern, maintainable code. Such speedups mean
that it is now possible to routinely run numbers of paths that would have
been regarded as far too large in the past.
A natural extension of the work here would be to consider a more com-
plex model incorporating OIS discounting and smiles. These methods are well
covered in the literature and following the example of this code, incorporat-
ing them into parallel code is straightforward. Alternatively, following this
example, other models can also be adapted for highly parallel execution.

References
Abbas-Turki, L., and Lapeyre, B. 2009. American options pricing on multi-core
graphic cards. 2009 International Conference on Business Intelligence and Finan-
cial Engineering, 307, Beijing, China.
Manycore Parallel Computation 505

Aldrich, E. M., Fernandez-Villaverde, J., Gallant, A. R., and Rubio-Ramirez,


J. F. 2011. Tapping the supercomputer under your desk: Solving dynamic
equilibrium models with graphics processors. Journal of Economic Dynamics
and Control, 35 (3), 386–393. [Link]
S0165188910002216 doi:10.1016/[Link].2010.10.001

Amdahl, G. M. 1967. Validity of the single processor approach to achieving large-


scale computing capabilities. AFIPS Conference Proceedings, 30 , 483–485.

Amin, A. 2003. Multi-factor cross currency Libor market models: Implementation,


calibration and examples. Calibration and Examples (May 1, 2003 ). [Link]
[Link]/10.2139/ssrn.1214042

Andersen, L. and Broadie, M. 2004. A primal–dual simulation algorithm for pricing


multi-dimensional American options. Management Science, 50 , 1222–1234.

Andersen, L. and Piterbarg, V. V. 2010. Interest rate modelling. London, New York:
Atlantic Financial Press.

Beveridge, C. J. and Joshi, M. S. 2008. Juggling snowballs. Risk, December , 100–


104.

Beveridge, C. J., Joshi, M. S., and Tang, R. 2013. Practical policy iteration: Generic
methods for obtaining rapid and tight bounds for Bermudan exotic derivatives
using Monte Carlo simulation. Journal of Economic Dynamics and Control, 37 ,
1342–1361.

Brace, A. 2007. Engineering BGM. Sydney: Chapman and Hall.

Brace, A., Gatarek, D., and Musiela, M. 1997. The market model of interest rate
dynamics. Mathematical Finance, 7 , 127–155.

Brigo, D. and Mercurio, F. 2001. Interest Rate Models: Theory and Practice. Hei-
delberg: Springer Verlag.

Brigo, D. and Mercurio, F. 2006. Interest Rate Models—Theory and Practice: With
Smile, Inflation and Credit. Springer.

Broadie, M. and Cao, M. 2008. Improved lower and upper bound algorithms for
pricing American options by simulation. Quantitative Finance, 8 , 845–861.

Carrière, J. F. 1996. Valuation of the early-exercise price for options using simu-
lation and nonparametric regression. Insurance: Mathematics and Economics, 19 ,
19–30.

Dang, D. M., Christara, C. C., and Jackson, K. R. 2010. Pricing multi-asset Amer-
ican options on graphics processing units using a PDE approach. 2010 IEEE
Workshop on High Performance Computational Finance (WHPCF) (pp. 1–8), New
Orleans, Louisiana, USA.

Dennard, R. H., Gaensslen, F. H., Rideout, V. L., Bassous, E., LeBlanc, A. R., and
Gaensslen, R. H. 1974. Design of ion-implanted mosfets with very small physical
dimensions. IEEE Journal of Solid-State Circuits, SC-9 (5), 256–268.
506 High-Performance Computing in Finance

du Toit, J. 2011. A high-performance Brownian bridge for GPUS: Lessons for band-
width bound applications.

Giles, M., Kuo, F. Y., Sloan, I. H., and Waterhouse, B. J. 2008. Quasi-Monte Carlo
for finance applications. ANZIAM Journal, 50 , C308–C323.

Giles, M. and Xiaoke, S. 2008. Notes on using the nVidia 8800 GTX graphics card.
([Link] old/[Link])

Gustafson, J. L. 1988. Reevaluating Amdahl’s law. Communications of the ACM,


31 , 532–533.

Hunter, C., Jäckel, P., and Joshi, M. S. 2001. Getting the drift. Risk, July, 81–84.

Jäckel, P. 2001. Monte Carlo Methods in Finance. New York: John Wiley & Sons
Ltd.

Jamshidian, F. 1997. LIBOR and swap market models and measures. Finance and
Stochastics, 1 , 293–330.

Joshi, M. 2003a. The Concepts and Practice of Mathematical Finance. London:


Cambridge University Press.

Joshi, M. 2003b. Rapid drift computations in the LIBOR market model. Wilmott
Magazine, May, 84–85.

Joshi, M. 2008. C++ Design Patterns and Derivatives Pricing (2nd edition).
London: Cambridge University Press.

Joshi, M. 2010. Graphical Asian options. Wilmott Journal, 2 , 97–107.

Joshi, M. 2011. More Mathematical Finance. Melbourne: Pilot Whale Press.

Joshi, M. 2014. Kooderive version 0.3. [Link]

Joshi, M. and Tang, R 2014. Effective sub-simulation-free upper bounds


for the Monte Carlo pricing of callable derivatives and various improve-
ments to existing methodologies. Journal of Economic Dynamics and Control,
40 ,25–45 [Link]
doi:10.1016/[Link].2013.12.001

Kolodko, A. and Schoenmakers, J. 2006. Iterative construction of the optimal


Bermudan stopping time. Finance and Stochastics, 10 , 27–49.

Little, J. D. C. 1961. A proof for the queuing formula: L = λw. Operations Research,
9.3 , 383–387.

Longstaff, F. A. and Schwartz, E. S. 2001. Valuing American options by simulation:


A simple least squares approach. The Review of Financial Studies, 14 , 113–147.

Moore, G. E. 1965. Cramming more components onto integrated circuits. Electron-


ics, April 19 , 114–117.
Manycore Parallel Computation 507

Musiela, M. and Rutkowski, M. 1997. Continuous-time term structure models:


forward-measure approach. Finance and Stochastics, 1 , 261–292.

Piterbarg, V. 2004. A practitioner’s guide to pricing and hedging callable LIBOR


exotics in forward LIBOR models. Journal of Computational Finance, 8 , 65–119.

Shaw, W. T. and Brickman, N. 2009. Differential equations for Monte Carlo recy-
cling and a GPU-optimized normal quantile (Tech. Rep.). Citeseer.
Chapter 17
Practitioner’s Guide on the Use of
Cloud Computing in Finance

Binghuan Lin, Rainer Wehkamp, and Juho Kanniainen

CONTENTS
17.1 What Is Cloud Computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
17.1.1 Why cloud computing and why now? . . . . . . . . . . . . . . . . . . . 511
17.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
17.2.1 The taxonomy of parallel computing . . . . . . . . . . . . . . . . . . . . 513
17.2.2 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
17.3 Financial Applications of Cloud Computing . . . . . . . . . . . . . . . . . . . . . 517
17.3.1 Derivative valuation and pricing . . . . . . . . . . . . . . . . . . . . . . . . 517
17.3.2 Risk management and reporting . . . . . . . . . . . . . . . . . . . . . . . . 518
17.3.3 Quantitative trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
17.3.4 Credit scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
17.4 The Nature of Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
17.5 Implementation and Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
17.5.1 Implementation example: Techila middleware
with MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
17.5.2 Computational needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
[Link] Do you have a computational
bottleneck? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
[Link] Where is your computational
bottleneck? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
17.5.3 Solution selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
17.5.4 Algorithm design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
17.6 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
17.6.1 Portfolio backtesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
[Link] Potential computing bottleneck . . . . . . . . . . . . . . 529
[Link] Computing environment and
architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
[Link] Experiment design and test result . . . . . . . . . . . . 530
17.6.2 Distributed portfolio optimization . . . . . . . . . . . . . . . . . . . . . . 531
[Link] Challenges in large-scale portfolio
construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
[Link] Algorithm design for large-scale
mean-variance optimization problem . . . . . . . . . 532

509
510 High-Performance Computing in Finance

17.7 Cloud Alpha: Economics of Cloud Computing . . . . . . . . . . . . . . . . . . 533


17.7.1 Cost analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
17.7.2 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535

17.1 What Is Cloud Computing?


It takes mankind centuries to learn how to make use of electricity. In
the early age, factories and corporations were powered by on-site small-scale
power plants. Maintaining such power plants is expensive due to the addi-
tional labor cost. Nowadays with the help of large-scale power plants and
efficient transmission networks, electricity powers modern industrial society
for transportation, heating, lighting, communications, and so on. Electricity
is at everyone’s disposal at a reasonable price.
Cloud computing shares many of the similarities with electricity. By con-
necting end-users via Internet to data centers, where powerful computing
hardware is located, cloud computing makes computation available to every-
one. The core concept of cloud computing is resource sharing. To harvest the
computing power, cloud computing is also a practice of operation research
in computing resource optimization. Such technology enables the processing
of massive-parallel computations by using shared computing resources. The
computing resources usually consist of large numbers of networked comput-
ing nodes. The word “cloud” is used to depict such networked computing
resources.
The formal definition of cloud computing given by the National Institute
of Standards and Technology (NIST) of the U.S. Department of Commerce is

a model for enabling ubiquitous, convenient, on-demand network


access to a shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services) that can be
rapidly provisioned and released with minimal management effort or
service provider interaction.

Mell and Grance (2009), NIST

The following five essential characteristics differentiate cloud computing


from other computing solutions, such as on-premise servers:

• On-demand self-service: A consumer can provision computing capacity


as needed without interaction with service provider.

• Broad network access: Computing resources are available to the consumer


through the network and can be accessed from mobile phones, tablets,
laptops, and workstations.

• Resource pooling: Resources are dynamically assigned to customers’


needs.
Practitioner’s Guide on the Use of Cloud Computing in Finance 511

• Rapid elasticity: Capacities can be reconfigured automatically to scale


rapidly in response to the changing demand.

• Measured service: The resource utilization is automatically controlled and


optimized by the cloud systems. The utilization is monitored, measured,
and reported.

17.1.1 Why cloud computing and why now?


The first reason is the increasing computing demand from industry, espe-
cially from financial industry. International Data Corporation (IDC) Joseph
et al. (2014) reported the demand for high-performance computing (HPC)
from 13 sectors. They predict an 8.7% yearly growth in the spending of HPC
in economics/financial sector from 2013 to 2018, which is among the top 3
of all 13 sectors studied as shown in Figure 17.1. The second reason is the
pervasiveness of cloud computing. Cloud computing has its deep roots dating
back to utility computing in 1960s:
If computers of the kind I have advocated become the computers of
the future, then computing may someday be organized as a public
utility just as the telephone system is a public utility... The computer
utility could become the basis of a new and important industry.

John McCarthy at MIT Centennial in 1961

The technology has developed since then. To provide an overview of the


technology developments, Figure 17.2 shows the advances of cloud computing
related technology alongside the innovations in financial engineering. While

3,500,000

3,000,000

2,500,000

2,000,000

1,500,000

1,000,000

500,000

0
se

ab

ic
es

al

es

er

r
n

SV

e
in

io
A

ig
ci

th

th
nc
nc

en

tl
/I
C

er

es
ut

an

de

ea

O
en
ie
ie

ef
/IT
ne

ld
r ib

fin

W
ca
sc
sc

nm
gi

ca
A
st

eo
o-

/A
s/
en

ED

ni
di

er
ic
Bi

ity
ha
om

ov
d
al

rs
an
ic

ec

ve
on
em

M
CC

ni
Ec
Ch

U
D

FIGURE 17.1: HPC spending by sector 2013 versus 2018. (Adapted from
Joseph, E. et al., 2014. IDC HPC update at ISC’14.)
512 High-Performance Computing in Finance

1900 1952 1973 1990–2000


Theory of Portfolio Black-Scholes- Stochastic
speculations— selection— Merton volatility 2000- 2013—
Louis Harry F. Black, model and Jump Basel III
Bachelier Markowitz M. Scholes local diffusion
R. Merton volatility model, Levy
model model, etc.

1900s 1940s 1950s 1960s 1970s 1980s 1990s 2000 2000s 2006 2012 2015

Large-scale Time-sharing Virtual private 2006, 2008,


mainframe service, IBM network Amazon Microsoft
computers VM operating (VPN) introduces Azure
system Elastic
Compute
Cloud

FIGURE 17.2: History of financial engineering and cloud computing.

the innovation of more complex models in financial engineering increases the


demand for HPC technology, the supply of HPC technology increases with the
development of technologies, such as cloud computing. There is also evidence
of increasing awareness of cloud computing from the public. Figure 17.3 shows
an increasing search trend for cloud computing using Google Trends.
The undergone development and commercialization of cloud computing
are significantly boosted by the increase in computation demands in the real
world. According to Gartner’s report Smith (2008).

By 2012, 80% of Fortune 1000 enterprises will pay for some cloud
computing service and 30% of them will pay for cloud computing
infrastructure. Through 2010, more than 80% of enterprise use of

2500

2000

1500

1000

500

0
05 20 6

05 20 2

06 20 6

06 20 2

07 20 6

07 20 2

08 20 6

08 20 2

09 20 6

09 20 2

10 20 6

10 20 2

11 20 6

11 20 2

12 20 6

12 20 2

13 20 6

13 20 2
14 20 6

14 20 2

15 20 6

20 2

6
20 07– 4-0

20 1– -1

20 07– 5-0

20 1– -1

20 07– 6-0

20 1– -1

20 07– 7-0

20 1– -1

20 07– 8-0

20 1– -1

20 07– 9-0

20 1– -1

20 07– 0-0

20 1– -1

20 07– 1-0

20 1– -1

20 07– 2-0

20 1– -1
20 07– 3-0

20 1– -1

20 07– 4-0

1– -1

-0
-0 04

-0 05

-0 06

-0 07

-0 08

-0 09

-0 10

-0 11

-0 12

-0 13

-0 14

15
0

1
04 20
20 01–
-

-
04
20

FIGURE 17.3: Google search trend of cloud computing (since 2004).


Practitioner’s Guide on the Use of Cloud Computing in Finance 513

cloud computing will be devoted to very large data queries, short-


term massively parallel workloads, or IT use by startups with little
to no IT infrastructure.

Modern day commercial cloud computing has also revolutionized a new


level of industrial computing practice, especially in financial industry where
the computing need is massive. Our focus is on how to utilize the enormous
computing resource from cloud computing to harness the massive computing
challenges posed by financial industry. We start by introducing parallel com-
puting problems and massive parallel computing tasks in finance industry. We
then compare cloud computing with alternative solutions from the following
aspects: performance, cost, and elasticity. To improve understanding about
what kind of problems users may face in practice and how they can be solved,
we also walk readers through a complete implementation procedure in finan-
cial industry with case studies and an implementation of TechilaR
Middleware
Solution.

17.2 Background
17.2.1 The taxonomy of parallel computing
Computer processors process instructions sequentially. Thus traditional
computing problems are serial problems by such design. The birth of multi-
processors has innovated a new type of computing problem: how to utilize the
parallel structure.
Parallel computing problem, in contrast to serial computing problems,
refers to the type of computing problems that can be divided into subproblems
to be processed simultaneously. Based on the dependency structure of sub-
problems, it can be further classified into embarrassingly parallel and nonem-
barrassingly parallel computing problems. If the processing of one subproblem
is independent of other subproblems, then it is an embarrassingly parallel
computing problem. It is called nonembarrassingly parallel computing prob-
lem otherwise.
The following figure illustrates the structure of embarrassingly parallel
and nonembarrassingly parallel problems. There is no communication between
jobs in embarrassingly parallel case as in Figure 17.4a, while communication
is required in nonembarrassingly parallel case as in Figure 17.4b.
By the nature of underlying problem, it can be classified as data-
parallel problem and task-parallel problem. While data parallelism focuses
on distributing data across different processors, task parallelism focuses on
distributing execution processes (subtasks) to different processors.
Another important aspect of parallel computing is whether the parallel
computing problem is a scalable problem. A scalable problem has either a
scalable problem size or scalable parallelism. Either the solution time reduces
514 High-Performance Computing in Finance

(a) Input data (b) Input data

Information Information
Jobs Jobs

Results Results

FIGURE 17.4: Parallel computing structure: (a) embarrassingly parallel


computing and (b) nonembarrassingly parallel computing.

with the increasing of parallelism or the performance of the solution increases


with the problem size. The elasticity of the computing architecture is the key
to the success of processing of scalable problems.
Here we provide two examples from finance industry:

Example 17.1: Monte Carlo Option Pricing


Monte Carlo simulation is a typical embarrassingly parallel and task-
parallel computing problem.
By the fundamental theorem of arbitrage pricing, option price is
equal to the expected payoff V discounted by a discount factor D. The
expectation can be evaluated via Monte Carlo method. The Monte Carlo
estimator of option price is given by
1
C0 = D V (ω)
N
ω∈sample set

where N is the number of sample paths.

1. The simulation of each price path is independent of other paths.


Thus it is easy to parallel process the simulation of different paths
on different computing nodes.

2. It is task-parallel in the sense that simulation of each path is a


small task. However, it is not data-parallel since there is no data
set to be distributed to different computing nodes.
3. There is benefit in scaling the computation. The accuracy of the
price estimator improves with the increase of N . (The error has
convergence of O( √1N ).)
Practitioner’s Guide on the Use of Cloud Computing in Finance 515

To illustrate the difference between task-parallel and data-parallel, we use


the following example:

Example 17.2: Backtesting Investment Strategy


Depending on how you implement the computation tasks, backtesting
can be either task-parallel or data-parallel.
Suppose you need to backtest a basket of different investment strate-
gies to identify the optimal strategies. The processing of each investment
strategy is independent of each other and can be run simultaneously.
Distributing the processing of different strategies to different computing
nodes is an embarrassingly parallel and task parallel implementation.

Algorithm 17.1: Task-Parallel Backtesting


Input: A set of investment strategies, historical data sample;
Output: PnL, Portfolio Attribution, Risk Exposures, etc.;
for i ← 1 to number of strategies do
P nL ← ProfitandLoss(strategy i, datasample);
P A ← PortfolioAttribute(PnL);
RE ← RiskExposure(PnL);
end

Backtesting of one strategy can also be implemented as data-parallel.


By generating subsamples from test data set (e.g., by bootstrapping),
strategy can be processed on different subsamples simultaneously. The
result on different subsamples is then aggregated to generate the perfor-
mance and risk report of the strategy.

Algorithm 17.2: Data-Parallel Backtesting


Input: Investment strategy, sub data samples
Output: PnL, Portfolio Attribution, Risk Exposures, etc.;
for i ← 1 to number of data sample do
P nL[i] ← ProfitandLoss(strategy, data sample i);
end
PnL total = aggregate(PnL)

17.2.2 Glossary
• Computing instance: refers to a (virtual) server instance that is linked
to a computing network to provide computing resources. To offer flex-
ibility to their customers, cloud vendors offer different types of nodes
516 High-Performance Computing in Finance

that comprise various combinations of CPU (central processing unit),


memory, storage, and networking capacity.1

• Data center : A data center comprises a large number of computing nodes,


the network to connect these nodes and necessary facility to house the
computer system.

• Server–worker nodes: Server–worker nodes are a typical mechanism to


coordinate the computation between different computing nodes. The
server nodes assign computing tasks to worker nodes and collect results
from worker nodes. Worker nodes receive instructions from server nodes,
execute the computations, and send the results back to the server.
• Middleware: Middleware is a computer software that “glues” software
applications with computing hardware. In cloud computing, middleware
is used to enable communications and management of data.

• Job scheduler and resource manager : Software to optimize the usage of


computing resources based on the resources available and job priority.
Commercial solutions usually package job scheduling, resource manage-
ment with middleware.

• Virtualization: Using computer resources to imitate other computer


resources. By virtualization, users are not locked with specific operating
systems, CPU architecture, and so on. Thus middleware and virtualiza-
tion are particularly important to ensure on-demand self-service of cloud
computing.

• Cloud bursting: Cloud computing offers on-demand service. Cloud burst-


ing refers to the process of dynamic deployment of software applications.
• Elastic computing: Elastic computing is a computing service which has
the ability to scale resources to meet requirements.
• Public cloud : Public cloud is the cloud computing service that is available
to public and can be accessed through Internet.
• Private cloud : Private cloud, in contrast to public cloud, is not available
to public. The computing resources are dedicated to select users.

• Hybrid cloud : Hybrid cloud is a cloud computing service that combines


different types of services, for example, public and private. A hybrid
cloud combines public and private clouds and allows workloads move
between public and private clouds. The flexibility allows users to optimize
the allocations to reduce cost while still having direct control of their
environments.

1 We will provide a list of different computing nodes in Section 17.5.


Practitioner’s Guide on the Use of Cloud Computing in Finance 517

• Wall clock time (WCT): Wall clock time is the human perception of the
passage of time from the start to the completion of a task.
• CPU time: The amount of time for which a CPU was used for processing
instructions of a computer program or operating system, as opposed to,
for example, waiting for input/output (I/O) operations or entering low-
power (idle) mode.
• Workload : In cloud computing, workload is measured by the amount of
CPU time and memory consumption.

• CPU efficiency: CPU efficiency measured as the CPU time used for com-
putation divided by the sum of CPU time and I/O time used for data
transfer. Thus CPU efficiency measures the overhead of paralleling a com-
putation. A low CPU efficiency, in general, indicates a high overhead.
• Acceleration factor : Acceleration factor is measured by wall clock time of
running the program locally on the end user’s computer divided by the
wall clock time of running it on the cloud. In an ideal case, the acceler-
ation factor can be linear in the number of cores used for computation.

• Total cost of ownership (TCO): TCO measures both direct and indirect
costs of deploying the solution. In cloud computing and alternative com-
puting solutions, TCO includes the cost of: hardware, software, operat-
ing expenses (such as infrastructure, electricity, outage cost, and so on),
and long-term expenses (such as replacement, upgrade and scalability
expenses, decommissioning, and so on).

17.3 Financial Applications of Cloud Computing


17.3.1 Derivative valuation and pricing
One of the core businesses of the front office is derivative pricing. Even
though, the market for exotic derivatives has shrunk after the crisis in 2008–
2009, the exoticization of vanilla products has increased the complexity of
the valuation process. Moreover, numerical methods are needed with certain
models, even with vanilla options, such as nonaffine variance models, infinite-
activity jump models, and so on.
The valuation process usually requires high-performance numerical solu-
tions, as well as a high-performance technology platform. The size of the
book and time criticalness requires a platform that is both suitable for
handling large data and processing massive computing. A recent paper by
Kanniainen et al. (2014) evaluates the computation performance of using
Monte Carlo methods for option pricing. With the aid of cloud computing
and Techila Middleware Solution,2 the time consumption of valuating option
2 For more information, please visit: [Link]
518 High-Performance Computing in Finance

contracts using Monte Carlo methods is comparable with other numerical


methods:

. . . valuate once the 32,729 options in Sample A using the Heston–


Nandi (HN) model was 25 s with the HN quasi-closed-form solution
and 249 s with the Monte Carlo methods. Moreover, with cloud com-
puting with the Techila middleware on 173 Azure extra small virtual
machines (173 × 1 GHz CPU, 768 MB RAM) and the task divided
into 775 jobs according to 775 sample dates, the overall wall clock
time was 55 s and the total CPU time 44 min and 33 s. The Monte
Carlo running times were approximately the same for GJR and
NGARCH. Substantially shorter wall clock times can be recorded if
more workers (virtual machines) are available on the cloud or if the
workers are larger (more efficient). Then the wall clock time differs
very little between the HN model with the quasi-closed-form solu-
tion on a local computer and the HN model or some other GARCH
model (such as GJR or NGARCH) with the Monte Carlo methods
on a cloud computing platform. Consequently, with modern accel-
eration technologies closed-form solutions are no longer a critical
requirement for option valuation.

Another key postcrisis trend is the populating of XVAs (Fund Valuation


Adjustment [FVA], Credit Valuation Adjustment [CVA], and so on). The
books of XVAs are usually huge and the time constraint to process the val-
uation is tight. An industry success is the award-winning in-house system of
Danske Bank. The combination of advanced numerical technique and modern
computing platform allows real-time pricing of derivative counterparty risk.

17.3.2 Risk management and reporting


The financial crisis also reshaped the business of risk management in finan-
cial industry. The implementation of Solvency II for insurance and Basel III
for banking, respectively, poses new challenges to financial computing.
First, the computations are highly resource-intensive. Second, the compu-
tation needs are dynamic rather than static. Risk report (Solvency II, Basel
III, and so on) are required at monthly or quarterly frequency. Computation
needs are periodic, where they reach their peak before the reporting deadline.
Building and managing a dedicated data center to meet the computation need
at its peak will significantly increase the cost. On the other hand, most of the
computing resource will be wasted during a relatively less intensive period.
Cloud computing has the advantage of being scalable, which allows it to
meet the dynamic computation need from financial industry. Using Google
search volume, we find that Google search volume for cloud computing
Practitioner’s Guide on the Use of Cloud Computing in Finance 519

increased rapidly after 2009 and so have the search volumes for Solvency
II and Basel III. We are not suggesting any causality between the increasing
attention of cloud computing and that of risk regulation. However, such trends
show the right timing of popularity of cloud computing as a potential solution
for regulation-oriented computation needs.
The financial industry started to embrace cloud solutions, especially when
they are integrated to support the need for an effective and timely risk man-
agement. IBM’s survey (reference) on the implementation of cloud computing
for Solvency II in the insurance industry points out the trend of adopting cloud
computing as part of the implementation strategy for risk management. Of
the 19 firms, 27% either have successfully implemented cloud solutions or are
in the process of implementing cloud solutions. Another 23.8% have started
considering cloud solutions.
One of the key questions is whether a cloud solution is cost-efficient. Lit-
tle (2011) from Moody’s Analytics analyzes the potential usage of cloud for
economic scenario generation and Solvency II in general. They conclude:

Building a Solvency II platform on the cloud is a realistic and cost-


effective option, especially when scenario generation and Asset Lia-
bility Modeling are both performed on a cloud.

17.3.3 Quantitative trading


The lowering barrier in the market participation challenges the develop-
ment of more complicated trading strategies as well as a race of technology.
Quantitative trading, especially high frequency trading, requires a quick time-
to-production as well as a quick time-to-market.
Fast R&D of trading strategies and back testing in a timely manner will
significantly shorten time-to-market. Firms, by taking advantage of cloud com-
puting, are generating alpha even before trading strategies are actually imple-
mented in market.
The quick prototyping and backtesting of strategies requires close-to-data
computing as well as adaptability to the heterogeneity of developer tools, such
as different end-user applications, different data storage types, and different
programming languages. On the other hand, the adaptability to different end-
user software has been one of the key features of the matured cloud computing
platform.

17.3.4 Credit scoring


Cloud computing is arguably the solution for big data problems in finance.
One typical big data problem in the finance industry is credit scoring. Credit
scoring is the procedure for lenders, such as banks and credit card companies,
to evaluate the potential risk posed by lending money to consumers. It has
been widely treated as a classification problem in machine learning literature,
520 High-Performance Computing in Finance

see (Hand and Henley, 1997; West, 2000; Baesens et al., 2003) and many
others.
The large number of consumers and the variety of credit report formats
create a big data problem. To solve the classification problem over the massive
data set of credit history of consumers, an efficient data storage and processing
system is required.
As a summary, modern day financial computing requires:

• Adaptability to the heterogeneity of end-user software

• Processing large data and close-to-date computing


• Massive computing

• Data security

In this chapter, we are going to introduce details of how modern cloud


computing can help to solve these problems. There are also innovative cloud-
supported new business models using the concept of sharing, such as coop-
erative quantitative strategy development platforms. While we focus on the
massive computation part of cloud computing for financial engineering, we
refer readers who are interested in those platforms to Internet resources.

17.4 The Nature of Challenges


Integrating massive computing power to existing IT systems may face sev-
eral challenges. The finance industry poses certain specific requirements to
cloud computing.
System needs to be multitenant: The system needs to support multiple
users accessing the computing resource at the same time. Meanwhile, the
system has to be smart enough to allocate computing resources based on the
priority and need of the tasks. The requirements arise from the heterogeneous
and dynamic nature of financial computing. Computing from different desks
has different priorities and uneven demand for resources.
Compliance requirement and cybersecurity: The finance industry operates
with public data/information as well as business critical private data/infor-
mation. Due to compliance requirements, a hybrid system needs to make
computing with sensitive data in-house while allowing utilization of external
computing resource with nonconfidential data.
A unified platform for quality assurance: Large financial organizations
have teams supporting local business operations across the world. A unified
platform will make life of quantitative support and model validation/review
Practitioner’s Guide on the Use of Cloud Computing in Finance 521

teams easier by guaranteeing consistency, and coherency of data and models


for users.
IT legacy: Maintaining a monster level of legacy codes is a huge task for
IT departments. Any change that needs complete rewrite of the codes will
be a nightmare. Thus IT systems have a high level of adaptability to have
effortless integrating with existing libraries.

17.5 Implementation and Practices


In the previous section, we reviewed the technical and nontechnical chal-
lenges in the integration process of cloud computing in the finance industry.
Luckily, with the development of commercial cloud computing services, the
complexity of cloud computing is hidden behind user-friendly interfaces.
Aiming to ease the use of cloud computing, many software programs and
computing frameworks were developed during the past decade. To mention
a few, popular computing framework includes Hadoop+MapReduce, Apache
Spark, and so on. Middleware solutions such as Techila Middleware, Sun Grid
Engine also help commercial users to distribute computing tasks to computing
nodes. In this section, we will introduce using an example of Techila Middle-
ware on how challenges in Section 17.4 can be handled by a cloud computing
solution. Then we walk readers through the procedure of implementing cloud
computing in practice.

17.5.1 Implementation example: Techila middleware


with MATLAB
Techila Middleware Solution, developed by Techila Technologies Ltd,3 is
a commercial software solution aiming to provide user-friendly integration of
cloud computing. The service structure of Techila is shown in Figure 17.5.
The specific design of the service structure allows accessing both on-premise
and external computing resources to ensure compliance requirements are met
when necessary. The system is multitenant, where users can assign different
priorities to computing jobs sent to Techila system through secured Gate-
way. The jobs are scheduled according to the availability of resources and
priorities. The solution hides the complexity of integration to heterogeneous
end-user software behind a user-friendly interface. To illustrate that, we pro-
vide an example of using Techila with MATLAB. For more information on
the programming languages and software that Techila supports, please refer
to the company’s website at: [Link].
Before proceeding to use Techila solution, the end user needs some minor
configuration. Readers are referred to Techila’s online documents for more
details ([Link]).

3 For more information, please visit: [Link]


522 High-Performance Computing in Finance

Company’s trusted IT infrastructure External


datasources
Time-
critical
business
Users
Secure gateway

Fast Company’s on-premises datacenter


computing Secure gateway

Resources in company’s trusted cloud(s)

FIGURE 17.5: Techila high-level architecture.

When Techila is successfully installed, using cloud computing with existing


codes developed in MATLAB is straightforward.
Suppose an end-user has a code that contains of the following for-loop
structure:

1 function result = l o c a l l o o p s ( loops )


2

3 result = zeros (1 , loops ) ;


4 f o r c o u n t e r =1: l o o p s
5 r e s u l t ( counter ) = counter ∗ counter ;
6 end

To parallelize the computation inside the for loop on cloud computing


resources, the end user only needs to make a minor change to the code:

1 function result = run loops dist ( loops )


2 % t h e o n l y change i s change f o r −end t o c l o u d f o r −
cloudend
3 result = zeros (1 , loops ) ;
4 cloudfor counter = 1: loops
5 r e s u l t ( counter ) = counter ∗ counter ;
6 cloudend
Practitioner’s Guide on the Use of Cloud Computing in Finance 523

17.5.2 Computational needs


Before making a decision to adopt any HPC solution, a key step is to
understand your computational need and usage pattern. The best choice of
solution depends on the answers to the following questions:

• Do you have a computational bottleneck?


• Where is the computational bottleneck?

[Link] Do you have a computational bottleneck?


The question may seem to be trivial at first sight. A computational bot-
tleneck exists when the current computing resource cannot process the com-
puting tasks within given time constraints. However, many of the cases, the
computational bottleneck arises from another dimension, that is the time it
required to upgrade computing resources to meet the increased demand.
The point we would like to emphasize is that the planning of computing
needs to be forward-looking. While quants, researchers, and developers are
aware of the computational bottleneck, it is usually the IT department’s deci-
sion whether to expand current IT resources. The procedure may take some
time. Thus forward-looking planning is critical in order to ensure an efficient
and effective response to the computational need.
The scalability and elasticity of cloud computing may offer an alternative
solution by providing computing resource on demand.

[Link] Where is your computational bottleneck?


A typical computational bottleneck from the finance industry poses one of
the following types of challenges:

1. Massive computational time exceeds time constraints

2. Massive memory consumption exceeds limited memory and storage


3. Dynamic usage pattern meets nonscalable computing resource.

There are several solutions for the type 1 challenge. In a production sce-
nario, for example, when implementing a high frequency trading algorithm,
hardware accelerators, such as FPGA and GPU, may be better alternatives
to cloud computing.4 The reasons are:

4 Although in practice, there are firms implementing their algorithms in the cloud to gain

benefit for lower latency in connection to exchange when colocation is not possible or too
costly to implement.
524 High-Performance Computing in Finance

1. Execution time is critical. Hardware acceleration may be the only solu-


tion to accelerate the algorithm.

2. No frequent reconfiguration is needed. The cost to configure and adapt


the algorithm for hardware acceleration is less than the profit gain from
shorter execution time.

While in an R&D scenario, time-to-market is more important. The easy imple-


mentation and massive parallel features provided by cloud computing will
enable researchers and developers to quickly prototype and backtest algo-
rithms and models.
Cloud computing is also an economical solution to type 2 and type 3
challenges. A distributed storage and memory consisting of relatively cheaper
hardware, compared with expensive local instances that have adequate mem-
ory and storage size, reduce significantly the cost to invest in hardware. The
elasticity of cloud computing provides users on-demand service. In a type 3
challenge, an investment in computer instances that can process computing
demands at their peak time is a waste of resources during periods where com-
puting demands are less intensive.

17.5.3 Solution selection


Both performance and cost should be taken into consideration when choos-
ing a cloud vendor. Among them, Amazon Elastic Compute Cloud (AWS),
Google Compute Engine (GCE), and Microsoft Azure (Azure) are three pop-
ular cloud computing platforms.
Vendors provide a variety of instance types. For example, the cloud
instances used in a recent benchmark by Techila include four different
instances from AWS and Azure and two instances from GCE as listed in
Table A.1 of (Techila, 2015, Appendix A: Cloud Platform Specifications).
These instances are optimized for different purposes: CPU, memory, I/O, cost,
storage, and so on.
Vendors adopt different pricing models. The cost of using cloud comput-
ing is affected by the pricing model. Table 17.1 provides an overview of pric-
ing models adopted by vendors for data centers based in Europe. The table
reports price per instance (PPI), price per CPU core (PPC) and the billing
granularity for AWS, Azure, and GCE.5 Generally speaking, a minute-based
(even seconds-based) pricing model offers more flexibility for the utilization of
cloud computing. However, the difference among pricing models is relatively
insignificant when the computation is massive.
Depending on the vendor, instance types and the operating systems,
instances have varying time consumption for configuration and deployment.

5 GCE machine types are charged a minimum of 10 minutes. After 10 minutes, instances

are charged in 1 minute increments, rounded up to the nearest minute.


Practitioner’s Guide on the Use of Cloud Computing in Finance 525

TABLE 17.1: Pricing model and cost


Cloud Instance PPI PPC Billing
platform type (USD/h) (USD/h) granularity
AWS c4.4xlarge (Win) 1.667 0.1041875 Hour
AWS c4.4xlarge (Linux) 1.003 0.0626875 Hour
AWS c3.8xlarge (Win) 3.008 0.094 Hour
AWS c3.8xlarge (Linux) 1.912 0.06 Hour
Azure A11 (Win) 3.5 0.219 Minute
Azure A11 (Linux) 2.39 0.149 Minute
Azure D14 (Win) 2.372 0.148 Minute
Azure D14 (Linux) 1.542 0.096 Minute
GCE n1-standard-16 (Win) 1.52 0.095 Minute
GCE n1-standard-16 (Linux) 0.88 0.055 Minute

Azure-A8 (Win)

Azure-extra large (Win)

AWS-c3.8xlarge (Win)

AWS-c3.8xlarge (Linux)

What can cause this difference?


GCE-n1-standard-8 (Linux)

0 50 100 150 200 250 300


Time/seconds

FIGURE 17.6: Configuration time.

According to Techila’s benchmark report (Techila, 2014) in 2014, the differ-


ences are significant as shown in Figures 17.6 and 17.7. These are nonnegligible
factors for the elasticity of cloud computing. As a rule of thumb, the oper-
ating systems, instances type, and vendor should be chosen according to the
end-user’s version of software applications.
We cite the test results from two recent benchmark reports (Techila, 2014,
2015) from Techila to provide readers an impression of the cost of utilizing
cloud computing. The test cases simulate real-world financial applications in
many areas including: portfolio analytics, machine learning, option pricing,
backtesting, model calibration, and so on. However, readers should be aware
of the different nature of these applications, whether they are high I/O, high
memory consumption, or high CPU consumption.
Figure 17.8 summarizes the cost of computing for different vendors and
instances versus the performance of computing. Table 17.2 provides the cost in
the portfolio simulation case.6 The simplified cost provides the cost per unit of

6 For more information on other user scenarios, please refer to Techila’s benchmark report

“Cloud HPC in Finance” (Techila, 2015).


526 High-Performance Computing in Finance

120
Linux Windows

100
CPU cores online/%

80
GCE-n1-standard-8 (Linux)
60 AWS-c3.8xlarge (Linux)
Azure-extra large (Win)
AWS-c3.8xlarge (Win)
40 Azure-A8 (Win)

20

0
0 200 400 600 800 1000 1200
Time/seconds

FIGURE 17.7: Deployment time.

Normalized price-performance
1
Execution time on Azure-A11 (Win)
reference servers
Azure-D14 (Win)

0.8
GCE-n1-standard-16
Azure-A11 (Linux) (Win)
Normalized price

0.6
Azure-D14 (Linux) AWS-c3.8xlarge
(Win)

AWS-c4.4xlarge (Win)
0.4
AWS-c3.8xlarge
(Linux)

0.2
GCE-n1-standard-16 (Linux)
AWS-c4.4xlarge (Linux)

0
0 0.2 0.4 0.6 0.8 1
Normalized average execution time

FIGURE 17.8: Cost versus performance.

computation after correcting for the difference in pricing model. For example,
for portfolio simulation, the simplified cost ranges from 0.58 USD (GCE with
n1-standard-16 instance on Linux Debian 7 operating system) to 1.99 USD
(Azure A11 on Windows Server 2012 R2). The difference is significant (about
4 times). However, if we take into consideration the pricing model, the real
cost (that is the billing from vendor) differs even more. The cost of using
AWS is more than 10 times the cost using Azure or GCE. This is because
Practitioner’s Guide on the Use of Cloud Computing in Finance 527

TABLE 17.2: Cost of cloud computing case: Portfolio simulation


Cloud Instance PPC Cost Simplified
platform type (USD/hour) (USD) (USD)
AWS c4.4xlarge (Win) 0.104 26.672 0.926
AWS c4.4xlarge (Linux) 0.063 16.048 0.566
AWS c3.8xlarge (Win) 0.094 24.064 1.324
AWS c3.8xlarge (Linux) 0.060 15.296 0.701
Azure A11 (Win) 0.219 2.8 1.991
Azure A11 (Linux) 0.149 1.275 1.073
Azure D14 (Win) 0.148 1.898 1.550
Azure D14 (Linux) 0.096 1.234 0.829
GCE n1-standard-16 (Win) 0.095 4.053 1.486
GCE n1-standard-16 (Linux) 0.055 2.347 0.583

AWS uses an hour-based pricing model. Users should be able to allocate their
computation as units of hours to reduce the cost of computation with AWS.
The report provides valuable insight about the effect of pricing models
on the cost of cloud computing. Together with the benchmarks on instance
performances, this should provide readers some information on how to choose
cloud vendors.

17.5.4 Algorithm design


Designing a well-suited algorithm for a specific problem can significantly
boost the performance and benefit from cloud computing.
We use the following simple example to illustrate how the design of algo-
rithm can change the performance.

Example 17.3: Distributed Matrix Multiplication


M is a matrix of size d × n. N is another matrix of size n × d.
The matrix multiplication of M and N: G = M N can be done via two
schemes.

• Scheme 1: Inner Product.


)n The entry in ith row and jth column of
matrix G: Gi,j = r=1 Mi,r Nr,j .
1 % Scheme 1 : v i a i n n e r p r od u c t
2 c l o u d f o r i =1:d
3 c l o u d f o r j =1:d
4 G( i , j ) = M( i , : ) ∗N( : , j )
5 cloudend
6 cloudend
528 High-Performance Computing in Finance

• Scheme 2: Outer Product. Alternatively, we can return the matrix G


as the sum of the outer products between corresponding rows and
columns of M and N .
1 % Scheme 2 : v i a o u t e r p r o d u c t
2 cloudfor i = 1:n
3 cloudfor j = 1:n
4 %c f : sum=G
5 G =M( : , i ) ∗N( j , : )
6 cloudend
7 cloudend
%cf:sum=G command is used to sum the return value from each
worker node.
The two schemes differ in both storage complexity and computational
complexity.

1. To send the data, scheme 1 requires a distributed storage of O(nd2 )


while scheme 2 requires O(nd)
2. To return the result, scheme 1 requires a local storage of O(1) and
total of O(d2 ) while scheme 2 requires a local storage of O(d2 ) and
total of O(nd2 )
3. Both schemes require a distributed computation of O(nd2 ). Scheme
1 requires a local computation of O(n) on each worker node while
scheme 2 requires O(d2 ).
Depending on the relative value of n and d, one of the schemes outper-
forms the other scheme. In terms of computation, outer product scheme
parallel computation in the direction of n, thus is preferred when n is
large, while the inner product scheme is preferred in large d scenario.

17.6 Case Studies


17.6.1 Portfolio backtesting
In this section, we demonstrate how portfolio backtesting can be acceler-
ated with a distributed computing technique, in particular the Techila Middle-
ware solution. Backtesting is widely used in financial industry to estimate the
performance of a trading strategy or a predictive model using historical data.
Instead of gauging the performance using the time period forward, which may
take many years, traders/portfolio managers can measure the effectiveness of
Practitioner’s Guide on the Use of Cloud Computing in Finance 529

their strategies and understand the potential risks by backtesting on the prior
time period using the datasets that are available today. Computer simulation
of the strategy/model is the main part of modern backtesting procedure. It
might be very time-consuming due to a few computing issues raised during the
procedure. Thus it is necessary to seek acceleration using modern techniques
and shorten time-to-market in the rapidly changing financial world.

[Link] Potential computing bottleneck


Backtesting requires three main components: Historical Datasets, Trading
Strategy/Model, and Performance Measure. The following computing issues
might be raised for each of the components:

1. The datasets used for testing might be huge while the requested out-
put (performance measure) is relatively small. For example, a portfolio
consists of N assets. Its historical return data series over the past T
time period is N × T . The covariance matrix is of size N × N . In case
N = 75,000, the memory size of the covariance matrix is 450 GB (double
precision).

2. The simulation of the strategies/models can be computational-intensive.


The intensity of computing is increasing w.r.t the complexity of the
strategies/models. Complex logic branching operations can also be
involved.

3. The evaluation of the performance measure can also be time-consuming.


Measures that are based on the Monte Carlo approach require simulation
of thousands and even more paths.

Cloud computing has its natural advantage of processing large data. In gen-
eral, CPU threads have better performance than GPU threads, especially in
handling complex logic branching operations. Thus cloud computing seems
to be a suitable technique for accelerating backtesting procedure. To illus-
trate how to use cloud computing for backtesting, we did some experiments
in Microsoft Windows Azure Cloud, as well as a local cluster. The results are
presented in the following sections.

[Link] Computing environment and architecture


By installing Techila SDK on their computer, end users (traders/portfolio
managers) can use Techila-enabled computing tools: MATLAB, R, Python,
Perl, C/C++, Java, FORTRAN, and so on to access the computing resource
managed by Techila Server.
Techila Server works as a resource manager, as well as a job scheduler.
Computational jobs are distributed through Techila Server to Techila Workers,
which are machines in the Cloud (Azure, Google Computing Engine, Ama-
zon EC2, and so on) or local cluster. When the computation on the worker
node finished, the requested results are sent back to the end-user through
530 High-Performance Computing in Finance

Techila Server. In our experiment, we use Techila environment on Windows


Azure Cloud. The testing code is written in MATLAB. The computational
jobs (optimization and evaluation of each data set) are sent to each of the
worker nodes (virtual machine in Azure).
When dealing with large data sets which exceed single machines capability,
there are two solutions: (1) Data sets can be stored in a common storage that
can be accessed by each of the workers (Blob on Azure for example); (2) Under
specific license, workers can easily access data sources such as Bloomberg,
Thomson Reuters, and so on.

[Link] Experiment design and test result


The callback feature of Techila enables streaming results when a compu-
tational job is finished. This can also be used to monitor intermediate results
of a computational job. This enables us to update the visualization of the
result when a computational job is finished. To visualize the result, we plot
the time evolution of efficient frontier over the backtesting period as a 3D
surface. We also plot the maximum Sharpe Ratio portfolio as a 3D line. In
the mean-variance optimization framework of Markowitz, this portfolio is the
tangency portfolio. Thus we should expect the line is on the surface7 as shown
in Figure 17.9.
We first perform a small-scale test using 20 stocks. The results based on
cloud computing are consistent with the results generated from local run on my
own laptop. Using weekly return from 2000 to 2013, we perform several tests
using different number of stocks and different length of backtesting period. We
set the historical estimation window length equal to 60 weeks, the strategy is
re-estimated every 3 weeks. The weekly return data from February 26, 2001,
to October 7, 2013, are separated into 220 windows. A straightforward way
of distributing the computational load is to treat the backtesting for each
window as independent job.
By default, Techila Middleware will automatically distribute the comput-
ing project such that each job will have sufficient length to reduce the overhead
caused in data transfer. A user can also set the job length (iterations per job)
using the job specification parameter.
We ran tests for 50, 100, and 500 stocks. When the number of stocks
increased, the optimizer will take a longer time to find the portfolio that
maximizes Sharpe ratio. In fact, when the number of stocks is too large,
the optimization problem might became an ill-posed problem. However, the
performance of the optimizer is not the concern of this report. Compared
with simply setting the stepsperworker = 1, Techila’s default setting signifi-
cantly improved the CPU efficiency (CPU time/Wall clock time) as shown in
Table 17.3.

7 The visualization code is adapted from Portfolio Demo by Bob Taylor at

[Link]
portfolios-with-financial-toolbox
Practitioner’s Guide on the Use of Cloud Computing in Finance 531

Time evolution of efficient frontier

0.6

0.4
Portfolio returns

0.2

0
0.8
–0.2 0.6
2000 k
2002 0.4 r is
2004 io
2006 2008 0.2 r t fol
2010 Po
2012 0
2014

FIGURE 17.9: Time evolution of efficient frontier.


TABLE 17.3: CPU efficiency in portfolio backtesting
NoJ ACE NoJ ACE
NoS (step = 1) (step = 1) (%) (Auto) (Auto) (%)
50 220 88.14 55 96.57
100 220 90.49 74 112.79
500 220 114.13 220 114.13
N oS is number of stocks; N oJ is number of jobs; Auto refers to Techila’s
automated job distribution scheme; step = 1 refers to assigning 1 step to
each job.

17.6.2 Distributed portfolio optimization


In this section, we demonstrate how a large-scale portfolio optimization
problem can be solved with a distributed computing technique with specific
algorithm design.

[Link] Challenges in large-scale portfolio construction


Constructing an optimal portfolio consists of two steps. The first step
is to construct future belief of the return distribution, which is essentially
an inference and prediction problem. The second step is to find the optimal
portfolio weights, which is an optimization problem that deals with the trade-
off between portfolio risk and portfolio return.
On one hand, this problem is a statistical challenge. Most of the portfo-
lio optimization and risk minimization approaches require estimation of the
532 High-Performance Computing in Finance

covariance or its inverse of the return series. When using the sample variance
as the expected variance, the estimation error could be large. To achieve a rea-
sonable accuracy, as stated in DeMiguel et al., 2009, an in-sample period of
3000 months is needed for a portfolio of 25 assets to beat naive 1/N strategy.
The problem becomes even more significant when the portfolio size is large.
As noticed in Fan et al., 2011, estimating the moments of high-dimension
distribution is challenging. Among them, one crucial problem is the spurious
correlation arise with the curse of dimension.
On the other hand, the problem is also challenging numerically. First, when
the degree of freedom is large, finding optimum in high-dimension parameter
space is almost impossible to achieve in reasonable time with general optimiz-
ers. Additionally, we need to take good care of the property of the matrices
to retain feasibility. It is also a data-intensive problem from a hardware per-
spective. Suppose we are dealing with 75,000 assets (data of the universe), the
covariance matrix has 2,812,537,500 parameters. That means, it takes more
than 20 GB of memory if we are using double precision. Last but not least,
the matrix operation for matrix size of M × N has a linear computational cost
increase with the number of columns.

[Link] Algorithm design for large-scale mean-variance


optimization problem
In the classical Markowitz’s mean-variance framework, the portfolio opti-
mization problem is to minimize the variance for given expected return
b = wT μ. The optimum w∗ is a solution to

min wT Cw

s.t.

wT μ = b
wT 1N = 1

This optimization problem is equivalent to solve:

min E[|ρ − wT rt |2 ] (17.1)

with the same restriction. ρ = 1T b. By replacing the expectation in Equation


17.1 with its sample average, the problem can be considered as a least-square
regression.
Regularization methods are introduced targeting to solve the problem that
arises with estimation error via shrinkage and achieve either stability and/or
sparsity. The regularization can be achieved by adding ln −penalty term r(x)
to the objective function:
r(x) = λ xn
where λ is a constant that scales the penalty term. When n = 1, the objective
is a LASSO regression. While n = 2, it is a ridge regression.
Practitioner’s Guide on the Use of Cloud Computing in Finance 533

In order to find a solution to this penalized problem, moreover to utilize the


modern computing environment–computer cluster/cloud, we solve the prob-
lem using distributed optimization technique, namely the alternative direction
method of multiplier (ADMM) and block splitting. A detailed introduction of
this optimizer can be found in Boyd et al., 2011.
Noticing that we can transform the constraint optimization problem to its
consensus form:
min||b1T − Rw||22 + λ||w||1 + IC (w) (17.2)
where IC is the indicator function, that is,

IC (w) = 0, if w ∈ C
(17.3)
IC (w) = ∞, if w ∈
/ C,

where C is the constraint set C = {w|wT μ = b, wT 1N = 1}.


Here we rewrite the problem in ADMM form (denote b1T := B):
  
1 2 ρ! k 2
k+1
w1 := argminw ||Rw − B||2 + ||w − z + μ1 ||2
k
(17.4)
2 2
w2k+1 := ΠC (z − μ2 ) (17.5)
1
z k+1 := Sλ/ρ w1k+1 + μk + Sλ/ρ w2k+1 + μk (17.6)
2
μk+1
1 := μ k
1 + w1
k+1
− z k+1 (17.7)
μk+1
2 := μk2 + w2k+1 − z k+1 . (17.8)

The update of w1 is Tikhonov-regularized least squares which have analytical


solution:
w1k+1 := (RT R + ρI)−1 (RT B + ρ(z k − μk ))
Via Block Splitting, we can utilize distributed computing environment and
solve the problem for small data block of R and B in parallel, depending
on the data structure. If the number of assets is larger than the length of
historical time series, allocation is preferred. While in the other case, consensus
is preferable alternative that is also easier to implement, since it is consistent
with the previous decomposition
−1 T
wik+1 := RiT Ri + ρI Ri Bi + ρ(z k − μk ) .

17.7 Cloud Alpha: Economics of Cloud Computing


In their review of cloud computing, Armbrust et al. (2010) proposed a
formula to evaluate the economic value of cloud computing by comparing to
534 High-Performance Computing in Finance

alternative solutions8 :

U serHourscloud × (revenue − costcloud ) ≥ U serHoursdatacenter


 
Costdatacenter
× revenue − .
U tilization

The cost of cloud computing or alternative solutions can be summarized into


TCO.
Cloud computing and alternative solutions may have different risks of IT
failure. The effect on risk measure should be taken into account when evalu-
ating the potential benefit of cloud computing. Thus we derive the following
formula of benefit as the change in revenue plus cost reduction and benefit of
risk control:

Benef itcloud = Δ(Revenue) − Δ(T CO) − γΔ(Risk) (17.9)

where Δ(Revenue) = Revenuecloud − Revenuealternative is the profit


difference from cloud versus alternative solution. Δ(T CO) = T COcloud −
T COalternative is the negative of cost reduction. Δ(Risk) = Riskcloud −
Riskalternative measures the change in risk and γ is the risk premium.
The optimal choice of computing solutions is simply the optimum of the
following Markowitz-style objective:

max Revenues − T COs − γRisks


s∈S

where S is the set containing all feasible computing solutions.


Quantitative measuring of revenue, cost, and risk is a difficult task and
is beyond the scope of this book. Thus, in the following subsections, we only
provide qualitative analysis of cost, revenue, and risk to give some intuition
of the economics of cloud computing.

17.7.1 Cost analysis


Financial market reduces transaction cost. As an example, asset managers
issue ETF and ETN to investors, offering them a lower cost of diversifica-
tion and exposures to risks and markets that may be costly for an individual
investor to access. Cloud computing, by pooling computing resources, offers
clients lower TCO and access to up-to-date hardware. Cloud computing may
offer cost reduction along one of the following dimensions:

• The first dimension of cost reduction is from lower cost of hardware


maintenance and upgrade.

• The second dimension of cost reduction is from elasticity of cloud com-


puting.
8 Here they compare cloud computing with a dedicated data center.
Practitioner’s Guide on the Use of Cloud Computing in Finance 535

• The third dimension of cost reduction is from lower cost of human


resources.

17.7.2 Risks
The risk of IT system failure is nonnegligible in the finance industry. The
following two examples provide some ideas of the importance of having backup
IT systems and highly reliable IT systems.

Example 17.4: NYSE and Bloomberg


The New York Stock Exchange crashed at 11:32 am ET, July 8, 2015.
The exchange was down for 3 hours and 38 minutes. According to NYSE
(reference online document), this was due to a software update to the
IT system.
Coincidentally, Bloomberg terminals suffered a widespread outage on
April 17, 2015, affecting more than 325,000 terminals worldwide.

IT failure can be costly; however, what would be the best way of risk man-
agement for IT systems? Cloud computing can be viewed as an insurance of
IT. While diversification is a widely accepted concept in the finance industry,
cloud computing may be an easy way to diversify the IT failure risk for the
finance industry. The distributed file systems, either in-house or in cloud ven-
dors’ data centers, protect data from hardware failures. Cloud vendors also
offer access to computing to data centers located in various locations around
the world. Such a scheme provides constant supply of computing resources in
case of catastrophic tail events, such as earthquakes, tsunamis, and so on.

References
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G.,
Patterson, D., Rabkin, A., Stoica, I. et al., 2010. A view of cloud computing.
Communications of the ACM 53, 50–58.

Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., Vanthienen, J.,
2003. Benchmarking state-of-the-art classification algorithms for credit scoring.
Journal of the Operational Research Society 54, 627–635.

Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., 2011. Distributed optimiza-
tion and statistical learning via the alternating direction method of multipliers.
Foundations and Trends in Machine Learning 3, 1–122.
536 High-Performance Computing in Finance

DeMiguel, V., Garlappi, L., Uppal, R., 2009. Optimal versus naive diversification:
How inefficient is the 1/n portfolio strategy? Review of Financial Studies 22,
1915–1953.

Fan, J., Lv, J. Qi, L., 2011. Sparse high dimensional models in economics. Annual
Review of Economics 3, 291.

Hand, D.J., Henley, W.E., 1997. Statistical classification methods in consumer credit
scoring: A review. Journal of the Royal Statistical Society. Series A (Statistics
in Society) 160, 523–541.

Joseph, E., Conway, S., Dekate, C., Cohen, L., 2014. IDC HPC update at ISC’14.

Kanniainen, J., Lin, B., Yang, H., 2014. Estimating and using garch models with
vix data for option valuation. Journal of Banking & Finance 43, 200–211.

Kanniainen, J., Piché, R., 2013. Stock price dynamics and option valuations under
volatility feedback effect. Physica A: Statistical Mechanics and its Applications
392, 722–740.

Little, M., 2011. ESG and Solvency II in the cloud. Moody’s Analytics Insights.
Published in Barrie+Hibbert (later Moody’s Analytics) magazine, see http://
[Link]/[Link].

Mell, P., Grance, T., 2009. The NIST definition of cloud computing. National Insti-
tute of Standards and Technology 53, 50.

Smith, D.M., 2008. Cloud computing scenario.

Techila, T., 2014. Cloud benchmark—round 1.

Techila, T., 2015. Cloud HPC in finance, cloud benchmark report with real-world
use-cases.

West, D., 2000. Neural network credit scoring models. Computers & Operations
Research 27, 1131–1152.

Yang, H., Kanniainen, J., 2017. Jump and volatility dynamics for the S&P 500:
Evidence for infinite-activity jumps with non-affine volatility dynamics from
stock and option markets. Review of Finance 21, 811–844.
Chapter 18
Blockchains and Distributed Ledgers
in Retrospective and Perspective

Alexander Lipton

CONTENTS
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
18.2 Blockchains and Distributed Ledgers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
18.3 Historical Examples of BCs and DLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
18.3.1 Genealogical trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
18.3.2 Land titles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
18.4 The Bitcoin Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
18.5 Potential Usages of DLT in Banking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
18.5.1 Banking X-Road . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
18.5.2 Trade execution, clearing, settlement . . . . . . . . . . . . . . . . . . . 548
18.5.3 Global payments, trade finance, rehypothecation . . . . . . . 549
18.6 Monetary Circuit and Money Creation . . . . . . . . . . . . . . . . . . . . . . . . . . 551
18.6.1 Monetary circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
18.6.2 General aspects of money creation . . . . . . . . . . . . . . . . . . . . . . 552
18.6.3 Money creation by individual banks . . . . . . . . . . . . . . . . . . . . 553
18.6.4 Money creation by the banking system . . . . . . . . . . . . . . . . . 554
18.6.5 Bank lending versus bitcoin and P2P lending . . . . . . . . . . . 554
18.7 CBDCs and Negative Interest Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
18.7.1 Why CBDCs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
18.7.2 How CBDCs can be issued? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
18.7.3 How CBDCs can be used to implement
the Chicago Plan? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
18.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558

PARATOV. The madness of passion soon passes, and what remains


are chains and common sense that tells us that these chains are
unbreakable. LARISA. Unbreakable chains!

Alexander Ostrovsky
Without a Dowry, A drama in four acts

537
538 High-Performance Computing in Finance

18.1 Introduction
In this chapter, we discuss blockchains (BCs) and distributed ledgers (DLs)
in retrospective and prospective, with an emphasis on their applications to
money and banking in the twenty-first century. Additional aspects are dis-
cussed in References 1 through 3.
Civilization is not possible without money, and, by extension, banking,
and vice versa. Through the ages, money existed in many forms, stretching
from the exquisite electrum coins of the Phrygian King Midas, giant stones of
Polynesia, cowry shells, the paper money of Khublai Khan and other rulers
who came after him, to digital currencies, and everything in between. The
meaning of money has preoccupied rulers and their tax collectors, traders,
entrepreneurs, laborers, economists, philosophers, writers, stand-up comedi-
ans, and ordinary folks alike. It is universally accepted that money has several
important functions, such as a store of value, a means of payments in general,
and taxes in particular, and a unit of account. The author shares the view of
Aristotle formulated in his Ethics: “Money exists not by nature but by law”
[4]. Thus money is linked to government and government to money. In fact,
anything taken in lieu of tax eventually becomes money.
For the last five centuries, money has gradually assumed the form of records
in various ledgers. This aspect of money is all important in the modern world.
At present, money is nothing more than a sequence of transactions, organized
in ledgers maintained by various private banks, and by central banks who pro-
vide means (central bank cash) and tools (various money transfer systems)
used to reconcile these ledgers. In addition to their ledger-maintaining func-
tions, private banks play a very important role, which central banks are not
equipped to perform. They are the system gatekeepers, who provide know your
customer (KYC) services, and system policemen, who provide antimoney laun-
dering services (AMLs). We argue that, in addition to the more obvious areas
of application of distributed ledger technology (DLT), for instance, digital
currencies (DCs), including central bank issued digital currencies (CBDCs),
DLT can be used to solve such complex issues as trust and identity, with an
emphasis on the KYC and AML aspects [5]. Further, given that all banking
activities boil down to maintaining a ledger, judicious applications of DLT
can facilitate trading, clearing and settlement triad, payments, trade finance,
and so on.
The chapter is organized as follows. We introduce DLs and briefly discuss
their different types in Section 18.2. We present historical instances of BCs
and DLs in Section 18.3, and describe what happened when they underwent
hard forking. Bitcoin, the most popular current application of DLT, is covered
in Section 18.4, where a few less well-known facts about bitcoin are presented.
Potential applications of DLT to banking are discussed in Section 18.5. As an
interesting potential area of applications of BC/DLT, we introduce a mod-
ern version of monetary circuit in Section 18.6 and show that it can benefit
from the BC/DL framework because money moves in a gigantic circle (or
Blockchains and Distributed Ledgers in Retrospective and Perspective 539

several circles if the world economy as a whole is considered). In addition, in


the process of money creation by the banking system as a whole, individual
banks become naturally interconnected, so that DLs are particularly suitable
to describing their interactions. We discuss topics related to CBDCs in Sec-
tion 18.7, where we explain the rational for its issuance and discuss practical
aspects. In particular, we show that CBDCs can be used to implement the
famous Chicago plan [6,7], of moving away from the fractional reserve banking
toward the narrow banking. We articulate the differences between Chaum’s
and Nakamoto’s approaches to DCs and consider their respective pros and
cons. Conclusions are drawn in Section 18.8.

18.2 Blockchains and Distributed Ledgers


Databases with joint writing access have been known for decades. Sev-
eral typical examples are worth mentioning: the concurrent versioning sys-
tem (CVS), Wikipedia, and distributed databases used on board of naval
ships [8].
We start with articulating differences between centralized and distributed
databases. In a centralized database, storage devices are all connected to a
common processor; in a distributed database, they are independent. Further-
more, in a centralized database, writing access is tightly controlled; in a dis-
tributed database, many actors have writing privileges. In the latter case, each
storage device maintains its own growing list of ordered records, which, for the
sake of efficiency, can be organized in blocks, hence, the name Blockchain. To
put it differently, in a traditional centralized ledger, the gatekeeper collects,
verifies, and performs the write requests of multiple parties, tasks which are
distributed in the DL. It should not be taken as fact that these tasks are best
distributed: the considerations of efficiency and specialization are relevant as
well.
The integrity of the distributed database is cryptographically ensured at
two levels. First, only users possessing private keys can make updates to
“their” part of the ledger. Second, notaries (also called miners) verify that
users’ updates are legitimate. Once the updates are notarized, they are broad-
cast to the whole network, thus ensuring that all copies of the distributed
database are in sync. There are several types of distributed databases or
ledgers. We list them in increasing order of complexity:

1. Traditional centralized ledger


2. Permissioned private DL (R3 CEV, DAH, and other similar projects)
3. Permissioned public DL (Ripple, and so on)

4. Unpermissioned public DL (Bitcoin, Ethereum and the myriad others)


540 High-Performance Computing in Finance

To control the integrity of DL, a variety of mechanisms can be used—proof of


work (PoW), proof of stake (PoS), third party verification, and so on.
Which ledger should be used? It largely depends on the context. If no joint
writing access is required, as is the case with most legacy banking applications,
a centralized ledger can be used. If participants do need joint writing access,
but know each other in advance, have aligned interests, and can be trusted,
as is the case in clearing and settlement, a permissioned private DL can be
employed. More details are given in Reference 9.
The best known application of BC/DL is the famous bitcoin, which exists
on an unpermissioned public DL whose integrity is maintained by anonymous
miners via PoW mechanism. BC/DL can be used for issuing CBDCs. However,
the sheer scale of the economy precludes unpermissioned public ledger in the
spirit of Nakamoto [10], to be used for this purpose, due to the enormous
computational effort required for PoW. Resurrecting digicash proposed by
Chaum [11] is an exciting possibility.
In many instances, building a DL just to be au courant with the times
might not be worth the effort.

18.3 Historical Examples of BCs and DLs


18.3.1 Genealogical trees
The idea of a BC is certainly not new. BCs naturally occur whenever power,
land, or property change hands. Some of the earliest examples of BC are the
genealogical trees of royal (or, more generally, aristocratic or property own-
ing) families. In such a tree (or BC), the transfer of power from one sovereign
to the next is governed by well-defined rules and in most cases, occurs with-
out commotion. However, when these rules become ambiguous and open to
interpretation the tree can undergo a hard fork.
In addition to being a chain, a genealogical tree is a DL. In order to agree
on their respective legitimacy and marriage eligibility, royal houses had to
inform each other about births, deaths, marriages, and other life events, thus
keeping their versions of BCs in sync. In Figure 18.1, we show the genealogical
tree of the House of Habsburg engraved by A. Durer. It was distributed to
other royal houses, as well as all imperial cities in the Holy Roman Empire.
Usually, forking of a succession tree is associated with wars and other acts
of violence. This is a cautionary tale for proponents of ubiquitous applications
of DLs without a possibility of resolving disputes outside of the ledger itself.
Here are two (of many) examples.
In Figure 18.2, we show a simplified genealogical tree of the House of Capet.
For 10 generations, starting with Hugh Capet, the transfer of power from
father to son was smooth. However, the ambiguity occurred when all three
sons of Philip IV died without surviving issue, thus creating a power vacuum.
Blockchains and Distributed Ledgers in Retrospective and Perspective 541

FIGURE 18.1: Albrecht Dürer—The Triumphal Arch of Maximilian


(Ehrenpforte Maximilians I )—The House of Habsburg generalogical tree by
A. Durer. It was distributed to all imperial cities in the Holy Roman Empire.
(Albrecht Dürer—National Gallery of Art: online database: entry 1991.200.1.
Public Domain.)

In order to resolve it, the peers of France applied the Salic law of Succession, by
which persons descended from a previous sovereign only through a woman are
not eligible to occupy the throne. The House of Plantagenet did not accept
this outcome and started the Hundred Years’ War (1337–1453) against the
House of Valois, a cadet branch of the Capetian dynasty, which was a dynastic
conflict for control of the Kingdom of France. In the end, the Valois established
themselves as Kings of France at the expense of the Plantagenets.
Similar conflicts occurred with regularity and for very similar reasons
throughout history. For example, the War of the Austrian Succession (1740–
1748), which involved all major powers of Europe, was fought to settle the
question of the Pragmatic Sanction and to decide whether the Habsburg
hereditary possessions could be inherited by a woman. It was finally resolved
in favor of Maria Theresa, who became the only female ruler of the Habsburg
dominions.
Closer to our times, an interesting example of Ethereum hard forking hap-
pened in July of 2016, as a result of fixing a theft of 60 Mil USD worth of
Ethereum from DAO. Buterin [12] described the situation as follows:
Louis X King of John I King of
542
France France
1289-1314-1316 1316-1316-1316

Clementia of
Hungary

Philip V King of
France
1293-1316-1322

Charles IV King of
France
1294-1322-1323

Hugh Capet King of Philip III King of Philip IV King of Edward II King of Kings of France and
the Franks France France England England of the House
941-987-996 1245-1270-1285 1268-1285-1314 1284-1307-1327 of Plantagenet
Eight generations

Isabella of
Adelaide Joan of Navarre Isabella of France
Aragon

Charles Count of Philip VI King of


Kings of France of
Valois France
the House of Valois
1270-1284-1325 1293-1328-1350
High-Performance Computing in Finance

Margaret of Blanche of
Anjou Navarre

FIGURE 18.2: Genealogical chart (chain) of the House of Capet. Hard fork was resolved in favor of the House of Valois at
the expense of the House of Plantagenet by inventing the Salic law. The Hundred Years’ War commence as a result. (Adapted
from Wikipedia.)
Blockchains and Distributed Ledgers in Retrospective and Perspective 543

The foundation has committed to support the community consen-


sus on the admittedly difficult hard fork decision. . . . That said, we
recognize that the Ethereum code can be used to instantiate other
blockchains with the same consensus rules, including testnets, con-
sortium and private chains, clones and spinoffs, and have never been
opposed to such instantiations.

Once again, we see that ambiguity within a BC cannot be resolved via its
intrinsic mechanisms.

18.3.2 Land titles


In more recent times, land registry title deeds are more relevant examples
of BCs. As per Land Registry,

Title deeds are paper documents showing the chain of ownership for
land and property. They can include: conveyances, contracts for sale,
wills, mortgages and leases.

It is clear that titles are BCs currently held in a central repository; instead
of miners, succession is verified by notaries. Titles are meaningful candidates
for being treated on DL. However, there are still some issues which need
to be resolved before it can be done. For example, recent lawsuits by Mark
Zuckerberg seeking to force hundreds of Hawaiians to sell to him small plots of
land located within the external boundaries of his 700-acre beachfront prop-
erty on the island of Kauai, is a good case in point. It illustrates that in some
instances, it is not possible to identify the first owner of land, and then build
a chain of ownership from the original owner to the present, resulting in an
ambiguous and potentially vulnerable BC.

18.4 The Bitcoin Ecosystem


Bitcoin is not the first digital currency by a long shot, and very likely is
not the last major one either. The astute reader will recognize that apart from
intriguing technical innovations, bitcoin does not differ that much from the
fabled tally sticks used in the Middle Ages. Its precursors include e-cash and
digicash invented by D. Chaum [11], and bitgold invented by N. Szabo [13].1

1 There is a heated debate of the true identity of Satoshi Nakamoto. Nick Szabo is often

mentioned as a potential inventor of bitcoin. Here is a small piece of evidence, which might be
of interest. Nakamoto’s initials are SN, while Szabo’s are NS. However, Szabo is originally a
Hungarian name, where the last name comes first, so his initials would be SN. An interesting
coincidence.
544 High-Performance Computing in Finance

FIGURE 18.3: A typical BTC block. (Adapted from [Link].)

All building blocks of the bitcoin ecosystem have been known for some
time, including two of the most important techniques in public-key cryp-
tography, a one-way hash function and the Elliptic Curve Digital Signature
Algorithm (ECDSA), (see References 14 through 16). Proof of work, based
on cryptographic hash functions, specifically SHA-256, is similar to hashcash
invented by Back [17], while Merkle trees were introduced in the seminal paper
by Merkle [18].
Ignoring such nuances as wallets, and so on, we can describe the basic setup
as follows. Participants of the system are represented by their public/private
key pairs. The main control variable is the number of bitcoins belonging to
a particular public key. This number is known to all participants at all times
(in theory). The owner of a particular public key broadcasts their intent to
send a certain quantity of bitcoins to another public key. Miners aggregate
individual transactions into blocks, verify them to ensure that there is no
double spend by competitively providing proof of work, and receive mining
rewards in bitcoins. A transaction is confirmed if there are at least six new
blocks built on the top on the block to which it belongs. A typical block is
shown in Figure 18.3.
The size of mining rewards is halved at regular intervals so that the total
number of bitcoins in circulation converges to 21 Mil. Currently there are
about 16 Mil bitcoins in circulation. It is believed that at least one Mil are
irretrievably lost or stolen. Some 450,000 blocks have been mined so far; a
new block is mined every 10 minutes on average. Due to the fact that mining
rewards are paid with new bitcoins, transaction costs are claimed to be very
low. This is a nifty bit of sleight of hand, however, because the value of existing
bitcoins is constantly diluted. Some representative bitcoin statistics is given in
Figure 18.4.
Bitcoin promises are grand. Its proponents expect it to become a supra-
national currency eventually supplanting national currencies, which, in their
minds, can be easily manipulated. Many even believe that bitcoin is the mod-
ern digital version of gold, due to the effort required for PoW [13]. Whilst
bitcoin is clearly an impressive breakthrough, reality is much less grand than
Total BTC Transactions last 24 h
1.80E+07 4.00E+05
1.60E+07 3.50E+05
1.40E+07 3.00E+05
1.20E+07
2.50E+05
1.00E+07
2.00E+05
8.00E+06
1.50E+05
6.00E+06
4.00E+06 1.00E+05

2.00E+06 5.00E+04

0.00E+06 0.00E+00
2013-05-06 2013-11-22 2014-06-10 2014-12-27 2015-07-15 2016-01-31 2016-08-18 2017-03-06 2017-09-22 2013-05-06 2013-11-22 2014-06-10 2014-12-27 2015-07-15 2016-01-31 2016-08-18 2017-03-06 2017-09-22

Market cap BTC price


2.00E+10 1400
1.80E+10
1200
1.60E+10
1.40E+10 1000
1.20E+10
800
1.00E+10
8.00E+09 600
6.00E+09
4.00E+09 400

2.00E+09
200
0.00E+00
2013-05-06 2013-11-22 2014-06-10 2014-12-27 2015-07-15 2016-01-31 2016-08-18 2017-03-06 2017-09-22 0
2008-02-22 2009-07-06 2010-11-18 2012-04-01 2013-08-14 2014-12-27 2016-05-10 2017-09-22

FIGURE 18.4: Some representative bitcoin statistics. (Adapted from [Link].)


Blockchains and Distributed Ledgers in Retrospective and Perspective
545
546 High-Performance Computing in Finance

perception, and is quite telling (at the time of writing this paper):

1. A new block is created on average every 10 minutes.

2. The number of transactions per second (TpS) is approximately 7, com-


pared to 2000 TpS on average handled by VISA.

3. In monetary terms, the amount of transactions is about 100 Mil


USD/day.
4. Current real (not nominal!) transaction costs are 1.5 Mil USD/day, 1.5%
of total volume; in 2012 it was whopping 8%, in 2014—6%.

5. Mining is a cost of electricity game. In high energy cost countries min-


ers go bust: Swedish KnCMiner recently declared bankruptcy ahead
of halving miner’s reward. While exact numbers are not known, it is
believed that bitcoin consumes as much electricity as EBay, Facebook,
and Google combined.

6. Miners are arranged in gigantic pools (so much for peer to peer (P2P)
mining!): AntPool—18.7%, F2Pool—17.7%, BitFury—7.7%, BTCC
Pool—7.4%, [Link]—7.3%. Thus a 51% attack becomes possible!
There is a very high probability that six consecutive blocks will be mined
by the same actor (so much for checks and balances!). Most of all these
pools are Chinese, partly due to low electricity cost, partly due to high-
tech advances. Not only miners are predominantly Chinese, so are the
players—91% CNY, 7% USD, 1% EUR.
7. At the moment, the main purpose of using bitcoin is for speculation and
circumvention of capital controls in China.

It is truly amazing to see how miners are prepared to perform socially useless
tasks, as long as they are paid for it. A telling historical analogy jumps to
mind. During the contest for design of the dome of Santa Maria del Fiore,
it was suggested to use dirt mixed with small coins to serve as scaffolding.
After the dome’s completion, the dirt was to be cleared away for free by the
profit-seeking citizens of Florence (proto-miners). It is clear that BC/DL is
still awaiting its Brunelleschi who figured out how to build the dome without
scaffolding [19].
T. J. Dunning, quoted by Karl Marx in Das Kapital [20], put it
succinctly:

With adequate profit, capital is very bold. A certain 10 per cent will
ensure its employment anywhere; 20 per cent certain will produce
eagerness; 50 per cent, positive audacity; . . .
Blockchains and Distributed Ledgers in Retrospective and Perspective 547

18.5 Potential Usages of DLT in Banking


18.5.1 Banking X-Road
No bank, however big, is an island; banks can only operate as a group. In
the process of their day-to-day activities, they become naturally interlinked.
Due to these linkages between banks, DLT can provide a useful tool for facili-
tating, reconciling, and reporting their interactions. Given that internal tech-
nology is bank specific, it is impractical to standardize bank infrastructure.
However, it is possible to bring them to a common denominator by emulating
the success of the Estonian X-Road and creating a DL solution for banking
operations, which, by analogy, can be called the e-bank X-Road. In this regard,
DL will serve as an adapter, not dissimilar to an electrical adapter.
In 1997, Estonia started to move to digital government. In 2001, Ansper in
his master thesis proposed a suitable design [21]. He developed a distributed
P2P secure information system called the e-Estonia X-Road based on the
idea of an adapter. X-Road is the digital environment which links various
heterogeneous public and private databases and enables them to operate in
sync. A small company, Cybernetica, implemented this design for around 60
Mil EUR.2
Let us describe a possible design for the e-bank X-Road. Given the non-
scalable nature of PoW, and unclear security properties of PoS, X-road has
to be controlled by trusted notaries or validators. Two financial institutions,
represented by their public keys, use their respective adapters to agree on com-
mon terms on a deal. They digitally sign and execute a smart contract, hash
it, and broadcast the hashed version to the X-Road participants. A quorum of
notaries digitally signs the hash (“laminates” it) and reposts the signed hash
in the common X-Road layer. Validators are paid for their services, similarly
to central securities depositories.3
It is worth noting that a BC does not by itself guarantee unambiguous
ownership: steps are required to identify and resolve any ambiguities before
moving to a BC, and in addition, tools and mechanisms to resolve ambigu-
ities are only discovered when the BC is already well established. Both of
these requirements are underemphasized in current discussions of BC/DLT
applications.
There are several smaller areas in which DLT can be used to reduce trans-
action costs and other frictions in the conventional system. Such areas include
but are not limited to:

1. Post-trade processing

2. Global payments

3. Trade finance
2 Other countries tried to follow suit but not all attempts were unqualified successes.
3 Corda, recently described in a white paper by R3, might be a step in this direction [22].
548 High-Performance Computing in Finance

4. Rehypothecation
5. Syndicated loans

6. Real estate transactions

18.5.2 Trade execution, clearing, settlement


The all-important triad of capital markets is trade execution, clearing, and
settlement. While initial public offering of stock is an important rite of pas-
sage for a new company, secondary trading is a mechanism for continually
reallocating ownership and control in a more or less optimal fashion. In addi-
tion to stocks, many other products, such as equity derivatives, interest rate
swaps, commodities, and so on, are traded on public exchanges. Moving many
over-the-counter (OTC) products to exchanges is an important regulatory
imperative [23].
Currently, there are three necessary steps required to trade public securi-
ties:

1. Buyers and sellers have to be matched


2. The transaction has to be cleared, that is, novated to a central clearing
counterparty (CCP)

3. The transaction has to be settled, that is, delivery versus payment


(DvP) has to take place; so that title and money can be transferred
as expected.4

These steps are characterized by vastly different time scales—trading often


takes place in milliseconds, while clearing and settlement take 1–3 days!
Although the proverbial T + 2, T + 3 irritate many people, they might be
a bit too fast to push for the T + 15

solution. The actual process is very


involved and includes investors, custodial banks, exchanges, brokers (general
clearing members of CCPs), CCPs, central securities depositories, regulators,
and so on.
It is natural to ask if a different design of exchanges can improve the
overall process and make it more stable and less costly. The answer is yes and
no. On the pros side, there are several issues which the current set-up solves
very well:

1. Counterparty credit risk management

4 The thriller “Ronin,” which is dealing with DvP, is not critically acclaimed [24]. In the

author’s view, it takes the difficult challenges of transactions among many untrustworthy
parties which underlie many great thrillers and brings them to the fore, arguably making
“Ronin” arguably one of the greatest of all thrillers ever (Perhaps the ending would have
been different had the characters known about DLT).
Blockchains and Distributed Ledgers in Retrospective and Perspective 549

2. Netting

3. DvP and credit risk more generally, which is addressed by collecting


Initial Margin, Variation Margin, and Guarantee Fund contribution from
clearing members
4. Anonymity

5. Ability to borrow stocks

On the cons side, numerous issues are rather disconcerting:

1. Cost

2. Speed

3. Need for reconciliation and failures

It is clear that straightforward attempts to apply DLT to clearing and set-


tlement (thankfully, to the best of the author’s knowledge, nobody wants
to use it in trading per se) cannot be successful. The reasons are simple—
instantaneous settlement obliterates all the aforementioned advantages of the
current system. It increases the money sloshing around by at least an order of
magnitude. Thus slow clearing and settlement is not so much a consequence
of the technological backwardness of exchanges and CCPs (although they are
not always using cutting edge technology), but rather a result of their modus
operandi.
By using permissioned private ledger(s), one can certainly cut costs, some-
what increase speed of clearing and settlement, and reduce the number of
failures and hence the need for reconciliation. In particular, smart contracts,
if they can be legally enforced, can solve a part of the DvP conundrum, which
will require that both securities and cash are parts of the same ledger. While
smart contracts cannot solve all problems, they represent a step in the right
direction. A potential evolution of the trading–clearing–settlement triad is
illustrated in Figure 18.5.

18.5.3 Global payments, trade finance, rehypothecation


Global payments is another area where DLT can be potentially useful. It
is important to note that, in spite of claims to the contrary, the payment
system is not broken however, it is rather expensive. For instance, the Real-
Time Gross Settlement system works well for domestic transactions but is
inefficient and expensive for foreign transactions. Thus some synergies can be
gained if a DL, which supports several national currencies at once, is developed
to replace the legacy system.
550 High-Performance Computing in Finance

(a) Corporation

Stock transfer agent

Central securities
depository
Buyer’s custodian Seller’s custodian
Buyer Clearing counterparty Seller
Buyer’s broker Seller’s broker
Exchange

Market maker

(b) Corporation

Stock transfer agent

Buyer’s custodian BC/DL Seller’s custodian


Buyer Seller
Buyer’s broker Seller’s broker
Exchange

Market maker

(c) Corporation

BC/DL

Buyer Seller
Buyer’s broker Seller’s broker
Exchange

Market maker

FIGURE 18.5: (a) Current organization of share trading. (b) First improve-
ment of stock trading setup, CSD and CCP are replaced by BC/DL. (c) Second
improvement of stock trading setup, in addition to CSD and CCP, custodians
and stock transfer agents are replaced by BC/DL.

For trade finance, there is the potential to use BC/DL to simplify the flow
of information among all participants and smart contracts to partially solve
the DvP problem.
In the rehypothecation setup, it is possible to use BC/DL to untangle
the ownership of the collateral. However, this is more of an accounting tool,
rather than a comprehensive solution, because in many instances the actual
legal ownership of collateral cannot be established with certainty.
Blockchains and Distributed Ledgers in Retrospective and Perspective 551

18.6 Monetary Circuit and Money Creation


18.6.1 Monetary circuit
For centuries, the origins, properties, and functions of money have been
debated in countless expositions. In the fourteenth century, the sagacious
French abbot Gilles li Muisis lamented [25]:

Money and currency are very strange things; They keep on going
up and down and no one knows why; If you want to win, you lose,
however hard you try.

In the twentieth century, the great British economist John Maynard


Keynes shrewdly observed [26]:

For the importance of money essentially flows from it being a link


between the present and the future.

As was mentioned earlier, money is inherently linked with banking, which,


over many centuries, gradually evolved from full reserve toward fractional
reserve banking. For instance, the Bank of England founded in 1694 already
operated as a fractional reserve bank.5
In modern societies, commercial banks are almost exclusively fractional
and produce money “out of thin air” [27–29]. This important fact is thor-
oughly misunderstood by the modern macroeconomic thinking, which incor-
rectly overemphasizes the intermediation aspect of banking and assigns the
money creation role to central banks instead of commercial banks. In reality,
commercial banks are not constrained by their deposits and can and do issue
money at will. At the same time, their ability to do so is restricted by banking
regulations, which impose floors on the amount of banks’ capital and liquidity,
so that money creation cannot go on ad infinitum.
To understand the role played by money in the economy, one needs to
follow its flow and to account for nonfinancial and financial stocks (cumulative
amounts), and flows (changes in these amounts). Here is how Michal Kalecki,
the great Polish economist, summarizes the complexity of the issues at hand
with his usual flair and penchant for hyperbole [30]:

Economics is the science of confusing stocks with flows.


5 The Bank of England was characterized by Marx [20], as follows:
At their birth the great banks, decorated with national titles, were only associations
of private speculators, who placed themselves by the side of governments, and,
thanks to the privileges they received, were in a position to advance money to the
State. Hence the accumulation of the national debt has no more infallible measure
than the successive rise in the stock of these banks, whose full development dates
from the founding of the Bank of England in 1694.
552 High-Performance Computing in Finance

B1 B2 G

B4 B3

PB CB

H F

NFA NFA

FIGURE 18.6: A sketch of the monetary circuit.

In the author’s opinion, the functioning of the economy and the role of
money is best described by monetary circuit theory (MCT), which provides
a unifying framework for specifying how money lubricates and facilitates pro-
duction and consumption cycles in the society. MCT describes in the most
precise way the dynamics of the economy and explains how and by whom
money is created. More specifically, it describes the interactions among the
five sectors, including government, central bank, private banks, firms, and
households. As part of the monetary circuit, private banks play an outstand-
ing role as credit money creators. In this framework, central banks do not cre-
ate money directly, but rather accelerate or slow down the process of money
creation by private banks by providing a unique universal medium in the
form of electronic cash for different banks to control their inventories of assets
and liabilities. A schematic representation of the monetary circuit is given in
Figure 18.6, which represents money flowing among the above-mentioned five
sectors of the economy.

18.6.2 General aspects of money creation


Currently, there are three theories explaining money creation: the credit
creation theory, the fractional reserve theory, and the financial theory of inter-
mediation [27–29]. The author firmly believes that only the credit theory advo-
cated by Macleod, Hahn, Wicksell, and Keen among others, correctly reflects
the mechanics of linking credit and money creation. Credit creation theory was
popular in the nineteenth century, but, unfortunately, gradually lost ground
and was overtaken by the fractional reserve theory of banking, which, in turn,
was supplanted by the financial theory of intermediation. In the author’s view,
Blockchains and Distributed Ledgers in Retrospective and Perspective 553

the latter theory severely underemphasizes the unique and special role of the
banking sector in the process of money creation, and cannot rationally explain
things like the global financial crisis of 2007–2008 and other similar events,
which happen with disconcerting regularity. This aspect is particularly impor-
tant because currently there is a profound lack of appreciation on the part of
the conventional economic paradigm of the special role of banks. For example,
banks are excluded from widely used dynamic stochastic general equilibrium
models, which are influential in contemporary macroeconomics and popular
among central bankers, in spite of the fact that they systematically fail to
produce any meaningful results [31]. It is clear that a vibrant financial system
cannot operate without banks, and that the banking system is very complex
and difficult to regulate because banks become interconnected as a part of
their regular lending activities. In addition to their money creation role, banks
regulate access to the monetary system, by providing KYC and AML services.

18.6.3 Money creation by individual banks


We start with the simplest situation, and consider a single bank, which
lends money to a borrower who immediately deposits it with the same bank.
Thus the bank simultaneously creates assets and liabilities. The size of the loan
is limited solely by regulations and bank’s own risk appetite. The full cycle
from money creation to money annihilation is shown in Figure 18.7. Money is
pumped into the system (created) when it is lent out by the bank and pumped
out (annihilated) when it is repaid. If the borrower repays, the principal is

(a) Assets (b) Liabilities


140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
1 2 3 1 2 3
CB Cash IB Assets Assets Equity IB Liabilities Deposits

(c) Assets (d) Liabilities


140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
1 2 3 1 2 3
CB Cash IB Assets Assets Equity IB Liabilities Deposits

FIGURE 18.7: Money creation by a single bank. (a and b) The case of no


borrower’s default. (c and d) The case of borrower’s default. In the case of no
default capital and CB cash increase; in the case of default capital and CB
cash decrease.
554 High-Performance Computing in Finance

destroyed, but the interest stays in the system. If the borrower defaults, the
money stays in the system indefinitely. The chain of money transfers from one
owner to the next is naturally described by a BC, ideally residing on DL.

18.6.4 Money creation by the banking system


A more complex case of asset creation by one bank and liabilities by a
second bank is illustrated in Figures 18.8 and 18.9. Linkages between these
two banks occur because the first one has to borrow cash from the second, so
that their central bank cash holdings reach suitable levels. In this setup, it is
clear that central banks do not generate money themselves; instead, they play
the role of liquidity providers (if, e.g., the second bank does not want to lend
money to the first) and system stabilizers (similar to the Watt’s centrifugal
governor). Thus central banks are the glue which keeps the financial system
together. It is clear that BC is even more relevant in the case in question.

18.6.5 Bank lending versus bitcoin and P2P lending


In view of the above, the key distinction between bank money creation and
bitcoin mining, P2P lending, and so on is evident. Banks create money “out of
thin air.” Since bitcoin transactions are not based on credit, they simply move
existing money around. The same is true for P2P transactions—P2P operators
are strictly intermediaries, they do not create money at all! Therefore, banks
and P2P operators lend on different scales: banks—money they don’t have,

(a) (b) Liabilities


140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
1 2 3 4 5 1 2 3 4 5
CB Cash IB Assets Assets Equity IB Liabilities Deposits

(c) Assets (d) Liabilities


120 120
100 100
80 80
60 60
40 40
20 20
0 0
1 2 3 4 5 1 2 3 4 5
CB Cash IB Assets Assets Equity IB Liabilities Deposits

FIGURE 18.8: Money creation by two banks. The case of no borrower’s


default. (a and b) Assets and liabilities of the first bank. (c and d) Assets
and liabilities of the second bank. Capital and CB cash of both banks
increase.
Blockchains and Distributed Ledgers in Retrospective and Perspective 555

(a) Assets (b) Liabilities


140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
1 2 3 4 1 2 3 4
CB Cash IB Assets Assets Equity IB Liabilities Deposits

(c) Assets (d) Liabilities


120 120
100 100
80 80
60 60
40 40
20 20
0 0
1 2 3 4 1 2 3 4
CB Cash IB Assets Assets Equity IB Liabilities Deposits

FIGURE 18.9: Money creation by two banks. The case of borrower’s default.
(a and b) Assets and liabilities of the first bank. (c and d) Assets and liabilities
of the second bank. Capital and CB cash of the first bank decrease, while
capital and CB cash of the second bank increase.

P2P—only money they have. Hence, the P2P impact on the financial system
as a whole is very limited.

18.7 CBDCs and Negative Interest Rates


18.7.1 Why CBDCs?
Can and should central banks issue DCs? Recently, these discussions have
been invigorated by the introduction of bitcoin [10], and a persistence of nega-
tive interest rates, which plagued Medieval Europe in the form of demurrage,
the Brakteaten system, and numerous variations of the same tune for cen-
turies. Recall that demurrage was a tax on monetary wealth and required a
massive apparatus of coercion to be imposed efficiently. Today, even the best-
in-class economists seem to be unsure of its true nature; for instance, Rogoff
equates it with currency debasement, which is a very different mechanism,
see Reference 32. The idea of scrip money, that is, money which requires the
paying of a periodic tax to stay in circulation, thus emulating demurrage,
was proposed by S. Gesell, the German–Argentinian entrepreneur and self-
taught economist, in the febrile post-WWI atmosphere [33]. Subsequently, it
was regurgitated by Irving Fisher during the Great Depression [34].
In the author’s view, it is a sad reflection of the present state of economic
affairs, and the level of economic insight, that the current low interest rate
environment has prevailed for such a long time, in spite of it being such an
556 High-Performance Computing in Finance

ineffective tool. Moreover, in some economies, such as Switzerland and Den-


mark, interest rates have reached seriously negative levels.6
Negative interest rates can be used to simulate inflation; the crucial dif-
ference between these two regimes is that physical cash is very valuable under
the former, and highly undesirable under the latter. The last line of defense
between us and meaningfully negative rates is paper currency. However, in
many societies, particularly in Scandinavia, cash is relegated to the far cor-
ners of the economy already. It is not hard to imagine that in a few years’ time
instead of banknotes, we shall have CBDCs [36–38]. Once cash is abolished,
interest can be made as negative as desired by central bankers.

18.7.2 How CBDCs can be issued?


Currently, there are two approaches to creating digital currencies on a large
scale. The first one, which has gained popularity since the invention of Bitcoin,
is based on unpermissioned DL, whose integrity is maintained by notaries (or
miners) [39]. Participants in this BC are pseudo-anonymous since they are
hidden behind their public keys. However, in principle, they can be identified
by various inversion techniques applied to old recorded transactions [40].
An earlier approach was developed by Chaum, who introduced a blind
signature procedure for converting bank deposits into anonymous cash [11].
Chaum’s approach is much cheaper, faster, and more efficient than the Bitcoin-
style one. However, it heavily relies on the integrity of the cash-issuing bank
rather than on trustless integrity of Bitcoin secured by computational efforts
of miners. Central banks can follow either avenue for issuing digital cash. By
doing so, central banks will be indirectly providing access to their balance
sheets to the general public. However, in either eventuality, central banks
won’t be able to perform KYC and AML functions and would still have to
rely on commercial banks, directly or indirectly, for doing so.
One possibility is as follows: a central bank issues numbered currency units
into DL, whose trust is maintained by designated notaries receiving payments
for their services. Thus, at any moment, there is an immutable record showing
which public key is the owner of a specific currency unit. Given that notary
efforts are significantly cheaper and faster than that of bitcoin miners, this con-
struct is easily scalable to satisfy the needs of the whole economy. Moreover,
since the records of transactions are immutable, it is possible to deanonymize
transactions thus maintaining AML requirements.
In summary, modern technology makes it possible to abolish paper cur-
rency and introduce CBDCs, which can also be used to address some of the
societal ills, such as crime, drug trafficking, illegal immigration, and so on,
and eliminate costs of handling physical cash, which are of order of 1% of the

6 One cannot help but notice with a modicum of satisfaction, that critics of the celebrated

Vasicek model for interest rates [35], who vigorously attacked him for allowing short rates
to become negative, proved to be completely wrong.
Blockchains and Distributed Ledgers in Retrospective and Perspective 557

country’s GDP see, e.g., [38]. It will smooth the motion of the wheels of com-
merce and help the unbanked to become participants in the digital economy,
thus positively affecting the society at large.

18.7.3 How CBDCs can be used to implement


the Chicago Plan?
Moreover, CBDCs make the execution of the celebrated Chicago Plan of
1933, originally proposed by Ricardo in 1824, for introducing narrow (full-
reserve) banking entirely possible—both firms and ordinary citizens can have
accounts directly with central banks, thus negating the need of having deposits
with commercial banks [6,7,41–43]. In this case, banks will lose their central
position in the economy and become akin to utility providers. They would
have to maintain the amount of central bank cash equal to the amount of
time deposits. Such narrow banks would in essence become the guardians
of the system by providing KYC and AML services and executing simple
transactions. In fact, in the wake of the global financial crisis, many central
banks massively increased their balance sheets, while commercial banks have
chosen to keep enormous quantities of nonmandatory deposits with them.
Thus the system de facto has moved toward narrow banking.

18.8 Conclusion
While the idea of BC/DLs is not new, modern technology gives it a new
lease of life. DLT opens new possibilities for making conventional banking and
trading activities less expensive and more efficient by removing unnecessary
frictions. Moreover, if built with skill, knowledge, and ambition, it has the
potential for restructuring the whole financial system on new principles. We
emphasize that achieving this goal requires overcoming not only technical but
also political obstacles.
While DLT has numerous applications, it is not entirely clear which finan-
cial applications should be handled first. Exchanges, payments, trade finance,
rehypothecation, syndicated loans, and other similar areas, where frictions
are particularly high, are attractive candidates. DCs, including CBDCs, are
another very promising venue.
Currently, many applications of DL and related technology appear to be
misguided. In some cases, they are driven by a desire to apply these tools for
their own sake, rather than because the result would be clearly superior. In
other cases, they are driven by a failure to appreciate that the current systems
may not be as they are because of technological reasons, but rather because
of business and other considerations.
558 High-Performance Computing in Finance

So far, practical applications of DLT in finance have been limited and a


lot remains to be done in order to achieve real breakthroughs.

Acknowledgments
The invaluable help of Marsha Lipton from Numeraire Financial in think-
ing about and preparing this chapter cannot be overestimated. I am grateful to
several colleagues, including Alex Pentland and David Shrier from MIT, Damir
Filipovic from EPFL, Matheus Grasselli from McMaster, Julian Phillips from
Standard Charter Bank, and Paolo Tasca from UCL for their help and sugges-
tions. As a CEO of StrongHold Labs, I am currently working on a new type
of a digital bank, which will be utilizing some of the ideas presented in this
chapter. This chapter is reprinted with permission from the Journal of Risk
Finance, 19(1), 2018.

References
1. Lipton, A., 2016, Banks must embrace their digital destiny, Risk Magazine, Vol.
29, No. 8.

2. Lipton, A., Shrier, D., and Pentland, A., 2016, Digital Banking Manifesto: The
End of Banks? in Frontiers of Financial Technology, Visionary Future, pp. 117–
140.

3. Tasca, P., Aste, T., Pelizzon, L., and Perony, N. (Eds.) 2016, Banking Beyond
Banks and Money: A Guide to Banking Services in the Twenty-First Century,
Springer, Switzerland.

4. Aristotle, Aristotle’s Nicomachean Ethics, R.C. Bartlett and S.D. Collins (trans-
lators), University of Chicago Press, Reprint edition.

5. Zyskind, G., Nathan, O., and Pentland, A., 2015, Enigma: Decentralized com-
putation platform with guaranteed privacy, MIT Working Paper.

6. Allen, W.R., 1993, Irving Fisher and the 100 percent reserve proposal, The
Journal of Law and Economics, Vol. 36, No. 2, pp. 703–717.

7. Beneš, J. and Kumhof, M., 2012, The Chicago plan revisited, IMF Working
Paper.

8. Miller, S.J., 1993, A fully replicated distributed database system, Research Note
ERL-0719-RN, Electronics Research Laboratory.

9. Greenspan, G., 2015, Avoiding pointless blockchain project, Working Paper.


Blockchains and Distributed Ledgers in Retrospective and Perspective 559

10. Nakamoto, S., 2008, Bitcoin: A peer-to-peer electronic cash system, Working
Paper.

11. Chaum, D., 1983, Blind signatures for untraceable payments, in Advances in
Cryptology, Springer, US, pp. 199–203.

12. Buterin, V., 2016, Blog post, [Link]


from the hard fork/.

13. Popper, N., 2015, Digital Gold: The Untold Story of Bitcoin, Penguin, UK.

14. Diffie, W., and Hellman, M., 1976, New directions in cryptography, IEEE Trans-
actions on Information Theory, Vol. 22, No. 6, pp. 644–654.

15. Miller, V.S., 1986, Use of Elliptic Curves in Cryptography. In: Williams H.C.
(Eds). Advances in Cryptology—CRYPTO ’85 Proceedings, Lecture Notes in
Computer Science, Vol. 218. Springer, Berlin, Heidelberg, pp. 417–426.

16. Koblitz, N., 1987, Elliptoc curve cryptosystems, Mathematics of Computation,


Vol. 48, pp. 203–209.

17. Back, A., 2002, Hashcash—A denial of service counter-measure, Working Paper.

18. Merkle, R.C., 1987, A digital signature based on a conventional encryption func-
tion, in Conference on the Theory and Application of Cryptographic Techniques,
Springer, Berlin, Heidelberg, pp. 369–378.

19. King, R., 2013, Brunelleschi’s Dome: How a Renaissance Genius Reinvented
Architecture, Walker & Company, New York, NY.

20. Marx, K., 1867, Das Kapital: Kritik der Politischen Őkonomie, Verlag von Otto
Meisner, Germany.

21. Ansper, A., Buldas, A., Freudenthal, M., and Willemson, J., 2003, Scalable
and efficient PKI for inter-organizational communication, in Computer Security
Applications, Proceedings of 19th Annual Conference, IEEE, pp. 308–318.

22. Brown, R.G., Carlyle, J., Grigg, I., and Hearn, M., 2016, Corda: An Introduc-
tion, R3 CEV Working Paper.

23. Skeel, D., 2010, The New Financial Deal: Understanding the Dodd–Frank Act
and Its (Unintended) Consequences, John Wiley & Sons, Hoboken, NJ.

24. Turan, K., 2004, Never Coming to a Theater Near You: A Celebration of a
Certain Kind of Movie, PublicAffairs, New York.

25. Bloch, M., 1953, Mutations monétaires dans l’ancienne France: Première Partie,
Annales Economies, Societes, Civilisations, Vol. 8, No. 2, pp. 145–158.

26. Keynes, J.M., 1936, General Theory of Employment, Interest and Money,
Macmillan, London.

27. Keen, S., 2001, Debunking Economics: The Naked Emperor of the Social Sci-
ences, Zed Books, London & New York.
560 High-Performance Computing in Finance

28. Werner, R.A., 2014, Can banks individually create money out of nothing?—The
theories and the empirical evidence, International Review of Financial Analysis,
Vol. 36, pp. 1–19.

29. Lipton, A., 2016, Modern monetary circuit theory, stability of interconnected
banking network, and balance sheet optimization for individual banks, Interna-
tional Journal of Theoretical and Applied Finance, Vol. 19, No. 6, pp. 1650034-1–
1650034-57.

30. Robinson, J., 1977, Michal Kalecki on the economics of capitalism, Oxford Bul-
letin of Economics and Statistics, Vol. 39, No. 1, pp. 7–17.

31. Buiter, W.H., 2009, The unfortunate uselessness of most “state of the art” aca-
demic monetary economics, MPR A Working Paper.

32. Rogoff, K.S., 2016, The Curse of Cash, Princeton University Press, Princeton
and Oxford.

33. Ilgmann, C., 2015, Silvio Gesell: “A strange, unduly neglected” monetary theo-
rist, Journal of Post Keynesian Economics, Vol. 38, No. 4, pp. 532–564.

34. Fisher, I., Cohrssen, H.R., and Fisher, H.W., 1933, Stamp Scrip, Adelphi Com-
pany, New York, NY.

35. Vasicek, O., 1977, An equilibrium characterization of the term structure, Journal
of Financial Economics, Vol. 5, No. 2, pp. 177–188.

36. Barrdear, J. and Kumhof, M. 2016, The macroeconomics of central bank issued
digital currencies, Bank of England, Working Paper.

37. Broadbent, B. 2016, Central banks and digital currencies, Speech at London
School of Economics.

38. Lipton, A., 2016, The decline of the cash empire, Risk Magazine, Vol. 29, No.
11, p. 53.

39. Danezis, G. and Meiklejohn, S., 2015, Centrally banked cryptocurrencies, UCL
Working Paper.

40. Reid, F. and Harrigan, M., 2013, An analysis of anonymity in the bitcoin system,
in Security and Privacy in Social Networks, Springer, New York, pp. 197–223.

41. Baynham-Herd, X., 2016, Banking Balance Sheets and Blockchain: A Path to
100% Digital Money, UBS Discussion Paper.

42. King, M., 2016, The End of Alchemy: Money, Banking, and the Future of the
Global Economy, WW Norton & Company, New York, NY.

43. Dwyer, J., 2016, Central Bank-Issued Digital Currency: Assessing Central Bank
Perspectives of DLT and Implications for Fiat Currency and Policy Stimulus,
Celent Working Paper.
Chapter 19
Optimal Feature Selection Using
a Quantum Annealer

Andrew Milne, Mark Rounds, and Peter Goddard

CONTENTS
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
19.2 Credit Scoring and Classification as a Business Problem . . . . . . . 562
19.3 Quadratic Unconstrained Binary Optimization
as an Established Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
19.4 Formulation of the Credit Scoring and Classification Problem . . 565
19.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
19.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
19.7 Binarizing, Scaling, and Correlating the German Credit Data . . 569
19.8 Coding the Feature Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
19.8.1 An inspiring simplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
19.8.2 QUBO feature selection in the 1QBit SDK . . . . . . . . . . . . . 570
19.8.3 What happens in the call to minimize() . . . . . . . . . . . . . . . . 571
19.9 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
19.10 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
19.10.1 Establishing the zero-rule and other baseline
properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
19.10.2 QUBO feature selection with logistic regression . . . . . . . . 579
19.10.3 Recursive feature elimination with logistic
regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
19.10.4 Comparison of QUBO feature selection and
recursive feature elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
19.10.5 Comparison with potentially missed subsets . . . . . . . . . . . . 584
19.11 Comparison with Previously Reported Results . . . . . . . . . . . . . . . . . . 585
19.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587

561
562 High-Performance Computing in Finance

19.1 Introduction
Quantum computing is still in its infancy. Its potential is sensed, but not
yet widely applied. Part of this is due to its specialized nature, and the small
size of the problems that can currently be handled. However, small does not
mean zero, and with the aid of software like the 1QBit Quantum-Ready Soft-
ware Development Kit, machines like the D-Wave quantum annealer can be
used to solve small but useful problems.
The software development kit (SDK) forms an abstraction layer between
the quantum hardware and the financial application program. In the specific
case of D-Wave, the SDK provides the objects needed to represent the objec-
tive function for a quadratic unconstrained binary optimization (QUBO) prob-
lem. The SDK also provides tools for translating constrained problems into
unconstrained problems, integer problems into binary problems, and so on.
Optimization is a computational paradigm that follows naturally from
the physics of quantum annealing. However, other types of hardware can
have other paradigms. The abstraction layer is designed to dispatch its high-
level problem representations to the appropriate physical solver. 1QBit refers
to the SDK as quantum ready, and to its overall architecture as hardware
agnostic.
The practical details of abstracting from multiple-state qubits to conven-
tional ones and zeros are outside the scope of this chapter. Suffice it to say
that human beings have been doing experiments with quantum mechanical
systems for over a century now. A lot is known about how to distinguish
between energy states, and how to accumulate observations until some level
of certainty has been reached. Some of this knowledge is encoded into the
software made by the hardware manufacturers themselves. However, for the
quantitative analyst or software developer working on a business problem, it is
easier to work with a consistent set of abstract entities that map more closely
onto the problem domain. The goal of the SDK is to provide these.
In the rest of this chapter, we will approach the problem of optimal feature
selection for credit scoring and classification as a perfectly ordinary problem
from the literature, that we just happen to solve with a quantum computer.
Only at a few select points will we pull back the curtain to reveal the hardware
being used.

19.2 Credit Scoring and Classification as a Business


Problem
Credit scoring and classification is a significant problem. The total amount
of money loaned globally is difficult to measure [1,2]. If we focus on the
United States, household debt alone is estimated to be around $14 trillion
[3]. The Federal Reserve also reports that approximately 2% of these loans
Optimal Feature Selection Using a Quantum Annealer 563

are nonperforming [4]. Superficially, this indicates that lenders are making
good decisions 98% of the time. However, the U.S. Federal Deposit Insurance
Corporation [5] publishes yearly summaries of bank failures, and in the 15
years since 2001, there have been 547 failed institutions. Nonperforming loans
have been a major cause.
According to the 2016 Credit Access Survey by the U.S. Federal Reserve
Bank of New York [6], approximately 40% of U.S. credit applications are
rejected. Moreover, between 20% and 40% of consumers expect their applica-
tions to be rejected (it depends on the type of credit), and many do not even
apply. Yet, among these people, there may well be qualified customers for the
right kind of lender.
In a literature survey by Huang [7], the academic approach to credit scoring
is typically one of bigger data and bigger models. However, in a recent article
in Forbes magazine, consumer lending veteran Matt Harris [8] takes a different
view:

Most start-up originators focus on the opportunity to innovate in


credit decisioning. The headline appeal of new data sources and new
data science seem too good to be true as a way to compete with
stodgy old banks. And, in fact, they are indeed too good to be true.

The recipe for success here starts with picking a truly underserved
segment. Then figure out some new methods for sifting the gold from
what everyone else sees as sand; this will end up being a combination
of data sources, data science and hard won credit observations.

...[P]rogressive and thoughtful traditional lenders like Capital One


have mined most of the segments large enough to build a business
around. The only way to build a durable competitive moat based on
Customer Acquisition is to become integrated into a channel that is
proprietary, contextual and data-rich. (original emphasis)

If Harris is correct, the thing to look for in new credit scoring and classifi-
cation tools will not be their success rate in large-scale applications for which
tools like FICO [9] already exist, but in their flexibility and ease of integration
into specialized applications.
Feature selection has a natural role in this. More and more data is available
all the time, and although there are various complex schemes for using it, the
idea of finding a small set of key features is simple and easy to grasp. Ideally, we
would want these features to be both influential and independent, but here too
there are nuances. People lie. Data can mislead [10]. The “redundant” feature
might actually be the corroborative feature. The correct balance of influence
and independence will depend on the “hard-won credit observations” that
Harris sees as crucial, and on the lender’s confidence that the model genuinely
includes them.
564 High-Performance Computing in Finance

It should be noted that even the largest and most elaborate system
for credit scoring has to begin somewhere. Moreover, the addition of new
credit instruments to an existing system needs its own stage of analysis and
validation. Thus the work of a quantitative analyst may involve both small
feature sets and large feature sets. The development of new instruments is
not so very different from the development of new markets as Harris pictures
them.

19.3 Quadratic Unconstrained Binary Optimization


as an Established Approach
QUBO has been applied to the credit scoring problem by several
researchers, beginning with Demirer in 1998 [11], and taken up more recently
by Huang [7] and Waad [12]. For comparative purposes, the technique is
usually applied to a widely available data set, such as the “German Credit
Data,” that is, the Hofmann data in the Statlog Data Set, as published by
the Machine Learning Repository at the University of California, Irvine (UCI)
[13]. However, the optimization step has often been seen as time-consuming,
and the technique has not yet made it into the most popular (i.e., free) soft-
ware toolkits.
For convenience, we will split the credit scoring problem into two parts:
feature selection and classification (i.e., classification using the selected fea-
tures instead of the full feature set). Sometimes, these are “wrapped” together
so that the parameters for the selection algorithm and the parameters for the
classifier can be optimized holistically for the best overall accuracy scores.
However, in order to focus on the feature selection part, we will keep them
separate and use the same classifier throughout.
In order to compare QUBO Feature Selection with other techniques, we
will also show the accuracy scores from feature subsets obtained from recursive
feature elimination (RFE), a widely used technique available in packages such
as scikit-learn [14]. We will look at two variants:

• Stand-alone RFE, where the desired number of features is set explicitly.


The least influential features are eliminated one at a time until the desired
number of features is left.
• RFE wrapped with cross-validation (RFECV), where the program evalu-
ates the performance of its classifier after every feature removal and ter-
minates when the highest accuracy score has been found. Cross-validation
is discussed later in Section 19.9. Broadly speaking, it involves using a
part of the data set for training, and the remaining part for evaluation.
The roles are then switched so that every data point is used an equal
number of times in each role.
Optimal Feature Selection Using a Quantum Annealer 565

The main difference between QUBO Feature Selection and RFE lies in how
aggressively each tries to reduce the number of feature variables in the fea-
ture subset. QUBO Feature Selection considers both the independence and
influence of the features under consideration. RFE focuses on eliminating
features that are less influential. Both approaches yield good results on the
German Credit Data.
For the classifier, we use logistic regression from scikit-learn. Unlike linear
regression, where one can imagine two continuous variables and fitting a line
to a scattered set of points, logistic regression assumes that the dependent
variable is a category, for example, the zero and one of a binary classifier. The
fitted line no longer predicts the value of the dependent variable, but rather
the probability that the dependent variable will have a specific value. It is a
well-established technique with a long pedigree [15].
To provide a benchmark, we apply logistic regression to the full fea-
ture set. Out of the box (i.e., without tuning), the logistic regression class
from scikit-learn gives a 75% success rate on the German Credit Data.
This is comparable to other methods reported in the literature, such as
support vector machines (SVMs), decision trees, neural networks, k-Nearest
Neighbors (k-NN) classification schemes, and so forth, as in Waad [12] or
Huang [7].
Standing on the shoulders of giants, then, we will now approach the optimal
feature selection problem in a perfectly ordinary way.

19.4 Formulation of the Credit Scoring and


Classification Problem
Assume that we have some data on past credit applicants that we believe
will be useful in predicting the creditworthiness of new applicants. The data
is composed of features, where each feature may be:

• An integer, where the order (lower to higher) has potential meaning, for
example, bank balances, years of education, and so on.

• A category, such as a geographic region code, where higher and lower val-
ues are arbitrary. It may also include value such as “missing” or “refused
to answer.”
• A decimal number, representing, for example, age, dollar amounts, inter-
est rates, and so on, where the order has meaning. Data such as the
latitude and longitude of the applicant’s home would be better repre-
sented as a category.
• A Boolean (yes/no) value, which may be considered as an integer or a
category.
566 High-Performance Computing in Finance

A credit observation, or sample, consists of an observation for each feature,


some means of identifying the applicant, and the outcome. The raw data may
come from many sources.
In practice, we clean the raw data to create a vector of “feature variables”
for each observation. For example, we may convert some or all of the cate-
gorical variables to binary indicators, a step described later, in Section 19.7.
We may also scale the data and (in some cases) replace missing values with
inferences. In the description that follows, we will assume that these steps
have been taken and the data is in a form suitable for input to our feature
selector and classifier.
For convenience, we organize the clean data as a matrix of m rows and
n columns. Each column represents a feature, and each row represents the
specific data values for a specific past credit applicant.
⎡ ⎤
u11 u12 u13 . . . u1n
⎢ u21 v22 u23 . . . u2n ⎥
⎢ ⎥
U =⎢ . .. .. .. .. ⎥ .
⎣ .. . . . . ⎦
um1 um2 um3 . . . umn
Our goal is to determine how the data on past applicants can inform us
on the creditworthiness of new applicants. For this, we need a record of the
decisions that were made. We represent these as the m-element vector
⎡ ⎤
v1
⎢ v2 ⎥
⎢ ⎥
V = ⎢ . ⎥.
⎣ .. ⎦
vm
The vi will be constrained to take on the values 0 and 1, where 0 represents
the acceptance and 1 represents the rejection of credit application i.
Conceptually, the classifier will be a “credit risk detector” that signals
when an applicant should not be granted credit. This is consistent with the
U.S. Fed data showing that rejections are less common than acceptances.
Acceptance is the rule, and rejection is the exception. The credit risk detector
has many analogues in other fields, and a well-established terminology exists
in the literature [16].

19.5 Feature Selection


Assume that from the original set of n features, we want to select a subset
of K features to use in making a credit decision. Data costs money, and we may
want to experiment with different sources. We may be prevented from using
certain data in certain jurisdictions. Or we may simply be curious, looking for
the insights that we won’t know until we find them.
Optimal Feature Selection Using a Quantum Annealer 567

We want to search broadly, without prejudging the data. However, the


number of possible subsets for each K is given by the combinatorial function
C(n, K). Even if we can eliminate some candidates at the outset, the search
space will typically be very large. We want to focus our search on areas where
a good subset is likely to be found.
Mathematically, our goal will be to find the columns of U that are cor-
related with V , but not correlated with each other. We have not yet defined
what correlation means here, but we assume that such a calculation is possible
and that the value of the correlation coefficient can take on values from −1
to 1. Note that we can interpret “correlation” quite liberally: a “hard-won
credit observation” might appear as a “hard requirement” that an attribute
be present. Taking this a step further, the conversion of categorical variables
to binary indicators (see below) can be expanded to include corroborations
that the lender sees as being necessary.
Let ρij represent the correlation between column i and column j of the
matrix U , and let ρVj represent the correlation between column j of U and
the single column of V .
To find the “best” subset, we introduce n binary variables xj , which have
the property 
1, if feature j is in the subset
xj =
0, otherwise.
We refer to these collectively as the vector X, where
⎡ ⎤
x1
⎢ x2 ⎥
⎢ ⎥
X = ⎢ . ⎥.
⎣ .. ⎦
xn
We will associate the best subset with the value of X that minimizes an
objective function, which we construct from two components.
The first component of the objective function represents the influence that
features have on the marked class, shown here in a form that increases as
more terms are included:
n
xj |ρV j |.
j=1

The second component of the objective function represents the indepen-


dence of the features. The form shown below increases as more of the cross-
correlated terms are included, which is the opposite of independence.

n
n
xj xk |ρjk |.
j=1 k=1,
k=j

To obtain an objective function that is maximized at the optimum, we will


need to subtract the second term from the first term.
568 High-Performance Computing in Finance

We perform this subtraction with the aid of a parameter α (0 ≤ α ≤ 1),


which represents the relative weighting of independence (greatest at α = 0)
and influence (greatest at α = 1).
Finally, we negate the expression overall to optimize at the minimum,
which yields the objective function
% n &

n
n
f (x) = − α xj |ρV j | − (1 − α) xj xk |ρjk | .
j=1 j=1 k=1,
k=j

We can make use of the property that xj xj = xj for binary variables,


which allows us to rewrite the summation as a vector product:

f (x) = −xT Qx.

From this, we can express the problem in terms of the argmin operator,
which returns the vector x∗ for which its function argument is minimized:
% &
x∗ = argmin − xT Qx .
x

19.6 Classification
The classification problem may be stated as follows: given a row vector u of
new observations from a new applicant, calculate whether the vector belongs
to the creditworthy class. More specifically, find a function f (u) that returns
0 for acceptance and 1 for rejection.
One of the premises of machine learning is that such a function can be
derived from a programmatic analysis of existing data. The existing data is
divided into a training set and a test set. A candidate function is derived
from the training set, and its performance is measured on the test set. Much
has been written on the best way to define such functions, the best way to
divide the data, how to adapt to new data, and so on. For example, see
the citation lists at the UCI Machine Learning Repository [13] or Chen [17].
For a cautionary note, however, the well-known Anscombe’s Quartet is worth
revisiting as a problem in dividing points as opposed to fitting lines through
them [18].
In our example here, we use a simple classifier based on logistic regression.
However, in the code examples we will see that other classifiers could easily be
used in its place. One might also imagine a classifier with tunable parameters,
and searching through these to obtain the settings for best overall perfor-
mance. In the future, the speed of QUBO Feature Selection on a quantum
annealer might enable searches on quite large spaces to be done interactively,
as opposed to being spread out over hours or even days.
Optimal Feature Selection Using a Quantum Annealer 569

19.7 Binarizing, Scaling, and Correlating the German


Credit Data
The German Credit Data under consideration was originally published in
1994 by Hans Hofmann at the Institute for Statistics and Econometrics, the
University of Hamburg. It has been studied extensively.
The data consists of 20 features (7 numerical, 13 categorical) and a binary
classification (good credit or bad credit). There are 1000 rows, of which 700
are “good” and 300 are “bad.” The data is intended for use with a cost matrix,
where giving credit to a bad applicant is five times as bad as not giving credit
to a good applicant. In this example, however, we are concerned mainly with
the relative “predictive power” of the feature subsets, so the cost matrix was
not used.
We prepare the data as follows:

• The [Link] file from UCI is imported into a Jupyter (iPython)


notebook as a pandas DataFrame and given column headers with names
from the accompanying [Link] file.

• The categorical variables are converted to “one-hot” binary indicators


using the DictVectorizer class from scikit-learn.
• The first binary indicator in each group is removed (for k indicators, only
k − 1 are independent).

• All of the numerical features are scaled to mean zero and variance one.
• The classification variable is transformed to 0 = good, 1 = bad.

The subsequent correlation step is not so straightforward. In preparing


this example, we looked at a variety of correlation methods. These led to a
small difference in the feature subsets, but no real difference in the accuracy
of the classifications. It was noted, however, that methods with a “smooth”
distribution of coefficients (Spearman, Pearson, and so on) worked better with
the quadratic objective function than correlation methods with sharper jumps,
such as “mutual information” scores, as seen in Pedregosa [14] or Rosenberg
[19]. In the end, the Spearman method was chosen as being simple and easy
to reproduce. However, this is an area where more research is needed.
The binarization and scaling procedures transform the 20 features in the
German Credit Data into 48 feature variables. For example, Attribute 1 (“Sta-
tus of existing checking account”), which has four possible values in the
original data, gets converted into three binary indicators. (In the resulting
DataFrame, these appear as columns “ChqStat=A12,” “ChqStat=A13,” and
“ChqStat=A14.”) These 48 feature variables are the input to the feature selec-
tion algorithm. The feature subset at the output of the feature selector then
forms the input to the classifier.
570 High-Performance Computing in Finance

19.8 Coding the Feature Selector


The 1QBit SDK [20] provides a toolkit for solving QUBO problems. The
details of the underlying quantum hardware are abstracted away, so that the
code used by the analyst is no more complex than the code needed to use
machine learning packages such as Weka [21] or scikit-learn [14].

19.8.1 An inspiring simplicity


The RFECV class in scikit-learn shows just how easy a good package can
make things for the analyst. There are other ways to do RFE, but RFECV
“wraps” RFE with a classifier chosen by the user, as shown below with Logistic
Regression,

featureMatrix = [Link]
classVector = [Link]
estimator = LogisticRegression()
selector = RFECV(estimator, step = 1, cv = 3)
selector = [Link](featureMatrix, classVector)
indexList = selector.get_support()
featureList = [Link](indexList)[0]

At the end of this simple code block, the analyst has a feature list that can
be used to select columns from the feature matrix. The accuracy scores for
the classifier (a.k.a. the estimator) can then be computed using testing and
scoring classes such as ShuffleSplit or StratifiedShuffleSplit.
Strictly speaking, RFECV is not directly comparable with QUBO Feature
Selection, since the feature list is calculated independently of the accuracy
scoring. The simpler variant of RFE with a cardinality target is closer in
terms of program flow. A list of candidate subsets (of varying cardinalities) is
returned by the selector and subsequently tested for performance. However,
the simplicity of the RFECV code block is both an example and an inspiration.

19.8.2 QUBO feature selection in the 1QBit SDK


QUBO Feature Selection involves searching for the value of α that yields
the feature subset with the highest accuracy score (or similar performance
metric). Huang [7] has found that the QUBO method returns a list of can-
didate subsets, corresponding to an objective function that consists of flat
regions and discrete jumps. This is a reasonable behavior for an objective
function that should change as features are added or removed from a candi-
date subset.
The remarkable thing about QUBO Feature Selection is that the number
of candidate subsets is roughly equivalent to the number of features overall,
that is, there is usually only one subset at each possible cardinality. There
Optimal Feature Selection Using a Quantum Annealer 571

is no a priori reason why this should be so, and the authors of this chapter
are hopeful that the wider use of QUBO Feature Selection may lead to some
deeper insights.
In the code example below, we construct a Q matrix using code that can
be easily written from the equations shown in the previous section. We can
assign our value of α at the outer level of a grid search, or at the core of a
bisection search that looks for jumps in the objective function, but in each
case we must somewhere solve the optimization problem to obtain a candidate
feature subset.
To do this, we use the SDK’s QuadraticBinaryPolynomialBuilder class,
which returns a polynomial object representing the objective function seen
earlier. The poly object is passed to a solver and the solution is returned as a
list of ones and zeros, referred to as a configuration. We convert this to a list of
integer indices that can be used to extract columns from a pandas DataFrame
or NumPy array, which is typically how feature matrices and class vectors are
passed to machine learning methods.
In the example below, the D-Wave solver is assigned explicitly.

builder = [Link]()
## ... Some code to construct the Q matrix...
poly = builder.build_polynomial()
solver = [Link](HWDWaveSolver(url, token))
solutionList = [Link](poly)
lowEnergySolution = solutionList.get_minimum_energy_solution()
config = [Link]
featureList = [Link]([Link]()).flatten().
tolist()
featureMatrix = [Link][:, featureList].values
classVector = [Link]
estimator = LogisticRegression() # and so on...

In the experimental study that led to this chapter being written, we fixed
the parameters for the classifier and focused on studying how the value of
α affected the feature subset returned by QUBO Feature Selection. A full
holistic optimization across all of the available parameters is a topic for future
work.

19.8.3 What happens in the call to minimize()


The physical D-Wave machine is located at a URL. Access is controlled by
a security token assigned to the calling program, as well as by various network
management techniques.
The D-Wave machine is based on magnetic effects that occur at very low
temperatures [22]. The chip containing the qubits is cooled to 15 mK (about
180 times colder than interstellar space). It is shielded from electric and mag-
netic fields inside a metal enclosure. To maintain the low temperature, the
572 High-Performance Computing in Finance

chip operates in a high vacuum environment at a pressure some 10 billion


times lower than atmospheric pressure. Getting the signal from the outside
world to the chip is an engineering accomplishment in itself.
The qubits may be thought of as small circulating currents governed by
superconducting effects. The direction of the current may be thought of as
representing a one or a zero. The qubits can interact with each other, with
the exact degree of interaction controlled by electric and magnetic fields that
can be externally applied. Collectively, the qubits form an ensemble whose
energy is determined by the signs of the magnetic fields, and their coupling
with the imposed fields. The overall arrangement corresponds to an Ising
model, a concept from statistical mechanics that has been studied for close to
a century.
The quantum annealing process consists of

• Initialization. The chip containing the qubits is prepared so that it repre-


sents a trivial problem, and is in the ground state of that problem, which
is an equally weighted superposition of all possible states.

• Adiabatic Transformation. The system is then transformed continuously


to the point that it represents the optimization problem that we want
to solve. If this process is done slowly enough, the adiabatic theorem
guarantees that the system will remain in the ground state, as long as
external disturbances are absent.

• Readout. The state of the system is then read, and in the ideal case it
would correspond to the optimal solution of the optimization problem
we wish to solve.

The adiabatic transformation is accomplished by slowly applying electric


and magnetic fields whose magnitudes correspond to the coefficients in the Q
matrix. In a real device, however, external interference is always present, so
the result is probabilistic, and annealing the same problem multiple times will
increase the probability of finding the optimum. The number of annealings to
perform is determined by the controlling software.
The internal structure of the D-Wave chip is outside the scope of this
chapter. However, to give an idea of the challenges facing the abstraction layer,
recall that the qubits have to be fabricated and connected using structures that
can be formed on the surface of a semiconducting chip. At the time of writing,
the physical layout is described by a square Chimera graph, composed of a
lattice of bipartite unit cells containing eight qubits, as shown in Figure 19.1.
If we label the number of unit cells along an edge as s, then the total
number of qubits is q = 8s2 . The hardware graph is sparse and in general
does not match the problem graph, which is defined by the adjacency matrix
of the problem matrix Q. In order to solve problems that are denser than
the hardware graph, we identify multiple physical qubits with a single logical
qubit (a problem known as “minor embedding” at the cost of using many
more physical qubits).
Optimal Feature Selection Using a Quantum Annealer 573

FIGURE 19.1: An example hardware graph, showing the connectivity of the


qubits for a Chimera graph with s = 4 unit cells in each row/column, giving
a total of q = 128 qubits.
For square Chimera hardware graphs, the size V of the largest √ fully dense
problem that can be embedded on a chip with q qubits is V = 2q + 1 =
4s + 1, assuming no faulty qubits or couplers. For example, for a chip with
s = 12 unit cells along each side (giving q = 1152 qubits), we find V =
49. Lower density problems of significantly larger size can be embedded. For
example, experiments by 1QBit on annealers of this size have shown successful
embeddings with Vb $ 140 and a density of $ 0.1.
It should be emphasized that users of the 1QBit SDK does not have to
concern themselves with the size of the problem or the details of the embed-
ding. For problems that exceed the capacity of the available D-Wave machine,
the SDK can route the calculation to a simulated annealer (software) or a
simulated Ising system (hardware).
The performance of simulated quantum annealers is now very good, and
to anticipate our conclusions slightly, there is no real limit on the size of the
feature selection problems that the SDK can handle, save that below a certain
(but ever increasing) size, the solution will be passed through a quantum
computer and be very fast as opposed to merely fast.

19.9 Evaluation Metrics


The evaluation of QUBO Feature Selection and RFE was performed by
wrapping them with the LogisticRegression model from scikit-learn. The
574 High-Performance Computing in Finance

evaluation metric was defined as the unweighted accuracy, that is, the number
of correct classifications divided by the total number of classifications made.
Other metrics from scikit-learn were attempted, but they always led to the
same optimal alpha or RFE feature set. Unweighted accuracy was kept for
compatibility with other work and ease of understanding, as in Reference 7.
Testing and scoring were performed using the StratifiedShuffleSplit cross-
validation class from scikit-learn. Given the feature matrix, this class returns
sets of row indices that can be used to divide the matrix rows into a training
set and a test set. The separation is done in folds, with the number of folds
set by an argument. For example, a shuffle and split with 5 folds will take
80% of the matrix for the training set and 20% for the test set, and repeat
this process until all 5 of the possible 20% folds have been used as test sets.
The accuracy score for each split is slightly different. Since these reflect a
random selection of data for training and testing, it is conventional to report
the mean of the individual accuracy scores. We follow the convention used by
scikit-learn and calculate the error bars for the 95% confidence level.

19.10 Experimental Results


19.10.1 Establishing the zero-rule and other baseline
properties
The German Credit Data has 700 class 0 samples (“good credit”) and 300
class 1 samples (“bad credit”). A zero-rule classifier that assigns all of the
samples to class 0 will therefore achieve a success rate of 70%.
We want our proposed feature selection and classification scheme to do
better than the zero-rule. We want the feature selection component to choose
subsets that are better than randomly selected subsets and (for that matter)
better than no selection at all. We begin this section by establishing a baseline
against which QUBO Feature Selection can be compared.
Feature selection is motivated by the intuitive concept that not all fea-
tures are equally important. For example, in Figure 19.2, we see the feature
variables from the binarized German Credit Data ranked by the Spearman cor-
relation coefficient between the feature and the classification variable. There
are four relatively important features at the left-hand end, followed by a grad-
ual, almost linear decline. It turns out that the four features on the left are
not enough to form a predictive subset on their own, so the problem for the
feature selector is to find where on the line “enough is enough.”
It also turns out that the smooth decline of the coefficients works well with
the quadratic objective function used in QUBO Feature Selection, as was seen
in comparisons of Pearson, Spearman, and Kendall correlations (readily avail-
able in the pandas package). In contrast, the use of a mutual information score
in place of a correlation coefficient was not as successful. The integer features
Optimal Feature Selection Using a Quantum Annealer 575

Spearman correlation of features with class


0.35

0.30
Spearman correlation (Abs value)

0.25

0.20

0.15

0.10

0.05

0.00
1 10 20 30 40 48
Feature number (sorted by influence)

FIGURE 19.2: Spearman correlation. The most influential features involve


the status of the applicant’s checking account, savings account, and loans at
other institutions, all of which are included in all practical feature subsets. As
we move toward the right, however, the influence of each new feature declines.

were binned into ordered categories and generally behaved as expected. How-
ever, the more arbitrary binarized categories (e.g., loan purpose) had uni-
formly low mutual information scores, and the objective function tended to
cycle among features. A plot of the ranked mutual information scores is shown
in Figure 19.3.
The results shown in this chapter were all calculated using the Spearman
correlation coefficient. As mentioned previously, this is an area where more
research is needed, for example, using other data sets with a broader mix of
feature variables.
Before we select any features, however, we first examine how well the
logistic regression classifier performs on the full feature set, that is, all 48
feature variables. We do this using the StratifiedShuffleSplit() class from scikit-
learn. Figure 19.4 shows that the mean accuracy depends on how many times
the data is shuffled, and on how the data is split between the training set and
the test set.
The combination of 1000 shuffles and 20% test share was chosen arbitrarily
as the standard for initial performance comparisons, being much more con-
venient than the larger numbers. It avoids the fluctuations found below 500
shuffles, and is close to the converged scores at 3000 samples and above. For
the definitive score comparison, however, the full 3000 shuffles were used. The
results for 10%, 15%, and 20% share were always very close and typically in
the median position. The 20% test share was therefore used throughout.
576 High-Performance Computing in Finance

Mutual information between features and class


0.06

0.05
Mutual information score

0.04

0.03

0.02

0.01

0.00
1 10 20 30 40 48
Feature number (sorted by influence)

FIGURE 19.3: Mutual information scores. The age, term, and credit
amount fields were binned into categories. However, the correlation matrix
based on this technique led to fluctuations in the accuracy scores, a lower
mean accuracy at the “best” feature subset, and a larger “best” feature sub-
set cardinality of 34 elements.

No feature selection (all 48 elements present)


Mean accuracies for varying shuffles and shares
0.775
Test share 0.05
Test share 0.10
0.770
Test share 0.15
Mean accuracy Test share 0.20
0.765 as reported Test share 0.25
Mean accuracy

for 1000 shuffles Test share 0.30


0.760 and 20% share:
0.753 +/– 0.049

0.755

0.750

0.745
0 500 1000 1500 2000 2500 3000
Number of shuffles

FIGURE 19.4: Mean accuracies measured for various numbers of shuffles


(10–3000) and for different fractions of the data assigned to the test set.
Optimal Feature Selection Using a Quantum Annealer 577

Accuracy distribution for all 48 features in subset


100
Mean accuracy
0.753 +/– 0.049
80

60
Count

Bottom of error bar Top of error bar


0.703 0.802
40

20

0
0.60 0.65 0.70 0.75 0.80 0.85 0.90
Accuracy score

FIGURE 19.5: Accuracy scores for all 48 samples, using 1000 shuffles with
20% test share. In the work described in this chapter, this distribution was
typical of the German Credit Data, regardless of the feature subset, number
of shuffles, number of shares, classifier parameters, and the like.

It can be seen in Figure 19.4 that the mean accuracy increases as the test
share is reduced, that is, a bigger training set yields a more accurate predictor.
However, the difference is small in relation to the dispersion of scores from
different shuffles, as can be seen in Figure 19.5.
Note that the 30% share curve does not fluctuate as much as the curves
for smaller shares. It turns out that scikit-learn chose a threefold default for
RFECV, and this may be one reason why it can attain a good (although not
optimal) result in relatively little time. It is also important to keep track of
absolute numbers as well as percentages: a 5% test share of 1000 samples
consists of only 50 samples, which stratification on the German Credit Data
will constrain to 35 good credit samples and 15 bad samples. It fluctuates
widely at the outset and converges slowly.
In Figure 19.5, we see the distribution of accuracy scores for all 48 features
at a 20% test share (800/200 train/test split) counted over 1000 shuffles.
Stratification forces the training set and the test set to have the same 70/30
distribution of good and bad credit samples, so that a 200-sample test set
will contain 60 bad credit samples chosen from 300 in the set overall. In
a large number of shuffles, there will inevitably be some repetition (a point
highlighted by scikit-learn in its documentation). Thus although the data looks
“Gaussian” and fits into the Gaussian overlay, in practice there are certain
scores that occur more frequently, and extreme values are not observed above
a certain limit. Note that in comparison to the spread of accuracy scores from
0.7 to 0.8 seen in Figure 19.5, the (converged) spread of mean scores from
578 High-Performance Computing in Finance

0.75 to 0.76 in Figure 19.4 is relatively small. So long as we avoid small test
shares and low numbers of shuffles, the error bars from the accuracy scores
will dominate the uncertainty overall.
We now take a moment to examine the behavior of logistic regression on
feature subsets with fewer than 48 features.
The number of possible subsets is given by the combinatorial function
C(48, K), where K is the cardinality of the subset. The largest number of
possible subsets occurs when 24 feature variables are selected, and is approx-
imately 32 trillion. There are some 280 trillion subsets possible overall.
It is not possible to test these trillions of subsets systematically. How-
ever, we can gain an idea of how they behave from random sampling. For
example, Figure 19.6 shows the accuracy of logistic regression for 10,000 ran-
domly selected subsets at each of the 48 possible cardinalities, which examines
432,354 feature subsets out of 281,474,976,710,656 possibilities. If we record
the best of the mean accuracies for each group of 10,000 subsets, and plot
them separately (the triangle markers in Figure 19.6), we can identify a “best
detected” subset at cardinality 35, with an accuracy of 0.76 ± 0.05. We then
examine how this mean was calculated from the accuracy scores for 1000 shuf-
fles with a 20% test share. In Figure 19.7, we see that the variance is large,
and comparable to what we saw with all 48 features present. Feature selection
looks to be a search for small improvements in collections of very noisy test
results.

Accuracy for random subsets at each cardinality


0.90
Samples (10,000 plotted for each cardinality)
Means
0.85
Detected maxima

0.80
Mean accuracy

0.75

0.70

0.65

0.60
1 10 20 30 40 48
Feature subset cardinality

FIGURE 19.6: Sample means plotted as nearly transparent black circles.


The largest mean found for each cardinality group is shown by the triangle
markers. The “maximum” at cardinality 35 is somewhat arbitrary, but given
the values of the maxima at the endpoints, there is clearly a maximum at
least somewhere between 1 and 48.
Optimal Feature Selection Using a Quantum Annealer 579

Accuracy distribution best random feature subset


100
Overall accuracy
reported as:
80 0.762 +/– 0.049

60
Count

40

20

0
0.60 0.65 0.70 0.75 0.80 0.85 0.90
Mean accuracies for the splits (using StratifiedShuffleSplit)

FIGURE 19.7: Distribution of accuracy scores for the best feature subset
found through random search.

It must be stated at this point that accuracy scores reported by other


researchers on the German Credit Data are typically in the 70%–75% range,
with standard deviations around 5%. For example, see Chen [17], Rao [23], or
Huang [7]. Our 0.76 ± 0.05 baseline accuracy puts QUBO Feature Selection
squarely in the mainstream.
We can now summarize our baseline requirements as follows:

• Select a feature subset with 35 features or less.


• Deliver accuracy equal to or better than 0.76 ± 0.05.

• Calculate the feature subset in an efficient way that can scale to larger
initial feature sets.

19.10.2 QUBO feature selection with logistic regression


QUBO Feature Selection and a logistic regression classifier were “wrapped”
together. Practically speaking, this means that the feature selector and the
classifier were placed together at the interior of the loops used to optimize the
selection and classification parameters. For simplicity, the only optimization
parameter was α, which determines the relative weighting of independence
(greatest at α = 0) and influence (greatest at α = 1).
It was found that the cardinality of the feature subset tended to increase
with α, as had been noted by other researchers [7,11]. However, an advantage
580 High-Performance Computing in Finance

QUBO feature selection


Accuracy and feature set cardinality vs. Alpha
0.90 50

Number of elements in the feature subset


Accuracy
Cardinality
0.85
40

0.80
Accuracy means

30
0.75
20
0.70

10
0.65

0.60 0
0.0 0.2 0.4 0.6 0.8 1.0
<-- “Feature independence” Alpha “Feature influence” -->

FIGURE 19.8: Increase in accuracy as influential features are chosen over


independent ones.

of the wrapper model is that optimization can be done “at the α level” without
looking at the details of the subsets.
Figure 19.8 shows the full range of α from 0 to 1. On the left-hand side,
where α is close to zero, the emphasis is on feature independence. This favors
small subsets, and since their regression coefficients are often not large enough
to “push” the classifier across the cutoff point of p ≥ 0.5, the predicted class
is 0. They classify almost all of the samples as “good credit” and achieve the
zero-rule’s 70% success rate.
In Figure 19.9, we look more closely at the region between α = 0.9 and
α = 1. Here, the emphasis is on feature influence, and the subsets eventually
grow to include all 48 available feature variables.
It is interesting that accuracy increases with the size of the subset, reaches
a peak at α = 0.977 with 24 elements, and then declines gradually as more
features are added. The drop in accuracy to the left of α = 0.977 is quite
sharp, and although it is encouraging to see a global maximum so clearly
defined, this may be due to the data, and should be further investigated.

19.10.3 Recursive feature elimination with logistic


regression
RFE is a technique for pruning features from a feature list. The procedure
begins with a set of available features. It fits the logistic regression model
and eliminates the feature with the lowest weight. The fitting and elimination
is continued until the desired number of features is reached. In RFECV, the
Optimal Feature Selection Using a Quantum Annealer 581

Accuracy vs. Alpha for QUBO Feature Sel. (zoomed)


0.90 50

Number of elements in the feature subset


Accuracy
Cardinality
0.85
40

0.80
Accuracy means

30
0.75
20
0.70

10
0.65

0.60 0
0.90 0.92 0.94 0.96 0.98 1.00
Alpha (zoomed in on 0.9 to 1.0)

FIGURE 19.9: A closer look at QUBO Feature Selection in the α = 0.9 to


α = 1 region.

fitting is accompanied by testing, using a training set and test set chosen
according to a folding parameter.
Conceptually, RFE is like starting the QUBO objective function at α = 1
and working downward toward α = 0, except that RFE is recursive and treats
each iteration as a new feature set.1 Unlike QUBO Feature Selection, RFE
does not test explicitly for feature independence, nor does it allow a feature
to “come back” after it has been eliminated.
We began with the direct version of RFE, where we specified the desired
number of features and then measured the performance of the returned feature
subset.
The accuracies were measured with the same 1000 shuffles and 20% test
share that was used with the other methods. The results are shown in
Figure 19.10.
RFECV was very fast, although it converged to different feature subsets as
the cross-validation settings and random seeds were varied. In practice, how-
ever, it was easy to search these (and faster than running RFE for thousands
of shuffles). RFECV ultimately delivered a 31-element feature subset with an
accuracy of 0.76 ± 0.05, comparable with the other methods.

1 One could imagine the recursive elimination of features as α is iterated from 1 down

to 0. A recursive version of QUBO Feature Selection would make an interesting topic for
future work.
582 High-Performance Computing in Finance

Accuracy for RFE with cardinality targets


0.90

0.85

0.80
Mean accuracy

0.75

0.70

0.65 Max mean accuracy


is at 28 features
0.60
1 10 20 30 40
Feature subset cardinality

FIGURE 19.10: RFE for cardinality targets from 1 to 48, using 1000 shuffles
with 20% test share. Error bars represent a confidence level of 95%. The
maximum mean accuracy is 0.77 ± 0.05.

19.10.4 Comparison of QUBO feature selection and


recursive feature elimination
Figure 19.11 shows the mean accuracies for QUBO Feature Selection and
RFE (and, for comparison, a random search of 10,000 sample subsets). The
differences between the methods are smaller than the error bars.
In Figure 19.12, a closeup view of the region between 20 features and 49
features shows that the results never differ by more than the spread of mean
accuracy scores in Figure 19.4.
When the best subsets from each method are compared at 3000 shuffles, we
obtain the accuracies shown in Table 19.1, ranked by the number of features
in the subset.
These are equivalent accuracies, with QUBO Feature Selection giving the
smallest feature subset. However, random feature selection with 10,000 sam-
ples per cardinality had the lowest score of all, which shows that 10,000 sam-
ples (out of several trillion) cannot tell us much about extreme values. There
may be other subsets “out there” that were missed by both QUBO Feature
Selection and RFE. To assess this possibility, we examine some of the feature
subsets that are “near” the QUBO Feature Selection and RFE results, in the
sense that they differ by one or two features.
The F1-Score was computed using the f1 score() method of the Logistic-
Regression() class from scikit-learn. It is defined as the harmonic mean of
the precision and the recall, where precision is the ratio of true positives to
all predicted positives, and recall is the ratio of true positives to all actual
Optimal Feature Selection Using a Quantum Annealer 583

Accuracy for QUBO, RFE, and random search max


0.90
Mean accuracies (StratifiedShuffleSplit) QUBO
REF
0.85
Random max

0.80

0.75

0.70

0.65

0.60
0 10 20 30 40
Feature subset cardinality

FIGURE 19.11: Comparison of QUBO Feature Selection, RFE, and random


search over all 48 cardinalities.

Accuracy for QUBO, RFE, and random search max


on expanded scales and “Inside” the error bars
0.780
Mean accuracies (StratifiedShuffleSplit)

0.775

0.770

0.765

0.760

0.755

0.750
QUBO
0.745 RFE
Random max
0.740
20 25 30 35 40
Feature subset cardinality (region with the best accuracy scores)

FIGURE 19.12: Comparison (using a different vertical scale) of QUBO Fea-


ture Selection, RFE, and random search.
584 High-Performance Computing in Finance

TABLE 19.1: Comparison of accuracy scores at 3000 shuffles


Method Accuracy F1-Score No. of features
QUBO Feature Selection 0.764 ± 0.049 0.54 ± 0.1 24
RFE28 0.767 ± 0.050 0.55 ± 0.1 28
RFECV 0.764 ± 0.050 0.54 ± 0.1 31
Rand10k 0.762 ± 0.050 0.54 ± 0.1 35

positives [16]. The above values reflect the classification of the German Credit
Data without the application of a cost matrix and can be compared with
results in the literature that were calculated in the same way.

19.10.5 Comparison with potentially missed subsets


In Figure 19.13, we see the mean accuracy results from feature subsets
that have the same cardinality as the best subsets, but which differ in one
feature, two features, and so on. Gaussian curves showing the distribution of
the accuracy scores for the best subsets have been overlaid on the histograms.

QUBO FS, 1-feat. change QUBO FS, 2-feat. change

RFECV 1-feat. change RFECV 2-feat. change

FIGURE 19.13: Comparison of mean accuracies from the “best” QUBO


Feature Selection (dark gray) and “best” RFE (light gray) with mean accu-
racies from feature sets of the same cardinality but differing by one or two
features. Gaussian distributions with the corresponding “best” standard devi-
ations have been overlaid. The horizontal scale on all four graphs is the same,
in both this figure and the next.
Optimal Feature Selection Using a Quantum Annealer 585

QUBO FS, 1 more feature QUBO FS, 1 less feature

RFECV 1 more feature RFECV 1 less feature

FIGURE 19.14: Comparison of mean accuracies from the “best” QUBO


Feature Selection (dark gray) and “best” RFE (light gray) with mean accu-
racies from feature sets with one more feature and one less feature. Gaussian
distributions with the corresponding “best” standard deviations have been
overlaid.

The dashed vertical lines represent the means of the best subsets. The
vertical bars represent the counted occurrences of mean accuracies for the
perturbed feature subsets. QUBO Feature Selection does a better job than
RFE in finding a subset that is better than its neighbors. This reflects the
behavior of the quadratic objective function, which considers all of the feature
sets simultaneously (especially in the quantum annealer implementation). In
contrast, once a feature has been eliminated by RFE, there is no possibility
of bringing it back on a later iteration.
The same behavior is observed when adding or subtracting a feature from
the best subsets. In Figure 19.14, we see that the accuracy of the best QUBO
subset is again better than the perturbed subsets.

19.11 Comparison with Previously Reported Results


The German Credit Data was originally published in 1994 by Hans
Hofmann at the Institute for Statistics and Econometrics at the University
of Hamburg. Since that time, it has been studied extensively. At the time of
writing, the most thorough survey of how quadratic optimization compares
586 High-Performance Computing in Finance

with other methods is given by Waad [12]. Waad also used a publicly available
machine learning package, Weka 3.7.0 [21], and created a “three-stage feature
selection fusion” technique with QUBO Feature Selection as the first stage,
which yielded very good accuracies. Waad’s results are not directly compara-
ble to the results in this chapter, since they were given in terms of precision
and recall, which are affected by whether the 5:1 cost ratio prescribed by
Hofmann has been applied in the training set. In this chapter, we did not use
the cost ratio and Waad does not mention it. Also, the Weka package does
not have a function like scikit-learn’s StratifiedShuffleSplit(), and in Waad’s
reported experimental procedure, the division of the samples into a training
set and a test set was done at an early stage with 10 folds but no shuffling.
Taken in total, however, Waad provides strong motivation for studying
how QUBO Feature Selection might be used as part of a larger procedure. For
example, Figure 19.12 shows that, although QUBO Feature Selection found a
very good subset with 24 elements, there were other subsets nearby that were
slightly better. Additional searching could uncover more.
For feature selection overall, Chen and Li [17] were able to achieve good
results with 12 features from the original German Credit Data with its 20 cat-
egorical and integer (or fixed decimal) features. These were manually selected
after comparing various correlation measures between the features and the
classification. Chen and Li did not use the programmatic binarization of cat-
egorical variables that came into greater popularity between 1998 and the
present day. Their results are primarily for SVMs and do not deliver notable
accuracy, especially since Chen and Li also report performing only a single
10-fold cross-validation. The real lesson from this early work is that correlation
makes a difference, and that automatic binarization might not always be a
good idea.

19.12 Conclusion
Our objective in this chapter has been to show that quantum computing is
now within the reach of everyone. We took an old problem and an old method
that people used to think was slow, and implemented it using an SDK that
can route the problem to either a quantum solver or an advanced classical
solver. If the reader has forgotten the term “quantum annealer” by this point,
that is perhaps a sign of success.
On the binarized German Credit Data, QUBO Feature Selection delivered
a smaller feature set (24 features) than either RFE (28 features) or RFECV
(31 features). All three methods showed comparable accuracy. A priority for
future work is to study the behavior of the QUBO method on different data
sets, with different correlation methods, and with a multistage approach that
could improve its performance.
Also of interest is the unusually small number of candidate feature sets
returned to the selector, and the possibility of applying the technique to
Optimal Feature Selection Using a Quantum Annealer 587

much larger initial feature sets. The method performed best when the fea-
ture correlation coefficients fell off smoothly. One could imagine broadening
the “wrapper” concept to include the selection of a correlation algorithm, and
using the speed of the QUBO Feature Selector to explore this space more
systematically.
The authors wishes that the availability of QUBO Feature Selection
via 1QBit’s quantum-ready SDK will make this method more accessible to
researchers interested in studying its possibilities, and to practitioners want-
ing to add a new (and quantum-ready) tool to their metaphorical toolboxes.

Acknowledgments
The author would like to thank Majid Dadashi and the 1QBit Software
Development Team for creating the SDK, and for their assistance with its
use. Anna Levit provided useful comments on the draft. Jaspreet Oberoi con-
tributed the idea that led to recursive QUBO Feature Selection, a concept
whose possibilities we have yet to explore.
The author wishes to express his gratitude to Gili Rosenberg and his many
collaborators on the 1QBit research paper, Solving the Optimal Trading Tra-
jectory Problem Using a Quantum Annealer, and to 1QBit for permission to
quote at length from their description of the D-Wave quantum annealer.

References
1. Organization for Economic Cooperation and Development (OECD). Household
debt (indicator), December 2016.

2. World Bank. International Debt Statistics 2017, 2017.

3. U.S. Federal Deposit Insurance Corporation. Statistics at a Glance, 2017. From


[Link].

4. World Bank. Bank Non-Performing Loans to Gross Loans for United States,
2016. Data series DDSI02USA156NWDB retrieved from FRED, Federal Reserve
Bank of St. Louis.

5. U.S. Federal Deposit Insurance Corporation. Bank Failures in Brief, 2017. From
[Link].

6. Center for Microeconomic Data U.S. Federal Reserve Bank of New York. SCE
Credit Access Survey, 2016. From [Link].

7. Huang, J. Feature Selection in Credit Scoring—A Quadratic Programming


Approach Solving with Bisection Method Based on Tabu Search. PhD thesis,
Texas A&M International University, 2014.
588 High-Performance Computing in Finance

8. Harris, M. The Short History and Long Future of the Online Lending Industry.
Forbes Valley Voices, 2017.

9. FICO (formerly the Fair Isaac Company). FICO Website www.fi[Link], 2017.

10. Goel, V. Russian Cyberforgers steal millions a day with fake sites. New York
Times Online, 2016.

11. Demirer, R. and B. Eksioglu. Subset Selection in Multiple Linear Regres-


sion: A New Mathematical Programming Approach (working paper). Technical
report, University of Kansas, School of Business, 1998. A later version of the
paper appeared in Computers and Industrial Engineering, vol. 49, August 2005.

12. N’Cir Waad, B. B. On Feature Selection Methods for Credit Scoring. PhD thesis,
Université de Tunis, Institut Supérieur de Gestion, École Doctorale Sciences de
Gestion LARODEC, 2016.

13. Lichman, M. Machine Learning Repository, School of Information and Computer


Science, University of California, January 2017.

14. Pedregosa, F., Varoquaux, G., and Gramfort, A. Scikit-learn: Machine learning
in python. Journal of Machine Learning Research, 12:2825–2830, 2011. Note:
This is the citation requested at the [Link] website, accessed January
12, 2017.

15. Cox, D. The regression analysis of binary sequences (with discussion). Journal
of Royal Statistical Society B., 20:215–242, 1958.

16. Powers, D. M. W. Evaluation: From precision recall and F-Measure to ROC,


informedness, markedness and correlation. Journal of Machine Learning Tech-
nologies, 2(1):37–63, 2011.

17. Chen, Fei-Long and Li, Feng-Chia. Combination of feature selection approaches
with SVM in credit scoring. Expert Systems with Applications, 37:4902–4909,
2010.

18. Anscombe, F. J. Graphs in statistical analysis. The American Statistician,


27(1):17–21, 1973. See also the nicely illustrated Anscombe’s Quartet article
on Wikipedia.

19. Rosenberg, A. and Hirschberg, J. V-Measure: A conditional entropy-based exter-


nal cluster evaluation measure. Technical report, Department of Computer Sci-
ence, Columbia University, January 2007.

20. 1QBit Inc. [Link], 2017.

21. Bouckaert, R. R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A.,
and Scuse, D. Weka Manual (3.7.1), 2016. This is the citation given by Waad
(see below) and describes the Weka features available at the time.

22. D-Wave Systems Inc., [Link], 2017.

23. Rao, M. How to Evaluate Bank Credit Risk Prediction Accuracy based on SVM
and Decision Tree Models, Capgemini “Capping IT Off” Blog, November 2,
2016. [Link], accessed January 18, 2017.
Index

Note: Page numbers followed by “n” indicate notes.

A Algorithmic differentiation (AD), 7, 17,


AAD, see Adjoint algorithmic 18, 315, 317, 340
differentiation adjoint mode, 318
Acceleration factor, 517 adjoint of evolution, 323
Acceptance probability, 225 American option pricing, 329–332
Active inputs, 317 case studies, 328
Active outputs, 317 checkpointing, 323–324
Actual pricing, 500 coverage, 21–22
AD, see Algorithmic differentiation; European option, 328–329
Automatic differentiation implementation, 19–20, 320
Adaptive market hypothesis (AMH), 28 implicit functions, 325–326
Ad hoc rules, 50 issues, 327
Adiabatic transformation system, 572 motivation, 316
Adjacency matrix, 572 NCM, 332–334
Adjoint algorithmic differentiation performance, 20–21
(AAD), 315, 340, 341, 352–354; preaccumulation, 327
see also Algorithmic review in finance, 319–320
differentiation (AD) second derivatives, 318–319
adjoint design paradigm, 342–343 smoothing, 326
by overloading and tape tangent mode, 318
interpretation, 322 Algorithmic investment strategies, 52
Adjoint(s), 321, 341 ALM, see Asset liability management
calculation of risk, 360 ALM systems, see Asset and liability
of calibration step, 362–363 modeling systems
of ensemble, 324 Alpha engine, 54, 56, 67–68
of evolution, 323 asset management, 50
mode, 342 daily profit & loss, 55
mode AD, 318 foreign exchange market, 50–51
Admissable parametrization, 280 guided by event-based framework,
Admissible response measure, 105, 106 56–67
ADMM, see Alternative direction hallmarks of profitable trading,
method of multiplier 52–53
Affine function, 280, 285, 293 monthly performance of unleveraged
Affine term structure models (ATSMs), trading model, 73
134, 278 rewards and challenges of automated
Agent-based models, 52–53 trading, 51–52
Age-Period-Cohort type, 147 trading model anatomy and
Algebraic adjoint approach, 345 performance, 53–56

589
590 Index

“Alpha-max” parameter, 39 Arithmetic Brownian motion, 138


Altera Stratix-V FPGAs, 445 Array padding, 483–484
Alternative direction method of Arrival rate of MOs, 83, 88, 89, 91,
multiplier (ADMM), 533 96–97
Amazon EC2 F1 instance, 448 Asian payoff, 234
Amazon Elastic Compute Cloud Asset-dependent correlation, 193
(AWS), 524 Asset and liability modeling systems
Amazon Web Services (AWS), 143, 468 (ALM systems), 116, 121, 123
Ambiguity aversion, 78, 82 Asset filter relative strength index,
ambiguity effects on optimal 37–38
strategy, 88–91 Asset liability management (ALM), 274
ambiguity weights, 86 Asset management, 49–50
closed-form solutions, 91–93 Asset managers, 50
inclusion of market orders, 93–97 ATSMs, see Affine term structure
natural alternative routes, 85 models
optimization problem, 83 Augmented conditional state vector, 296
penalty function, 84 Automated trading
verification theorem, 87 performance, 53
Ambiguity effects on optimal rewards and challenges of, 51–52
strategy, 88 Automatic differentiation (AD), 19
arrival rate, 88 Automation Module, 126
fill probability, 88–89 Average gain, 37
market order arrival rate, 89 Average loss, 37
midprice drift, 90–91 AWS, see Amazon Elastic Compute
Cloud; Amazon Web Services
optimal depth and change in depth,
89, 90
Amdahl’s law, 424, 477–478 B
American option pricing, 329–332 Bachelier’s model, 116
AMH, see Adaptive market hypothesis Backtesting, 40, 528
AML services, see Antimoney investment strategy, 515
laundering services Backwards inductive algorithm, 490
Analytical Engine, 416 Backward sweep, 19, 342
Anchored-ANOVA decomposition, Banking regulations, 551
176–177, 182–183 Banking system, money creation by, 554
Annuity factor, 432 Banking X-road, 547–548
Ansatz functions, 433 Bank lending, bitcoin vs., 554–555
Antimoney laundering services (AML Barrie and Hibbert ESG, see Moody’s
services), 538 Analytics’ RiskIntegrity Suite
Antithetic Monte Carlo, simulations for, ESG
235–236 Barrier options, 211–214
Antithetic multilevel Monte Carlo Basel Capital Accord (1988), 11
estimator, 227–228 Basel Committee, 12
Aon Benfield’s PathWise(TM) , 124 Basel II framework, 11, 12
Aon Benfield’s ReMetrica, 117 Base strategy, 37
Apache Spark, 143, 521 Basic adjoint mode, 321, 322
Application Programming Interface Basis variables, 490, 491, 493, 498,
(API), 456–457 500, 503
Approximation to Black model, 293 Bayesian approach, 128
Arbitrage free-pricing, 128 BCs, see Blockchains
Index 591

Bermudan pricing problem, 486 Brownian bridge, 17, 211, 496


Bermudan swaption, 464–465, 488 construction, 240
Bespoke in-house HPC grids, 142 interpolation, 207, 212, 223, 242–244
BGM model, see Libor Market Model Brownian diffusions, 241
BiCGSTAB solver, 429 Brownian increments, 216
Binarization and scaling procedures Brownian motion, 62, 63, 202
transform, 569 Brownian path and approximations over
Binarizing, scaling, and correlating one coarse timestep, 230
German credit data, 569 Brownian shocks, 128, 139, 144
Binary classification, 569 B-splines, 250
Bitcoin, 540 Building blocks, 475–476
bank lending vs., 554–555 Bumping, 7, 239
Bitcoin ecosystem, 543–546 Business units, 119
bitcoin statistics, 545
BTC block, 544 C
Bivariate T -distribution, 146
C (programming language), 418
Black and Scholes seminal work, 5, 6
C code of nested loop with
Black correction, 285–286
dependency, 453
EFM model, 275
C code simple computation inside
for negative rates, 287–288
loop, 451
shadow rate model, 294–295
C++ (programming language), 16,
Black EFM model
415, 418
prediction, 306
AD for, 321
UKF for, 295–297 approach, 428
Black model calibration progress, combination technique, 428–429
292–295 decomposition, 428
Black–Scholes formula/model, 122, 268 implementation, 429
Blockchains (BCs), 538–540 parallelization, 429
genealogical trees, 540–543 pricing basket options using, 427
historical examples, 540–543 problem, 427–428
implementation, 395 results, 429–431
improving ICH’s clearance procedure Cache lines, 476
with, 393–395 Caching of instructions and data, 442
land titles, 543 Càdlàg process, 222
Blocks, 493–495 Calibration
Block splitting technique, 533 adjoint of, 362–363
Bloomberg, 299, 530, 535 issues, 140–142
Bloomberg, NYSE and, 535 of nonlinear Black models, 290
Blue Gene/Q supercomputer step, 358–357
JUQUEEN, 414 of yield curve model, 131
Bond price, 132 Callback feature, 530
calculation method, 290 Call option, 236, 267
Boolean (yes/no) value, 565 Candidate function, 106, 108
Box–Muller method, 127 Candidate measure, 83
Brakteaten system, 555 Capetian dynasty, 541
Branch-and-bound tree, 39, 40 Capital buffer, 118
“Branch and cut” technique, 26, 41 Capital requirements, 4
Brian’s scheme, 179, 189 Capital value adjustment (KVA),
Broad network access, 510 13–14, 466
592 Index

Cardinal B-spline functions, 257 Closed-form solutions, 91–93


Cardinality constraints, 28 Cloud, 510
Cascading events, 66 bursting, 516
Cash equivalence of collateral DFEs in, 447–449
holdings, 9 providers, 448
Catastrophic tail events, 535 vendors, 535
Catch-all category, 148 Cloud computing, 510–513, 524, 529
Cauchy’s integral formula, 258 algorithm design, 527–528
CBDCs, see Central bank issued digital Cloud Alpha, 533–535
currencies computational needs, 523–524
CCP, see Central clearing counterparty computing environment and
CCR, see Counterparty credit risk architecture, 529–530
CCRM, see Counterparty credit risk computing instance, 515–516
management distributed portfolio optimization,
CDO, see Collateralized debt 531–533
obligations economics of, 533–535
Central bank issued digital currencies experiment design and test result,
(CBDCs), 538 530–531
approaches, 556 financial applications of, 517–520
implementing Chicago Plan, 557 financial engineering and, 512
and negative interest rates, 555–557 implementation and practices, 521
Central clearing counterparty (CCP), nature of challenges, 520–521
548 portfolio backtesting, 528–529
Central finite difference (CFD), 316 potential computing bottleneck, 529
Centralized control strategy, 42 solution selection, 524–527
Central processing unit (CPU), taxonomy of parallel computing,
15–16, 516 513–515
architecture, 476–477 TCO, 517
efficiency, 517, 530 Techila middleware with MATLAB,
programming, 493–494 521–522
thread, 480 Clumping of variables, 162
time, 517 Clustering, 9, 379
CFD, see Central finite difference of variables, 162
Chain rule, 18, 19, 22 CMI, see Continuous Mortality
Checkpointing, 322–324, 330–331 Investigation
Chicago Plan, 557 CNK, see Compute Node Kernel
Chinese Tianhe-2, 417 Coalescing constraints, 495
Cholesky factorization, 155, 164 Coastline, 58
of correlation matrix, 139 Coastline trading, 60–62
matrix factorization, 117 agents, 54
CIR model, see Cox–Ingersoll–Ross Co-dependency structures and
model simulation, 144–147
Clark–Cameron example, 228–231 Coding feature selector, 570
Classical Markowitz’s mean-variance inspiring simplicity, 570
framework, 532 minimize() function, 571–573
Classical master-slaves strategy, 42 QUBO feature selection in 1QBit
Classification SDK, 570
as business problem, 562–564 Coherent expected shortfall
problem formulation, 565–566 measure, 119
Index 593

Collateral holdings, cash equivalence Convex optimization, 326


of, 9 Cooperative quantitative strategy
Collateralized debt obligations development platforms, 520
(CDO), 357 Coordinate-wise maximum, 295
Collateral modeling, 8, 10 Coprocessors, 414
Collateral posting, delay in, 9 Copula central limit theorem, 146–147
Collateral posting, granularity, 9 Copula-marginal distribution
Combination technique, 428–429 factorization, 144
Complex systems, 59, 60 Correlations, 176, 459–461, 567
Compliance requirement, 520 matrix, 188
Compound Poisson process, 222 structures, 188
Computational architecture, 31 COS method, 251, 267, 268; see also
Computational bottleneck, 523–524 Multilevel Monte Carlo
Computational time, 28, 40, 176, 267, methods (MLMC methods)
295, 357, 367, 376 computational complexity, 270
Compute Node Kernel (CNK), 414 density coefficients, 253
Compute nodes, 414 domain truncation, 254
Computer cluster, 414 plain vanilla payoff coefficients, 254
Computer technology, 441 pricing multiple strikes, 254–255
Computing framework, 521 Cost analysis, 534–535
Concurrent versioning system Counterparty credit risk (CCR), 466
(CVS), 539 Counterparty credit risk management
Conditional expectation, 211, 215, (CCRM), 340, 350–354
216, 218 Counterparty default, 7, 8
difficulties in using, 219 Covariance matrix factor, 285
for payoff, 214 Cox–Ingersoll–Ross model (CIR model),
use of multiple samples, 220 135
in vibrato Monte Carlo, 220–221 SDE, 136
Conditional Monte Carlo approach, 207, CPLEX 12. 6, 40
211, 212, 214 CPU, see Central processing unit
for pathwise sensitivity, 218–219 Crank–Nicolson time stepping scheme,
Conditional tail expectation (CTE), see 179, 328, 429
Tail-Value at Risk (TVaR) CRAN website, 124
Conditional Value at Risk (CVaR), see Cray-1, 416
Tail-Value at Risk (TVaR) Credit
Configuration time, 525 creation theory, 552
Conning’s GEMS ESG, 124 models, 138
Constant coefficients, 185–186 observation, 566
PDEs, 183 Credit default
Constant jump rate, multilevel Monte index swaptions, 364–368
Carlo for, 223–224 risk, 8
Constant Relative Risk Aversion utility swaps, 363–364
function, 432 Credit derivatives, pricing of, 357
Continuation region, 94 calibration step, 358–359
Continuous Mortality Investigation Credit risk
(CMI), 147 annuity, 359
Continuous-time CIR process, 162 capital, 12–13
Control-flow based processor, 444 challenges in calculation, 359–360
Control variables, 400–401 Credit risk detector, 566
594 Index

Credit scoring, 519–520, 562–564 De-cascading events, 66


formulation of, 565–566 Debit value adjustment (DVA), 7–9
Credit Support Annexes (CSAs), 9 Debugging, 420
Credit value adjustment (CVA), 4, 7–9, Decision trees, 565
350, 466, 518 Decomposition methods, 182
capital, 466–467 anchored-ANOVA
Cryptocurrency solution, 394 decomposition, 182
CSAs, see Credit Support Annexes constant coefficient PDEs, 183
CUDA program, 143, 418, 493 partial freezing, 184
CUDA 5.5 toolkit, 491 variable coefficients, 184
CUDA code, 491 zero-correlation approximation, 184
Cumulative normal distribution, 215 de facto market standard model, 364
Currency debasement, 555 Delivery vs. payment (DvP), 548
Curse of dimensionality, 6, 176, 427 conundrum, 549
Cut pool, 42 Deloitte’s XSG, 124
Cutting-plane approach, 28 DEMO, 380, 382, 384–385
CVA, see Credit value adjustment DE/ND/Rand/1 strategy, 381
CVS, see Concurrent versioning system DE/ND/Rand/RF/1 strategy, 381
Cybersecurity, 520 Dennard scaling, 474
Density coefficients, 252, 253, 258–260,
D 263–264
Density function, 154–157
Das Kapital, 546
Deployment time, 526
Data, 30, 299, 565, 569
DE/Rand/1 strategy, 380
center, 516
Derivative securities, 122
collection phase, 499–500
market, 30 Derivative valuation, 517–518
news meta, 30 DFE memory (LMem), 457
parallel, 480 DFEs, see Dataflow Engines
simulation description and model DFEVars handle run-time data, 453
assumptions, 395–397 Diebold–Rudebusch arbitrage-free
sources, 530 version, 279–280
structures, 482–484 Differentiability, 317n3
Dataflow Engines (DFEs), 443, 444–445 Differential evolution (DE), 379–382
in cloud, 447–449 Digipede Grids, 142
DFE-accelerated components of Digital currencies (DCs), 538
Maxeler RiskAnalytics Digital options, 208, 214–215, 218
library, 467 DiPBIL, 378
Dataflow programming principles, Discontinuous payoff function, 208
449–457 Discount factors, 131, 277
Data-parallel backtesting algorithm, 515 Discrete time
Data-parallel problem, 513 dynamic programming, 433
Datasets, 528 LMM in, 488–489
“Days to recovery” dynamic risk optimal control model, 397–399
measures, 29, 38 Discretization error, 125
DCs, see Digital currencies Discretized price curve, 62
DE, see Differential evolution Disk, 476
DE/Arch/Rand/1 strategy, 381 Displaced diffusion, 488
DE/Arch/Rand/RF/1 strategy, 381 Distributed Computing Server, 420n7
DE/Best/1 strategy, 380 Distributed databases, 539
Index 595

Distributed file systems, 535 ECDSA, see Elliptic Curve Digital


Distributed ledgers (DLs), 394, 538–540 Signature Algorithm
genealogical trees, 540–543 Economic Factor Model (EFM), 275
historical examples, 540–543 calibration, 285–287
land titles, 543 three-factor basic, 284–285
Distributed ledger technology (DLT), Economics, 551
394, 538 Economic scenario generators (ESG),
banking X-road, 547–548 116, 117–124, 126; see also
global payments, trade finance, Risk scenario generator (RSG)
rehypothecation, 549–550 calibration issues, 140–142
potential usages in banking, 547 credit models, 138
stock and share trading, 549, 550 equity models, 138–139
trade execution, clearing, settlement, Girsanov theorem, 128, 130
548–549 high-performance computing in,
Distributed matrix multiplication, 142–143
527–528 nominal interest-rate models,
Distributed optimization technique, 533 133–138
Distributed portfolio optimization, 531 rESG, 125
algorithm design for large-scale
SDEs, 127
mean-variance optimization
yield curve, 131–133
problem, 532–533
Economics of cloud computing, 533
challenges in large-scale portfolio
cost analysis, 534–535
construction, 531–532
Distributional representation, 158–160 evaluation, 533–534
Distribution function, 152–153 risks, 535
Diversification, 12, 535 ECP, see Expected capital profile
DLs, see Distributed ledgers Edges, 321
DLT, see Distributed ledger technology Efficient market hypothesis (EMH),
Domain truncation, 254 27, 28
Dual simulation method, 238 e-Estonia X-Road, 547
DVA, see Debit value adjustment EFM, see Economic Factor Model
DvP, see Delivery vs. payment EIOPA, see European Insurance and
D-Wave machine, 571, 573 Occupational Pensions
Dynamic programming approach, Authority
433–434 EKF, see Extended Kalman filter
Dynamic Scenario Generator, 124 Elastic computing, 516
Dynamic stochastic general equilibrium Electricity, 510
models, 553 Elemental function, 322
Dynamic strategy using money Elliptic Curve Digital Signature
management, 38–39 Algorithm (ECDSA), 544
EM algorithm, see
Expectation–maximization
E algorithm
EAD, see Exposure-at-default Embarrassingly parallel computing
Earthquakes, 535 problem, 513, 514
EAs, see Evolutionary algorithms Embarrassingly parallel problems, 422,
E-bank X-Road, 547 423, 480, 485
ECDF, see Empirical cumulative Emergence of scaling laws, 58–59
distribution function EMH, see Efficient market hypothesis
596 Index

Empirical cumulative distribution real-world example of coastline


function (ECDF), 150 trading, 61
Empirical evaluation of model in-and simple rules, 61
out-of-sample, 299 trading models and complexity,
data, 299 59–60
in-sample goodness-of-fit, 300–303 transition network of states, 63
out-of-sample Monte Carlo Evolutionary algorithms (EAs), 376
projection, 304–308 DE, 379–382
yield curve bootstrapping, 300 PBIL, 377–379
EM scheme, see Euler–Maruyama Execution region, 94
scheme Exotic interest rate pricing, 464–466
“Enhanced indexation” model, 26, 27 Expectation–maximization algorithm
applying SSD criterion, 26–28 (EM algorithm), 275
Expected capital profile (ECP), 467
Enterprise-level Proxy Generator, 126
Expected Historical VaR, 467–468
Equity models, 138–139
Expected shortfall (ES), 12
Equity prices, 468
Exponential Lévy process, 270
ES, see Expected shortfall
Exposure-at-default (EAD), 12
ESG, see Economic scenario generators Exposures, 14
ESGtoolkit, 124 Extended Kalman filter (EKF), 290
Estonian X-Road, 547 Extended Vasicek, see Hull–White
Ethereum code, 543 models
Ethereum hard forking, 541
Euler and Milstein discretizations,
F
203–205
Euler–Maruyama scheme (EM scheme), f1 score() method, 582
207–209 Fast Fourier transform (FFT), 250
approximation, 199 Fast Memory (FMem), 445, 549
discretization, 127, 160 Father wavelet, 256
European call option, 218, 241 Fathoming of tree branches, 41
European Insurance and Occupational Feature selection, 566–568
Pensions Authority Feature variables, 566
(EIOPA), 117 Feedback controls, 81–82, 93–95
European Mont-Blanc project, 417 Feynman-Kac theorem, 5
FFT, see Fast Fourier transform
European option, 235, 250, 260, 328–329
FIAS, see Frankfurt Institute for
numerical simulations for, 235–236
Advanced Studies
pricing problem, 251
Field programmable gate arrays
SWIFT pricing formula, 265
(FPGAs), 16–17, 143, 421,
Evaluation metrics, 573–574 442–443, 523
Event-based framework, guided by, 56 Fill probability, 88–89
cascading with asymmetric Filtering, 274, 291
thresholds, 67 Filter initialization, 286–287
coastline trading, 60–62 Final mathematical formulation of
emergence of scaling laws, 58–59 model, 402–403
final pieces of puzzle, 64–67 Finance, AD review in, 319–320
intrinsic time, 56–58 Financial applications, 242, 461
Monte Carlo simulation, 65 credit value adjustment capital,
novel insights from information 466–467
theory, 62–64 exotic interest rate pricing, 464–466
Index 597

interest rate swap pricing, 462–463 time-dependent volatilities, simple


Maxeler Risk Analytics platform, correlation, 190–191, 192
461–462 Finite-dimensional random variable, 198
standard initial margin model, First-order conditions, 98, 101–102
467–469 First-order cumulant
value-at-risk, 463–464 approximation, 294
program, 562 First Mont Blanc prototype, 417
Financial applications of cloud First-to-default CVA (FTDCVA), 7, 8
computing, 517 Floating point math, 474
credit scoring, 519–520 Floating Point Operations Per Second
derivative valuation and pricing, (FLOPS), 414
517–518 Floating point units (FPUs), 15–16
quantitative trading, 519 FLOPS, see Floating Point Operations
risk management and reporting, Per Second
518–519 Flow credit products, real-time risk
Financial applications, supercomputers management of, 357
for, 422 adjoint calculation of risk, 360
access to and costs of adjoint of calibration step, 362–363
supercomputers, 425–426 challenges in calculation of credit
performance measurement, 423–425 risk, 359–360
credit default index swaptions,
suitable financial applications,
364–368
422–423
credit default swaps, 363–364
Financial computing, 520
implicit function theorem, 360–362
Financial crisis, 274, 518
pricing of credit derivatives, 357–359
Financial derivatives, see Financial
results, 363
investments
Foreign exchange market, 50–51
Financial engineer, 472
Fortran, 418
Financial investments, 3
Forward accumulation, 20
Financial markets, 52, 60, 534 Forward mode AD, see Tangent mode
Financial options, see Financial AD
investments Forward rate models, 133
Financial payoff function, 216 Forward sweep, 342
Financial system, 51 Fourier and wavelet option pricing
Financial Times Stock Exchange 100 methods; see also Multilevel
Index (FTSE100), 42–43 Monte Carlo methods (MLMC
Finite difference approximation, 179 methods)
Finite-difference method, 179–181, 293 computational time, 267
asset-dependent correlation, 193 COS method, 251–255, 267
decomposition methods, 182–184 European option pricing
different base cases with nonconstant problem, 251
parameters, 188 multiple strike pricing, 270–271
time-dependent exponential numerical results, 266
correlation, 190, 191 rate of convergence, 268–269
time-dependent simple correlation, robustness of WA(a,b) , 267–268
189–190 SWIFT method, 261–266, 267
time-dependent volatilities, WA(a,b) method, 256–260, 267
exponential correlation, wavelet series, 255–256
191–192 Fourier cosine expansions, 250
598 Index

Fourier methods, 4, 6 Global polynomials, 433


Fourier transform, 250 Google Compute Engine (GCE), 524
FPGAs, see Field programmable gate Google’s AlphaGo algorithm, 50
arrays Google’s App Engine, 143
FPUs, see Floating point units Google search volume, 518
Fractional reserve theory of Google Trends, 512
banking, 552 GPGPUs, see General-purpose GPUs
Frankfurt Institute for Advanced GPUs, see Graphic processing units
Studies (FIAS), 417 Granularity of collateral posting, 9
FRTB, see Fundamental Review of Graphic processing units (GPUs), 16,
Trading Book 482, 484, 523
FTDCVA, see First-to-default CVA architecture, 477
FTSE100, see Financial Times Stock least-squares and multiple
Exchange 100 Index regressions on, 498–499
FTSE 100, 26, 40, 45 memory management, 482
Fubini’s theorem, 252 programming, 493–494
Full asset universe strategy, 37 thread, 481
Full freezing, 184 Graphics processors, 239
Fundamental Review of Trading Book Grid, 481
(FRTB), 12 Gustafson’s law, 424, 478
Funding adjustments, 4
Funding value adjustment, see Fund H
Valuation Adjustment (FVA)
Fund Valuation Adjustment (FVA), 10, Haar basis, 267
466, 518 Haar wavelets, 256
Hadoop+MapReduce, 521
Hamilton–Jacobi-Bellman equation
G
(HJB equation), 80, 81, 91, 94
GARCH model, 518 Hamilton–Jacobi-Bellman–Isaacs
Gaussian affine models, difficulties with, equation (HJBI equation), 86,
282–283 87
Gaussian copula, 116, 147 Hard-won credit observations, 563, 567
Gaussian EFM model, 300, 306–307 Hard disk drive (HDD), 448n2
Gaussian models, 274, 275 Hardware, 15, 491
Gauss–Markov theorem, 123 agnostic, 562
GCE, see Google Compute Engine CPU/FPU, 15–16
Genealogical trees, 540–543 FPGAs, 16–17
General-purpose GPUs (GPGPUs), GPUs, 16
414, 417 graph, 572–573
Geometric Brownian motion, 127, 268 in-memory data aggregation, 17
German credit data, 564, 569 Harvesting, 318
Gigabit Ethernet, 414 Hazard rate, 357
Girsanov theorem, 128 function, 358
Global “property cat” reinsurance Hazel Hen supercomputer, 425–426
market, 371–372 HDD, see Hard disk drive
Global distribution of economic Heat equation, 191
activity, 372 Heath–Jarrow–Morton framework (HJM
Global Lipschitz property, 103 framework), 136, 277–278
Global low-rate environment, 277 Hedging, 14–15
Global payments, 549–550 Herding behavior, 146
Index 599

Hessian accumulation, 319 HPC, see High-performance computing


Heston model, 266 Hull–White models, 135, 279, 281
Heston–Nandi model (HN model), 518 120/20 fund, 34
Heston’s stochastic volatility model, Hundsdorfer–Verwer scheme (HV
139, 203 scheme), 179
Higher dimensional arrays, 483 Hybrid cloud, 516
Highest Scaling Codes on JUQUEEN
(High-Q club), 419 I
High-frequency trading, 50
algorithm, 523 IaaS, see Infrastructure as a Service
High-performance computing (HPC), IATA, see International Air Transport
250, 511; see also Reinsurance Association
Contract Optimization (RCO) IATA Clearing House (ICH), 392
clearance procedure, 392–393
approaches to calibrating Black
data simulation description and
models, 288, 290
model assumptions, 395–397
Black model calibration progress,
example of one simulation, 406–407
292–295
before IATA Coin adoption, 403–404
community, 417
after IATA Coin adoption, 404–405
in ESGs, 142–143
improving ICH’s clearance procedure
implementation, 297–298
with blockchain technology,
Monte Carlo bond pricing, 291
393–395
PDE bond pricing, 292
increasing number of simulations,
systems, 441 407–408
techniques, 274 mathematical formulation of model,
three-factor Black model calibration, 397–403
290–291 model results, 405–406
High Performance Computing Center practical implementation of
(HLRS), 425–426 model, 403
High-Q club, see Highest Scaling Codes results representation, 403
on JUQUEEN IATA Coins, 401, 407
Histograms, 378 before IATA Coin adoption, 403–404
HJB equation, see after IATA Coin adoption, 404–405
Hamilton–Jacobi-Bellman IBM’s Blue Gene series, 414
equation IBM’s Roadrunner, 416
HJBI equation, see Hamilton–Jacobi- ICE Benchmark Administration
Bellman–Isaacs Limited, 299
equation ICH, see IATA Clearing House
HJM framework, see IDC, see International Data
Heath–Jarrow–Morton Corporation
framework Idiosyncratic factors, 237
HLRS, see High Performance IEKF, see Iterated extended Kalman
Computing Center filter
HN model, see Heston–Nandi model IFRS 9 standard, 7, 9
Hölder’s inequality, 226, 234 ILP, see Instruction Level Parallelism
Ho–Lee model, 279 IMM, see Internal model method
Hour-based pricing model, 527 Impact measure for news, 32
House of Capet, 540, 542 impact score, 32–34
House of Habsburg, 540, 541 sentiment score, 32
House of Valois, 541 Impact score, 32–34
600 Index

Implementation approaches of AD, coastline representation of price


19–20 curve, 58
Implicit functions, 325 events, 57
convex optimization, 326 Inventory, 80
linear systems, 325 penalization, 90–91
nonlinear systems, 325–326 Investment, 5
theorem, 360–362 banking industry, 116
Improved MLMC method, 202 strategies, 53
Indifference prices, 66 Investment banking, computationally
Individual banks, money creation by, expensive problems in
553–554 AD, 17–22
Inferring states method, 290 financial investments, 3
Infiniband, 449 hardware, 15–17
Infinitesimal perturbation analysis, see hedging, 14–15
Pathwise sensitivity approach RCR, 11–14
Information flow, 31 technology, 15–22
Information service, 70 trading, 14–15
Information theory, novel insights from, valuation requirements, 5–10
62–64 I/O operations, see Input/output
Infrastructure as a Service (IaaS), 143 operations
Inhomogeneous heat equation, 186 ISDA, see International Swaps &
In-memory data aggregation, 17 Derivatives Association
Input/output operations (I/O Iterated extended Kalman filter
operations), 442, 517 (IEKF), 290
Input parameters, 216 IT legacy, 521
In-sample goodness-of-fit, 300–303
Instantaneous short rate, 280
J
Instruction Level Parallelism (ILP),
479–480 Jacobians, 20, 21
Instruction parallel, 479–480 Jarrow–Landau–Turnbull model, 138
Intel(R) Xeon (R) CPU E5–2643, 491 Job scheduler, 516
Intel CPU, 476 Joslin–Singleton–Zhu model (JSZ
“Interaction Between Law and Society” model), 281, 282
PhD thesis, 70 Jump-adapted discretization, 222
Interest rate products, 340 Jump-adapted Milstein discretization,
pathwise derivative method, 344–349 223
real-time risk management, 344 MLMC for constant jump rate,
Interest rates, 468 223–224
swap pricing, 462–463 MLMC for path-dependent rates,
Internal model, 120 224–226
Internal model method (IMM), 12, 13 Jump-adapted thinning, 225
International Air Transport Association Jump-diffusions, 222, 223, 241
(IATA), 392 Lévy processes, 226–227
International Data Corporation Merton’s jump diffusion model, 139
(IDC), 511 MLMC for, 222
International Swaps & Derivatives SDE, 222
Association (ISDA), 467 Jump-to-default process, 8
master agreement, 8 JUQUEEN supercomputer, 419,
Intrinsic time, 54, 56 427, 429
Index 601

K multiple regression, 489–491


packages and hardware, 491
K20c NVIDIA graphics card, 485
path generation, 495–497
Kalman filter (KF), 275, 309
pricing phase, 500–501
measurement equation, 286
transition equation, 286 product specification and design,
497–498
Kelly Criterion, 29
Kelly Strategy (KS), 38 simulation, 345–349
Kernels, 493 speed comparisons and numerical
KF, see Kalman filter results, 501–504
Knight Capital, 52 with stochastic volatility, 125
k-Nearest Neighbors classification Life cycle investment decisions
schemes (k-NN classification optimization using
schemes), 565 MATLAB, 431
Know your customer (KYC), 538 approach, 433
Kolmogorov forward equation, 177 discrete time dynamic
Kooderive, 502, 503 programming, 433
Krippner approach, 291 implementation, 434
KS, see Kelly Strategy parallelization, 433–434, 437
KVA, see Capital value adjustment problem, 432–433
KYC, see Know your customer results, 434–436
Likelihood maximization, 291
Likelihood ratio method (LRM), 216
L Limiting factor, 423
λ hyper-parameter, 155, 156 Limit order placement, 79
Land titles, 543 Linear systems, 325
Lapse, 148 LINPACK performance list, 416
Large DFE memory (LMem), 445, 549 Lipschitz condition, 105
LASSO regression, 532 Lipschitz functions, 235
Layers, 372 Liquidation, 77
L-CSC, 417, 437 Liquid financial markets, 51
Least-squares on GPU, 498–499 Liquidity-providing investment
Lee and Carter model, 147 algorithms, 51–52
Legal frameworks, 11 Liquidity profile improving, 399
Levenberg–Marquardt technique, Little’s law, 479
133, 140 LMM, see LIBOR market model
Lévy area, 204, 229 LoadLeveler, 414
Lévy processes, 226–227, 255, 266 Local volatility process, 315–316
LIBOR, see London Interbank Offered Log-asset price processes, 250
Rate Logical threading
LIBOR market model (LMM), 116, 136, logical thread on GPU, 481
217, 320, 344, 464, 484, models, 480–481
485, 486 Logistic regression, 565
data collection phase, 499–500 QUBO feature selection with,
design overview, 492–493 579–580
in discrete time, 488–489 recursive feature elimination with,
least-squares and multiple 580–582
regressions on GPU, 498–499 Log-likelihood, 287
memory use, threads, and blocks, function calculation, 309–310
493–495 objective function, 298
602 Index

Lognormal asset-price process, 127 Market risk, 5


Lognormal equity asset model, 139 capital calculation, 12
Log-normal models, 4 Markit, 137
London Interbank Offered Rate Markowitz model, 29
(LIBOR), 463; see also LIBOR Markowitz-style objective, 534
market model (LMM) Massively Parallel Processing (MPP),
data, 299 414
Long-Term Capital Management, 284n7 Master–worker pattern, 434
Long-term models, 53 Mathematical formulation of model, 397
Longevity risk, 147 discrete-time optimal control model,
“Long only” SSD portfolios, 29 397–399
“Long–short” strategy, 29 final mathematical formulation of
Long–short discrete optimization model, model, 402–403
34–35 values of control variables, 400–401
Longstaff–Schwartz algorithm for put variants of objective function, 399
options, 331 MATLAB, 418, 420
Lookback options, 209–211 approach, 433
Low truncation dimension, 176 discrete time dynamic
LP model solution, 39–40 programming, 433
LRM, see Likelihood ratio method implementation, 434
Lyapunov exponent, 141n22 life cycle investment decisions
optimization using, 431
M parallelization, 433–434, 437
problem, 432–433
Machine learning packages, 570
Macroscopic complexity, 60 results, 434–436
Magnetic effects, 571 version R2012b, 434
Manycore parallel computation MaxCompiler, 447
computer architecture, 472–473 compiling dataflow application
LIBOR market model, 486–504 with, 450
NVIDIA’s recent GPUs, 484 Maxeler
parallelism and execution, 480–484 computing systems, 443
parallelism and performance, dataflow oriented computing
477–480 paradigm, 443
parallelism imperative, 473–475 MPC-C systems couple x86
practitioner, 472 server-grade CPUs, 445
systems architecture, 475–477 MPC-N series systems, 447
Market SLiC, 451
data, 30 Maxeler dataflow systems, 445, 461
Market dynamics, 121 DFEs in cloud, 447–449
Market-consistent embedded value MPC-C series architecture, 446
(MCEV), 122 MPC-N series architecture, 447
Market orders (MO), 79 MPC-X series architecture, 446
ambiguity aversion effects on MO MaxelerOS, 447, 456
execution, 95–97 Maxeler RiskAnalytics
feedback controls, 93–95 DFE-accelerated components, 467
inclusion of, 93 platform, 461–462
optimal depth and MO execution Maxeler Technologies, 443
boundary, 95 “Maximum drawdown” dynamic risk
Market-price-of-risk, 121, 128, 285 measures, 29, 38
Index 603

Maximum likelihood, 22 Metrics, 382–383


Maximum likelihood estimation (MLE), Micro-architectural innovations, 442
275, 290 Microsoft Azure, 524
Maximum Performance Computing, 448 Microsoft’s Azure, 142
MaxJ, 450–451, 452, 453, 454, 455 Microsoft’s Service Fabric, 143
MaxRing, 445, 449 Microsoft Windows Azure Cloud, 529
MCEV, see Market-consistent Middleware, 516
embedded value solutions, 521
MCR, see Minimum capital requirement Midprice, 79
MCT, see Monetary circuit theory Midprice drift, 90
Mean-variance optimization equivalence to inventory
framework, 530 penalization, 90–91
Measured service, 511 Milstein discretization, 231–233
Measurement Milstein scheme, 209–211; see also
covariance matrix, 296 Multidimensional Milstein
error process, 286 scheme
Medium-and high-dimensional Miners, see Notaries
derivative pricing PDEs minimize() function, 571–573
asset-dependent correlation, 193 Minimum capital requirement
decomposition methods, 182–184 (MCR), 118
different base cases with nonconstant Mining, 546
parameters, 188 Minor embedding, 572
finite difference schemes, 179–181 Minute-based pricing model, 524
theoretical results, 185–187 Mixed precision arithmetic, 239–240
time-dependent exponential MLE, see Maximum likelihood
correlation, 190, 191 estimation
time-dependent simple correlation, MLMC methods, see Multilevel Monte
189–190 Carlo methods
time-dependent volatilities, MO, see Market orders
exponential correlation, Model calibration toolkits, 125
191–192 Model risk, 116
time-dependent volatilities, simple Modern FVA models, 10
correlation, 190–191, 192 Modern macroeconomic thinking, 551
Memory management on GPU, 482 Monetary circuit, 551–552
Memory use, 493–495 Monetary circuit theory (MCT), 552
Merkle trees, 544 Money creation
Mersenne twister, 144 by banking system, 554
Merton’s jump diffusion model, 139 general aspects, 552–553
Mesh-based methods, 176 by individual banks, 553–554
Message Passing Interface (MPI), by two banks, 554, 555
415–416, 418, 419, 479 Money management
approach, 428 dynamic strategy using, 38–39
combination technique, 428–429 volatility pumping via, 29
decomposition, 428 Monotonicity, 119
implementation, 429 Monte Carlo (MC)
parallelization, 429 bond pricing, 291
pricing basket options using, 427 greeks, 216–217
problem, 427–428 methods, 4, 517, 518, 529
results, 429–431 option pricing, 514
604 Index

Monte Carlo (MC) (Continued ) conditional Monte Carlo for pathwise


simulations, 6, 9, 16, 118, 122, sensitivity, 218–219
199–200, 332, 340, 422, 423, European call, 218
430, 443, 484–485, 514 Monte Carlo Greeks, 216–217
trials, 125, 142 optimal number of samples, 220
Monte Carlo scenario split pathwise sensitivities, 219–220
generation, 275 vibrato Monte Carlo, 220–222
simulation, 274 Multilevel Monte Carlo methods
Moody’s Analytics, 125, 519 (MLMC methods), 198, 199,
RiskIntegrity Suite ESG, 124, 126 202; see also Fourier and
Moody’s KMV model, 138 wavelet option pricing methods
Moore’s law, 415, 473 algorithm, 205–206
MOPBIL, 379, 380, 383–385 Brownian bridge interpolation,
Mortality risk, 147 analysis of, 242–244
MPC-C series architecture, 446 Euler and Milstein discretizations,
MPC-N series architecture, 447 203–205
MPC-X series architecture, 446 improved multilevel Monte Carlo,
MPI, see Message Passing Interface 202
MPP, see Massively Parallel Processing for jump-diffusion processes, 222–227
MRA, see Multi resolution analysis Monte Carlo, 199–200
Multidimensional Black–Scholes multidimensional Milstein scheme,
model, 485 227–236
Multidimensional Milstein scheme, 227 multilevel method, other uses of, 237
antithetic multilevel Monte Carlo multilevel Monte Carlo algorithm,
estimator, 227–228 205–206
Clark–Cameron example, 228–231 multilevel Quasi-Monte Carlo, 240
Milstein discretization, 231–233 pricing with, 206–215
piecewise linear interpolation SDE, 202–203
analysis, 233–235 theorem, 200–201
simulations for antithetic Monte Multilevel Quasi-Monte Carlo, 240
Carlo, 235–236 Multilevel treatment, 238
Multifactor yield curve models and Multiple regression, 489–491
drawbacks, 276 Multiple regressions on GPU,
availability, 277–280 498–499
classification of three-factor affine Multiple strike pricing, 270–271
short rate models, 280–281 Multi resolution analysis (MRA), 255
difficulties with Gaussian affine Multiscale dataflow computing in
models, 282–283 finance
requirements for model development, conventional control-flow oriented
276–277 processor, 444
Multilevel checkpointing schemes, 324 correlation, 459–461
Multilevel implementation, 225 dataflow paradigm, 443–445
Multilevel method, 237 dataflow programming principles,
mixed precision arithmetic, 239–240 449–457
nested simulation, 238 development process and design
stochastic partial differential optimization, 457–459
equations, 237 financial application examples,
truncated series expansions, 238–239 461–469
Multilevel Monte Carlo Greeks, 216, 218 Maxeler dataflow systems, 445–449
Index 605

Multiscale dataflow programming, Nonlinear black correction for EFM


449 model, 283
ecosystem, 448–449 black correction for negative rates,
Multithreaded implementations, 442 287–288
Mutual information, 569 EFM model calibration, 285–287
scores, 576 stylized properties of Black models,
288, 289
three-factor basic EFM model,
N 284–285
NAG, see Numerical Algorithms Group Nonlinear black correction for EFM
Ltd model, 283
National Institute of Standards and EFM model calibration, 285–287
Technology (NIST), 510 three-factor basic EFM model,
NCM, see Nearest correlation matrix 284–285
N-dimensional heat equation, 183 Nonlinear equations, 35
Nearest correlation matrix (NCM), Non-linear optimization problem (NLP
332–334 problem), 36
Negative interest rates, 555–557 Nonlinear problem, 8
Negative nominal interest rates, Nonlinear systems, 325–326
challenge of, 116 Normal copula, 145, 146
Nelson and Siegel rate parameter, Non-market risks, 143
132 Non-negativity, 139
Nelson–Siegel functional extrapolation, Notaries, 539, 546
132 Numerical Algorithms Group Ltd
(NAG), 297
Nested simulation, 238
library, 292
Nested stochastic problem, 122
Numerix’ Oneview ESG, 124
Network IO, 476
NumPy array, 571
Neural networks, 565
NVidia’s GPU computation, 143
News
NYSE, see New York Stock Exchange
analytics data, 30
meta data, 30
sentiment score, 32 O
Newton method, 332 OANDA trading platform, 71
Newton–Raphson iterations, 35 Objective function, 570
New York Stock Exchange (NYSE), 535 variants, 399
NIKKEI 227, 40 OIS, see Overnight Index Swap
NIKKEI 252 components, 38, 40 On-demand self-service, 510
NImpact, 33 1D diffusion process, 292
NIST, see National Institute of OpenMP, 418, 419
Standards and Technology Operational risk, 148–149
NLP problem, see Non-linear Operational Risk Consortium
optimization problem (ORIC), 149
Nominal interest-rate models, 133–138 Operations Per Second (OPS), 416
Nominal short rate, 287 Operator overloading, 20
Nonconstant parameters, base cases OPS, see Operations Per Second
with, 188 Optimal execution algorithms, 77, 78
Nonembarrassingly parallel computing Optimal feedback controls, 81–82
problem, 513, 514 Optimal liquidation problem, 80–81
Non-Gaussian simulation processes, 9 Optimal number of samples, 206, 220
606 Index

Optimization, 562; see also Reinsurance Amdahl’s law, 477–478


Contract Optimization (RCO) data parallel, 480
convex, 326 Gustafson’s law, 478
distributed portfolio, 531–533 instruction parallel, 479–480
model, 397 Little’s law, 479
portfolio optimization models, 26, task parallel, 479
34, 422 Parallelism imperative, 473
problem, 83 Dennard scaling, 474
QUBO, 562, 564 Moore’s law, 473
robust optimization problem, 85 performance, 474–475
Option payoffs, 4 Parallelization, 421, 429, 433–434, 437
Option pricing problems, 344 of code, 419
Option value, 175, 251, 267, 344 interfaces, 418–420
Ordered records, 539 schema, 298
Orders of convergence, 209, 215, Parallel version, 384–389
219–220 Parameter calibrations, 121, 142
ORIC, see Operational Risk Consortium Parameter estimation method, 290
ORSA, see Own risk and solvency Parametric integration, 198
assessment Pareto-event-driven compound Poisson
Orthogonal Chebyshev polynomial (PEDCP), 149, 150
basis, 158, 159 compound Poisson empirical
Orthonormality, 257 distribution function, 151
OTC, see Over-the-counter density function, 154–157
Out-of-sample Monte Carlo projection,
distribution function, 152–153
304–308
finite-element representation,
Over-and under-collateralization, 9
154, 155
Overloading for custom active data
histogram, 151
type, 321
Pareto Type I events, 150
Overnight Index Swap (OIS), 463
regularized finite-element
Over-the-counter (OTC), 5, 291, 466,
representation, 157
548
Pareto frontier, 387, 388
Own risk and solvency assessment
Parseval’s identity, 258
(ORSA), 120
Parsimonious models, 52
Partial differential equations (PDEs),
P 175, 251, 290, 328, 422; see also
P2P lending, 554–555 Medium-and high-dimensional
PaaS, see Platform as a Service derivative pricing PDEs
Packages, 491 bond pricing, 292
Parallel application, 418 discretization method, 4
Parallel computing PDE/HV approach, 193
problem, 513 Partial freezing, 184
taxonomy of, 513–515 Partial-integro differential equation, 5–6
Parallel edges, 321 Partnership for Advanced Computing in
Parallelism, 383 Europe (PRACE), 426
Parallelism and execution, 480 Passive inputs, 317
data structures, 482–484 Passive outputs, 317
logical threading models, 480–481 Path-dependent jump rate, 224
physical execution models, 481–482 Path-dependent rates, MLMC for,
Parallelism and performance, 477 224–226
Index 607

Path generation, 495–497 PPC, see Price per CPU core


Pathwise derivative method, 344 PPI, see Price per instance
LMM simulation, 345–349 PRA, see Prudential Regulatory
Pathwise sensitivity Authority
analysis, 239 PRACE, see Partnership for Advanced
approach, 216 Computing in Europe
conditional Monte Carlo for, 218–219 Preaccumulation, 327
Payoff, 4, 5, 211, 212 Pre-analysis, 31
coefficients, 252, 254, 260, 264–265 Precision, 582, 586
function, 217, 233, 251 Price per CPU core (PPC), 524
PBIL, 377–379 Price per instance (PPI), 524
PCI Express (PCIe), 445, 449, 456 Pricing, 517–518
PDEs, see Partial differential equations formulas, 5
PEDCP, see Pareto-event-driven model and cost, 525
compound Poisson multiple strikes, 254–255, 266
Penalty function, 84, 85 phase, 500–501
Per unit of limit, 373 Step, 358, 360
Pervasiveness of cloud computing, 511 Pricing basket options using C++ and
Peta FLOPS, 416 MPI, 427
P-forecasting measure, 128 approach, 428
Physical execution models, 481–482 combination technique, 428–429
Piecewise linear interpolation, 207 decomposition, 428
analysis, 233–235 implementation, 429
PImpact, 33 parallelization, 429
Pipeline depth, 460–461 problem, 427–428
Plain vanilla payoff coefficients, 254, 260 results, 429–431
Plain vanilla products, 120 Pricing of credit derivatives, 5–7, 357
Plancharel equality, 6 calibration step, 358–359
Platform as a Service (PaaS), 143 Pricing with MLMC, 206
P-measure, 121 barrier options, 211–214
PnL attribution, 11 conditional Monte Carlo, 211
Point-wise entropy, 62 digital options, 214–215
Pointwise ε-optimality, 106–107 Euler–Maruyama scheme, 207–209
Poisson process, 224, 225 Milstein scheme, 209–211
Polynomial separation algorithm, 39 orders of convergence, 209, 215
Portfolio backtesting, 528–529 Primal code, 317
Portfolio liquidation, 78 Primal function, 317
optimal depth for ambiguity neutral Private banks, 538, 552
agent, 82 Private cloud, 516
reference model, 79–82 Probability density function, 217,
Portfolio optimization models, 26, 34, 250, 268
422 Probability indicator, 54, 62–64
Portfolio simulation, 527 Processors performance, 442
Portfolio strategies, 43, 44 Product specification and design,
PoS, see Proof of stake 497–498
Positive homogeneity, 119 Profitable trading, hallmarks of, 52–53
POSIX threads (Pthreads), 418, 419 Programming languages, 418–420
Potential computing bottleneck, 529 Proof of stake (PoS), 540
PoW, see Proof of work Proof of work (PoW), 540
608 Index

Proprietary supercomputers, 426 evaluation metrics, 573–574


Prudential Regulatory Authority experimental results, 574
(PRA), 118 feature selection, 566–568
Pseudo-random normal deviates, 144 implementation, 585
Pseudo-square root, 488–489 potentially missed subsets,
Pthreads, see POSIX threads comparison with, 584–585
Public cloud, 516 previously reported results,
computing, 443, 447–448 comparison with, 585–586
Python, 418, 420 QUBO as established approach,
564–565
Q QUBO feature selection with logistic
regression, 579–580
1QBit, 562 recursive feature elimination with
1QBit SDK, 570 logistic regression, 580–582
QGMs, see Quadratic Gaussian models Quantum annealing process, 572
QL code, see QuantLib code Quantum effects, 473
Q-measure, 121 Quantum hardware, 562, 570
QMLE, see Quasi-maximum likelihood Quantum ready SDK, 562, 587
estimation Quasi-maximum likelihood estimation
Quadratic Gaussian models (QGMs), (QMLE), 290, 297
279n5 parameter estimation, 287
Quadratic inventory penalization, 90 Quasi-Monte Carlo methods, 240
Quadratic objective function, 574, 585 QUBO, see Quadratic unconstrained
Quadratic unconstrained binary binary optimization
optimization (QUBO), 562, 564
as established approach, 564–565
R
feature selection, 570, 579–580,
582–584 Rack Unit (U), 446n1
feature selection in 1QBit SDK, 570 Radial basis functions, 176, 266
Quality assurance, unified platform for, Radon–Nikodym derivative, 83, 85
520–521 Random fields, 84
Quantitative analyst, 472 Random market effects, 237
Quantitative Easing program, 54 Rank-1 lattice rule, 240
Quantitative trading, 519 Rapid elasticity, 511
Quantity of interest, 7, 131, 176 Rate of convergence, 268–269
QuantLib code (QL code), 502, 503 Rate on line, see Per unit of limit
Quantum annealer, 586 Rates on line (ROL), 375–376
binarizing, scaling, and correlating Rating transition risk, 354
German credit data, 569 RCO, see Reinsurance Contract
classification, 568 Optimization
coding feature selector, 570–573 RCR, see Regulatory capital
comparison of QUBO feature requirements
selection and recursive feature Real-code (RPBIL), 378
elimination, 582–584 Real-time counterparty credit risk
credit scoring and classification as management, 349
business problem, 562–564 adjoint algorithmic differentiation
credit scoring and classification and, 352–354
problem formulation, 565–566 counterparty credit risk
establishing zero-rule and other management, 350–352
baseline properties, 574–579 results, 354–357
Index 609

Real-Time Gross Settlement risk value and optimization problem,


system, 549 375–376
Real-time risk management Relative predictive power, 569
AAD, 341–343 Relative strength index (RSI), 37
of flow credit products, 357–368 asset filter relative strength index,
of interest rate products, 344–349 37–38
real-time counterparty credit risk Replicating portfolio of matching assets,
management, 349–357 122n8
Real-world ESG (rESG), 125 Repomarket for derivatives, 10
Recall, 582, 586 Reporting, 518–519
Recursive feature elimination (RFE), rESG, see Real-world ESG
564, 580–584 Resource manager, 516
Recursive feature elimination Resource pooling, 510
cross-validation (RFECV), 564 Reverse accumulation, 19, 20
Red Hat Linux 64-bit operating Reverse mode AD, see
system, 383 Adjoint(s)—mode AD
Reference distribution, revision of, Revised Basel III framework (2010), 11
28–29, 35–36 RFE, see Recursive feature elimination
Reference model, 79 RFECV, see Recursive feature
dynamics, 79–80 elimination cross-validation
feedback controls, 81–82 R framework, 31
optimal liquidation problem, 80–81 Ricatti equation, 280
Registers, 476 Riesz basis, 257
Regression-based method, 486 Risk, 535
assessment, 276
Regularization methods, 532
calculation, 7
“Regulation T”, 39
of derivatives, 5–7
Regulator Supervisory Report, 120
hedging strategies, 371
Regulatory capital requirements (RCR),
management, 340, 518–519
11, 13
and premium flow, 372
calculation of market risk capital, 12
risk transfer contracts, tendency for,
capital value adjustment, 13–14
372
credit risk capital, 12–13
value and optimization problem,
Regulatory supervisor, 121 375–376
Rehypothecate collateral, 10 RiskAnalytics DFE implementation, 465
Rehypothecation, 549–550 RiskIntegrity suite, 126
Reinsurance Risk-neutral
costs, 373–374 option valuation formula, 251
recoveries, 374–375 pricing, 121n6
Reinsurance Contract Optimization Risk scenario generator (RSG), 124,
(RCO), 372 131, 143; see also Economic
case study, 382 scenario generators (ESG)
EAs, 376–382 co-dependency structures and
metrics, 382–383 simulation, 144–147
modeling RCO problem, 373 lapse and surrender risk, 148
parallel version, 384–389 mortality and longevity risk, 147
reinsurance costs, 373–374 operational risk, 148–149
reinsurance recoveries, 374–375 Rmetrics’ R package fOptions, 162
results, 383 Robust calibration technique, 141
610 Index

Robust long-term yield curve model, layout, 124


275 objectives, 116–117
calculation of log-likelihood function, PEDCP representation, 150–157
309–310 RSG, 143–149
EFM, 275 SVJD model, 157–165
empirical evaluation of model in-and Scenario simulation, 276
out-of-sample, 299–308 SCR, see Solvency capital requirement
HPC approaches to calibrating Black SDEs, see Stochastic differential
models, 288, 290–295 equations
multifactor yield curve models and SDK, see Software development kit
drawbacks, 276–283 Secondary trading, 548
nonlinear black correction for EFM Second derivatives, 318–319
model, 283–288, 289 Second-order cumulant
UKF EM algorithm HPC approximation, 294
implementation, 295–298 Second-order polynomials, 491
Robust optimization problem, 85 Second-order stochastic dominance
ROL, see Rates on line model (SSD model), 26, 29
Root mean square error, 303 asset allocation optimization
RPBIL, see Real-code model, 31
R’s ecdf() function, 151 backtesting, 40
RSG, see Risk scenario generator data, 30
RSI, see Relative strength index density curves, 36
R’s runif() function, 151 enhanced indexation applying SSD
[Link]() function, 162 criterion, 26–28
guided tour, 29
S impact measure for news, 32–34
S&P 502 realized mean runtime, 430 information flow and computational
SABR-LMM model, 138 architecture, 31
SA-CCR approaches, 11, 12 long–short discrete optimization
SA-CVA approaches, 11 model, 34–35
SAPS, see Self-administered pension models, 32–36
scheme money management via “volatility
Scalable problem, 513 pumping”, 29
“Scaled” tail, 34 performance measures, 44
Scaled long/short formulation of results, 42–45
achievement-maximization revising reference distribution, 28–29
problem, 34 revision of reference distribution,
Scale-free networks, 59 35–36
Scale-up to process larger models, 41–42 scale-up to process larger models,
Scaling function, 256 41–42
Scaling laws, 54, 72 solution method and processing
distributions, 59 requirement, 39–42
emergence of, 58–59 solution methods for SIP models, 29
Scenario generation, challenges in solution of LP and SIP models, 39–40
challenge of negative nominal system architecture, 31
interest rates, 116 trading strategies, 36–39
ESG, 124–143 Seeding, 318
ESG and solvency 2, 117–124 Self-administered pension scheme
examples, 149 (SAPS), 147
Index 611

Sentiment score, 32 Solid-State Drive, 448n2


Separation algorithm, 41 Solvency 2, 117–124
Server–worker nodes, 516 Solvency and Financial Condition
SHA-258, 544 Report, 120
Shadow rates, 278 Solvency capital requirement (SCR),
models, 275 118
process, 292 Source code transformation, 20
Shadow short rate, 287, 292 Sparse grids, 176
Shannon scaling function, 261 SPDE applications, 237
Shannon wavelet inverse Fourier Spearman correlation, 575
technique (SWIFT), 250, 261, Speed comparisons and numerical
267; see also Multilevel Monte results, 501–504
Carlo methods (MLMC Speedup, 383, 386
methods) Split pathwise sensitivity, 219–220
density coefficients, 263–264 Splitting, 219, 220
payoff coefficients, 264–265 Spot LIBOR measure, 487
pricing multiple strikes, 266 Spot-starting interest rate swap, 6
Shannon wavelets, 250, 255, 261 Square Chimera graph, 572, 573
Shared computing resources, 510 SSD model, see Second-order stochastic
Shaw–Brickman algorithm, 495 dominance model
Short-term models, 53 SST, see Swiss Solvency Test
Short rate models, 133 Standard initial margin model (SIMM),
ShuffleSplit, 570 18, 467–469
Sign reversal, 231 Standard methods, 22
SIMD, see Single Instruction Multiple State-measurement cross-covariance
Data matrix, 296
SIMM, see Standard initial margin State–space form, 286, 292
model Stochastic asset price process, 250
Simple Live CPU (SLiC), 451 Stochastic differential equations (SDEs),
library, 456 125, 127, 201, 202–203, 278
Single bank, 553 Stochastic integer programming model
Single Instruction Multiple Data (SIP model), 26
(SIMD), 442 solution, 39–40
Single-level checkpointing schemes, 324 solution methods for, 29
Single period SP, 29 Stochastic partial differential equations,
SIP model, see Stochastic integer 237
programming model Stochastic volatility and jump diffusion
Sklar’s theorem, 144, 145 model (SVJD model), 149, 157
SLiC, see Simple Live CPU bar plots of coefficients, 165
Small fixed cost, 352–353 and combined equity asset shock,
Smooth density functions, 261 160–161
Smoothing, 326 distributional representation,
SMs, see Streaming Multiprocessors 158–160
Sobol numbers generation, 495 unconditional equity asset shock
SoC, see Systems on Chip distribution, 161–165
Software development kit (SDK), 562 Stochastic volatility process, 137
1QBit, 570 StratifiedShuffleSplit, 570, 575, 586
QuadraticBinaryPolynomialBuilder Streaming Multiprocessors (SMs),
class, 571 477, 482
612 Index

Structural models, 138 Tangent mode AD, 318


Stylized properties of Black models, Task parallel, 479
288, 289 backtesting algorithm, 515
Sub-additivity, 119 problem, 513
Sun-Gard’s Prophet Asset Liability Taxonomy of parallel computing,
Strategy, 117 513–515
Sun Grid Engine, 521 Taylor-like ANOVA decomposition,
Sunway TaihuLight, 414, 415 428–430
Supercomputers, 414 TCO, see Total cost of ownership
advantages and disadvantages, T -copula, 144–146
421–422 Techila-enabled computing tools, 529
current landscape, and upcoming Techila high-level architecture, 522
trends, 416–417 Techila Middleware, 521, 530
exponential performance with MATLAB, 521–522
development and projection, Techila SDK, 529
415 Tensor approach, 176
for financial applications, 422–426 Tesla K20, 484
with MPP architecture, 414 Tesla K20c, 492
optimizing life cycle investment Textures, 494
decisions, 431–436 cache, 482
pricing basket options using C++ Thinning, 224
and MPI, 427–431 Thomson Reuters, 530
programming languages and Thomson Reuters Data Stream
parallelization interfaces, platform, 43
418–420 Threads, 476, 493–495
SuperDerivatives, 137
Three-dimension (3D)
SuperMUC supercomputer, 426
equations, 181
Superposition dimension, 176
parabolic quasi-linear PDE, 292
Supplementary material, 74
Three-factor
Support vector machines (SVMs), 565
affine short rate models, 280–281
Surprise of event-based price curve, 62
basic EFM model, 284–285
Surrender risk, 148
Black model calibration, 290–291
SVJD model, see Stochastic volatility
extended Vasicek model, 281
and jump diffusion model
state variables, 286
SVMs, see Support vector machines
“Three-stage feature selection fusion”
Swap rate, 133
technique, 586
SWIFT, see Shannon wavelet inverse
Fourier technique Threshold, 64
Swiss Piz Daint, 417 Tianhe-1A, 416
Swiss Solvency Test (SST), 118 Tikhonov-regularized least squares, 533
System memory, 476 Time-consuming process, 276
Systems on Chip (SoC), 417 Time-dependent exponential
correlation, 190, 191
Time-dependent simple correlation,
T 189–190
Tail-Value at Risk (TVaR), 29, 375 Time-dependent volatilities
Tangent-over-adjoint mode AD tangent exponential correlation, 191–192
mode, 319 simple correlation, 190–191, 192
Tangent-over-tangent mode AD tangent Time frequency analysis, 261
mode, 318–319 Time-reversed Brownian motion, 231
Index 613

Time-series data, 26 HPC implementation, 297–298


Time-shared functional unit, 444 quasi-maximum likelihood
Time value of guarantees (TVOG), 122 estimation, 297
Titan’s performance of 17.6 Peta technical implementation, 297
FLOPS, 416–417 UKF for Black EFM model,
Title deeds, 543 295–297
TOP500 supercomputers, 420 Uncollateralized derivative transactions,
Top right plot, 236 10
Total cost of ownership (TCO), 517 Unconditional equity asset shock
Total offset ratio, 405 distribution, 161–165
TpS, see Transactions per second Unified platform for quality assurance,
Tracking error, 26–27 520–521
Trade/trading, 14–15 University of California, Irvine (UCI),
clearing, 548–549 564
execution, 548–549 Unscented Kalman filter (UKF), 275,
finance, 549–550 290
settlement, 548–549 for Black EFM model, 295–297
Trading models, 52, 53, 72 likelihood, 302
algorithm, 50, 51
anatomy and performance, 53–56 V
and complexity, 59–60
Trading strategies, 36 Valuation, 276, 517–518
base strategy, 37 CVA, 7–9
dynamic strategy using money derivatives pricing and risk, 5–7
management, 38–39 DVA, 7–9
using RSI and impact as filters, 38 FVA, 10
using RSI as filter, 37–38 requirements, 5
Transaction costs reducing, 399 Value-at-Risk measurement (VaR
Transactions per second (TpS), 546 measurement), 12, 29, 118–120,
Transition equation, 286 375, 463–464
Translational invariance, 119 Variable coefficients, 184, 186–187
Truncated Pareto event–driven Variance reduction (VR), 356
compound Poisson VaR measurement, see Value-at-Risk
distribution, 149 measurement
Truncated series expansions, 238–239 VAR models, see Vector autoregression
Tsunamis, 535 models
TVaR, see Tail-Value at Risk Vasicek dynamic model, 134, 135
TVOG, see Time value of guarantees Vasicek implied yield curve, 135
Two-dimension (2D) Vasicek model, 125–126
arrays, 483 Vector autoregression models (VAR
equations, 181 models), 281
Two-level approach, 239 Verification theorem, 87
t-year VaR, 119 Vibrato Monte Carlo, 220–222
Vieta’s formula, 263
Virtualization, 516
U
Viscosity solution, 155
UCI, see University of California, Irvine Visualization code, 530n7
UKF, see Unscented Kalman filter Volatility, 54, 176
UKF EM algorithm HPC Volatility pumping, 26, 29, 38
implementation, 295–298 money management via, 29
614 Index

Volume credit products, 340 Winning propositions, 38


VR, see Variance reduction Workload, 517
Work variables, 19
W Wrong way risk, 9

WA(a,b) method, 256, 267


density coefficients, 258–260 X
plain vanilla payoff coefficients, 260 X-Road, 547
robustness of, 267–268 xVA, 14–15, 320, 518
Wall clock time (WCT), 517
Walltime, 426
Y
War of the Austrian Succession, 541
Warps, 493 Yield curve, 131–133
Wavelet series, 255–256 bootstrapping, 300
WCT, see Wall clock time
Weak speedup, 383
Z
Wealth, 80
Willis Towers Watson’s MoSes HPC, Zero-correlation approximation, 184
124 Zero coupon bond prices, 280, 292
Willis Towers Watson’s Replica, Igloo, Zero lower bound (ZLB), 275
and MoSes solutions, 117 Zero-rule classifier, 574
Willis Towers Watson’s Star ESG, 124 Zero-rule establishment, 574–579
go to

[Link]
for more...

High-Performance
Computing in Finance
Problems, Methods,
and Solutions
 (http://taylorandfrancis.com)
High-Performance
Computing in Finance
Problems, Methods,
and Solutions
Edited by
M. A. H. Dempster
Juho Kanniainen
John Keane
MATLAB R
⃝is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks
does not warrant the accuracy of t
Contents
Editors
xi
Contributors
xiii
Introduction
xvii
I
Computationally Expensive Problems in the
Financial Industry
1
1
Co
viii
Contents
7
Multilevel Monte Carlo Methods for Applications
in Finance
197
Michael B. Giles and Lukasz Szpruch
8
Fourier
Contents
ix
18
Blockchains and Distributed Ledgers in Retrospective
and Perspective
537
Alexander Lipton
19
Optimal Feature S

You might also like