1 Randomized Algorithms
This is a field that became large in the past 15 years, as a subfield of Linear
Algebra, Scientific Computing and especifically, big data. This field did not
receive much attention because of its lack of precision.
The main application of Randomized algorithms is matrix operations. In
many big–data tasks we are dealing with large matrices
• matrix multiplication (ex. covariance matrices),
• solving linear systems of equations. Furthermore, nonlinear system of
equations are usually solved iteratively by linear approximation,
• computing eigenvectors, singular vectors, eigenvalues, etc. ([Link],
PCA), and
• matrix factorization
One of the shortcomings on the computation of standard linear algebra
is that it scales poorly for large matrices (roughly, cubic complexity)
1.1 Matrix Multiplication
Let A, B be two n × n matrices. The cost for computing the product AB is
O(n3 ).
1.2 Strassen’s algorithm (1969)
Let’s decompose the matrix multiplication of A and B into blocks in the
following way
A11 A12 B11 B12 C11 C12
=
A21 A22 B21 B22 C21 C22
with each block of size n2 × n2 . Following the standard way of multiplication, it
requires 8 n2 × n2 matrix multiplications and 4 additions of n2 × n2 matrices.
Strassen’s algorithm can compute this block matrix multiplication using
only seven n2 × n2 matrix multiplication and 18 n2 × n2 additions.
3
• Cost for multipliying n2 × n2 is O n2
1
n n n 2
• Cost for adding 2
× 2
is O 2
Now, one can design a recursive procedure with blocks of size n2 , n4 , . . . , (up
to blocks of size 16) and end up with Strassen’s algorithm.
The total cost now is O(nlog2 7 ) ∼ O(n2.8074 ). Strassen’s algorithm has
been improved to O(n2.37 ), but the hidden constant increases significantly.
The stability of the algorithm is normal, but the communication cost is
higher, thus, these algorithms are not (yet) used in practice. Strassen’s
also show that matrix multiplication actually reduces the complexity of the
Gaussian elimination procedure.
Big question Can we construct an algorithm that multiplies two arbitrary
n × n matrices in O(n2+ ) or O(n2 log n) operations?
1.3 Power iteration
Let A be an n × n symmetric matrix, with largest eigenvalue λ1 , |λ1 | ≥
|λ2 | ≥ . . ., with corresponding eigenvector v1 . Choose x0 ∈ Rn randomly,
and compute
Axk
xk+1 = , µk+1 = xTk+1 Axk+1 , k = 0, . . . ,
kAxk k
then xk → v1 and µk → λ1 as k → ∞, with a rate of convergence given by
! !
2 2
λ2 λ2
|µk − λ1 | = O , kxk − v1 k = O
λ1 λ1
(if hx0 , v1 i =
6 0), which in large independent random data sets occurs with
probability one). These methods are called Lanczos-type methods, and in
matlab can be called by eigs. One major application for this is given by the
page-rank problem.
1.4 Randomized algorithms
Matrix multiplication
Let A be an m × n matrix, and B an n × p matrix. We want to approximate
AB at a cost smaller than O(mnp). Note that
n
X
AB = A(i) B(i) ,
i=1
2
where A(i) is the i-th row of A, and B(i) is the i-th column of B. Then, AB
is the sum of rank-1 matrices A(i) B(i) . P
The idea is to approximate AB = ni=1 A(i) B(i) by randomly sampling
rank-1 components, according to their size, by given probabilities {pi }ni=1 ,
where
kA(i) k2 kB(i) k2
pi = n
X
kA(i) k2 kB(i) k2
i=1
Note that, as A(i) B(i) are rank-1 matrices, it is easy to check that
kA(i) B(i) kop = kA(i) k2 kB(i) k2
We select a total number of c such rank-1 components A(it ) B(it ) with
probability P{ik = k} = pk , with replacement, such that
A(it ) B (it )
C (t) = √ , R(t) = √ ,
cpit cpit
and get
c
1 X 1 (it )
AB ≈ CR = A B(it )
C t=1 pit
One can show that in this case, Drineas-Mahoney
(2011)
[], and Shynin-
T kAkF kAk2F
Rudelson (2007) []: If B = A , set c ≈
log 2 √δ . Then, for any
0 < < 1, with probability at least 1 − δ,
kAAT − CC T kop ≤
The proof uses a matrix concentration inequality, for example
Theorem 1 (Tropp). Let {Ak } be a finite sequence of fixed self-adjoint ma-
trices of size d×d, and let {ξk } be a sequence of Rademacher random variables
(i.e., ξ ∈ {−1, 1} with equal probability). Then, for all t ≥ 0,
( ! )
X t2
P λmax ξk Ak ≥ t ≤ d e− 2σ2 ,
k
where σ 2 = k k A2k kop .
P
3
This result can be related to the sum of zero-one variables. One needs to
consider the Symmetrization result:
X X
E (Xi − E(Xi )) ≤ 2E ξi Xi
i i
2 Randomized SVD
Given an m × n matrix A. SVD of A is given by A = UΣV∗ , where
U = [u1 , . . . , um ], V = [v1 , . . . , vn ], and Σ = diag(σ1 , . . . , σmin(m,n) ). Best
rank-k approximation to A is Ak = kj=1 σj uj vj∗ , and kA − Ak kop = σk+1 ,
P
which serves as the benchmark for approximation error.
Randomized SVD algorithm:
Step 1: (i) Let Ω be an n × 2k Gaussian random matrix. (ii) Form Y =
(AA∗ )q AΩ for some q ∈ N. Y is a matrix of size m × 2k. (iii) Construct a
matrix Q whose columns form an orthonormal basis for the range of Y.
Step 2: (i) Form B = Q∗ A, where B is a 2k × n matrix. (ii) Compute SVD
of B: B = Ũ ˜ Σ̃Ṽ∗ . (iii) Set Ũ = QŨ ˜ and get A ≈ ŨΣ̃Ṽ∗ .
Cost: Let Tmult be the cost of a matrix-vector multiplication by A or A∗ .
and then total cost of randomized SVD is Tmult (2q + 2)k + θ(k 2 (m + n)). In
contrast, full SVD costs θ(min(m, n) · m · n).
Theorem 2. Let A be an m × n matrix. Select an exponent q and a target
number of k singular vectors, where 2 ≤ k ≤ min(m, n)/2. Let ŨΣ̃Ṽ∗ be the
rank-2k factorization produced by randomized SVD. Then
" r 1
# 2q+1
2 · min(m, n)
E kA − ŨΣ̃Ṽ∗ kop ≤ 1 + 4 σk+1 .
k−1
In practice, we can truncate the approximate SVD and keep only the best
k singular values Σ̃k and corresponding singular vectors Ũk , Ṽk . The error
bound is
" r 1
# 2q+1
min(m, n)
E kA − Ũk Σ̃k Ṽk∗ kop ≤ 2 + 4 σk+1 .
k−1
Typically, q = 1 or 2 suffices. If singular values decay fast, often q = 0 is
okay. For practical implementation, the n × 2k Gaussian random matrix can
4
be replaced byp an n × ` randomly subsampled DFT or Hadamard matrix of
the form Ω = n` DFR, where D is an n × n diagonal matrix with random
±1 or random entries from the complex unit circle, F is the n×n unitary DFT
or Hadamard matrix, and R is an n × ` matrix that samples ` coordinates
from {1, 2, . . . , n} uniformly at random (without replacement). In theory,
instead of 2k, we need ` ∼ (2k + log n) · log k, but in practice, ` ∼ 2k suffices.
Full proof is omitted, but the key to the proof is an estimate for kA −
QQ∗ Akop . Concentration-of-measure bounds yield that kA − QQ∗ Akop con-
centrates tightly around E kA − QQ∗ Akop .
3 Randomized Kaczmarz Algorithm
Given Ax = y, if A is too large, we cannot load it into computer memory.
Instead, we can take “row action” methods that apply to rows of A individ-
ually rather than access to the entire matrix A. Let A be an m × n matrix.
Denote the rows of A by a∗1 , . . . , a∗m .
Kaczmarz method: Iteratively project orthogonally onto solution hyper-
plane of hai , xi = yi . Given an initial guess x0 , compute
yi − hai , xk i
xk+1 = xk + ai ,
kai k22
where i = k mod (m + 1). Convergence of Kaczmarz method is easy to
establish, but rate of convergence is not clear (m > 2). Moreover, in many
cases, empirical convergence can be very slow. Remedy is to pick up the
hyperplane onto which we project in some random order instead of the linear
order.
Randomized Kaczmarz method: Iterative projection onto solution hyper-
plane in some random order
yr(i) − har(i) , xk i
xk+1 = xk + ar(i) ,
kar(i) k22
where r(i) is chosen from {1, . . . , m} at random with probability proportional
to kar(i) k22 .
Theorem 3. Let A be an m × n matrix with m ≥ n and rank(A) = n.
Let x be the solution of Ax = y. Denote A† = (A∗ A)−1 A∗ . Then ran-
domized Kaczmarz method converges in expectation and E kxk − xk22 ≤ (1 −
5
κ(A)−2 )k kxo − xk22 , where κ(A) = kAkF kA† kop is scaled condition num-
ber of A. Expected number of iterations achieve accuracy is E[k ] ∼
2κ(A)2 log(1/).
Full proof can be found in [1]. Note that randomized Kaczmarz method
is not useful if A has a lot of structures so that matrix-vector multiplication
is cheap.
References
[1] Strohmer, Thomas; Vershynin, Roman (2009), “A randomized Kacz-
marz algorithm for linear systems with exponential convergence”, Jour-
nal of Fourier Analysis and Applications 15: 262-278.