0% found this document useful (0 votes)
21 views90 pages

Student Solutions Manual: Linear Algebra & ML

The document is a Student's Solutions Manual for Linear Algebra, Data Science, and Machine Learning, authored by Jeff Calder and Peter J. Olver. It contains various chapters covering topics such as vectors, inner products, matrices, optimization, machine learning, and neural networks, along with exercises and solutions. The manual is intended to assist students in understanding and applying concepts in linear algebra and its applications in data science and machine learning.

Uploaded by

Rithish Barath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views90 pages

Student Solutions Manual: Linear Algebra & ML

The document is a Student's Solutions Manual for Linear Algebra, Data Science, and Machine Learning, authored by Jeff Calder and Peter J. Olver. It contains various chapters covering topics such as vectors, inner products, matrices, optimization, machine learning, and neural networks, along with exercises and solutions. The manual is intended to assist students in understanding and applying concepts in linear algebra and its applications in data science and machine learning.

Uploaded by

Rithish Barath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Student’s Solutions Manual

Linear Algebra, Data Science, and


Machine Learning
Springer, 2025

Jeff Calder Peter J. Olver


May 15, 2025
Contents

1 Vectors 1

2 Inner Product, Orthogonality, Norm 4

3 Matrices 10

4 How Matrices Interact with Inner Products and Norms 16

5 Eigenvalues and Singular Values 24

6 Basics of Optimization 33

7 Introduction to Machine Learning and Data 43

8 Principal Component Analysis 52

9 Graph Theory and Graph-based Learning 58

10 Neural Networks and Deep Learning 75

11 Advanced Optimization 82
Chapter 1

Vectors

1.1. Plot the following vectors in R 2 .


T T T T T
(a ) ♡ ( −2, 2 ) , (b ) ♡ ( 0, −1 ) , (c ) ♢ 3 ( 1, 1 ) , (d ) ( −2, 3 ) − ( −5, 3 ) .
Solution: (a) (b )

T T
1.2. Suppose v = ( 1, 2, −1 ) and w = ( 0, −1, 2 ) . Determine the following vectors:
(a ) ♡ − v, (b) 3 v, (c) ♡ −5 w, (d ) ♢ v + w, (e) v − w, (f) 2 v − 3 w.
T T
Solution: (a ) ( −1, −2, 1 ) , (c ) ( 0, 5, −10 ) .
1.3. Prove the arithmetic properties (a) ♡, (b) ♢, (c) ♡, (d), (e), (f ) ♡, (g) for vectors in R n .
T T T
Solution: Given u = ( u1 , u2 , . . . , un ) , v = ( v1 , v2 , . . . , vn ) , w = ( w1 , w2 , . . . , wn ) ,
the i-th entry of each side of the equation are equal by the laws of arithmetic of real numbers:
(a ) vi +wi = wi +vi ; (c ) (c+d) vi = c vi +d vi and c (vi +wi ) = c vi +c wi ; (f ) vi +0 = vi = 0+vi ,
vi + (− vi ) = 0 = (− vi ) + vi . If ui = vi + wi then wi = vi − ui = vi + (− ui ).

T
2.1. ♡ (a) Prove that the set of all vectors ( x, y, z ) such that x − y + 4 z = 0 forms a
subspace of R 3 . (b) Explain why the set of all vectors that satisfy x − y + 4 z = 1 does not
form a subspace.
T T
Solution: (a) If v = ( x, y, z ) satisfies x − y + 4 z = 0 and v e = (x e, ye, ze ) also satisfies
T
e − ye + 4 ze = 0, so does v + v
x e = (x + x e, y + ye, z + ze ) since
(x + xe) − (y + ye) + 4 (z + ze) = (x − y + 4 z) + (e x − ye + 4 ze) = 0,
T
as does c v = ( c x, c y, c z ) since
(c x) − (c y) + 4 (c z) = c (x − y + 4 z) = 0. ■
T
(b) For instance, the zero vector 0 = ( 0, 0, 0 ) does not satisfy the equation.
2 Chapter 1. Vectors

2.2. Which of the following are subspaces of R 3 ? Justify your answers! (a) ♡ The set of all
T T
vectors ( x, y, z ) satisfying x + y + z + 1 = 0. (b)♢ The set of vectors of the form ( t, − t, 0 )
T
for t ∈ R. (c ) ♡ The set of vectors of the form ( r − s, r + 2 s, − s ) for r, s ∈ R. (d ) The
set of vectors whose first component equals 0. (e ) The set of vectors whose last component
T
equals 1. (f ) ♡ The set of all vectors ( x, y, z ) with x ≥ y ≥ z. (g ) ♡ The set of all solutions
to the equation z = x − y. (h ) ♢ The set of all solutions to the equation z = x y. (i ) The set
of all solutions to the equation x2 + y 2 + z 2 = 0. (j ) The set of all solutions to the system
x y = y z = x z.
Solution: (a ) Not a subspace; (c ) subspace; (f ) not a subspace; (g ) subspace.
T
2.3. Determine which of the following sets of vectors x = ( x1 , x2 , . . . , xn ) are subspaces of
R n : (a) ♡ all equal entries x1 = · · · = xn ; (b ) ♡ all positive entries: xi ≥ 0; (c ) ♢ first and
last entries equal to zero: x1 = xn = 0; (d ) the entries add up to zero: x1 + · · · + xn = 0;
(e ) first and last entries differ by one: x1 − xn = 1.
Solution: (a ) Subspace; (b) not a subspace.
2.6. Show that if V and W are subspaces of R n , then (a) ♡ their intersection V ∩ W is a
subspace; (b ) their sum V + W = { v + w | v ∈ V, w ∈ W } is a subspace; but (c) ♢ their
union V ∪ W is not a subspace, unless V ⊂ W or W ⊂ V .
Solution: (a) If v, w ∈ W ∩ Z, then v, w ∈ W , so c v + d w ∈ W because W is a subspace,
and v, w ∈ Z, so c v + d w ∈ Z because Z is a subspace, hence c v + d w ∈ W ∩ Z. ■


    
−1 2 5
3.1. ♡ Show that  2  belongs to the subspace of R 3 spanned by  −1 ,  −4  by
3 2 1
writing it as a linear combination of the spanning vectors.
     
−1 2 5
Solution:  2  = 2 −1  −  −4 .
3 2 1
     
1 2 1
3.3. Which of the following sets of vectors span all of R 2 ? (a)♡ ; (b )♡ , ;
−1 −1 3
                   
6 −4 2 −1 1 2 4 0 1 3
(c ) ♢ , ; (d ) , ; (e ) ♡ , , ; (f) , , .
−9 6 −1 2 2 4 8 0 2 4

Solution: (a ) No, (b ) yes, (e ) no.


Chapter 1. Vectors 3

3.4. Determine whether the given vectors are linearly independent or linearly dependent:
   
              1 0
1 2 1 −2 2 −1 5
(a)♡ , , (b )♡ , , (c ) , , , (d)♡  3 ,  2 ,
2 1 3 −6 1 3 2
−2 −1
               
4 −6
0 1 1 1 1 0
 2   −3 
(e) ♢  1  ,  −1  ,  1  , (f ) ♢  1  ,  0  ,  1  , (g )  ,  .
0 0
1 0 2 0 1 1
−6 9
Solution: (a ) Linearly independent; (b) linearly dependent; (d ) linearly independent.
3.7. ♡ Prove or give a counterexample to the following statement: If v1 , . . . , vk do not span
R n , then v1 , . . . , vk are linearly independent.
Solution: False. For example, any set containing the zero vector that does not span R n is
linearly dependent. ■

 
2
4.1. Determine which of the following sets of vectors are bases of R 2 : (a ) ♡ ;
1
                 
1 −1 1 2 3 0 2 −1 0
(b) ♡ , ; (c) ♢ , ; (d) ♡ , ; (e ) , , .
−1 1 2 1 5 0 0 2 −1
Solution: (a ) Not a basis; (b ) basis; (d) not a basis.
     
2 1 0
4.2. Determine which of the following are bases of R 3 : (a) ♡  1 ,  5 ; (b) ♡  1 ,
5 2 −5
                 
−1 1 −1 0 1 2 −1 0 −1
 3 ,  3 ; (c ) ♢  0 ,  4 ,  −4 ; (d)  0 ,  2 ,  −1 ,  2 .
0 0 1 −1 0 −2 −1 0 1
Solution: (a ) Not a basis; (b ) basis.
4.4. Find a basis for the following planes in R 3 :
(a ) ♡ the x y plane; (b) z − 2 y = 0; (c ) ♢ 4 x + 3 y − z = 0.
   
1 0
Solution: (a )  0  ,  1 .
0 0
4.5. ♡ Show, by computing an example, how the uniqueness result in Proposition 1.18 fails
if one has a linearly dependent set of vectors.
Solution: For instance, if
       
1 0 1 2
v1 = , v2 = , v3 = , then = 2 v1 + v2 = v1 + v3 .
0 1 1 1

In fact, there are infinitely many different ways of writing this vector as a linear combination
of v1 , v2 , v3 . ■
Chapter 2

Inner Product, Orthogonality,


Norm

1.1. Which of the following formulas for ⟨ v, w ⟩ define inner products on R 2 ?


(a ) ♡ 2 v1 w1 + 3 v2 w2 , (b ) ♡ v1 w2 + v2 w1 , (c ) (v1 + v2 )(w1 + w2 ), (d ) v12 w12 + v22 w22 ,
(e ) ♢ 2 v1 w1 + (v1 − v2 ) (w1 − w2 ), (f ) 4 v1 w1 − 2 v1 w2 − 2 v2 w1 + 4 v2 w2 .

Solution: (a ) Yes; (b) no — not positive definite.

1.3. Prove that each of the following formulas for ⟨ v, w ⟩ defines an inner product on R 3 .
Verify all the inner product axioms in careful detail:
(a) ♡ v1 w1 + 2 v2 w2 + 3 v3 w3 , (b ) 4 v1 w1 + 2 v1 w2 + 2 v2 w1 + 4 v2 w2 + v3 w3 ,
(c ) ♢ 2 v1 w1 − 2 v1 w2 − 2 v2 w1 + 3 v2 w2 − v2 w3 − v3 w2 + 2 v3 w3 .

Solution: (a) Bilinearity:

⟨ c u + d v, w ⟩ = (c u1 + d v1 ) w1 + 2 (c u2 + d v2 ) w2 + 3 (c u3 + d v3 ) w3
= c (u1 w1 + 2 u2 w2 + 3 u3 w3 ) + d (v1 w1 + 2 v2 w2 + 3 v3 w3 )
= c ⟨ u, w ⟩ + d ⟨ v, w ⟩,
⟨ u, c v + d w ⟩ = u1 (c v1 + d w1 ) + 2 u2 (c v2 + d w2 ) + 3 u3 (c v3 + d w3 )
= c (u1 v1 + 2 u2 v2 + 3 u3 v3 ) + d (u1 w1 + 2 u2 w2 + 3 u3 w3 )
= c ⟨ u, v ⟩ + d ⟨ u, w ⟩.

Symmetry: ⟨ v, w ⟩ = v1 w1 + 2 v2 w2 + 3 v3 w3 = w1 v1 + 2 w2 v2 + 3 w3 v3 = ⟨ w, v ⟩.
T
Positivity: ⟨ v, v ⟩ = v12 + 2 v22 + 3 v32 > 0 for all v = ( v1 , v2 , v3 ) ̸= 0, because it is a sum of
nonnegative terms, at least one of which is strictly positive. ■

1.4. Prove that the following quadratic forms on R 3 are positive definite by writing each as
a sum of squares. Then write down the corresponding inner product.
(a) ♡ x2 + 4 x z + 3 y 2 + 5 z 2 , (b) ♢ x2 + 3 x y + 3 y 2 − 2 x z + 8 z 2 ,
(c) 2 x21 + x1 x2 − 2 x1 x3 + 2 x22 − 2 x2 x3 + 2 x23 .

Solution: (a) (x + 2 z)2 + 3 y 2 + z 2 , ⟨ v, w ⟩ = v1 w1 + 2 v1 w2 + 2 v2 w1 + 3 v2 w2 + 5 v3 w3 .


Chapter 2. Inner Product, Orthogonality, Norm 5

1.6. (a) ♡ Prove that ⟨ x, v ⟩ = 0 for all v ∈ R n if and only if x = 0. (b ) ♢ Prove that
⟨ x, v ⟩ = ⟨ y, v ⟩ for all v ∈ R n if and only if x = y. (c) Let v1 , . . . , vn be a basis for R n .
Prove that ⟨ x, vi ⟩ = ⟨ y, vi ⟩ for all i = 1, . . . , n if and only if x = y.
Solution: (a) Choosing v = x, we have 0 = ⟨ x, x ⟩ = ∥ x ∥2 , and hence x = 0. ■
1.7. Let ⟨ ·, · ⟩ be an inner product on R n and let ∥ · ∥ be the induced norm.
(a) ♡ Show that the norm satisfies the parallelogram identity

∥ v + w ∥2 + ∥ v − w ∥2 = 2 ∥ v ∥2 + 2 ∥ w ∥2 for all v, w ∈ R n . (2.24)

(b) ♢ Prove the identity



⟨ v, w ⟩ = 1
4 ∥ v + w ∥2 − ∥ v − w ∥2 , (2.25)

which allows one to reconstruct an inner product from its norm.


(c) Use (2.25) to find the inner product on R 2 corresponding to the norm
q
∥ v ∥ = v12 − 3 v1 v2 + 5 v22 .

Solution: (a) ∥ v + w ∥2 + ∥ v − w ∥2 = ⟨ v + w, v + w ⟩ + ⟨ v − w, v − w ⟩
 
= ⟨ v, v ⟩ + 2 ⟨ v, w ⟩ + ⟨ w, w ⟩ + ⟨ v, v ⟩ − 2 ⟨ v, w ⟩ + ⟨ w, w ⟩ = 2 ∥ v ∥2 + 2 ∥ w ∥2 .

T
2.1. Verify the Cauchy–Schwarz and triangle inequalities for the vectors v = ( 1, 2 ) and
T
w = ( 1, −3 ) using (a) ♡ the dot product; (b) ♢ the weighted inner product
⟨ v, w ⟩ = v1 w1 + 2 v2 w2 ; (c) the inner product (2.10).
√ √
Solution: (a ) | v · w | = 5 ≤ 7.0711 = 5 10 = ∥ v ∥ ∥ w ∥.
2.2. Verify the Cauchy–Schwarz and triangle inequalities for each of the following pairs of
vectors v, w, using the standard dot product, and then determine the angle between them:
T T T T T T
(a ) ♡ ( 1, 2 ) , ( −1, 2 ) , (b) ♢ ( 1, −1, 0 ) , ( −1, 0, 1 ) , (c) ( 1, −1, 1, 0 ) , ( −2, 0, −1, 1 ) .
√ √
Solution: (a ) | v1 · v2 | = 3 ≤ 5 = 5 5 = ∥ v1 ∥ ∥ v2 ∥; angle: cos−1 35 ≈ .9273.
2.4. ♡ Given an inner product on R n , define the corresponding (non-Euclidean) angle θ
between two nonzero vectors 0 ̸= v, w ∈ R n by the formula ⟨ v, w ⟩ = ∥ v ∥ ∥ w ∥ cos θ. Prove
that the Law of Cosines holds in general:

∥ v − w ∥2 = ∥ v ∥2 + ∥ w ∥2 − 2 ∥ v ∥ ∥ w ∥ cos θ. (2.31)

Solution:
∥ v − w ∥2 = ⟨ v − w, v − w ⟩ = ∥ v ∥2 − 2 ⟨ v, w ⟩ + ∥ w ∥2 = ∥ v ∥2 + ∥ w ∥2 − 2 ∥ v ∥ ∥ w ∥ cos θ.

T T
3.1. ♡ (a ) Find a ∈ R such that ( 2, a, −3 ) is orthogonal to ( −1, 3, −2 ) .
T T
(b) Is there any value of a for which ( 2, a, −3 ) is parallel to ( −1, 3, −2 ) ?
Solution: (a ) a = − 34 ; (b) no.
6 Chapter 2. Inner Product, Orthogonality, Norm

T T
3.2. ♡ Find all vectors in R 3 that are orthogonal to both ( 1, 2, 3 ) and ( −2, 0, 1 ) .
T
Solution: All scalar multiples of ( 2, −7, 4 ) .

3.5. Using the dot product, classify the following pairs of vectors in R 2 as
(i ) basis, (ii) orthogonal basis, and/or (iii) orthonormal basis:
   
           
−1 2 √1 − √1
−1 2 2 1
(a )♡ , ; (b)♢  2  ,  2 
; (c ) , ; (d )♡ , ;
2 1 √1 √1 −1 2 3 −6
2 2
    ! !
−1 0 3
− 4
(e ) ♢ , ; (f ) 5 , 5 .
0 3 4 3
5 5

Solution: (a ) Orthogonal basis; (d) basis.

3.6. Repeat Exercise 3.5, but use the weighted inner product ⟨ v, w ⟩ = v1 w1 + 19 v2 w2 instead
of the dot product.
Solution: (a ) Basis; (d ) orthogonal basis.
3.8. ♡ Suppose that u1 , . . . , un form an orthonormal basis of R n . Prove that the inner
product between two vectors v = c1 u1 + · · · + cn un and w = d1 u1 + · · · + dn un is equal to
the dot product of their coordinates: ⟨ v, w ⟩ = c1 d1 + · · · + cn dn .
Solution: Using bilinearity,
* n +
X X
n X
n X
n
⟨ v, w ⟩ = ci ui , dj uj = ci dj ⟨ ui , uj ⟩ = ci di . ■
i=1 i=1 i,j = 1 i=1

T
4.1. Using the dot product on R 3 , given v = ( 1, 1, 1 ) find its orthogonal projection onto
 T
and distance to the following subspaces: (a ) ♡ the line in the direction − √13 , √13 , √13 ;
T T T
(b ) the line spanned by ( 2, −1, 3 ) ; (c) ♢ the plane spanned by ( 1, 1, 0 ) , ( −2, 2, 1 ) .
 
− 31 √
  2 2
 
Solution: (a) projection:  1 , distance: .
3 3
1
3

4.2. Redo Exercise 4.1 using the weighted inner product ⟨ v, w ⟩ = 2 v1 w1 + 2 v2 w2 + 3 v3 w3 .


 
− 73 r
  10

Solution: (a) projection:  3  , distance: 2 .
7  7
3
7
Chapter 2. Inner Product, Orthogonality, Norm 7

5.1. Use the first version of the Gram–Schmidt process to determine an orthonormal basis
for R 3 with the dot product starting with the following sets of vectors:
                 
1 1 −1 1 0 1 1 4 2
(a)♡  0 ,  1 ,  2 ; (b)♡  1 ,  1 ,  0 ; (c)♢  2 ,  5 ,  3 .
1 1 1 0 −1 −1 3 0 −1
T T T
Solution: (a ) √1
2
( 1, 0, 1 ) , ( 0, 1, 0 ) , √12 ( −1, 0, 1 ) ;
T T T
(b ) √1
2
( 1, 1, 0 ) , √1
6
( −1, 1, −2 ) , √1
3
( 1, −1, −1 ) .

5.2. Apply the Gram–Schmidt process to the following sets of vectors using the dot product
on R 4 . Which produce an orthonormal basis?
T T T T
(a) ♡ ( 1, 0, 1, 0 ) , ( 0, 1, 0, −1 ) , ( 1, 0, 0, 1 ) , ( 1, 1, 1, 1 ) ;
T T T T
(b) ( 1, 0, 0, 1 ) , ( 4, 1, 0, 0 ) , ( 1, 0, 2, 1 ) , ( 0, 2, 0, 1 ) ;
T T T T
(c) ♢ ( 1, −1, 0, 1 ) , ( 0, −1, 1, 2 ) , ( 2, −1, −1, 0 ) , ( 2, 2, −2, 1 ) .
 T  T T T
Solution: (a) √12 , 0, √12 , 0 , 0, √12 , 0, − √12 , 12 , 12 , − 21 , 12 , − 12 , 12 , 12 , 12 .

5.4. Use the Gram–Schmidt process to construct an orthonormal basis under the dot product
T T
for the following subspaces of R 3 : (a ) ♡ the plane spanned by ( 0, 2, 1 ) , ( 1, −2, −1 ) ;
(b )♢ the plane defined by the equation 2 x − y + 3 z = 0; (c ) the set of all vectors orthogonal
T
to ( 1, −1, −2 ) .
 T
T
Solution: (a ) 0, √25 , √15 , ( 1, 0, 0 ) .

5.5. Redo Exercises 5.1 and 5.4 using the weighted inner product
⟨ v, w ⟩ = 3 v1 w1 + 2 v2 w2 + v3 w3 .
T T T
Solution: 5.1 (a ) 1
2
√1 ( 0, 1, 0 ) , √
( 1, 0, 1 ) ,
2
1
2 3
( −1, 0, 3 ) ;
T T T
(b) √15 ( 1, 1, 0 ) , √155 ( −2, 3, −5 ) , √166 ( 2, −3, −6 ) .
T T T T
5.4 (a ) 1
3 ( 0, 2, 1 ) ,
√1
130
( 4, 3, −8 ) ; (b) √111 ( 1, 2, 0 ) , √7151
( −12, 9, 11 ) .
T
5.6. ♡ Using the dot product on R 3 , find the orthogonal projection of the vector ( 1, 3, −1 )
T T
onto the plane spanned by ( −1, 2, 1 ) , ( 2, 1, −3 ) by first using the Gram–Schmidt process
to construct an orthonormal basis.

Solution:  T  T T
Orthonormal basis: − √16 , √26 , √16 3
, 5√ 4
, √
2 5 2
, − √12 ; projection: 15 , 15 , − 3
8 44 4
.
8 Chapter 2. Inner Product, Orthogonality, Norm

6.1. Using the dot product on R 3 , find the orthogonal complement V ⊥ of the subspaces
V ⊂ R 3 spanned by the indicated vectors. What is the dimension of V ⊥ in each case?
               
3 1 2 1 2 1 1 0
(a) ♡  −1 , (b ) ♡  2 ,  0 , (c )  2 ,  4 , (d ) ♢  1 ,  0 ,  1 .
1 3 1 3 6 0 1 1
 
Solution: 1  1 − 1
3 −3  2 
(a ) V ⊥ has basis  1 ,  0 , dim V ⊥ = 2; (b ) V ⊥ has basis  5  ⊥
 − 4 , dim V = 1.
0 1
1
6.2. Use the dot product to decompose each of the following vectors with respect to the
indicated subspace as b = p + q, where p ∈ V, q ∈ V ⊥ .
        
0 x 1 −3
(a ) ♡ b = , V = 3 x + 2 y = 0 ; (b ) ♢ b = , V = span ;
1 y 2 1
       
1  1  2 1 
(c ) b =  0 , V = x − y + z = 0 ; (d) b =  2 , V = span  2 ,  0  .
 
0 1 1 1
! !
− 13
6 6
13
Solution: (a ) p = , q= .
9 4
13 13

6.3. Find an orthonormal basis under the dot product for the orthogonal complement of the
following subspaces of R 3 : (a ) ♡ the plane 3 x + 4 y − 5 z = 0; (b) the plane spanned by
T T T
( 1, −1, 3 ) , ( 2, 0, −1 ) ; (c ) ♢ the line in the direction ( −2, 1, 3 ) .
 
3
1
Solution: (a ) √  4 .
5 2 −5

6.5. ♡ Prove that if V1 ⊂ V2 ⊂ R n are subspaces, then V1⊥ ⊃ V2⊥ .


Solution: If z ∈ V2⊥ then ⟨ z, w ⟩ = 0 for every w ∈ V2 . In particular, since V1 ⊂ V2 , every
w ∈ V1 belongs to V2 , and hence z is orthogonal to every vector w ∈ V1 . Thus, z ∈ V1⊥ ,
proving V2⊥ ⊂ V1⊥ . ■

7.1. Compute the 1, 2, 3, and ∞ norms of the following vectors, and then verify the triangle
inequality in each case.
       
        1 −1 1 2
1 0 2 1
(a) ♡ , ; (b ) , ; (c )  0  ,  1 ; (d) ♢  −2  ,  −1 .
0 1 −1 −2
−1 0 −1 −3
Solution: (a ) ∥ v + w ∥1 = 2 ≤ 1 + 1 = ∥ v ∥1 + ∥ w ∥1 ,

∥ v + w ∥2 = 2 ≤ 1 + 1 = ∥ v ∥2 + ∥ w ∥2 ,

∥ v + w ∥3 = 3 2 ≤ 1 + 1 = ∥ v ∥3 + ∥ w ∥3 ,
∥ v + w ∥∞ = 1 ≤ 1 + 1 = ∥ v ∥∞ + ∥ w ∥∞ .
Chapter 2. Inner Product, Orthogonality, Norm 9

T
7.2. Find a unit vector in the same direction as v = ( 1, 2, −3 ) for (a ) ♡ the Euclidean
norm, (b )♢ the weighted norm ∥ v ∥2 = 2 v12 + v22 + 13 v32 , (c )♡ the 1 norm, (d) the ∞ norm.
 T T
√1 , √2 , − √3
6, 3, −2
1 1 1
Solution: (a ) 14 14 14
; (c) .

T T T
7.3. Which two of the vectors u = ( −2, 2, 1 ) , v = ( 1, 4, 1 ) , w = ( 0, 0, −1 ) are closest
in distance for (a) ♡ the Euclidean norm? (b) ♢ the 1 norm? (c ) the ∞ norm?
√ √ √
Solution: (a ) ∥ u − v ∥2 = 13, ∥ u − w ∥2 = 12, ∥ v − w ∥2 = 21, so u, w are closest.
7.5. Prove that the following formulas define norms on R 2 :
p p
(a) ♡ ∥ v ∥ = 2 v12 + 3 v22 , (b ) ∥ v ∥ = 2 v12 − v1 v2 + 2 v22 , (c) ♡ ∥ v ∥ = 2 | v1 | + | v2 |,
 
(d) ∥ v ∥ = max 2 | v1 |, | v2 | , (e ) ♢ ∥ v ∥ = max | v1 − v2 |, | v1 + v2 | .
Solution: (a) Comes from weighted inner product ⟨ v, w ⟩ = 2 v1 w1 + 3 v2 w2 .
(c) Clearly positive; ∥ c v ∥ = 2 | c v1 | + | c v2 | = | c | 2 | v1 | + | v2 | = | c | ∥ v ∥;
∥ v + w ∥ = 2 | v1 + w1 | + | v2 + w2 | ≤ 2 | v1 | + | v2 | + 2 | w1 | + | w2 | = ∥ v ∥ + ∥ w ∥.
p
7.6. Which of the following formulas define norms on R 3 ? (a) ♡ ∥ v ∥ = 2 v12 + v22 + 3 v32 ,
p
(b) ♡ ∥ v ∥ = v12 + 2 v1 v2 + v22 + v32 , (c) ♢ ∥ v ∥ = max{ | v1 |, | v2 |, | v3 | },
(d) ∥ v ∥ = | v1 − v2 | + | v2 − v3 | + | v3 − v1 |, (e ) ∥ v ∥ = | v1 | + max{ | v2 |, | v3 | }.
T
Solution: (a ) Norm; (b) not a norm since, for instance, ∥ ( 1, −1, 0 ) ∥ = 0.
7.7. ♡ Prove that any norm on R n satisfies the reverse triangle inequality

∥x + y∥ ≥ ∥x∥ − ∥y∥ for all x, y ∈ R n . (2.79)

Solution: The regular triangle inequality yields ∥ x ∥ = ∥ x + y − y ∥ ≤ ∥ x + y ∥ + ∥ y ∥,


which implies ∥ x + y ∥ ≥ ∥ x ∥ − ∥ y ∥. Swapping x and y, we also have ∥ x + y ∥ ≥ ∥ y ∥ − ∥ x ∥.
Combining these proves (2.79). ■
7.10. ♡ True or false: If ∥ v + w ∥ = ∥ v ∥ + ∥ w ∥, then v, w are parallel vectors.
Solution: True for an inner product norm, but false in general. For example,
∥ e1 + e2 ∥1 = 2 = ∥ e1 ∥1 + ∥ e2 ∥1 .
7.11. ♡ How many unit vectors are parallel to a given vector v ̸= 0? (a ) 0, (b ) 1, (c ) 2,
(d) 3, (e) ∞, (f ) depends on the norm. Explain your answer.
Solution: 2 vectors, namely u = v/∥ v ∥ and − u = − v/∥ v ∥.
7.14. Check the validity of the inequalities (2.74) for the particular vectors
T T T
(a) ♡ ( 1, −1 ) , (b ) ♢ ( 1, 2, 3 ) , (c ) ( 1, 1, 1, 1 ) .
√ √ √
Solution: (a ) ∥ v ∥2 = 2 , ∥ v ∥∞ = 1, and √12 2 ≤ 1 ≤ 2 .

7.17. Compute the cosine distance between the pairs of vectors in Exercise 7.1.
Solution: (a ) 1.
Chapter 3

Matrices

   
1 −1 3   2 3
−6 0 3
1.2. Let A =  −1 4 −2 , B = , C =  −3 −4 .
4 2 −1
3 0 6 1 2

Compute the indicated combinations where possible. (a ) ♡ 3 A − B, (b ) A B, (c ) ♡ B A,


(d ) ♡ (A + B) C, (e ) A + B C, (f ) ♢ A + 2 C B , (g ) A2 − 3 A + I , (h) (B − I ) (C + I ).
 
3 6 0
Solution: (a ) Undefined, (c ) , (d) undefined.
−1 4 2

1.3. Which of the following pairs of matrices commute under matrix multiplication?
         
−1 1 2 2 3 1 2 3 −2
(a) ♡ , ( 4 3 ), (b) , , (c) ♡ , ,
1 −2 1 5 0 2 1 −2 3
     
3 −1   3 0 −1 2 0 −1
  4 2 −2
(d) ♢ 0 2 , , (e)  −2 −1 2 ,  1 1 −1 .
5 2 4
1 4 2 0 0 2 0 −1

Solution: (a ) Do not commute; (c) commute.

1.6. ♡ Find a nonzero matrix A ̸= O such that A2 = O.


 
0 1
Solution: For example, A = .
0 0

1.8. (a ) ♡ Let A be an m × n matrix. Let ej ∈ R n denote the j-th standard basis vector.
Explain why the product A ej equals the j-th column of A. (b ) ♢ Similarly, let bei ∈ R m be
the i-th standard basis vector. Explain why the triple product b
eTiA ej = aij equals the (i, j)
entry of the matrix A.

Solution: (a) The i-th entry of A ej is the product of the i-th row of A with ej . Since all
the entries in ej are zero except the j-th entry the product will be equal to aij , i.e., the (i, j)
entry of A, which are the entries of its j-th column. ■
Chapter 3. Matrices 11

1.9. ♡ Prove that A v = 0 for every vector v (with the appropriate number of entries) if and
only if A = O is the zero matrix.
Solution: Let v = ej . Then A v is the same as the j-th column of A. Thus, the hypothesis
implies all columns of A are 0 and hence A = O. ■

1.13. Write out the following diagonal matrices: (a ) ♡ diag (1, 0, −1), (b ) diag (2, −2, 3, −3).
 
1 0 0
Solution: (a )  0 0 0 .
0 0 −1
1.15. The trace of a square matrix A ∈ Mn×n is defined to be the sum of its diagonal entries:

tr A = a11 + a22 + · · · + ann . (3.12)

Let A, B, C be n × n matrices. Prove that the trace satisfies the following identities:
(a ) ♡ tr (A + B) = tr A + tr B; (b) ♡ tr (A B) = tr (B A); (c ) ♢ tr (A B C) = tr (CA B) =
tr (B CA). On the other hand, find an example where tr (A B C) ̸= tr (A CB). (d) Is part
(b) valid if A has size m × n and B has size n × m?
X
n X
n X
n
Solution: (a ) tr (A + B) = (aii + bii ) = aii + bii = tr A + tr B. ■
i=1 i=1 i=1
X
n X
n X
n
(b ) The diagonal entries of A B are aij bji , so tr (A B) = aij bji ; the diagonal
j =1 i=1 j =1
X
n n X
X n
entries of B A are bji aij , so tr (B A) = bji aij . These are clearly equal. ■
i=1 i=1 j =1

1.17. The naïve way to “multiply” matrices is known as the Hadamard product, and is oc-
casionally useful. More specifically, given two m × n matrices A, B, necessarily of the same
size, their Hadamard product is the m × n matrix C = A ◦ B whose (i, j) entry is merely the
product of the (i, j) entries of A and B, so cij = aij bij .
(a) ♡ Prove that the Hadamard product is commutative: A ◦ B = B ◦ A.
(b) Which of the matrix arithmetic properties does the Hadamard product satisfy?
(c) ♡ What is the multiplicative identity for the Hadamard product?
(d) Let D = diag d be a diagonal matrix. Show that D x = d ◦ x.
(e) ♢ Let x, y, z ∈ Rn . Prove the Hadamard product vector identities
(i ) (x ◦ y) · z = (x ◦ z) · y, (ii) (x xT ) ◦ (y yT ) = (x ◦ y) (x ◦ y)T .
Solution:
(a) The (i, j) entry of A ◦ B is aij bij = bij aij , which equals the (i, j) entry of B ◦ A. ■
(c) The multiplicative identity for the Hadamard product is the m × n matrix E that has
all 1’s as entries: A ◦ E = A = E ◦ A.
12 Chapter 3. Matrices

2.1. Write down the transpose of the following matrices:    


      1 2 1 2 −1
1 1 2 1 2 −1
(a) ♡ , (b ) , (c ) ♡ , (d)  3 4 , (e ) ♢  0 3 2 .
5 2 1 2 0 2
5 6 1 1 5
 
1 2
Solution: (a) ( 1 5 ), (c )  2 0 .
−1 2
2.4. ♡ Let A be a square matrix. Prove that A + AT is symmetric.
Solution: (A + AT )T = AT + (AT )T = AT + A = A + AT . ■
2.6. If v, w are column vectors with the same number of entries, does
(a) ♡ vT w = wT v? (b ) ♢ v wT = w vT ?
Solution: (a) Yes, since the expression is a scalar, so vT w = (vT w)T = wT (vT )T = wT v.
2.8. ♡ Let A = ( v1 . . . vn ) be an m × n matrix with the indicated columns. Prove that
the trace (see Exercise 1.15) of the symmetric matrix AT A equals the sum of the squared
Euclidean norms of the columns of A, i.e., tr(AT A) = ∥ v1 ∥22 + · · · + ∥ vn ∥22 .
Solution: The diagonal entries of AT A are viT vi = ∥ vi ∥22 . Summing them from i = 1, . . . , n
produces the formula. ■
2.9. Suppose R, S are symmetric matrices. Prove that (a ) ♡ their sum R + S is symmetric;
(b ) ♢ their product R S is symmetric if and only if R and S commute: R S = S R.
Solution: (a ) (R + S)T = RT + S T = R + S.

3.1. For each of the following linear systems, write down the coefficient matrix A and the
vectors x and b.
p + 3 q − 3 r = 0,
2 u − v + 2 w = 2,
x − y = 7, 6 u + v = 5, q − r = 1,
(a )♡ (b ) (c )♢ (d)♡ − u − v = 1,
x + 2 y = 3; 3 u − 2 v = 5; 2 p − q + 3 r = 3,
3 u − 2 w = 1.
2 p − 5 r = −1;
     
1 −1 x 7
Solution: (a) A = , x= , b=
1 2 y 3
     
2 −1 2 u 2
(d) A =  −1 −1 0 , x =  v , b =  1 .
3 0 −2 w 1
3.2. Write out and solve the linear systems corresponding to the indicated matrix, vector of
     
1 −1 x −1
unknowns, and right-hand side. (a ) ♡ A = , x= , b= ;
2 3 y −3
   
      1 0 1 u
0 −4 c 4
(b) ♢ A = , x= , b= ; (c ) ♡ A =  1 1 0  , x =  v  ,
5 1 d 1
0 1 1 w
       
−1 3 0 −1 x1 1
b =  −1  ; (d ) A =  −2 −1 0  , x =  x2  , b =  0 .
2 0 −3 0 x3 1
Chapter 3. Matrices 13

Solution: (a) x − y = −1, 2 x + 3 y = −3. The solution is x = − 56 , y = − 15 .


(c) u + w = −1, u + v = −1, v + w = 2. The solution is u = −2, v = 1, w = 1.
3.3. Write out the linear system that determines whether the following sets of vectors are
linearly independent or dependent. Then determine which of the two possibilities holds.
   
              1 0
1 2 1 −2 2 −1 5
(a) ♡ , , (b ) , , (c ) ♡ , , , (d)  3 ,  2 ,
2 1 3 −6 1 3 2
              −2 −1
0 1 3 2 1 2 0
(e) ♢  1  ,  −1  ,  −1  , (f)  1  ,  −2  ,  −3  ,  −1 .
1 0 2 3 1 0 4
     
1 −1 x1 0
Solution: (a ) A = , x= , b= , linearly independent;
1 2 x2  0
  x1  
2 −1 5   0
(c ) A = , x = x2 , b = , linearly dependent.
1 3 2 0
x3
3.4. For each of the corresponding sets of vectors in Exercise 3.3, write out the linear system
that determines whether the indicated vector lies in their span. Then determine whether or
not this holds.      
      1 1 2
1 2 −1
(a) ♡ , (b) , (c ) , (d ) ♡  0 , (e) ♢  1  , (f )  −2  .
0 1 −1
0 1 4
     
1 −1 x1 1
Solution: (a ) A = , x= , b= , in span;
1 2 x2 0
   
1 0   1
x1
(d) A =  3 2 , x = , b =  0 , not in span.
x2
−2 −1 0

4.1. Find a basis, if it exists, of the image and the kernel of the following matrices:
 
    1 2 3
8 −4 1 −1 2
(a) ♡ ( 2 −1 5 ), (b) , (c ) ♡ , (d) ♢  0 4 5 .
−6 3 −2 2 −4
0 0 6
      1
1 0   1 2
1
Solution: (a ) Image: (1); kernel:  2  ,  5 . (c ) Image: ; kernel  1  ,  0 .
−1
0 1 0 1
4.5. ♡ True or false: If A is a square matrix, then ker A ∩ img A = {0}.
   
1 1 1
Solution: False. For example, if A = then is in both ker A and img A.
−1 −1 −1
4.8. ♡ True or false: If ker A = ker B, then rank A = rank B.
Solution: True. If ker A = ker B ⊂ R n , then both matrices have n columns, and so
n − rank A = dim ker A = dim ker B = n − rank B.
14 Chapter 3. Matrices

     
1 2 −1 5 1
5.2. ♡ Let A =  2 5 −1 . Given that x⋆1 =  −1  solves A x = b1 =  3  and
1 3 2 2 6
     
−11 0 2
x⋆2 =  5  solves A x = b2 =  4 , find a solution to A x = 2 b1 + b2 =  10 .
−1 2 14
T
Solution: x⋆ = 2 x⋆1 + x⋆2 = ( −1, 3, 3 ) .

6.1. Verify by direct multiplication that the following matrices are inverses:
   
    2 1 1 3 −1 −1
2 3 −1 −3
(a) ♡ , ; (b)  3 2 1 ,  −4 2 1 .
−1 −1 1 2
2 1 2 −1 0 1
       
2 3 −1 −3 1 0 −1 −3 2 3
Solution: (a) = = .
−1 −1 1 2 0 1 1 2 −1 −1

6.5. ♡ Prove that a diagonal matrix D = diag (d1 , . . . , dn ) is invertible if and only if all its
diagonal entries are nonzero, in which case D−1 = diag (1/d1 , . . . , 1/dn ).
Solution: If all the diagonal entries are nonzero, then D−1 D = I . On the other hand,
if one of diagonal entries is zero, then all the entries in that row are zero, and so D is not
invertible. ■
6.7. ♡ (a ) Prove that the inverse transpose operation (3.53) respects matrix multiplication:
   
−T −T −T 1 −1 2 1
(A B) = A B . (b) Verify this identity for A = , B= .
1 0 1 1
−T T −1 T T −1 T −1 T −1 −T −T
Solution: (a)  = ((A B) ) = (B A ) = (A ) (B ) = A B .
 (A B) ■
1 0 1 −2
(b) A B = , so (A B)−T = ,
2 1 0 1
     
−T 0 −1 −T 1 −1 −T −T 1 −2
while A = ,B = , so A B = . ■
1 1 −1 2 0 1

7.1. ♡ (a ) Show that the function R : R 2 → R 2 that rotates vectors in the plane by 90◦ is
linear and find its matrix representative.
(b) Answer the same question for rotation by a specified angle θ.
Solution: Rotations preserve parallelograms (vector addition)
 and stretching
 of vectors

0 −1 cos θ − sin θ
(scalar multiplication) and hence define linear maps. (a ) ; (b) .
1 0 sin θ cos θ
7.5. ♡ Prove that an affine function maps parallel lines to parallel lines.
Solution: Given parallel lines parametrized by c+t v and d+s v, and an affine map (3.63),
the image lines are parametrized by (A c + b) + t A v and (A d + b) + s A v, which are also
parallel. ■
Chapter 3. Matrices 15

7.7. Suppose we identify Mm×n ≃ R m n . (a) ♡ Show that, for a fixed k × m matrix B,
matrix multiplication L[ A ] = BA defines a linear function L : R m n → R k m . (b ) Show that,
2
similarly, the trace tr A of a square matrix A ∈ Mn×n defines a linear function tr : R n → R.

Solution: (a ) For any scalars c, d and matrices A, C, we have

L[ c A + d C ] = B (c A + d C) = c BA + d BC = c L[ A ] + d L[ C ],

proving linearity. ■
Chapter 4

How Matrices Interact with


Inner Products and Norms

1.1. Are the following matrices are positive definite? In the positive definite cases, write
down its Cholesky factorization and the formula for the associated inner product.
 
      1 1 2
1 0 1 1 1 −1
(a) ♡ , (b) ♡ , (c ) ♢ , (d)  1 2 1 ,
0 2 1 1 −1 3
2 1 1
     
2 1 1 1 −1 1 1 1
1 1 1
1 2 1 1  1 −1 1 1 
(e ) ♡  1 2 −2 , (f )  , (g) ♢  .
1 1 2 1 1 1 −1 1
1 −2 4
1 1 1 2 1 1 1 −1
Solution: (a ) Positive definite; (b) not positive definite; (e) not positive definite.
 
1 2
1.3.♡ Let C = . Prove that the associated quadratic form q(x) = xT C x is indefinite
2 3
by finding a point x+ where q(x+ ) > 0 and a point x− where q(x− ) < 0.
Solution: For instance, q(1, 0) = 1, while q(2, −1) = −1. ■

1.5. (a ) ♡ Prove that the sum of two positive definite matrices is positive definite.
(b) More generally, prove that the sum of a positive definite matrix and a positive semidef-
inite matrix is positive definite.
(c) ♢ Can the sum of two positive semidefinite matrices be positive definite?
(d) Give an example of two matrices that are not positive definite or semidefinite, but
whose sum is positive definite.

Solution: (a) If H, K > 0, then xT H x > 0 and xT K x > 0 for all x ̸= 0, hence
xT (H + K)x = xT H x + xT K x > 0 for all x ̸= 0. ■
Chapter 4. How Matrices Interact with Inner Products and Norms 17

1.9. ♡ (a ) Show that every diagonal entry of a positive definite matrix must be strictly
positive. (b) Write down a symmetric matrix with all positive diagonal entries that is not
positive definite. (c ) Find a nonzero matrix with one or more zero diagonal entries that is
positive semidefinite.
 
1 2
Solution: (a ) cii = eTi C ei > 0. ■ (b) For example, C = is not positive definite
  2 1
1 0
or even semidefinite. (c ) For example, C = .
0 0

1.13. ♡ Let A be an n × n matrix. Prove that xT A x = xT S x, where S = 12 (A + AT ) is a


symmetric matrix. Therefore, we do not lose any generality by restricting our discussion to
quadratic forms that are constructed from symmetric matrices.
Solution: Since q(x) = xT A x is a scalar q(x) = q(x)T = (xT A x)T = xT AT x, and hence
q(x) = 21 (xT A x + xT AT x) = xT S x. ■

2.1. Find the Gram matrix corresponding to each of the followingsets of  vectors
 using the
−1 0
Euclidean dot product on R . Which are positive definite? (a) ♡
n
, ,
3 2
         
      2 −3 1 1 0
1 −2 −1
(b ) ♢ , , , (c) ♡  1 ,  0 , (d ) ♢  1 ,  0 ,  1 ,
2 3 −1
−1 2 0 1 1
               
1 −1 1 −2 −1
1 2 −1
 0   1  2  1   3 
(e ) ♡  −2 ,  −1 ,  −1 , (f)  ,  , (g)  ,  ,  .
−1 0 3 −4 −1
2 1 1
0 1 4 3 −2
   
10 6 6 −8
Solution: (a ) ; positive definite. (c ) ; positive definite. (e )
  6 4 −8 13
9 6 3
 6 6 0 ; positive semidefinite.
3 0 3
2.2. Recompute the Gram matrices for cases (c–e ) in the previous exercise using the weighted
inner product ⟨ x, y ⟩ = x1 y1 + 2 x2 y2 + 3 x3 y3 . Does this change their positive definiteness?
 
  21 12 9
9 −12
Solution: (c) ; (e)  12 9 3 .
−12 21
9 3 6
Positive definiteness doesn’t change, since it only depends upon the linear independence of
the vectors.
18 Chapter 4. How Matrices Interact with Inner Products and Norms

2.3. Express the following as Gram matrices or explain why this is not possible.
   
      1 1 1 9 3 3
2 3 4 −1 3 2
(a) ♡ , (b) ♡ , (c ) ♢ , (d)  1 0 1 , (e) ♢  3 2 2 .
3 4 −1 1 1 4
1 1 1 3 2 6
  
2 √0 −1/2
2 √
Solution: (a ) Not positive (semi)definite; (b ) .
−1/2 3/2 0 3/2

2.5. ♡ (a ) Prove that if C is a positive definite matrix, then C 2 is also positive definite.
(b) More generally, if S is symmetric and nonsingular, then S 2 is positive definite.
Solution: (a) is a special case of (b) since positive definite matrices are symmetric.
(b) By Theorem 4.12 if S is any symmetric matrix, then S T S = S 2 is always positive
semidefinite, and positive definite if and only if ker S = {0}, i.e., S is nonsingular. In partic-
ular, if S = H > 0, then ker H = {0} and so H 2 > 0. ■

3.1. Choose one from the following list of inner products on R 3 for both the domain and
 
1 1 0
codomain, and find the adjoint of A =  −1 0 1  : (a) ♡ the Euclidean dot product;
0 −1 2
(b ) ♡ the weighted inner product ⟨ v, w ⟩ = v1 w1 + 2 v2 w2 + 3 v3 w3 ; (c) ♢ the inner product
 
2 1 0
⟨ v, w ⟩ = vT C w defined by the positive definite matrix C =  1 2 1 .
  0 1 2
  1 −2 0
1 −1 0  
Solution: (a )  1 0 −1 , (b )  2
1
0 − 32 ,.
0 1 2 2
0 3 2
3.4. ♡ Consider the weighted inner product ⟨ v, w ⟩ = v1 w1 + 12 v2 w2 + 13 v3 w3 on R 3 .
(a ) What are the conditions on the entries of a 3 × 3 matrix A in order that it be self-adjoint?
(b ) Write down an example of a non-diagonal self-adjoint matrix.
 
0 1 1
Solution: (a ) a12 = 21 a21 , a13 = 13 a31 , 12 a23 = 13 a32 , (b)  2 1 2 .
3 3 2

3.6. Prove the following adjoint identities: (a )♡ (A+B)∗ = A∗ +B ∗ , (b )♢ (A B)∗ = B ∗ A∗ ,


(c) (c A)∗ = c A∗ for c ∈ R, (d) ♡ (A∗ )∗ = A, (e) ♢ (A−1 )∗ = (A∗ )−1 .
Solution: (a) ⟨ x, (A + B)∗ y ⟩C = ⟨ (A + B) x, y ⟩K = ⟨ A x, y ⟩K + ⟨ B x, y ⟩K
= ⟨ x, A∗ y ⟩C + ⟨ x, B ∗ y ⟩C = ⟨ x, (A∗ + B ∗ ) y ⟩C .
Since this holds for all x, y, we conclude that (A + B)∗ = A∗ + B ∗ . ■
(d) ⟨ (A∗ )∗ x, y ⟩ = ⟨ x, A∗ y ⟩ = ⟨ A x, y ⟩ .
K C K ■
Chapter 4. How Matrices Interact with Inner Products and Norms 19

3.8. ♡ Let C, K be positive definite matrices defining inner products on R n and R m , respec-
tively. Let A be an m × n matrix with adjoint A∗ . Prove that x solves the inhomogeneous
linear system A x = b if and only if

⟨ x, A∗ y ⟩C = ⟨ b, y ⟩K for all y ∈ Rm. (4.27)

Remark: Equation (4.27) is known as the weak formulation of the linear system. Its gener-
alizations plays an essential role in the analysis of differential equations and their numerical
approximations, [180, 192, 225].
Solution: By the definition of the adjoint, ⟨ x, A∗ y ⟩C = ⟨ A x, y ⟩K = ⟨ b, y ⟩K . Since this
holds for all y ∈ R m , it is equivalent to the system A x = b. ■

4.1. For each of the following matrices find bases (when they exist) for the
(i ) image, (ii ) coimage, (iii ) kernel, and (iv) cokernel.
   
    0 0 −8 1 1 3 1
1 −3 1 −3 1
(a) ♡ , (b ) ♡ , (c ) ♢  1 2 −1 , (d )  1 1 0 1 .
2 −6 0 2 0
2 4 6 0 0 3 0
       
1 1 3 −2
Solution: (a ) image: ; coimage: ; kernel: ; cokernel: .
2 −3 1 1
     
    1 0 −1
1 −3
(b ) image: , ; coimage:  −3 ,  2 ; kernel:  0 ; cokernel: {0} (no basis).
0 2
1 2 1
 
1 1 0
4.3. Find bases for the coimage and cokernel of the matrix  1 0 −1  using
0 1 1
(a ) ♡ the dot product; (b) ♢ the weighted inner product ⟨ v, w ⟩ = v1 w1 + 2 v2 w2 + 3 v3 w3 .
Make sure that the dimensions satisfy the formulas in the Fundamental Theorem 4.24.
     
1 1 −1
Solution: (a ) coimage:  1 ,  0 ; cokernel:  1 .
0 −1 1

4.5. ♡ True or false: nullity A = nullity A∗ .


Solution: True if A is square; otherwise false.

 
1 1
5.1. Determine which of the following are orthogonal matrices: (a ) ♡ ,
−1 1
   
!   − 13 2 2 1 1 1
12 5 0 1 0  3 3   2 3 4 
(b) 13 13 , (c) ♡  −1 0 0 , (d)   3
2
− 13 2 , (e) ♢ 
 
1 1 1 .

3 3 4 5
− 135 12
13 0 0 −1 2
3
2
3 − 31 1
4
1
5
1
6

Solution: (a ) Not orthogonal; (c ) orthogonal.


20 Chapter 4. How Matrices Interact with Inner Products and Norms

5.4. True or false:


(a) ♡ A matrix whose columns form an orthogonal basis of R n is an orthogonal matrix.
(b) ♢ A matrix whose rows form an orthonormal basis of R n is an orthogonal matrix.
(c) An orthogonal matrix is symmetric if and only if it is a diagonal matrix.

Solution: (a ) False — they must be an orthonormal basis.


5.5. Which of the indicated maps define isometries of the Euclidean plane?
       
y x−y+1 1 x+y−3 1 3x + 4y
(a ) ♡ , (b ) ♢ , (c ) ♡ √ , (d ) .
−x x+2 2 x+y−2 5 −4x + 3y + 1
Solution: (a ) Isometry, (c) not an isometry.
5.6. Which of the following matrices are Euclidean
norm-preserving?
  
  1 1 2 √1
  1 0 3
1 0 0  3 3   
2
(a ) ♡ , (b ) ♡  0 1 , (c) ♢  2 ,
 3 −3 
1 (d) 
−3
1
0 
.
0 1 0
0 0 1
3
1
3
− 23 √1
2

Solution: (a ) Not norm-preserving, because it maps from R 3 to R 2 ; (b) norm-preserving.

6.1. Using the dot product, write out the projection matrix corresponding to the subspaces
spanned by  1   1   1 
   1   1 
! 2 √ √ 2
 1   1
2 2
 1 
√1  3   6
  3
    −   
(a ) ♡ 2
, (b )  2   2   1   2   2
 − 3 , (c ) ♡  − √6 ,  √3 , (d ) ♢  1 ,  1 ,
 2 .
 1
√1
2
1
 2   2   −2 
√1 √1
3 6 3 − 12 1
2
1
2
 
! 1
0 1
1 1 2 2
Solution: (a ) 2 2 ; (c ) 
0 1 0
.
1 1
2 2 1 1
2 0 2

6.2. Write out the projection matrices onto the orthogonal complements of the subspaces in
Exercise 6.1.  
! 1
0 − 12
1
−2 1  2 
Solution: (a ) 2 ; (c)  0 0 0 .

−2 1 1
2
−2 0
1 1
2

6.3. ♡ Prove that a projection matrix is positive semidefinite. When is it positive definite?

Solution: Since P = U U T where U is a k × n matrix with orthonormal columns, it is the


Gram matrix corresponding to U T , and hence positive semidefinite. It is positive definite if
and only if rank U = rank U T = n which implies U has size n × n and hence is an orthogonal
matrix. Thus, P = U U T = I is the only positive definite projection matrix. ■
Chapter 4. How Matrices Interact with Inner Products and Norms 21

   
1 −3 4 3
7.1. Find the Q R factorization of the following matrices: (a ) ♡ , (b) ,
2 1 3 2
       
1 1 1 1
2 1 −1 0 1 2 0 0 2
1 2 1 0
(c ) ♡  0 1 3 , (d)  −1 1 1 , (e) ♢  0 4 1 , (f )  .
1 1 2 1
−1 −1 1 −1 1 3 −1 0 1
1 0 1 1
!    √ 
1 −3 √1 − √25 5 − √15
Solution: (a) =  5   ,
2 1 √2 √1 0 √7
5 5
   2  5√ 
2 1 −1 √ − √1 √1 5 √3 − √3
 
5 q 30 6   q5 q5 

(c)  =  0 5 
√1   0 6
7 2 
.
 0 1 3   q6 q6   5
q15 
− 1 −1 1 − √15 − 15 2 2
3 0 0 2 23

7.2. For each of the following linear systems, find the Q R factorization of the coefficient
    
1 2 x −1
matrix, and then use your factorization to solve the system: (a )♡ = ,
−1 3 y 2
         
2 1 −1 x 2 1 1 0 x 0
(b) ♢  1 0 2   y  =  −1 , (c )  −1 0 1   y  =  1 .
2 −1 3 z  0  0 −1 1 z 0
√   !
√1 √1
2 − √2 1
x − 75
Solution: (a)  2 2   , = .
− √12 √12 0 √5 y 1
2 5

7.3. Determine the rank of the following matrices using the extended Q R method:
   
    1 2 3 4
2 −5 −1
1 2 3 −1 −2  3 6 4 7
(a) ♡ , (b ) , (c) ♡  1 −6 −4 , (d ) ♢  .
−2 −4 −6 2 1 1 2 2 3
3 −4 2
  3 6 5 8
  √1 √ 
1 2 √
Solution: (a) = 5 
5 2 5 , rank = 1;
−2 −4 −√ 2
5
 
  √4 − √121 !
2 −5 −1 √ √
 14
 14 −2 14 0
(c)  1 −6 −4  = 
 √1
− √421 
 √ √ , rank = 2.
14
3 −4 2 0 21 21
√3 √2
14 21
22 Chapter 4. How Matrices Interact with Inner Products and Norms

7.4. Use the Q R method to compute the least squares solution to the linear system A x = b
       
1 1 1 0 1
when (a ) ♡ A =  2  , b =  1 ; (b) ♢ A =  2 −1 , b =  1 ;
1 0 3 2 0
       
2 1 1 0
2 1 −1 1
 1 −2 −1  0
(c ) ♡ A =  1 −2 0 , b =  0 ; (d) A =  , b =  .
1 0 −1 1
3 −1 −1 −1
5 0 1 0
 1 

 6
   q
Solution: (a ) Q =   , R = √6 ; 3
for x = 12 .
− √4
6  solve R x = QT b = 2

√1
  6
√4 − √442 √ 
 14
 14 − √314 − √514
(c ) Q = 
 − √542 
√1
, R =  ;
14
0 √15 − √342
− 42
√3√1 42
14  
  2
z
− √1  5 
solve R x = QT b =  14 
for x =  1 + 1 z , where z is a free variable, in
√5
3 5 
42
z
accordance with nullity A = 1.

Solution: The missing code in the Python notebook from this section is below.
R[0,0] = [Link](A[:,0])
Q[:,0] = A[:,0] / R[0,0]
for k in range(1,m):
#First orthogonalization
s = Q[:,:k].T@A[:,k]
v = A[:,k] - Q[:,:k]@s
#Re-orthogonalization
t = Q[:,:k].T@v
xk = v - Q[:,:k]@t

8.1. Compute (i ) the 1 matrix norm, (ii) the ∞ matrix norm, and (iii ) the Frobenius norm
of the following matrices:    
  !
5 4 1 −2 3 0 .2 .8
2 −2
(a )♡ , (b )♢ 3 3 , (c)♡  −1 0 1 ,(d )  −.3 0 .1 .
−3 5 − 76 − 56 2 −1 1 −.4 .1 0
√ √
Solution: (a ) (i ) 7, (ii) 8, (iii ) 42 ≈ 6.4807 (c) (i ) 5, (ii) 6, (iii ) 22 ≈ 4.6904.
Chapter 4. How Matrices Interact with Inner Products and Norms 23

 
1 1
8.2. Let A = . Compute the natural matrix norm ∥ A ∥ using the following norms
1 −2
on R 2 : (a)♡ the 1 norm; (b)♢ the ∞ norm; (c )♡ the weighted 1 norm ∥ v ∥ = 2 | v1 |+3 | v2 |;

(d ) the weighted ∞ norm ∥ v ∥ = max 2 | v1 |, 3 | v2 | .
Solution: (a ) 3, (c) ∥ A ∥ = 38 . The unit sphere for this norm is the diamond with
T T T
corners ± 12 , 0 , ± 0, 13 . It is mapped to the parallelogram with corners ± 21 , 12 ,
T
± 31 , − 23 , with respective norms 52 and 38 , and so ∥ A ∥ = max{ ∥ A v ∥ | ∥ v ∥ = 1 } = 38 .

8.3. ♡ Find a matrix A such that ∥ A2 ∥∞ ̸= ∥ A ∥2∞ .


   
1 1 1 2
Solution: Example: A = has ∥ A ∥∞ = 2, while A2 = has ∥ A2 ∥∞ = 3.
0 1 0 1

8.7. ♡ True or false: If ∥ A ∥F < 1, then Ak → O as k → ∞.

Solution: True, since by (4.88), ∥ Ak ∥F ≤ ∥ A ∥kF → 0 as k → ∞.


Chapter 5

Eigenvalues and Singular Values

1.1. Find the eigenvalues and eigenvectors of the following 2 × 2 matrices:


    !  
0 1 1 −2 1 − 23 3 1
(a) ♡ , (b ) ♢ , (c) , (d ) ♡ .
1 0 −2 1 1
−1 −1 1
2 6
   
1 −1
Solution: (a ) Eigenvalues: 1, −1; eigenvectors: , .
1 1
 
−1
(d) Eigenvalue: 2; eigenvector: .
1
T
1.2. Write down (a) ♡ a 2 × 2 matrix that has 0 as one of its eigenvalues and ( 1, 2 ) as a
T
corresponding eigenvector; (b) a 3 × 3 matrix that has ( 1, 2, 3 ) as an eigenvector for the
eigenvalue −1; (c ) ♡ a 4 × 4 matrix that has −4 as an eigenvalue with multiplicity 2.
 
  −4 0 0 0
0 0  0 −4 0 0 
Solution: Examples: (a ) , (c)  .
0 0 0 0 1 0
0 0 0 0
1.3. Find all eigenvalues and eigenvectors of (a ) ♡ the n × n zero matrix O; (b) the n × n
identity matrix I ; (c ) ♢ the n × n matrix E = 1 1T with every entry equal to 1..

Solution: (a) Since O v = 0 = 0 v, we conclude that 0 is the only eigenvalue; all nonzero
vectors v ̸= 0 are eigenvectors.

1.6. Let A be a square matrix. (a ) ♡ Explain in detail why every nonzero scalar multiple
of an eigenvector of A is also an eigenvector. (b ) ♡ Show that every nonzero linear combi-
nation of two eigenvectors v, w corresponding to the same eigenvalue is also an eigenvector.
(c )♢ Prove that a linear combination c v + d w, with c, d ̸= 0, of two eigenvectors correspond-
ing to different eigenvalues is never an eigenvector.

Solution: (a) If A v = λ v, then A(c v) = c A v = c λ v = λ (c v) and so c v satisfies the


eigenvector equation for the eigenvalue λ. Moreover, since v ̸= 0, also c v ̸= 0 for c ̸= 0, and
so c v is a bona fide eigenvector. ■
Chapter 5. Eigenvalues and Singular Values 25

(b) If A v = λ v, A w = λ w, then A(c v+d w) = c A v+d A w = c λ v+d λ w = λ (c v+d w).


Since v.w are linearly independent, we have c v + d w ̸= 0, and hence it is an eigenvector for
the eigenvalue λ. ■
1.8. ♡ (a ) Show that if λ is an eigenvalue of A, then λ2 is an eigenvalue of A2 . (b ) Is the

converse valid: if µ is an eigenvalue of A2 , then µ is an eigenvalue of A?
Solution: (a ) If A v = λ v then A2 v = λ A v = λ2 v, and hence v is also an eigenvector of

A with eigenvalue λ2 . ■ (b) Not necessarily; possibly − µ is the only eigenvalue.
2

1.11. True or false:


(a) ♡ If λ is an eigenvalue of both A and B, then it is an eigenvalue of the sum A + B.
(b) ♢ If v is an eigenvector of both A and B, then it is an eigenvector of A + B.
(c) If λ is an eigenvalue of A and µ is an eigenvalue of B, then λ µ is an eigenvalue of the
matrix product C = A B.
   
0 1 0 0
Solution: (a) False. For example,
 0 is an eigenvalue of both and , but
0 0 −1 0
0 1
the eigenvalues of A + B = are ±1.
−1 0

2.1. Which of the following matrices are complete? For those that are, exhibit an eigenvector
basis of R 2 . For those thatare not,
 what is the dimension
  of R spanned
of the subspace 2
  by
1 3 1 3 1 3 3 3
the eigenvectors? (a ) ♡ , (b ) ♢ , (c) ♡ , (d) .
3 1 −3 1 0 1 1 5
   
−1 1
Solution: (a) Complete: eigenvalues: −2, 4; the eigenvectors , form a basis.
1 1
(c) Not complete:
  the only eigenvalue is 1, and there is only one independent eigenvector
1
v1 = spanning a one-dimensional subspace of R 2 .
0
2.2. Find the spectral radius of the following matrices.
 Which are convergent?

! −1.7  
  .3 2.2 0 1 −1
.3 −.4 0 45  
(a) ♡ , (b ) ♢ , (c ) ♡ 
 0 −.6 4.1 , (d)  0 1
 0 .
−.2 .6 3 2
5 3 1 0 2
0 0 .5
Solution: (a ) 0.7702, convergent (c ) 0.6, convergent.
2.3. Use (5.19) to write down an explicit formula
 for the k-thpower of the
following
matrices:
    1 −1 0 1 1 2
5 2 4 1
(a ) ♡ , (b ) , (c ) ♡  0 0 1 , (d ) ♢  1 2 1 .
2 2 −2 1
0 0 −1 2 1 1
  k  !
2 1
2 −1 6 0 5 5
Solution: (a) ,
1 2 0 1 − 1 2
 5 5 
   1
−1 1 1 (−1)k 0 0  0 0 2 
(c)  −2 0 1  0 1 0   1 −1 − 2 ,
1 

2 0 0 0 0 0
  0 1 1  
1 −1 0 1 −1 −1
which is  0 0 1  if k is odd and  0 0 −1  if k > 0 is even.
0 0 −1 0 0 1
26 Chapter 5. Eigenvalues and Singular Values

2.5. ♡ Prove that if A is a complete matrix, then so is c A + d I , where c, d are any scalars.
Solution: According to Exercise 1.7, every eigenvector of A is an eigenvector of c A + d I
with eigenvalue c λ + d, and hence if A has an eigenvector basis, so does c A + d I . ■

2.9. True or false: (a ) ♡ ρ(c A) = c ρ(A), (b ) ρ(V −1 A V ) = ρ(A), (c ) ♡ ρ(A2 ) = ρ(A)2 ,


(d ) ♡ ρ(A−1 ) = 1/ρ(A), (e ) ♢ ρ(A + B) = ρ(A) + ρ(B), (f ) ρ(A B) = ρ(A) ρ(B).
Solution: (a) False: ρ(c A) = | c | ρ(A).
2
(c) True, since the eigenvalues
  of A are the squares of theeigenvalues
 of A.
1
2 0 −1 0
(d) False, e.g., if A = then ρ(A) = 2, but A = 2 and ρ(A) = 1.
0 1 0 1

3.1. Find the eigenvalues and an orthonormal eigenvector basis for the following symmetric
matrices, and then write out their spectral factorization. Use  this to determine
 which are
    1 0 4
2 6 5 −2
positive definite. (a) ♡ (b ) , (c) ♢  0 1 3 .
6 −7 −2 5
4 3 1
   
1 2 1 1
Solution: (a ) Eigenvalues: 5, −10; eigenvectors: √ ,√ ,
  5 1 5 −2 
! 2
√2 √1 5 0 √ √1
spectral factorization:  5 5   5 5 
.
√1
5
− √2
5
0 −10 √ 1
5
− √2
5

3.2. Construct a symmetric matrix that has the following eigenvalues and associated eigen-
vectors, or explain why none exists:
T T T
(a ) ♡ λ1 = −2, v1 = ( 1, −1 ) , λ2 = 1, v2 = ( 1, 1 ) , (b) ♢ λ1 = 3, v1 = ( 2, −1 ) ,
T T T
λ2 = −1, v2 = ( −1, 2 ) , (c ) λ1 = 2, v1 = ( 2, 1 ) , λ2 = 2, v2 = ( 1, 2 ) .
!
− 21 3
2
Solution: (a ) .
3
2 − 1
2

3.5. Prove the properties (b) ♡ , (c) ♢ , (d) listed on page 140.

Solution: (b) According to (4.24), Se is self-adjoint with respect to the inner product de-
fined by C provided SeT C = C S. e Thus S T = C −1/2 SeT C 1/2 = C 1/2 Se C −1/2 = S. Moreover,
if x = C 1/2 y, then x ̸= 0 if and only if y ̸= 0, hence yT S y = xT Se x > 0 for all y ̸= 0. ■

3.8. ♡ Given an inner product on R n , let u1 , . . . , un be an orthonormal basis. Prove that


they form an eigenvector basis for some self-adjoint n × n matrix S. Can you characterize all
such matrices? Under what conditions can you construct such an S that is positive definite?
Solution: The simplest is S = I . More generally, any matrix of the form S = U ∗ Λ U ,
where U = ( u1 u2 . . . un ) and Λ is any real diagonal matrix. The matrix is positive definite
if and only if Λ has all positive diagonal entries. ■
Chapter 5. Eigenvalues and Singular Values 27

3.9. ♡ Find a non-symmetric 2 × 2 matrix S with real eigenvalues that does not satisfy the
inequalities (5.62).
Solution:  
2 1
For example, S = has only the eigenvalue λ1 = 1, but Se = (2) has eigenvalue 2.
−1 0
3.11. ♡ Write down two self adjoint positive definite matrices whose Hadamard product is
not positive definite.
  1  1    
2 0 0 0 2 1 1 12
Solution: Let C = , so C −1 = 2 . Then R = 2 =
0 1 0 1 0 1 1 1 1 1
1    
0 2 −1 1 − 1
and S = 2 = 2 are both self adjoint and positive definite, but
0 1 −1 1  −1 1 
1 − 14
their Hadamard product R ◦ S = is not self adjoint for the C inner product.
−1 1
3.12. Compute the generalized eigenvalues and eigenvectors for the following matrix pairs.
Verify orthogonality of the eigenvectors under the appropriate inner product.
       
3 −1 2 0 3 1 2 0
(a) ♡ K = , C= ; (b) K = , C= ;
−1 2 0 3 1 1 0 1
   
    1 2 0 1 1 0
2 −1 2 −1
(c ) K = , C= ; (d ) ♢ K =  2 8 2  , C =  1 3 1 .
−1 4 −1 1
0 2 1 0 1 1
  1
−3
Solution: (a ) Eigenvalues: 53 , 12 ; eigenvectors: , 2 .
1 1

4.2. Write down and solve optimization principles characterizing the largest and the smallest
eigenvalues of the following positive definite matrices:
   
    3 0 −1 4 −1 −2
2 −1 4 1
(a )♡ , (b)♢ , (c )♡  0 3 0 , (d )  −1 4 −1 .
−1 3 1 4
−1 0 3 −2 −1 4

5+ 5 
Solution: (a ) = max 2 x2 − 2 x y + 3 y 2 x2 + y 2 = 1 ,
2√
5− 5 
= min 2 x2 − 2 x y + 3 y 2 x2 + y 2 = 1 ;
2

(c) 4 = max 3 x2 − 2 x z + 3 y 2 + 3 z 2 x2 + y 2 + z 2 = 1 ,
n o
2 = min 3 x2 − 2 x z + 3 y 2 + 3 z 2 x2 + y 2 + z 2 = 1 .

4.3. Write down and solve a maximization principle that characterizes the middle eigenvalue
of the matrices in parts (c) and (d) of Exercise 4.2.

Solution: (c) 3 = max 3 x2 − 2 x z + 3 y 2 + 3 z 2 x2 + y 2 + z 2 = 1, x − z = 0 .
28 Chapter 5. Eigenvalues and Singular Values

4.6. Write out optimization principles for the largest and smallest generalized eigenvalues of
the matrix pairs in Exercise 3.12.

Solution: (a) 53 = max 3 x2 − 2 x y + 2 y 2 2 x2 + 3 y 2 = 1 ,

2 = min 3 x − 2 x y + 2 y
2 2
1
2 x2 + 3 y 2 = 1 .

5.1. Find the explicit formula for the solution to the following linear iterative systems:
(a ) ♡ xk+1 = xk − 2 yk , yk+1 = − 2 xk + yk , x0 = 1, y0 = 0;
(b) ♢ xk+1 = xk − 23 yk , yk+1 = 12 xk − 16 yk , x0 = −2, y0 = 3;
(c ) xk+1 = xk − yk , yk+1 = − xk + 5 yk , x0 = 1, y0 = 0.

3k + (−1)k − 3k + (−1)k
Solution: (a ) xk = , yk = .
2 2
5.2. Use your answers from Exercise 2.3 to solve the following iterative systems:
(a ) ♡ xk+1 = 5 xk + 2 yk , yk+1 = 2 xk + 2 yk , x0 = 1, y0 = −1;
(b) xk+1 = 4 xk + yk , yk+1 = −2 xk + yk , x0 = 1, y0 = −1;
(c) ♡ xk+1 = xk − yk , yk+1 = zk , zk+1 = − zk , x0 = 1, y0 = 3, z0 = 2;
(d ) ♢ xk+1 = xk + yk + 2 zk , yk+1 = xk + 2 yk + zk , zk+1 = 2 xk + yk + zk ,
x0 = 1, y0 = 0, z0 = 1.
 
    k
    k  xk −1 1 1  (−1) 
xk 2 −1 6 /5
Solution: (a ) = ; (c )  yk  =  −2 0 1   − 3 ,

yk 1 2 − 35
zk 2 0 0
    0
−2 −4
for k ≥ 1, which equals  2  if k is odd and  −2  if k > 0 is even.
−2 2
(k)
5.3. ♡ Explain why the j-th column cj of the matrix power Ak satisfies the linear iterative
(k+1) (k) (0)
system cj = A cj with initial data cj = ej , the j-th standard basis vector.
Solution: Since matrix multiplication acts column-wise, as per formula (3.8), the j-th
(k+1) (k)
column of the matrix equation Ak+1 = A Ak is cj = A cj . Moreover, A0 = I has j-th
(0)
column cj = ej . ■

5.6. True or false: (a ) ♡ If A is convergent, then A2 is convergent.


(b ) ♢ If A is convergent, then ATA is convergent.
Solution: (a ) True by part (c) of Exercise 2.9.

5.8. Determine if the following matrices are regular transition matrices. If so, find the asso-
ciated probability eigenvector.  
! ! ! 0 1 0  
1 1 1 2 1   .3 .5 .2
0
(a)♡ 2 3 , (b )♡ 4 3 , (c ) 5 , (d)♢  1 0 0 , (e)  .3 .2
 .5 .
3 2 3 1
4 3 4 3 1 45 .4 .3 .3
0 0 1
8 9 T

Solution: (a ) Not a transition matrix; (b) regular transition matrix: 17 , 17 .
Chapter 5. Eigenvalues and Singular Values 29

 
0 1
5.9. ♡ Explain why the irregular Markov process with transition matrix A = does
1 0
not reach a steady state.
T
Solution: If x0 = ( a, b ) is the initial state vector, then the subsequent state vectors
T T
switch back and forth between ( b, a ) and ( a, b ) . At each step in the process, all of the
population in state 1 goes to state 2 and vice versa, so the system never settles down.
5.10. ♡ A certain plant species has either red, pink, or white flowers, depending on its
genotype. If you cross a pink plant with any other plant,
 the probability
 distribution of the
.5 .25 0
offspring is prescribed by the transition matrix A =  .5 .5 .5 . On average, if you
0 .25 .5
continue crossing with only pink plants, what percentage of the three types of flowers would
you expect to see in your garden?
Solution: 25% red, 50% pink, 25% pink.

6.1. Use the power method to approximate the dominant eigenvalue and associated eigen-
vector of the following matrices. Write your code in Python and compare to the output of
[Link].
     
  2 −1 0 0
3 −1 0 −2 0 1
−1 −2  −1 2 −1 0 
(a )♡ , (b)♢  −1 2 −1 , (c )♡  −3 −2 0 , (d )  .
3 4 0 −1 2 −1
0 −1 3 −2 5 4
0 0 −1 2
Solution: In all cases, we use the normalized version (5.86) starting with x0 = e1 ; the
answers are correct to 4 decimal places.
T
(a) After 17 iterations, λ = 2.00002, u = ( −.55470, .83205 ) .
T
(c) After 121 iterations, λ = −3.30282, u = ( .35356, .81416, −.46059 ) .

6.3. ♡ Prove that, for the normalized iterative method (5.86), ∥ A yk ∥ → | λ1 |. Assuming λ1
is real, explain how to deduce its sign.
Solution: Since xk → λk1 c1 v1 as k → ∞, we have

xk c1 λk1 u1 u1 , λ1 > 0, v1
yk = −→ = where u1 = (sign c1 )
∥ xk ∥ | c1 | | λ1 | ∥ v1 ∥
k k
(−1) u1 , λ1 < 0, ∥ v1 ∥

λ1 u1 , λ1 > 0,
is one of the two real unit eigenvectors. Moreover, A yk → k so
(−1) λ1 u1 , λ1 < 0,
∥ A yk ∥ → | λ1 |. If λ1 > 0, the iterates yk → u1 converge to one of the two dominant unit
eigenvectors, whereas if λ1 < 0, the iterates yk → (−1)k u1 switch back and forth between
the two unit eigenvectors. ■
30 Chapter 5. Eigenvalues and Singular Values

6.5.♡ The Inverse Power Method. Let A be a nonsingular matrix. (i ) Show that the eigenval-
ues of A−1 are the reciprocals 1/λ of the eigenvalues of A. How are the eigenvectors related?
(ii ) Show how to use the power method on A−1 to produce the smallest (in modulus) eigen-
value of A. (iii ) What is the rate of convergence of the algorithm? (iv) Design a practical
iterative algorithm based on the QR decomposition of A. (v) Apply your algorithm to find
the smallest eigenvalues and associated eigenvectors of the matrices in Exercise 6.1.
Solution: (i) First note that 0 cannot be an eigenvalue because A is nonsingular. If
A v = λ v then A−1 v = λ−1 v, and so v is also the eigenvector of A−1 .
(ii) If λ1 , . . . , λn are the eigenvalues of A, and we assume that | λ1 | > | λ2 | > · · · >
| λn | > 0 (otherwise, see Exercise 6.4), then λ−1 −1
1 , . . . , λn are the eigenvalues of A
−1
, and
−1
1/| λn | > 1/| λn−1 | > · · · > 1/| λ1 | and so 1/λn is the dominant eigenvalue of A . Thus,
applying the power method to A−1 will produce the reciprocal of the smallest (in absolute
value) eigenvalue of A and its corresponding eigenvector.
(iii) The rate of convergence of the algorithm is the ratio | λn /λn−1 | of the moduli of the
smallest two eigenvalues.
(iv) Once we factor A = Q R, we can solve the iteration equation A u(k+1) = xk by
rewriting it in the form R xk+1 = QT xk , and then using Back Substitution to solve for xk+1 .
As we know, this is much faster than computing A−1 .
T
(a) After 15 iterations, we obtain λ = .99998, u = ( .70711, −.70710 ) .
T
(c) After 6 iterations, we obtain λ = .30277, u = ( .35355, −.46060, .81415 ) .

6.10. Apply orthogonal iteration to the following symmetric matrices to find their eigenvalues
and eigenvectors to 2 decimal places:
   
    2 1 0 2 5 0
1 2 3 −1
(a) ♡ , (b) , (c ) ♡  1 2 3 , (d) ♢  5 0 −3 ,
2 6 −1 5
0 3 1 0 −3 3
   
3 −1 0 0 6 1 −1 0
 −1 3 −1 0   1 8 1 −1 
(e) ♢  , (f )  .
0 −1 3 −1 −1 1 4 1
0 0 −1 3 0 −1 1 3
   
.3310 .9436
Solution: (a ) Eigenvalues: 6.7016, .2984; eigenvectors: , .
.9436 −.3310
     
.2726 .9454 −.1784
(c ) Eigenvalues:4.7577, 1.9009, −1.6586; eigenvectors:  .7519 ,  −.0937 ,  .6526 .
.6003 −.3120 −.7364
Chapter 5. Eigenvalues and Singular Values 31

 
4 −1 1
6.13.♡ Show that applying orthogonal iteration to the matrix A =  −1 7 2 , starting
1 2 7
with the initial matrix S0 = I , eventually results in a diagonal matrix with the eigenvalues
on the diagonal, but not in decreasing order. Explain why. Try changing the initial condition
S0 ; does that produce the eigenvalues in the correct order?
 
6 0 0
Solution: The iterates converge to the diagonal matrix  0 9 0 . The eigenvalues
0 0 3
appear on along the diagonal, but not in decreasing order, because, when the eigenvalues
 are
0 1 1
listed in decreasing order, the corresponding eigenvector matrix S =  1 −1 1  does
−2 −1 1
not satisfy the regularity condition (5.94), and so Theorem 5.71 does not apply.
6.14.♡ Assume that orthogonal iteration applied to a symmetric positive semidefinite matrix
A converges to an n × k matrix Q, whose columns are orthonormal, and a k × k upper
triangular matrix R, whose diagonal entries are positive. Then Q and R satisfy A Q = Q R.
Show that the columns of Q are eigenvectors of A, and R is a diagonal matrix containing the
corresponding eigenvalues.
Solution: We start with the first two columns of A Q = Q R, which read
A q1 = r11 q1 , A q2 = r12 q1 + r22 q2 .
Thus, q1 is an eigenvector with eigenvalue r11 . Taking the dot product of the second equation
with q1 and using orthogonality produces
r12 = q1 · (r12 q1 + r22 q2 ) = q1 · (A q2 ) = (A q1 ) · q2 = r11 q1 · q2 = 0.
Therefore A q2 = r22 q2 and hence q2 is an eigenvector with eigenvalue r22 . For the inductive
step, in general we have
A qk = r1k q1 + r2k q2 + · · · + rk−1,k qk−1 + rkk qk .
Assuming that q1 , . . . , qk−1 are eigenvectors with eigenvalues r11 , . . . , rk−1,k−1 , we take the
dot product with qj with j < k to obtain rjk = qj · A qk = (A qj ) · qk = rjj qj · qk = 0.
Therefore A qk = rkk qk , which completes the proof. ■

7.1. Find the singular values of the following matrices and then write out their singular value
     
0 1 1 1 1 −2
decomposition: (a) ♡ ( 2, −1, 3 ), (b ) ♡ , (c ) , (d) ♢ ,
−1 0 0 2 −3 6
   
  0 1   1 −1 0
2 0 0  1 −1  2 1 0 −1
(e ) ♡ , (f ) ♢  , (g ) ♢ , (h )  −1 2 −1 .
0 3 0 −1 0 0 −1 1 1
0 −1 1
1 −1
     
 √  2 
1 0 1 0 0 1
Solution: (a ) 1 14 √ − √14 √14 , (b)
1 3 ,
 
14 0 −1 0 1 1 0
   
− √110 √  
0 1 3 0 0 1 0
(d)   5 2 − √5 √5 , (e )
1 2 .
√3 1 0 0 2 1 0 0
10
32 Chapter 5. Eigenvalues and Singular Values

7.4. ♡ True or false: If A is a symmetric matrix, then its singular values are the same as its
eigenvalues.
Solution: False. The singular values are the absolute values of the nonzero eigenvalues.
7.6. ♡ Suppose Q is an orthogonal n × n matrix. What are its singular values?
Solution: σ1 = · · · = σn = 1.
7.9. Use the power method to find the largest singular value of the following matrices:
 
      3 1 −1
1 2 2 1 −1 2 2 1 −1
(a )♡ , (b )♢ , (c )♡ , (d)  1 −2 2 .
−1 3 −2 3 1 1 −2 0 1
2 −1 1
Solution: To find the dominant singular value of a matrix A, we apply the power method
to H = ATApand take the square root of its dominant eigenvalue to find the dominant singular
value σ1 = λ1 of A.
 
2 −1
(a) H = ; after 11 iterations, λ1 = 13.0902 and σ1 = 3.6180;
−1 13
 
5 2 2 −1
 2 8 2 −4 
(c) H =  ; after 16 iterations, λ1 = 11.6055 and σ1 = 3.4067.
2 2 1 −1
−1 −4 −1 2
7.10. Compute the Euclidean matrix norm of the following matrices.
! ! ! !
1 1 5 4 2
− 72 1 3
(a ) ♡ 2 4 , (b ) ♢ 3 3 , (c ) ♡ 7 , (d ) 4 2 .
1
3
1
6 − 76 − 56 − 27 6
7 − 1
2
5
4
Solution: (a ) 0.671855, (c ) 0.9755.
7.12. ♡ True or false: The minimum value of the quantity in (5.107) is the smallest singular
value of A.
Solution: True if A is nonsingular, in view of (5.50). If A is singular, the minimum value
is 0, which, by our convention, is not a singular value.
7.14. ♡ Prove that the Euclidean matrix norm is bounded by the Frobenius norm, so that
∥ A ∥2 ≤ ∥ A ∥F . When are they equal?
Solution: Using Exercise 7.13, ∥ A ∥2F = σ12 + · · · + σr2 ≥ σ12 = ∥ A ∥22 . They are equal if
and only if A has only one singular value if and only if rank A = 1. ■
7.17. Find the condition number of the following matrices. Which would you characterize as
ill-conditioned?    
    −1 3 4 72 96 103
2 −1 1 2
(a)♡ , (b)♢ , (c )♡  2 10 6 , (d )  42 55 59 .
−3 1 1.001 1.9997
1 2 −3 67 95 102
Solution: (a) The singular values are 3.8643, .2588, and so the condition number is
3.86433 / .25878 = 14.9330.
(c) The singular values are 12.6557, 4.34391, .98226, and so the condition number is
12.6557 / .98226 = 12.88418.
Chapter 6

Basics of Optimization

1.1. Find all local and global extremizers on R of the following scalar functions:
x x2 − 3 x + 5
, (d )♢ ex −2 x , (e ) sin x+ 12 cos 2 x.
2 4
(a )♡ x3 −2 x+1, (b)♢ 2
, (c )♡ 2
1+x x +1
q q 

Solution: (a) Local minimizer: x = 3 ≈ .8165, F
2 2
3 = 1 − 49 6 ≈ −.0887;
q  q 

local maximizer: x = − 3 ≈ −.8165, F − 23 = 1 + 49 6 ≈ 2.0887;
2


(c) global minimizer: x = 3, F (3) = 21 ; global maximizer: x = − 31 , F − 13 = 11
2 .

1.2. Minimize and maximize the following objective functions on the indicated domains:
(a) ♡ x3 − 2 x2 + x, −1 ≤ x ≤ 1; (b ) ♢ x5 − 2 x3 + x − 3, 0 ≤ x ≤ 2;
x2 − x
(c) ♡ 2 , −3 ≤ x ≤ 3; (d) sin(x2 + 1), 0 ≤ x ≤ 2.
x +1
Solution: 
(a) Minimum at x = −1 with F (−1) = −4; maximum at x = 13 with F 13 = 4
27

√ √  2−2
(c) minimum at x = 2 − 1 with F 2 −1 = √ ≈ −.20711;
2 √
√ √  2+2
maximum at x = − 2 − 1 with F − 2 − 1 = √ ≈ 1.20711.
2

2.1. For each of the following quadratic functions, determine whether there is a minimizer.
If so, find the minimizer and the minimum value. (a ) ♡ x2 − 2 x y + 4 y 2 + x − 1,
(b ) 3 x2 + 3 x y + 3 y 2 − 2 x − 2 y + 4, (c ) ♡ x2 + 5 x y + 3 y 2 + 2 x − y,
(d ) ♡ x2 + y 2 + y z + z 2 + x + y − z, (e ) x2 + x y − y 2 − y z + z 2 − 3,
(f ) ♢ x2 + 5 x z + y 2 − 2 y z + z 2 + 2 x − z − 3, (g) x2 + x y + y 2 + y z + z 2 + z w + w2 − 2 x − w.

Solution: (a ) Minimizer: x = − 32 , y = − 16 ; minimum value: − 43 . (c ) No minimizer.


(d ) Minimizer: x = − 21 , y = −1, z = 1; minimum value: − 54 .
34 Chapter 6. Basics of Optimization

2.3. For each matrix H, vector f , and scalar c, write out the quadratic function P (x) given
by (6.10). Then either find the minimizer x⋆ and minimum value P (x⋆ ), or explain why there
is none.    1    
4 −12 −2 3 2 4
(a ) ♡ H = , f= , c = 3; (b ) ♡ H = , f= , c = 0;
−12 45 2 2 1 1
     
3 −1 1 0 1 1 1
(c ) ♡ H =  −1 2 −1  , f =  4 , c = 6; (d) ♢ H =  1 2 −1  ,
1 −1 3 −4 1 −1 1
     
1 1 0 0 −1
−3
1 2 1 0  2 
f =  −1 , c = 1; (e) H =  , f =  , c = 0.
0 1 3 1 −3
2
0 0 1 4 4
Solution: (a) P (x) = 2 x2 − 12 x y + 45 2 y + 2 x − 2 y + 3;
2 1 7

1 T T
minimizer: x⋆ = 24 1
, 18 ≈ ( .0417, .0556 ) ; minimum value: P (x⋆ ) = 288
851
≈ 2.95486.
(b) P (x) = 2 x + 2 x y + y − 4 x − y; no minimizer since H is not positive definite.
3 2 2

(c) P (x) = 32 x2 − x y + x z + y 2 − y z + 32 z 2 − 4 y + 4 z + 6;
T
minimizer: x⋆ = ( 1, 2, −1 ) ; minimum value: P (x⋆ ) = 0.
2.6.♡ Show that the quadratic function P (x, y) = x2 +y has a positive semidefinite coefficient
matrix, but no minimum.
 
2 0
Solution: The coefficient matrix is H = and xT H x = 2 x2 ≥ 0. However,
0 0
P (0, y) = y can be made as large negative as desired.

2.9.♡ Under what conditions does a quadratic function (6.10) have a finite global maximum?
Explain how to find the maximizer and maximum value.
Solution: P (x) has a maximum if and only if − P (x) has a minimum. Thus, we require
either H is negative definite, or negative semidefinite with f ∈ img H. The maximizer x⋆ is
obtained by solving H x⋆ = f , and the maximum value P (x⋆ ) is given as before by any of the
expressions in (6.14).
2.10. Find the maximum value of the quadratic functions
(a) ♡ − x2 + 3 x y − 5 y 2 − x + 1, (b ) − 2 x2 + 6 x y − 3 y 2 + 4 x − 3 y.
10 3 T
 16
Solution: (a) maximizer: x⋆ = 11 , 11 ; maximum value: p(x⋆ ) = 11 .

2.13. Find the minimizer and minimum value of the following quadratic functions when
subject to the indicated constraint. (a ) ♡ x2 − 2 x y + 6 y 2 , x + y = 1,
(b ) ♢ x2 + y 2 + 2 y z + 4 z 2 , x + 2 y − z = 3, (c ) x2 + x y − y 2 − y z + z 2 , x − y − z = 1.
Solution: (a) Minimizer: x⋆ = 97 , y ⋆ = 29 ; minimum value = 59 .
Chapter 6. Basics of Optimization 35

2.15. ♡ Let H be a symmetric matrix. Suppose V is a subspace spanned by one or more


eigenvectors of H having positive eigenvalues. Show that the restriction of the quadratic
function (6.10) to V has a unique global minimum. Write down the linear system the minimum
must satisfy.

Solution: Let u1 , . . . , un be an orthonormal basis of eigenvectors of H such that u1 , . . . , uk


have positive eigenvalues λ1 , . . . , λk > 0. Let x = y1 v1 + · · · + yk vk ∈ V . Then,

1 X X X
k k k
xT H x = yi yj uTi H uj = yi yj λj uTi uj = λi yi2 > 0
2 i,j = 1 i,j = 1 i=1

T
for y = ( y1 , y2 , . . . , yk ) ̸= 0, and hence the restricted quadratic function is positive definite.
Further, write f = b1 u1 + · · · + bn vn in terms of the eigenvector basis, so when x ∈ V , the
restricted quadratic function
1X X
k k
P (x) = 21 xT H x − xT f + c = λi yi2 − bi fi + c
2 i=1 i=1
has a unique global minimum when yi = bi /λi . ■

3.1. Find the standard gradient, where it exists, of the following functions.:
(a) ♡ x1 x22 , (b) ♢ log(x21 + x22 ), (c) ♡ ex1 −2 x2 , (d) tan−1 (x1 /x2 ).
T T
Solution: (a ) x22 , 2 x1 x2 ; (c) ex1 −2 x2 , −2 ex1 −2 x2 .

3.2. Repeat Exercise 3.1 using the inner products


(i) ⟨ x, y ⟩ = 3 x1 y1 + 2 x2 y2 ; (ii) ⟨ x, y ⟩ = x1 y1 − x1 y2 − x2 y1 + 4 x2 y2 .
1
T 2
T
Solution: (a) (i ) 3 x22 , x1 x2 , (ii) 3x1 x2 + 43 x22 , 23 x1 x2 + 13 x22
1 x1 −2 x2
T T
(c) (i ) 3e , − ex1 −2 x2 , (ii ) 2 x1 −2 x2
3e , − 13 ex1 −2 x2 .

3.3. Find the critical points of the following objective functions:


(a ) ♡ x4 + y 4 − 4 x y, (b) ♡ x y (1 − x − y), (c ) ♢ x y e− 2 x −2 y ,
2 2
(d ) (x − y) cos y.
T T T T
Solution: (a ) ( 0, 0 ) , ( 1, 1 ) , ( −1, −1 ) ; (b ) 1 1
3, 3 , (0, 0), (0, 1), (1, 0).

3.6. Let y = f (x) and z = g(y) be continuously differentiable scalar functions, and let
h(x) = g ◦ f (x) denote their composition. True or false:
(a) ♡ A critical point of f (x) is a critical point of h(x).
(b) ♢ A local minimizer of f (x) is a local minimizer of h(x).
(c) ♡ A critical point of h(x) is a critical point of f (x).
(d) A local minimizer of h(x) is a local minimizer of f (x).

Solution: (a) True, since if f ′ (x) = 0, then h′ (x) = g ′ f (x) f ′ (x) = 0.

(c) False: if g ′ (y⋆ ) = 0 for y⋆ = f (x⋆ ) with f ′ (x⋆ ) ̸= 0, then h′ (x⋆ ) = g ′ f (x⋆ ) f ′ (x⋆ ) = 0,
so x⋆ is a critical point for h(x) but not for f (x).
36 Chapter 6. Basics of Optimization

4.1.♡ Write Python code to implement gradient descent on the functions F1 (x, y) = x2 +2 y 2 ,
F2 (x, y) = x2 + 10 y 2 and F3 (x, y) = sin x sin y and numerically investigate the rates of
convergence. You will need to choose the time step α by hand in each case to get the fastest
convergence rate. For which function does gradient descent converge the most quickly?

Solution:

Python Notebook: Solution to Exercise 4.1 (.ipynb)

4.4. ♡ Prove that, provided ∇F (xk ) ̸= 0, the inequality (6.40) holds when αk > 0 is
sufficiently small.

Solution: Set g(t) = F xk − t ∇F (xk ) . By the chain rule, g ′ (0) = − ∥ ∇F (xk ) ∥2 < 0 by
our assumption. Thus, g(t) < g(0) for t > 0 sufficiently small. Replacing t by αk completes
the proof. ■

4.5. ♡ (a ) Show that the system x2 + y 2 = 1, x + y = 2, does not have a solution.


(b) Use gradient descent to construct a “least squares solution” by minimizing the scalar
valued function F (x, y) = (x2 + y 2 − 1)2 + (x + y − 2)2 .
Solution: (a) Since a solution must satisfy 4 = (x + y)2 = x2 + 2 x y + y 2 = 1 + 2 x y and
so x y = 32 , but this is not possible since | x |, | y | ≤ 1.

(b) The least squares solution is x = y = 1/ 3 2 ≈ .7937.

5.1. Solve the following linear systems by the conjugate gradient method, keeping track of
the residual vectors and solution approximations as you iterate.
   
        6 2 1 1
3 −1 2 2 1 −3
(a)♡ x= , (b) x= , (c)♡  2 3 −1  x =  0 ,
−1 5 1 1 1 1
1 −1 2 −2
       
6 −1 −1 5 1 5 1 1 1 4
 −1 7 1 −1   2  1 5 1 1 0
(d) ♢  x =  , (e )  x =  .
−1 1 3 −3 0 1 1 5 1 0
5 −1 −3 6 −1 1 1 1 5 0
Solution: In each, the fibal xk is the actual solution, with residual rk = b − H xk ≈ 0. 
        1
2 .76923 .07692 .78571
(a ) r0 = , x1 = , r1 = , x2 = ; (c ) r0 =  0 ,
1 .38462 −.15385 .35714
         −2 
.5 −1 .51814 1.28497 1.
x1 =  0 , r1 =  −2 , x2 =  −.72539 , r2 =  −.80311 , x3 =  −1.4 .
−1 −.5 −1.94301 .64249 −2.2
Chapter 6. Basics of Optimization 37

5.5. ♡ Use the conjugate gradient method to solve the system A u = e5 with coefficient
matrix  
4 −1 0 −1 0 0 0 0 0
 −1 4 −1 0 −1 0 0 0 0 
 
 0 −1 4 0 0 −1 0 0 0 
 
 −1 0 0 4 −1 0 −1 0 0 
 
A =  0 −1 0 −1 4 −1 0 −1 0 .
 
 0 0 −1 0 −1 4 0 0 −1 
 
 0 0 0 −1 0 0 4 −1 0 
 
0 0 0 0 −1 0 −1 4 −1
0 0 0 0 0 −1 0 −1 4
How many iterations do you need to obtain the solution that is accurate to 2 decimal places?
Remark: This matrix arises in the numerical discretization of the two-dimensional Laplace
partial differential equation, of great importance in many applications, [180].
Solution: Remarkably, after only two iterations, the method finds the exact solution:
T
x3 = x⋆ = ( .0625, .125, .0625, .125, .375, .125, .0625, .125, .0625 ) .
6.1. When possible, use Theorem 6.30 to determine the status of the critical points you found
in Exercises 3.3 and 3.4.
T T T
Solution: Exercise 3.3: (a) ( 1, 1 ) , ( −1, −1 ) are local minimizers; ( 0, 0 ) is a saddle
T T T T
point; (c ) 12 , − 12 , − 21 , 12 are local minimizers; 12 , 21 , − 12 , − 12 are local
T
maximizers; ( 0, 0 ) is a saddle point.
6.2. Let f (x) ∈ C4 be a scalar function. (a) ♡ Suppose that f ′ (x⋆ ) = f ′′ (x⋆ ) = 0, but
f ′′′ (x⋆ ) ̸= 0. Prove that x⋆ cannot be a local minimizer or maximizer of f (x).
(b ) ♢ Suppose that f ′ (x⋆ ) = f ′′ (x⋆ ) = f ′′′ (x⋆ ) = 0, while f ′′′′ (x⋆ ) > 0. Is x⋆ necessarily
a local (i) maximizer, (ii) minimizer, (iii ) neither, or (iv) cannot tell with this information
alone?
Solution: (a) The third order Taylor formula says f (x) = f (x⋆ ) + 16 f ′′′ (x⋆ ) h3 + R(h),
where h = x − x⋆ , with R(h)/h3 → 0 as h → 0. Thus, if x − x⋆ is sufficiently small, then
f (x) − f (x⋆ ) is positive on one side of x⋆ and negative on the other side, and hence t⋆ cannot
be a local minimizer or maximizer. ■

6.4. ♡ Give an example of a quadratic function Q(x, y) of two variables that has no critical
points. If your answer is an affine function, try harder. What can you say about the graph
of Q(x, y)?
T
Solution: Let x = ( x, y ) . Any function of the form Q(x) = xT H u−2 xT f with det H = 0
and f ̸∈ img H. For instance, Q(x, y) = x2 − y. The cross-sections of the graph parallel to
the x axis are parabolas with progressively lower minima as y increases.
7.2. ♡ Show that − log x is strictly convex when x > 0. Use this to prove log x ≤ x − 1 for
all x > 0, with equality if and only if x = 1.

Solution: Its second derivative is x−2 > 0, proving strict convexity. Moreover, it lies
strictly above its tangent line at x = 1, which is 1 − x ≤ − log x, which proves the inequality
with equality only holding when x = 1. ■
38 Chapter 6. Basics of Optimization

7.4. Determine whether the following functions are (i ) convex; (ii ) strictly convex:
(a) ♡ x, (b ) ♡ x2 , (c ) ♡ x3 , (d ) | x |, (e) ♢ | x |3 , (f ) 1/(1 + x2 ).
Solution: (a ) Convex; (b ) strictly convex; (c ) not convex.

7.6. ♡ Let ∥ · ∥ be any norm on R n . Show that the norm function F (x) = ∥ x ∥ is convex.
Solution: Let 0 ≤ t ≤ 1 and x, y ∈ Rn . By the triangle inequality,

F (1 − t) x + t y = ∥ (1 − t) x + t y ∥ ≤ ∥ (1 − t) x ∥ + ∥ t y ∥ = (1 − t) F (x) + t F (y).

Therefore F is convex. ■

7.8. (a ) ♡ Prove that if F : R n → R is continuously differentiable, and both F and − F are


convex, then F is an affine function. (b) ♢ Is the result also valid for general functions F ? If
so, prove it. If not, find an explicit counterexample.
Solution: (a) If both F and − F are convex, then the convexity inequality (6.91) holds for
both F and − F , which implies that it is an equality: F (y) = F (x) + ⟨ ∇F (x), y − x ⟩ for all
x, y ∈ Rn . Setting x = 0 and replacing y by x yields F (x) = F (0)+⟨ ∇F (0), x ⟩ = ⟨ a, x ⟩+b,
which is an affine function with a = ∇F (0) and b = F (0). ■
7.10. ♡ Prove Lemma 6.37.
Solution: Let us write H(x) = a F (x) + b G(x). Let 0 ≤ t ≤ 1 and x, y ∈ Rn . Then, since
F and G are convex, and a, b ≥ 0, we have

H((1 − t) x + t y) = a F ((1 − t) x + t y) + b G((1 − t) x + t y)


≤ a ((1 − t) F (x) + t F (y)) + b ((1 − t) G(x) + t G(y))
= (1 − t) (a F (x) + b G(x)) + t (a F (y) + b G(y)) = (1 − t) H(x) + t H(y),

which completes the proof. ■


7.13. ♡ Let F : [ 0, ∞ ) → R be a convex function satisfying F (0) = 0. Show that F is
superadditive, which means that F (x) + F (y) ≤ F (x + y) for all x, y ≥ 0.
x y
Hint: Use (6.88) to show that F (x) ≤ F (x + y) and F (y) ≤ F (x + y).
x+y x+y
Solution: If either x or y is zero, the inequality is trivial, so we may assume x + y > 0.
Since F (0) = 0 and F is convex we have
 
x y x
F (x) = F (x + y) + 0 ≤ F (x + y),
x+y x+y x+y

where we used definition of convexity with t = x/(x + y). The inequality


y
F (y) ≤ F (x + y)
x+y

follows by swapping x and y. Adding these two inequalities produces


x y x+y
F (x) + F (y) ≤ F (x + y) + F (x + y) = F (x + y) = F (x + y). ■
x+y x+y x+y
Chapter 6. Basics of Optimization 39

7.17. ♡ Prove Hölder’s inequality (6.99) when p = ∞ and q = 1.


!
X
n X
n X
n
Solution: x · y = xi yi ≤ |xi | |yi | ≤ |xi | max |yj | = ∥ x ∥1 ∥ y ∥∞ . ■
1≤j≤n
i=1 i=1 i=1

7.20. ♡ Show that F : Rn → R is µ-strongly convex if and only if



F (1 − t) x + t y + 12 µ t (1 − t) ∥ x − y ∥2 ≤ (1 − t) F (x) + t F (y) (6.112)

holds for all x, y ∈ R n and 0 ≤ t ≤ 1.


Solution: By definition, F is µ-strongly convex if and only if G(x) = F (x) − 21 µ ∥ x ∥2 is
convex. Then, for x, y ∈ R n and 0 ≤ t ≤ 1, the convexity inequality (6.87) for G says
 
F (1 − t) x + t y − 12 µ ∥ (1 − t) x + t y ∥2 = G (1 − t) x + t y ≤ (1 − t) G(x) + t G(y)
= (1 − t) F (x) + t F (y) − 21 µ (1 − t) ∥ x ∥2 − 12 µ t ∥ y ∥2 .

Replacing ∥ (1 − t) x + t y ∥2 = (1 − t)2 ∥ x ∥2 + 2 t (1 − t) x · y + t2 ∥ y ∥2 , and then rearranging


terms yields
  
F (1 − t) x + t y ≤ (1 − t) F (x) + t F (y) − 21 µ t (1 − t) ∥ x ∥2 − 2 x · y + ∥ y ∥2
= (1 − t) F (x) + t F (y) − 21 µ t (1 − t) ∥ x − y ∥2 ,

which proves (6.112). Conversely, if F satisfies (6.112), then


 
G (1 − t) x + t y = F (1 − t) x + t y − 12 µ ∥ (1 − t) x + t y ∥2

= F (1 − t) x + t y − 12 µ (1 − t)2 ∥ x ∥2 − t (1 − t) x · y − 21 µ t2 ∥ y ∥2

= F (1 − t) x + t y + 12 µ t (1 − t) ∥ x − y ∥2 − 12 µ (1 − t) ∥ x ∥2 − 12 µ t ∥ y ∥2
≤ (1 − t) F (x) + t F (y) − 12 µ (1 − t) ∥ x ∥2 − 12 µ t ∥ y ∥2 = (1 − t) G(x) + t G(y),

and hence G is convex. ■

8.1. Determine whether or not the following scalar functions are Lipschitz continuous on R.
If so, find their Lipschitz constant.
(d ) ♡ ex , (e ) ♢ e−x , (f ) tanh x.
2
(a) ♡ | x | + | x − 1 |, (b) x2/3 , (c ) ♡ sign x,
Solution: (a ) Lipschitz constant = 2; (c ) not continuous; (d ) not Lipschitz
8.2. Do your answers to Exercise 8.1 change if the domain is restricted to [ − 1, 1 ]?

Solution: (a ) Lipschitz constant = 2; (c ) not continuous; (d ) Lipschitz constant = e

 following functions2 on R
2
8.5. Determine whether or not the are Lipschitz continuous.
(a ) ♡ | x − y |, (b) max | x |, | y | , (c ) ♡ x − y , (d) ♢ exp(− x2 − y 2 ).
2

Solution: (a ) Lipschitz continuous; (c ) not Lipschitz continuous


40 Chapter 6. Basics of Optimization

8.8. ♡ Suppose the scalar-valued functions F1 , F2 : R 2 → R are both Lipschitz continuous.


T
(a) Prove that the vector-valued function F (x, y) = ( F1 (x, y), F2 (x, y) ) is Lipschitz con-
tinuous. (b) Is the converse to part (a) valid?
Solution: (a) By the equivalence of norms, we can use the 1 norm on R 2 to prove this.
Let λ1 , λ2 be their respective Lipschitz constants. Then, given x, y ∈ R 2 ,
∥ F (x) − F (y) ∥1 = | F1 (x) − F1 (y) | + | F2 (x) − F2 (y) |

≤ λ1 ∥ x − y ∥ + λ2 ∥ x − y ∥ = λ ∥ x − y ∥, where λ = λ1 + λ2 .
(b) Yes, since for i = 1, 2, we have | Fi (x) − Fi (y) | ≤ ∥ F (x) − F (y) ∥1 ≤ λ ∥ x − y ∥.
8.10. ♡ Prove that the property of Lipschitz continuity does not depend on the underlying
norm on R n .
Solution: Let ∥ · ∥a and ∥ · ∥b be any two norms on R n . If F is Lipschitz continuous on
Ω ⊂ R n under the second norm, then
| F (x) − F (y) | ≤ λb ∥ x − y ∥b ≤ λb R⋆ ∥ x − y ∥a for all x, y ∈ Ω,

where R > 0 is the constant appearing in the equivalence of norms inequality (2.71). Thus,
F is Lipschitz continuous for the first norm, with Lipschitz constant λa = R⋆ λb . ■
8.12. True or false: (a ) ♡ A convex scalar function is Lipschitz continuous.
(b) ♢ A strictly convex scalar function is Lipschitz continuous.
(c) A Lipschitz continuous scalar function is convex.
Solution:
(a) False. The function f (x) = x2 is convex, but not Lipschitz continuous on all of R.
8.14.♡ Give an example to show that (6.117) does not hold in general on connected domains.
Hint: Take the domain to be a disk in R2 centered at the origin with the negative x axis
removed.
Solution: As suggested, let Ω = { x2 + y 2 ≤ 1 }\{ (x, 0) | −1 ≤ x ≤ 0 }. Let F (x, y) = θ be
the polar angle, so − π < θ < π on Ω. The polar angle function is continuously differentiable
on Ω, and even has bounded gradient if one omits a small disk centered at the origin. Let
T T
ε > 0 be small. Setting x = − 12 , ε , y = − 12 , − ε , we have F (x) − F (y) ≈ 2 π while
∥ x − y ∥ = 2 ε. Thus there is no constant β > 0 such that (6.117) holds for all x, y ∈ Ω.

9.1. ♡ For F1 and F2 from Exercise 4.1, compute the Lipschitz constant of the gradient and
determine the rate of convergence of gradient descent, according to Theorem 6.66, with the
optimal choice of time step. Use the Euclidean norm and dot product. Do the theoretical
convergence rates match up with the experimental rates determined in Exercise 4.1?
Solution: All gradients, Hessians, and norms are Euclidean in this solution. We compute
       
2x 2x 2 0 2 0
∇F1 (x, y) = , ∇F2 (x, y) = , ∇2 F1 (x, y) = , ∇2 F2 (x, y) = .
4y 20 y 0 4 0 20
By Remark 6.60, the Lipschitz constant of the gradient for each function is the largest eigen-
value of the Hessian matrix, which is 4 and 20 for F1 and F2 , respectively. Since the optimal
choice of the time step in Theorem 6.66 is α = Lip(∇F )−1 , the optimal convergence rate is
O(Lip(∇F )/k), which is significantly faster for F1 compared to F2 (5 times faster). This is
the same phenomenon we observed numerically in Exercise 4.1.
Chapter 6. Basics of Optimization 41

9.2. ♡ Repeat Exercise 9.1, except this time compare the linear convergence rates provided
by Theorem 6.68. Use the Euclidean norm and dot product.
Solution: For simplicity, assume that all gradients, Hessians, and norms are Euclidean. By
(6.104) and the Hessians computed in Exercise 9.1, each Fi is µ-strongly convex with µ = 2.
By Remark 6.70 the convergence rate is 1 − τ where τ is given by (6.146). Thus, τ1 = 24 = 12
2 1
and τ2 = 20 = 10 , yielding again a much faster rate for F1 as compared to F2 .
9.3. ♡ Find a preconditioning matrix C so that preconditioned gradient descent on F2 from
Exercise 4.1 is equivalent to ordinary gradient descent on F1 (from the same exercise), and
thus admits the same convergence rate.
Solution: The preconditioned gradient is ∇C F = C −1 ∇F , so in order to make the pre-
conditioned
 C-gradient
 of F2 the
 same as the Euclidean gradient of F1 , we need to set
−1 1 0 1 0
C = , or C = .
0 15 0 5

10.2. Use Newton’s Method to find all points of intersection of the following pairs of plane
curves: (a ) ♡ x2 + y 2 = 1, x y = 12 , (b) ♡ x3 + y 3 = 1, x2 − y 2 = 1,
(c ) ♢ x2 + 13 y 2 = 1, x2 + 14 x + 2 y 2 − 14 y = 5, (d ) y = x2 − 3 x − 5, x = −2 y 2 + 6 y.
Hint: Sketching the curves will help you decide where to start the iterations.
Solution:
   
       
√1
0.70710 − √12 −0.70710 0.92416 1.42503
(a)  2 ≈ ,   ≈ . (b) , .
√1 0.70710 − √12 −0.70710 0.59505 −1.23722
2

10.5. ♡ Prove that (6.155) is equivalent to (6.156).


Solution: Assuming ∇22 F (xk ) is positive definite, Theorem 6.7 implies that the minimum
of the quadratic function in (6.155) is achieved when ∇22 F (xk ) (xk+1 − xk ) = − ∇2 F (xk ).
Inverting ∇22 F (xk ) produces the Newton iteration (6.156). ■
10.8. ♡ Given a descent direction v for an optimization method — for gradient descent
−1
v = − ∇F (x), while for Newton’s method v = − ∇22 F (x) ∇2 F (x) — a backtracking line
search aims to choose the best time step α to minimize the function F along the descent
direction, that is, to minimize F (x + α v) over α. The backtracking line search has two
parameters 0 < γ ≤ 21 and 0 < β < 1, and chooses α = β k , where k ≥ 0 is the smallest
nonnegative integer such that
F (x + β k v) ≤ F (x) + γ β k ⟨ ∇F (x), v ⟩. (6.166)
In practice, one starts with k = 0, and then iteratively increases k = 1, 2, . . . until the
inequality (6.166) holds.
(a) Assume ∇F is Lipschitz continuous, and the descent direction is v = − ∇F (x). Show
that there exists an integer k ≥ 0 such that (6.166) holds. That is, the backtracking line
search will eventually terminate. Hint: Use Lemma 6.64.
(b) Implement the backtracking line search in Python when F (x) = x31 + 10 x22 . Try gradi-
−1
ent descent, where v = − ∇F (xk ), and Newton’s method where v = − ∇22 F (xk ) ∇2 F (xk ).
In both cases, after conducting the backtracking line search, the update is xk+1 = xk + β k v.
T
Starting from x0 = ( 1, 1 ) , you should observe faster convergence with the backtracking line
search with good choices of parameters: γ = 0.5 and β = 0.9 are reasonable.
42 Chapter 6. Basics of Optimization

Solution: (a) By Lemma 6.64,

F (x + β k v) = F (x − β k ∇F (x)) ≤ F (x) − 12 β k ∥ ∇F (x) ∥2 ,

provided k is large enough so that β k ≤ 1/ Lip(∇F ). In contrast, the right hand side of
(6.166) is
F (x) + γ β k ⟨ ∇F (x), v ⟩ = F (x) − γ β k ∥ ∇F (x) ∥2 .
The backtracking line search condition is met provided γ β k ≤ 1
2 β k , or γ ≤ 12 , which is exactly
the condition on γ. ■
(b)

Python Notebook: Backtracking Line Search (.ipynb)


Chapter 7

Introduction to Machine
Learning and Data

1.1. Find the mean, the variance, and the standard deviation of the following data sets. You
can set ν = 1 when computing the latter.
(a)♡ 1.1, 1.3, 1.5, 1.55, 1.6, 1.9, 2, 2.1; (b) 2., .9, .7, 1.5, 2.6, .3, .8, 1.4; (c)♡ −2.9, −.5, .1, −1.5,
−3.6, 1.3, .4, −.7; (d) ♢ 1.1, .2, .1, .6, 1.3, −.4, −.1, .4; (e ) .9, −.4, −.8, .2, 1., −1.6, −1.2, −.7.
Solution: (a ) mean = 1.63125; variance = .84469; standard deviation = 0.91907.
(c ) mean = −0.925; variance = 19.375; standard deviation = 4.4017.
1.2. Show that the centering matrix J is (a) ♡ positive semi-definite, (b ) ♡ idempotent, so
J 2 = J, (c ) ♢ has one-dimensional kernel spanned by 1, and hence is not positive definite,
and (d) has rank m − 1.
1 T
Solution: (a) Using the definition (7.5) to write J x = x − (1 x) 1 = x − x 1, and hence
m
positive semidefiniteness follows from
1 1
xT J x = xT x − (1 · x)2 ≥ ∥ x ∥2 − ∥ 1 ∥2 ∥ x ∥2 = ∥ x ∥2 − ∥ x ∥2 = 0,
m m
where we used Cauchy–Schwarz (2.27) for the inequality, and then 1T 1 = ∥ 1 ∥2 = m. ■
 2
1 2 1 1
(b) J 2 = I − 1 1T = I − 1 1T + 2 1 1T 1 1T = I − 1 1T = J. ■
m m m m
1.6. ♡ Suppose we have a collection of data points x1 , . . . , xm lying along a line spanned by
the unit vector u, that is each xi = si u for some si ∈ R. Show that the covariance matrix of
this data is SX = σs2 u uT , where σs2 is the variance of the weights s = (s1 , . . . , sm ).
1 X
m
Solution: Let s̄ = s and note that x = s u. Thus, by (7.22) we have
m i=1 i
X
m
SX =ν (si − s̄ )2 u uT = σs2 u uT , since xi − x = (si − s̄) u. ■
i=1
44 Chapter 7. Introduction to Machine Learning and Data

2.2. ♡ This exercise compares the 1 norm and 2 norm in regression.


(a) Show that the solution of the optimization problem min{ ∥ w ∥2 | 1 · w = 1 } is given by
w = 1/n. This shows that under a constraint on the total mass of the weights, i.e., w · 1 = 1,
the 2 norm prefers to assign weights equally across all features. Hint : Write z = w − 1/n
and convert it into the equivalent optimization problem min{ ∥ z + 1/n ∥2 | 1 · z = 0 } whose
optimal solution is z = 0.
(b) Show that the solution of the optimization problem min{ ∥ w ∥1 | 1 · w = 1 } is any
vector w satisfying the constraint 1 · w = 1 that has nonnegative entries, i.e., wi ≥ 0 for all
i. Hence, the 1 norm does not place any preference on how the mass is distributed among
features, and the sparse solution w = e1 is as equally preferred as the nonsparse solution
w = 1/n. This fact allows lasso and elastic net to find sparser solutions when they exist.
Solution: (a) First note that
X
n X
2
1 = (1 · w)2 = wi2 + wi wj .
i=1 i<j = 1

Since 0 ≤ (wi − wj )2 = wi2 + wj2 − 2 wi wj , we have 2 wi wj ≤ wi2 + wj2 , with equality if and
only if wi = wj . Using this in the first formula yields

X
n
1 1
2
1≤n wi2 = n ∥ w ∥2 . Thus, ∥ w ∥2 ≥ = whenever 1 · w = 1.
i=1
n n

Moreover, equality is achieved if and only if all entries of w are equal, so w = 1/n. ■
(b) We have 1 = w1 + · · · + wn ≤ | w1 | + · · · + | wn | = ∥ w ∥1 , with equality if and only
if all wi ≥ 0, and hence any such w minimizes the one norm. ■

2.3.♡ Consider the general ridge regression problem minw ∥ Xw − y ∥2 + λ ∥ Bw ∥2 . Show
that the solution is unique and is given by (7.41) when X T X + λ B T B is nonsingular. Explain
how this solves the ridge regression problem (7.42).
Solution: Expanding,

∥ Xw − y ∥2 + λ ∥ B w ∥2 = wT (X T X + λ B T B) w − 2 wT X T y + ∥ y ∥2 .

This is a quadratic function of w, and, if λ > 0, the coefficient matrix X T X + λ B T B is a


combination of Gram matrices, and hence positive semidefinite. Thus, provided this matrix
is nonsingular and hence positive definite, Theorem 6.7 implies that it has a unique global
minimizer, which is given by (7.41). ■
Solution: (a) Let X = (v1 . . . vn ). The assumption in the problem is that v1 = · · · = vk .
In this case, the least squares part of the ridge regression loss can be written as
2 2
X
n X
n
∥ Xw − y ∥ = 2
wi vi − y = (w1 + · · · + wk ) v1 + wi vi − y ,
i=1 i=k+1

while the regularizer is λ ∥ w ∥2 = λ (w12 +· · ·+wn2 ), where λ > 0. Let w be the minimizer of the
ridge regression problem, and set w e = (w, . . . , w, wi+1 , . . . , wn ), where w = (w1 + · · · + wk )/k.
Then w e1 + · · · + w
ek = w1 + · · · + wk , which implies that ∥ Xw − y ∥2 = ∥ X w e − y ∥2 . Note
that, by the Cauchy-Schwarz inequality,

k 2 w2 = (w1 + · · · + wk )2 = (1 · w)2 ≤ ∥ 1 ∥2 ∥ w ∥2 = k (w12 + · · · + wk2 ).


Chapter 7. Introduction to Machine Learning and Data 45

It follows that
e12 + · · · + w
w ek2 = k w2 ≤ w12 + · · · + wk2 ,
and hence ∥ we ∥2 ≤ ∥ w ∥2 . Thus, w
e is also a minimizer of ridge regression problem. Since the
minimizer is unique we have w e = w, and so the minimizer must set w1 = · · · = wk = w. ■
(b) We again write the least squares part of the lasso loss in the above form. Without loss
of generality, we may assume that w1 + · · · + wk ≥ 0, the proof being similar in the negative
case. We define w e in the same way as in part (a), and note that w e1 = · · · = w
ek = w ≥ 0.
Furthermore,

|w
e1 | + · · · + |w
ek | = w + · · · + w = k w = w1 + · · · + wk ≤ |w1 | + · · · + |wk |,

from which it follows that ∥ w e ∥1 ≤ ∥ w ∥1 . Therefore w e is also a minimizer of the lasso


regression problem, and its first k weights all have the same sign (they are all nonnegative).
They are also all exactly the same, as in part (a), except in this case the solution of lasso
is not unique, since the loss function is not strongly or strictly convex, and so we cannot
conclude that w = w. e ■
(c) We have essentially already shown this in the previous parts. By the displayed equation
e The
in part (a), the least squares portion of the lasso regression loss is the same for w and w.
regularizer term is also the same, since

|w1 | + · · · + |wk | = w1 + · · · + wk = w
e1 + · · · + w
ek = |w
e1 | + · · · + |w
ek |. ■

2.6. ♡ Prove that (7.53) holds.


Solution: For any i = 1, . . . , n,

∥ Xw − y ∥2 + λ ∥ w ∥1 = wi2 ∥ vi ∥2 − 2 wi bi + λ |wi | + ∥ y ∥2 ,

where  
X
bi = vi · y − wj vj  . ■
j̸=i
46 Chapter 7. Introduction to Machine Learning and Data

2.9. ♡ Write Python code that solves the lasso regression problem using the iterative
shrinkage–thresholding algorithm (ISTA). Test your code on the diabetes data set.
Solution:

Python Notebook: Lasso via ISTA (.ipynb)

2.11. Let x1 , . . . , xk+1 ∈ R be distinct real numbers, i.e., xi ̸= xj when i ̸= j. (a) ♡ Prove
that the corresponding (k + 1) × (k + 1) Vandermonde matrix (7.56) is nonsingular. Hint :
Prove that ker Z = {0}, by using the fact that a nonzero polynomial of degree k can have
at most k roots. (b ) ♢ Given data points y1 , . . . , yk+1 ∈ R, an interpolating polynomial p(x)
satisfies p(xi ) = yi for all i = 1, . . . , k + 1. Prove that there exists a unique interpolating
polynomial of degree ≤ k for any collection of data points. Hint : Write the interpolation
conditions in vectorial form using the Vandermonde matrix.
T
Solution: (a) Suppose c = ( c0 , c1 , . . . , ck ) ∈ ker Z. Define the degree k polynomial
p(x) = c0 + c1 x + · · · + ck xk . Then the condition Z c = 0 implies p(xi ) = 0 for i = 1, . . . , k + 1
which, by the hint, implies p(x) ≡ 0 and hence c0 = c1 = · · · = ck = 0. Thus ker Z = {0}. ■

T
3.1. ♡ Let F : Rn → R2 , so F (x) = ( F1 (x), F2 (x) ) , be the output of a binary classifier,
which we assume to be a probability vector. Explain why we only need to learn the scalar-
valued function F1 (x) and threshold at F1 (x) = 0.5 to perform binary classification.
Solution: Since F (x) is a probability vector, we have F1 (x) + F2 (x) = 1 and Fi (x) ≥ 0
for i = 1, 2. Given F1 (x), we can deduce F2 (x) = 1 − F1 (x). Moreover, F (x) is closer to e1
if 0 ≤ F1 (x) < 0.5 and closer to e2 if 0.5 < F1 (x) < 1.
   
0 0 −1
3.2. ♡ Given the data matrix X =  1 0  and labels y =  1  find three separating
0 1 1
hyperplanes, and find the maximal margin SVM classifier.
Solution: Any line of the form x1 + x2 = c for 0 < c < 1 is a separating hyperplane. The
maximum margin classifier is obtained by setting c = 0.5, which makes the line equidistant
from all points.
3.3. ♡ Use sklearn in Python to train a soft-margin SVM to be applied to the data matrix
   
0 0 0 −1
 1 −1 1   1 
X=  and labels y =  . What are the values for w and b? The Python
0 1 1 −1
2 −2 3 1
notebook for this section will be helpful, but keep in mind that, in Python, the labels must
be nonnegative, so use y = (0, 1, 0, 1)T and adapt your result.
Chapter 7. Introduction to Machine Learning and Data 47

Solution: A Python notebook is provided below, and the found values, which perfectly
separate the data, are w = (0.6664919, −0.66675405, 0.66622976)T and b = 0.99965047.

Python Notebook: SVM Toy Homework Problem (.ipynb)

3.8. ♡ Consider a linearly separable data set, where there exists a solution (w0 , b0 ) of the
hard-margin SVM problem (7.62). Let (wλ , bλ ) be a solution of the soft-margin problem (7.64)
for λ > 0. (a ) Show that ∥ wλ ∥ ≤ ∥ w0 ∥. (b ) Show that yi (xi · wλ − bλ ) + λ m ∥ w0 ∥2 ≥ 1.
Therefore, in the linearly separable case, the soft-margin SVM problem with small λ provides
a good approximation to the solution of the hard-margin problem.
Solution: We’ll prove both parts at the same time. Since (wλ , bλ ) minimizes the soft-
margin SVM problem,

1 X  1 X 
m m
λ∥ wλ ∥2 + 1 − yi (xi · wλ − bλ ) + ≤ λ ∥ w0 ∥2 + 1 − yi (xi · w0 − b0 ) + .
m i=1 m i=1

Since (w0 , b0 ) solves the hard-margin SVM problem, yi (xi · w0 − b0 ) ≥ 1 for all i, and so

1 X
m
[1 − yi (xi · w0 − b0 )]+ = 0.
m i=1

Therefore
1 X 
m
λ ∥ wλ ∥2 + 1 − yi (xi · wλ − bλ ) + ≤ λ ∥ w0 ∥2 .
m i=1

Since all terms on the left hand side are nonnegative, we deduce that λ ∥ wλ ∥2 ≤ λ ∥ w0 ∥2 ,
which proves part (a), and
1 
1 − yi (xi · wλ − bλ ) + ≤ λ ∥ w0 ∥2 ,
m
which upon rearranging, proves part (b). ■

4.1. ♡ Let x1 , . . . , xm ∈ R be a collection of one dimensional data points. Show that we


can compute the nearest neighbor of every point xi in the data set in O(m log m) operations,
compared to the O(m2 ) operations it would take to compare distances between every pair of
data points | xi − xj |.
Hint: Recall that the computational complexity of sorting m numbers is O(m log m) [47].

Solution: We first sort the numbers in ascending order, which takes O(m log m) operations,
and we can henceforth assume x1 ≤ x2 ≤ · · · ≤ xm . Then the nearest neighbor of xi is either
xi−1 or xi+1 , unless i = 1 or i = m, in which case there is only one choice. Thus, we only
need to check two points for each xi , which is an additional 2 m operations, yielding a total
of O(m log m) operations. ■
48 Chapter 7. Introduction to Machine Learning and Data

4.5. ♡ Let x1 , . . . , xm ∈ Rn and y1 , . . . , ym ∈ Rc denote the training data for a k-nearest


neighbor classifier. Define 
1, ∥ x ∥ ≤ 1,
G(x) =
0, otherwise.
Show that there exists a function H : R n → R, which may depend on the training data, such
that the classification decision of a uniformly weighted k-nearest neighbor classifier using the
norm ∥ · ∥ can be deduced from the function

X
m  
x − xi
F (x) = G yi . (7.72)
i=1
H(x)

How does the classification decision relate to F (x)? You can assume the k-th nearest neighbor
of x is unique, so no ties have to be broken.
Solution: If we set H(x) to be the distance from x to its k-th nearest neighbor, then F (x)
is the sum of the label vectors corresponding to each of these neighbors, which, up to a factor
of 1/k, is the uniformly weighted k-nearest neighbor classifier (provided no ties have to be
broken). The normalization by 1/k does not affect the label decision, which is simply taken
as the largest component of F (x). ■

5.1. ♡ Given a data set with m points, prove that there are 2m−1 − 1 possible ways to cluster
the data into 2 nonempty clusters. Remark: The generalization of this result to k clusters is
provided by the Stirling numbers of the second kind, cf. [92].
Solution: Without loss of generality, we can assume the first data point is in the first
cluster; otherwise just relabel the clusters. Each of the other m − 1 data points can be either
in cluster 1 or 2, leading to 2m−1 different possible clusterings. Except one of them has all
the data points in the first cluster while the second cluster is empty, so we must subtract 1
from this total. ■
5.4. ♡ Consider Exercise 5.3 in dimension n = 1 and assume c1 < c2 . Show that
 
C1 = x x ≤ 12 (c1 + c2 ) , C2 = x x > 21 (c1 + c2 ) .

Solution: In this case the inequality for C1 reduces to

x (c2 − c1 ) = x w ≤ b = 1
2 (c22 − c21 ) = 1
2 (c2 − c1 ) (c2 + c1 ).

Dividing by c2 − c1 > 0 completes the proof. The proof for C2 is the same calculation with
> replacing ≤. ■

5.6. ♡ Complete the proof of Lemma 7.11 by showing that C1t+1 is nonempty.
Solution: Assume that C1t+1 = ∅, which means that xi · wt+1 > bt+1 for all i = 1, . . . , m.
2 ·w
Then we would have that ct+1 2 ·w
> bt+1 , which contradicts the fact that ct+1 ≤ bt+1 ,
t+1 t+1

which holds by definition of w t+1


and b . Thus, we must have C2 ̸= ∅.
t+1 t+1

Chapter 7. Introduction to Machine Learning and Data 49

5.9. ♡ (Robust k-means clustering) The exercise is focused on the robust k-means algorithm,
which is guided by minimizing (7.82). We start with distinct randomized initial values for the
means c01 , c02 , . . . , c0k chosen from the data set, and iterate the steps below until convergence.
(i) Update the clusters as in (7.76).
(ii) Update the cluster centers
X
ct+1
j ∈ argmin ∥ x − c ∥. (7.85)
c
x∈Cjt
(a) Show that the robust k-means algorithm descends on the energy Erobust .
(b) The cluster center ct+1
j does not admit a closed form expression and is sometimes
inconvenient to work with in practice. Consider changing the Euclidean norm in (7.82) to the
1 norm and redefine Erobust as
X
m
Erobust (c1 , c2 , . . . , ck ) = min ∥ cj − xi ∥1 . (7.86)
1≤j≤k
i=1

This is called k-medians clustering. Formulate both steps of the k-medians algorithm so
that it descends on the k medians clustering energy (7.86). In particular, show that the
cluster centers ct+1
j are the coordinatewise medians of the points x ∈ Cjt , which are simple to
compute.
(c) Can you think of any reasons why the Euclidean norm would be preferred over the 1
norm in the k-means energy?
(d) Challenge : Implement the robust k-medians algorithm in Python.
Solution: (a) The proof is very similar to Theorem 7.13. We compute

X
k X X
k X
Erobust (ct1 , . . . , ctk ) = ∥ ctj − x∥ = ∥ ct+1
j − x∥
j =1 x∈Cjt j =1 x∈Cjt

X
k X
≥ min ∥ ct+1
ℓ − x ∥ = Erobust (ct+1 t+1
1 , . . . , ck ),
1≤ℓ≤k
j = 1 x∈Cjt

and hence the energy decreases at each step. ■


(b) The only change in the k-medians algorithm is the cluster centers step (7.85), which
in this case amounts to minimizing
X X X
n X
n X
∥ x − c ∥1 = | x i − ci | = | xi − ci |.
x∈Cjt x∈Cjt i=1 i=1 x∈Cjt

T
over c = ( c1 , c2 , . . . , cn ) . Hence, the coordinate ci should be chosen to minimize
X
| xi − ci |.
x∈Cjt

We claim that an optimal ci is the median of the i-th coordinates of points in Cjt . This is
equivalent to showing that a solution c of
X
m
min f (c) := | ai − c |
c
i=1
50 Chapter 7. Introduction to Machine Learning and Data

for any given real numbers ai ∈ R is the median of a1 , . . . , am . To see this, we may, without
loss of generality, assume that a1 ≤ · · · ≤ am . We can then differentiate f at any aj < c < aj+1
 
d X
j Xm
f ′ (c) =  (c − ai ) + (ai − c)  = j − (m − j) = 2 j − m.
dc i=1 i=j+1

Thus, f is decreasing when j ≤ m/2 and increasing when j ≥ m/2, which implies that any
median is a minimizer of f . ■
(c) Unlike the Euclidean norm, the 1 norm is not invariant under rotations of Euclidean
space, so you may get different clusterings upon rotating the data, which may not be desirable.
(d)

Python Notebook: k medians solution (.ipynb)

6.1. ♡ Consider the inner product feature map kernel Kϕ (x, y) = ⟨ ϕ(x), ϕ(y) ⟩. Show that
there exists a feature map ψ such that Kϕ (x, y) = ψ(x) · ψ(y), so that there is no loss of
generality in using the dot product in the definition of feature map kernels.
Solution: If C is the symmetric positive definite matrix that defines the inner product,
then we have Kϕ (x, y) = ⟨ ϕ(x), ϕ(y) ⟩ = ϕ(x)T C ϕ(y). Let S = C 1/2 be its square root,
which is also symmetric positive definite. This allows us to write
   
Kϕ (x, y) = ϕ(x)T S 2 ϕ(y) = S ϕ(x) T S ϕ(y) = S ϕ(x) · S ϕ(y) = ψ(x) · ψ(y),

where ψ(x) = S ϕ(x). ■


6.4. ♡ Let K be the sigmoid kernel function. (a ) Find x so that K(x, x) < 0 when κ < 0 or
c < 0. (b ) Show that K is not a Mercer kernel even when c > 0 and κ > 0. Hint: Look for
an m = 2 counterexample.
Solution: (a ) If either κ < 0 orc < 0, then we can find x such that κ ∥ x ∥2 + c < 0 and
hence K(x, x) = tanh κ ∥ x ∥2 + c < 0. (b ) Let x1 be a unit vector, ∥ x1 ∥ = 1, and let
x2 = a x1 . Then
   
K(x1 , x1 ) K(x1 , x2 ) tanh(κ + c) tanh(κ a + c)
det = det
K(x1 , x2 ) K(x2 , x2 ) tanh(κ a + c) tanh(κ a2 + c)
= tanh(κ + c) tanh(κ a2 + c) − tanh(κ a + c)2 .

Since tanh t < 1 for all t, and limt→∞ tanh t = 1, in the limit as a → ∞, the above determinant
has limiting value tanh(κ + c)−1 < 0, which implies it is negative for a sufficiently large. This
implies that the corresponding 2×2 sigmoid kernel matrix cannot be positive semidefinite. ■
Chapter 7. Introduction to Machine Learning and Data 51

6.7. ♡ Show that c defined by (7.102) is the minimal Euclidean norm solution of (7.104)
when y ∈ img K. What happen when y ̸∈ img K?
Solution: By the discussion leading to (7.102), any solution of (7.104) is given by

c = (K + λ I )−1 y + v,

where v ∈ ker K. If y ∈ img K, then y = Kz and so

(K + λ I )−1 y = (K + λ I )−1 Kz = K(K + λ I )−1 z,

and so the first term in the definition of c above belongs to img K as well. Since img K and
ker K are orthogonal complements, we have

∥ c ∥2 = ∥ (K + λ I )−1 y ∥2 + ∥ v ∥2 ,

hence the minimal Euclidean norm solution is obtained by setting v = 0. When y ̸∈ img K,
the first term (K + λ I )−1 y has a non-zero (orthogonal) projection p onto ker K. Setting
v = −p results in a smaller Euclidean norm, and so c given by (7.102) is not the minimal
Euclidean norm solution in this case. ■
Chapter 8

Principal Component Analysis

1.1. ♡ Construct the 5 × 5 covariance matrix for the data set from Exercise 1.1 in Chapter 7,
and find its principal variances, principal standard deviations, and principal directions. What
do you think is the dimension of the subspace the data lies in?
Solution: Covariance matrix:
 
0.844687 −0.62375 2.09625 −0.71 −1.48875
 −0.62375 3.995 −8.625 3.01 4.675 
 
SX =  2.09625 −8.625 19.375 −6.74 −10.865 
 
−0.71 3.01 −6.74 2.36 3.77
−1.48875 4.675 −10.865 3.77 6.295

Principal variances: 1017.39, 0.8656, 0.00086, 0.000178, 0.0;


Principal standard deviations: 31.8966, 0.93037, 0.02938, 0.01335, 0.0;
Principal directions:
         
0.08677 −0.80181 0.46356 −0.067789 0.36067
 −0.34556   −0.44715   −0.057289   0.13724   −0.81150 
         
 0.77916 ,  0.086880 ,  0.31273 ,  −0.29037 ,  −0.45084 .
         
−0.27110 −0.056296 −0.18453 −0.94302 0.0
−0.43873 0.38267 0.80621 −0.05448 −0.09017

The data lies close to a two-dimensional subspace spanned by the first two principal directions,
and is precisely on the four-dimensional subspace spanned by the first four principal directions.
Chapter 8. Principal Component Analysis 53

1.2. ♡ Using the Euclidean norm, compute a fairly dense sample of points on the unit sphere
S = { x ∈ R 3 | ∥ x ∥ = 1 }. (a ) Set µ = .95 in (8.3), and then find the principal components
of your data set. Do they indicate the two-dimensional nature of the sphere? If not, why not?
(b ) Now look at the subset of your data that is within a distance r > 0 of the north pole,
T
i.e., ∥ x − ( 0, 0, 1 ) ∥ ≤ r, and compute its principal components. How small does r need to
be to reveal the actual dimension of S? Interpret your calculations.

Solution:

Python Notebook: Solution to Exercise 1.2. (.ipynb)

1.4. ♡ Show that the first principal direction q1 can be characterized as the direction of the
line that minimizes the sums of the squares of its distances to the data points. Hint : Use
Theorem 5.43.

Solution: Given a line in the direction of a unit vector u, the sum of the squares of its
distances to the data points is

X
m X
m
  Xm
∥ xi − (xi · u) u ∥2 = ∥ xi ∥2 − (xi · u)2 = ∥ xi ∥2 − ∥ X u ∥2 .
i=1 i=1 i=1

This is minimized by maximizing the second term,

∥ X u ∥2 = uT (X T X) u,

over all unit vectors u. By the optimization principle in Theorem 5.43, the maximum is the
largest eigenvalue of X T X which, assuming X ̸= O, is the square of the largest singular value
of X. Moreover, the maximizing unit vector is one of the unit eigenvectors for this eigenvalue,
i.e., one of the dominant unit singular vectors, so u = ± q1 . ■

1.5. ♡ Prove Proposition 8.3.

Solution: Let J denote the m × m centering matrix (7.5). Then,

SY = Y T Y = (J Y )T (J Y ) = (J XW )T (J XW )

= W T (J X)T (J X)W = W T (X T X)W = W T SX W.

2.1. ♡ Verify equation (8.21).

Solution: The second equality is immediate since S = Y T Y . Moreover, according to


Exercise 2.8, tr (U T Y T Y U ) = tr ((Y U )T Y U ) is the sum of the squared Euclidean norms of
the columns of Y U , which are simply Y uj . Thus

X
k X
k
tr (U T Y T Y U ) = ∥ Y uj ∥2 = uTj Y T Y uj . ■
j =1 j =1
54 Chapter 8. Principal Component Analysis

2.3. ♡ This exercise considers the problem of fitting the best subspace in a general inner
p p
product norm ∥ x ∥C = ⟨ x, x ⟩C = xT C y, where C is symmetric, positive definite. Given
T
points x1 , . . . , xm ∈ R n , let X = ( x1 , . . . , xm ) be the corresponding data matrix. Then,
given a subspace V ⊂ R , define the distance and squared energy
n

X
m
distC (x, V ) = min { ∥ x − y ∥C | y ∈ V } , EC (V ; x1 , . . . , xm ) = distC (xi , V )2 .
i=1

(a) Show that the k-dimensional subspace minimizing EC is the one spanned by the top
k eigenvectors q1 , . . . , qk of the matrix S = X T XC.
(b) What happens if we minimize over affine subspaces W = a + V ? What choice of a is
optimal?
(c) Formulate equivalent optimization principles as was done in Remark 8.12.

Solution: (a) Since ∥ x ∥C = xT C x = ∥ C 1/2 x ∥, we can set zi = C 1/2 xi and W = C 1/2 V ,
so
EC (V ; x1 , . . . , xm ) = E(W ; z1 , . . . , zm ).
Note that the corresponding data matrix for Z is Z = XC 1/2 . By Theorem 8.9, the best
subspace is the space W spanned by the top k eigenvectors of Z T Z = C 1/2 X T XC 1/2 , which
we denote by p1 , . . . , pk , so C 1/2 X T XC 1/2 pi = λi pi . The optimal subspace V is thus
V = C −1/2 W , which is spanned by the vectors qi = C −1/2 pi , which are the top k eigenvectors
of S = X T XC.
(b) With an offset, so W = a + V , we again have

distC (x, W ) = distC (x − a, V ) = dist(C 1/2 x, C 1/2 a + C 1/2 V ),


1 X 1/2
m
so by Lemma 8.8 an optimal value for C 1/2 a is the mean C 1/2 a = C xi . Multiplying
m i=1
by C −1/2 , we see that the mean vector x is again an optimal choice for a.
(c) The optimization principles can be formulated by part (a) by using Z = XC 1/2 . The
equivalent to (8.20) is
n o
min ∥ XC 1/2 − XC 1/2 U U T ∥2F U T U = I .

Make the change of variables U = C 1/2 V to find


n o
min ∥ (X − XC V V T ) C 1/2 ∥2F V TCV = I ,

which is subject to the more natural orthogonality condition in the C inner product. Notice
the C 1/2 appear at the end inside the Frobenius norm amounts to taking the sum of the
squared C norms of each row of the matrix X − C V V T .
The equivalent to (8.22) is
n o
max tr (U T C 1/2 X T XC 1/2 U ) U T U = I .

Again, the change of variables U = C 1/2 V yields



max tr (V T CX T XCV ) V T CV = I .
Chapter 8. Principal Component Analysis 55

Recalling that the adjoint of V in the C inner product is V ∗ = C −1 V T C, we can use properties
of the trace to write this as

max tr (V ∗ S V C) V T C V = I , where S = X T XC.

3.1. ♡ Generate a plot of the singular values for the rows versus blocks of the image in this
section. Which ones decay faster?

Solution: The block-wise singular values decay faster.

Python Notebook: Solution to Exercise 3.1 (.ipynb)

3.5. ♡ In this exercise, you will extend the PCA-based compression algorithm from this
section to audio compression. Complete the parts (a) through (c); the notebook below will
help you get started.

Python Notebook: Audio Compression (.ipynb)

(a) Use the block-based image compression algorithm described in this section for audio
compression. You can use any audio file you like; thePython notebook linked above
downloads a classical music sample from the textbook GitHub website. A stereo audio
signal is an array of size n × 2, where n is the number of samples. Use blocks of size
N × 2 for compression.

(b) Plot the top k = 10 or so principal components. They should look suspiciously like
sinusoids.

(c) When you play back the compressed audio file, you will likely hear some static noise
artifacts, even at very low compression rates. These are caused by blocking artifacts,
where the signals do not match up on the edges of the blocks used for compression, which
introduces discontinuities into the signal. This is similar to the blocking artifacts we
observed in image compression in this section, however, the artifacts are more noticeable
in audio than in images.
To fix this, audio compression algorithms use overlapping blocks, and apply a windowing
function in order to smoothly patch together the audio in each block. The blocks are
structured so that half of the first block overlaps with half of the second block, and
so on. To implement this in Python, just shift the signal by half of the block width,
and apply the image_to_patches function on the original and shifted signals. Then
compress and decompress both signals. After decompressing, and before converting
back from the block format to the audio signal, you’ll need to multiply by a windowing
function to smooth the transition between blocks. If the block size is N × 2, then each
56 Chapter 8. Principal Component Analysis

channel should be multiplied by a window function wi , i = 0, 1, . . . , N − 1. A common


window function that is used, for example, in mp3 compression, is
  
2 1 π
wi = sin i+ .
2 N
After you decompress and apply the window, undo the shift and add the signals together
to get the decompressed audio. Does this improve the audio quality? As a note, in order
to make sure the shifted signals add up correctly, we need that
wi + wi+N/2 = 1.
As an exercise, the reader should check that the window function above satisfies this
condition, which is called the Princen-Bradley condition.
Solution:

Python Notebook: Audio Compression Solutions (.ipynb)

4.4. ♡ Write Python code to implement the version of LDA where a singular within class
covariance is handled according to Exercise 4.2, and compare to covariance shrinkage on
MNIST data. Hint: Instead of trying to figure out exactly which singular values are zero, a
more numerically stable approach is to truncate all singular values less than a threshold ε > 0
to zero. In this problem, it works well to, for example, just take the top 100 eigenvectors of
Sw on MNIST.
Solution:

Python Notebook: Solution to Exercise 4.4. (.ipynb)

4.5. ♡ Show that the solution of (8.47) is the matrix U = Qk whose columns are the top k
discriminating directions. Hint : See Remark 8.12.
1/2
Solution: Let us write V = Sw U , so that U T Sw U = V T V . Then
−1/2 −1/2
 −1/2 −1/2

tr (U T Sb U ) = tr (Sw V )T Sb Sw V = tr V T Sw Sb Sw V .
Hence, in terms of the matrix V , the trace LDA problem (8.47) is
−1/2 −1/2

max V T Sw Sb Sw V .
VTV=I

According to Remark 8.12, the solution is the m × k matrix V = ( q1 . . . qk ) whose columns


−1/2 −1/2
are the top k eigenvectors of the symmetric positive semidefinite matrix Sw Sb Sw . Then
−1/2 −1/2 1/2
pi = Sw qi satisfy Sb pi = Sb Sw qi = λi Sw qi = λi Sw pi and so are the LDA discrim-
−1/2
inating directions. We then set U = Qk = ( p1 . . . pk ) = Sw V . ■
Chapter 8. Principal Component Analysis 57

C
5.3.♡ Let C be a positive definite symmetric
√ matrix and define the distance matrix DX to be
the distance matrix in the norm ∥ x ∥C = x Cx, with entries dij = ∥ xi − xj ∥C . Generalize
T 2

Proposition 8.21 and Theorem 8.22 to this setting. In particular, how do you construct the
optimal isometric embedding in this case?

Solution: The key is to note that ∥ x ∥C = ∥ C 1/2 x ∥, from which it follows that DX
C
= DY
1/2 1/2 −1/2
where yi = C X, so Y = XC and X = Y C . Thus, (8.51) becomes

− 12 J DX
C
J = X C XT .
C
From Theorem 8.22, the requirement is again that J DX J is negative semidefinite. The
1/2 −1/2
optimal isometric embedding is Xk = Qk Λk C , where

− 21 J DX
C
J = Qk Λk QTk

is the reduced spectral decomposition. ■


5.5. ♡ Show that there do not exist three points z1 , z2 , z3 ∈ R that satisfy
|z1 − z2 | = |z1 − z3 | = |z2 − z3 | = 1.
Solution: Let us assume z1 < z2 < z3 , which would necessitate z2 − z1 = 1 and z3 − z2 = 1,
and so z3 − z1 = z3 − z2 + z2 − z1 = 2 =
̸ 1. ■
Chapter 9

Graph Theory and Graph-based


Learning

1.1. Sketch the graphs corresponding to the following adjacency matrices.


 
  0 1 0 1 0 0
  0 1 1 1 1
1 0 0 0 1 0
0 1 0 1 0 1 1 1  
  0 0 0 1 0 0
(a ) ♡  1 0 1 ; (b)  1 1 0 1 1 ; (c ) ♡  .
  1 0 1 0 1 0
0 1 0 1 1 1 0 1  
0 1 0 1 0 1
1 1 1 1 0
0 0 0 0 1 0

Solution: (a ) (c )

1.2. Sketch the digraphs corresponding to the following adjacency matrices.


 
      0 0 1 0 0
0 1 0 1
0 1 0 0 1 0 0 0 1 1 0
0 0 1 0  
(a) ♡  0 0 1 ; (b) ♢  1 0 1 ; (c ) ♡  ; (d)  1 0 0 0 1 .
1 0 0 0  
1 0 0 0 1 0 0 0 0 0 0
0 1 1 0
1 0 1 0 0

Solution: (a ) (c)
Chapter 9. Graph Theory and Graph-based Learning 59

1.3. Write out an adjacency matrix for the following digraphs.

(a ) ♡ (b ) (c ) ♡

(d) (e ) ♢ (f )

Solution: We label the nodes in order from top to bottom and, when at the same height,
from left to right.
 
  0 0 0 0 0
0 0 0 0
1 0 0 0 0
0 0 0 1  
(a )  ; (c)  1 0 0 1 0 .
1 1 0 1  
0 1 0 0 1
0 0 0 0
0 1 0 0 0
1.4. Write out an adjacency matrix for graphs given by the edges of the Platonic solids:
(a)♡ tetrahedron, (b)♡ cube, (c)♢ octahedron, (d) dodecahedron, and (e ) icosahedron.
 
0 1 1 1
1 0 1 1
Solution: (a ) Tetrahedron:  ;
1 1 0 1
1 1 1 0
 
0 1 1 0 1 0 0 0
1 0 0 1 0 1 0 0
 
1 0 0 1 0 0 1 0
 
0 1 1 0 0 0 0 1
(b) Cube:  .
1 0 0 0 1 1 0 0
 
0 1 0 0 1 0 0 1
 
0 0 1 0 1 0 0 1
0 0 0 1 0 1 1 0
1.6. ♡ True or false: Let A be the adjacency matrix for an unweighted digraph. Then the
underlying unweighted graph has adjacency matrix A = A b+A bT.

Solution: True if there is only one edge between any pair of nodes, but false if there are
b+A
two edges since then the corresponding entry of A b T is 2.
60 Chapter 9. Graph Theory and Graph-based Learning

1.11. A connected graph is called a tree if it has no circuits. (a) Find an adjacency matrix
for each of the following trees:

(i ) ♡ (ii) ♡ (iii ) ♢ (iv )

(b ) ♢ Draw all distinct trees with 5 nodes, and write down the corresponding adjacency
matrices. (c ) Prove that any two nodes in a tree are connected by one and only one path.
 
  0 0 1 0 0
0 0 1 0
0 0 1 0 0
0 0 1 0  
Solution: (a) (i )  , (ii)  1 1 0 1 0 .
1 1 0 1  
0 0 1 0 1
0 0 1 0
0 0 0 1 0

2.2. Verify Euler’s formula for each of the Platonic solids of Exercise 1.4.
Solution: (a) Tetrahedron: number of nodes = 4, number of edges = 6, number of circuits
= 3, Euler’s formula: 4 + 3 = 6 + 1;
(b) Cube: number of nodes = 8, number of edges = 12, number of circuits = 5, Euler’s
formula: 8 + 5 = 12 + 1.
2.3. Draw the digraph represented by the following incidence matrices:
     
−1 0 1 0 1 0 −1 0 0 1 0 0 −1
 1 0 0 −1   0 1 0 −1   −1 0 1 0 0 
(a ) ♡  , (b)  , (c ) ♡  ,
0 −1 1 0 −1 1 0 0 0 0 0 −1 1
0 1 0 −1 0 0 1 −1 0 −1 1 0 0
 
  0 1 −1 0 0 0 0
−1 0 1 0 0
 −1 0 0 1 0 0 0 
 0 −1 0 1 0  
   0 0 0 −1 1 0 0 
(d ) ♢  1 −1 0 0 0 , (e )  .
   0 −1 0 0 0 0 1 
0 0 0 −1 1  
0 0 −1 0 0 1 0
0 0 −1 0 1
0 0 0 0 0 1 −1

Solution: (a ) or, equivalently,

(c )
Chapter 9. Graph Theory and Graph-based Learning 61

2.4. For each of the digraphs in Exercise 1.3, see if you can determine a collection of indepen-
dent circuits of the underlying graph. Verify your answer by writing out the incidence matrix
and constructing a suitable basis of its cokernel.
   
−1 0 1 0 0
 0 −1 1 0   −1 
Solution: (a)  , 1 circuit:  ;
0 1 0 −1 −1
0 0 1 −1 1
     
−1 0 1 0 0 −1 0
 −1 1 0 0 0   1   0 
     
 0 −1 0 1 0   1   −1 
(c)  , 2 circuits:  ,  .
 0 −1 0 0 1   0   1 
     
0 0 1 −1 0 1 0
0 0 0 1 −1 0 1
2.5. ♡ A complete graph Gm on m nodes has one edge joining every distinct pair of nodes.
(a ) Draw G3 , G4 and G5 . (b ) Choose an orientation for each edge and write out the result-
ing incidence matrix of each digraph. (c) How many edges does Gn have? (d) How many
independent circuits? (e) Find a spanning tree and the corresponding basic circuits.
Solution:

(a )

 
1 −1 0 0 0
 1 0 −1 0 0 
   
1 −1 0 0 1 0 0 −1 0 
   
 1 0 −1 0  1 0 0 0 −1 
1 −1 0    
 1 0 −1 , 1 0 0 −1   0 1 −1 0 0 
(b )  ,  .
 0 1 −1 0  0 1 0 −1 0 
0 1 −1    
0 1 0 −1 0 1 0 0 −1 
 
0 0 1 −1 0 0 1 −1 0 
 
0 0 1 0 −1
0 0 0 1 −1
(c ) 21 m (m−1); (d) 12 (m−1)(m−2); (e ) One option is to choose all m−1 edges starting at a
single node, say node 1. The basic circuits are the triangles with nodes 1, i, j for 1 < i < j ≤ m.
62 Chapter 9. Graph Theory and Graph-based Learning

e are incidence matrices of the same size and


2.11. ♡ True or false: If N and N
e , then the corresponding digraphs are equivalent.
coker N = coker N

Solution: False. For example, any two inequivalent trees, cf. Exercise 1.11, with the same
number of nodes have incidence matrices of the same size, both with trivial cokernels, so
coker N = coker Ne = {0}. As another example, the incidence matrices
   
1 −1 0 0 0 1 −1 0 0 0
 0 1 −1 0 0   0 1 −1 0 0 
  e = 
N =  −1 0 1 0 0  and N  −1 0 1 0 0 
   
1 0 0 −1 0 1 0 0 −1 0
1 0 0 0 −1 0 1 0 0 −1
T
both have cokernel basis ( 1, 1, 1, 0, 0 ) , but do not represent equivalent digraphs.

3.2. Determine the graph Laplacian and its spectrum for the graphs with adjacency matrices
listed in Exercise 1.1.
 
1 −1 0
Solution: (a ) L =  −1 2 −1 , eigenvalues = 3, 1, 0;
0 −1 1
 
2 −1 0 −1 0 0
 −1 2 0 0 −1 0 
 
 0 0 1 −1 0 0 
(c ) L =  , eigenvalues = 4.8136, 3., 2.5293, 1., 0.65708, 0.
 −1 0 −1 3 −1 0 
 
0 −1 0 −1 3 −1
0 0 0 0 −1 1

3.5. ♡ In Proposition 9.21, assume that N is the incidence matrix for a weighted digraph G,b
without the restriction that each pair of nodes (i, j) has at most one directed edge between
them. Show that L = N T CN is the graph Laplacian for the underlying weighted graph G.

Solution: Referring to the displayed formula in the proof of Proposition 9.21, the weight
wij in the second summation will equal the sum of the weights wik jk of all edges with ik = i
and jk = j or ik = j and jk = i, i.e., of all the edges connecting node i and node j in
either direction, and hence equals the weight of the edge connecting i and j in G. With this
interpretation, the formula continues to identify N T CN = L. ■

3.6. ♡ Let G be a connected graph with m nodes and with graph Laplacian matrix L. Let
P = ( I − 1) be the (m − 1) × m matrix whose first m − 1 columns form the (m − 1) × (m − 1)
identity matrix and whose last column has all −1 entries.
(a) Show that the (m − 1) × (m − 1) matrix P L P T is positive definite.
(b) Let b ∈ Rm satisfy b · 1 = 0, and let y ∈ Rm be the unique solution of P L P T y = P b.
Show that x = P T y solves L x = b and x · 1 = 0.
(c) Suppose b · 1 ̸= 0 in part (b). What equation does x = P T y satisfy?

Solution: (a) Note that if y ∈ R m−1 , then


T
x = P T y = y1 , . . . , ym−1 , − y1 − · · · − ym−1 ∈ 1⊥
lies in the orthogonal complement to the ones vector. Moreover, if y ̸= 0, then x ̸= 0. Since G
Chapter 9. Graph Theory and Graph-based Learning 63

is connected, the kernel of L is spanned by 1. Thus, by Corollary 4.9, yT P L P T y = xT L x > 0


for all x ̸= 0.
(b) Part (a) shows x · 1 = 0. Moreover, P (L x − b) = 0, hence L x − b ∈ ker P , which is
spanned by 1. Together these imply L x − b = 0.
(c) We have L x = b − b 1 where b = b · 1/m = (b1 + · · · + bm )/m is the mean of b. The
right hand side is the centered version of b, with mean 0. Indeed, P T (L x − b) = 0, and hence
L x − b = c 1 ∈ ker P T = ker L. Thus, 1T (L x − b) = − 1T b = m c, hence c = − 1T b/m.

4.1. ♡ Show that the quantity (m − k) k is maximized over integers 0 ≤ k ≤ m by setting


k = 21 m when m is even, and k = 12 (m − 1) or k = 12 (m + 1) when m is odd.
Solution: Complete the square: (m − k) k = −(k − 12 m)2 + 14 m2 . Thus the left hand side
is maximized by minimizing | k − 21 m | over all integers k, which is achieved in the indicated
manner.
4.4. ♡ Suppose we allow loops in a graph, by permitting one or more diagonal entries of the
weight matrix to be positive: wii > 0. Recall from Exercise 3.3 that the Laplacian matrix is
unchanged by loops. Is this also true for the modularity matrix?
Solution: No, because adding in loops at the nodes with weights wii changes dbi = di + wii ,
c become
and hence, in view of (9.42), the entries of the new modularity matrix M
(di + wii ) (dj + wjj )
b ij = wij −
m ,
d1 + · · · + dm + w11 + · · · + wmm
which are clearly not the same as mij . For a simple example, consider the graph with 2 nodes
 
w 1
with weight matrix W = where w ≥ 0. Then d = (w + 1, 1)T and dT 1 = w + 2.
1 0
d d 1
Thus, for example, the lower right entry of M is m22 = 0 − 2T 2 = − , which clearly
d 1 w+2
depends on the value of the diagonal entry w.
Intuitively, the reason is that modularity counts how many edges are in a given subset
of the graph, and adding self loops affects this count. In contrast, the graph Laplacian is
connected to graph cuts, which do not see loops, since you would never cut a loop when
clustering a graph.
4.7. ♡ Implement spectral modularity optimization for community detection in Python via
the power method and detect communities in Zachary’s karate club graph and Krebs’ political
books graph to reproduce the results from this section. Try other graphs in the graphlearning
package as well. Hint: You will have to use a spectral shift of the modularity matrix, as we
did for finding the Fiedler vector with the power iteration; see Exercise 4.5(a).
Solution:

Python Notebook: Solution to Exercise 4.7. (.ipynb)


64 Chapter 9. Graph Theory and Graph-based Learning

5.2. ♡ Let x1 , . . . , xm ∈ Rn , and let T be the m × m matrix with entries tij = ∥ xi − xj ∥.


Show that T satisfies the triangle inequality tik ≤ tij + tjk for all 1 ≤ i, j, k ≤ m.
Solution: This is immediate from the triangle inequality for the norm:

tik = ∥ xi − xk ∥ = ∥ (xi − xj ) + (xj − xk ) ∥ ≤ ∥ xi − xj ∥ + ∥ xj − xk ∥ = tij + tjk . ■

5.3. ♡ Show that the dynamic programming principle (9.52) can be reformulated as

ti = 0 for i ∈ D, and max wji (ti − tj ) = 1 for i ∈ Dc . (9.56)


j

Solution: Note that we can rewrite (9.52) to read


−1

min wji wji (tj − ti ) + 1 = 0.
j

−1
Indeed, we simply subtract ti from both sides of (9.52) and factor out wji . Now, the term
−1
wji can be removed, while still keeping the minimum value at zero; indeed, this weight is
positive so the term in parentheses must vanish at the minimum. Therefore,
 
min wji (tj − ti ) + 1 = 0
j

for all nodes i in the graph. We can further rewrite this by multiplying by −1 on both sides
to obtain (9.56). ■

5.4. ♡ Let G = (N , E) be a disconnected graph and D ⊂ N . Suppose there is a connected


component H ⊂ N for which H ∩ D = ∅. Show that there is no solution t ∈ Rm to the
dynamic programming principle (9.52). Hint : Suppose a solution t exists, and consider the
node i that minimizes ti over i ∈ H. It is easier to work with the reformulation (9.56).
Solution: If i ∈ H is a point where ti is minimal over i ∈ H, then ti − tj ≤ 0 for all j ∈ H,
and so wji (ti − tj ) ≤ 0 for all j, which violates (9.56). ■
5.5. ♡ Let G = (N , E) be a connected graph with m nodes, and let D ⊂ N . Assume that
u, v ∈ Rm satisfy the reformulated dynamic programming principle (9.56) with inequalities

max wji (ui − uj ) ≤ 1 ≤ max wji (vi − vj )


j j

for all i ∈ Dc , and suppose that ui ≤ 0 ≤ vi for i ∈ D. Show that u ≤ tD ≤ v.


Hint: Use a similar argument to the proof of Lemma 9.41.
Solution: Let t = tD , and suppose to the contrary that ui > ti for some i ∈ Dc . Then
there exists λ > 1 such that ui > λ ti . Let i be the index where ui − λ ti is largest. Since the
difference is positive, i ∈ Dc . Therefore ui − λ ti ≥ uj − λ tj for all j = 1, . . . , m. Rearranging
the inequality and multiplying the result by wji on both sides yields wji (ui −uj ) ≥ λ wji (ti −tj )
for all j = 1, . . . , m. Maximizing both sides of the latter inequality over j implies that λ ≤ 1,
which is a contradiction. The proof that tD ≤ v is similar. ■
Chapter 9. Graph Theory and Graph-based Learning 65

6.1. ♡ Let xk satisfy the diffusion equation (9.60) and set zk = D−1 xk . Show that zk solves
the adjoint equation (9.62).

Solution: Since W = W T is symmetric, so is the graph Laplacian LT = L. Thus

zk+1 = D−1 xk+1 = D−1 xk − D−1 LTrw xk



= D−1 xk − D−1 LT D−1 xk = zk − D−1 L zk = zk − Lrw zk .

6.3. ♡ The symmetric normalized graph Laplacian (9.65) defines another type of diffusion
on graphs, which is given by
yk+1 = yk − Lsym yk . (9.82)

Let xk satisfy the diffusion equation (9.60) and set yk = D−1/2 xk . Show that yk solves the
symmetric diffusion equation (9.82).

Solution:

yk+1 = D−1/2 xk+1 = D−1/2 xk − D−1/2 LTrw xk = yk − D−1/2 L D−1 xk


 ■
= yk − D−1/2 L D−1/2 D−1/2 xk = yk − Lsym yk .

6.4.♡ Let xk satisfy the diffusion equation (9.60). Give an example of a graph where x0 = 1,
but xk ̸= 1 when k ≥ 1.

Solution: An example is In this case,

   
1 0 0 0 0 1 −1 0 0 0
0 2 0 0 0  −1 2 −1 0 0 
   
D = 0 0 3 0 0 , L= 0 −1 3 −1 −1 ,
   
0 0 0 1 0 0 0 −1 1 0
0 0 0 0 1 0 0 −1 0 1
 
1 − 21 0 0 0
 
 −1 1 − 21 0 0 
 
 
and so LTrw = 0 − 12 1 −1 −1 . Here are the first few iterates:
 
 
 0 0 − 31 1 0 
0 0 − 13 0 1
  1 2  2   5   13 
1 2 3 3 9 18
  4 4 1   13   28 
1      9    
  3 3 0   9   27 
           
x0 =  1 , x1 =  5 , x2 =  4 , x3 =  7 , x4 =  13 , x5 =  41 .
  2 3  3   9   18 
  1 5  4   7   13 
1 3 6  9   9   27 
1 5 4 7 13
1 3 6 9 9 27

The graph is not aperiodic, and the diffusion process has a periodic limiting behavior, switch-
T T
ing back and forth between 43 , 1, 94 , 12 , 21 and 21 , 32 , 32 , 34 , 34 .
66 Chapter 9. Graph Theory and Graph-based Learning

6.5. ♡ Show that the solution xk of the diffusion equation (9.60) is given by
X
m
xk = D (1 − λi )k (x0 · pi ) pi , (9.83)
i=1

where p1 , . . . , pm are the orthonormal eigenvector basis of Lrw with eigenvalues λ1 , . . . , λm .


Solution: Using Exercise 6.1 and (9.68),
X
m
x k = D zk = D (1 − λi )k ⟨ pi , z0 ⟩D pi .
i=1

Furthermore, ⟨ pi , z0 ⟩D = pTi D z0 = pTi x0 = x0 · pi . Substituting this into the preceding


equation produces the result. ■
6.7. ♡ State and prove Theorem 9.61 for xk and yk , the respective solutions of the diffusion
equations (9.60) and (9.82).
Solution: According to Exercises 6.1 and 6.3, xk = D zk and yk = D1/2 zk . Thus, the
limiting distributions are
X 1 ·x X (D1/2 1 ) · y
H H
x∞ = D 0
1 , y∞ = D1/2 0
1H .
∥ 1H ∥2D H ∥ 1H ∥2D
H H

Moreover,

∥ xk − x∞ ∥D−1 ≤ τ k ∥ x0 ∥D−1 , ∥ yk − y∞ ∥ ≤ τ k ∥ y0 ∥, k ≥ 0,

where τ = min | 1 − λ |, | 1 − λm | . In particular, if G is aperiodic, then τ < 1 and hence
xk → x∞ and yk → y∞ as k → ∞. ■
6.10.♡ Let x∞ solve (9.80) with W not necessarily symmetric. Let v be a probability vector.
Show that ∥ x∞ − v ∥1 ≤ 2 α.
Solution:

∥ x∞ − v ∥1 = ∥ α P x∞ − α v ∥1 ≤ α ∥ P x∞ ∥1 + α ∥ v ∥1 ≤ α ∥ x∞ ∥1 + α ∥ v ∥1 = 2 α,
because both v and x∞ are probability vectors. ■
6.15. ♡ Let G be a connected weighted graph. Show that there exists C > 0, depending only
on G, such that
Xm
C X
m
u2i ≤ wij (ui − uj )2 , (9.86)
i=1
2 i,j=1

for all u ∈ Rm satisfying u · 1 = 0. The result is known as a Poincaré inequality.


Solution: By Proposition 9.16, the inequality (9.86) is equivalent to
uT L u
∥ u ∥2 ≤ C uT L u or ≥ C −1 ,
∥ u ∥2
for all u ∈ Rm such that u · 1 = 0. Since 1 ∈ ker L, which is one-dimensional, the result
follows from the optimization principle in Theorem 5.47, upon setting C = λ−1
2 > 0. ■
Chapter 9. Graph Theory and Graph-based Learning 67

7.1. ♡ Prove that the diffusion distance satisfies the triangle inequality (9.88).

Solution: By the triangle inequality for the C norm:


(k)
di ℓ = ∥ P k (ei − eℓ ) ∥C = ∥ P k (ei − ej ) + P k (ej − eℓ ∥C
(k) (k)

≤ ∥ P k (ei − ej ) ∥C + ∥ P k (ej − eℓ ) ∥C = dij + dj ℓ .

7.4. ♡ Let S be a self-adjoint matrix for the inner product ⟨ x, y ⟩C = xT C y, where C is


symmetric positive definite, and let S = Q Λ QT C be its spectral decomposition. Follow the
proof of Theorem 9.69 to show that ∥ S k (ei − ej ) ∥C = ∥ xi − xj ∥, where x1 , . . . , xm ∈ Rm
are the rows of the matrix X = C Q Λk .

Solution: Since S k = Q Λk QT C and QT C Q = I , we have

∥ S k (ei − ej ) ∥2C = (ei − ej )T C Q Λk QT C Q Λk QT C (ei − ej )



= (ei − ej )T XX T (ei − ej ) = ∥ X T (ei − ej ) ∥2 = ∥ xi − xj ∥2 .

7.6.♡ Use Remark 8.12 to show that the first k eigenvectors q1 , . . . , qk of the graph Laplacian
L solve the optimization problem

X
k
min uTi L ui = tr (U T L U ), (9.97)
i=1

where the minimum is over all m × k matrices U = ( u1 . . . uk ) with orthonormal columns.


What is the minimum value of this optimization problem?

Solution: This is immediately deduced from Remark 8.12 and Theorem 8.9, by replacing
Y T Y with − L. The minimum value is the sum λ1 + · · · + λk of the first k eigenvalues of L.

8.2. ♡ Show that equation (9.106) holds.

Solution: Note first that, using the formula for γ in (9.100),


 
" #
X m
−1  1 X m
1
∇zℓ log 
 1 + ∥ zi − zj ∥2 =
 γ ∇zℓ
i,j=1 i,j=1
1 + ∥ zi − zj ∥2
i̸=j i̸=j ■
2 X −2 (zℓ − zj ) X
m m
=  = −4 γ 2
qℓj (zℓ − zj ).
γ j =1 1 + ∥ zℓ − zj ∥2 2
j =1

T
8.4. ♡ Consider a probability vector pλ = ( λ, 1 − λ ) ∈ R 2 , where 0 < λ < 1, and let
h(λ) be its entropy. (a) Write down a formula for h(λ). (b) Show that h(λ) is increasing for
0 < λ < 21 and decreasing for 12 < λ < 1.

Solution: (a) h(λ) = − λ log λ − (1 − λ) log(1 − λ).


(b) h′ (λ) = − log λ + log(1 − λ) = log(λ−1 − 1), which is positive for 0 < λ < 1
2 and
negative for 12 < λ < 1.
68 Chapter 9. Graph Theory and Graph-based Learning

8.7. ♡ Suppose we generalize the definition of Q in (9.100) to read



 1 f ∥z − z ∥2 , i ̸= j Xm

qij = γ i j
where γ= f ∥zi − zj ∥2 , (9.112)

0 i = j, i,j=1
i̸=j

where f : R → R. Show that the gradient of the corresponding Kullback–Leibler divergence


(9.101) is given by
!
4 X pij 
∇z i E = − 1 g ∥ zi − zj ∥2 (zi − zj ), (9.113)
γ i,j=1 qij
i̸=j

where g(t) = − f ′ (t). Verify that this gives the correct t-SNE gradient when f (t) = 1/(1 + t).

Solution: In this case, (9.102) becomes


 
X
m X
m
  X
m

b ,...,z ) = −
E(z pij log qij − pij log f ∥ zi − zj ∥2 − log  f ∥zi − zj ∥2 
1 m  ,
i,j=1 i,j=1 i,j=1
i̸=j

and hence, using (9.104) and mimicking the computations in (9.105) and (9.106),

Xm
f ′ ∥zi − zj ∥2 4 X ′
m

b
∇zi E = ∇zi E = − pij  ∇zℓ ∥ zi − zj ∥2 − f ∥zi − zj ∥2 (zℓ − zj )
i̸=j
f ∥zi − zj ∥ 2 γ j=1
!
4 X pℓj 
= − 1 g ∥zℓ − zj ∥2 (zℓ − zj ).
γ qℓj
ℓ,j=1
j̸=ℓ

Replacing ℓ by i completes the demonstration. 


If f (t) = 1/(1 + t), then g(t) = 1/(1 + t)2 , so g ∥ zi − zj ∥2 = γ 2 qij
2
for i ̸= j, as in
(9.100), and hence (9.113) reduces to (9.107).
8.9. ♡ Implement the SNE algorithm in Python and compare the results to t-SNE on small
toy data sets. The Python notebook from this section will be helpful.
Solution:

Python Notebook: Solution to Exercise 8.9. (.ipynb)


Chapter 9. Graph Theory and Graph-based Learning 69

8.10.♡ This exercise explores the uniform manifold approximation (UMAP) algorithm [160],
which is a variant of t-SNE that can often give better visualizations. The algorithm starts
with a weight matrix W constructed over the high dimensional data for which the maximum
entry in every row is exactly one; that is maxj wij = 1 for all i.
(a) The UMAP algorithm uses the symmetrization pij = wij + wji − wij wji , which in
particular is not normalized to be a probability matrix. Show that 0 ≤ pij ≤ 1.
(b) For the embedded data points, UMAP uses the weights

1
qij = , (9.114)
1 + a ∥ zi − zj ∥2b

for parameters a, b, which are also not normalized. The loss function for UMAP is defined by
" ! !#
Xm X m
pij 1 − pij
E(z1 , . . . , zm ) = pij log + (1 − pij ) log , (9.115)
i=1 j=1
qij 1 − qij
j̸=i

which is referred to as fuzzy cross-entropy. Take a = b = 1, and show that

X
m
∇zi E = 4 (1 − qij )−1 qij (pij − qij )(zi − zj ). (9.116)
j=1

Solution: (a) Note that pij = 1 − (1 − wij )(1 − wji ) ∈ [ 0, 1 ].


(b) For the purposes of computing the gradient, we may write
m X
X m
 
E=− pij log(qij ) + (1 − pij ) log(1 − qij ) .
i=1 j=1
j̸=i

Since 

−2qij (zi − zj ), if ℓ = i ̸= j
2

∇zℓ qij = −2qij


2
(zj − zi ), if ℓ = j ̸= i


0, ̸ i, j,
if ℓ =
we have
m X
X m
 
−1
∇zℓ E = − pij qij − (1 − pij )(1 − qij )−1 ∇zℓ qij
i=1 j=1
j̸=i
Xm
  2
−1
=2 pℓj qℓj − (1 − pℓj )(1 − qℓj )−1 qℓj (zℓ − zj )
j=1
j̸=ℓ X
m
  2
−1
+2 piℓ qiℓ − (1 − piℓ )(1 − qiℓ )−1 qiℓ (zℓ − zi )
i=1
i̸=ℓ
X
m
=4 (1 − qℓj )−1 qℓj (pℓj − qℓj ) (zℓ − zj ).
j=1
70 Chapter 9. Graph Theory and Graph-based Learning

9.5. ♡ Implement the following binary semi-supervised learning algorithm in Python. Let
y ∈ Rm be the label vector encoding the given labels, with yi = +1 if i is in one class, yi = −1
if i is in the other class, and yi = 0 otherwise. Assume the number of labels in each class is
the same. Start from the zero vector u0 = 0, and run the iterations

uk+1 = D−1 W uk + D−1 y (9.135)

until convergence. This is essentially the power iteration for spectral clustering that we say
previously in Exercise 4.3, except we are adding the labels as a source term. After convergence,
threshold the vector uk at zero to produce the two classes. Try your algorithm on some simple
binary classification problems, like two moons, circles, and pairs of MNIST digits.

Solution:

Python Notebook: Solution to Exercise 9.5. (.ipynb)

9.6. ♡ Under the assumptions of Exercise 9.5, show that lim uk = u, where u is the unique
k→∞
solution of Lu = y that satisfies ⟨ u, 1 ⟩D = 0.
Solution: First, since there are equal numbers of points in both classes, we have 1 · y = 0,
so y ∈ (ker L)⊥ = img L. This implies that the equation L u = y has a solution u which
is unique up to the addition of a term c 1 ∈ ker L for some c ∈ R. We choose c so that
⟨ 1, u ⟩D = 0. Note that the resulting u satisfies u = D−1 W u + D−1 y. Thus, zk = uk − u
satisfies the adjoint diffusion equation zk+1 = D−1 W zk . Since u0 = 0, we have ⟨ 1, z0 ⟩ = 0.
Then Theorem 9.61 tells us that zk → 0 as k → ∞. ■
9.7. ♡ Under the assumptions of Exercises 9.5, 9.6, show that u is the unique minimizer of
E(u) = 12 xT L x − yT u satisfying ⟨ u, 1 ⟩D = 0. Hint: Use Theorem 9.61.
Solution: Since y · 1 = 0 we have y ∈ (ker L)T = img L, and so, by Theorem 6.9, E admits
a minimizer, which solves L u = y. As in Exercise 9.6, we select the unique solution to this
system satisfying the mean zero constraint.
9.8.♡ For very large sparse graphs, direct methods struggle to solve the Laplacian regularized
learning problem (9.120). Implement a spectrally truncated solver, as outlined in Exercise
3.7 in Chapter 5, and apply it to graph-based semi-supervised binary classification on the
MNIST data set.
Solution:

Python Notebook: Solution to Exercise 9.8. (.ipynb)


Chapter 9. Graph Theory and Graph-based Learning 71

9.12. ♡ Let G be an unweighted graph with adjacency matrix A. Let k ≥ 0 such that for
each pair of adjacent nodes i, j — so that aij = 1 — there are at least k other nodes that are
adjacent to both i and j; that is the number of indices ℓ ̸= i, j for which aiℓ = 1 = ajℓ is at
least k. Show that s
2 uT L u
| ui − uj | ≤ (9.136)
k+2
holds for all u ∈ Rm and all adjacent nodes i, j. Note : The inequality (9.136) allows us to
estimate the difference in the label predictions between two adjacent nodes in terms of the
size of the regularization term uT L u.
Solution: Fix any adjacent pair i, j, and let ℓ be any other index. Then, by the triangle
inequality and Cauchy’s inequality (6.98),

| ui − uj |2 ≤ | ui − uℓ | + | uℓ − uj | 2
 
= (ui − uℓ )2 + (uj − uℓ )2 + 2 | ui − uℓ | | uℓ − uj | ≤ 2 (ui − uℓ )2 + (uj − uℓ )2 .

Let J ⊂ {1, . . . , m} be the set of indices corresponding to the adjacent pair i, j, so, by the
assumption, # J ≥ k. By Proposition 9.16 and the computation above,

X
m
 
uT Lu ≥ (ui − uj )2 + aiℓ (ui − uℓ )2 + ajℓ (uj − uℓ )2
ℓ∈J
X  1X
≥ (ui − uj ) +
2
(ui − uℓ )2 + (uj − uℓ )2 ≥ (ui − uj )2 + (ui − uj )2 ■
2
ℓ∈J ℓ∈J
 
k k+2
≥ 1+ (ui − uj )2 = (ui − uj )2 .
2 2

10.1. ♡ Find a formula for the the reciprocal 1/z of a nonzero complex number z ̸= 0 in
terms of its real and imaginary parts. Then write out the formula for complex division z/w.
Hint: Use the formula z z = | z |2 .
Solution:
1 z 1 x − iy
= , z ̸= 0, or, equivalently, = 2 .
z | z |2 x + iy x + y2

The general formula for complex division is

w wz u + iv (x u + y v) + i (x v − y u)
= or = .
z | z |2 x + iy x2 + y 2
72 Chapter 9. Graph Theory and Graph-based Learning

10.3. ♡ Use the complex orthogonality formulas (9.179) to prove that the real vectors xk , yk
in (9.141) are mutually orthogonal, and, moreover, the Euclidean norm formulas (9.144)
hold. Remark: A purely real proof, based on trigonometric identities, is doable, but more
challenging.

Solution: Using (9.178),


1 1 
1 = zk · zk = zTk zk = (xk + i yk )T (xk − i yk ) = ∥ xk ∥2 + ∥ yk ∥2 ,
m m
and hence ∥ xk ∥2 +∥ yk ∥2 = m. In particular, since y0 = 0 = yn , the latter holding only when

m = 2 n is even, we have ∥ x0 ∥ = ∥ xn ∥ = m under the same condition on m. Furthermore,
if 1 ≤ k < 12 m, we have

1 1  2
0 = zk · z−k = zTk zk = (xk + i yk )T (xk + i yk ) = ∥ xk ∥2 − ∥ yk ∥2 + (x · y ) i ,
m m m k k
which, by the previous result, implies ∥ xk ∥2 = ∥ yk ∥2 = 1
2 m, and, moreover, xk · yk = 0. ■

10.7. ♡ Prove the unnormalized DFT convolution formulas in (9.211).


Solution: In view of (9.184), (9.185), the formulas in (9.210) become
√ 
Gm (z ∗ w) = m Fm (z ∗ w) = m Fm z ◦ Fm w = Gm z ◦ Gm w.

10.8. Use the fast Fourier transform to find the discrete Fourier coefficients for the following
functions using the indicated number of sample points. Carefully indicate each step in your
analysis.
x
(a ) ♡ , n = 4; (b) ♢ sin x, n = 8; (c) ♡ | x − π |, n = 8; (d) sign(x − π), n = 16.
π
     1 
0 0 2  3 
1     4
  (0)  1  (1)  − 1  − + i
 2 , c = c(2) = 
1 1
Solution: (a ) f =  2   
 , c =  1 , c =    4 1 4 ;
1 2  1  −4
− 14 + 41 i
3 3
− 12  
  
2
 2
    1
π π 1
π 1
π π
     2   2   √2 
 3π  0   1π   1π   2+1 √ π
4     2   4   
1  1   1     8 2

 π  π  π   0   0 
2  2   2     
1  1     1  √ 
 π  (0)  π  (1)  0  (2)  π   2−1 
         8 √2 π 
(c) f =  4
 , c = 2
3  , c =  1  , c =  1
4
 , c = c(3)
=  .
 0   4π  2π   2π   0 
         
1  1   1   1+ i   √2−1 
 4π  4π  4π   8 π  
         8 √2 π 
 1π  1π  1π   0   
2  4   2     0 
√ 
1− i
3
4 π 3
4 π − 1
4 π 8 π 2+1
√ π
8 2
Chapter 9. Graph Theory and Graph-based Learning 73

10.9. Use the inverse fast Fourier transform to reassemble the sampled function data corre-
sponding to the following discrete Fourier coefficients. Carefully indicate each step.
(a ) ♡ c0 = c2 = 1, c1 = c3 = −1, (b ) ♢ c0 = c1 = c4 = 2, c2 = c6 = 0, c3 = c5 = c7 = −1.
       
1 1 2 0
 −1   1   0  0
Solution: (a ) c =  , f (0) = , f (1) = , f = f (2) =  .
1 −1 −2 4
−1 −1 0 0

10.14.♡ Let G and Ge be two weighted graphs with m and m e nodes, and with weight matrices
W and W f , respectively. The Cartesian product graph G × G, e whose nodes are indexed by pairs
(i, j) of nodes in each graph, so 1 ≤ i ≤ m and 1 ≤ j ≤ m, e and whose weight matrix is denoted
by W × W f , with edge weights

wik , if j = ℓ

(W × Wf) = wejℓ , if i = k
(i,j),(k,ℓ)


0, otherwise.

Note that W × W f is an m m e × mm e matrix, and the nodal set for G × Ge is the Cartesian
product of the nodal sets of each graph. Let L and L e denote the graph Laplacian matrices for
e respectively. For u ∈ Rm and u
G and G, e ∈ Rme e ∈ Mm×m
, let u × u e be the matrix with (i, j)
ej which we identify in the usual manner with a vector in Rm m
entry ui u e e denote
. Let L × L
e × mm
the m m e graph Laplacian matrix on G × G. e
(a) Show that if u is an eigenvector of L with eigenvalue λ, and u e
e is an eigenvector of L
e e e
e is an eigenvector of L × L with eigenvalue λ + λ.
with eigenvalue λ, then u × u
e
(b) Do the eigenvectors obtained this way form a basis for Rm m ?

Solution: (a) This is just a matter of unwrapping definitions. Note that the neighbors of
the node (i, j) in G × Ge are the nodes (k, ℓ) where either k is a neighbor of node i in G and
j = ℓ, or ℓ is a neighbor of node j in Ge with i = k. Thus, the degree of node (i, j) in the
Cartesian product graph is simply the sum of the degrees di + dej of node i and node j in their
respective graphs. Following this same logic,

e (u × u
(L × L) e ) = (Lu) × u e u.
e + u × Le

The eigenvector condition yields

e (u × u
(L × L) e) = λ u × u eu × u
e+λ e (u × u
e = (λ + λ) e ). ■

(b) The answer is yes. To see this, we just need to show that the m m e eigenvectors of
the form ui × u eu
e j , where L ui = λi ui and L ej = λ e u
e , are linearly independent. To see this
j j
assume that
X m X e
m
e j ) = 0,
cij (ui × u
i=1 j=1

which can be arranged to read

X
m e
X
m
ui × pi = 0, where pi = ej .
cij u
i=1 j=1
74 Chapter 9. Graph Theory and Graph-based Learning

e
In the more familiar matrix notation, where we identify a vector v × w ∈ Rm m with the
m×m e matrix v w , this reads
T
Xm
ui pTi = O.
i=1
T
Writing pi = pi,1 , . . . , pi,m
e , the j-th column of the preceding equation is

X
m
pi,j ui = 0.
i=1

Since the ui are linearly independent, we deduce that pij = 0 for all i, j, and so pi = 0 for all
e j , we conclude that cij = 0
i. Using the definition of pi and the linear independence of the u
for all i, j, which establishes the claim. ■
10.16. ♡ Write a Python notebook to extend the signal denoising example with the DFT to
two dimensional signals (i.e., images) and use it to denoise natural images corrupted by noise.
The Python notebook from this section will be helpful to start out.
Solution:

Python Notebook: Solution to Exercise 10.16. (.ipynb)


Chapter 10

Neural Networks and Deep


Learning

1.1. ♡ Explain why we can, without loss of generality, assume that all but the last activation
function in a neural network can be taken to be non-affine.

Solution: If σk is an affine function, then, since a composition of affine functions is also


an affine function, cf. (3.64), Fk would be an affine function of zk−1 , and hence pk+1 would
also be an affine function of zk−1 . But we could then omit the (k + 1)-st layer and generate
the (k + 2)-nd preactivation directly from the kth by an affine function.
1.2.♡ Consider a bias-free ResNet, as in (10.15) but where all the bias vectors bk are removed,
i.e., we set bk = 0. Explain why any  ResNet with input x ∈ R n has a bias-free version that
x
gives the same outputs by taking ∈ Rn+1 as input.
1
     
x f Wk bk f Wk x + bk
Solution: Set xe = and W k = e =
so that W k x . Thus,
 
1 0T 1 1
Vk 0
setting Ve k = , and using (10.15), the bias-free ResNet layer
0T 1
   
e e f x + Vk σ(Wk x + bk ) Fk (x)
F k (e e + V k σ(W k x
x) = x e) = =
1 1

coincides with the biased version (10.15) once the last entry of the vector is ignored.

1.3.♡ Modify the Python notebooks from this section to use the loss (10.20) for classification
instead of the softmax and negative log likelihood loss. Try different norms for the ∥ F (xi ) ∥
term. In particular, try the p norm with different values for p; in particular, try p > 2 (for
example, p = 5). Are you able to get similar accuracy to the softmax composed with negative
log likelihood loss?
76 Chapter 10. Neural Networks and Deep Learning

Solution: We experimented with 2 ≤ p ≤ 5. The accuracy is similar to softmax/cross


entropy loss, though for large values, like p = 5, the results are actually very slightly better.

Python Notebook: Solution to Exercise 1.3. (.ipynb)

2.1. ♡ Show that the Jacobian of the function F (x) = σ(W x + b) equals DF (x) = D W ,
where D = diag σ ′ (W x + b).
!
Xn
Solution: The i-th component of F is σ wik xk + bi , and so its derivative with
! k = 1
X
n
respect to xj is wij σ ′ wik xk + bi = wij dii , which is the (i, j) entry of D W . ■
k=1

2.4. ♡ Prove Theorem 10.5. Hint: Mimic the proof of Theorem 10.3.

Solution: For k = 0, . . . , L, let Gk = ℓ ◦ FL ◦ · · · ◦ Fk+1 , so that ℓ F (x) = Gk (zk ), where
zk = Fk (zk−1 ; wk ) and z0 = x. Since Gk ◦ Fk = Gk−1 , the chain rule yields
 
vk−1 = ∇zk−1 ℓ F (x) = ∇zk−1 Gk Fk (zk−1 ; wk )
= Dz Fk (zk−1 , wk )T ∇zk Gk = Dz Fk (zk−1 ; wk )T vk ,

which establishes the first formula in (10.32). Similarly,


 
∇wk ℓ F (x) = ∇wk Gk Fk (zk−1 ; wk ) = Dw Fk (zk−1 , wk )T ∇zk Gk = Dw Fk (zk−1 ; wk )T vk ,

proving the second formula. ■


2.5.♡ Implement a one hidden layer neural network in Python using only the numpy package.
Use the formulas from Example 10.4 to compute the gradients of the network and implement
plain gradient descent for training. Try your network on some of the basic classification
examples from Section 10.1, including MNIST.
Solution: The notebook below implements a 1-hidden layer ReLU neural network with
64 hidden nodes in numpy for classification of MNIST digits using softmax and negative log
likelihood loss. The code is written to look like PyTorch, though everything is explicitly
written in numpy. The training is with vanilla stochastic gradient descent with a learning rate
of α = 1 and batch size of 128. Convergence is slower than with advanced optimizers like
Adam, and the backpropagation runs slower as well. If you wait for about 50 epochs, the
testing accuracy is around 95%, and continues to increase after that.

Python Notebook: Solution to Exercise 2.5. (.ipynb)


Chapter 10. Neural Networks and Deep Learning 77

3.1. ♡ Modify the Python notebook from this section to use the Adam optimizer instead of
Adadelta and see if you can obtain similar results. You may have to adjust the learning rate
to ensure the loss decreases during training.
Solution: Change the optimizer line to read
optimizer = [Link]([Link](), lr=learning_rate)
You will also have to change the learning rate to
learning_rate = 0.01
You will get an accuracy above 98%, but it may not be as high as the original notebook with
the Adadelta optimizer.

3.2. ♡ How many parameters are in the convolutional part and the fully connected part of
the CNN used for MNIST digit recognition in this section? Count just the weight matrices
and ignore the biases, for simplicity.
Solution: The first convolutional layer has 32 3 × 3 filters applied to grayscale (single
channel) images, which is 9 × 32 = 288 parameters. The second layer has 64 3 × 3 × 32 filters
(since they are applied to 32 channel images from the first layer), which is 64 × 9 × 32 = 18432
parameters. Thus, the convolutional layers have 18720 parameters. The fully connected
network has two layers and 1024 hidden nodes, with 1600 inputs and 10 outputs. Thus, the
number of parameters in the weight matrices is 1600 × 1024 + 1024 × 10 = 1648640. Thus,
there are about 88 times as many parameters in the fully connected part of the network, as
compared to the convolutional part.

3.4. ♡ Modify the Python notebook from this this section so that the CNN can be trained
for fully supervised classification on the CIFAR-10 data set, which has color images of size
32 × 32 × 3, and train the network in a similar way as we did for MNIST (in particular, keep
the size of the network roughly the same). What accuracy do you get?
Solution: The main thing we need to do is modify the architecture so the input layer allows
three channels, and keep in mind that the images are 32 × 32, so after the two convolutional
and max pooling layers, we’ll have 6 × 6 images, not 5 × 5 as with MNIST, so we’ll have
6 × 6 × 64 = 2304 convolutional features. After 20 epochs of training, the accuracy will be
around 74%, and it does not increase much upon training for longer. The best results on
CIFAR-10 are above 99%, using either much larger CNN models, or transformers (see Section
10.5) [62]. The solution notebook is below.

Python Notebook: Solution to Exercise 3.4. (.ipynb)

4.1. ♡ What conditions could you place on w in Lemma 10.12 so that the result (10.42)
remains true even when the eigenvalues of Lsym are not distinct?
Solution: If λi = λj then you need wi = wj in order define an interpolating polynomial
(which now has smaller degree) satisfying p(λi ) = wi for all i.
78 Chapter 10. Neural Networks and Deep Learning

4.2. ♡ Show that the spectral graph convolution in Definition 10.11 can be written as

X
m
x ∗G y = (qi · x) (qi · y) qi . (10.48)
i=1

Solution: The i-th entry of QT x is qi · x and similarly for QT y. Thus the i-th entry of
their Hadamard product is (qi · x) (qi · y). The result thus follows from the column-based
formula (3.21) for multiplying a matrix by a vector. ■
4.5. ♡ Implement a diffusion GCN, which uses diffusion (10.44) as the convolution in each
layer. This has tunable parameters wj that need to be learned. Write your code where the
degree of the convolution k can be arbitrarily chosen. You may want to use [Link]
to define the weights wj in the neural network. For simplicity, use the same convolution in
each layer. Try this on the PubMed data set, and compare to the Python notebook from this
section.

Solution: The Python notebook is below. You will generally find the result worse at low
label rates for the more expressive diffusion convolutions.

Python Notebook: Solution to Exercise 4.5. (.ipynb)

5.1. ♡ Explain how the initial vector embedding of tokens into R n can be viewed as a linear
function, i.e., matrix multiplication, of the M × n token embedding matrix E with a matrix
containing the one-hot encoding ei of each token in the context window.
Solution: The token matrix E is the M × n matrix whose i-th row represents the vector
embedding in Rn of the i-th token in the dictionary. Suppose we have a context with tokens
indexed by i1 , . . . , im ∈ {1, . . . , M }. The corresponding tokens as row vectors are given by
eTi1 E, . . . , eTim E, which are simply the corresponding rows of E. Thus, the matrix X whose
rows are the embeddings of the m tokens is the matrix X = (ei1 . . . eim )T E.

5.3. ♡ Let π be a permutation of the integers 1, . . . , m. (a ) Show that if π(i) = i, then the
output zi of the attention layer (10.52) is unchanged by permuting the tokens with π, that is

1 X
m
zi = aπ(i) π(j) V xπ(j) .
dπ(i) j=1

(b ) Show that the same is true for masked self-attention described in Remark 10.19 provided
π(i) = i and π(j) ≤ i for all j ≤ i.
Chapter 10. Neural Networks and Deep Learning 79

Solution: (a) If π(i) = i then

1 X 1 X
m m
zi = aij V xj = a V xπ(j) ,
di j=1 di j=1 i π(j)

where we simply re-indexed the sum in the second inequality.


(b) The condition on π implies that π restricted to {1, . . . , i − 1} is a permutation of the
first i − 1 tokens. Thus, we can apply the same argument as in (a) to find that masked self
attention is
1 X 1 X
zi = aij V xj = ai π(j) V xπ(j) .
di di
j≤i j≤i

6.1. ♡ Prove that the space of piecewise affine functions A forms a vector space. Explain in
detail why it is also affine invariant.
Solution: Any constant multiple of an affine function is also affine: c (a x+b) = (c a) x+(c b).
Consequently, a constant multiple of a piecewise affine function is also piecewise affine with
the same nodes. Similarly, the sum of two affine functions is affine: (a x + b) + (c x + d) =
(a + c) x + (b + d). Thus, the sum of two piecewise affine functions that have the same nodes
is also piecewise affine. If the two piecewise affine functions have different nodes, then the
nodes of their sum will be the union of their nodes. Further, they are both piecewise affine on
the union of nodes (adding one of more new nodes does not alter the status of the function
being piecewise affine) and hence, by the preceding observation, the sum is piecewise affine
on the union of the nodes.
To prove affine invariance, first if x = c y +d with c ̸= 0, then a x+b = a c y +(a d+b) is also
affine. We conclude that applying such an affine transformation to a piecewise affine function
with nodes xi produces a piecewise affine function with nodes yi satisfying xi = c yi + d.
6.4. ♡ Prove that if u ∈ Lip(R), there exists a 2 π-periodic function u
e ∈ Lip(R) such that
e(x) for all 0 ≤ x ≤ 1.
u(x) = u
Solution: Let a = u(0+ ) and b = u(1− ). Set u
e be the 2 π periodic extension of the function

u(x), 0 ≤ x ≤ 1, a−b
e(x) =
u where µ= ,
b + (a − b) (x − 1)/(2 π − 1), 1 ≤ x ≤ 2 π, 2π − 1
e is continuous on all of R. If 0 ≤ y ≤ x ≤ 1, then
which is defined so that u
|u
e(x) − u
e(y) | = | u(x) − u(y) | ≤ λ | x − y |,
where λ is the Lipschitz constant of u. If 0 ≤ y ≤ 1 ≤ x ≤ 2 π, then
|u
e(x) − u
e(y) | ≤ | u
e(x) − u
e(1) | + | u
e(1) − u
e(y) | ≤ µ (x − 1) + λ (1 − y) ≤ ν | x − y |,
where ν = max{ λ, | µ | }. A similar proof works if 1 − 2 π ≤ y ≤ 0 ≤ x ≤ 1. If 1 ≤ y ≤ x ≤ 2 π,
then | u
e(x) − u
e(y) | = | µ | | x − y |. Finally, for any x
e, ye ∈ R, we can find j, k ∈ Z such that
x = xe − 2 j π and y = ye − 2 k π satisfy one of the preceding inequalities and, moreover,
|x − y| ≤ |xe − ye |. Then
|u
e(e
x) − u
e(e
y) | = | u
e(x) − u
e(y) | ≤ ν | x − y | ≤ ν | x
e − ye |,
completing the proof. ■
80 Chapter 10. Neural Networks and Deep Learning

6.5. ♡ Prove that, for each k ≥ 1, the function (10.85) is continuous, piecewise affine, and
satisfies (10.86).
Solution: The composition of two affine functions is affine; see (3.64). Therefore, the
composition of two continuous piecewise affine functions is continuous piecewise affine. Finally,
we prove (10.86) by induction on k, the case k = 1 being obvious from the construction of
the hat function g(x). First note that

    2,
1
j odd,
j
gk = 0, j ≡ 0 mod 4,
2k+1 

1, j ≡ 2 mod 4,

where the first case follows since the value of an affine function at the midpoint of an interval
is the average of its values at the ends. Thus, by the definition of the hat function,
 
      1
 g 2 , j odd, 
j j 1, j odd,
gk+1 = g g = g(0), j ≡ 0 mod 4, = ■
2k+1 k
2 k+1 
 0, j even.
g(1), j ≡ 2 mod 4,

6.6. ♡ (a ) Use polar coordinates to prove that, for any a > 0,


ZZ
π
e− a (x +y ) dx dy = .
2 2
(10.99)
R2 a

(b ) Explain why r
Z ∞
− a x2 π
e dx = . (10.100)
−∞ a
Solution: (a) Setting x = r cos θ, y = r sin θ, we have dx dy = r dr dθ, and hence the
double integral is
ZZ Z π Z ∞
e− a (x +y ) dx dy = r e− a r dr dθ
2 2 2

R2 −π 0
Z ∞ ■
π 2 ∞ π
r e− a r dr = − e− a r
2
= 2π = .
0 a r=0 a

(b) By part (a),


ZZ Z ∞ Z ∞
π
e− a (x +y ) dx dy = e− a x e− a y dy dx
2 2 2 2
=
a R2 −∞ −∞
Z ∞ Z ∞  Z ∞ 2
e− a x dx e− a y dy = e− a x dx
2 2 2
= .
−∞ −∞ −∞

Taking square roots of both sides proves the identity. ■


Chapter 10. Neural Networks and Deep Learning 81

6.8. ♡ Let u ∈ Lip(R), and let g be a piecewise affine interpolant of u on an interval [ a, b ].


Prove that Lip[ a,b ] (g) ≤ Lip[ a,b ] (u).
Solution: Suppose ai < ai+1 are two consecutive nodes of g, so g(ai ) = u(ai ) and g(ai+1 ) =
u(ai+1 ), and hence

u(ai+1 ) − u(ai )
g(x) = u(ai ) + (x − ai ) for ai ≤ x ≤ ai+1 .
ai+1 ) − ai

Applying the Lipschitz estimate on u, we bound

| u(ai+1 ) − u(ai ) |
| g ′ (x) | = ≤ Lip[ a,b ] (u) for ai ≤ x ≤ ai+1 .
ai+1 ) − ai

Since this holds on all subintervals of [ a, b ], and a piecewise affine function is piecewise
continuously differentiable, Proposition 6.59 implies Lip[ a,b ] (g) ≤ Lip[ a,b ] (u). ■
6.9. ♡ Suppose u : [ a, b ] → [ c, d ]. Let g be a piecewise affine interpolant of u on [ a, b ]. Prove
that g : [ a, b ] → [ c, d ].

Solution: On each interval where g is affine, it is bounded from above and below by its
values at the endpoints, where it has the same values as u. Thus the minimum and maximum
values of g on [ a, b ] are bounded by the minimum and maximum values of u, which are
bounded from below and above by c and d, respectively.
6.10. ♡ Write down the hat functions corresponding to unequally spaced nodes, and prove
that they still form a partition of unity.
Solution: Let a0 , . . . , an denote the nodes. Then θi (x) is the piecewise affine function
interpolating θi (ai ) = 1 and θi (aj ) = 0 for j ̸= i. Thus, on the interval ai ≤ x ≤ ai+1 , we
have
x − ai+1 x − ai
θi (x) = , θi+1 (x) = , θj (x) = 0, j ̸= i, i + 1,
ai − ai+1 ai+1 − ai
and hence, for such x,
X
n
x − ai+1 x − ai
θj (x) = θi (x) + θi+1 (x) = + = 1. ■
j =0
ai − ai+1 ai+1 − ai
Chapter 11

Advanced Optimization

1.1. ♡ Prove the inequality (11.6).

Solution: First note that for x ∈ [ 0, 1 ], we have F (x) = 0 = F ′ (x), so (11.6) clearly holds.
For x < 0, we have F (x) = 12 x2 and F ′ (x)2 = x2 , so (11.6) is satisfied there. For x > 1, we
have F (x) = 21 (x − 1)2 and F ′ (x) = x − 1, therefore (11.6) holds again. ■

1.3. ♡ Prove Theorem 11.2.


Hint : The proof is nearly identical to Theorem 6.68. Use Lemma 6.64 and Exercise 1.2.

Solution: Using the inequality (6.129) and then the PL inequality (11.2), we find
α  
F (xk+1 ) − F ⋆ ≤ F (xk ) − F ⋆ − ∥ ∇F (xk ) ∥2 ≤ (1 − α µ) F (xk ) − F ⋆
2
By Exercise 1.2 and the restriction α ≤ Lip(∇F )−1 we have 1 − α µ ≥ 0. This allows us to
iterate this final inequality to produce (11.3). ■

1.4. ♡ Show that the function F defined in (11.5) is not strongly convex.

Solution: Set G(x) = F (x) − µ x2 . If 0 < x < 1 and 0 < t < 1, then 0 < (1 − t) x < 1 and
(1 − t)2 < (1 − t). Thus,
 
G t · 0 + (1 − t) x = G (1 − t) x = − µ (1 − t)2 x2 > − µ (1 − t) x2 = t G(0) + (1 − t) G(x),

thus violating convexity. Alternatively, we can check that F ′ (x) = 0 for 0 < x < 1 so F is
not strictly nor strongly convex. ■

2.1. ♡ Suppose β > 1 in (11.15). For which initial conditions x0 , x1 do the iterates not
become unbounded as k → ∞?

Solution: When x0 = x1 since then xk = x0 for all k. Indeed, the solution remains bounded
T T
if and only if the initial vector c1 = x0 ( 1, 1 ) is a multiple of the eigenvector ( 1, 1 ) for the
eigenvalue λ1 = 1.
Chapter 11. Advanced Optimization 83

2.3. ♡ Suppose that H is only positive semidefinite in Theorem 11.10. Assume b ∈ img H
so that there exists x⋆ such that H x⋆ = b. Let U U T be the orthogonal projection matrix
onto img H. Show that for any choice of x⋆ we have
√ p 
∥ U U T (xk − x⋆ ) ∥ ≤ 2 β + ε k ∥ U U T (x0 − x⋆ ) ∥ (11.21)

using all the same conditions as Theorem 11.10, except that λmin (H) is replaced by the
smallest positive eigenvalue in (11.17). Can you prove anything about the convergence of xk
to x⋆ ?
Solution: Let U be the n × k matrix whose columns are the orthonormal eigenvectors of
H corresponding to non-zero eigenvalues. Then U U T is the projection matrix onto img H.
The k × k matrix U T H U has the same non-zero eigenvalues as H, except the eigenvectors are
e1 , . . . , ek ; indeed U T H U ei = λi U T U ei = λi ei , since U ei is the i-th column of U , which
is an eigenvalue of H whose eigenvalue we denote by λi . Thus, U T H U is positive definite
symmetric, and λmin (U T H U ) is the smallest nonzero eigenvalue of H, while λmax (U T H U ) =
λmax (H). We can take the heavy ball method for H, multiply by U T on both sides, and use
H = H U U T to obtain

U T xk+1 = U T xk − α (U T H U U T xk − U T b) + β (U T xk − U T xk−1 ).

Writing S = U T H U and zk = U T xk we have

zk+1 = zk − α (S xk − U T b) + β (zk − zk−1 ).

This is the heavy ball method for S and U T b, so we can apply Theorem 11.10 to obtain
√ p 
∥ zk − U T x⋆ ∥ ≤ 2 β + ε k ∥ z0 − U T x⋆ ∥,

where in the condition (11.18) we note that the argument above shows that λmin (S) is the
smallest nonzero eigenvalue of H, while λmax (S) = λmax (H). We now use the fact that
∥ U U T x ∥ = ∥ U T x ∥ (square both sides and expand using U T U = I ) to obtain
√ p 
∥ U U T (xk − x⋆ ) ∥ ≤ 2 β + ε k ∥ U U T (x0 − x⋆ ) ∥.

We cannot prove anything about the convergence of xk → x⋆ without selecting a particular


x , since it is not unique. We can choose x⋆ to be the orthogonal projection of x0 onto the

solution space {x | H x = b}, which satisfies x0 − x⋆ ∈ (ker H)⊥ = img H. It follows that
xk − x⋆ ∈ img H for all k, and so we can remove the projection matrix U U T and obtain
√ p 
∥ xk − x⋆ ∥ ≤ 2 β + ε k ∥ x0 − x⋆ ∥,

and thus xk → x0 . ■
84 Chapter 11. Advanced Optimization

3.1. ♡ Prove by induction that if the Krylov subspaces satsify Vj+1 = Vj for some j ≥ 0,
then Vk = Vj for all k ≥ j.
Solution: By assumption, Aj v ∈ Vj , and thus can be written as a linear combination
Aj v = c1 v + c2 A v + · · · + cj−1 Aj−2 v + cj Aj−1 v ∈ Vj
for some scalars c1 , . . . , cj . Thus,
Aj+1 v = c1 A v + c2 A2 v + · · · + cj−1 Aj−1 v + cj Aj v
= cj c1 v + (c1 + cj c2 )A v + · · · + (cj−2 + cj cj−1 )Aj−2 v + (cj−1 + c2j )Aj−1 v ∈ Vj
also, which implies Vj+2 = Vj . The general induction step follows. ■
3.2. ♡ Let Tk (x) denote the k-th order Chebyshev polynomial (11.28).
(a) Show that both formulas in (11.28) satisfy the recursive formula (10.46). Hint : Use
the trigonometric identity cos(x + y) + cos(x − y) = 2 cos x cos y.
(b) Explain why both formulas define the same polynomial of degree k.
(c) Write down expressions for the Chebyshev " polynomials T3 (x) and T4 (x).
  √ k  √ k #
κ+1 1 κ+1 κ−1
(d) For κ > 1, show that Tk = √ + √ .
κ−1 2 κ−1 κ+1
Solution: (a) By the indicated trigonometric identity
 
cos(k y) = 2 cos y cos (k − 1) y − cos (k − 2) y .
Setting y = arccos x, this becomes the recursive formula (10.46):
 
cos(k arccos x ) = 2 x cos (k − 1) arccos x − cos (k − 2) arccos x .
Similarly,
p k−1 p  p  p 
2x x ± x2 − 1 − x± x2 − 1 k−2 = 2 x2 ± 2 x x2 − 1 − 1 x ± x2 − 1 k−2
p 
= 2 x ± x2 − 1 k .
Summing the + and the − equations and dividing by 2 establishes the recursive formula (10.46)
in this case.
(b) Both formulas give T0 (x) = 1 and T1 (x) = x, and hence, in view of (10.46), define
the same function Tk (x) for all k. Moreover, by induction, Tk−1 (x) is a polynomial of degree
k − 1, which implies 2 x Tk−1 (x) is a polynomial of degree k, while Tk−2 (x) is a polynomial of
degree k − 2, and hence the sum Tk (x) = 2 x Tk−1 (x) − Tk−2 (x) is a polynomial of degree k.
(c) T3 (x) = 4 x3 − 3 x, T4 (x) = 8 x4 − 8 x2 + 1.
(d) The result is easily seen to be valid for k = 0, 1. We prove the general case by induction
using the recursive formula (10.46) with x = (κ + 1)/(κ − 1). Assuming the validity of the
formula for k − 1 and k − 2, the right hand side of (10.46), namely,
     
κ+1 κ+1 κ+1
2 Tk−1 − Tk−2
κ−1 κ−1 κ−1
is the sum of the following two terms:
  √   √ k−2 √ k
κ+1 κ+1 1 κ+1 1 κ+1
√ − √ = √ ,
κ−1 κ−1 2 κ−1 2 κ−1
  √   √ k−2 √ k
κ+1 κ−1 1 κ−1 1 κ−1
√ − √ = √ .
κ−1 κ+1 2 κ+1 2 κ+1

Adding them produces Tk (κ + 1)/(κ − 1) , and thus completes the induction step. ■
Chapter 11. Advanced Optimization 85

3.5. ♡ Choose a polynomial q(λ) in the proof of Theorem 11.15 to show that if H has at
exactly k distinct eigenvalues then xk = x⋆ .
(λ − λ1 ) · · · (λ − λk )
Solution: Let λ1 , . . . , λk be the distinct eigenvalues. Set q(λ) = , so
(−1)k λ1 · · · λk
that q(λ) has degree k, and q(λi ) = 0 for all i = 1, . . . , k, while q(0) = 1. Thus q(λ) satisfies
the conditions for (11.27) to be valid. However, the left hand side of the inequality is 0, and
hence F (xk ) = F (x⋆ ), which, by uniqueness of the minimizer, implies xk = x⋆ .

4.2. ♡ Consider Nesterov’s accelerated gradient descent (11.33) where the gradient ∇F is
defined with respect to an inner product ⟨ ·, · ⟩ on Rn . Formulate and prove a version of
Theorem 11.18 in this setting.
Solution: There are two ways to do this problem. One is to go through the proof of
Theorem 11.18 and replace all dot products by inner products, and Euclidean norms by the
induced inner product norm. Everything goes through the same way, except the Lipschitz
constant of ∇F is interpreted in the induced norm and the norm on the right hand side in
the rate also uses the induced norm.
Alternatively, we can make a change of variables in Theorem 11.18, which we do here.
Nesterov’s accelerated gradient descent in the inner product ⟨ x, y ⟩ = xT C y is

xk+1 = xk + βk (xk − xk−1 ) − α C −1 ∇F xk + βk (xk − xk−1 ) .
Let G(x) = F (A x), where A is a symmetric and nonsingular matrix to be determined. Then
∇G(x) = A ∇F (A x) and so

xk+1 = xk + βk (xk − xk−1 ) − α C −1 A−1 ∇G A−1 (xk + βk (xk − xk−1 )) .
Setting zk = A−1 xk and multiplying by A−1 on both sides yields

zk+1 = zk + βk (zk − zk−1 ) − α A−1 C −1 A−1 ∇G zk + βk (zk − zk−1 ) .

Now we see a good choice is A = C −1/2 , in which case G(x) = F (C −1/2 x), zk = C 1/2 xk , and

zk+1 = zk + βk (zk − zk−1 ) − α ∇G zk + βk (zk − zk−1 )
is the usual Nesterov method in the Euclidean dot product. Thus, by Theorem 11.18,
2 ∥ x0 − x⋆ ∥2C
F (xk ) − F (x⋆ ) ≤ for 0 < α ≤ Lip(∇G)−1 .
α (k − 1)2
4.3. ♡ Implement Nesterov’s accelerated gradient descent on a convex, but not strongly
convex, function F in numpy in Python. Compare it with ordinary gradient descent and with
the heavy ball method.
Solution:

Python Notebook: Solution to Exercise 4.3. (.ipynb)


86 Chapter 11. Advanced Optimization

5.1. ♡ Show that (11.54) implies (11.57).


Solution: We compute
 
Ek ξk · ∇Fik (xk ) = Ek ∥ ∇Fik (xk ) ∥2 − ∇F (xk ) · ∇Fik (xk )
= Ek ∥ ∇Fik (xk ) ∥2 − ∇F (xk ) · Ek ∇Fik (xk ) ≤ ∥ ∇F (xk ) ∥2 + σ 2 − ∥ ∇F (xk ) ∥2 = σ 2 .

5.4. ♡ Let αk = α/(k + 1)p for 0 < p < 12 . Show that

X
k−1
α2 (k 1−2 p − 2 p)
αj2 ≤ .
j=0
1 − 2p

What happens when p = 21 ?


Solution: We compute

X
k−1 X
k−1 Z k−1
1 1 α2
αj2 = α2 ≤ α2 + α2 dx ≤ α2 + (k 1−2 p − 1).
j=0 j=0
(j + 1)2p 0 (x + 1) 2p 1 − 2p

When p = 12 ,
X
k−1 Z k−1
1
αj2 ≤ α2 + α2 dx ≤ α2 (log k + 1) . ■
j=0 0 x+1

5.6. ♡ Let αk = α/(k + 1)p where 0 < p ≤ 1. Use the previous three exercises and Theorem
11.22 to establish convergence rates for SGD with this choice of time step (use big O notation
for simplicity). For which values of p is the method convergent? What happens when p > 1?
Solution: By Theorem 11.22, the convergence rate is the larger of the two quantities
O(1/(k αk )) and O(εk ). When 0 < p < 12 the rate is O(k −p ). When p = 12 the rate is

O(log k/ k) as was worked out in this section. When 21 < p < 1 the rate is O(k 1−p ). When
p = 1 the rate is O(1/ log k). In all cases 0 < p ≤ 1, SGD is convergent, with the fastest
rate occurring when p = 12 . When p > 1, the quantity k αk is bounded and does not grow to
infinity, so SGD is not convergent. ■
Chapter 11. Advanced Optimization 87

6.1. ♡ (a ) Solve the initial value problem

c′′ (t) + 2 b c′ (t) + λ c(t) = 0, c(0) = c0 , c′ (0) = 0,

where b > 0, λ > 0, and c0 ∈ R.


(b) Show that | c(t) | ≤ | c0 | (1 + b t) e− b t whenever b2 ≤ λ. Hint: Use sin x ≤ x.
(c) What can you say about the solution if b2 > λ?
Solution: (a) The solution is
  p p 
  b 

 c e −bt
cos t λ − b 2 + √ sin t λ − b 2 , b2 < λ,

 0
λ − b2

c(t) = c0 (1 + b t) e− b t b2 = λ,

     

 c0 b √ b √

 1+ √ e− (b− b −λ) t + 1 − √
2
e− (b+ b −λ) t ,
2
b2 > λ.
2 b2 − λ b2 − λ
(b) When b2 = λ, the inequality is an equality. When b2 < λ, bounding cos x ≤ 1 and
sin x ≤ x in the solution formula in part (a) proves the bound.

(c) When b2 > λ, we have | c(t) | ≤ | c0 | e− (b− b −λ ) t , so the decay rate is slower.
2

√ √ 
6.5. ♡ Define βk = 1 − α F k α . Prove that βk is independent of α if and only if
F (t) = r/t for some r ∈ R. Hint : Differentiate the expression for βk with respect to α.
Solution: If βk does not depend on α, then
√  √  √ √  √ 
dβk F k α kF′ k α k αF′ k α + F k α
0= =− √ − =− √ .
dα 2 α 2 2 α

Setting t = k α, we find that the function F (t) must satisfy the ordinary differential equation
d 
0 = t F ′ (t) + F (t) = t F (t) , and hence t F (t) = r is a constant. ■
dt
6.6. ♡ Let x(t) be twice continuously differentiable and satisfy (11.90). Assume that ∇F is
Lipschitz continuous. Show that x′ (0) = 0 and x′′ (0) = − 14 ∇F (x(0)).
Hint: For x′′ , use (11.91).
Solution: By (11.90) we have, for 0 ≤ t ≤ 1,
Z
′ 1 t 3 M
∥x (t)∥ ≤ 3 s ∥∇F (x(s))∥ ds ≤ t t, where M = max ∥∇F (x(t))∥.
t 0 4 0≤t≤1

Therefore x′ (0) = lim x′ (t) = 0. Keeping track of the error terms in (11.91),
t→0

t 3 ′ 3
x′ (t) = − ∇F (x(t)) + O(t2 ), and so x (t) = − ∇F (x(t)) + O(t),
4 t 4
for t near zero. Plugging this into (11.89) yields

x′′ (t) − 3
4 ∇F (x(t)) + O(t) = − ∇F (x(t)).

Thus, sending t → 0 produces x′′ (0) = − 41 ∇F (x(t)). ■


88 Chapter 11. Advanced Optimization

7.1. ♡ Given O ̸= X ∈ Mm×n , show that ∥ X T Xu ∥ ≥ σmin (X) ∥ Xu ∥ for all u ∈ Rn .


Solution:
Xr X
r
Let X = σi pi qTi be its singular value decomposition, so that X T = σi qi pTi is the
i=1 i=1
singular value decomposition of X T . Since Xu ∈ img X we can write
!T  r 
Xr Xr X X
r
Xu = cj pj , and hence X T X = σi qi pi  cj pj  = σi ci qi ,
j=1 i=1 j=1 i=1

where we used orthonormality of the pi in the last equality. Since the qi are also orthonormal,

X
r X
r
∥ X T Xu ∥2 = σi2 c2i ≥ σr2 c2i = σmin (X)2 ∥ Xu ∥2 . ■
i=1 i=1

7.2. ♡ Suppose F has the form


X
m
F (x; w) = wk Φk (x), (11.110)
k=1

where each Φk : Rn → R is a scalar-valued function. Assume that the functions Φk can


distinguish the training points, in the sense that Φk (xℓ ) = 0 for k ̸= ℓ and Φk (xk ) = 1. Show
that the corresponding kernel matrix K defined in (11.99) is the identity matrix.

Solution: In this case,

∇w F (x; w) = (Φ1 (x), . . . , Φm (x))T

and so the kernel matrix K has entries


X
m
kij = Φk (xi ) Φk (xj ).
k=1

By the assumption that Φk distinguishes the training points, kij = 1 if i = j and kij = 0
otherwise — that is, K = I is the identity matrix. ■

You might also like