ISyE 3013: Optimization for Machine Learning
Lecture #04 2025-09-11
Instructor: Swati Padmanabhan, Teaching Assistant: Zhenyi Zhang
We are now on the cusp of learning some classical optimization algorithms that, to this day,
power processes around us and remain objects of study. Before getting there, we need to finish our
current module on some essential theory underpinning these algorithms: optimality conditions (in
unconstrained minimization) and convexity. To this end, we continue our study of convex functions.
4.1 Equivalent Characterizations of Convex Functions
Recall that a function f : K → R (where K ⊆ Rd is a convex set) is convex if and only if for all
x, y ∈ K,
f (λy + (1 − λ)x) ≤ λf (y) + (1 − λ)f (x), for all λ ∈ [0, 1]. (4.1.1)
Two important points to note:
1. We say λa + (1 − λ)b for λ ∈ [0, 1] is a “convex combination” of a and b. Thus, a convex
combination is a weighted average with nonnegative weights that add to one.
2. Any point x ∈ [a, b] (i.e., x in the line segment between a and b) can be equivalently represented
as x = λa + (1 − λ)b for some λ ∈ [0, 1]. As we increase λ from zero to one, we move from b
to a. (You will need to use this representation in your very first homework problem.)
In the following two subsections, we cover two more equivalent definitions of convex functions when
we have additional assumptions on higher-order derivatives of f .
4.1.1 First-Order Taylor Approximator as Global Underestimator
Theorem 4.T1 (Repeating Theorem 3.T2). Let f : K → R be a continuously differentiable function
on convex set K ⊆ Rd . Then
f is convex on K if and only if f (y) ≥ f (x) + ∇f (x)⊤ (y − x) for any x, y ∈ K.
Proof. We showed one direction in the previous lecture, so in this lecture, we cover only the other
direction: Assume that f is convex. Then, by rearranging Inequality 4.1.1, we have:
f (λy + (1 − λ)x) − f (x) ≤ λ(f (y) − f (x)), for all λ ∈ [0, 1].
Next, for λ ∈ (0, 1], the above can safely be divided throughout by λ (and without the inequality
being flipped); since the inequality holds for all λ ∈ (0, 1], it is valid even in the limit as λ → 0:
f (x + λ(y − x)) − f (x)
lim ≤ f (y) − f (x).
λ→0 λ
The left-hand side is exactly ∇f (x)⊤ (y − x), which when plugged into the above inequality then
finishes the proof. We explain this last step in more detail in Theorem 4.T2.
These notes have not been subjected to the usual scrutiny reserved for formal peer-reviewed publications. Thank
you for reporting any typos! These notes closely follow the textbooks and lecture notes acknowledged at the end.
1
What Theorem 4.T1 is saying is that given a convex function, no matter where you draw a tangent
to it, that tangent lies under the function everywhere (and conversely, if for a function on a convex
set, all tangents lies below the function everywhere, then the function is convex). In other words,
if there exists even a single point such that the tangent to the function at that point lies above the
function somewhere, then the function most certainly is not convex. See Example 4.E1.
Example 4.E1. Satisfaction and violations of the global underestimator property:
Note, f (x) = .6x2 − 1.2x + 1.6 is convex and
has tangents under it everywhere (of course, it
is not physically possible to check this at every
point, but we know this to be true and illustrate
this only for three tangents). For the function x
4 3 2
f (x) = x8 − x6 − x2 , we see at least one tan-
gent that lies above the function at some points, f (x) = x4 − x3 − x2 f (x) = .6x2 − 1.2x + 1.6
8 6 2
which is evidence to show that it is non-convex.
We now provide1 a proof for the claim
f (x + t · p) − f (x)
lim = ∇f (x)⊤ p,
t→0 t
critically used in the last step when proving Theorem 4.T1. This is a crucial statement, which
shows up whenever simplifying an expression about function values to one in terms of the gradient.
Theorem 4.T2. Let f : Rd → R. Fix x ∈ Rd and a direction p = (p1 , . . . , pd ). If f is C 1 in
∂f
a neighborhood of x (i.e., all partials ∂xi
exist and are continuous near x), then the directional
derivative of f at x along p exists and
d
′ f (x + t · p) − f (x) ⊤
X ∂f
f (x; p) := lim = ∇f (x) p = pi (x).
t→0 t ∂xi
i=1
As a warm-up, note that if d = 1 (i.e., we are operating only on scalars), then the above reduces
to the chain rule. We see that the proof essentially performs such a chain rule in every coordinate.
Proof. We will generate a sequence of iterates by “walking” from x to x + t · p, moving along one
coordinate axis at each step. We denote the vector we are at at the ith iterate by x(i) , i.e.,
i
X
x(0) := x, x(i) := x + t pk ek (i = 1, . . . , d), (4.1.2)
k=1
so that x(d) = x + t · p. Then, the difference between the initial (at x) and final (at x + t · p)
function values may be decomposed into the sum of differences between these successive iterates:
d
X
f (x(i) ) − f (x(i−1) ) .
f (x + t · p) − f (x) =
i=1
1
Thank you very much for asking this question in class!
2
We now write x(i) = x(i−1) + tpi ei and divide throughout by t ̸= 0 and later take the limit as t → 0,
which will make the left-hand side identical to that of the statement of the theorem:
d
f (x + t · p) − f (x) X f x(i−1) + tpi ei − f (x(i−1) )
= . (4.1.3)
t t
i=1
For each index i, if pi = 0, the ith summand in the right-hand side above is 0. If pi ̸= 0, we have
f x(i−1) + tpi ei − f (x(i−1) ) f x(i−1) + tpi ei − f (x(i−1) )
= pi · . (4.1.4)
t tpi
By the one–variable Mean Value Theorem applied to s 7→ f (x(i−1) + sei ) on the interval from 0 to
tpi , there exists a scalar θi (t) ∈ (0, tpi ) such that
f x(i−1) + tpi ei − f (x(i−1) )
∂f (i−1)
= x + θi (t)ei . (4.1.5)
tpi ∂xi
Hence, for all t with pi ̸= 0, plugging the right-hand side of Equation (4.1.5) into that of Equa-
tion (4.1.4) and Equation (4.1.4) into each term of the right-hand side of Equation (4.1.3) yields:
d
f (x + t · p) − f (x) X ∂f (i−1)
= pi x + θi (t)ei . (4.1.6)
t ∂xi
i=1
Now taking the limit t → 0, we have x(i−1) → x (by definition of x(i) in Equation (4.1.2)) and
∂f
θi (t) → 0 (since θi (t) ∈ (0, tpi )). Consequently, x(i−1) + θi (t)ei → x. By continuity of ∂xi
at x,
∂f (i−1) ∂f
x + θi (t)ei −→ (x).
∂xi ∂xi
Therefore, plugging this into Equation (4.1.6) yields the claim.
4.1.2 Positive Semidefiniteness of Hessian
We now circle back to a second-order characterization from the second lecture, which showed global
minimality under global positive semidefiniteness of the Hessian. As we remarked then, for twice
continuously differentiable functions, this completely defines convex functions2 .
Theorem 4.T3. Let K ⊆ Rd be an open convex set, and let f : K → R be twice continuously
differentiable. Then f is convex if and only if ∇2 f is positive semidefinite at all x ∈ K.
Proof. Consider two points x, y ∈ K. Then, by the second-order Taylor approximation of f , there
exists some z ∈ [x, y] such that:
1
f (y) = f (x) + ∇f (x)⊤ (y − x) + (y − x)⊤ ∇2 f (z)(y − x). (4.1.7)
2
2
Note, this definition holds only for convex functions that are also twice continuously differentiable; the function
f (x) = |x| is convex, but not differentiable at x = 0.
3
Suppose ∇2 f (u) ⪰ 0 for all u ∈ K. Since ∇2 f (z) ⪰ 0 (since z ∈ K too, by convexity of K), the
third term on the right-hand side above is non-negative, which implies the linear underestimator
inequality from Theorem 4.T1 for all x, y ∈ K, thus implying convexity of f .
We now show the other direction. Suppose instead that f is convex on K. Fix x ∈ K and a
direction h ∈ Rd . Because K is open, there exists r > 0 such that x + th ∈ K for all t ∈ (−r, r).
By Theorem 4.T1, for all |t| < r,
f (x + th) ≥ f (x) + t ∇f (x)⊤ h. (4.1.8)
By the second–order Taylor theorem (Lemma 1.L1), there exists zt ∈ [x, x + th] ⊆ K such that
f (x + th) = f (x) + t ∇f (x)⊤ h + 21 t2 · h⊤ ∇2 f (zt )h. (4.1.9)
Comparing Inequality 4.1.8 and Equation (4.1.9) and cancelling the common terms gives, for t ̸= 0,
1 2
2t · h⊤ ∇2 f (zt )h ≥ 0.
Note that because t2 /2 > 0, the above inequality is the same as:
h⊤ ∇2 f (zt )h ≥ 0. (4.1.10)
Since zt ∈ [x, x + th], if we let t → 0, we have zt → x. We therefore have:
h⊤ ∇2 f (x)h = lim h⊤ ∇2 f (zt )h ≥ 0, .
t→0
where the first step is by continuity of ∇2 f (since we assume f is twice continuously differentiable),
and the second step is by Equation (4.1.10). The above inequality, as of now, holds only for the
vector h that we fixed at the start of this proof. However, note that the proof does not make any
special assumptions on h, which means the above inequality holds for any arbitrary h. In other
words, we have that
h⊤ ∇2 f (x)h ≥ 0, for all h.
This is precisely the definition of ∇2 f (x) ⪰ 0, which finishes the proof of the claim.
To conclude this section, we have three equivalent tests for whether a twice continuously differ-
entiable function is convex or not (one that involves only the function values, one involving the first
derivative, and this one involving the second derivative). Depending on the function at hand, one
test might be significantly easier to perform than the others. For example, it is almost immediate
to see the convexity of f (x) = ex and f (x) = 5x2 via the second-derivative test; however, these
are not all that trivial using Inequality 4.1.1. As we will see in the next few lectures, each of these
definitions turns out to be useful also when designing fast algorithms.
Problem 4.P1. Based on Theorem 4.T3, prove the convexity of:
1. f (x) = eax ,
2. f (x) = x⊤ Px when P ⪰ 0,
3. f (x) = − log(x), x > 0,
4. f (x) = x · log(x) on x > 0,
4
4.1.3 A Brief Summary of Our Current Toolkit
Assumptions Implication Description in plain English
⋆ ⋆
x local minimum ∇f (x ) = 0 First-Order Necessary Condition
x⋆ local minimum, f ∈ C 2 ∇2 f (x⋆ ) ⪰ 0 Second-Order Necessary Condition
f ∈ C 2 , ∇f (x⋆ ) = 0, ∇2 f (x⋆ ) ≻ 0 x⋆ is a strict local minimum Second-Order Sufficient Condition
f ∈ C 2 , ∇f (x⋆ ) = 0, ∇2 f (x) ⪰ 0 ∀ x x⋆ is a global minimum Convexity
f is convex f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) Secant inequality
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) f is convex Secant inequality
f ∈ C 1 , f convex f (y) ≥ f (x) + ∇f (x)⊤ (y − x) Global linear underestimator
f ∈ C 1 , f (y) ≥ f (x) + ∇f (x)⊤ (y − x) f convex Global linear underestimator
4.2 Iterative Algorithms for Optimization
The algorithms we study in this class are all iterative. Suppose the problem is minx f (x). We denote
the problem variable at the k th iteration by xk . The key design principle is to find a direction pk
and a step size αk such that moving in that direction improves the function value:
f (xk + αk · pk ) < f (xk ). (Descent Property)
One method to achieve this is to find the (unit) direction pk that minimizes the first-order Taylor
approximation of f at xk :
pk := arg min f (xk ) + p⊤ ∇f (xk ). (Descent Candidate)
p:∥p∥≤1
Question 4.Q1. What norm should we choose in (Descent Candidate), and what step size should
we choose to enable (Descent Property)?
We will examine these questions in the next few lectures. For now, we touch upon two norm choices.
4.2.1 Linear Minimization Over the ℓ2 -Ball y
(0, 1)
We consider the problem maxp:∥p∥2 ≤1 p⊤ g. Note, this is a constrained
optimization over the unit ℓ2 -norm ball. The projection of p onto g is g
maximized when we choose p to be a vector that directly aligns with g
hitting the boundary of the ball. (Try to draw any other vector in the x
adjacent blue circle and see its projection onto g.) This can be seen (0, 0) (1, 0)
g
formally via the Cauchy-Schwarz inequality. Thus p⋆ = ∥g∥ 2
(when g ̸=
0). This gives an optimal value of ∥g∥2 for the problem in consideration.
4.2.2 Linear Minimization Over the ℓ1 -Ball y
The unit ℓ1 -ball has quite a different geometry than the unit ℓ2 -ball. If (0, 1)
we therefore instead consider the problem maxp:∥p∥1 ≤1 p⊤ g, then the g
vector p⋆ that attains this largest projection is, up to a sign factor, the
indicator vector corresponding to the component of g with the largest x
absolute value. Thus, imposing a constraint on the ℓ1 -norm promotes (0, 0) (1, 0)
sparsity in solutions, a fact which also nicely generalizes to the matrix
setting [Faz] and finds use in recommendation systems for personalized
user experiences across, e.g., e-commerce and entertainment.
5
Readings
The material in these notes is based on the following excellent sources: [BV04, Chapter 2, Chapter
3, Section 9.4, Appendix A.4] and [NW06, Chapter 2].
References
[BV04] Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university
press, 2004 (cit. on p. 6).
[NW06] Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 2006 (cit. on
p. 6).
[Faz] Maryam Fazel. “Matrix rank minimization with applications”. PhD thesis (cit. on p. 5).