0% found this document useful (0 votes)
10 views9 pages

Understanding Convex Functions

The document discusses convex functions, providing three characterizations of convexity: zeroth-order, first-order, and second-order definitions. It also covers properties of convex functions, examples, and operations that preserve convexity, along with concepts of smoothness, strong convexity, and strict convexity. Additionally, it presents optimality conditions for minimizing convex functions over convex sets.

Uploaded by

alypaty
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views9 pages

Understanding Convex Functions

The document discusses convex functions, providing three characterizations of convexity: zeroth-order, first-order, and second-order definitions. It also covers properties of convex functions, examples, and operations that preserve convexity, along with concepts of smoothness, strong convexity, and strict convexity. Additionally, it presents optimality conditions for minimizing convex functions over convex sets.

Uploaded by

alypaty
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

10-725: Convex Optimization Spring 2023

Lecture 2: January 19
Lecturer: Siva Balakrishnan

2.1 Convex Functions


There are three characterizations of convexity that you should be familiar with:

1. No Assumptions (Zeroth-Order): This is the definition we discussed last time,


i.e. f is convex if its domain is a convex set and, for any x, y ∈ dom(f ),

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y).

2. Differentiable (First-Order): Suppose our function f has a derivative (at all points
in its domain) then, f is convex if its domain is a convex set and, for any x, y ∈ dom(f ),

f (y) ≥ f (x) + h∇f (x), y − xi.

3. Twice Differentiable (Second-Order): A function f is convex, if its domain is a


convex set and, for any x ∈ dom(f ),

∇2 f (x)  0.

In your HW you will explore some connections between these definitions (in particular,
showing that (3) =⇒ (2) =⇒ (1)). You can find the proof that (2) =⇒ (1) in the BV
textbook (but it will still likely be a part of your HW).
It is also worth noting that there is a definition analogous to (2) above in the case when the
function is not differentiable everywhere.

2’ Non-Smooth: A function f is convex if its domain is a convex set, and if at every


point x ∈ dom(f ), there exists a vector gx such that, for any y ∈ dom(f ),

f (y) ≥ f (x) + hgx , y − xi.

It is worth noting that if f is differentiable at x, then there is only one vector which will
satisfy the above definition and it will coincide with the usual gradient, i.e. gx = ∇f (x).
Any gx which satisfies the above property is called a subgradient of f at x. The set
of all subgradients at a point x is called the subdifferential of f at x and it will be
denoted as ∂f (x).

2-1
2-2 Lecture 2: January 19

Except for some very pathological functions (and only at the boundary of their domain)
subgradients always exist. Formally, one can for instance show that a subgradient gx
of a convex function f at x exists if x is in the interior of their domain.

Notational Note: I will often stop adding the qualifiers “for x, y ∈ dom(f )”. One way to
make this precise (I, and most textbooks do this implicitly) is to allow f to be whats called
an extended function, and define it to be ∞ outside its (effective) domain. This won’t change
any of its convexity properties, and things like the first and zeroth-order characterizations
will now make sense for any x, y ∈ Rd .

2.1.1 An example

Let us consider the quadratic function f (x) = 21 xT Qx + aT x + b where Q  0.


Applying definition (3) is easiest, since ∇2 f (x) = Q and this is PSD.
Now, let us try to apply definition (2). It is a differentiable function, with gradient ∇f (x) =
Qx + a. So we need to verify if,
1 T ? 1
y Qy + aT y + b ≥ xT Qx + aT x + b + hQx + a, y − xi.
2 2
Re-arranging we obtain that we need to check if,
1
(y − x)T Q(y − x) ≥ 0,
2
which is certainly the case since Q  0.
Finally, let us try to apply definition (1). We see (after cancelling some terms) that we need
to verify if for 0 ≤ θ ≤ 1,
1 ? θ 1−θ T
(θx + (1 − θ)y)T Q (θx + (1 − θ)y) ≤ xT Qx + y Qy.
2 2 2
Now, use the fact (you should see how you might prove this fact) that, xT Qy ≤ 12 xT Qx + y T Qy
 

for PSD Q (this is the matrix analogue of the simple fact that a × b ≤ (a2 + b2 )/2), to verify
that the desired inequality holds.

2.2 More Examples of Convex Functions


Here are a few examples of convex functions:

1. exp(ax) is convex for any a over R.


Lecture 2: January 19 2-3

2. log x is concave on R++ .

3. aT x + b is convex (and concave).

4. The least squares loss kAx − bk2 is convex (for any A, b).

5. Any norm is convex, i.e. kxk is a convex function.

6. The spectral norm, and the trace norm of a matrix are convex, i.e. kXkop = σ1 (X),
kXktr = di=1 σi (X) where σi (X) denotes the i-th singular value of X.
P

7. Convex Indicators: If C is a convex set, then the indicator function (which is


defined on the extended reals):
(
0 x∈C
IC (x) =
∞ x∈/ C.

is convex.

2.3 Convexity and Monotonicity


One nice property of convex functions is that their gradients are monotone.
In 1D this is a simple thing to interpret, a monotone function is order preserving. A function
which is monotone increasing has the property that if x ≥ y then f (x) ≥ f (y). One way to
write this mathematically is to say that for any x, y, (x − y) × (f (x) − f (y)) ≥ 0.
The (sub)gradient of a convex function satisfies a multivariate analogue of this property.
Particularly for any x, y ∈ dom(f ), if f is convex we have that for any gx ∈ ∂f (x) and
gy ∈ ∂f (y),

(x − y)T (gx − gy ) ≥ 0.

To see this we observe that by the first-order characterization:

f (y) ≥ f (x) + gxT (y − x),


f (x) ≥ f (y) + gyT (x − y),

and summing these inequalities gives our desired result.


It turns out that there is a converse to the above characterization. If you have a differentiable
function whose gradient is monotone, then it must be convex. This idea will likely be useful
in your HW for verifying some of the equivalences.
2-4 Lecture 2: January 19

2.4 Properties of Convex Functions


Here are a few properties of convex functions that will be useful:

1. A function is convex iff the univariate functions g(t) = f (x + tv) are convex for any
v ∈ Rd , and for any x ∈ dom(f ).

2. A function is convex iff its epigraph,

epi(f ) = {(x, t) ∈ dom(f ) × R : f (x) ≤ t}

is a convex set.

3. Convex functions satisfy Jensen’s inequality. If f is convex, then for any random
variable X supported on dom(f ), f (E[X]) ≤ Ef (X).

2.5 Operations which Preserve Convexity


1. Non-negative
Pm Linear Combination: Suppose f1 , . . . , fm are convex, then so is
i=1 ai fi for any a1 , . . . , am ≥ 0.

2. Pointwise Max: If the collection of functions fs for s ∈ S are convex, then so is


g(x) = sups∈S fs (x).

3. Partial Minimization: If g(x, y) is a convex function, and C is a convex set, then


f (x) = miny∈C g(x, y) is a convex function.

An Example:

1. Suppose C is an arbitrary set, consider f (x) = maxy∈C kx − yk. f is convex. To see


this, we can view f as a maximum of convex functions fy (x) = kx − yk.

2. Let C be a convex set, then f (x) = miny∈C kx − yk is a convex function. We can


view this as a partial minimization of the function g(x, y) = kx − yk which is a convex
function (in (x, y)).

Function compositions:

1. Affine Composition: If f is convex then so is g(x) = f (Ax + b).


Lecture 2: January 19 2-5

2. General Composition: Suppose that f = h ◦ g, where g : Rd 7→ R, h : R 7→ R,


f : Rd 7→ R. Then one can ask when f is convex. There are many cases to cover
(see BV) but we’ll simply study one, and try to understand where it comes from: f is
convex if h is convex and nondecreasing, g is convex.
To see this: imagine everything was twice differentiable, then by the chain rule

f 00 (x) = h00 (g(x))(g 0 (x))2 + h0 (g(x))g 00 (x).

When h is convex and non-decreasing, h00 and h0 are positive, and when g is convex,
g 00 is positive, so f 00 is positive.

2.6 Smooth, Strongly Convex and Strictly Convex Func-


tions
For this section, we will switch back to thinking about differentiable convex functions.

2.6.1 Smoothness

In optimization smoothness has a very particular meaning (it has a slightly different meaning
in stats, and other areas of math). A function f is β-smooth, if its gradient is Lipschitz
continuous with parameter β, i.e. for any x, y ∈ dom(f ),

k∇f (x) − ∇f (y)k ≤ βkx − yk.

There are several useful implications of smoothness that you will show in your HW, but we
will briefly discuss now:

1. If f is β-smooth then the function β2 kxk2 − f (x) is convex. Typically, we would not
expect −f (x) to be convex (except when f is affine).

2. Another implication of smoothness, is that it implies a quadratic upper bound on the


function, i.e. if f is β-smooth then,

β
f (y) ≤ f (x) + ∇f (x)T (y − x) + ky − xk2 .
2

To interpret this fix a point x. Convex functions always lie above their tangent lines.
Smooth convex functions always lie below a parabola which passes through the point
(x, f (x)) (defined by the RHS above).
2-6 Lecture 2: January 19

3. Finally, if f is twice differentiable, then β-smoothness is equivalent to the condition


that,

∇2 f (x)  βId .

Examples: It is worth briefly considering two examples (canonical examples of non-smooth


and smooth convex functions):

1. Absolute value: Here we consider f (x) = |x|, and observe that at x = 0, it’s
impossible to seat a parabola at the origin which is always above the function. Roughly,
a parabola must have close to zero derivative near its minimum, but the absolute value
function has constant derivative near its minimum.

2. Quadratic function: Suppose we consider f (x) = xT Qx + aT x + b where Q  0.


Its now easy to see that this function has Hessian 2Q, and consequently it satisfies
smoothness for any β ≥ 2λmax (Q) (i.e. twice the largest eigenvalue of Q).

2.6.2 Strong Convexity

The twin assumption to smoothness is strong convexity. A function f is α-strongly convex, if


the function g(x) = f (x) − α2 kxk2 is convex. As with smoothness there are several important
implications of strong convexity that you will explore in your HW.

1. If f is strongly convex then an equivalent definition is that it satisfies the following


inequality for any x, y ∈ dom(f ),
α
f (y) ≥ f (x) + ∇f (x)T (y − x) + ky − xk2 .
2

Again to interpret this, fix a point x, and observe that this expression tells us that a
strongly-convex function is above a parabola which passes through the point (x, f (x)).

2. If f is twice differentiable, an equivalent characterization is that,

∇2 f (x)  αId .

Examples:

1. Absolute value: Consider the same function as before. It is not strongly convex.
For instance, if we consider x = 1, y = 2, then f (y) − (f (x) + ∇f (x)T (y − x)) is 0, so
the definition can only hold with α = 0.
Lecture 2: January 19 2-7

2. Quadratic function: Once again using the second-order characterization of strong


convexity we see that the quadratic function satisfies the definition of strong convexity
for any α ≤ 2λmin (Q).

It is possible to have strongly convex functions which are not smooth and vice versa, and it
is worth trying to “draw” some examples to convince yourself of this.

2.6.3 Strict Convexity

Strict convexity is a “weakening” of strong convexity (we won’t use it so much in this course
but it’s a useful concept to be aware of). A function f is strictly convex if either:

1. f (θx + (1 − θ)y) < θf (x) + (1 − θ)f (y) for 0 < θ < 1.

2. f (y) > f (x) + ∇f (x)T (y − x), for any x 6= y.

It is worth noting the second-order characterization doesn’t work in the expected way, i.e.
you can have twice-differentiable, strictly convex functions which don’t satisfy the condition
that ∇2 f (x)  0. (As an example, think about the function x4 at x = 0.)

2.7 Optimality Conditions


Here we will revisit some things we discussed briefly in the previous lecture. Here is the
basic question. We are interested in solving a problem:

min f (x),
x∈C

where f is a convex function, and C is a convex set. What can I say about a solution x∗ to
this problem?

1. Unconstrained Case: Suppose first that C = Rd , and that dom(f ) = Rd then our
characterization should be familiar to us from usual calculus classes.

Theorem 2.1 x∗ is optimal, if (and only if ) 0 ∈ ∂f (x∗ ).

Proof: If 0 ∈ ∂f (x∗ ), then from the first-order condition we know that,

f (y) ≥ f (x∗ ) + 0T (y − x∗ ) = f (x∗ ).


2-8 Lecture 2: January 19

Conversely, if x∗ is optimal, then we know that, f (y) ≥ f (x∗ ) + gxT∗ (y − x∗ ) for all y,
when gx∗ = 0 and so we know that 0 is valid subgradient at x∗ .
Notice an interesting aspect of this result – it does not require convexity, i.e. for any
function f the condition that 0 ∈ ∂f (x∗ ) is necessary and sufficient for x∗ to be a
minimizer.

2. Constrained, Differentiable Case: A feasible point x∗ is optimal, if and only if


∇f (x∗ )T (y − x∗ ) ≥ 0 for all y ∈ C.
We will only verify one direction of this (the other direction requires a bit of analysis
to check). Suppose that, ∇f (x∗ )T (y − x∗ ) ≥ 0 for all y ∈ C, then from the first-order
condition we have that,

f (y) ≥ f (x∗ ) + ∇f (x∗ )T (y − x∗ ) ≥ f (x∗ ),

so x∗ is optimal. If you recall the definition of the normal cone from last lecture, then
you will see that this condition says that,

−∇f (x∗ ) ∈ NC (x∗ ).

3. General, Constrained Case: A feasible point x∗ is optimal, if and only if 0 ∈


∂f (x∗ ) + NC (x∗ ). Here we are adding two sets, i.e. C + D = {y : y = u + v, u ∈ C, v ∈
D}.
Again it’s only easy to verify one direction of this, i.e. suppose that 0 ∈ ∂f (x∗ ) +
NC (x∗ ), this means that there are two vectors u ∈ ∂f (x∗ ) and v ∈ NC (x∗ ) such that,

u + v = 0.

Now, we know that for any y which is feasible,

f (y) ≥ f (x∗ ) + uT (y − x∗ )
= f (x∗ ) − v T (y − x∗ ).

Since v ∈ NC (x∗ ) we know that v T (y − x∗ ) ≤ 0 for every feasible y, and so we conclude


that f (y) ≥ f (x∗ ).

2.7.1 Optimality Conditions for Projection

Here is a very basic/important problem. It arises in signal processing and statistics as a


basic denoising scheme. For some convex set K, and observation y we would like to solve
the constrained minimization problem,
1
min ky − xk2 .
x∈K 2
Lecture 2: January 19 2-9

This finds the closest point in K to y, and is called the projection of y onto K. We will
denote the solution x∗ to the above program by PK (y).
Let us first write out the optimality conditions, and then use them to show a nice property
of this projection operation. Since f is differentiable we have that,

0 ∈ x∗ − y + NC (x∗ ).

Equivalently, this means that, (y − x∗ )T (a − x∗ ) ≤ 0 for all a ∈ K. This can be easily


understood with a picture.

Theorem 2.2 Projection onto a convex set is a contraction, i.e. for any pair of points a, b,

kPK (a) − PK (b)k ≤ ka − bk.

Proof: From the optimality conditions we have that for any x ∈ K,

(a − PK (a))T (x − PK (a)) ≤ 0
(b − PK (b))T (x − PK (b)) ≤ 0.

As a consequence we can see that,

(a − PK (a))T (PK (b) − PK (a)) ≤ 0


(b − PK (b))T (PK (a) − PK (b)) ≤ 0.

Adding these inequalities we obtain that,

(b − a + (PK (a) − PK (b)))T (PK (a) − PK (b)) ≤ 0.

Now, re-arranging and applying the Cauchy-Schwarz inequality, we see that,

kPK (a) − PK (b)k2 ≤ (a − b)T (PK (a) − PK (b)) ≤ ka − bkkPK (a) − PK (b)k,

which is our desired conclusion.

You might also like