0% found this document useful (0 votes)
16 views6 pages

Convex Optimization in Deep Learning

The document discusses convex optimization problems relevant to signal and image recovery, presenting mathematical formulations and examples such as `1-minimization and machine learning model fitting. It defines key concepts including convex sets, functions, and optimality conditions, emphasizing the efficiency of convex optimization. Additionally, it introduces algorithms like gradient descent and proximal gradient descent, highlighting their applications in solving optimization problems effectively.

Uploaded by

Subhosri Basu
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

Convex Optimization in Deep Learning

The document discusses convex optimization problems relevant to signal and image recovery, presenting mathematical formulations and examples such as `1-minimization and machine learning model fitting. It defines key concepts including convex sets, functions, and optimality conditions, emphasizing the efficiency of convex optimization. Additionally, it introduces algorithms like gradient descent and proximal gradient descent, highlighting their applications in solving optimization problems effectively.

Uploaded by

Subhosri Basu
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Learning and Inverse Problems

Summer 2022
Reinhard Heckel

3 Solving convex optimization problems numerically


Many problems arising in signal and image recovery can be formulated as an optimization problem:

minimize f (x) subject to x ∈ C, (1)

where f is a cost function and C is a set.


A concrete example is the `1 -minimization problem discussed in the previous section, where
f (x) = kxk1 and C = {x : Ax = y}.
Another example introduced later is a machine learning setup, where gθ (x) is a machine learning
model parameterized by θ, and our goal is to fit the model to a set of examples {(xi , yi )} by
minimizing:
n
X
f (θ) = kgθ (xi ) − yi k22 .
i=1

In this case, the function f models goodness of fit in machine learning (we’ll see many concrete
examples later), in general it models utility or cost.
The set C can incorporate constraints such as a limited budget (e.g., kxk ≤ B) or priors (e.g.,
x is non-negative).
We say that x∗ is a (global) solution to the optimization problem (1) if x∗ ∈ C, and f (x∗ ) ≤ f (x)
for all x ∈ C. We say that x∗ is a local solution to the optimization problem if in a neighborhood
N around x∗ , f (x∗ ) ≤ f (x) for all x ∈ C ∩ N . Optimality can be very hard to check—even if f is
differentiable and C = Rn .

3.1 Convex optimization problems


An important class of optimization problems for which we can check optimality rather easily, and
which we can solve efficiently are convex optimization problems. A standard reference for convex
optimization problems is the book [BV04].

Definition 1. A convex set C is any set such that for all x, y ∈ C and all θ ∈ (0, 1),

θx + (1 − θ)y ∈ C.

Figure 1 shows examples of convex and non-convex sets. Intersections of convex sets are convex.
Important examples of convex sets are the following:

• Subspaces {x = Ua : a ∈ Rd },

• Affines spaces {x = Ua + b : a ∈ Rd },

• Half-spaces {x : ha, xi ≤ b}.

1
(a) Convex set (b) Non-convex set

Figure 1: Graphical interpretation of convex sets.

Definition 2. for a set of points x1 , . . . , xn ∈ Rp , a convex combination of them is defined as the


set
n n
( )
X X
x= θi xi : θi = 1, θi ≥ 0 .
i=1 i=1

A set C is convex if and only if it contains all convex combinations of its elements. For any
given set S, the convex hull of S is defined as the set of all convex combinations of points in S.
Intuitively, it is the smallest convex set that contains S. As an example, consider the non-convex
set {x : kxk0 ≤ 1, kxk1 ≤ 1}. Its convex hull is the `1 -norm ball {x : kxk1 ≤ 1}.
We are now ready to define a convex function.
Definition 3. A function f : Rn → R is convex if for all x, y ∈ Rn , and all θ ∈ (0, 1),

θf (x) + (1 − θ)f (y) ≥ f (θx + (1 − θ)y).

A function is strictly convex, if the inequality above is strict whenever x 6= y. A function f is


concave if −f is convex.
Common examples of convex functions are:
• Linear functions: f (x) = ha, xi + b,
• Quadratics: f (x) = 12 xT Qx + bT x + c, where Q is positive semidefinite. The function f is
convex if and only if Q is positive semidefinite.
• Any norm f (x) = kxk is convex. This follows from the triangle inequality and homogene-
ity/scalability.
An important property of convex functions is that local minima are global.
Proposition 1. Any local minimum of a convex function f : Rn → R is also a global minimum.
We now turn to first order optimality conditions. To this end, we first state a proposition stating
that a function is convex if and only if it is lower bounded by its first order Taylor expansion at
any point.
Proposition 2. A differentiable function f is convex if and only if for all x, y ∈ Rn ,

f (y) ≥ f (x) + hy − x, ∇f (x)i .

It is strictly convex if and only if the inequality holds strictly for all x 6= y.

2
An important consequence is the following optimality condition.

Corollary 1. If a differentiable function f is convex and ∇f (x∗ ) = 0, then x∗ is a global minimizer


of f .

3.2 Computational aspects of optimization algorithms


Suppose we want to minimize a differentiable function f : Rn → R over Rn :

minimize
n
f (x).
x∈R

For simplicity assume that f (x) = kAx − yk22 , where A ∈ Rm×n is a matrix with full column
rank. So the optimization problem above is a standard least squares problem. If the columns are
−1
linearly independent, we can simply obtain a closed form solution as x∗ = (AT A) AT y. However,
for many practically relevant functions f , we cannot compute a closed form solution. Even if we
can, we might not want to because computing the closed form solution can be computationally
−1
expensive, e.g., for x∗ = (AT A) AT y we have to compute the inverse of a possibly large matrix.
We consider algorithms for finding approximate solutions of the minimization problem above.
In practice we are almost always content with an approximate solution (a computer can only give
us a solution up to machine precision anyways). We assume that we can evaluate the function f
at a point x as well as compute a derivative (or a sub-differential) ∇f (x).

3.3 Gradient descent


Gradient descent is a simple iterative algorithm for minimizing a differentiable function f : Rn → R
on Rn . Starting from an initial point x0 , gradient descent iterates:

xk+1 = xk − αk ∇f (xk ),

where αk is a step size parameter. Gradient descent converges to a local minimum, and provided
that f is convex, it converges to a global minimum. The idea behind this algorithm is to make
small steps in the direction that minimizes the local first order approximation of f .

3.4 Convergence for quadratic functions


In order to understand the gradient method better, let us start with a simple class of functions,
namely quadratic functions:
1
f (x) = xT Qx − bT x,
2
where Q is strictly positive definite. A closed form solution to the minimization problem minimizex f (x)
is x∗ = Q−1 b.
We want to understand what gradient descent yields for this problem. We consider gradient
descent with a fixed stepsize:
xk+1 = xk − α∇f (xk ).

3
Using that the gradient is given by ∇f (x) = Qx − b and that the optimal solution obeys Qx∗ = b,
the difference of the (k + 1)-st iteration to the optimum is

xk+1 − x∗ = xk − η∇f (xk ) − x∗


= xk − η(Qxk − b) − x∗
= xk − η(Qxk − Qx∗ ) − x∗
= (I − ηQ)(xk − x∗ ).

It follows that
xk+1 − x∗ ≤ kI − ηQk xk − x∗ .
2 2
Since I − ηQ is symmetric (the first equality below can be checked by taking the singular value
decomposition of the matrix)

kI − ηQk = max(λmax (I − ηQ), −λmin (I − ηQ)) = max(ηM − 1, 1 − ηm),

where M and m are the largest and smallest singular values of the matrix Q. For the second
inequality, we used that, due to (I − ηQ)vi = λi (I − ηQ)vi , the eigenvalues of I − ηQ and Q are
2
related as λi (I − ηQ) = 1 − ηλi (Q)1 . The right hand side above is minimized for η = M +m . For
this choice, kI − ηQk = 1−1/κ
1+1/κ < 1, where κ = M/m is the condition number of the matrix Q.
To summarize, we have proven the following proposition:
2
Proposition 3. Gradient descent with η = M +m applied to a quadratic function obeys
 k
∗ 1 − 1/κ
k
x −x ≤ x0 − x∗ 2
.
2 1 + 1/κ
Thus the rate of convergence is geometric/linear. Suppose we want to find a solution that is
-close to the original solution. It follows from the proposition that we require

1 + 1/κ −1
 
N = log log( x0 − x∗ 2 /))
1 − 1/κ
many iterations to reach an -accurate solution. Due to log(1 + x) ≈ x for small x, for large
condition numbers we have that N = O(κ log(1/)).

3.5 Proximal gradient descent


The gradient descent algorithm introduced in the previous section is only one example of an iterative
algorithm for solving a convex optimization problem, and many more exist that can be faster for a
given application.
A concrete example is an algorithm called proximal gradient descent that performs well for the
sparse recovery problem discussed in the last section. As discussed, solving

arg min kxk1 subject to kAx − yk22 ≤ ξ,


x
1
The singular value decomposition and the eigenvalue decomposition of a symmetric matrix are identical. Therefore
λ can be used for singular values and eigenvalues at the same time. λi (A) means the i’th eigenvalue of matrix A.

4
provably recovers a sparse vector from an observation y (provided the sparsity is sufficiently low,
and A is appropriately chosen). It can be shown that this constraint problem is equivalent to the
un-constrained `1 -regularized least-squares problem
1
arg min kAx − yk22 +λkxk1 , (2)
x 2
| {z }
f (x)

i.e., for each ξ there is a regularization parameter λ so that the two problems yield an equivalent
solution.
First suppose we are only interested in minimizing the smooth function f . The gradient descent
iterates for optimizing f are given by
xk+1 = xk − η∇f (xk ).
We can write those iterates in so-called proximal form as a solution to a simple optimization
problem:
1 2
xk+1 = arg min x − (xk − η∇f (xk )) .
x 2 2
However, our goal is to minimize f (x) + λkxk1 instead of f , so it seems natural to solve
1 2
xk+1 = arg min x − (xk − η∇f (xk )) + ηλkxk1 . (3)
x 2 2
Those iterates are the so-called proximal gradient descent updates. There are more formal ways
to arrive at the proximal gradient descent algorithm, and it can be shown that proximal gradient
provably converges for the loss function above.
Proximal gradient descent is very popular for `1 -regularized least squares, because the optimiza-
tion problem in (3) for obtaining the iterates is separable and thus simple to solve. Specifically,
since λkxk1 is separable, the optimization problem corresponding to the i-th coordinate is given by
1
min (x − xi )2 + ηλ|x|, xi = [xk − η∇f (xk )]i .
x∈R 2

This coordinate-wise optimization problem has the closed-form solution



1 z + λ, if z < −λ,

2
arg min (x − y) + λ|y| = τλ (y), τλ (z) = 0, if z ∈ [−λ, λ], (4)
y 2 
z − λ, if z > λ.

The operator τλ is called the soft-thresholding operator. The resulting algorithm is called Iterative
Shrinkage-Thresholding Algorithm (ISTA).
Equation (4) follows by analyzing the one-dimensional problem
1
arg min (x − y)2 + λ|y|.
y 2

Specifically, the optimality condition is that


 
1 2
0∈∂ (x − y) + λ|y| ,
2
which is equivalent to
0 = y − x + λsign(y).

5
References
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

You might also like