0% found this document useful (0 votes)

16 views6 pages

Convex Optimization in Deep Learning

The document discusses convex optimization problems relevant to signal and image recovery, presenting mathematical formulations and examples such as `1-minimization and machine learning model fitting. It defines key concepts including convex sets, functions, and optimality conditions, emphasizing the efficiency of convex optimization. Additionally, it introduces algorithms like gradient descent and proximal gradient descent, highlighting their applications in solving optimization problems effectively.

Uploaded by

Subhosri Basu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views6 pages

Convex Optimization in Deep Learning

Uploaded by

Subhosri Basu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Deep Learning and Inverse Problems

Summer 2022
Reinhard Heckel

3 Solving convex optimization problems numerically

Many problems arising in signal and image recovery can be formulated as an optimization problem:

minimize f (x) subject to x ∈ C, (1)

where f is a cost function and C is a set.

A concrete example is the `1 -minimization problem discussed in the previous section, where
f (x) = kxk1 and C = {x : Ax = y}.
Another example introduced later is a machine learning setup, where gθ (x) is a machine learning
model parameterized by θ, and our goal is to fit the model to a set of examples {(xi , yi )} by
minimizing:
n
X
f (θ) = kgθ (xi ) − yi k22 .
i=1

In this case, the function f models goodness of fit in machine learning (we’ll see many concrete
examples later), in general it models utility or cost.
The set C can incorporate constraints such as a limited budget (e.g., kxk ≤ B) or priors (e.g.,
x is non-negative).
We say that x∗ is a (global) solution to the optimization problem (1) if x∗ ∈ C, and f (x∗ ) ≤ f (x)
for all x ∈ C. We say that x∗ is a local solution to the optimization problem if in a neighborhood
N around x∗ , f (x∗ ) ≤ f (x) for all x ∈ C ∩ N . Optimality can be very hard to check—even if f is
differentiable and C = Rn .

3.1 Convex optimization problems

An important class of optimization problems for which we can check optimality rather easily, and
which we can solve efficiently are convex optimization problems. A standard reference for convex
optimization problems is the book [BV04].

Definition 1. A convex set C is any set such that for all x, y ∈ C and all θ ∈ (0, 1),

θx + (1 − θ)y ∈ C.

Figure 1 shows examples of convex and non-convex sets. Intersections of convex sets are convex.
Important examples of convex sets are the following:

• Subspaces {x = Ua : a ∈ Rd },

• Affines spaces {x = Ua + b : a ∈ Rd },

• Half-spaces {x : ha, xi ≤ b}.

1
(a) Convex set (b) Non-convex set

Figure 1: Graphical interpretation of convex sets.

Definition 2. for a set of points x1 , . . . , xn ∈ Rp , a convex combination of them is defined as the

set
n n
( )
X X
x= θi xi : θi = 1, θi ≥ 0 .
i=1 i=1

A set C is convex if and only if it contains all convex combinations of its elements. For any
given set S, the convex hull of S is defined as the set of all convex combinations of points in S.
Intuitively, it is the smallest convex set that contains S. As an example, consider the non-convex
set {x : kxk0 ≤ 1, kxk1 ≤ 1}. Its convex hull is the `1 -norm ball {x : kxk1 ≤ 1}.
We are now ready to define a convex function.
Definition 3. A function f : Rn → R is convex if for all x, y ∈ Rn , and all θ ∈ (0, 1),

θf (x) + (1 − θ)f (y) ≥ f (θx + (1 − θ)y).

A function is strictly convex, if the inequality above is strict whenever x 6= y. A function f is

concave if −f is convex.
Common examples of convex functions are:
• Linear functions: f (x) = ha, xi + b,
• Quadratics: f (x) = 12 xT Qx + bT x + c, where Q is positive semidefinite. The function f is
convex if and only if Q is positive semidefinite.
• Any norm f (x) = kxk is convex. This follows from the triangle inequality and homogene-
ity/scalability.
An important property of convex functions is that local minima are global.
Proposition 1. Any local minimum of a convex function f : Rn → R is also a global minimum.
We now turn to first order optimality conditions. To this end, we first state a proposition stating
that a function is convex if and only if it is lower bounded by its first order Taylor expansion at
any point.
Proposition 2. A differentiable function f is convex if and only if for all x, y ∈ Rn ,

f (y) ≥ f (x) + hy − x, ∇f (x)i .

It is strictly convex if and only if the inequality holds strictly for all x 6= y.

2
An important consequence is the following optimality condition.

Corollary 1. If a differentiable function f is convex and ∇f (x∗ ) = 0, then x∗ is a global minimizer

of f .

3.2 Computational aspects of optimization algorithms

Suppose we want to minimize a differentiable function f : Rn → R over Rn :

minimize
n
f (x).
x∈R

For simplicity assume that f (x) = kAx − yk22 , where A ∈ Rm×n is a matrix with full column
rank. So the optimization problem above is a standard least squares problem. If the columns are
−1
linearly independent, we can simply obtain a closed form solution as x∗ = (AT A) AT y. However,
for many practically relevant functions f , we cannot compute a closed form solution. Even if we
can, we might not want to because computing the closed form solution can be computationally
−1
expensive, e.g., for x∗ = (AT A) AT y we have to compute the inverse of a possibly large matrix.
We consider algorithms for finding approximate solutions of the minimization problem above.
In practice we are almost always content with an approximate solution (a computer can only give
us a solution up to machine precision anyways). We assume that we can evaluate the function f
at a point x as well as compute a derivative (or a sub-differential) ∇f (x).

3.3 Gradient descent

Gradient descent is a simple iterative algorithm for minimizing a differentiable function f : Rn → R
on Rn . Starting from an initial point x0 , gradient descent iterates:

xk+1 = xk − αk ∇f (xk ),

where αk is a step size parameter. Gradient descent converges to a local minimum, and provided
that f is convex, it converges to a global minimum. The idea behind this algorithm is to make
small steps in the direction that minimizes the local first order approximation of f .

3.4 Convergence for quadratic functions

In order to understand the gradient method better, let us start with a simple class of functions,
namely quadratic functions:
1
f (x) = xT Qx − bT x,
2
where Q is strictly positive definite. A closed form solution to the minimization problem minimizex f (x)
is x∗ = Q−1 b.
We want to understand what gradient descent yields for this problem. We consider gradient
descent with a fixed stepsize:
xk+1 = xk − α∇f (xk ).

3
Using that the gradient is given by ∇f (x) = Qx − b and that the optimal solution obeys Qx∗ = b,
the difference of the (k + 1)-st iteration to the optimum is

xk+1 − x∗ = xk − η∇f (xk ) − x∗

= xk − η(Qxk − b) − x∗
= xk − η(Qxk − Qx∗ ) − x∗
= (I − ηQ)(xk − x∗ ).

It follows that
xk+1 − x∗ ≤ kI − ηQk xk − x∗ .
2 2
Since I − ηQ is symmetric (the first equality below can be checked by taking the singular value
decomposition of the matrix)

kI − ηQk = max(λmax (I − ηQ), −λmin (I − ηQ)) = max(ηM − 1, 1 − ηm),

where M and m are the largest and smallest singular values of the matrix Q. For the second
inequality, we used that, due to (I − ηQ)vi = λi (I − ηQ)vi , the eigenvalues of I − ηQ and Q are
2
related as λi (I − ηQ) = 1 − ηλi (Q)1 . The right hand side above is minimized for η = M +m . For
this choice, kI − ηQk = 1−1/κ
1+1/κ < 1, where κ = M/m is the condition number of the matrix Q.
To summarize, we have proven the following proposition:
2
Proposition 3. Gradient descent with η = M +m applied to a quadratic function obeys
k
∗ 1 − 1/κ
k
x −x ≤ x0 − x∗ 2
.
2 1 + 1/κ
Thus the rate of convergence is geometric/linear. Suppose we want to find a solution that is
-close to the original solution. It follows from the proposition that we require

1 + 1/κ −1

N = log log( x0 − x∗ 2 /))
1 − 1/κ
many iterations to reach an -accurate solution. Due to log(1 + x) ≈ x for small x, for large
condition numbers we have that N = O(κ log(1/)).

3.5 Proximal gradient descent

The gradient descent algorithm introduced in the previous section is only one example of an iterative
algorithm for solving a convex optimization problem, and many more exist that can be faster for a
given application.
A concrete example is an algorithm called proximal gradient descent that performs well for the
sparse recovery problem discussed in the last section. As discussed, solving

arg min kxk1 subject to kAx − yk22 ≤ ξ,

x
1
The singular value decomposition and the eigenvalue decomposition of a symmetric matrix are identical. Therefore
λ can be used for singular values and eigenvalues at the same time. λi (A) means the i’th eigenvalue of matrix A.

4
provably recovers a sparse vector from an observation y (provided the sparsity is sufficiently low,
and A is appropriately chosen). It can be shown that this constraint problem is equivalent to the
un-constrained `1 -regularized least-squares problem
1
arg min kAx − yk22 +λkxk1 , (2)
x 2
| {z }
f (x)

i.e., for each ξ there is a regularization parameter λ so that the two problems yield an equivalent
solution.
First suppose we are only interested in minimizing the smooth function f . The gradient descent
iterates for optimizing f are given by
xk+1 = xk − η∇f (xk ).
We can write those iterates in so-called proximal form as a solution to a simple optimization
problem:
1 2
xk+1 = arg min x − (xk − η∇f (xk )) .
x 2 2
However, our goal is to minimize f (x) + λkxk1 instead of f , so it seems natural to solve
1 2
xk+1 = arg min x − (xk − η∇f (xk )) + ηλkxk1 . (3)
x 2 2
Those iterates are the so-called proximal gradient descent updates. There are more formal ways
to arrive at the proximal gradient descent algorithm, and it can be shown that proximal gradient
provably converges for the loss function above.
Proximal gradient descent is very popular for `1 -regularized least squares, because the optimiza-
tion problem in (3) for obtaining the iterates is separable and thus simple to solve. Specifically,
since λkxk1 is separable, the optimization problem corresponding to the i-th coordinate is given by
1
min (x − xi )2 + ηλ|x|, xi = [xk − η∇f (xk )]i .
x∈R 2

This coordinate-wise optimization problem has the closed-form solution


1 z + λ, if z < −λ,

2
arg min (x − y) + λ|y| = τλ (y), τλ (z) = 0, if z ∈ [−λ, λ], (4)
y 2 
z − λ, if z > λ.


The operator τλ is called the soft-thresholding operator. The resulting algorithm is called Iterative
Shrinkage-Thresholding Algorithm (ISTA).
Equation (4) follows by analyzing the one-dimensional problem
1
arg min (x − y)2 + λ|y|.
y 2

Specifically, the optimality condition is that

1 2
0∈∂ (x − y) + λ|y| ,
2
which is equivalent to
0 = y − x + λsign(y).

5
References
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

Optimization Course by J. Zico Kolter
No ratings yet
Optimization Course by J. Zico Kolter
59 pages
Introduction to Convex Optimization
No ratings yet
Introduction to Convex Optimization
20 pages
EE364b Homework 2 Solutions
100% (1)
EE364b Homework 2 Solutions
5 pages
EE364b Homework 2 Solutions
No ratings yet
EE364b Homework 2 Solutions
5 pages
Mathematics For Machine Learning: Chapter 7: Continuous Optimization
100% (1)
Mathematics For Machine Learning: Chapter 7: Continuous Optimization
36 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
6 pages
Continuous Optimization Techniques
No ratings yet
Continuous Optimization Techniques
46 pages
EE364b Exercises: Subgradients & Methods
No ratings yet
EE364b Exercises: Subgradients & Methods
48 pages
LTSM Essentials: Convex Optimization Guide
No ratings yet
LTSM Essentials: Convex Optimization Guide
48 pages
Data Science Optimization Techniques
No ratings yet
Data Science Optimization Techniques
54 pages
Convex Optimization Fundamentals Guide
No ratings yet
Convex Optimization Fundamentals Guide
58 pages
Continuous Optimization in Machine Learning
No ratings yet
Continuous Optimization in Machine Learning
30 pages
Unconstrained Convex Optimization Methods
No ratings yet
Unconstrained Convex Optimization Methods
28 pages
Gradient Descent in Data Science Optimization
No ratings yet
Gradient Descent in Data Science Optimization
8 pages
Elementary Descent Methods in Optimization
No ratings yet
Elementary Descent Methods in Optimization
18 pages
Fast Algorithms via Convex Optimization
No ratings yet
Fast Algorithms via Convex Optimization
114 pages
Machine Learning Optimization Techniques
No ratings yet
Machine Learning Optimization Techniques
35 pages
Subgradients in Convex Optimization
No ratings yet
Subgradients in Convex Optimization
25 pages
Optimization Techniques in Machine Learning
No ratings yet
Optimization Techniques in Machine Learning
41 pages
Unconstrained Optimization Concepts
No ratings yet
Unconstrained Optimization Concepts
13 pages
Understanding Optimization Techniques
No ratings yet
Understanding Optimization Techniques
28 pages
Gradient Methods for Composite Functions
No ratings yet
Gradient Methods for Composite Functions
31 pages
Unconstrained Optimization Methods
No ratings yet
Unconstrained Optimization Methods
19 pages
Optimsatio
No ratings yet
Optimsatio
36 pages
Unconstrained Optimization Concepts
No ratings yet
Unconstrained Optimization Concepts
13 pages
Understanding Convex Functions and Subgradients
No ratings yet
Understanding Convex Functions and Subgradients
5 pages
Overview of Coordinate Descent Algorithms
No ratings yet
Overview of Coordinate Descent Algorithms
32 pages
Neural Net
No ratings yet
Neural Net
182 pages
Poly 4AI11 Optim
No ratings yet
Poly 4AI11 Optim
49 pages
Lec 13
No ratings yet
Lec 13
8 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Smoothing Techniques in Nondifferentiable Optimization
No ratings yet
Smoothing Techniques in Nondifferentiable Optimization
11 pages
Overview of Optimization Techniques
No ratings yet
Overview of Optimization Techniques
47 pages
Unconstrained Optimization Algorithms
No ratings yet
Unconstrained Optimization Algorithms
16 pages
Optimization Algorithms Overview
No ratings yet
Optimization Algorithms Overview
29 pages
Convex Optimisation in Data Science
No ratings yet
Convex Optimisation in Data Science
31 pages
Optimization Exam Review Guide
No ratings yet
Optimization Exam Review Guide
13 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
27 pages
Lec 09
No ratings yet
Lec 09
10 pages
Algorithms For Optimization 1st Edition Edition Mykel J. Kochenderfer Ebook Testbank Solutions Optimized PDF Reading
No ratings yet
Algorithms For Optimization 1st Edition Edition Mykel J. Kochenderfer Ebook Testbank Solutions Optimized PDF Reading
53 pages
Nonlinear Least Squares Overview
No ratings yet
Nonlinear Least Squares Overview
61 pages
Convex Optimization Lecture Overview
No ratings yet
Convex Optimization Lecture Overview
35 pages
10 Proxalgs Notes Cvxopt f21
No ratings yet
10 Proxalgs Notes Cvxopt f21
13 pages
Smooth Convex Minimization Methods
No ratings yet
Smooth Convex Minimization Methods
28 pages
CS-6777 Liu Abs
100% (1)
CS-6777 Liu Abs
103 pages
Excsol Convex Optimization Theory
No ratings yet
Excsol Convex Optimization Theory
20 pages
Homework 2
No ratings yet
Homework 2
7 pages
From Optimization To Linear Regression - v3
No ratings yet
From Optimization To Linear Regression - v3
56 pages
Steepest Descent Method Overview
No ratings yet
Steepest Descent Method Overview
7 pages
Steepest Descent Method Overview
No ratings yet
Steepest Descent Method Overview
7 pages
Big Data Optimization Techniques
No ratings yet
Big Data Optimization Techniques
4 pages
Convex Function Composition in Optimization
No ratings yet
Convex Function Composition in Optimization
4 pages
Unconstrained Minimization in Convex Optimization
No ratings yet
Unconstrained Minimization in Convex Optimization
41 pages
Gradient Descent and Convergence Analysis
No ratings yet
Gradient Descent and Convergence Analysis
7 pages
Nester Ov Smoothing Review
No ratings yet
Nester Ov Smoothing Review
5 pages
Gradient-Based Optimization Algorithms
No ratings yet
Gradient-Based Optimization Algorithms
8 pages
Local Unconstrained Optimization Methods
No ratings yet
Local Unconstrained Optimization Methods
11 pages
Complex Number Maths
No ratings yet
Complex Number Maths
30 pages
Channel Coding Lecture Details 2021
No ratings yet
Channel Coding Lecture Details 2021
7 pages
Machine Learning for Anomaly Detection
No ratings yet
Machine Learning for Anomaly Detection
8 pages
Self-Supervised Learning for Radar MTI
No ratings yet
Self-Supervised Learning for Radar MTI
6 pages
Understanding Random Processes in Communication
No ratings yet
Understanding Random Processes in Communication
40 pages
Linear Programming Problem Solutions
No ratings yet
Linear Programming Problem Solutions
30 pages
Engineering Mathematics I Course Outline
No ratings yet
Engineering Mathematics I Course Outline
4 pages
Normality and Paired T-Test Analysis
No ratings yet
Normality and Paired T-Test Analysis
13 pages
Developing Pre Service Teachers' Computational Thinking A Systematic Literature Review
No ratings yet
Developing Pre Service Teachers' Computational Thinking A Systematic Literature Review
37 pages
Telescoping Series and Partial Terms
No ratings yet
Telescoping Series and Partial Terms
40 pages
Summative Assessment 3: KQL Median
No ratings yet
Summative Assessment 3: KQL Median
2 pages
2017 ICSE ISC Pupil Performance Analysis
No ratings yet
2017 ICSE ISC Pupil Performance Analysis
46 pages
MSc Mathematical Methods Exercises
No ratings yet
MSc Mathematical Methods Exercises
11 pages
Exponential Smoothing Forecast Data
No ratings yet
Exponential Smoothing Forecast Data
4 pages
The 7 Ps Marketing Mix of Home Sharing Services - 2020 - International Journal o
No ratings yet
The 7 Ps Marketing Mix of Home Sharing Services - 2020 - International Journal o
11 pages
STAAR Algebra I Assessment Overview
No ratings yet
STAAR Algebra I Assessment Overview
6 pages
Birkhoff-Kakutani Theorem Overview
No ratings yet
Birkhoff-Kakutani Theorem Overview
100 pages
Alkalinity Testing SOP EPA 310.1
No ratings yet
Alkalinity Testing SOP EPA 310.1
6 pages
Control System Analysis and Design
No ratings yet
Control System Analysis and Design
12 pages
Kurtosis Calculation in Statistics
No ratings yet
Kurtosis Calculation in Statistics
1 page
Measurement and Scaling Concepts
No ratings yet
Measurement and Scaling Concepts
21 pages
IIT-JAM 2010 Mathematics Paper
No ratings yet
IIT-JAM 2010 Mathematics Paper
6 pages
Descriptive Statistics Explained: Mean, Median, Mode
No ratings yet
Descriptive Statistics Explained: Mean, Median, Mode
28 pages
Bode Diagram Basics for Transfer Functions
No ratings yet
Bode Diagram Basics for Transfer Functions
2 pages
OML Math: Optimization Concepts Overview
No ratings yet
OML Math: Optimization Concepts Overview
33 pages
Elementary Proof of Riemann Integrability
No ratings yet
Elementary Proof of Riemann Integrability
4 pages
Practical Reliability Engineering Guide
No ratings yet
Practical Reliability Engineering Guide
7 pages
Column Chromatography Theory Overview
No ratings yet
Column Chromatography Theory Overview
2 pages
Senior High School Math Assessment
No ratings yet
Senior High School Math Assessment
2 pages
Understanding Residence Time Distribution
No ratings yet
Understanding Residence Time Distribution
3 pages
Numerical Integration Laboratory Manual
No ratings yet
Numerical Integration Laboratory Manual
2 pages
B.Tech CSE Academic Regulations 2023
No ratings yet
B.Tech CSE Academic Regulations 2023
186 pages
Understanding Generalized Functions
100% (1)
Understanding Generalized Functions
20 pages

Convex Optimization in Deep Learning

Uploaded by

Convex Optimization in Deep Learning

Uploaded by

Deep Learning and Inverse Problems

3 Solving convex optimization problems numerically

minimize f (x) subject to x ∈ C, (1)

where f is a cost function and C is a set.

3.1 Convex optimization problems

• Half-spaces {x : ha, xi ≤ b}.

Figure 1: Graphical interpretation of convex sets.

Definition 2. for a set of points x1 , . . . , xn ∈ Rp , a convex combination of them is defined as the

θf (x) + (1 − θ)f (y) ≥ f (θx + (1 − θ)y).

A function is strictly convex, if the inequality above is strict whenever x 6= y. A function f is

f (y) ≥ f (x) + hy − x, ∇f (x)i .

Corollary 1. If a differentiable function f is convex and ∇f (x∗ ) = 0, then x∗ is a global minimizer

3.2 Computational aspects of optimization algorithms

3.3 Gradient descent

3.4 Convergence for quadratic functions

xk+1 − x∗ = xk − η∇f (xk ) − x∗

kI − ηQk = max(λmax (I − ηQ), −λmin (I − ηQ)) = max(ηM − 1, 1 − ηm),

3.5 Proximal gradient descent

arg min kxk1 subject to kAx − yk22 ≤ ξ,

This coordinate-wise optimization problem has the closed-form solution

Specifically, the optimality condition is that

You might also like