Subgradients
Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Last time: gradient descent
Consider the problem
min f (x)
x
for f convex and differentiable, dom(f ) = Rn .
Gradient descent: choose initial x (0) ∈ Rn , repeat
x (k) = x (k−1) − tk · ∇f (x (k−1) ),
k = 1, 2, 3, . . .
Step sizes tk chosen to be fixed and small, or by backtracking line
search
If ∇f Lipschitz, gradient descent has convergence rate O(1/ε)
Downsides:
Requires f differentiable
Can be slow to converge
1
Outline
Today:
Subgradients
Examples
−1
first-order
approximation
f at xlower
is abound
global lower
• the The
first-order
approximation
of f at x isof
a global
bound.
∇f defines a non-vertical supporting hyperplane to epi(f ) at
T
∇f (x)
y
x
(x, f (x))
−
≤ 0 ∀(y, t) ∈ epi f
• ∇f (x) defines a non-vertical supporting hyperplane to epi f at (x, f (x)):
Subgradients
∇f
−1
4
Subgradients
A subgradient of a convex function f at x is any g ∈ Rn such that
f (y ) ≥ f (x) + g T (y − x), ∀y ∈ dom(f ).
Always exists (on the relative
interior of dom(f ))
Subgradient
If f differentiable at x, then g = ∇f (x) uniquely
g is a subgradient of a convex function f at x ∈ dom f if
Same
definition works for nonconvex f (however, subgradients
need not exist).f (y) ≥ f (x) + gT (y − x) ∀y ∈ dom f
f (y)
f (x1) + g1T (y − x1)
f (x1) + g2T (y − x1)
f (x2) + g3T (y − x2)
x1
x2
g1 and g2 are subgradients at x1 , g3 is subgradient at x2 .
g1, g2 are subgradients at x1; g3 is a subgradient at x2
4
2
x
• For x = 0, unique subgradient g = sign(x)
For x = 0, unique subgradient g = sign(x)
For x = 0, subgradient g is any element of [−1, 1].
• For x = 0, subgradient g is any element of [−1, 1]
5
5
Examples of subgradients
Considerf f: :RRnn→
→R,
R,ff (x)
(x) =
= xx
Consider
22
f(x)
x2
x1
x1
For x = 0, unique ith component g = sign(x )
For xi = 0, ith component gi is any element of [−1, 1].
• For xi = 0, ith component gi is any element of [−1, 1]
• For xi i = 0, unique ith component gi i = sign(xii)
7
Examples of subgradients
0
5
f(x)
10
15
n
Considerf f(x)
(x)==max{f
max{f11(x),
(x), ff22(x)}, for
Consider
for ff11,,ff2 2: R
= ∇f
∇f22(x)
(x)
• For
Forf1f1(x)
(x)==ff22(x),
(x), subgradient
subgradient gg isis any
any point
point on
on line
line segment
segment
• For
between∇f
∇f11(x)
(x)and
and∇f
∇f22(x)
(x).
between
8
Subdifferential
Set of all subgradients of convex f is called the subdifferential:
∂f (x) = {g ∈ Rn : g is a subgradient of f at x}.
Monotonicity
Theorem
The subdifferential of a convex function f is a monotone operator
(u − v )T (x − y ) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y ).
Chứng minh.
By definition we have
f (y ) ≥ f (x) + u T (y − x) and f (x) ≥ f (y ) + v T (x − y ).
Combining them get shows monotonicity.
Question: Monotonicity for differentiable convex function?
10
Monotonicity
Theorem
The subdifferential of a convex function f is a monotone operator
(u − v )T (x − y ) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y ).
Chứng minh.
By definition we have
f (y ) ≥ f (x) + u T (y − x) and f (x) ≥ f (y ) + v T (x − y ).
Combining them get shows monotonicity.
Question: Monotonicity for differentiable convex function?
(∇f (x) − ∇f (y ))T (x − y ) ≥ 0,
which follows directly from the first order characterization of
convex functions.
10
Examples of non-subdifferentiable functions
IC (y ) ≥ IC (x) + g T (y − x) for all y .
For y ∈ C , IC (y ) = ∞
For y ∈ C , this means 0 ≥ g T (y − x).
12
●
●
●
●
13 11
Subgradient calculus
Basic rules for convex functions:
Scaling: ∂(af ) = a · ∂f provided a > 0.
Addition: ∂(f1 + f2 ) = ∂f1 + ∂f2 .
Affine composition: if g (x) = f (Ax + b), then
∂g (x) = AT ∂f (Ax + b).
Finite pointwise maximum: if f (x) = maxi=1,...m fi (x), then
∂f (x) = conv
∂fi (x)
i:fi (x)=f (x)
= max z T x.
z
q ≤1
And
∂f (x) = argmax
z
q ≤1
z T x.
15
Why subgradients?
Subgradients are important for two reasons:
Convex analysis: optimality characterization via subgradients,
monotonicity, relationship to duality.
Convex optimization: if you can compute subgradients, then
you can minimize any convex function.
16
Optimality condition
min f (x) subject to x ∈ C
x
is solved at x if and only if
∇f (x)T (y − x) ≥ 0 for all y ∈ C .a
a
Direct proof see, e.g., />Teaching/ORF523/S16/ORF523_S16_Lec7_gh.pdf. Proof using subgradient
next slide.
Intuitively: says that gradient increases as we move away from x.
Note that for C = Rn (unconstrained case) it reduces to ∇f = 0.
18
Derivation of first-order optimality
Chứng minh.
First recast problem as
min f (x) + IC (x).
x
Now apply subgradient optimality: 0 ∈ ∂(f (x) + IC (x)).
Observe
0 ∈ ∂(f (x) + IC (x)) ⇔ 0 ∈ {∇f (x)} + NC (x)
⇔ −∇f (x) ∈ NC (x)
⇔ −∇f (x)T x ≥ −∇f (x)T y for all y ∈ C
⇔ ∇f (x)T (y − x) ≥ 0 for all y ∈ C
as desired.