본문 바로가기

Math/Optimization

Proximal Gradient and Mirror Descent

References

[1] https://www.youtube.com/watch?v=0uIL-ADxvmw&list=PLsQI3WOaT8hgnhaqobr8ny41Pk5Q5bu7D&index=21

[2] First-Order Method - Beck

 

Proximal Gradient

Gradient Descent는 는 이차 근사함수를 최소화하는 방법이라고 할 수 있다.

$x_{t+1}=x_t - \gamma \nabla g(x_t) = arg\min_y g(x_t) _ \nabla g(x_t)^T(y-x_t) + \frac{1}{2\gamma}||y-x_t||^2$

만약 convex 하지만 non differentiable 한 $h(x)$가 있다면 , $f(x)=g(x) + h(x)$ 아래와 같이 Gradient Descent 비슷하게 minimization을 할 수 있고 이를 Proximal Gradient라고 한다.

\begin{align}x_{t+1} &= arg\min_y g(x_t) _ \nabla g(x_t)^T(y-x_t) + \frac{1}{2\gamma}||y-x_t||^2 + h(y)\\ &= arg\min \frac{1}{2\gamma} ||y-(x_t-\gamma \nabla g(x_t)||^2 +h(y)  \end{align}

Proximal mapping

$prox_{h,\gamma} (z) = arg\min_y \frac{1}{2\gamma} ||y-z||^2 + h(y)$

Example 1 (prox mapping of L1 ) [2]

$g(x) = \lambda ||x||_1$이라고 하자

$x\in R$ 이면 $prox_g (x) = [|x|-\lambda]_+ sgn(x)$이다. 즉, $-\lambda < x< \lambda$ 보다 작으면 0 이된다.

Generalized Gradient

$G_{h,\gamma} (x_t) = \frac{1}{\gamma} (x_t - prox_{h,\gamma} ( x_t-\gamma \nabla g(x))$

$x_{t+1} = x_t - \gamma G_{h,\gamma}(x)$

Generalized Gradient와 관련된 inequality를 하나 소개한다.

$G(x_t) = \nabla h(x_{t+1}) + \nabla g(x_t)$

$x_{t+1} = arg\min_y \frac{\beta}{2} ||y-(x_t - \frac{1}{\beta}\nabla g(x_t) ||^2 + h(y) $

따라서 $\beta ( x_{t+1} - (x_t - \frac{1}{\beta} \nabla g(x_t))) +\nabla h(x_{t+1})=0$을 만족시켜야한다. 즉,

$G(x_t) = \nabla h(x_{t+1}) +\nabla g(x_t)$ 이다.

Convergence rate [1]

Assume $g$ is smooth. $G(x_t) = \frac{1}{\gamma} (x_t-x_{t+1})$. Step size를 $\frac{1}{\beta}$라고 한다면

$\beta$-smoothness에 의해

$$g(x_{t+1}) \leq g(x_t) - \frac{1}{\beta}\nabla g(x_t)^T G(x_t) + \frac{1}{2\beta} || G(x_t)||^2$$

따라서

\begin{align} f(x_{t+1}) &\leq g(x_t) -\frac{1}{\beta} \nabla g(x_t)^T G(x_t) + \frac{1}{2\beta} ||G(x_t)||^2 + h(x_{t+1}) \\ &\leq g(z) + \nabla g(x_t)^T (x_t-z) -\frac{1}{\beta} \nabla g(x_t)^T G(x_t) + \frac{1}{2\beta} ||G(x_t)||^2 + h(z) +\nabla h(x_{t+1})^T (x_{t+1}-z) \\ &=  f(z) + \nabla g(x_t)^T (x_t-z) -\frac{1}{\beta} \nabla g(x_t)^T G(x_t) + \frac{1}{2\beta} ||G(x_t)||^2  +\nabla h(x_{t+1})^T (x_t-z) + \nabla h(x_{t+1})^T (x_{t+1}-x_t) \end{align}

$G(x_t) = \nabla h(x_{t+1}) +\nabla g(x_t)$ 를 활용하면

$f(x_{t+1}) \leq f(z) + G(x_t)^T (x_t - z)- \frac{1}{2\beta} || G(x_t)||^2$ 를 유도할 수 있다.

$z=x_t$일 때 $f(x_t) - f(x_{t+1}) \geq \frac{1}{2\beta} ||G(x_t)||^2$ 가 되어 sufficient decrease를 확인할 수 있고

$z=x^*$일 때 , $f(x_{t+1}) - f(x^*) \leq \frac{1}{2\beta} ( ||x_t-x^*||^2 - ||x_{t+1} - x^* ||^2 ) $ 를 통해 last iterate의 convergence rate ( 1/T ) 를 구할 수 있다. 여기서 inequality 를 구할 때 $ 2(x_t-x_{t+1})^T(x_t-x^*) = ||x_t - x_{t+1}||^2+||x_t-x^*||^2- ||x_{t+1}-x^*||^2$ 를 활용하면 된다.

 

Bregman Divergence

Lemma 1. Generalized Pythagorian for Bregman Divergence

$$ D_f(x,y) + D_f(z,x) - D_f (z,y) = ( \nabla f(x) - \nabla f(y) )^T ( x - z ) $$ 

Mirror Descent

Mirror Descent 알고리즘은 다음과 같다.

1. $x_t$를 $\nabla \Phi (x_t)$ 로 mapping 한다.

2. $\nabla \Phi (x_t) - \gamma \nabla f(x_t)$를 계산한다

3. $\nabla \Phi (y_{t+1}) = \nabla \Phi (x_t) - \eta\nabla f(x_t)$인 $y_{t+1}$을 찾는다.

4. $x_{t+1} = \Pi^{\Phi}_{\mathcal{X}}(y_{t+1}) = arg\min_{x\in\mathcal{X}} D_{\Phi}(x,y_{t+1})$

아래 관계를 통해 bregman divergence가 euclidean distance면 projected gradient descent와 같아짐을 알 수 있다.arg\min_{x\in\mathcal{X}} D_{\Phi}(x,y_{t+1})

\begin{align} x_{t+1} &= arg\min_{x\in\mathcal{X}} D_{\Phi}(x,y_{t+1})\\ &= arg\min \Phi(x) - \nabla \Phi(y_{t+1})^Tx\\ &= \Phi (x) - (\nabla (x_t) - \eta \nabla f(x_t))^Tx \\ &= arg\min \eta \nabla f(x_t)^Tx + D_{\Phi}(x,x_t)  \end{align}


Theorem 4.2 ( from Bubeck ) Convergence rate of mirror descent

Let $\Phi$ be mirror map $\rho$-strongly convex on $\mathcal{X}\cap \mathcal{D}$ w.r.t $||\dot ||$.

Let $R^2 = \sup_{x\in\mathcal{X}\cap \mathcal{D}} \Phi(x) - \Phi(x_1)$ and $f$ be convex and $L$-lipschitz w.r.t $||\dot || $

mirror descent with $\eta = \frac{R}{L} \sqrt{\frac{2\rho}{t}}$ satisfies 

$$ f(\bar{x}_t) - f(x^*) \leq RL \sqrt{\frac{2}{\rho t}}$$

 

$$\begin{align}f(x_s) - f(x) & \leq g^T_s ( x_s-x)\\& \leq\frac{1}{\eta}( \nabla \Phi (x_s) - \nabla (y_{s+1}))^T (x_s-x)\\&\leq \frac{1}{\eta} ( D_{\Phi}(x,x_s) + D_{\Phi}(x,y_{s+1})  -  D_{\Phi} (x,y_{s+1}) ) \\&\leq \frac{1}{\eta} ( D_{\Phi}(x,x_s) + D_{\Phi}(x,y_{s+1})  -  D_{\Phi} (x,x_{s+1})-  D_{\Phi} (x_{s+1},y_{s+1}) )   \end{align}$$

We prove $D_{\Phi} (x^*,x_{s+1}) + D_{\Phi}(x_{s+1},y_{s+1}) \leq D_{\Phi}(x^*,y_{s+1})$

\begin{align} D_{\Phi} (x^*,x_{s+1}) + D_{\Phi}(x_{s+1},y_{s+1}) &= \Phi(x^*)-\Phi(y_{s+1}) - \nabla \Phi(y_{s+1})^T (x^* - y_{s+1}) - ( \nabla \Phi(x_{s+1}) - \nabla \Phi(y_{s+1}))^T (x^*-x_{s+1})\\ &= D_{\Phi}(x^*,y_{s+1}) + \nabla D_{\Phi} (x_{s+1},y_{s+1})^T (x^*-x_{s+1}) \leq D_{\Phi} (x^*,y_{s+1}) \end{align}

Inequality Related to strong convexity

\begin{align} D_{\Phi}(x_s,y_{s+1}) - D{\Phi}(x_{s+1},y_{s+1}) &= \Phi(x_s) - \Phi(x_{s+1}) - \nabla \Phi (y_{s+1})^T (x_s - x_{s+1}) \\ &\leq (\nabla \Phi (x_s) - \nabla \Phi (y_{s+1}))^T (x_s - x_{s+1}) - \frac{\rho}{2} ||x_s - x_{s+1}||^2\\ &= \eta g^T_s (x_s - x_{s+1}) - \frac{\rho}{2} ||x_s - x_{s+1}||^2 \\ &\leq L ||x_s - x_{s+1}|| - \frac{\rho}{2} ||x_s - x_{s+1}||^2\end{align}

 

'Math > Optimization' 카테고리의 다른 글

Knapsack problem  (0) 2025.04.05
Hermite Form  (0) 2025.04.05
Oracle Complexity  (0) 2021.12.17
Smoothness  (0) 2021.10.21
Duality  (0) 2021.10.17