References
[1] https://www.youtube.com/watch?v=0uIL-ADxvmw&list=PLsQI3WOaT8hgnhaqobr8ny41Pk5Q5bu7D&index=21
[2] First-Order Method - Beck
Proximal Gradient
Gradient Descent는 는 이차 근사함수를 최소화하는 방법이라고 할 수 있다.
$x_{t+1}=x_t - \gamma \nabla g(x_t) = arg\min_y g(x_t) _ \nabla g(x_t)^T(y-x_t) + \frac{1}{2\gamma}||y-x_t||^2$
만약 convex 하지만 non differentiable 한 $h(x)$가 있다면 , $f(x)=g(x) + h(x)$ 아래와 같이 Gradient Descent 비슷하게 minimization을 할 수 있고 이를 Proximal Gradient라고 한다.
\begin{align}x_{t+1} &= arg\min_y g(x_t) _ \nabla g(x_t)^T(y-x_t) + \frac{1}{2\gamma}||y-x_t||^2 + h(y)\\ &= arg\min \frac{1}{2\gamma} ||y-(x_t-\gamma \nabla g(x_t)||^2 +h(y) \end{align}
Proximal mapping
$prox_{h,\gamma} (z) = arg\min_y \frac{1}{2\gamma} ||y-z||^2 + h(y)$
Example 1 (prox mapping of L1 ) [2]
$g(x) = \lambda ||x||_1$이라고 하자
$x\in R$ 이면 $prox_g (x) = [|x|-\lambda]_+ sgn(x)$이다. 즉, $-\lambda < x< \lambda$ 보다 작으면 0 이된다.
Generalized Gradient
$G_{h,\gamma} (x_t) = \frac{1}{\gamma} (x_t - prox_{h,\gamma} ( x_t-\gamma \nabla g(x))$
$x_{t+1} = x_t - \gamma G_{h,\gamma}(x)$
Generalized Gradient와 관련된 inequality를 하나 소개한다.
$G(x_t) = \nabla h(x_{t+1}) + \nabla g(x_t)$
$x_{t+1} = arg\min_y \frac{\beta}{2} ||y-(x_t - \frac{1}{\beta}\nabla g(x_t) ||^2 + h(y) $
따라서 $\beta ( x_{t+1} - (x_t - \frac{1}{\beta} \nabla g(x_t))) +\nabla h(x_{t+1})=0$을 만족시켜야한다. 즉,
$G(x_t) = \nabla h(x_{t+1}) +\nabla g(x_t)$ 이다.
Convergence rate [1]
Assume $g$ is smooth. $G(x_t) = \frac{1}{\gamma} (x_t-x_{t+1})$. Step size를 $\frac{1}{\beta}$라고 한다면
$\beta$-smoothness에 의해
$$g(x_{t+1}) \leq g(x_t) - \frac{1}{\beta}\nabla g(x_t)^T G(x_t) + \frac{1}{2\beta} || G(x_t)||^2$$
따라서
\begin{align} f(x_{t+1}) &\leq g(x_t) -\frac{1}{\beta} \nabla g(x_t)^T G(x_t) + \frac{1}{2\beta} ||G(x_t)||^2 + h(x_{t+1}) \\ &\leq g(z) + \nabla g(x_t)^T (x_t-z) -\frac{1}{\beta} \nabla g(x_t)^T G(x_t) + \frac{1}{2\beta} ||G(x_t)||^2 + h(z) +\nabla h(x_{t+1})^T (x_{t+1}-z) \\ &= f(z) + \nabla g(x_t)^T (x_t-z) -\frac{1}{\beta} \nabla g(x_t)^T G(x_t) + \frac{1}{2\beta} ||G(x_t)||^2 +\nabla h(x_{t+1})^T (x_t-z) + \nabla h(x_{t+1})^T (x_{t+1}-x_t) \end{align}
$G(x_t) = \nabla h(x_{t+1}) +\nabla g(x_t)$ 를 활용하면
$f(x_{t+1}) \leq f(z) + G(x_t)^T (x_t - z)- \frac{1}{2\beta} || G(x_t)||^2$ 를 유도할 수 있다.
$z=x_t$일 때 $f(x_t) - f(x_{t+1}) \geq \frac{1}{2\beta} ||G(x_t)||^2$ 가 되어 sufficient decrease를 확인할 수 있고
$z=x^*$일 때 , $f(x_{t+1}) - f(x^*) \leq \frac{1}{2\beta} ( ||x_t-x^*||^2 - ||x_{t+1} - x^* ||^2 ) $ 를 통해 last iterate의 convergence rate ( 1/T ) 를 구할 수 있다. 여기서 inequality 를 구할 때 $ 2(x_t-x_{t+1})^T(x_t-x^*) = ||x_t - x_{t+1}||^2+||x_t-x^*||^2- ||x_{t+1}-x^*||^2$ 를 활용하면 된다.
Bregman Divergence
Lemma 1. Generalized Pythagorian for Bregman Divergence
$$ D_f(x,y) + D_f(z,x) - D_f (z,y) = ( \nabla f(x) - \nabla f(y) )^T ( x - z ) $$
Mirror Descent
Mirror Descent 알고리즘은 다음과 같다.
1. $x_t$를 $\nabla \Phi (x_t)$ 로 mapping 한다.
2. $\nabla \Phi (x_t) - \gamma \nabla f(x_t)$를 계산한다
3. $\nabla \Phi (y_{t+1}) = \nabla \Phi (x_t) - \eta\nabla f(x_t)$인 $y_{t+1}$을 찾는다.
4. $x_{t+1} = \Pi^{\Phi}_{\mathcal{X}}(y_{t+1}) = arg\min_{x\in\mathcal{X}} D_{\Phi}(x,y_{t+1})$
아래 관계를 통해 bregman divergence가 euclidean distance면 projected gradient descent와 같아짐을 알 수 있다.arg\min_{x\in\mathcal{X}} D_{\Phi}(x,y_{t+1})
\begin{align} x_{t+1} &= arg\min_{x\in\mathcal{X}} D_{\Phi}(x,y_{t+1})\\ &= arg\min \Phi(x) - \nabla \Phi(y_{t+1})^Tx\\ &= \Phi (x) - (\nabla (x_t) - \eta \nabla f(x_t))^Tx \\ &= arg\min \eta \nabla f(x_t)^Tx + D_{\Phi}(x,x_t) \end{align}
Theorem 4.2 ( from Bubeck ) Convergence rate of mirror descent
Let $\Phi$ be mirror map $\rho$-strongly convex on $\mathcal{X}\cap \mathcal{D}$ w.r.t $||\dot ||$.
Let $R^2 = \sup_{x\in\mathcal{X}\cap \mathcal{D}} \Phi(x) - \Phi(x_1)$ and $f$ be convex and $L$-lipschitz w.r.t $||\dot || $
mirror descent with $\eta = \frac{R}{L} \sqrt{\frac{2\rho}{t}}$ satisfies
$$ f(\bar{x}_t) - f(x^*) \leq RL \sqrt{\frac{2}{\rho t}}$$
$$\begin{align}f(x_s) - f(x) & \leq g^T_s ( x_s-x)\\& \leq\frac{1}{\eta}( \nabla \Phi (x_s) - \nabla (y_{s+1}))^T (x_s-x)\\&\leq \frac{1}{\eta} ( D_{\Phi}(x,x_s) + D_{\Phi}(x,y_{s+1}) - D_{\Phi} (x,y_{s+1}) ) \\&\leq \frac{1}{\eta} ( D_{\Phi}(x,x_s) + D_{\Phi}(x,y_{s+1}) - D_{\Phi} (x,x_{s+1})- D_{\Phi} (x_{s+1},y_{s+1}) ) \end{align}$$
We prove $D_{\Phi} (x^*,x_{s+1}) + D_{\Phi}(x_{s+1},y_{s+1}) \leq D_{\Phi}(x^*,y_{s+1})$
\begin{align} D_{\Phi} (x^*,x_{s+1}) + D_{\Phi}(x_{s+1},y_{s+1}) &= \Phi(x^*)-\Phi(y_{s+1}) - \nabla \Phi(y_{s+1})^T (x^* - y_{s+1}) - ( \nabla \Phi(x_{s+1}) - \nabla \Phi(y_{s+1}))^T (x^*-x_{s+1})\\ &= D_{\Phi}(x^*,y_{s+1}) + \nabla D_{\Phi} (x_{s+1},y_{s+1})^T (x^*-x_{s+1}) \leq D_{\Phi} (x^*,y_{s+1}) \end{align}
Inequality Related to strong convexity
\begin{align} D_{\Phi}(x_s,y_{s+1}) - D{\Phi}(x_{s+1},y_{s+1}) &= \Phi(x_s) - \Phi(x_{s+1}) - \nabla \Phi (y_{s+1})^T (x_s - x_{s+1}) \\ &\leq (\nabla \Phi (x_s) - \nabla \Phi (y_{s+1}))^T (x_s - x_{s+1}) - \frac{\rho}{2} ||x_s - x_{s+1}||^2\\ &= \eta g^T_s (x_s - x_{s+1}) - \frac{\rho}{2} ||x_s - x_{s+1}||^2 \\ &\leq L ||x_s - x_{s+1}|| - \frac{\rho}{2} ||x_s - x_{s+1}||^2\end{align}
'Math > Optimization' 카테고리의 다른 글
Knapsack problem (0) | 2025.04.05 |
---|---|
Hermite Form (0) | 2025.04.05 |
Oracle Complexity (0) | 2021.12.17 |
Smoothness (0) | 2021.10.21 |
Duality (0) | 2021.10.17 |