Fokker-Planck Equation

The probability density function $\rho_t$ of $X_t$ evolves according to the Fokker-Planck equation: $$ \frac{\partial \rho_t}{\partial t} = \nabla \cdot \left(\rho_t \nabla \log \frac{\rho_t}{\nu}\right) = \nabla \cdot (\rho_t \nabla f) + \Delta \rho_t. $$

The Fokker-Planck equation is the gradient flow for minimizing the relative entropy $D_{\mathrm{KL}}(\cdot | \nu)$ in the space of probability distributions equipped with the Wasserstein $W_2$ metric.

Logarithmic Sobolev Inequality (LSI)

The Logarithmic Sobolev Inequality (LSI) establishes a connection between the relative entropy and the Fisher information. For a target measure $\pi \propto e^{-V}$, the inequality is given by:

$$ D_{\mathrm{KL}}(\rho | \pi) \leq \frac{1}{2 \lambda} I(\rho | \pi), $$

where the relative entropy is defined as:

$$ D_{\mathrm{KL}}(\rho | \pi) = \int_{\mathbb{R}^d} \rho(x) \log \frac{\rho(x)}{\pi(x)} , dx, $$

and the Fisher information is:

$$ I(\rho | \pi) = \int_{\mathbb{R}^d} \frac{|\nabla \rho(x)|^2}{\rho(x)} , dx. $$

Here, $\lambda > 0$ is the logarithmic Sobolev constant. LSI implies exponential convergence of the Wasserstein gradient flow of the Fokker-Planck equation.

Poincaré Inequality (PI)

The Poincare Inequality (PI), which is weaker than LSI, bounds the variance of functions under $\pi$ : $$ \operatorname{Var}_N(g) \leq C_p \mathbb{E}_N\left|\left|\nabla_g\right|^2\right| . $$

Talagrand Inequality (TI)

Talagrand’s inequality refines these analyses by connecting the Wasserstein distance $\mathcal{W}_2$ to entropy. From Otto-Villani’s derivation, we have: $$ \mathcal{W}_2^2(\rho, \pi) \leq \frac{2}{a} \mathrm{KL}(\rho | \pi) $$

This bridges the gap between convergence in KL divergence and geometric convergence in Wasserstein distance. Additionally, it leads to: $$ \frac{d}{d t} W_2^2\left(\rho_i, \pi\right) \leq-2 a \mathcal{W}_2^2\left(\rho_t, \pi\right) $$

LSI $\implies$ PI $\implies$ TI\

Pinsker’s inequality relates the $L^1$ norm to the relative entropy: $$ |\rho - \pi|{L^1} \leq \sqrt{2 D{\mathrm{KL}}(\rho | \pi)}. $$

Latala and Oleszkiewicz inequality

((interpolating between pi and lsi)) Let $a \in[0,1]$ and $r \in[1,2]$ satisfy relation $r=2 /(2-a)$. Let $\mu(d x)=$ $c_r^n \exp \left(-\left(\left|x_1\right|^r+\left|x_2\right|^r+\ldots+\left|x_n\right|^r\right)\right) d x_1 d x_2 \ldots d x_n$ be a probability measure on the Euclidean space $\left(R^n,|\cdot|\right)$. We prove that there exists a universal constant $C$ such that for any smooth real function $f$ on $R^n$ and any $p \in[1,2)$

$$ E_\mu f^2-\left(E_\mu|f|^p\right)^{2 / p} \leq C(2-p)^a E_\mu|\nabla f|^2 $$

Wasserstein metric

$P_2(\mathbb{R}^d)$ are the probability measures on $\mathbb{R}^d$ with finite second moments. The Wasserstein distance $W_2$ is defined as: \begin{equation} W_2^2(\nu, \mu) = \inf_{\gamma \in \Pi(\nu, \mu)} \int_{\mathbb{R}^d \times \mathbb{R}^d} |x - y|^2 , d\gamma(x, y), \end{equation} where $\Pi(\nu, \mu)$ is the set of all couplings of $\nu$ and $\mu$.

Geodesically Convex

A functional $F : P_2(\mathbb{R}^d) \to \mathbb{R}$ is geodesically convex if: \begin{equation} F((1 - t)\mu_0 + t\mu_1) \leq (1 - t)F(\mu_0) + tF(\mu_1), \quad \forall t \in [0, 1], \end{equation}

Itô’s Lemma

For an SDE of the form: \begin{equation} dX_t = b(X_t)dt + \sigma(X_t)dB_t, \end{equation} where $b : \mathbb{R}^d \to \mathbb{R}^d$ is the drift term and $\sigma : \mathbb{R}^d \to \mathbb{R}^{d \times m}$ is the diffusion term, Itô’s Lemma states: \begin{equation} df(X_t) = \left( \nabla f \cdot b + \frac{1}{2}\text{Tr}(\sigma^\top \nabla^2 f \sigma) \right) dt + \nabla f \cdot \sigma , dB_t. \end{equation}

Girsanov’s theorem

Girsanov’s theorem provides a change of measure for probability laws of SDEs. Let $X_t$ and $\bar{X}_t$ satisfy: $$ \begin{aligned} & d X_t=b\left(X_t\right) d t+\sigma d B_2 \ & \vec{d} \hat{X_t}=\tilde{b}\left(\hat{X_t}\right) \hat{d t}+\sigma \dot{i} B_t . \end{aligned} $$

Then the Radon-Nikodym derivative of the measure $P$ induced by $X_t$ with respect to the measure $\overline{\mathbf{P}}$ induced by $\overline{\mathrm{X}}_t$ is: $$ \frac{d \mathrm{P}}{d \tilde{P}}=\exp \left(\int_0^T\left(b\left(X_s\right)-\bar{b}\left(X_s\right)\right)^T d B_s-\frac{1}{2} \int_0^T\left|b\left(X_s\right)-\bar{b}\left(X_s\right)\right|^2 d s\right) $$

Nesterov’s Method

Nesterov’s accelerated gradient method is described by the secondorder ODE: $$ \ddot{x}+\gamma \dot{x}+\nabla f(x)=0 $$

optimization in riemannian manifolds

We aim to minimize $f(x)$ on a smooth Riemannian manifold $M$. $$ x^*=\arg \min _{x \in M} f(x) $$ Gradient flow: $\dot{x}=-\nabla_x f$.\ Exponential convergence under strong convexity: Hess $f \succeq k l$ and gradient domination: $\left|\nabla_x f\right|^2 \geq 2 \alpha(f(x)-\min f)$.

Exponential Convergence (Gradient Flow)

Gradient Flow Dynamics: $$ \frac{d}{d t} f(x)=-\left|\nabla_x f\right|^2 . $$ Under gradient domination: $$ \frac{d}{d t}(f(x)-\min f) \leq-2 \alpha(f(x)-\min f) $$ Exponential convergence: $$ f\left(x_t\right)-\min f \leq e^{-2 x t}\left(f\left(x_0\right)-\min f\right) . $$

Wasserstein Metric

Definition: For probability measures $\rho, \nu$,

$$ W_2^2(\rho, v)=\inf _{X \sim \rho, Y \sim v} \mathbb{E}\left[|X-Y|^2\right] $$

Here the gradient of $F(\rho): \nabla_\rho F=-\nabla \cdot(\rho \nabla \phi)$, and the geodesic is the interpolation of the optimal transport map.

Expected Value Functional

$$ F(\rho)=\mathbb{E}\rho[f]=\int{\mathbb{R}^n} \rho(x) f(x) d x $$

Gradient domination:

$$ \mathbb{E}\rho\left[|\nabla f|^2\right] \geq 2 x \mathbb{E}\rho[f-\min f] $$

Gradient flow:

$$ \frac{\partial \rho}{\partial t}=\nabla \cdot(\rho \nabla f) $$

forward backward algorithm

from wibisono 2018

  1. Forward step: Gradient descent for $f$. $$ x_{k+\frac{1}{2}}=x_k-\epsilon \nabla_x f . $$
  2. Backward step: Proximal gradient for $g$. $$ x_{k+1}=\arg \min x\left{g(x)+\frac{1}{2 \epsilon}\left|x-x{k+\frac{1}{2}}\right|^2\right} . $$