02-Flow model, Everything Before Flow Matching

本文最后更新于：June 9, 2025 pm

The series of tutorial is based on Flow Matching Guide and Code

arXiv: 2412.06264
Thank you, META

Flow Models

Before flow matching, flow models are hard to train, it entails solving ODEs during training, which slows down the training process and may incur error when solving ODEs numerically. But still, it’s important to learn what’s before flow matching.

Diffeomorphisms and push-forward maps

Let’s denote the collection of functions $f:\mathbb{R}^m\rightarrow \mathbb{R}^n$ with continuous partial derivatives of order r as $C^r(\mathbb{R}^m, \mathbb{R}^n)$ , the r’th order partial derivative is defined as

\frac{\partial^r f^k}{\partial x^{i_1}\cdots \partial x^{i_r}}\quad k\in [1,2,\dots,n], i_j\in [1,2,\dots,m]

We simplify denote $C^r(\mathbb{R}^n):=C^r(\mathbb{R}^n,\mathbb{R})$ , e.g., $C^1(\mathbb{R}^m)$ denotes continuously differentiable scalar functions (map from $\mathbb{R}^m$ to $\mathbb{R}$ ).

Definition. If functions $\psi\in C^r(\mathbb{R}^n,\mathbb{R}^n)$ with $\psi^{-1}\in C^r(\mathbb{R}^n,\mathbb{R}^n)$ , we say the class of functions are $C^r$ diffeomorphism.

So diffeomorphism poses two strong conditions: (1) the functions have r’th order derivative and map from $\mathbb{R}^n$ to $\mathbb{R}^n$ , (2) the function is invertible, and the invert functions also have r’th order derivative.

Now let’s consider $Y = \psi(X)$ , where $X\sim p_X$ is a random vector and $\psi: \mathbb{R}^d\rightarrow \mathbb{R}^d$ is a $C^1$ diffeomorphism, we can calculate the PDF of $Y$ , which is $p_Y$ , by Change of Variables formula

p_Y(y) = p_X(\psi^{-1}(y))|\det \partial_y\psi^{-1}(y)|

where $\partial_y\phi(y)$ denotes the Jacobian matrix. We further denote the push-forward operator with the symbol $\sharp$

[\psi_{\sharp}p_X](y) := p_X(\psi^{-1}(y))|\det \partial_y\psi^{-1}(y)|

The reason we call it push-forward is that in the context of flow models, $\psi$ describes the position after some period of time, so this function pushes forward (in time dimension) the distribution from $X$ to $Y$ , $Y$ is the distribution after some period of time.

Generative Flow Models

Definition. A $C^r$ flow is a time-dependent mapping $\psi:[0,1]\times \mathbb{R}^d\rightarrow \mathbb{R}^d$ implementing $\psi:(t,x)\rightarrow \psi_t(x)$ .

The above flow is a $C^r([0,1]\times \mathbb{R}^d,\mathbb{R}^d)$ function, such that the function $\psi_t(x)$ is a $C^r$ diffeomorphism in $x$ for all $t\in[0,1]$ .

Definition. A flow model is a continuous-time Markov process $(X_t)_{0\leq t\leq 1}$ defined by applying a flow $\psi_t$ to the random vector $X_0$ .

X_t = \psi_t(X_0)

We can prove that $X_t$ is Markov, let $0\leq t<s\leq 1$ , then

X_s = \psi_s(X_0)=\psi_s(\psi_t^{-1}(\psi_t(X_0))) = \psi_{s|t}(X_t)

where $\psi_{s|t}:=\psi_s\circ \psi_t^{-1}$ , so a state at $s$ , which is later than $t$ , depends only on $t$ .

In summary, the goal of generative flow modeling is to find a flow $\psi_t$ such that

X_1 = \psi_1(X_0)\sim q

Probability Path and the Continuity Equation

Definition. A time-dependent probability $(p_t)_{0\leq t\leq 1}$ is called a probability path.

We can say $X_t\sim p_t$ , and $p_t$ can be obtained by the push-forward function

p_t(x) = [\psi_{t\sharp}p](x)

which basically pushes the distribution from an initial point to a distribution later at time $t$ . We can formally define this as $u_t$ generates $p_t$ if $X_t=\psi_t(X_0)\sim p_t$ for all $t\in [0,1)$ , the right end is opened to handle velocity not defined precisely at $t=1$ .

If $u_t$ really generates $p_t$ , they satisfy the following partial differential equation know as the Continuity Equation

\frac{\mathrm{d}}{\mathrm{d}t}p_t(x)+\mathrm{div} (p_tu_t)(x)=0\tag{1}

we define the divergence here as

\mathrm{div} (v)(x) = \sum_{i=1}^d \partial_{x^i}v^i(x)

for $v(x)=(v^1(x),\dots, v^d(x))$ , this basically resembles the conventional divergence in field theory. With continuity equation, we have a theorem named mass conservation

Theorem. Let $p_t$ be a probability path and $u_1$ a locally Lipchitz integrable vector field. Then, the following two statements are equivalent:

The continuity equation holds for $t\in [0,1)$
$u_t$ generates $p_t$ .

The continuity equation actually has a straightforward physical meaning, recall the Gaussian law or the Divergence Theorem

\int_D \mathrm{div}(u)(x)dx = \int_{\partial D}\langle u(y),n(y)\rangle ds_y

which states that, the divergence inside a volume $D$ equals the flux leaving $D$ by orthogonally crossing its boundary $\partial D$ .

Now let’s rewrite Equation (1)

\frac{\mathrm{d}}{\mathrm{d}t}\int_D p_t(x)dx = -\int_D\mathrm{div}(p_tu_t)(x)dx = -\int_{\partial D}\langle p_t(y)u_t(y),n(y) \rangle ds_y

So continuity equation is saying the rate of change of total probability mass in the volume $D$ is the negative probability flux leaving the domain, where the probability flux is defined as $j_t(y)=p_t(y)u_t(y)$

Computing the Target $p_1$

Let’s consider the log-likelihood $\log p_t(\psi_t(x))$ , we have

\frac{\mathrm{d}}{\mathrm{d}t}\log p_t(\psi_t(x)) = \frac{d\log p_t(\psi_t(x))}{dt} + \nabla_{z}\log p_t(z)|_{z=\psi_t(x)}\frac{d\psi_t(x)}{dt}\tag{2}

this is because the function is actually a multi-variable function.

For continuity function, let’s apply identity $\nabla\cdot (fg)=(\nabla f)\cdot g+f(\nabla\cdot g)$

\frac{dp_t(x)}{dt} = -(\nabla p_t(x))\cdot u_t(x) - p_t(x)(\nabla \cdot u_t(x))

Let’s divide both sides by $1/p_t(x)$

\begin{aligned} \frac{d\log p_t(x)}{dt} &= -\frac{\nabla p_t(x)}{p_t(x)}\cdot u_t(x) - \nabla\cdot u_t(x) \\ &=-(\nabla \log p_t(x))\cdot u_t(x) - \nabla\cdot u_t(x) \end{aligned}

substitude $x=\psi_t(x)$ we obtained the first term in (2).

\frac{d\log p_t(\psi_t(x))}{dt} = -(\nabla \log p_t(\psi_t(x)))\cdot u_t(\psi_t(x)) - \nabla\cdot u_t(\psi_t(x))\tag{3}

From flow ODE we know that

\frac{d\psi_t(x)}{dt}= u_t(\psi_t(x))

Take this back to (2) together with (3), we find

\begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t}\log p_t(\psi_t(x)) &= -(\nabla \log p_t(\psi_t(x)))\cdot u_t(\psi_t(x)) - \nabla\cdot u_t(\psi_t(x)) + (\nabla\log p_t(\psi_t(x)))\cdot u_t(\psi_t(x))\\ &=-\nabla\cdot u_t(\psi_t(x)) \end{aligned}

This gives us a new relation called the Instantaneous Change of Variables

\frac{\mathrm{d}}{\mathrm{d}t}\log p_t(\psi_t(x)) = -\mathrm{div}(u_t)(\psi_t(x))\tag{4}

Take integration on both sides from 0 to 1, we obtain

\log p_1(\psi_1(x))=\log p_0(\psi_0(x)) - \int_0^1 \mathrm{div}(u_t)(\psi_t(x))dt

This gives us access to the final distribution at $t=1$ . For the divergence term, it’s hard to compute it directly, we can employ unbiased estimators such as Hutchinson’s trace estimator

\mathrm{div}(u_t)(x) = \mathrm{tr}[\partial_x u_t(x)] = \mathbb{E}_Z\mathrm{tr}[Z^T\partial_xu_t(x)Z]

where $Z\in \mathbb{R}^{d\times d}$ is just any random variable with $\mathbb{E}[Z]=0$ and $\mathrm{CoV}(Z,Z)=I$

We now obtain an unbiased estimator version

\log p_1(\psi_1(x)) = \log p_0(\psi_0(x)) - \mathbb{E}_Z\int_0^1\mathrm{tr}[Z^T\partial_xu_t(\psi_t(x))Z]dt \tag{5}

To obtain $\log p_1(\psi_1(x))$ , we can leverage two ODEs, one is the flow ODE telling us

\frac{d}{dt}\psi_t(x) = u_t(\psi_t(x)) \tag{6}

another is an ODE to calculate the second term in (5)

\frac{d}{dt} g(t) = -\mathrm{tr}[Z^T\partial_xu_t(\psi_t(x))Z] \tag{7}

adopted from

g(t) = \int_t^1\mathrm{tr}[Z^T\partial_xu_s(\psi_s(x))Z]ds\tag{8}

Take (6) and (7) together as an ODE system, we can solve

\frac{d}{dt} \begin{bmatrix} f(t) \\ g(t) \end{bmatrix}= \begin{bmatrix} u_t(f(t))\\ -\mathrm{tr}[Z^T\partial_x u_t(f(t))Z] \end{bmatrix}\quad \begin{bmatrix} f(1)\\ g(1) \end{bmatrix}= \begin{bmatrix} x \\ 0 \end{bmatrix}\tag{9}

backwards in time, from $t=1$ to $t=0$ to obtain $f(0)$ and $g(0)$ , then Equation (5) can be estimated by

\widehat{\log p_1}(x) = \log p_0(f(0)) - g(0) \tag{10}

Here, the initial conditions follow Equation (8), which is 0 at $t=1$ and $f(1)$ represents the target distribution.

Training Flow Models

To train a flow model, we can still paramterize the velocity field as $u_t^\theta$ , and the model should learn $\theta$ such that

p_1^\theta \approx q

which means the distribution at $t=1$ is the target distribution.

We can achieve this simply by using KL-divergence

\mathcal{L}(\theta)=D_{KL}(q,p_1^\theta) = -\mathbb{E}_{Y\sim q}\log p_1^\theta (Y) + C

where $C$ is a constant.

To obtain $\log p_1^\theta (Y)$ , we can use (9) and (10) by setting $u_t = u_t^\theta$ which is our neural network and $x=Y$ , samples from target distribution.

As you can see, computing the loss requires simulation of Equation (9), this is inefficient, and solving ODE numerically poses errors which will make the gradients biased.

Visual Generation > Flow Matching

#Deep Learning #Generative Model #Flow Matching

02-Flow model, Everything Before Flow Matching

https://jesseprince.github.io/2025/06/08/visual_gen/flow_match/02_flow_model/

Author

林正

Posted on

June 8, 2025

Licensed under

03-Flow Matching and Conditional Flow Matchings Previous

01-Overview of Flow Matching Next

02-Flow model, Everything Before Flow Matching

Flow Models

Diffeomorphisms and push-forward maps

Generative Flow Models

Probability Path and the Continuity Equation

Computing the Target p1p_1p1​

Training Flow Models

Computing the Target $p_1$