02-Flow model, Everything Before Flow Matching

本文最后更新于:June 9, 2025 pm

The series of tutorial is based on Flow Matching Guide and Code

  • arXiv: 2412.06264
  • Thank you, META

Flow Models

Before flow matching, flow models are hard to train, it entails solving ODEs during training, which slows down the training process and may incur error when solving ODEs numerically. But still, it’s important to learn what’s before flow matching.

Diffeomorphisms and push-forward maps

Let’s denote the collection of functions f:RmRnf:\mathbb{R}^m\rightarrow \mathbb{R}^n with continuous partial derivatives of order r as Cr(Rm,Rn)C^r(\mathbb{R}^m, \mathbb{R}^n), the r’th order partial derivative is defined as

rfkxi1xirk[1,2,,n],ij[1,2,,m]\frac{\partial^r f^k}{\partial x^{i_1}\cdots \partial x^{i_r}}\quad k\in [1,2,\dots,n], i_j\in [1,2,\dots,m]

We simplify denote Cr(Rn):=Cr(Rn,R)C^r(\mathbb{R}^n):=C^r(\mathbb{R}^n,\mathbb{R}), e.g., C1(Rm)C^1(\mathbb{R}^m) denotes continuously differentiable scalar functions (map from Rm\mathbb{R}^m to R\mathbb{R}).

Definition. If functions ψCr(Rn,Rn)\psi\in C^r(\mathbb{R}^n,\mathbb{R}^n) with ψ1Cr(Rn,Rn)\psi^{-1}\in C^r(\mathbb{R}^n,\mathbb{R}^n), we say the class of functions are CrC^r diffeomorphism.

So diffeomorphism poses two strong conditions: (1) the functions have r’th order derivative and map from Rn\mathbb{R}^n to Rn\mathbb{R}^n, (2) the function is invertible, and the invert functions also have r’th order derivative.

Now let’s consider Y=ψ(X)Y = \psi(X), where XpXX\sim p_X is a random vector and ψ:RdRd\psi: \mathbb{R}^d\rightarrow \mathbb{R}^d is a C1C^1 diffeomorphism, we can calculate the PDF of YY, which is pYp_Y, by Change of Variables formula

pY(y)=pX(ψ1(y))detyψ1(y)p_Y(y) = p_X(\psi^{-1}(y))|\det \partial_y\psi^{-1}(y)|

where yϕ(y)\partial_y\phi(y) denotes the Jacobian matrix. We further denote the push-forward operator with the symbol \sharp

[ψpX](y):=pX(ψ1(y))detyψ1(y)[\psi_{\sharp}p_X](y) := p_X(\psi^{-1}(y))|\det \partial_y\psi^{-1}(y)|

The reason we call it push-forward is that in the context of flow models, ψ\psi describes the position after some period of time, so this function pushes forward (in time dimension) the distribution from XX to YY, YY is the distribution after some period of time.

Generative Flow Models

Definition. A CrC^r flow is a time-dependent mapping ψ:[0,1]×RdRd\psi:[0,1]\times \mathbb{R}^d\rightarrow \mathbb{R}^d implementing ψ:(t,x)ψt(x)\psi:(t,x)\rightarrow \psi_t(x).

The above flow is a Cr([0,1]×Rd,Rd)C^r([0,1]\times \mathbb{R}^d,\mathbb{R}^d) function, such that the function ψt(x)\psi_t(x) is a CrC^r diffeomorphism in xx for all t[0,1]t\in[0,1].

Definition. A flow model is a continuous-time Markov process (Xt)0t1(X_t)_{0\leq t\leq 1} defined by applying a flow ψt\psi_t to the random vector X0X_0.

Xt=ψt(X0)X_t = \psi_t(X_0)

We can prove that XtX_t is Markov, let 0t<s10\leq t<s\leq 1, then

Xs=ψs(X0)=ψs(ψt1(ψt(X0)))=ψst(Xt)X_s = \psi_s(X_0)=\psi_s(\psi_t^{-1}(\psi_t(X_0))) = \psi_{s|t}(X_t)

where ψst:=ψsψt1\psi_{s|t}:=\psi_s\circ \psi_t^{-1}, so a state at ss, which is later than tt, depends only on tt.

In summary, the goal of generative flow modeling is to find a flow ψt\psi_t such that

X1=ψ1(X0)qX_1 = \psi_1(X_0)\sim q

Probability Path and the Continuity Equation

Definition. A time-dependent probability (pt)0t1(p_t)_{0\leq t\leq 1} is called a probability path.

We can say XtptX_t\sim p_t, and ptp_t can be obtained by the push-forward function

pt(x)=[ψtp](x)p_t(x) = [\psi_{t\sharp}p](x)

which basically pushes the distribution from an initial point to a distribution later at time tt. We can formally define this as utu_t generates ptp_t if Xt=ψt(X0)ptX_t=\psi_t(X_0)\sim p_t for all t[0,1)t\in [0,1), the right end is opened to handle velocity not defined precisely at t=1t=1.

If utu_t really generates ptp_t, they satisfy the following partial differential equation know as the Continuity Equation

ddtpt(x)+div(ptut)(x)=0(1)\frac{\mathrm{d}}{\mathrm{d}t}p_t(x)+\mathrm{div} (p_tu_t)(x)=0\tag{1}

we define the divergence here as

div(v)(x)=i=1dxivi(x)\mathrm{div} (v)(x) = \sum_{i=1}^d \partial_{x^i}v^i(x)

for v(x)=(v1(x),,vd(x))v(x)=(v^1(x),\dots, v^d(x)), this basically resembles the conventional divergence in field theory. With continuity equation, we have a theorem named mass conservation

Theorem. Let ptp_t be a probability path and u1u_1 a locally Lipchitz integrable vector field. Then, the following two statements are equivalent:

  1. The continuity equation holds for t[0,1)t\in [0,1)
  2. utu_t generates ptp_t.

The continuity equation actually has a straightforward physical meaning, recall the Gaussian law or the Divergence Theorem

Ddiv(u)(x)dx=Du(y),n(y)dsy\int_D \mathrm{div}(u)(x)dx = \int_{\partial D}\langle u(y),n(y)\rangle ds_y

which states that, the divergence inside a volume DD equals the flux leaving DD by orthogonally crossing its boundary D\partial D.

Now let’s rewrite Equation (1)

ddtDpt(x)dx=Ddiv(ptut)(x)dx=Dpt(y)ut(y),n(y)dsy\frac{\mathrm{d}}{\mathrm{d}t}\int_D p_t(x)dx = -\int_D\mathrm{div}(p_tu_t)(x)dx = -\int_{\partial D}\langle p_t(y)u_t(y),n(y) \rangle ds_y

So continuity equation is saying the rate of change of total probability mass in the volume DD is the negative probability flux leaving the domain, where the probability flux is defined as jt(y)=pt(y)ut(y)j_t(y)=p_t(y)u_t(y)

Computing the Target p1p_1

Let’s consider the log-likelihood logpt(ψt(x))\log p_t(\psi_t(x)), we have

ddtlogpt(ψt(x))=dlogpt(ψt(x))dt+zlogpt(z)z=ψt(x)dψt(x)dt(2)\frac{\mathrm{d}}{\mathrm{d}t}\log p_t(\psi_t(x)) = \frac{d\log p_t(\psi_t(x))}{dt} + \nabla_{z}\log p_t(z)|_{z=\psi_t(x)}\frac{d\psi_t(x)}{dt}\tag{2}

this is because the function is actually a multi-variable function.

For continuity function, let’s apply identity (fg)=(f)g+f(g)\nabla\cdot (fg)=(\nabla f)\cdot g+f(\nabla\cdot g)

dpt(x)dt=(pt(x))ut(x)pt(x)(ut(x))\frac{dp_t(x)}{dt} = -(\nabla p_t(x))\cdot u_t(x) - p_t(x)(\nabla \cdot u_t(x))

Let’s divide both sides by 1/pt(x)1/p_t(x)

dlogpt(x)dt=pt(x)pt(x)ut(x)ut(x)=(logpt(x))ut(x)ut(x)\begin{aligned} \frac{d\log p_t(x)}{dt} &= -\frac{\nabla p_t(x)}{p_t(x)}\cdot u_t(x) - \nabla\cdot u_t(x) \\ &=-(\nabla \log p_t(x))\cdot u_t(x) - \nabla\cdot u_t(x) \end{aligned}

substitude x=ψt(x)x=\psi_t(x) we obtained the first term in (2).

dlogpt(ψt(x))dt=(logpt(ψt(x)))ut(ψt(x))ut(ψt(x))(3)\frac{d\log p_t(\psi_t(x))}{dt} = -(\nabla \log p_t(\psi_t(x)))\cdot u_t(\psi_t(x)) - \nabla\cdot u_t(\psi_t(x))\tag{3}

From flow ODE we know that

dψt(x)dt=ut(ψt(x))\frac{d\psi_t(x)}{dt}= u_t(\psi_t(x))

Take this back to (2) together with (3), we find

ddtlogpt(ψt(x))=(logpt(ψt(x)))ut(ψt(x))ut(ψt(x))+(logpt(ψt(x)))ut(ψt(x))=ut(ψt(x))\begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t}\log p_t(\psi_t(x)) &= -(\nabla \log p_t(\psi_t(x)))\cdot u_t(\psi_t(x)) - \nabla\cdot u_t(\psi_t(x)) + (\nabla\log p_t(\psi_t(x)))\cdot u_t(\psi_t(x))\\ &=-\nabla\cdot u_t(\psi_t(x)) \end{aligned}

This gives us a new relation called the Instantaneous Change of Variables

ddtlogpt(ψt(x))=div(ut)(ψt(x))(4)\frac{\mathrm{d}}{\mathrm{d}t}\log p_t(\psi_t(x)) = -\mathrm{div}(u_t)(\psi_t(x))\tag{4}

Take integration on both sides from 0 to 1, we obtain

logp1(ψ1(x))=logp0(ψ0(x))01div(ut)(ψt(x))dt\log p_1(\psi_1(x))=\log p_0(\psi_0(x)) - \int_0^1 \mathrm{div}(u_t)(\psi_t(x))dt

This gives us access to the final distribution at t=1t=1. For the divergence term, it’s hard to compute it directly, we can employ unbiased estimators such as Hutchinson’s trace estimator

div(ut)(x)=tr[xut(x)]=EZtr[ZTxut(x)Z]\mathrm{div}(u_t)(x) = \mathrm{tr}[\partial_x u_t(x)] = \mathbb{E}_Z\mathrm{tr}[Z^T\partial_xu_t(x)Z]

where ZRd×dZ\in \mathbb{R}^{d\times d} is just any random variable with E[Z]=0\mathbb{E}[Z]=0 and CoV(Z,Z)=I\mathrm{CoV}(Z,Z)=I

We now obtain an unbiased estimator version

logp1(ψ1(x))=logp0(ψ0(x))EZ01tr[ZTxut(ψt(x))Z]dt(5)\log p_1(\psi_1(x)) = \log p_0(\psi_0(x)) - \mathbb{E}_Z\int_0^1\mathrm{tr}[Z^T\partial_xu_t(\psi_t(x))Z]dt \tag{5}

To obtain logp1(ψ1(x))\log p_1(\psi_1(x)), we can leverage two ODEs, one is the flow ODE telling us

ddtψt(x)=ut(ψt(x))(6)\frac{d}{dt}\psi_t(x) = u_t(\psi_t(x)) \tag{6}

another is an ODE to calculate the second term in (5)

ddtg(t)=tr[ZTxut(ψt(x))Z](7)\frac{d}{dt} g(t) = -\mathrm{tr}[Z^T\partial_xu_t(\psi_t(x))Z] \tag{7}

adopted from

g(t)=t1tr[ZTxus(ψs(x))Z]ds(8)g(t) = \int_t^1\mathrm{tr}[Z^T\partial_xu_s(\psi_s(x))Z]ds\tag{8}

Take (6) and (7) together as an ODE system, we can solve

ddt[f(t)g(t)]=[ut(f(t))tr[ZTxut(f(t))Z]][f(1)g(1)]=[x0](9)\frac{d}{dt} \begin{bmatrix} f(t) \\ g(t) \end{bmatrix}= \begin{bmatrix} u_t(f(t))\\ -\mathrm{tr}[Z^T\partial_x u_t(f(t))Z] \end{bmatrix}\quad \begin{bmatrix} f(1)\\ g(1) \end{bmatrix}= \begin{bmatrix} x \\ 0 \end{bmatrix}\tag{9}

backwards in time, from t=1t=1 to t=0t=0 to obtain f(0)f(0) and g(0)g(0), then Equation (5) can be estimated by

logp1^(x)=logp0(f(0))g(0)(10)\widehat{\log p_1}(x) = \log p_0(f(0)) - g(0) \tag{10}

Here, the initial conditions follow Equation (8), which is 0 at t=1t=1 and f(1)f(1) represents the target distribution.

Training Flow Models

To train a flow model, we can still paramterize the velocity field as utθu_t^\theta, and the model should learn θ\theta such that

p1θqp_1^\theta \approx q

which means the distribution at t=1t=1 is the target distribution.

We can achieve this simply by using KL-divergence

L(θ)=DKL(q,p1θ)=EYqlogp1θ(Y)+C\mathcal{L}(\theta)=D_{KL}(q,p_1^\theta) = -\mathbb{E}_{Y\sim q}\log p_1^\theta (Y) + C

where CC is a constant.

To obtain logp1θ(Y)\log p_1^\theta (Y), we can use (9) and (10) by setting ut=utθu_t = u_t^\theta which is our neural network and x=Yx=Y, samples from target distribution.

As you can see, computing the loss requires simulation of Equation (9), this is inefficient, and solving ODE numerically poses errors which will make the gradients biased.


02-Flow model, Everything Before Flow Matching
https://jesseprince.github.io/2025/06/08/visual_gen/flow_match/02_flow_model/
Author
林正
Posted on
June 8, 2025
Licensed under