The series of tutorial is based on Flow Matching Guide and Code
arXiv: 2412.06264
Thank you, META
Flow Models
Before flow matching, flow models are hard to train, it entails solving ODEs during training, which slows down the training process and may incur error when solving ODEs numerically. But still, it’s important to learn what’s before flow matching.
Diffeomorphisms and push-forward maps
Let’s denote the collection of functions f:Rm→Rn with continuous partial derivatives of order r as Cr(Rm,Rn), the r’th order partial derivative is defined as
∂xi1⋯∂xir∂rfkk∈[1,2,…,n],ij∈[1,2,…,m]
We simplify denote Cr(Rn):=Cr(Rn,R), e.g., C1(Rm) denotes continuously differentiable scalar functions (map from Rm to R).
Definition. If functions ψ∈Cr(Rn,Rn) with ψ−1∈Cr(Rn,Rn), we say the class of functions are Crdiffeomorphism.
So diffeomorphism poses two strong conditions: (1) the functions have r’th order derivative and map from Rn to Rn, (2) the function is invertible, and the invert functions also have r’th order derivative.
Now let’s consider Y=ψ(X), where X∼pX is a random vector and ψ:Rd→Rd is a C1 diffeomorphism, we can calculate the PDF of Y, which is pY, by Change of Variables formula
pY(y)=pX(ψ−1(y))∣det∂yψ−1(y)∣
where ∂yϕ(y) denotes the Jacobian matrix. We further denote the push-forward operator with the symbol ♯
[ψ♯pX](y):=pX(ψ−1(y))∣det∂yψ−1(y)∣
The reason we call it push-forward is that in the context of flow models, ψ describes the position after some period of time, so this function pushes forward (in time dimension) the distribution from X to Y, Y is the distribution after some period of time.
Generative Flow Models
Definition. A Cr flow is a time-dependent mapping ψ:[0,1]×Rd→Rd implementing ψ:(t,x)→ψt(x).
The above flow is a Cr([0,1]×Rd,Rd) function, such that the function ψt(x) is a Cr diffeomorphism in x for all t∈[0,1].
Definition. A flow model is a continuous-time Markov process (Xt)0≤t≤1 defined by applying a flow ψt to the random vector X0.
Xt=ψt(X0)
We can prove that Xt is Markov, let 0≤t<s≤1, then
Xs=ψs(X0)=ψs(ψt−1(ψt(X0)))=ψs∣t(Xt)
where ψs∣t:=ψs∘ψt−1, so a state at s, which is later than t, depends only on t.
In summary, the goal of generative flow modeling is to find a flow ψt such that
X1=ψ1(X0)∼q
Probability Path and the Continuity Equation
Definition. A time-dependent probability (pt)0≤t≤1 is called a probability path.
We can say Xt∼pt, and pt can be obtained by the push-forward function
pt(x)=[ψt♯p](x)
which basically pushes the distribution from an initial point to a distribution later at time t. We can formally define this as ut generates pt if Xt=ψt(X0)∼pt for all t∈[0,1), the right end is opened to handle velocity not defined precisely at t=1.
If ut really generates pt, they satisfy the following partial differential equation know as the Continuity Equation
dtdpt(x)+div(ptut)(x)=0(1)
we define the divergence here as
div(v)(x)=i=1∑d∂xivi(x)
for v(x)=(v1(x),…,vd(x)), this basically resembles the conventional divergence in field theory. With continuity equation, we have a theorem named mass conservation
Theorem. Let pt be a probability path and u1 a locally Lipchitz integrable vector field. Then, the following two statements are equivalent:
The continuity equation holds for t∈[0,1)
ut generates pt.
The continuity equation actually has a straightforward physical meaning, recall the Gaussian law or the Divergence Theorem
∫Ddiv(u)(x)dx=∫∂D⟨u(y),n(y)⟩dsy
which states that, the divergence inside a volume D equals the flux leaving D by orthogonally crossing its boundary ∂D.
So continuity equation is saying the rate of change of total probability mass in the volume D is the negative probability flux leaving the domain, where the probability flux is defined as jt(y)=pt(y)ut(y)
Computing the Target p1
Let’s consider the log-likelihood logpt(ψt(x)), we have
This gives us access to the final distribution at t=1. For the divergence term, it’s hard to compute it directly, we can employ unbiased estimators such as Hutchinson’s trace estimator
div(ut)(x)=tr[∂xut(x)]=EZtr[ZT∂xut(x)Z]
where Z∈Rd×d is just any random variable with E[Z]=0 and CoV(Z,Z)=I
backwards in time, from t=1 to t=0 to obtain f(0) and g(0), then Equation (5) can be estimated by
logp1(x)=logp0(f(0))−g(0)(10)
Here, the initial conditions follow Equation (8), which is 0 at t=1 and f(1) represents the target distribution.
Training Flow Models
To train a flow model, we can still paramterize the velocity field as utθ, and the model should learn θ such that
p1θ≈q
which means the distribution at t=1 is the target distribution.
We can achieve this simply by using KL-divergence
L(θ)=DKL(q,p1θ)=−EY∼qlogp1θ(Y)+C
where C is a constant.
To obtain logp1θ(Y), we can use (9) and (10) by setting ut=utθ which is our neural network and x=Y, samples from target distribution.
As you can see, computing the loss requires simulation of Equation (9), this is inefficient, and solving ODE numerically poses errors which will make the gradients biased.