Differentiable Permutation Layer

本文最后更新于：May 12, 2025 pm

Introduction

Early in 2024, when I was still working on computer vision (CV), Mamba had just been introduced, and I saw many attempts to apply it to images. However, models like Mamba or RNN/SSM inherently carry causal and positional inductive biases. If we process images by splitting them into patches and feeding them sequentially (as in ViT), it imposes a top-left to bottom-right reading order on the image—a prior that doesn’t naturally exist in images. Many papers at the time were exploring better scanning methods.

A natural question arises:

Does an optimal scanning order exist? And if so, can the model learn it automatically?

Method

For sequence modeling, the input is typically a tensor $X \in \mathbb{R}^{B \times T \times D}$ . Ignoring the batch dimension, an image can be represented as a matrix $X \in \mathbb{R}^{T \times D}$ , where each row vector corresponds to the embedding of a patch. After passing through a patch embedding layer, we obtain an ordered sequence $X$ , where the order is determined by the position of each row in the matrix.

To find an optimal scanning order, we need to find an elementary row transformation matrix $I_{\text{row}} \in \mathbb{R}^{T \times T}$ such that:

X' = I_{\text{row}} \times X

Thus, we need a function $f_{\theta}(\cdot)$ that can predict this transformation matrix. Suppose we have a function $h_\theta(\cdot): \mathbb{R}^{T \times D} \rightarrow \mathbb{R}^T$ that outputs a “importance score” for each patch. Using Tensor.sort(), we can obtain a sorted index $i \in \mathbb{R}^T$ , where each element indicates the desired position of the corresponding patch. One-hot encoding this index yields a matrix $I \in \mathbb{N}^{T \times T}$ , which can serve as the row transformation matrix.

Example

\begin{aligned} a = [1, 3, 2, 5, 4] &\overset{\text{sort}}{\rightarrow} i = [0, 2, 1, 4, 3] \\ &\overset{\text{one\_hot}}{\rightarrow} I = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 \end{bmatrix} \end{aligned}

It’s easy to verify that $a_{\text{sorted}} = I \times a^T$ .

However, a major issue is that Tensor.sort() is non-differentiable—gradients cannot flow through it. Is there a fully differentiable way to obtain the transformation matrix?

Differentiable Approximation

Consider a simple input $\mathbf{x} = [x_0, x_1, x_2]$ where $x_i \in \mathbb{R}^D$ . Applying $h_\theta(x)$ gives importance scores $\mathbf{e} = [e_1, e_2, e_3]$ . After sorting, suppose we get $\mathbf{e}' = [e_3, e_1, e_2]$ , corresponding to $i = [1, 2, 0]$ and:

I = \begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{bmatrix}

Note that:

\mathbf{e}'^T \times \mathbf{e} = \begin{bmatrix} e_1 e_3 & e_1^2 & e_1 e_2 \\ e_2 e_3 & e_2 e_1 & e_2^2 \\ e_3^2 & e_1 e_3 & e_2 e_1 \end{bmatrix}

The squared terms ( $e_i^2$ ) indicate the positions of "1"s in the elementary matrix.

A naive way to approximate $I$ is:

\hat I_{\text{row}} = \text{softmax}\left(t \cdot (\mathbf{e}'^T \times \mathbf{e} - \mathbf{e}^2)\right)

Here, subtracting $\mathbf{e}^2$ zeros out the squared terms, and scaling by temperature $t$ (possibly with absolute values for cross-terms) followed by softmax approximates the elementary matrix. However, optimization might be tricky due to softmax.

Alternatively, if we treat $\mathbf{e}'^T \times \mathbf{e}$ as containing learnable information, we can use the stop-gradient trick:

\hat I_{\text{row}} = \exp(\mathbf{e}'^T \times \mathbf{e} - \mathbf{e}^2)

Let $I$ be the one-hot encoded matrix from Tensor.sort(). Then:

X' = \left(\hat I_{\text{row}} + \text{sg}(I - \hat I_{\text{row}})\right) \times X

where sg stops gradients.

Conclusion

This was a rough idea I came up with to let models learn their own scanning order. I originally planned to experiment further, but later shifted focus to LLMs and text-to-video work, leaving Mamba/SSM behind. While preliminary, I wanted to document it—maybe someone else will find it useful.

Research Blogs

#Sequence models #Mamba #RNN

Differentiable Permutation Layer

https://jesseprince.github.io/2025/05/08/research/differentiable_reorder/

Author

林正

Posted on

May 8, 2025

Licensed under

Why model.enable_input_require_grads()? Previous

Rethinking R1-like Rule-based RL Next

Differentiable Permutation Layer

Introduction

Method​

Example​

Differentiable Approximation

Conclusion​

Method

Example

Conclusion