广泛使用的RoPE

Last updated on March 15, 2026 pm

位置编码–广泛使用的RoPE

1 RoPE位置编码

RoPE的理论及由来:https://kexue.fm/archives/8265

假设位置为m的query向量q的维度为dd,则RoPE会将q变换为

[q0q1q2q3qd2qd1][cosmθ0cosmθ0cosmθ1cosmθ1cosmθd/21cosmθd/21]+[q1q0q3q2qd1qd2][sinmθ0sinmθ0sinmθ1sinmθ1sinmθd/21sinmθd/21]\begin{bmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{bmatrix}\otimes \begin{bmatrix} \cos m\theta_0 \\ \cos m\theta_0 \\ \cos m\theta_1 \\ \cos m\theta_1 \\ \vdots \\ \cos m\theta_{d/2-1}\\ \cos m\theta_{d/2-1} \end{bmatrix} + \begin{bmatrix} -q_1 \\ q_0 \\ -q_3 \\ q_2 \\ \vdots \\ -q_{d-1} \\ q_{d-2} \end{bmatrix} \otimes \begin{bmatrix} \sin m\theta_0 \\ \sin m\theta_0 \\ \sin m\theta_1 \\ \sin m\theta_1 \\ \vdots \\ \sin m\theta_{d/2-1}\\ \sin m\theta_{d/2-1} \end{bmatrix}

其中

θi=base2i/d\theta_i = \text{base}^{-2i/d}

然后会以同样的方式去变换位置为n的key向量k,这些变换都是在计算Attention score QKTQK^T之前完成的。

2 代码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@staticmethod
def compute_default_rope_parameters(
config: Qwen2Config | None = None,
device: Optional["torch.device"] = None,
seq_len: int | None = None,
) -> tuple["torch.Tensor", float]:
"""
Computes the inverse frequencies according to the original RoPE implementation
Args:
config ([`~transformers.PreTrainedConfig`]):
The model configuration.
device (`torch.device`):
The device to use for initialization of the inverse frequencies.
seq_len (`int`, *optional*):
The current sequence length. Unused for this type of RoPE.
Returns:
Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
"""
base = config.rope_parameters["rope_theta"]
dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads

attention_factor = 1.0 # Unused in this type of RoPE

# Compute the inverse frequencies
inv_freq = 1.0 / (
base ** (torch.arange(0, dim, 2, dtype=torch.int64).to(device=device, dtype=torch.float) / dim)
)
return inv_freq, attention_factor

这一个函数主要计算了

ω=1base2i/d\omega = \frac{1}{\text{base}^{2i/d}}

其中torch.arange(0, dim, 2)主要生成了长度为dim/2\text{dim}/2的一个偶数数组

[0,2,4,...,dim2]Rdim/2[0, 2, 4, ..., \text{dim}-2] \in \mathbb{R}^{\text{dim}/2}

以此满足公式中2i2i计算,然后除以了模型的维度dim。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@torch.no_grad()
@dynamic_rope_update # power user: used with advanced RoPE types (e.g. dynamic rope)
def forward(self, x, position_ids):
# D/2 -> 1, D/2, 1 -> B, D/2, 1
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
# B, T -> B, 1, T
position_ids_expanded = position_ids[:, None, :].float()

device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
with maybe_autocast(device_type=device_type, enabled=False): # Force float32
# B, D/2, 1 @ B, 1, T -> B, D/2, T -> B, T, D/2
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
# B, T, D/2 -> B, T, D
emb = torch.cat((freqs, freqs), dim=-1)
cos = emb.cos() * self.attention_scaling
sin = emb.sin() * self.attention_scaling

return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)

其中inv_freq_expanded.float() @ position_ids_expanded.float()执行了outer product

[1base0/D1base2/D1base4/D1base(D2)/D]×[0,1,2,3,...,T1]=[0base0/D1base0/DT1base0/D0base2/D1base2/DT1base2/D0baseD2/D1baseD2/DT1baseD2/D]RD/2×T\begin{bmatrix} \frac{1}{\text{base}^{0/D}} \\ \frac{1}{\text{base}^{2/D}} \\ \frac{1}{\text{base}^{4/D}} \\ \vdots \\ \frac{1}{\text{base}^{(D-2)/D}} \end{bmatrix}\times [0, 1, 2, 3, ..., T-1]= \begin{bmatrix} \frac{0}{\text{base}^{0/D}} & \frac{1}{\text{base}^{0/D}} & \cdots & \frac{T-1}{\text{base}^{0/D}}\\ \frac{0}{\text{base}^{2/D}} & \frac{1}{\text{base}^{2/D}} & \cdots & \frac{T-1}{\text{base}^{2/D}}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{0}{\text{base}^{D-2/D}} & \frac{1}{\text{base}^{D-2/D}} & \cdots & \frac{T-1}{\text{base}^{D-2/D}} \end{bmatrix}\in \mathbb{R}^{D/2\times T}

Transpose可得

[0base0/D0base2/D0base(D2)/D1base0/D1base2/D1base(D2)/DT1base0/DT1base2/DT1base(D2)/D]RT×D/2\begin{bmatrix} \frac{0}{\text{base}^{0/D}} & \frac{0}{\text{base}^{2/D}} & \cdots & \frac{0}{\text{base}^{(D-2)/D}} \\ \frac{1}{\text{base}^{0/D}} & \frac{1}{\text{base}^{2/D}} & \cdots & \frac{1}{\text{base}^{(D-2)/D}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{T-1}{\text{base}^{0/D}} & \frac{T-1}{\text{base}^{2/D}} & \cdots & \frac{T-1}{\text{base}^{(D-2)/D}} \end{bmatrix}\in \mathbb{R}^{T\times D/2}

接下来的拼接可得

[0base0/D0base2/D0base(D2)/D0base0/D0base2/D0base(D2)/D1base0/D1base2/D1base(D2)/D1base0/D1base2/D1base(D2)/DT1base0/DT1base2/DT1base(D2)/DT1base0/DT1base2/DT1base(D2)/D]RT×D\begin{bmatrix} \frac{0}{\text{base}^{0/D}} & \frac{0}{\text{base}^{2/D}} & \cdots & \frac{0}{\text{base}^{(D-2)/D}} & \frac{0}{\text{base}^{0/D}} & \frac{0}{\text{base}^{2/D}} & \cdots & \frac{0}{\text{base}^{(D-2)/D}} \\ \frac{1}{\text{base}^{0/D}} & \frac{1}{\text{base}^{2/D}} & \cdots & \frac{1}{\text{base}^{(D-2)/D}} & \frac{1}{\text{base}^{0/D}} & \frac{1}{\text{base}^{2/D}} & \cdots & \frac{1}{\text{base}^{(D-2)/D}} \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \frac{T-1}{\text{base}^{0/D}} & \frac{T-1}{\text{base}^{2/D}} & \cdots & \frac{T-1}{\text{base}^{(D-2)/D}} & \frac{T-1}{\text{base}^{0/D}} & \frac{T-1}{\text{base}^{2/D}} & \cdots & \frac{T-1}{\text{base}^{(D-2)/D}} \end{bmatrix}\in \mathbb{R}^{T\times D}

然后各自算一遍sin和cos。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)


@use_kernel_func_from_hub("rotary_pos_emb")
def apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=1):
"""Applies Rotary Position Embedding to the query and key tensors.

Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
"""
# B, 1, T, D
cos = cos.unsqueeze(unsqueeze_dim)
# B, 1, T, D
sin = sin.unsqueeze(unsqueeze_dim)
# (B, n_head, T, D) * (B, 1, T, D)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed

开头的unsqueeze主要是为了broadcast到head上面去,rotate_half函数主要实现了

[q0q1q2q3qd1]rotate[q2/dq2/d+1qd1q0q1q2/d1]\begin{bmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \\ \vdots \\ q_{d-1} \end{bmatrix}\overset{rotate}{\rightarrow} \begin{bmatrix} -q_{2/d} \\ -q_{2/d+1}\\ \vdots \\ -q_{d-1}\\ q_{0} \\ q_{1} \\ \vdots \\ q_{2/d-1} \end{bmatrix}

然后再和之前得到的sin和cos进行相乘相加。在这里不难发现,这又和原来的RoPE对不上,原来的RoPE是交错的[q1,q0,q3,q2,...][-q_1,q_0,-q_3,q_2,...],这里变成了负的一半放在前面,正的一半放在后面,这其实只是一个permutation,不影响模型学习位置信息。但这样的实现会更快。

仔细观察就会发现,现在是q0cos(θ0)q2/dsin(θ0)q_0\cos(\theta_0)-q_{2/d}\sin(\theta_0), q2/dcos(θ0)+q0sin(θ0)q_{2/d}\cos(\theta_0)+q_0\sin(\theta_0)


广泛使用的RoPE
https://lynx-li.github.io/2026/03/15/llms/position_encodes/rope/
Author
Lynx Li
Posted on
March 15, 2026
Licensed under