广泛使用的RoPE

Last updated on March 15, 2026 pm

位置编码–广泛使用的RoPE

1 RoPE位置编码

RoPE的理论及由来：https://kexue.fm/archives/8265

假设位置为m的query向量q的维度为 $d$ ，则RoPE会将q变换为

\begin{bmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{bmatrix}\otimes \begin{bmatrix} \cos m\theta_0 \\ \cos m\theta_0 \\ \cos m\theta_1 \\ \cos m\theta_1 \\ \vdots \\ \cos m\theta_{d/2-1}\\ \cos m\theta_{d/2-1} \end{bmatrix} + \begin{bmatrix} -q_1 \\ q_0 \\ -q_3 \\ q_2 \\ \vdots \\ -q_{d-1} \\ q_{d-2} \end{bmatrix} \otimes \begin{bmatrix} \sin m\theta_0 \\ \sin m\theta_0 \\ \sin m\theta_1 \\ \sin m\theta_1 \\ \vdots \\ \sin m\theta_{d/2-1}\\ \sin m\theta_{d/2-1} \end{bmatrix}

其中

\theta_i = \text{base}^{-2i/d}

然后会以同样的方式去变换位置为n的key向量k，这些变换都是在计算Attention score $QK^T$ 之前完成的。

2 代码实现

@staticmethod
def compute_default_rope_parameters(
    config: Qwen2Config | None = None,
    device: Optional["torch.device"] = None,
    seq_len: int | None = None,
) -> tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies according to the original RoPE implementation
    Args:
        config ([`~transformers.PreTrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length. Unused for this type of RoPE.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
    """
    base = config.rope_parameters["rope_theta"]
    dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads

    attention_factor = 1.0  # Unused in this type of RoPE

    # Compute the inverse frequencies
    inv_freq = 1.0 / (
        base ** (torch.arange(0, dim, 2, dtype=torch.int64).to(device=device, dtype=torch.float) / dim)
    )
    return inv_freq, attention_factor

这一个函数主要计算了

\omega = \frac{1}{\text{base}^{2i/d}}

其中torch.arange(0, dim, 2)主要生成了长度为 $\text{dim}/2$ 的一个偶数数组

[0, 2, 4, ..., \text{dim}-2] \in \mathbb{R}^{\text{dim}/2}

以此满足公式中 $2i$ 计算，然后除以了模型的维度dim。

@torch.no_grad()
@dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
def forward(self, x, position_ids):
    # D/2 -> 1, D/2, 1 -> B, D/2, 1
    inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
    # B, T -> B, 1, T
    position_ids_expanded = position_ids[:, None, :].float()

    device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
    with maybe_autocast(device_type=device_type, enabled=False):  # Force float32
        # B, D/2, 1 @ B, 1, T -> B, D/2, T -> B, T, D/2
        freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
        # B, T, D/2 -> B, T, D
        emb = torch.cat((freqs, freqs), dim=-1)
        cos = emb.cos() * self.attention_scaling
        sin = emb.sin() * self.attention_scaling

    return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)

其中inv_freq_expanded.float() @ position_ids_expanded.float()执行了outer product

\begin{bmatrix} \frac{1}{\text{base}^{0/D}} \\ \frac{1}{\text{base}^{2/D}} \\ \frac{1}{\text{base}^{4/D}} \\ \vdots \\ \frac{1}{\text{base}^{(D-2)/D}} \end{bmatrix}\times [0, 1, 2, 3, ..., T-1]= \begin{bmatrix} \frac{0}{\text{base}^{0/D}} & \frac{1}{\text{base}^{0/D}} & \cdots & \frac{T-1}{\text{base}^{0/D}}\\ \frac{0}{\text{base}^{2/D}} & \frac{1}{\text{base}^{2/D}} & \cdots & \frac{T-1}{\text{base}^{2/D}}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{0}{\text{base}^{D-2/D}} & \frac{1}{\text{base}^{D-2/D}} & \cdots & \frac{T-1}{\text{base}^{D-2/D}} \end{bmatrix}\in \mathbb{R}^{D/2\times T}

Transpose可得

\begin{bmatrix} \frac{0}{\text{base}^{0/D}} & \frac{0}{\text{base}^{2/D}} & \cdots & \frac{0}{\text{base}^{(D-2)/D}} \\ \frac{1}{\text{base}^{0/D}} & \frac{1}{\text{base}^{2/D}} & \cdots & \frac{1}{\text{base}^{(D-2)/D}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{T-1}{\text{base}^{0/D}} & \frac{T-1}{\text{base}^{2/D}} & \cdots & \frac{T-1}{\text{base}^{(D-2)/D}} \end{bmatrix}\in \mathbb{R}^{T\times D/2}

接下来的拼接可得

\begin{bmatrix} \frac{0}{\text{base}^{0/D}} & \frac{0}{\text{base}^{2/D}} & \cdots & \frac{0}{\text{base}^{(D-2)/D}} & \frac{0}{\text{base}^{0/D}} & \frac{0}{\text{base}^{2/D}} & \cdots & \frac{0}{\text{base}^{(D-2)/D}} \\ \frac{1}{\text{base}^{0/D}} & \frac{1}{\text{base}^{2/D}} & \cdots & \frac{1}{\text{base}^{(D-2)/D}} & \frac{1}{\text{base}^{0/D}} & \frac{1}{\text{base}^{2/D}} & \cdots & \frac{1}{\text{base}^{(D-2)/D}} \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \frac{T-1}{\text{base}^{0/D}} & \frac{T-1}{\text{base}^{2/D}} & \cdots & \frac{T-1}{\text{base}^{(D-2)/D}} & \frac{T-1}{\text{base}^{0/D}} & \frac{T-1}{\text{base}^{2/D}} & \cdots & \frac{T-1}{\text{base}^{(D-2)/D}} \end{bmatrix}\in \mathbb{R}^{T\times D}

然后各自算一遍sin和cos。

def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)


@use_kernel_func_from_hub("rotary_pos_emb")
def apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=1):
    """Applies Rotary Position Embedding to the query and key tensors.

    Args:
        q (`torch.Tensor`): The query tensor.
        k (`torch.Tensor`): The key tensor.
        cos (`torch.Tensor`): The cosine part of the rotary embedding.
        sin (`torch.Tensor`): The sine part of the rotary embedding.
        unsqueeze_dim (`int`, *optional*, defaults to 1):
            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
    Returns:
        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
    """
    # B, 1, T, D
    cos = cos.unsqueeze(unsqueeze_dim)
    # B, 1, T, D
    sin = sin.unsqueeze(unsqueeze_dim)
    # (B, n_head, T, D) * (B, 1, T, D)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

开头的unsqueeze主要是为了broadcast到head上面去，rotate_half函数主要实现了

\begin{bmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \\ \vdots \\ q_{d-1} \end{bmatrix}\overset{rotate}{\rightarrow} \begin{bmatrix} -q_{2/d} \\ -q_{2/d+1}\\ \vdots \\ -q_{d-1}\\ q_{0} \\ q_{1} \\ \vdots \\ q_{2/d-1} \end{bmatrix}

然后再和之前得到的sin和cos进行相乘相加。在这里不难发现，这又和原来的RoPE对不上，原来的RoPE是交错的 $[-q_1,q_0,-q_3,q_2,...]$ ，这里变成了负的一半放在前面，正的一半放在后面，这其实只是一个permutation，不影响模型学习位置信息。但这样的实现会更快。

仔细观察就会发现，现在是 $q_0\cos(\theta_0)-q_{2/d}\sin(\theta_0)$ , $q_{2/d}\cos(\theta_0)+q_0\sin(\theta_0)$ 。

LLM > Position Encoding

#深度学习 #智能系统 #AIGC

广泛使用的RoPE

https://lynx-li.github.io/2026/03/15/llms/position_encodes/rope/

Author

Lynx Li

Posted on

March 15, 2026

Licensed under

Chapter 1 Notations Next