本文提出了 RoPE, 作为 position embedding

Formulation

\[q_m=f_q(x_m, m), k_n=f_k(x_n, n), v_n=f_v(x_n, n)\]

其中 $m, n$ 表示 token 序列的 index

Previous Methods

Absolute Position Embedding

输入的时候对 $x$ 加上 position embedding $p$. 然后对这个过 linear 得到 qkv, 也是最经典的 PE

Others (relative PE)

后面反正提过一大堆，没那么有名

image not found

提到效果最好的如下:

\[q_m^T k_n = x_m^TW^T_qW_kx_n + x_m^TW^T_qW_k\tilde p_{m-n} + \tilde p_{m-n}^T W^T_qW_kx_n\]

Method: RoPE

我们希望 attention score 只和 $m-n$ 有关

对于 2d 的情况，可以直接求解，一般形式为

image not found

然后以此为基础, 得到方法:

记

\[R^d_{\Theta, m}=\begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \ldots & 0 & 0 \\ \sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \ldots & 0 & 0 \\ 0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \ldots & 0 & 0 \\ 0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \ldots & \cos m\theta_d & -\sin m\theta_d \\ 0 & 0 & 0 & 0 & \ldots & \sin m\theta_d & \cos m\theta_d \\ \end{pmatrix}\]

令 $q_m = R^d_{\Theta, m}W_qx_m, \quad k_n = R^d_{\Theta, n}W_kx_n$ 也就是对过 linear 的向量做旋转，然后做点积

这样点积的结果只和 $m-n$ 有关. 实施的时候取 $\theta_i=10000^{-2i/d}$, $d$ 为 $qkv$ dim.

通常对这个旋转变化的计算如下：

image not found

伪代码

sinusoidal_pos.shape = [1, seq_len, hidden_size] # Sinusoidal position embeddings
qw.shape = [batch_size, seq_len, num_heads, hidden_size]  # query hiddens
kw.shape = [batch_size, seq_len, num_heads, hidden_size]  # key hiddens

cos_pos = repeat_elements(sinusoidal_pos[..., None, 1::2], rep=2, axis=-1)
sin_pos = repeat_elements(sinusoidal_pos[..., None, ::2], rep=2, axis=-1)
qw2 = stack([-qw[..., 1::2], qw[..., ::2]], 4)
qw2 = reshape(qw2, shape(qw))
qw = qw * cos_pos + qw2 * sin_pos
kw2 = K.stack([-kw[..., 1::2], kw[..., ::2]], 4)
kw2 = K.reshape(kw2, K.shape(kw))
kw = kw * cos_pos + kw2 * sin_pos

# Attention
a = tf.einsum('bjhd,bkhd->bhjk', qw, kw)

SQA-033

[Paper] Roformer: Enhanced transformer With Rotary Position Embedding

Formulation

Previous Methods

Absolute Position Embedding

Others (relative PE)

Method: RoPE