Rethinking Positional Encoding in Language Pre-training#

Authors: Guolin Ke, Di He, Tie-Yan Liu
Affiliations: Microsoft Research
ICLR, 2021
Links: arXiv, notes

Summary#

Usually in transformers (e.g., BERT), we add the positional encoding to the word embeddings, which brings mixed correlations between two heterogeneous information resources.

The attention includes two unwanted terms involving the dot product between positional embeddings and word embeddings.
The class embedding [CLS] will be biased towards the word tokens nearby.

The authors propose a new positional encoding method called Transformer with Untied Positional Encoding (TUPE). Experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of this method.

../_images/rethinking_positional_encoding_in_language_pretraining-1.png — Figure 1: The architecture of TUPE.#

Key Ideas#

Untie correlations between positions and words. Consider the absolute positional encoding in the transformer. A learnable real-valued vector \(p_i \in \mathbb{R}^d\) is added to the word embedding \(w_i\) at position \(i\). The self-attention is calculated as follows:

\[\begin{split}\alpha_{ij}^\text{Abs} = & \frac{1}{\sqrt{d}} ((w_i + p_i)W^{Q, 1}) ((w_j + p_j)W^{K, 1})^\top \\ = & \frac{1}{\sqrt{d}} [ (w_iW^{Q, 1}(w_jW^{K, 1})^\top) + (w_iW^{Q, 1}(p_jW^{K, 1})^\top) \\ & + (p_iW^{Q, 1}(w_jW^{K, 1})^\top) + (p_iW^{Q, 1}(p_jW^{K, 1})^\top) ]\end{split}\]

which includes four terms: word-to-word, word-to-position, position-to-word, and position-to-position correlations. The authors argue that the second and the third terms are unwanted. Moreover, from the figure below, the word-to-position and position-to-word correlations seem uniform across positions.

../_images/rethinking_positional_encoding_in_language_pretraining-2.png — Figure 2: Visualizations of four correlations: word-to-word, word-to-position, position-to-word, and position-to-position.#

Therefore, the authors propose to remove the second and third terms, and the positional encoding attention are added to each layer as an attention bias.

../_images/rethinking_positional_encoding_in_language_pretraining-3.png — Figure 3: Untie correlations bewteen positions and words.#

Untie the class embedding from positions. In BERT, the class embedding [CLS] is treated just like other tokens. However, regular words often have strong local dependencies and many visualizations [2, 3] show that the attention distributions of some heads concentrate locally. Therefore, treating [CLS] like other tokens will make [CLS] biased to focus on the first several words instead of the whole sentence.

Therefore, the authors propose to set the correlation scores related to the [CLS] positional encoding to a fixed learnable parameter \(\theta\).

../_images/rethinking_positional_encoding_in_language_pretraining-5.png — Figure 4: Illustration of untying `[CLS]`.#

Technical Details#

Implementations. Two variants of TUPE are developed: TUPE-A with untied absolute positional encoding, and TUPE-R with an additional relative positional encoding.

Efficiency. The projection matrices \(U^Q\) and \(U^K\) for positional embeddings are shared across layers and only increase 1% of the 110M parameters.

Absolute and relative positional encodings are not redundant to each other.

Experimental results.

../_images/rethinking_positional_encoding_in_language_pretraining-4.png — Figure 5: GLUE scores on dev set.#

Visualizations of learned positional correlations by TUPE-A.

../_images/rethinking_positional_encoding_in_language_pretraining-6.png — Figure 6: Visualizations of learned positional correlations by TUPE-A.#

Notes#

The implementations of untying positional encodings and textual embeddings is not exactly the same as explained in the paper. In BERT, the positional encodings are fused with the features and subsequent layers may learn a very complex correlation. However, in this paper, the position correlations are fixed across layers. It works fine but all the explanations do not necessarily stand.

References#

[1] G. Ke, D. He, T. Liu. “Rethinking positional encoding in language pre-training.”. In ICLR, 2021.

[2] K. Clark, U. Khandelwal, O. Levy, C. D. Manning. “What does BERT look at? An analysis of BERT’s attention”. In ACL, 2019.

[3] L. Gong, D. He, Z. Li, T. Qin, L. Wang, T. Liu. “Efficient training of BERT by progresssively stacking”. In ICML, 2019.