Paper 2 ======= * Authors: Guolin Ke, Di He, Tie-Yan Liu * Affiliations: Microsoft Research Key Ideas --------- Usually in transformers (e.g., BERT), we add the positional encoding to the word embeddings, which brings mixed correlations between two heterogeneous information resources. 1. The attention includes two unwanted terms involving the dot product between positional embeddings and word embeddings. 2. The class embedding :code:`[CLS]` will be biased towards the word tokens nearby.