Transformer architectures / pretraining losses

Lite Transformer with Long-Short Range Attention[a]

Long Short Range Attention uses smaller dimension global attention in parallel with convolutions to capture local context. The approach is more parameter-efficient and robust to hyper-parameter search than the baseline all-attention transformer, and does seem to use the attention to focus more on distant inputs.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

AlBERT, making BERT lighter. Token embeddings: small dimension + projection to model dimension. Attention layers: shared parameters across layers. Remove dropout (works with more data, 10x BERT). Add sentence swapping objective > increased performance. Reducing parameters leads to a small loss of performance, but recovered with new objective + more data. AlBERT-large > Bert for fewer params, XX-large >> Bert-large.

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

StructBERT: during pre-training, the model needs to unshuffle word and sentence trigrams (the latter is harder then the original Next Sentence PRediction task) during training. Significant boost over BERT with the same amount of pre-training data, still notable improvement over RoBERTa.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA: identifying strategically chosen switched out words is a lot more data efficient than masked word prediction. Could easily be combined with StructBERT

Encoding word order in complex embeddings

Some promising early foray into complex-valued transformers which allow to naturally model word orders with better position embeddings

Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention

Transformer with extra hop attention: models several documents simultaneously, each [CLS] token attending over all other documents’ [CLS]. Leads to improvements in multi-hop QA

Depth-Adaptive Transformer
Adaptive computation: in seq2seq model, the model predicts the next token at each layer (not just the last), and learns to “stop computation” when confidence is high enough. More a proof-of-concept than actual efficiency optimization given hardware constraints, but provides some insights into the model behavior (e.g. uses more layers at the start of decoding.)

Generalization through Memorization: Nearest Neighbor Language Models

kNN-LM: cache model over all of Wikipedia. The language model is only trained once without cache, and the interpolation parameter between original prediction and 1024-nearest neighbor distribution is tuned on validation set. Index for 3B contexts (all of Wikipedia) is 400GB
A FAISS index is then created using 1M randomly sampled keys to learn 4096 cluster centroids. For efficiency, keys are quantized to 64-bytes. During inference, we retrieve k = 1024 neighbors, and the index looks up 32 cluster centroids while searching for the nearest neighbors.


TabFact: A Large-scale Dataset for Table-based Fact Verification

Learn to read tables: 118k human annotated statements with supporting table data, model needs to predict “ENTAILED” or “REFUTED”. Baselines: 1) TableBERT > linearizes table (tries concatenating cells and templates to transform into natural language) and uses as premise in NLI 2) Latent Program Algorithm: semantic parsing (program synthesis) in pre-defined execution language on table format. Neither works very well.

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
“Compositionality”: 200k questions on Freebase. The authors introduce a measure of “compound divergence” which allows them to create train/test splits that underline some definition of “modularity” of the models. SOTA models that do pretty well on random splits still do poorly on those, even when they have seen all of the entities / verbs in training.

Abductive Commonsense Reasoning
Abductive reasoning defined as “identifying the most likely explanation”. 20K crowd-source examples from short (5-sentence stories). Standard BERT has room to improve, pre-training on dataset helps on other common-sense reasoning. This seems to be a more focused version of the StructBERT re-ordering task.

Learning The Difference That Makes A Difference With Counterfactually-Augmented Data

Counterfactually augmented sentiment classification and NLI dataset based on pre-identified “spurious pattern”. Could be a good tool to evaluate pre-trained models, but wouldn’t advise training on the augmented data.

Environmental drivers of systematicity and generalization in a situated agent

Deepmind situated language-controlled system in a game environment (Unreal Engine), shows some zero-shot transfer for some tasks by changing object to be moved


Pre-training Tasks for Embedding-based Large-scale Retrieval

Adds two pre-training tasks to Inverse Cloze Task of Lee et al. to pre-train a dense retriever. ICT works best by itself, but Body First Selection (query: random sentence from into section, document: random passage from same page) and Wiki Link Prediction (q: intro sentence, d: passage from article with link to q) help a tiny bit. Trains against sampled softmax of samples from current batch, batch size 8192 on TPU.

Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring

Poly-encoders: pre-computes all document representations into a single vector, but uses the representation while computing the query representation: first encodes N tokens into N vectors, then uses M attention heads to get M query representations, then computes one representation per candidate doc with 1x M attention, then dot product > more expensive than dot product but cheaper and performs as well as cross-attention

Minimizing FLOPs to Learn Efficient Sparse Representations

Learnt sparse index: Learns to map pretrained document representations (images in the paper) to high-dimensional sparse reps and inverted index to minimize FLOPs. Main contribution of the paper is the relaxed loss function to reflect number of FLOPs, and shows that the trained model is better at retrieval for 1M face images, 85k classes, 1024-d spare embeddings

Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering

Multi-hop reasoning: learns reasoning paths over Wikipedia hyperlinks graph for multi-hop QA. Starts with tf-idf match list, then RNN over paragraph representations for linked articles, until End Of Evidence is predicted. Supervised training: “We first derive a ground-truth reasoning path g = [p1, . . . , p|g| ] using the available annotated data in each dataset. p|g| is set to [EOE] for the termination condition. To relax and stabilize the training process, we augment the training data with additional reasoning paths – not necessarily the shortest paths – that can derive the answer. In particular, we add a new training path gr = [pr, p1, . . . , p|g| ] by adding a paragraph pr ∈ C1 that has a high TF-IDF score and is linked to the first paragraph p1 in the ground-truth path g

Differentiable Reasoning over a Virtual Knowledge Base
Multi-hop reasoning: first extracts entities from documents, and co-occurrence graph with pre-computed dense representation of mentions. During QA, gets question mentions and question representation, and uses dot product between question rep and mention embedding to navigate co occurrence graph. Needs to pre-train mention embedding on artificial data, BERT doesn’t work off-the-shelf.


BERTScore: Evaluating Text Generation with BERT

BERTscore: computes BERT embeddings of candidate and gold sequence, gets word-to-word matching score by taking dot products of embeddings, then greedily matches in both direction. Computes prediction and recall with softmax assignment, then gets F-score. Evaluated on single-sentence (MT, text simplification), shows stronger human correlation than BLEU (image captioning), although still not perfect and F-score sometimes works a lot worse than R-score.
Used for summarization in 

BERTology and Visualization

On Identifiability in Transformers

Identifiability in attention: combines analysis of kernel space of the projection matrix in the attention head and a gradient-based measure of token importance to provide better insights into model behavior.

Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models
Provides a “non-contextual measure of importance” of a phrase in a model decision by combining methods that l
ook at the prediction difference when the phrase is replaced by padding tokens, and when tokens around the phrase are re-sampled. Computationally expensive and only demonstrated on binary decisions, but shows some good compositionality.

Text Generation

Neural Text Generation With Unlikelihood Training

Neural text degeneration: adds a list of discouraged prediction at each time step, reduces repetition at limited cost to perplexity

The Curious Case of Neural Text Degeneration

Introduces Nucleus / top-p sampling for text generation, with good analysis of current issues and metrics for text diversity.

Decoding As Dynamic Programming For Recurrent Autoregressive Models

RNN-specific work. Treats decoding as a constrained optimization problem over both the hidden states and predicted tokens, and alternates between optimizing over tokens (dynamic programming, complexity O(NxV2)) and hidden state (either closed form or have to do inner loop of gradient descent). Good results on text in-filling with a unidirectional model thanks to DP.

Residual Energy-Based Models for Text Generation

Combines causal language model to natural/artificial discriminator to obtain globally normalized model. Leads to better perplexity. Can sample from the globally normalized model with importance sampling using causal LM as proposal distribution. Costly for generation: generates 10,000 sentence completions at each time step!

Other Domains

Residual Energy-Based Models for Text Generation

Used to rate the likelihood of specific protein configurations (Rob Fergus’ group at FAIR/NYU)

Random extra links

[a]Open link