1 of 22

AER: Autoregressive Entity Retrieval

The Natural Language Processing Reading Group Reviews:

2 of 22

Housekeeping

  • Anything the mod team needs to mention ahead of time

3 of 22

Reviewer

Adithya

4 of 22

Paper Metadata

5 of 22

Problem Statement

Encyclopedias like Wikipedia are structured by entities. We need to retrieve entities given a query, but this is knowledge intensive

Current approaches are classifiers and have some disadvantages:

  • Context is obtained through dot-product. So no fine-grained interaction between them
  • The memory increases linearly with the number of entities
  • We need to subsample a hard set of negative data at training time

Aim of the paper:

  • Capture the relation between context and entity name directly
  • Make it linear with the vocabulary
  • Remove the need for subsampling negative data

6 of 22

Terms to know

  • Entity set ε is the set of wiki articles
  • KB - Knowledge Base = Wikipedia
  • Each entity “e” in “ε” is assigned a unique sequence of tokens (wikipedia article title)
  • Entity disambiguation: The input x is annotated with a “mention”. We need to retrieve the entity (based on context of x)
  • Document retrieval: x is a query and entities are documents identified by unique titles

7 of 22

Formulation

  • Each entity is ranked using a score with autoregressive formulation
    • Maximize log likelihood of p
    • No need negative sampling to approximate loss normalizer
  • To preventive expensive scoring for every element in the entity set, they use beam search - top-k using k beams
    • Constrained to the set of valid entity identifiers - This is done using Trie (prefix tree)
    • Since we use only beam search, time cost is independent of entity set size. On average, 6 tokens and beam width 10

8 of 22

Formulation

  • End-to-End Entity Linking - detect entity mentions and link to KB entities
    • Annotate span boundaries with special tokens
    • Generation follows this diagram

9 of 22

Types of Tasks

10 of 22

Results

11 of 22

🏺 Archaeologist

Adithya

12 of 22

Prior Paper 3

Title: CHOLAN: A Modular Approach for Neural Entity Linking on Wikipedia and Wikidata

Authors: Manoj Prabhakar Kannan Ravi, Kuldeep Singh, Isaiah Onando Mulang, Saeedeh Shekarpour, Johannes Hoffart, and Jens Lehmann

Publication Venue: European Chapter of the Association for Computational Linguistics

Date: April, 2021

Publication: https://aclanthology.org/2021.eacl-main.40.pdf

13 of 22

Approach

Three step approach for end-to-end neural entity linking:

  1. Mention detection (does it automatically unlike the other case) using BERT
  2. Candidates Generation
  3. Entity Disambiguation

14 of 22

Approach - Stage 1 and 2

  • Logistic regression based classifier from fine tuned BERT output in the mention detection stage. Similar to the regression stage in the main paper.
  • For the candidate generation stage, they use:
    • DCA candidates: Prior probabilities for candidate entities for each mention are calculated. In the probabilistic entity map, each entity mention has 30 potential entity candidates. DCA also provides associated Wikipedia description of each entity
    • Falcon candidates: Local index of KG items from Wikidata entities expanded with entity aliases. The local KG index is adopted to generate entity candidates per entity mention in the employed datasets. The local KG has a querying mechanism using BM25 algorithm and ranked by the calculated score. 30 candidates from wiki are generated per mention

15 of 22

Approach - Stage 3

  • Token embedding: embedding of the corresponding token. The entity mention tokens appended at the beginning of S1 and separated from the sentence context tokens by a single vertical token bar |, likewise, for the entity context sequence S2, we prepend the entity title tokens from the KB before adding the descriptions.
  • Segment embedding: each of the sequences receive a single representation; ELC => local context (S1), EEC => extended context (S2)
  • Position embedding: represents the position of the token
  • Negative sampling is used to make it a binary classification task

16 of 22

🏺 Archaeologist

Aiswarya

17 of 22

  • Background
    • End to end entity linking systems consist of 3 steps
      • mention detection
      • candidate generation
      • entity disambiguation

    • Investigates the following
      • Can all those steps be learned jointly with a model for contextualized text representations?
      • How much entity knowledge is already contained in pretrained BERT
      • Does additional entity knowledge improve BERT’s performance in downstream tasks?

18 of 22

Motivation and Model

  • The goal of entity linking is - given a knowledge base (KB) and unstructured data, detect mentions of the KB’s entities in the unstructured data and link them to the correct KB entry

  • The entity linking task is generally defined through the following steps

    • mention detection (MD) - text spans of potential entity mentions are identified
    • candidate generation (CD) - entity candidates for each mention are retrieved from the KB
    • entity disambiguation (ED) - a mix of useful coreference and coherence features together with a classifier determine the entity link

19 of 22

Motivation and Model

  • Can BERT’s architecture learn all entity linking steps jointly?
    • per token classification over the entire entity vocabulary thus solving MD, CG and ED simultaneously
    • entity vocabulary is based on the 700K top most frequent entities in English Wikipedia texts
    • worked surprisingly well for entity linking even if we don’t have any supervision on mention-spans

20 of 22

Training Data

  • The entity vocabulary and training data are derived from English Wikipedia texts
  • WikiExtractor to extract the text spans that are associated with an internal Wikipedia link to se as annotation
    • The first Thor was all about introducing Asgard
      • The text span “Thor” links to wikipedia for Thor
    • Bert is originally trained with sentences however for entity linking, a larger context can help to disambiguate entity mentions which is why we select text fragments of such a length that they span multiple sentences
    • We collect (m, e) tuples of entities e and their mentions m
    • this yields a set M of potentially linkable strings and also lets us compute the conditional probability p(e|m) based on the #(m, e) counts

21 of 22

Experiments

  • Data
    • Wikipedia
      • Keeps the 700K top most frequent entities from the 6M entities in Wikipedia
    • Training
      • use a multi class classification over the entity vocabulary - the label y vector for one token v_i is defined by

  • Computing the loss over the whole entity vocabulary is infeasible because the entity mention vocabulary is very large
  • negative sampling to improve memory efficiency and increase convergence speed
  • after sampling text fragments for a batch b, we collected the set N + b of all true entities that occurred in those text fragments

22 of 22

Results