1 of 21

On Leveraging Encoder-only Pre-trained LMs for Effective Keyphrase Generation

Di Wu, Wasi Uddin Ahmad, Kai-Wei Chang

Department of Computer Science

UCLA

2 of 21

Motivation

  • Keyphrase extraction (KPE) and keyphrase generation (KPG) are important NLP tasks that can benefit
    • Information retrieval
    • Text summarization
    • Clustering
    • Classification

3 of 21

Motivation

  • Pre-trained language models (LMs) have revolutionized the space of KPE and KPG research, showing promising results in
    • Unsupervised KPE [1, 2]
    • Zero-shot or low-resource KPG [3, 4]
    • Cross-lingual and cross-domain KPG [5, 6]

4 of 21

Motivation

  • A dilemma arises for KPG:
    • Strong empirical support for using in-domain LMs [6, 7].
    • Strong inductive bias for using sequence-to-sequence LMs.
    • Yet in-domain versions of these pre-trained LMs are very limited.

5 of 21

Motivation

  • Can we leverage BERT-like domain experts for KPG instead?
    • much more widely available
    • yet much less investigated

  • In this paper, we aim to provide a thorough investigation.

6 of 21

Evaluation Setup

  • We evaluate on two standard KPG testbenches
    • Scientific KPG [8]: KP20k, Inspec, Krapivin, NUS, and SemEval
    • News KPG [9]: KPTimes

  • We compare with well-established KPG methods
    • CatSeq [10]
    • ExHiRD-h [11]
    • Transformer [12]
    • SetTrans [12]

7 of 21

Evaluation Setup

  • Considered LMs
    • We consider both encoder-only and encoder-decoder pre-trained LMs.
    • We consider in-domain, cross-domain, and general domain LMs.
    • Inspired by [7], we train our own news domain LMs.

General Domain

Science

News

Encoder-Only

BERT

SciBERT [13]

NewsBERT

Encoder-Decoder

BART

SciBART [7]

NewsBART

8 of 21

Modeling

  • How to effectively leverage encoder-only models?
  • We consider three approaches
    • Sequence labeling for KPE
    • BERT2BERT for KPG
    • Prefix-LM for KPG

9 of 21

Modeling

  • How to effectively leverage encoder-only models?
  • Approach 1: sequence labeling for KPE
    • We use a token-level classification formulation.
    • We consider the use of conditional random fields (CRF).

10 of 21

Modeling

  • How to effectively leverage encoder-only models?
  • Approach 2: BERT2BERT
    • We initialize a sequence-to-sequence model using various-sized BERT checkpoints distributed by [14].
    • We investigate two variations
      • Varying the depth of the encoder and the decoder.
      • Randomly initializing the encoder or the decoder.

11 of 21

Modeling

  • How to effectively leverage encoder-only models?
  • Approach 3: Prefix-LM
    • We follow [15] to fine-tune encoder-only LMs using a sequence-to-sequence attention mask pattern.
    • The loss is based on masking and recovering the target sequence.
    • During inference, autoregressive decoding is used.

12 of 21

Results

  • Our experiments address the following research questions
    • Is the KPG formulation less suitable to encoder-only pre-trained LMs compared to KPE?
    • Can encoder-only pre-trained LMs generate better keyphrases than encoder-decoder pre-trained LMs?
    • What is the best parameter allocation strategy for using encoder-decoder pre-trained LMs to balance the performance and computational cost?

13 of 21

Results: KPE vs. KPG

  • CRF improves KPE performance.
  • KPG with prefix-LM can achieve the same level of present keyphrase performance as KPE methods.

14 of 21

Results: BERT for KPG – prefix-LM

  • The prefix-LM approach achieves strong performance.
  • With in-domain BERT models, prefix-LM outperforms general-domain BART.

15 of 21

Results: BERT for KPG – prefix-LM

  • Compared to various non-pre-trained baselines, prefix-LM exhibits much more superior data efficiency.

  • Prefix-LM trained with 2k data performs on par with SetTrans trained on 100k.

16 of 21

Results: BERT for KPG – BERT2BERT

  • We also dive into the key design factor behind the strong performance of BERT2BERT.

  • Our first observation is that depth should be prioritized over the width in parameter allocation.

17 of 21

Results: BERT for KPG – BERT2BERT

  • Finally, we investigate how to allocate the layer budget across the encoder and the decoder.

  • The deep-encoder-shallow-decoder approach improves over the other parameter allocation strategies.

18 of 21

Results: BERT for KPG – BERT2BERT

  • The deep-encoder-shallow-decoder approach also provides much better inference latency than the other designs.

19 of 21

Summary

  • We unveil the potential of encoder-only pre-trained LMs for KPG.
  • Two formulations are introduced and systematically compared
    • Prefix-LM
    • BERT2BERT
  • Experiments show their strong performance, resource efficiency, and competitive inference latency.
  • We also dive deep into their various design considerations.

20 of 21

Thank you for listening!

Paper: https://arxiv.org/abs/2402.14052

Code and data: https://github.com/uclanlp/DeepKPG/

Models:

21 of 21

References

[1] Unsupervised Keyphrase Extraction by Jointly Modeling Local and Global Context (Liang et al., EMNLP 2021)

[2] PromptRank: Unsupervised Keyphrase Extraction Using Prompt (Kong et al., ACL 2023)

[3] Learning Rich Representation of Keyphrases from Text (Kulkarni et al., Findings 2022)

[4] Representation Learning for Resource-Constrained Keyphrase Generation (Wu et al., Findings 2022)

[5] Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training (Gao et al., Findings 2022)

[6] General-to-Specific Transfer Labeling for Domain Adaptable Keyphrase Generation (Meng et al., Findings 2023)

[7] Rethinking Model Selection and Decoding for Keyphrase Generation with Pre-trained Sequence-to-Sequence Models (Wu et al., EMNLP 2023)

[8] Deep Keyphrase Generation (Meng et al., ACL 2017)

[9] KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents (Gallina et al., INLG 2019)

[10] One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases (Yuan et al., ACL 2020)

[11] Exclusive Hierarchical Decoding for Deep Keyphrase Generation (Chen et al., ACL 2020)

[12] One2Set: Generating Diverse Keyphrases as a Set (Ye et al., ACL-IJCNLP 2021)

[13] SciBERT: A Pretrained Language Model for Scientific Text (Beltagy et al., EMNLP-IJCNLP 2019)

[14] Well-read students learn better: On the importance of pre-training compact models (Turc et al., 2019)

[15] Unified Language Model Pre-training for Natural Language Understanding and Generation (Dong et al., NeurIPS 2019)