1 of 21

On Leveraging Encoder-only Pre-trained LMs for Effective Keyphrase Generation

Di Wu, Wasi Uddin Ahmad, Kai-Wei Chang

Department of Computer Science

UCLA

2 of 21

Motivation

Keyphrase extraction (KPE) and keyphrase generation (KPG) are important NLP tasks that can benefit

Information retrieval
Text summarization
Clustering
Classification
…

3 of 21

Motivation

Pre-trained language models (LMs) have revolutionized the space of KPE and KPG research, showing promising results in

Unsupervised KPE [1, 2]
Zero-shot or low-resource KPG [3, 4]
Cross-lingual and cross-domain KPG [5, 6]

[1] Unsupervised Keyphrase Extraction by Jointly Modeling Local and Global Context (Liang et al., EMNLP 2021)

[2] PromptRank: Unsupervised Keyphrase Extraction Using Prompt (Kong et al., ACL 2023)

[3] Learning Rich Representation of Keyphrases from Text (Kulkarni et al., Findings 2022)

[4] Representation Learning for Resource-Constrained Keyphrase Generation (Wu et al., Findings 2022)

[5] Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training (Gao et al., Findings 2022)

[6] General-to-Specific Transfer Labeling for Domain Adaptable Keyphrase Generation (Meng et al., Findings 2023)

4 of 21

Motivation

A dilemma arises for KPG:

Strong empirical support for using in-domain LMs [6, 7].
Strong inductive bias for using sequence-to-sequence LMs.
Yet in-domain versions of these pre-trained LMs are very limited.

[6] General-to-Specific Transfer Labeling for Domain Adaptable Keyphrase Generation (Meng et al., Findings 2023)

[7] Rethinking Model Selection and Decoding for Keyphrase Generation with Pre-trained Sequence-to-Sequence Models (Wu et al., EMNLP 2023)

5 of 21

Motivation

Can we leverage BERT-like domain experts for KPG instead?

much more widely available
yet much less investigated

In this paper, we aim to provide a thorough investigation.

6 of 21

Evaluation Setup

We evaluate on two standard KPG testbenches

Scientific KPG [8]: KP20k, Inspec, Krapivin, NUS, and SemEval
News KPG [9]: KPTimes

We compare with well-established KPG methods

CatSeq [10]
ExHiRD-h [11]
Transformer [12]
SetTrans [12]

7 of 21

Evaluation Setup

Considered LMs

We consider both encoder-only and encoder-decoder pre-trained LMs.
We consider in-domain, cross-domain, and general domain LMs.
Inspired by [7], we train our own news domain LMs.

	General Domain	Science	News
Encoder-Only	BERT	SciBERT [13]	NewsBERT
Encoder-Decoder	BART	SciBART [7]	NewsBART

[7] Rethinking Model Selection and Decoding for Keyphrase Generation with Pre-trained Sequence-to-Sequence Models (Wu et al., EMNLP 2023)

[13] SciBERT: A Pretrained Language Model for Scientific Text (Beltagy et al., EMNLP-IJCNLP 2019)

8 of 21

Modeling

How to effectively leverage encoder-only models?
We consider three approaches

Sequence labeling for KPE
BERT2BERT for KPG
Prefix-LM for KPG

9 of 21

Modeling

How to effectively leverage encoder-only models?
Approach 1: sequence labeling for KPE

We use a token-level classification formulation.
We consider the use of conditional random fields (CRF).

10 of 21

Modeling

How to effectively leverage encoder-only models?
Approach 2: BERT2BERT

We initialize a sequence-to-sequence model using various-sized BERT checkpoints distributed by [14].
We investigate two variations

Varying the depth of the encoder and the decoder.
Randomly initializing the encoder or the decoder.

[14] Well-read students learn better: On the importance of pre-training compact models (Turc et al., 2019)

11 of 21

Modeling

How to effectively leverage encoder-only models?
Approach 3: Prefix-LM

We follow [15] to fine-tune encoder-only LMs using a sequence-to-sequence attention mask pattern.
The loss is based on masking and recovering the target sequence.
During inference, autoregressive decoding is used.

[15] Unified Language Model Pre-training for Natural Language Understanding and Generation (Dong et al., NeurIPS 2019)

12 of 21

Results

Our experiments address the following research questions

Is the KPG formulation less suitable to encoder-only pre-trained LMs compared to KPE?
Can encoder-only pre-trained LMs generate better keyphrases than encoder-decoder pre-trained LMs?
What is the best parameter allocation strategy for using encoder-decoder pre-trained LMs to balance the performance and computational cost?

13 of 21

Results: KPE vs. KPG

CRF improves KPE performance.
KPG with prefix-LM can achieve the same level of present keyphrase performance as KPE methods.

14 of 21

Results: BERT for KPG – prefix-LM

The prefix-LM approach achieves strong performance.
With in-domain BERT models, prefix-LM outperforms general-domain BART.

15 of 21

Results: BERT for KPG – prefix-LM

Compared to various non-pre-trained baselines, prefix-LM exhibits much more superior data efficiency.

Prefix-LM trained with 2k data performs on par with SetTrans trained on 100k.

16 of 21

Results: BERT for KPG – BERT2BERT

We also dive into the key design factor behind the strong performance of BERT2BERT.

Our first observation is that depth should be prioritized over the width in parameter allocation.

17 of 21

Results: BERT for KPG – BERT2BERT

Finally, we investigate how to allocate the layer budget across the encoder and the decoder.

The deep-encoder-shallow-decoder approach improves over the other parameter allocation strategies.

18 of 21

Results: BERT for KPG – BERT2BERT

The deep-encoder-shallow-decoder approach also provides much better inference latency than the other designs.

19 of 21

Summary

We unveil the potential of encoder-only pre-trained LMs for KPG.
Two formulations are introduced and systematically compared

Prefix-LM
BERT2BERT

Experiments show their strong performance, resource efficiency, and competitive inference latency.
We also dive deep into their various design considerations.