1 of 8

Towards a Unified View of �Parameter-Efficient Transfer Learning

TTIC & UChicago NLP Seminar

Chenghao (Alan) Yang

AWS AI

(incoming UChicago PhD student)

2 of 8

Background: Parameter-Efficient Transfer Learning (PET)

  • Full fine-tuning over Pre-trained Language Models (PLMs) can often yields great performance in multiple benchmarks, however:
    • We need a separate copy of parameters for each task, which can be expensive to train and store them.
    • Full fine-tuning is even not do-able for large models given limited resources.
  • Therefore, lightweight alternatives is desired, where we only update a small proportion of parameters for each tasks.
  • Examples including prompt tuning[1], prefix tuning[2] and adapters[3].

[1] Lester, Brian, Rami Al-Rfou, and Noah Constant. "The Power of Scale for Parameter-Efficient Prompt Tuning." In Proceedings of EMNLP. 2021.

[2] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of ACL, 2021.

[3] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In Proceedings of ICML, 2019.

3 of 8

Research Questions

  • How are PET methods connected?
  • Which ingredients or design elements are important for PET methods?
  • Can they transfer to yield more effective variants?

4 of 8

Recap: Transformer Architecture

5 of 8

Different PET Formulations

 

6 of 8

Prefix Tuning as Adapters

7 of 8

Unified Framework

  • General PET Form

  • Design Dimensions

8 of 8

Transferring Design Elements

Detailed Experiment Results is in the Paper section 4, but here are the quick summaries:

  1. Parallel adapters are better than sequential adapters
  2. FFN modifications can utilize extra parameters more effectively than attention, no matter what the functional forms or composition function is – FFN learns task-specific textual patterns. Attentions do not require large capacity for adapting to new tasks (more economic to modify Attn if budget is tight!).
  3. Scaling composition function is better than the vanilla additive while being easily applicable.