1 of 18

LLM + External Tools

Chenghao Yang

University of Chicago

2 of 18

Research Questions (before we start)

  • Should a language model handle all works on its own?
    • From theory perspectives? (e.g., Computability?)
    • For application consideration? (e.g., Efficiency?)
  • If we want to delegate/decouple certain functionalities from LLM to some specific software, how should we decide which ones?
  • How should we instruct an LLM to live with external tools?
  • What is the ideal implementation form for “future AI” / “AGI”? An elegant single model or a complicated system?

3 of 18

WebGPT: Browser-assisted question-answering with human feedback

https://twitter.com/samcharrington/status/1610365059540066304

Webpage References helps improve Truthfulness!

Browsing Webpage does provide up-to-date information!

Last Updated:

A week ago

Last Updated:

Two Months Ago

4 of 18

WebGPT: High-Level Design

  • GPT-3 Model interacts with a text-based Web-browsing environment
    • Improve retrieval and synthesis end-to-end, using RL/Imitation Learning
  • GPT-3 Model generates answers with references
    • Crucial for allowing labelers to judge the factual accuracy of answers
  • Using human feedback to directly optimize answer quality
    • Collect human demonstrations on web-browsing
    • Judge comparisons between model-generated responses. (factuality, coherence, usefulness)
    • Using human-provided data in 1) behavior cloning (SFT); 2) reward model training; 3) RL against reward model; 4) Rejection Sampling against reward model.

5 of 18

WebGPT: Environment Design

  • Search Engine: Microsoft Bing Web Search API
  • Simplify Web Pages via Readability.js @ Mozilla
  • Using html2text, pdfminer.six, … tools to convert rich text to normal text, convert images to [Image: <alt text>]

6 of 18

WebGPT: Environment Interface

7 of 18

WebGPT: Data Collection

  • Vast Majority of Questions coming from ELI5 (“Explain-like-I-am-five”) dataset.
  • For diversity, also mixed in some other sources
    • TriviaQA, ARC
    • Hand-written
    • ELI5 fact-check (using InstructGPT to generate answers for ELI5 questions)
  • In total, collected 6,000 demonstrations and 21,500 comparisons.
  • 92% demonstrations coming from ELI5, 98% comparisons coming from ELI5.

Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., & Auli, M. (2019, July). ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3558-3567).

8 of 18

WebGPT: Training

  • Behavior Cloning (BC): Simply supervised fine-tuning on demonstration data.
  • Reward Modeling (RM): Starting from BC model, remove final layer, training a discriminator producing scalar rewards to learn human preference on comparison data.
  • Reinforcement Learning (RL): Starting from BC model, fine-tune GPT-3 on the environment using PPO.
  • Rejection sampling (best-of-n): Sample 4/16/64 outputs from BC/RL model, re-rank using RM and pick the best one.

9 of 18

WebGPT: Evaluation on ELI5

  • Compare model-generated answers to demonstrators-written answers.
  • Compare model-generated answers to reference answers from ELI5.

10 of 18

WebGPT: Other Experiment Findings

Also evaluated WebGPT models on TruthfulQA and TriviaQA, all outperforms GPT-3. Larger models always works better.

RL models simply do not provide significant benefit. Using BC + Rejection Sampling is enough and achieves good efficiency.

For Rejection Sampling, sampling more outputs always provides extra outputs.

More demonstration data and comparison data gives more benefits.

11 of 18

WebGPT: Limitations

  • Still, WebGPT does not fully resolve the Truthfulness problems, as we sometime seen in New Bing. It can still quotes from highly unreliable sources.
  • WebGPT produces more authoritative answers, partly because of the use of citations.
  • Can lead to user’s overreliance on WebGPT’s answers. (“Automation Bias”)
  • Make more mistakes on out-of-distribution questions.
  • Reinforcement of Social Bias (e.g., exclude some materials unfairly)
  • Risks of live web access (e.g., edit Wikipedia)

12 of 18

Toolformer: Language Models Can Teach Themselves to Use Tools

  • Motivation: LM can hardly do truthful generation, process low-resource languages, precise calculation. Lack awareness of the progression of time.
  • Ideas:
    • Teach LM to use external tools, in a self-supervised way, without requiring human annotation. (e.g., self-instruct [1], this technique also used by Alpaca)
    • LM should decide for itself when and how to use which tools, not tied to specific tasks.
  • In this paper they experiment with GPT-J (6.7B) and outperform GPT-3 on zero-shot evaluation.

[1] Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2022). Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560.

13 of 18

Toolformer: Data Collection Method �(Using QA tool as Example)

Done using LM itself

Using Few-shot Prompt

Using LM loss

External Tools

14 of 18

Toolformer: Sample API call and Filtering

Sample API call via Prompting

Filtering via LM Loss

LM Loss

Loss for inserting API responses

Loss for NOT using API at this position

Loss for using API but NO Response at this position

Only keep API calls if it reduces LM loss by certain margin

15 of 18

Toolformer: External Tools

  • Question Answering: Atlas system (retrieval-augmented LM on NQ)
  • Calculator
  • Wikipedia Search (BM25 retriever, KILT Wiki dump)
  • Machine Translation (600M NLLB)
  • Calendar

16 of 18

Toolformer: Benchmark

Fine-tuned on the same corpus, no API

Do not allow calling API during decoding

17 of 18

Toolformer: Other Experimental Findings

Adding API calls comes without a cost in terms of perplexity for language modeling without any API calls.

The ability to leverage the provided tools only emerges at around 775M parameters. (Except Wikipedia search engine used mostly for QA, as the API might be comparably easy to use)

Changing decoding strategy from greedy decoding can increase the model tendency to use more API calls.

18 of 18

GPT+Plugins

  • How about we just watch their demo together for fun?
  • https://openai.com/blog/chatgpt-plugins