1 of 18

LLM + External Tools

Chenghao Yang

University of Chicago

2 of 18

Research Questions (before we start)

Should a language model handle all works on its own?

From theory perspectives? (e.g., Computability?)
For application consideration? (e.g., Efficiency?)

If we want to delegate/decouple certain functionalities from LLM to some specific software, how should we decide which ones?
How should we instruct an LLM to live with external tools?
What is the ideal implementation form for “future AI” / “AGI”? An elegant single model or a complicated system?

3 of 18

WebGPT: Browser-assisted question-answering with human feedback

https://twitter.com/samcharrington/status/1610365059540066304

Webpage References helps improve Truthfulness!

Browsing Webpage does provide up-to-date information!

Last Updated:

A week ago

Last Updated:

Two Months Ago

4 of 18

WebGPT: High-Level Design

GPT-3 Model interacts with a text-based Web-browsing environment

Improve retrieval and synthesis end-to-end, using RL/Imitation Learning

GPT-3 Model generates answers with references

Crucial for allowing labelers to judge the factual accuracy of answers

Using human feedback to directly optimize answer quality

Collect human demonstrations on web-browsing
Judge comparisons between model-generated responses. (factuality, coherence, usefulness)
Using human-provided data in 1) behavior cloning (SFT); 2) reward model training; 3) RL against reward model; 4) Rejection Sampling against reward model.

5 of 18

WebGPT: Environment Design

Search Engine: Microsoft Bing Web Search API
Simplify Web Pages via Readability.js @ Mozilla
Using html2text, pdfminer.six, … tools to convert rich text to normal text, convert images to [Image: <alt text>]

6 of 18

WebGPT: Environment Interface

7 of 18

WebGPT: Data Collection

Vast Majority of Questions coming from ELI5 (“Explain-like-I-am-five”) dataset.
For diversity, also mixed in some other sources

TriviaQA, ARC
Hand-written
ELI5 fact-check (using InstructGPT to generate answers for ELI5 questions)

In total, collected 6,000 demonstrations and 21,500 comparisons.
92% demonstrations coming from ELI5, 98% comparisons coming from ELI5.

Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., & Auli, M. (2019, July). ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3558-3567).

8 of 18

WebGPT: Training

Behavior Cloning (BC): Simply supervised fine-tuning on demonstration data.
Reward Modeling (RM): Starting from BC model, remove final layer, training a discriminator producing scalar rewards to learn human preference on comparison data.
Reinforcement Learning (RL): Starting from BC model, fine-tune GPT-3 on the environment using PPO.
Rejection sampling (best-of-n): Sample 4/16/64 outputs from BC/RL model, re-rank using RM and pick the best one.

9 of 18

WebGPT: Evaluation on ELI5

Compare model-generated answers to demonstrators-written answers.
Compare model-generated answers to reference answers from ELI5.

10 of 18

WebGPT: Other Experiment Findings

Also evaluated WebGPT models on TruthfulQA and TriviaQA, all outperforms GPT-3. Larger models always works better.

RL models simply do not provide significant benefit. Using BC + Rejection Sampling is enough and achieves good efficiency.

For Rejection Sampling, sampling more outputs always provides extra outputs.

More demonstration data and comparison data gives more benefits.

11 of 18

WebGPT: Limitations

Still, WebGPT does not fully resolve the Truthfulness problems, as we sometime seen in New Bing. It can still quotes from highly unreliable sources.
WebGPT produces more authoritative answers, partly because of the use of citations.
Can lead to user’s overreliance on WebGPT’s answers. (“Automation Bias”)
Make more mistakes on out-of-distribution questions.
Reinforcement of Social Bias (e.g., exclude some materials unfairly)
Risks of live web access (e.g., edit Wikipedia)

12 of 18

Toolformer: Language Models Can Teach Themselves to Use Tools

Motivation: LM can hardly do truthful generation, process low-resource languages, precise calculation. Lack awareness of the progression of time.
Ideas:

Teach LM to use external tools, in a self-supervised way, without requiring human annotation. (e.g., self-instruct [1], this technique also used by Alpaca)
LM should decide for itself when and how to use which tools, not tied to specific tasks.

In this paper they experiment with GPT-J (6.7B) and outperform GPT-3 on zero-shot evaluation.

[1] Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2022). Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560.

13 of 18

Toolformer: Data Collection Method �(Using QA tool as Example)

Done using LM itself

Using Few-shot Prompt

Using LM loss

External Tools

14 of 18

Toolformer: Sample API call and Filtering

Sample API call via Prompting

Filtering via LM Loss

LM Loss

Loss for inserting API responses

Loss for NOT using API at this position

Loss for using API but NO Response at this position

Only keep API calls if it reduces LM loss by certain margin

15 of 18

Toolformer: External Tools

Question Answering: Atlas system (retrieval-augmented LM on NQ)
Calculator
Wikipedia Search (BM25 retriever, KILT Wiki dump)
Machine Translation (600M NLLB)
Calendar

16 of 18

Toolformer: Benchmark

Fine-tuned on the same corpus, no API

Do not allow calling API during decoding

17 of 18

Toolformer: Other Experimental Findings

Adding API calls comes without a cost in terms of perplexity for language modeling without any API calls.

The ability to leverage the provided tools only emerges at around 775M parameters. (Except Wikipedia search engine used mostly for QA, as the API might be comparably easy to use)

Changing decoding strategy from greedy decoding can increase the model tendency to use more API calls.

18 of 18

GPT+Plugins

How about we just watch their demo together for fun?
https://openai.com/blog/chatgpt-plugins