LLM + External Tools
Chenghao Yang
University of Chicago
Research Questions (before we start)
WebGPT: Browser-assisted question-answering with human feedback
https://twitter.com/samcharrington/status/1610365059540066304
Webpage References helps improve Truthfulness!
Browsing Webpage does provide up-to-date information!
Last Updated:
A week ago
Last Updated:
Two Months Ago
WebGPT: High-Level Design
WebGPT: Environment Design
WebGPT: Environment Interface
WebGPT: Data Collection
Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., & Auli, M. (2019, July). ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3558-3567).
WebGPT: Training
WebGPT: Evaluation on ELI5
WebGPT: Other Experiment Findings
Also evaluated WebGPT models on TruthfulQA and TriviaQA, all outperforms GPT-3. Larger models always works better.
RL models simply do not provide significant benefit. Using BC + Rejection Sampling is enough and achieves good efficiency.
For Rejection Sampling, sampling more outputs always provides extra outputs.
More demonstration data and comparison data gives more benefits.
WebGPT: Limitations
Toolformer: Language Models Can Teach Themselves to Use Tools
[1] Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2022). Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560.
Toolformer: Data Collection Method �(Using QA tool as Example)
Done using LM itself
Using Few-shot Prompt
Using LM loss
External Tools
Toolformer: Sample API call and Filtering
Sample API call via Prompting
Filtering via LM Loss
LM Loss
Loss for inserting API responses
Loss for NOT using API at this position
Loss for using API but NO Response at this position
Only keep API calls if it reduces LM loss by certain margin
Toolformer: External Tools
Toolformer: Benchmark
Fine-tuned on the same corpus, no API
Do not allow calling API during decoding
Toolformer: Other Experimental Findings
Adding API calls comes without a cost in terms of perplexity for language modeling without any API calls.
The ability to leverage the provided tools only emerges at around 775M parameters. (Except Wikipedia search engine used mostly for QA, as the API might be comparably easy to use)
Changing decoding strategy from greedy decoding can increase the model tendency to use more API calls.
GPT+Plugins