1 of 109

From LLMs to Agents

Grigory Sapunov

Oxford LLMs / 2024.09.24

gs@inten.to

2 of 109

Prompting is not enough

You can do many things with advanced prompting, from zero-shot instructions to many-shot examples and involved schemes like Chain-of-Thought
You can also fine-tune LLM on your own data
But that’s not enough.

3 of 109

Prompting is not enough

You may need more complicated workflows with external orchestration, for example, involving:

multiple LLM runs (Self-Consistency)
controlling multi-prompt sequences (steps of a pipeline/workflow)
advanced prompting techniques (Tree-of-Thought)
using LLM as a part of a more complex system (LLM Programs)
using external datasources (RAG)
applying style-guides, policies, constitutions
processing large documents (longer than context size)
using tools (Function calling)
estimating the Quality of the answer and applying specific logic (retry/reformulate/…)
reflecting on the LLM output with another LLM (reflection)
using multi-agent workflows with multiple LLM-based (and/or code-based) agents and static or dynamic communication graphs

4 of 109

Advanced LLM features

5 of 109

* Embedding models

6 of 109

Word Embeddings (Word2Vec)

https://cloud.google.com/blog/topics/developers-practitioners/meet-ais-multitool-vector-embeddings

7 of 109

Sentence Embeddings

https://cohere.com/llmu/sentence-word-embeddings

8 of 109

Embedding models

Text embeddings measure the relatedness (semantic similarity) of text strings.
Embeddings are commonly used for:

Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

9 of 109

Embedding models

All the major vendors have embeddings API:

There are also good open embedding models, e.g.

NVIDIA’s NV-Embed https://huggingface.co/nvidia/NV-Embed-v1

Benchmarks

MTEB: Massive Text Embedding Benchmark https://huggingface.co/blog/mteb

10 of 109

MTEB: Massive Text Embedding Benchmark

11 of 109

Topic Modeling: BERTopic + Embeddings

https://github.com/cohere-ai/sandbox-topically�https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html

12 of 109

* Multimodality

13 of 109

Multimodality

Some LLMs support other than text modalities, say, images (then they are VLMs -- Visual Language Models), video, speech, etc
It is especially useful with a large context (say, Gemini 1.5 with 1M tokens)
You can use all the techniques you used with LLMs: few-shot learning, Chain-of-Thought, structured outputs, etc.

14 of 109

GPT-4v

https://blog.inten.to/gpt-4v-is-coming-e216259697d6

In the last six months, we all learned many new words and acronyms: Generative AI, Multilingual AI, LLM, GPT and ChatGPT, zero-shot, few-shot, many-shot, prompt engineering, prompt-tuning, parameter-tuning, fine-tuning, adapters, tokens, temperature, and some others. Different people use them with different meanings, and for some, the meaning even changed recently. There's no way around using them in this overview. It's a new language of the language industry. So, let's start with defining the terminology.

Generative AI is the AI that generates new content. It can be human-like text but also software code, artwork, music, new designs, or even new AI models. Some of the other types of AI are discriminative (e.g. spam-filters), predictive (e.g. weather forecasting), descriptive (explaining patterns) and prescriptive (decision-making).

There’s a limited scope of generative tasks involving working with human languages, emphasizing working with many languages, which is very interesting for the localization industry. This scope implies certain requirements for Generative AI and defines a more specific type of system fulfilling those requirements, which we call Multilingual AI. Hence, we are at Multilingual AI Roundtable.

Three main types of models in Generative AI are generative adversarial networks, or GANs, variational autoencoders (VAE), and Transformers. For text generation, Transformer models are the most popular.

Formally speaking, Machine Translation is Generative AI. Many Machine Translation models are built using transformer architectures.

Most of the things we humans do are generate content based on some inputs, and many things we do are on the Internet. So the idea was, much like the MT model can learn lots of nuances about the language if provided with sufficiently large data, a more general transformer can learn a lot about humans and what they do if provided sufficiently large data. At some point, a company called OpenAI succeeded in this, and such models are called Large Language Models, or LLMs.

Some of the MT providers already use LLMs with special prompts as a core technology for their MT offerings. Hence, many of us use LLMs for a while without realizing it.

15 of 109

Gemini 1.5 Pro

https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/design-multimodal-prompts

In the last six months, we all learned many new words and acronyms: Generative AI, Multilingual AI, LLM, GPT and ChatGPT, zero-shot, few-shot, many-shot, prompt engineering, prompt-tuning, parameter-tuning, fine-tuning, adapters, tokens, temperature, and some others. Different people use them with different meanings, and for some, the meaning even changed recently. There's no way around using them in this overview. It's a new language of the language industry. So, let's start with defining the terminology.

Generative AI is the AI that generates new content. It can be human-like text but also software code, artwork, music, new designs, or even new AI models. Some of the other types of AI are discriminative (e.g. spam-filters), predictive (e.g. weather forecasting), descriptive (explaining patterns) and prescriptive (decision-making).

There’s a limited scope of generative tasks involving working with human languages, emphasizing working with many languages, which is very interesting for the localization industry. This scope implies certain requirements for Generative AI and defines a more specific type of system fulfilling those requirements, which we call Multilingual AI. Hence, we are at Multilingual AI Roundtable.

Three main types of models in Generative AI are generative adversarial networks, or GANs, variational autoencoders (VAE), and Transformers. For text generation, Transformer models are the most popular.

Formally speaking, Machine Translation is Generative AI. Many Machine Translation models are built using transformer architectures.

Most of the things we humans do are generate content based on some inputs, and many things we do are on the Internet. So the idea was, much like the MT model can learn lots of nuances about the language if provided with sufficiently large data, a more general transformer can learn a lot about humans and what they do if provided sufficiently large data. At some point, a company called OpenAI succeeded in this, and such models are called Large Language Models, or LLMs.

Some of the MT providers already use LLMs with special prompts as a core technology for their MT offerings. Hence, many of us use LLMs for a while without realizing it.

16 of 109

Gemini 1.5 Pro

https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/intro_multimodal_use_cases.ipynb

In the last six months, we all learned many new words and acronyms: Generative AI, Multilingual AI, LLM, GPT and ChatGPT, zero-shot, few-shot, many-shot, prompt engineering, prompt-tuning, parameter-tuning, fine-tuning, adapters, tokens, temperature, and some others. Different people use them with different meanings, and for some, the meaning even changed recently. There's no way around using them in this overview. It's a new language of the language industry. So, let's start with defining the terminology.

Generative AI is the AI that generates new content. It can be human-like text but also software code, artwork, music, new designs, or even new AI models. Some of the other types of AI are discriminative (e.g. spam-filters), predictive (e.g. weather forecasting), descriptive (explaining patterns) and prescriptive (decision-making).

There’s a limited scope of generative tasks involving working with human languages, emphasizing working with many languages, which is very interesting for the localization industry. This scope implies certain requirements for Generative AI and defines a more specific type of system fulfilling those requirements, which we call Multilingual AI. Hence, we are at Multilingual AI Roundtable.

Three main types of models in Generative AI are generative adversarial networks, or GANs, variational autoencoders (VAE), and Transformers. For text generation, Transformer models are the most popular.

Formally speaking, Machine Translation is Generative AI. Many Machine Translation models are built using transformer architectures.

Most of the things we humans do are generate content based on some inputs, and many things we do are on the Internet. So the idea was, much like the MT model can learn lots of nuances about the language if provided with sufficiently large data, a more general transformer can learn a lot about humans and what they do if provided sufficiently large data. At some point, a company called OpenAI succeeded in this, and such models are called Large Language Models, or LLMs.

Some of the MT providers already use LLMs with special prompts as a core technology for their MT offerings. Hence, many of us use LLMs for a while without realizing it.

17 of 109

* Tools / Function calling

18 of 109

Using Tools / Function calling

Function calling allows you to connect models to external tools and systems.
This is useful for many things such as empowering AI assistants with capabilities, or building deep integrations between your applications and the models.
For example, enabling assistants to:

fetch data from an internal system
take actions, say, an AI assistant to schedule meetings based on user preferences and calendar availability.
perform computation or running external code.
build rich workflows with data extraction, conversion, saving
modify your applications' UI: updating on user input, for example, rendering a pin on a map.

19 of 109

Using Tools / Function calling

https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/function-calling

20 of 109

* Dealing with non-English content

21 of 109

Dealing with non-English content

Datasets are dominated by English and LLMs are trained mostly with English data.
So, not surprisingly, models solve tasks using English better that with other languages, especially low-resource languages.
Solution for non-English content - translate it first into English (with MT or even LLM), solve in English, translate back.

22 of 109

Google PaLM 2

https://ai.google/static/documents/palm2techreport.pdf

23 of 109

Meta LLaMa 2

https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

24 of 109

LLMs may work better with translation

“Do Multilingual Language Models Think Better in English?“ https://arxiv.org/abs/2308.01223

25 of 109

Multilingual GenAI Summarization task

https://inten.to/blog/when-genai-still-needs-mt/

26 of 109

> Additional considerations: �Token economy

27 of 109

All languages are NOT (tokenized) equal

Because of tokenization issues, the same text in English may require less tokens than, say, for Korean.

https://blog.yenniejun.com/p/all-languages-are-not-created-tokenized

28 of 109

English has the shortest median token len

https://blog.yenniejun.com/p/all-languages-are-not-created-tokenized

29 of 109

Tokenization in different languages

Prompts and responses in non-English languages consume up to 18x more tokens, resulting in higher costs and delays.

Example:

English - 19 tokens

Korean - 97 tokens (5x)

Malayalam - 339 tokens (18x)

September 2022

Intento Inc. AI Hub

30 of 109

Tokenization: conclusion

Because of tokenization, language has a significant influence on:

Price �(because of per token pricing)
Response time �(because of iterative generation)
Ability to process long documents �(because the effective context length is smaller with poor tokenization)

31 of 109

Advanced single-LLM uses

32 of 109

Improving quality - incorporating custom data

As we’ve seen, providing lots of context as inputs may not work well due to the loss of attention (also it’s very expensive!).
Solution - store all context in the database, and pull only relevant pieces of information.
Examples:

Generate a draft answer to a customer query by pulling 3 most similar answers from the history
Generate a translation by pulling similar translations from Translation Memory

“Retrieval Augmented Generation” or RAG
Also helps to incorporate recent data (overcoming the knowledge cutoff date), or data unknown to the LLM

33 of 109

RAG: Retrieval-Augmented Generation

https://cs.stanford.edu/~myasu/blog/racm3/

34 of 109

Improving quality - advanced techniques

Humans use two types of thinking: System 1 (quick, intuitive), and System 2 (takes time, thoughtful).
Today, LLMs have only System 1 - they can’t think for a while, but immediately produce an answer. With OpenAI’s O1 it is changing!
Solution - make them think aloud, first generating steps of the solution, and then generating the final result step by step.
“Chain of Thought” (CoT) prompting

source

35 of 109

“Chain of Thought” prompting

Large Language Model

(plan)

(step 2)

Build a plan for task X and solve step-by-step

prompt

completion

(step 1)

(step 3)

(final answer)

(plan)

(step 2)

(step 1)

(step 3)

36 of 109

Chain of Thought (CoT)

“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, https://arxiv.org/abs/2201.11903

37 of 109

“To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning” https://arxiv.org/abs/2409.12183

38 of 109

Self-Consistency (SC)

“Self-Consistency Improves Chain of Thought Reasoning in Language Models”�https://arxiv.org/abs/2203.11171

39 of 109

LLM Programs

LLMs are embedded into a program or algorithm.
The LLM is NOT responsible for maintaining the current state of the program (i.e. its context)
The LLM, for each step of the program is only presented with a step-specific prompt and context.

“Large Language Model Programs”, https://arxiv.org/abs/2305.05364

40 of 109

Tree-of-Thought (ToT)

“Large Language Model Guided Tree-of-Thought”, https://arxiv.org/abs/2305.08291

41 of 109

Graph of Thoughts (GoT)

“Graph of Thoughts: Solving Elaborate Problems with Large Language Models”, https://arxiv.org/abs/2308.09687

42 of 109

Everything of Thoughts (XoT)

https://github.com/microsoft/Everything-of-Thoughts-XoT

43 of 109

“The Prompt Report: A Systematic Survey of Prompting Techniques”�https://arxiv.org/abs/2406.06608

44 of 109

Agents

45 of 109

What is an agent?

https://newsletter.theaiedge.io/p/introduction-to-langchain-augmenting

46 of 109

What is an agent?

There are many different definitions.
{LLM/AI} Agents are systems that use �an LLM as a reasoning engine to �determine which actions to take and �what the inputs to those actions should �be. The results of those actions can then �be fed back into the agent as observations. It determines whether more actions are needed, or whether it is okay to finish.
Typically, an Agent is an LLM equipped with Tools, Memory, and some orchestration code (say, to call tools, manage memory, run agentic loop, etc).

47 of 109

What is an agent?

https://lilianweng.github.io/posts/2023-06-23-agent/

48 of 109

ReAct framework

ReAct (Reason + Act) is a paradigm where LLMs are used to generate both reasoning traces and task-specific actions in an interleaved manner.
The ReAct framework can allow LLMs to interact with external tools to retrieve additional information that leads to more reliable and factual responses.
However, see “On the Brittle Foundations of ReAct Prompting for Agentic Large Language Models”�https://arxiv.org/abs/2405.13966

“ReAct: Synergizing Reasoning and Acting in Language Models”�https://arxiv.org/abs/2210.03629

49 of 109

Why agents?

LLMs are not enough for some cases:

When they do not know some data

Need an external source, a database or search

When they need to do some multi-step reasoning and execution

Need separate runs with different prompts

When they need to use some external tools, e.g.

LLMs are bad in counting, so even a simple calculator can help
LLM may need to run Python code
Some domain-specific tools (e.g. chemistry or biology-related)

When they need to maintain context

Need to have memory

Some future LLMs (GPT-5+) may be better in some aspects, but that’s still not enough (they will not have access to some specific tools).

50 of 109

Example: a conversational agent with search

An LLM-based conversational agent equipped with a generic search tool. You can ask the agent questions and have conversations with it. The agent uses an LLM to decide which tools to use and when.

Components:

General LLM, say, OpenAI GPT-4o or Anthropic Claude 3.5
Tools (consists of: the name of the tool, a description of what the tool does, a JSON schema defining the inputs to the tool, a function to call):

Web search
Weather API

Memory to store previous interactions (just history of messages)

https://python.langchain.com/docs/tutorials/agents/

51 of 109

Example: ChemCrow

ChemCrow is an LLM chemistry agent designed to �accomplish tasks across organic synthesis, drug �discovery, and materials design.

Uses OpenAI GPT-4
Integrates 18 expert-designed tools
Workflow reflects the ReAct framework
Implemented in LangChain (https://github.com/ur-whitelab/chemcrow-public some tools are missing in the repo)

“ChemCrow: Augmenting large-language models with chemistry tools”�https://arxiv.org/abs/2304.05376

52 of 109

“ChemCrow: Augmenting large-language models with chemistry tools”�https://arxiv.org/abs/2304.05376

53 of 109

Multi-Agents

54 of 109

Agent vs. Multi-agent

We can define an agent as an entity that can reason, act, communicate and adapt to solve problems.
By extension, we define a multi-agent system as an application that involve a group of agents, each with diverse roles (personalities, skills) and capabilities (language models, tools, and objectives), collaborating to solve tasks.

55 of 109

Why multi-agents?

Some complex tasks have properties that lend themselves really well to a multi-agent solution approach:

Task Decomposition. Requires planning, which involves breaking the task down into steps that must be completed successfully.
Instructional Complexity. Lengthy instructions or context presents a significant challenge for single-agent systems due to known difficulties with long or complex instructions
Diverse Expertise. Benefits from the collaboration of multiple entities, each bringing diverse skills/expertise, tools, or perspectives to the task.
Separation of Concerns: Separate prompts can give better results.
Emergent Solutions. Problems where the solution arises from the interactions among multiple agents, with the exposure to the outputs from previous steps and the state and process of each agent enhancing the solution.
Autonomous Exploratory Problem-Solving. The task solution is not known beforehand, may take various forms, and results from exploring a vast search space (optimization).

https://newsletter.victordibia.com/p/multi-agent-llm-applications-a-review

56 of 109

https://newsletter.victordibia.com/p/multi-agent-llm-applications-a-review

57 of 109

https://www.manning.com/books/multi-agent-systems-with-autogen

58 of 109

https://www.manning.com/books/multi-agent-systems-with-autogen

59 of 109

* Translation Agent

60 of 109

Example: Agentic translation with reflection

A multi-agent system for machine translation. Contains two agents: translator and reviewer

3-step workflow:

Translator translates a text from source_language to target_language
Reviewer reflects on the translation to come up with constructive suggestions for improving it
Translator uses the suggestions to improve the translation.

Can repeat these steps if necessary.

https://github.com/andrewyng/translation-agent

61 of 109

Example: Reflection agent

https://medium.com/aimonks/reflection-agents-with-langgraph-agentic-llm-based-applications-87e43c27adc7

62 of 109

Source: English

<article> In the world of softwarre development, you have to be on your toes all the time. The CEO, who has been leading the team for years, said, <q>"We need to push the envelope and think outside the box!"</q> This means our team must work smarter, not harder. Sometimes, it's like trying to find a needle in a haystack. <b>Can you imagine how challenging that is?</b> Our developers often feel like they are walking on eggshells. But, if we nail it, the sky's the limit. Remember, there is no "I" in "team." The project manager comanted, "We have to stay agile and responsive." <mark>With the new deadline looming, the presure is on.</mark> </article>

63 of 109

Translation: EN-RU (DeepL)

<article>В мире софтварной разработки нужно постоянно быть начеку. Генеральный директор, который руководит командой уже много лет, сказал: <q>"Нам нужно расширять границы и мыслить нестандартно!"<q> Это значит, что наша команда должна работать умнее, а не сложнее. Иногда это похоже на попытку найти иголку в стоге сена. <Представляете, как это сложно? Нашим разработчикам часто кажется, что они ходят по яичной скорлупе. Но если мы справимся с этой задачей, то небо - это предел. Помните, что в слове "команда" нет слова "я" Руководитель проекта сказал: "Мы должны оставаться проворными и оперативными" <марка> В связи с приближением нового дедлайна давление возрастает.</марка> </article

64 of 109

Reflection: LLM (GPT-4o)

1. Correct the spelling of "софтварной" to "программной" to use the more accurate and commonly accepted term in Russian.�2. Replace "расширять границы" with "выходить за рамки" to better capture the idiomatic expression "push the envelope."�3. Change "сложнее" to "тяжелее" to better match the meaning of "harder" in this context.�4. Add a closing quotation mark after "нестандартно!" to correct the punctuation.�5. Change "<Представляете, как это сложно?" to "<b>Представляете, как это сложно?</b>" to match the bold formatting in the source text.�6. Add a period after "я" in "в слове 'команда' нет слова 'я'" to correct the punctuation.�7. Change "Руководитель проекта сказал" to "Руководитель проекта прокомментировал" to better match the source text's "commented."�8. Add a closing quotation mark after "оперативными" to correct the punctuation.�9. Replace "<марка>" with "<mark>" and "</марка>" with "</mark>" to match the HTML tag in the source text.�10. Correct the spelling of "давление возрастает" to "давление усиливается" to better convey the increasing pressure.

65 of 109

Improved translation: LLM (GPT-4o)

<article>В мире программной разработки нужно постоянно быть начеку. Генеральный директор, который руководит командой уже много лет, сказал: <q>"Нам нужно выходить за рамки и мыслить нестандартно!"</q> Это значит, что наша команда должна работать умнее, а не тяжелее. Иногда это похоже на попытку найти иголку в стоге сена. <b>Представляете, как это сложно?</b> Нашим разработчикам часто кажется, что они ходят по яичной скорлупе. Но если мы справимся с этой задачей, то небо - это предел. Помните, что в слове "команда" нет слова "я". Руководитель проекта прокомментировал: "Мы должны оставаться проворными и оперативными." <mark>В связи с приближением нового дедлайна давление усиливается.</mark> </article>

66 of 109

Agent Design Patterns

67 of 109