1 of 109

From LLMs to Agents

Grigory Sapunov

Oxford LLMs / 2024.09.24

gs@inten.to

2 of 109

Prompting is not enough

  • You can do many things with advanced prompting, from zero-shot instructions to many-shot examples and involved schemes like Chain-of-Thought
  • You can also fine-tune LLM on your own data
  • But that’s not enough.

3 of 109

Prompting is not enough

You may need more complicated workflows with external orchestration, for example, involving:

  • multiple LLM runs (Self-Consistency)
  • controlling multi-prompt sequences (steps of a pipeline/workflow)
  • advanced prompting techniques (Tree-of-Thought)
  • using LLM as a part of a more complex system (LLM Programs)
  • using external datasources (RAG)
  • applying style-guides, policies, constitutions
  • processing large documents (longer than context size)
  • using tools (Function calling)
  • estimating the Quality of the answer and applying specific logic (retry/reformulate/…)
  • reflecting on the LLM output with another LLM (reflection)
  • using multi-agent workflows with multiple LLM-based (and/or code-based) agents and static or dynamic communication graphs

4 of 109

Advanced LLM features

5 of 109

* Embedding models

6 of 109

Word Embeddings (Word2Vec)

7 of 109

Sentence Embeddings

8 of 109

Embedding models

  • Text embeddings measure the relatedness (semantic similarity) of text strings.
  • Embeddings are commonly used for:
    • Search (where results are ranked by relevance to a query string)
    • Clustering (where text strings are grouped by similarity)
    • Recommendations (where items with related text strings are recommended)
    • Anomaly detection (where outliers with little relatedness are identified)
    • Diversity measurement (where similarity distributions are analyzed)
    • Classification (where text strings are classified by their most similar label)
  • An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

9 of 109

Embedding models

10 of 109

MTEB: Massive Text Embedding Benchmark

11 of 109

Topic Modeling: BERTopic + Embeddings

12 of 109

* Multimodality

13 of 109

Multimodality

  • Some LLMs support other than text modalities, say, images (then they are VLMs -- Visual Language Models), video, speech, etc
  • It is especially useful with a large context (say, Gemini 1.5 with 1M tokens)
  • You can use all the techniques you used with LLMs: few-shot learning, Chain-of-Thought, structured outputs, etc.

14 of 109

GPT-4v

15 of 109

Gemini 1.5 Pro

16 of 109

Gemini 1.5 Pro

17 of 109

* Tools / Function calling

18 of 109

Using Tools / Function calling

  • Function calling allows you to connect models to external tools and systems.
  • This is useful for many things such as empowering AI assistants with capabilities, or building deep integrations between your applications and the models.
  • For example, enabling assistants to:
    • fetch data from an internal system
    • take actions, say, an AI assistant to schedule meetings based on user preferences and calendar availability.
    • perform computation or running external code.
    • build rich workflows with data extraction, conversion, saving
    • modify your applications' UI: updating on user input, for example, rendering a pin on a map.

19 of 109

Using Tools / Function calling

20 of 109

* Dealing with non-English content

21 of 109

Dealing with non-English content

  • Datasets are dominated by English and LLMs are trained mostly with English data.
  • So, not surprisingly, models solve tasks using English better that with other languages, especially low-resource languages.
  • Solution for non-English content - translate it first into English (with MT or even LLM), solve in English, translate back.

22 of 109

Google PaLM 2

23 of 109

Meta LLaMa 2

24 of 109

LLMs may work better with translation

“Do Multilingual Language Models Think Better in English?“ https://arxiv.org/abs/2308.01223

25 of 109

Multilingual GenAI Summarization task

26 of 109

> Additional considerations: �Token economy

27 of 109

All languages are NOT (tokenized) equal

Because of tokenization issues, the same text in English may require less tokens than, say, for Korean.

28 of 109

English has the shortest median token len

29 of 109

Tokenization in different languages

Prompts and responses in non-English languages consume up to 18x more tokens, resulting in higher costs and delays.

Example:

English - 19 tokens

Korean - 97 tokens (5x)

Malayalam - 339 tokens (18x)

September 2022

Intento Inc. AI Hub

30 of 109

Tokenization: conclusion

Because of tokenization, language has a significant influence on:

  • Price �(because of per token pricing)
  • Response time �(because of iterative generation)
  • Ability to process long documents �(because the effective context length is smaller with poor tokenization)

31 of 109

Advanced single-LLM uses

32 of 109

Improving quality - incorporating custom data

  • As we’ve seen, providing lots of context as inputs may not work well due to the loss of attention (also it’s very expensive!).
  • Solution - store all context in the database, and pull only relevant pieces of information.
  • Examples:
    • Generate a draft answer to a customer query by pulling 3 most similar answers from the history
    • Generate a translation by pulling similar translations from Translation Memory
  • “Retrieval Augmented Generation” or RAG
  • Also helps to incorporate recent data (overcoming the knowledge cutoff date), or data unknown to the LLM

33 of 109

RAG: Retrieval-Augmented Generation

34 of 109

Improving quality - advanced techniques

  • Humans use two types of thinking: System 1 (quick, intuitive), and System 2 (takes time, thoughtful).
  • Today, LLMs have only System 1 - they can’t think for a while, but immediately produce an answer. With OpenAI’s O1 it is changing!
  • Solution - make them think aloud, first generating steps of the solution, and then generating the final result step by step.
  • “Chain of Thought” (CoT) prompting

35 of 109

“Chain of Thought” prompting

Large Language Model

(plan)

(step 2)

Build a plan for task X and solve step-by-step

prompt

completion

(step 1)

(step 3)

(final answer)

(plan)

(step 2)

(step 1)

(step 3)

36 of 109

Chain of Thought (CoT)

“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, https://arxiv.org/abs/2201.11903

37 of 109

“To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning” https://arxiv.org/abs/2409.12183

38 of 109

Self-Consistency (SC)

“Self-Consistency Improves Chain of Thought Reasoning in Language Models”�https://arxiv.org/abs/2203.11171

39 of 109

LLM Programs

  • LLMs are embedded into a program or algorithm.
  • The LLM is NOT responsible for maintaining the current state of the program (i.e. its context)
  • The LLM, for each step of the program is only presented with a step-specific prompt and context.

“Large Language Model Programs”, https://arxiv.org/abs/2305.05364

40 of 109

Tree-of-Thought (ToT)

“Large Language Model Guided Tree-of-Thought”, https://arxiv.org/abs/2305.08291

41 of 109

Graph of Thoughts (GoT)

“Graph of Thoughts: Solving Elaborate Problems with Large Language Models”, https://arxiv.org/abs/2308.09687

42 of 109

Everything of Thoughts (XoT)

43 of 109

“The Prompt Report: A Systematic Survey of Prompting Techniques”�https://arxiv.org/abs/2406.06608

44 of 109

Agents

45 of 109

What is an agent?

46 of 109

What is an agent?

  • There are many different definitions.
  • {LLM/AI} Agents are systems that use �an LLM as a reasoning engine to �determine which actions to take and �what the inputs to those actions should �be. The results of those actions can then �be fed back into the agent as observations. It determines whether more actions are needed, or whether it is okay to finish.
  • Typically, an Agent is an LLM equipped with Tools, Memory, and some orchestration code (say, to call tools, manage memory, run agentic loop, etc).

47 of 109

What is an agent?

48 of 109

ReAct framework

  • ReAct (Reason + Act) is a paradigm where LLMs are used to generate both reasoning traces and task-specific actions in an interleaved manner.
  • The ReAct framework can allow LLMs to interact with external tools to retrieve additional information that leads to more reliable and factual responses.
  • However, see “On the Brittle Foundations of ReAct Prompting for Agentic Large Language Models”�https://arxiv.org/abs/2405.13966

“ReAct: Synergizing Reasoning and Acting in Language Models”�https://arxiv.org/abs/2210.03629

49 of 109

Why agents?

LLMs are not enough for some cases:

  • When they do not know some data
    • Need an external source, a database or search
  • When they need to do some multi-step reasoning and execution
    • Need separate runs with different prompts
  • When they need to use some external tools, e.g.
    • LLMs are bad in counting, so even a simple calculator can help
    • LLM may need to run Python code
    • Some domain-specific tools (e.g. chemistry or biology-related)
  • When they need to maintain context
    • Need to have memory

Some future LLMs (GPT-5+) may be better in some aspects, but that’s still not enough (they will not have access to some specific tools).

50 of 109

Example: a conversational agent with search

An LLM-based conversational agent equipped with a generic search tool. You can ask the agent questions and have conversations with it. The agent uses an LLM to decide which tools to use and when.

Components:

  • General LLM, say, OpenAI GPT-4o or Anthropic Claude 3.5
  • Tools (consists of: the name of the tool, a description of what the tool does, a JSON schema defining the inputs to the tool, a function to call):
    • Web search
    • Weather API
  • Memory to store previous interactions (just history of messages)

51 of 109

Example: ChemCrow

ChemCrow is an LLM chemistry agent designed to �accomplish tasks across organic synthesis, drug �discovery, and materials design.

  • Uses OpenAI GPT-4
  • Integrates 18 expert-designed tools
  • Workflow reflects the ReAct framework
  • Implemented in LangChain (https://github.com/ur-whitelab/chemcrow-public some tools are missing in the repo)

“ChemCrow: Augmenting large-language models with chemistry tools”�https://arxiv.org/abs/2304.05376

52 of 109

“ChemCrow: Augmenting large-language models with chemistry tools”�https://arxiv.org/abs/2304.05376

53 of 109

Multi-Agents

54 of 109

Agent vs. Multi-agent

  • We can define an agent as an entity that can reason, act, communicate and adapt to solve problems.
  • By extension, we define a multi-agent system as an application that involve a group of agents, each with diverse roles (personalities, skills) and capabilities (language models, tools, and objectives), collaborating to solve tasks.

55 of 109

Why multi-agents?

Some complex tasks have properties that lend themselves really well to a multi-agent solution approach:

  • Task Decomposition. Requires planning, which involves breaking the task down into steps that must be completed successfully.
  • Instructional Complexity. Lengthy instructions or context presents a significant challenge for single-agent systems due to known difficulties with long or complex instructions
  • Diverse Expertise. Benefits from the collaboration of multiple entities, each bringing diverse skills/expertise, tools, or perspectives to the task.
  • Separation of Concerns: Separate prompts can give better results.
  • Emergent Solutions. Problems where the solution arises from the interactions among multiple agents, with the exposure to the outputs from previous steps and the state and process of each agent enhancing the solution.
  • Autonomous Exploratory Problem-Solving. The task solution is not known beforehand, may take various forms, and results from exploring a vast search space (optimization).

56 of 109

57 of 109

58 of 109

59 of 109

* Translation Agent

60 of 109

Example: Agentic translation with reflection

A multi-agent system for machine translation. Contains two agents: translator and reviewer

3-step workflow:

  1. Translator translates a text from source_language to target_language
  2. Reviewer reflects on the translation to come up with constructive suggestions for improving it
  3. Translator uses the suggestions to improve the translation.

Can repeat these steps if necessary.

61 of 109

Example: Reflection agent

62 of 109

Source: English

<article> In the world of softwarre development, you have to be on your toes all the time. The CEO, who has been leading the team for years, said, <q>"We need to push the envelope and think outside the box!"</q> This means our team must work smarter, not harder. Sometimes, it's like trying to find a needle in a haystack. <b>Can you imagine how challenging that is?</b> Our developers often feel like they are walking on eggshells. But, if we nail it, the sky's the limit. Remember, there is no "I" in "team." The project manager comanted, "We have to stay agile and responsive." <mark>With the new deadline looming, the presure is on.</mark> </article>

63 of 109

Translation: EN-RU (DeepL)

<article>В мире софтварной разработки нужно постоянно быть начеку. Генеральный директор, который руководит командой уже много лет, сказал: <q>"Нам нужно расширять границы и мыслить нестандартно!"<q> Это значит, что наша команда должна работать умнее, а не сложнее. Иногда это похоже на попытку найти иголку в стоге сена. <Представляете, как это сложно? Нашим разработчикам часто кажется, что они ходят по яичной скорлупе. Но если мы справимся с этой задачей, то небо - это предел. Помните, что в слове "команда" нет слова "я" Руководитель проекта сказал: "Мы должны оставаться проворными и оперативными" <марка> В связи с приближением нового дедлайна давление возрастает.</марка> </article

64 of 109

Reflection: LLM (GPT-4o)

1. Correct the spelling of "софтварной" to "программной" to use the more accurate and commonly accepted term in Russian.�2. Replace "расширять границы" with "выходить за рамки" to better capture the idiomatic expression "push the envelope."�3. Change "сложнее" to "тяжелее" to better match the meaning of "harder" in this context.�4. Add a closing quotation mark after "нестандартно!" to correct the punctuation.�5. Change "<Представляете, как это сложно?" to "<b>Представляете, как это сложно?</b>" to match the bold formatting in the source text.�6. Add a period after "я" in "в слове 'команда' нет слова 'я'" to correct the punctuation.�7. Change "Руководитель проекта сказал" to "Руководитель проекта прокомментировал" to better match the source text's "commented."�8. Add a closing quotation mark after "оперативными" to correct the punctuation.�9. Replace "<марка>" with "<mark>" and "</марка>" with "</mark>" to match the HTML tag in the source text.�10. Correct the spelling of "давление возрастает" to "давление усиливается" to better convey the increasing pressure.

65 of 109

Improved translation: LLM (GPT-4o)

<article>В мире программной разработки нужно постоянно быть начеку. Генеральный директор, который руководит командой уже много лет, сказал: <q>"Нам нужно выходить за рамки и мыслить нестандартно!"</q> Это значит, что наша команда должна работать умнее, а не тяжелее. Иногда это похоже на попытку найти иголку в стоге сена. <b>Представляете, как это сложно?</b> Нашим разработчикам часто кажется, что они ходят по яичной скорлупе. Но если мы справимся с этой задачей, то небо - это предел. Помните, что в слове "команда" нет слова "я". Руководитель проекта прокомментировал: "Мы должны оставаться проворными и оперативными." <mark>В связи с приближением нового дедлайна давление усиливается.</mark> </article>

66 of 109

Agent Design Patterns

67 of 109

“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467

68 of 109

“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467

69 of 109

“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467

70 of 109

“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467

71 of 109

“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467

72 of 109

“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467

73 of 109

“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467

74 of 109

Agentic Frameworks

75 of 109

LangChain + LangGraph

76 of 109

LangGraph Studio

77 of 109

AutoGen

AutoGen is an open-source programming framework for building AI agents and facilitating cooperation among multiple agents to solve tasks. AutoGen aims to provide an easy-to-use and flexible framework for accelerating development and research on agentic AI, like PyTorch for Deep Learning. It offers features such as agents that can converse with other agents, LLM and tool use support, autonomous and human-in-the-loop workflows, and multi-agent conversation patterns.

78 of 109

AutoGen Studio

79 of 109

CrewAI

80 of 109

Case studies

81 of 109

* Generative Agents

82 of 109

Example: Generative Agents

Researchers from Stanford and Google have developed generative agents that simulate human behaviors. These agents lead their own lives, going through daily routines such as waking up, preparing breakfast, heading to work, and so on. They engage with other agents, retain memories, and strategize for the upcoming day.

“Generative Agents: Interactive Simulacra of Human Behavior”�https://arxiv.org/abs/2304.03442https://gonzoml.substack.com/p/generative-agents-interactive-simulacra

83 of 109

Example: Generative Agents

The agents are hosted in a sandbox environment named Smallville, modeled after a small town.

This town comprises houses, a college, shops, parks, cafes, and more. Inside the houses, individual rooms are equipped with items like tables and wardrobes. The entire world is depicted as a tree structure, with its leaves representing individual objects.

84 of 109

Agent Architecture

85 of 109

Agent Architecture

The agent itself is an event memory �augmented LLM.

The agent's architecture is built upon three �primary components:

  • Memory Stream: This stores the agent's life experiences.
  • Reflection: Synthesizes memories into higher-level conclusions.
  • Planning: Translates the agent's reasoning and the environment's state into high-level plans and detailed behaviors.

86 of 109

Memory Stream

Memory Stream is the agent's experience database. The records and the agent's reasoning are in natural language, making use of the LLM.

A challenge arises when an agent partakes in numerous events. Retrieving the most pertinent memories becomes vital, or they could surpass the prompt's capacity. Summarization might not always achieve the desired outcome as crucial details could be overlooked. Moreover, the significance of information can vary depending on the situation. Hence, extracting relevant data is crucial.

Each memory object contains textual content, its creation date, and the date of the last access.

The most elementary memory type is an 'observation,' representing what the agent directly perceives. This could be the actions of the agent itself or the behaviors observed in other agents or objects.

87 of 109

Memory Stream

Here is an example memory stream from of Isabella Rodrigues, who works at a cafe:

(1) Isabella Rodriguez is setting out the pastries, �(2) Maria Lopez is studying for a Chemistry test while drinking coffee�(3) Isabella Rodriguez and Maria Lopez are conversing about planning a Valentine’s day party at Hobbs Cafe�(4) The refrigerator is empty.

88 of 109

Memory Stream

A special retrieval function takes an agent's current situation as input and returns a subset of events from its memory to pass to the LLM. The score of each memory item depends on three factors:

  1. Recency: This is an exponentially decaying freshness score, calculated based on the last time an item was accessed.
  2. Importance: This score comes from the LLM, rated on a scale of 1 to 10. Here, 1 indicates something mundane and minor (like brushing teeth), while 10 represents something significant (like getting a divorce or entering college).
  3. Relevance: It's determined using cosine similarity between the embeddings (obtained from the LLM) of the query and the memory item.

All three factors are normalized to a range of [0, 1] and are weighted and summed (with equal weights in this study). Only the best memory candidates that fit within the prompt are retained.

89 of 109

Memory Stream

90 of 109

Reflection - A Higher-level Memory

After observations, Reflection serves as a secondary, more abstract type of memory generated by the agent. These reflections are also stored in the memory stream and are retrieved in a similar manner.

Reflections are created periodically, especially when the combined importance scores of the agent's recent events exceed a certain threshold. In practical terms, this happens about two or three times a day.

91 of 109

Reflection - A Higher-level Memory

To create a reflection via the LLM, the 100 most recent memories of the agent are taken and sent to the LLM with the prompt: "Given only the information above, what are the 3 most salient high-level questions we can answer about the subjects in the statements?".

Based on the retrieved questions, relevant memories are recalled, and insights are generated using the prompt: "What 5 high-level insights can you infer from the above statements? (example format: insight (because of 1, 5, 3))".

For instance, the outcome might be something like: "Klaus Mueller is dedicated to his research on gentrification (because of 1, 2, 8, 15)."

Reflections can also be generated based on previous reflections.

92 of 109

Reflection - A Higher-level Memory

93 of 109

Planning and Reacting

Planning ensures consistency and believable behavior. Without plans, if you constantly ask the LLM what the agent should do, there's a chance it might suggest having lunch at noon, then again half an hour later, and once more afterward.

Plans are also stored in the memory stream and are retrieved when relevant.

The approach to generating plans is top-down and recursive. Initially, a broad plan is generated for the day with 5-8 points, which is then refined and detailed. This primary plan is built based on the agent's general description and a summary of the previous day's experience. These elements are first detailed at an hourly resolution and later refined to intervals of 5-15 minutes.

94 of 109

Planning and Reacting

The agent constantly perceives the world, saves it in the memory stream, and decides via the LLM whether to continue following the plan or react to something.

When querying the LLM, relevant context about the observed entity is also considered. This is achieved using prompts like “What is [observer]’s relationship with the [observed entity]?” and “[Observed entity] is [action status of the observed entity]” to recall and summarize pertinent memories.

If the LLM suggests a reaction, the plan is regenerated accordingly. If an interaction between two agents is implied, their dialogue is generated, influenced by their memories of each other.

95 of 109

Planning and Reacting

96 of 109

Simulation

In the town of Smallville, there are 25 agents. The identity of each agent is described by a single paragraph of text in English.

Agents can perform actions and communicate with other agents. At every time tick of the sandbox, an agent produces a text that describes its current action.

Agents communicate with each other in English. They're aware of the presence of other agents nearby and decide whether to walk past or engage in conversation. An agent can move around the world, enter buildings, and approach other agents. If an agent's model directs it to a specific place in Smallville, the environment calculates the route, and the agent follows it.

The simulation operator can engage in conversation with an agent, as well as issue directives through an "internal voice". Users can enter Smallville as agents, either as new agents or as existing ones.

97 of 109

98 of 109

Behaviours: agent coordination

99 of 109

* Translation Agency

100 of 109

Example: Translation Agency

In this study, the authors assembled a virtual (for now) multi-agent company called TransAgents, specializing in literary translation.

Roles in the company: CEO, Senior Editor, Junior Editor, Translator, Localization Specialist, Proofreader.

“(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts”, https://arxiv.org/abs/2405.11804https://gonzoml.substack.com/p/perhaps-beyond-human-translation

101 of 109

102 of 109

Agent profiles

The profiles include attributes related to translation (language knowledge, experience, education, rate per word) as well as non-translation-related attributes (hobbies, persona, age, gender, nationality).

“(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts”, https://arxiv.org/abs/2405.11804https://gonzoml.substack.com/p/perhaps-beyond-human-translation

103 of 109

The workflow: Preparation phase

The company's process is two-tiered, consisting of preparation and execution phases.

1. Preparation phase: A human client comes with a request for literary text translation. At this moment, the appointed CEO agent selects (using self-reflection) the Senior Editor suitable for the client's specific request.

For translation, a translation guideline of five components needs to be prepared, including a glossary, summary, tone, style, and target audience. The Senior Editor coordinates a Junior Editor and compiles the guideline.

The guideline is used as an essential part of the prompts for all roles involved in the subsequent translation, ensuring consistency and coherence throughout the work.

104 of 109

The workflow: Execution phase

2. Execution phase: This phase is divided into four sub-stages:

  • Translation
  • Cultural adaptation
  • Proofreading
  • Final review

The first three operate in Trilateral Collaboration mode, with the Translator, Localization Specialist, and Proofreader as the Action agents, and the Junior Editor and Senior Editor as the Critique and Judgment agents. The actor produces the result, the critic critiques it and adds to the interaction history, which the actor can consider in generating an updated response in the next iteration. The judge, at the end of each (after the first) iteration, reviews the response and decides whether to continue the discussion or conclude it.

105 of 109

The workflow: Execution phase

2. Execution phase

The final stage is the final review, where the Senior Editor evaluates the quality of each chapter and how one chapter transitions to the next.

The Judgment agent is considered crucial for maintaining translation quality because models do not handle long contexts well, and the meaning of translations starts to deviate from the original after several review iterations. The judge does not use the entire review history, thus helping maintain the final quality.

106 of 109

Results

TransAgents scored significantly lower on automatic metrics (d-BLEU), but won when evaluated on human or LLM preferences.

107 of 109

* Resources

108 of 109

Resources

The field is highly dynamic and evolves very quickly. There are many other interesting multi-agent examples. Just a few relevant ones:

109 of 109

Thanks!