From LLMs to Agents
Grigory Sapunov
Oxford LLMs / 2024.09.24
gs@inten.to
Prompting is not enough
Prompting is not enough
You may need more complicated workflows with external orchestration, for example, involving:
Advanced LLM features
* Embedding models
Word Embeddings (Word2Vec)
Sentence Embeddings
Embedding models
Embedding models
MTEB: Massive Text Embedding Benchmark
Topic Modeling: BERTopic + Embeddings
* Multimodality
Multimodality
GPT-4v
Gemini 1.5 Pro
Gemini 1.5 Pro
* Tools / Function calling
Using Tools / Function calling
Using Tools / Function calling
* Dealing with non-English content
Dealing with non-English content
Google PaLM 2
Meta LLaMa 2
LLMs may work better with translation
“Do Multilingual Language Models Think Better in English?“ https://arxiv.org/abs/2308.01223
Multilingual GenAI Summarization task
> Additional considerations: �Token economy
All languages are NOT (tokenized) equal
Because of tokenization issues, the same text in English may require less tokens than, say, for Korean.
English has the shortest median token len
Tokenization in different languages
Prompts and responses in non-English languages consume up to 18x more tokens, resulting in higher costs and delays.
Example:
English - 19 tokens
Korean - 97 tokens (5x)
Malayalam - 339 tokens (18x)
September 2022
Intento Inc. AI Hub
Tokenization: conclusion
Because of tokenization, language has a significant influence on:
Advanced single-LLM uses
Improving quality - incorporating custom data
RAG: Retrieval-Augmented Generation
Improving quality - advanced techniques
“Chain of Thought” prompting
Large Language Model
(plan)
(step 2)
Build a plan for task X and solve step-by-step
prompt
completion
(step 1)
(step 3)
(final answer)
(plan)
(step 2)
(step 1)
(step 3)
Chain of Thought (CoT)
“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, https://arxiv.org/abs/2201.11903
“To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning” https://arxiv.org/abs/2409.12183
Self-Consistency (SC)
“Self-Consistency Improves Chain of Thought Reasoning in Language Models”�https://arxiv.org/abs/2203.11171
LLM Programs
“Large Language Model Programs”, https://arxiv.org/abs/2305.05364
Tree-of-Thought (ToT)
“Large Language Model Guided Tree-of-Thought”, https://arxiv.org/abs/2305.08291
Graph of Thoughts (GoT)
“Graph of Thoughts: Solving Elaborate Problems with Large Language Models”, https://arxiv.org/abs/2308.09687
Everything of Thoughts (XoT)
“The Prompt Report: A Systematic Survey of Prompting Techniques”�https://arxiv.org/abs/2406.06608
Agents
What is an agent?
What is an agent?
What is an agent?
ReAct framework
“ReAct: Synergizing Reasoning and Acting in Language Models”�https://arxiv.org/abs/2210.03629
Why agents?
LLMs are not enough for some cases:
Some future LLMs (GPT-5+) may be better in some aspects, but that’s still not enough (they will not have access to some specific tools).
Example: a conversational agent with search
An LLM-based conversational agent equipped with a generic search tool. You can ask the agent questions and have conversations with it. The agent uses an LLM to decide which tools to use and when.
Components:
Example: ChemCrow
ChemCrow is an LLM chemistry agent designed to �accomplish tasks across organic synthesis, drug �discovery, and materials design.
“ChemCrow: Augmenting large-language models with chemistry tools”�https://arxiv.org/abs/2304.05376
“ChemCrow: Augmenting large-language models with chemistry tools”�https://arxiv.org/abs/2304.05376
Multi-Agents
Agent vs. Multi-agent
Why multi-agents?
Some complex tasks have properties that lend themselves really well to a multi-agent solution approach:
* Translation Agent
Example: Agentic translation with reflection
A multi-agent system for machine translation. Contains two agents: translator and reviewer
3-step workflow:
Can repeat these steps if necessary.
Example: Reflection agent
Source: English
<article> In the world of softwarre development, you have to be on your toes all the time. The CEO, who has been leading the team for years, said, <q>"We need to push the envelope and think outside the box!"</q> This means our team must work smarter, not harder. Sometimes, it's like trying to find a needle in a haystack. <b>Can you imagine how challenging that is?</b> Our developers often feel like they are walking on eggshells. But, if we nail it, the sky's the limit. Remember, there is no "I" in "team." The project manager comanted, "We have to stay agile and responsive." <mark>With the new deadline looming, the presure is on.</mark> </article>
Translation: EN-RU (DeepL)
<article>В мире софтварной разработки нужно постоянно быть начеку. Генеральный директор, который руководит командой уже много лет, сказал: <q>"Нам нужно расширять границы и мыслить нестандартно!"<q> Это значит, что наша команда должна работать умнее, а не сложнее. Иногда это похоже на попытку найти иголку в стоге сена. <Представляете, как это сложно? Нашим разработчикам часто кажется, что они ходят по яичной скорлупе. Но если мы справимся с этой задачей, то небо - это предел. Помните, что в слове "команда" нет слова "я" Руководитель проекта сказал: "Мы должны оставаться проворными и оперативными" <марка> В связи с приближением нового дедлайна давление возрастает.</марка> </article
Reflection: LLM (GPT-4o)
1. Correct the spelling of "софтварной" to "программной" to use the more accurate and commonly accepted term in Russian.�2. Replace "расширять границы" with "выходить за рамки" to better capture the idiomatic expression "push the envelope."�3. Change "сложнее" to "тяжелее" to better match the meaning of "harder" in this context.�4. Add a closing quotation mark after "нестандартно!" to correct the punctuation.�5. Change "<Представляете, как это сложно?" to "<b>Представляете, как это сложно?</b>" to match the bold formatting in the source text.�6. Add a period after "я" in "в слове 'команда' нет слова 'я'" to correct the punctuation.�7. Change "Руководитель проекта сказал" to "Руководитель проекта прокомментировал" to better match the source text's "commented."�8. Add a closing quotation mark after "оперативными" to correct the punctuation.�9. Replace "<марка>" with "<mark>" and "</марка>" with "</mark>" to match the HTML tag in the source text.�10. Correct the spelling of "давление возрастает" to "давление усиливается" to better convey the increasing pressure.
Improved translation: LLM (GPT-4o)
<article>В мире программной разработки нужно постоянно быть начеку. Генеральный директор, который руководит командой уже много лет, сказал: <q>"Нам нужно выходить за рамки и мыслить нестандартно!"</q> Это значит, что наша команда должна работать умнее, а не тяжелее. Иногда это похоже на попытку найти иголку в стоге сена. <b>Представляете, как это сложно?</b> Нашим разработчикам часто кажется, что они ходят по яичной скорлупе. Но если мы справимся с этой задачей, то небо - это предел. Помните, что в слове "команда" нет слова "я". Руководитель проекта прокомментировал: "Мы должны оставаться проворными и оперативными." <mark>В связи с приближением нового дедлайна давление усиливается.</mark> </article>
Agent Design Patterns
“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467
“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467
“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467
“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467
“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467
“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467
“Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents”�https://arxiv.org/abs/2405.10467
Agentic Frameworks
LangChain + LangGraph
LangGraph Studio
AutoGen
AutoGen is an open-source programming framework for building AI agents and facilitating cooperation among multiple agents to solve tasks. AutoGen aims to provide an easy-to-use and flexible framework for accelerating development and research on agentic AI, like PyTorch for Deep Learning. It offers features such as agents that can converse with other agents, LLM and tool use support, autonomous and human-in-the-loop workflows, and multi-agent conversation patterns.
AutoGen Studio
CrewAI
Case studies
* Generative Agents
Example: Generative Agents
Researchers from Stanford and Google have developed generative agents that simulate human behaviors. These agents lead their own lives, going through daily routines such as waking up, preparing breakfast, heading to work, and so on. They engage with other agents, retain memories, and strategize for the upcoming day.
“Generative Agents: Interactive Simulacra of Human Behavior”�https://arxiv.org/abs/2304.03442 �https://gonzoml.substack.com/p/generative-agents-interactive-simulacra
Example: Generative Agents
The agents are hosted in a sandbox environment named Smallville, modeled after a small town.
This town comprises houses, a college, shops, parks, cafes, and more. Inside the houses, individual rooms are equipped with items like tables and wardrobes. The entire world is depicted as a tree structure, with its leaves representing individual objects.
Agent Architecture
Agent Architecture
The agent itself is an event memory �augmented LLM.
The agent's architecture is built upon three �primary components:
Memory Stream
Memory Stream is the agent's experience database. The records and the agent's reasoning are in natural language, making use of the LLM.
A challenge arises when an agent partakes in numerous events. Retrieving the most pertinent memories becomes vital, or they could surpass the prompt's capacity. Summarization might not always achieve the desired outcome as crucial details could be overlooked. Moreover, the significance of information can vary depending on the situation. Hence, extracting relevant data is crucial.
Each memory object contains textual content, its creation date, and the date of the last access.
The most elementary memory type is an 'observation,' representing what the agent directly perceives. This could be the actions of the agent itself or the behaviors observed in other agents or objects.
Memory Stream
Here is an example memory stream from of Isabella Rodrigues, who works at a cafe:
(1) Isabella Rodriguez is setting out the pastries, �(2) Maria Lopez is studying for a Chemistry test while drinking coffee�(3) Isabella Rodriguez and Maria Lopez are conversing about planning a Valentine’s day party at Hobbs Cafe�(4) The refrigerator is empty.
Memory Stream
A special retrieval function takes an agent's current situation as input and returns a subset of events from its memory to pass to the LLM. The score of each memory item depends on three factors:
All three factors are normalized to a range of [0, 1] and are weighted and summed (with equal weights in this study). Only the best memory candidates that fit within the prompt are retained.
Memory Stream
Reflection - A Higher-level Memory
After observations, Reflection serves as a secondary, more abstract type of memory generated by the agent. These reflections are also stored in the memory stream and are retrieved in a similar manner.
Reflections are created periodically, especially when the combined importance scores of the agent's recent events exceed a certain threshold. In practical terms, this happens about two or three times a day.
Reflection - A Higher-level Memory
To create a reflection via the LLM, the 100 most recent memories of the agent are taken and sent to the LLM with the prompt: "Given only the information above, what are the 3 most salient high-level questions we can answer about the subjects in the statements?".
Based on the retrieved questions, relevant memories are recalled, and insights are generated using the prompt: "What 5 high-level insights can you infer from the above statements? (example format: insight (because of 1, 5, 3))".
For instance, the outcome might be something like: "Klaus Mueller is dedicated to his research on gentrification (because of 1, 2, 8, 15)."
Reflections can also be generated based on previous reflections.
Reflection - A Higher-level Memory
Planning and Reacting
Planning ensures consistency and believable behavior. Without plans, if you constantly ask the LLM what the agent should do, there's a chance it might suggest having lunch at noon, then again half an hour later, and once more afterward.
Plans are also stored in the memory stream and are retrieved when relevant.
The approach to generating plans is top-down and recursive. Initially, a broad plan is generated for the day with 5-8 points, which is then refined and detailed. This primary plan is built based on the agent's general description and a summary of the previous day's experience. These elements are first detailed at an hourly resolution and later refined to intervals of 5-15 minutes.
Planning and Reacting
The agent constantly perceives the world, saves it in the memory stream, and decides via the LLM whether to continue following the plan or react to something.
When querying the LLM, relevant context about the observed entity is also considered. This is achieved using prompts like “What is [observer]’s relationship with the [observed entity]?” and “[Observed entity] is [action status of the observed entity]” to recall and summarize pertinent memories.
If the LLM suggests a reaction, the plan is regenerated accordingly. If an interaction between two agents is implied, their dialogue is generated, influenced by their memories of each other.
Planning and Reacting
Simulation
In the town of Smallville, there are 25 agents. The identity of each agent is described by a single paragraph of text in English.
Agents can perform actions and communicate with other agents. At every time tick of the sandbox, an agent produces a text that describes its current action.
Agents communicate with each other in English. They're aware of the presence of other agents nearby and decide whether to walk past or engage in conversation. An agent can move around the world, enter buildings, and approach other agents. If an agent's model directs it to a specific place in Smallville, the environment calculates the route, and the agent follows it.
The simulation operator can engage in conversation with an agent, as well as issue directives through an "internal voice". Users can enter Smallville as agents, either as new agents or as existing ones.
Behaviours: agent coordination
* Translation Agency
Example: Translation Agency
In this study, the authors assembled a virtual (for now) multi-agent company called TransAgents, specializing in literary translation.
Roles in the company: CEO, Senior Editor, Junior Editor, Translator, Localization Specialist, Proofreader.
“(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts”, https://arxiv.org/abs/2405.11804 �https://gonzoml.substack.com/p/perhaps-beyond-human-translation
Agent profiles
The profiles include attributes related to translation (language knowledge, experience, education, rate per word) as well as non-translation-related attributes (hobbies, persona, age, gender, nationality).
“(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts”, https://arxiv.org/abs/2405.11804 �https://gonzoml.substack.com/p/perhaps-beyond-human-translation
The workflow: Preparation phase
The company's process is two-tiered, consisting of preparation and execution phases.
1. Preparation phase: A human client comes with a request for literary text translation. At this moment, the appointed CEO agent selects (using self-reflection) the Senior Editor suitable for the client's specific request.
For translation, a translation guideline of five components needs to be prepared, including a glossary, summary, tone, style, and target audience. The Senior Editor coordinates a Junior Editor and compiles the guideline.
The guideline is used as an essential part of the prompts for all roles involved in the subsequent translation, ensuring consistency and coherence throughout the work.
The workflow: Execution phase
2. Execution phase: This phase is divided into four sub-stages:
The first three operate in Trilateral Collaboration mode, with the Translator, Localization Specialist, and Proofreader as the Action agents, and the Junior Editor and Senior Editor as the Critique and Judgment agents. The actor produces the result, the critic critiques it and adds to the interaction history, which the actor can consider in generating an updated response in the next iteration. The judge, at the end of each (after the first) iteration, reviews the response and decides whether to continue the discussion or conclude it.
The workflow: Execution phase
2. Execution phase
The final stage is the final review, where the Senior Editor evaluates the quality of each chapter and how one chapter transitions to the next.
The Judgment agent is considered crucial for maintaining translation quality because models do not handle long contexts well, and the meaning of translations starts to deviate from the original after several review iterations. The judge does not use the entire review history, thus helping maintain the final quality.
Results
TransAgents scored significantly lower on automatic metrics (d-BLEU), but won when evaluated on human or LLM preferences.
* Resources
Resources
The field is highly dynamic and evolves very quickly. There are many other interesting multi-agent examples. Just a few relevant ones:
Thanks!