Ember
In collaboration with the following early users,contributors, and reviewers:
Jared Quincy DavisF,S, Marquita EllisI, Diana ArroyoI, Pravein Govindan KannanI, Paul CastroI, Siddharth SharmaF,S, Parth AsawaB, Alan ZhuB, Connor ChowB, Jason LeeB, Jay Adityanag TipirneniB, Chad FergusonB, Kathleen GeB, Kunal AgrawalB, Rishab BhatiaB, Rohan PenmatchaB, Sai KolasaniB,Théo Jaffrelot InizanB, Lingjiao ChenMS, Omar KhattabD,MT, Deepak NarayananN, Long FeiF, Aparajit RaghavanF, Eyal CidonF, Jacob ScheinF, Prasanth SomasundarF, Boris HaninF,P, James ZouS, Alex DimakisB, Joey GonzalezB, Peter BailisG,S, Ion StoicaA,B,D, Matei ZahariaD,B
Foundry (MLFoundry)F, DatabricksD, IBM ResearchI, Stanford UniversityS, UC BerkeleyB, MITMT, NVIDIAN, MicrosoftMS, AnyscaleA, GoogleG, PrincetonP
Ember is a compositional framework for building and deploying large inference-time scaling architectures and strategies. We call these architectures Networks of Networks, NONs for shorthand. These architectures can be employed to eclipse the quality or reliability frontier available via today’s frontier LLMs or to achieve comparable quality at 1/1000th or less the cost[1].
The cost-per-token of intelligence is falling precipitously. The cost to achieve a given level of model performance has been falling by 10x every year for the past 3 years. Over this same timespan, the model ecosystem has expanded dramatically. There are now hundreds of models to choose from, yielding a rich but perilous model market landscape: latency aside, the monetary cost dispersion between the most expensive “systems” and the cheapest model calls is nearly 10000x (4 orders of magnitude) now and is only widening.
The confluence of these factors has motivated work on compound AI systems that allow practitioners to navigate and interpolate along the Pareto frontier more effectively or that can make expanded frontiers accessible. Simple constructs, like best-of-N graphs, verifier prover structures, and ensembles with “voting-based” aggregation, work surprisingly well in many regimes.
We’ve been exploring this emerging and vibrant architecture space.
To conduct this inference-time scaling and architecture work and beyond, we began building a framework to facilitate the composition of larger and richer architectures. We call this framework Ember.
Ember is attempting to contribute to the networks of networks (NON) era what PyTorch and JAX contributed to the neural networks (NN) era.
Ember programs are, to quote early Spark literature, “fast to run, and fast to write”.
Ember confers the compositional, eager-mode, and pythonic UX of PyTorch while offering the native graph IR compatibility of TensorFlow and JAX, derived from XLA.
We are open-sourcing Ember in hopes that others find it useful. We also aspire to galvanize further work and exploration into these themes, which we have found are more approachable for more modest-budget academic probing than endeavors in pre-training.
We wanted to be able to compose DAG / network structures like the above out of LLM inference-calls
Many users have found that Ember complements their existing “agentic” or “workflows” stack. It is relatively modular, allowing one to easily swap in any existing infrastructure they use for cross-LLM provider API wrangling instead of using Ember’s models module.
Users can also leverage Ember in parallel with frameworks designed for turning LLMs into agentic systems, enhancing them with memory, tool use capabilities, retry, and more. These modules can be employed “inside” Ember operators or to wrap Ember operators.
Ember is intended as a straightforward, general, and extensible programming model. It’s aspirationally minimal, inspired by JAX, of which several of us were early users at Google Brain. All that Ember’s core Operator programming model does is (I) mandate a specific delineation of the input/output specification and (II) allow users to write arbitrary logic in the forward method of an operator.
From there, each Operator is registered as a PyTree, allowing it to be composed into larger structures and wrangled via arbitrary transforms, jitted, vmapped, etc, and natively converted into Ember’s xcs graph IR, represented as a “plan” that can then be optimized. Given an IR, users can apply sophisticated or custom scheduling logic. The simplest form of scheduler takes a large xcs graph, and flattens and runs it via topological sort with parallel dispatch at the API calls / IO-level. From there, we’ve seen more frontier work on custom scheduling logic that is query-aware (with native router integration), rate-vs-token limit aware, or hardware-resource aware, and beyond. Contributors plan to submit these in the near future.
For detailed documentation and examples for building compound AI systems with Ember, check out the Ember codebase on GitHub.
Ember is attempting to contribute to the networks of networks (NON) era what PyTorch and JAX contributed to the neural networks (NN) era. We look forward to seeing what you build with it!
Addendum: What about AI Agents?
The connections between Ember and the theme of “AI agents”
Some early contributors have called Ember an AI agent framework. We have found that this term possesses many disparate meanings, so we seek to explain our definition.
In our minds, an AI agent is simply characterized by its agency or capacity to take action. This is more of the canonical RL definition of an agent[2], which defines an agent by its possessing goals, its ability to perceive/sense aspects of its environments, and its capacity to choose actions to influence its environments.
The definition of an agent can be separated from the properties/desiderata that we seek in a useful or effective agent.
Under this definition, an agent's action determination can be undergirded by a single monolithic text-to-action inference call from a Large Action Model, or it can be constituted as a compound AI system empowered by explicit tool-use modules, enhanced by access to a memory-system, and/or fortified by many calls arrayed in static or dynamic NON structures.
The emerging science around compound AI systems is thereby deeply entangled with, but definitionally orthogonal to, the emerging engineering discipline around AI agents.
[1] via routing and using networks of small models
[2] distinct from the useful definitions put forth by Anthropic, Lilian, and others.