Ember

In collaboration with the following early users,contributors, and reviewers:

Jared Quincy DavisF,S, Marquita EllisI, Diana ArroyoI, Pravein Govindan KannanI, Paul CastroI, Siddharth SharmaF,S, Parth AsawaB, Alan ZhuB, Connor ChowB, Jason LeeB, Jay Adityanag TipirneniB, Chad FergusonB, Kathleen GeB, Kunal AgrawalB, Rishab BhatiaB, Rohan PenmatchaB, Sai KolasaniB,Théo Jaffrelot InizanB, Lingjiao ChenMS, Omar KhattabD,MT, Deepak NarayananN, Long FeiF, Aparajit RaghavanF, Eyal CidonF, Jacob ScheinF, Prasanth SomasundarF, Boris HaninF,P, James ZouS, Alex DimakisB, Joey GonzalezB, Peter BailisG,S, Ion StoicaA,B,D, Matei ZahariaD,B

Foundry (MLFoundry)F, DatabricksD, IBM ResearchI, Stanford UniversityS, UC BerkeleyB, MITMT, NVIDIAN, MicrosoftMS, AnyscaleA, GoogleG, PrincetonP

Ember: an inference-time scaling architecture framework

Ember is a compositional framework for building and deploying large inference-time scaling architectures and strategies. We call these architectures Networks of Networks, NONs for shorthand. These architectures can be employed to eclipse the quality or reliability frontier available via today’s frontier LLMs or to achieve comparable quality at 1/1000th or less the cost[1].

Why inference-time scaling architecture research

The cost-per-token of intelligence is falling precipitously. The cost to achieve a given level of model performance has been falling by 10x every year for the past 3 years. Over this same timespan, the model ecosystem has expanded dramatically. There are now hundreds of models to choose from, yielding a rich but perilous model market landscape: latency aside, the monetary cost dispersion between the most expensive “systems” and the cheapest model calls is nearly 10000x (4 orders of magnitude) now and is only widening. 

The confluence of these factors has motivated work on compound AI systems that allow practitioners to navigate and interpolate along the Pareto frontier more effectively or that can make expanded frontiers accessible. Simple constructs, like best-of-N graphs, verifier prover structures, and ensembles with “voting-based” aggregation, work surprisingly well in many regimes.

We’ve been exploring this emerging and vibrant architecture space.

  1. In Are More LLM Calls All You Need, we explored the behavior of depth-1, voting-based aggregation systems and uncovered that inference-time scaling can actually hurt you, producing worse performance at  exponentially higher costs, depending on your setting. In other circumstances, it is neutral to or helps performance. The non-monotonic scaling dynamics we observed have prompted much further work.
  2. In Networks of Networks: Complexity Class Principles applied to Compound AI Systems Design, we explored the dynamics of judge-based compound AI systems, and uncovered that, via inference-time scaling, you can bootstrap to arbitrary levels of performance with a fixed LLM at inference time depending on the complexity class dynamics of the regime.
  3. In LLMSelector, we explored the extent to which we can compose multi-stage, multi-step LLM pipeline systems and optimize performance of the compound system by selecting the optimal model for each stage. We found that simply tuning the model for each stage can confer a 5-70% gain, depending on the setting.  

To conduct this inference-time scaling and architecture work and beyond, we began building a framework to facilitate the composition of larger and richer architectures. We call this framework Ember.

Why Ember

Ember is attempting to contribute to the networks of networks (NON) era what PyTorch and JAX contributed to the neural networks (NN) era.

Ember programs are, to quote early Spark literature, “fast to run, and fast to write”.

Ember confers the compositional, eager-mode, and pythonic UX of PyTorch while offering the native graph IR compatibility of TensorFlow and JAX, derived from XLA.

We are open-sourcing
Ember in hopes that others find it useful. We also aspire to galvanize further work and exploration into these themes, which we have found are more approachable for more modest-budget academic probing than endeavors in pre-training.

We wanted to be able to compose DAG / network structures like the above out of LLM inference-calls

Integration and how to use Ember

Many users have found that Ember complements their existing “agentic” or “workflows” stack. It is relatively modular, allowing one to easily swap in any existing infrastructure they use for cross-LLM provider API wrangling instead of using Ember’s models module.


Users can also leverage Ember in parallel with frameworks designed for turning LLMs into agentic systems, enhancing them with memory, tool use capabilities, retry, and more. These modules can be employed “inside”  Ember operators or to wrap Ember operators.

Ember is intended as a straightforward, general, and extensible programming model. It’s aspirationally minimal, inspired by JAX, of which several of us were early users at Google Brain. All that Ember’s core Operator programming model does is (I) mandate a specific delineation of the input/output specification and (II) allow users to write arbitrary logic in the forward method of an operator.

From there, each Operator is registered as a PyTree, allowing it to be composed into larger structures and wrangled via arbitrary transforms, jitted, vmapped, etc, and natively converted into Ember’s xcs graph IR, represented as a “plan” that can then be optimized. Given an IR, users can apply sophisticated or custom scheduling logic. The simplest form of scheduler takes a large xcs graph, and flattens and runs it via topological sort with parallel dispatch at the API calls / IO-level. From there, we’ve seen more frontier work on custom scheduling logic that is query-aware (with native router integration), rate-vs-token limit aware, or hardware-resource aware, and beyond. Contributors plan to submit these in the near future.

For detailed documentation and examples for building compound AI systems with Ember, check out the Ember codebase on GitHub.

Conclusion

Ember is attempting to contribute to the networks of networks (NON) era what PyTorch and JAX contributed to the neural networks (NN) era. We look forward to seeing what you build with it!


Addendum: What about
AI Agents?

The connections between Ember and the theme of “AI agents

Some early contributors have called Ember an AI agent framework. We have found that this term possesses many disparate meanings, so we seek to explain our definition.

In our minds, an AI agent is simply characterized by its agency or capacity to
take action. This is more of the canonical RL definition of an agent[2], which defines an agent by its possessing goals, its ability to perceive/sense aspects of its environments, and its capacity to choose actions to influence its environments.

The definition of an agent can be separated from the properties/desiderata that we seek in a useful or effective agent.

Under this definition, an agent's action determination can be undergirded by a single monolithic text-to-action inference call from a Large Action Model, or it can be constituted as a compound AI system empowered by explicit tool-use modules, enhanced by access to a memory-system, and/or fortified by many calls arrayed in static or dynamic NON structures.

The emerging science around
compound AI systems is thereby deeply entangled with, but definitionally orthogonal to, the emerging engineering discipline around AI agents.


[1] via routing and using networks of small models

[2] distinct from the useful definitions put forth by Anthropic, Lilian, and others.