Kuzco

sam@kuzco.xyz

Global demand for GPUs has risen sharply over the last 18 months and is expected to continue growing more than 33% YoY. Increased demand has primarily been driven by AI model training and inference. As models have improved and unlocked new use-cases, developers are building new applications that will require even more compute capacity to serve at scale. Despite growing demand, the market for compute remains wildly inefficient, with roughly 50% of global capacity sitting idle at any given time.

The largest GPU producer, NVIDIA, sold an estimated 30MM non-datacenter GPU units in 2023. These are primarily RTX 4080 and 4090 chips being sold to prosumers, like gamers and graphic artists. These chips are less capable than the H100 and A100 datacenter chips, but still very capable of delivering inference for models like Llama 7B at acceptable rates of throughput. Apple has sold 43MM M-series Macs since the start of 2022. With up to 128GB of effective VRAM, these machines are also quite capable of running LLM inference on many of the top open source models.

All chips produced by NVIDIA and Apple are capable of providing at least 40 tokens per second of inference for Llama2 7B. Assuming these chips sit idle for 16 hours per day, this is 1.94B tokens per second of inference capacity that is currently going unused. At the current price offered by together.ai of $.20/million tokens, this idle capacity is worth roughly $33,600,000 per day, or $12.2B per year. This is a conservative estimate given that M3 Max chips, for example, can deliver inference at over 100 tokens per second

GPU providers like Paperspace, Lambda Labs, Run Pod, and others purchase large blocks of GPU time from data center operators like CoreWeave, mark it up, and sell it by the minute to developers, primarily for model fine-tuning and inference. This business model works if utilization is high, but often results in unused resources that sit idle waiting to be rented. Even the most popular GPUs, like H100s and A100s, may sit idle for 10 minutes between rentals. Lower-end GPUs like the RTX 4080 will sit idle for several hours at a time. Without a way to effectively monetize small periods of unused capacity, these businesses must factor unused resources into their financial models and accept the fact that they will pay for hardware that goes underutilized.

While resources sit idle, global demand for LLM inference and the compute capacity that it requires have spiked. OpenAI is relying on CoreWeave-owned hardware to serve ChatGPT. NVIDIA datacenter revenue tripled in Q4 2023, of which AI inference made up 40%. Devices capable of providing inference are being underutilized while the need for such devices has risen dramatically, keeping inference prices unnecessarily high. High inference prices are ultimately felt by developers (and end-users) and limit the types of products they can sustainably afford to build and operate.

Our solution to this problem is Kuzco, an aggregator for idle GPU compute capacity that combines unused resources into a single logical system which can be used to run inference tasks for open source LLMs. Kuzco Worker software can be run on GPU-equipped computers to process inference tasks for models like Llama2 and Mistral. Kuzco Workers earn money for their owners by watching for periods of low resource utilization and joining the network to provide inference capacity. Workers are paid per token of inference they deliver.

The pitch to GPU owners and operators is simple: If you ever have under-utilized hardware, run our software and we’ll pay you for the inference requests you handle. When you need your hardware back, give us a ten second warning and we’ll shutdown gracefully. The start-up and shutdown process can be automated to require no manual input from the operator. If an H100 operator who primarily services model training finds themselves with a few days of unreserved capacity, Kuzco provides a way for them to easily sell this capacity that would otherwise go unused. We estimate that running Kuzco for a total 36 days per year will generate an additional $3,000 per GPU. Revenue increases simply by running our software.

Managing inference infrastructure is difficult and expensive. Most developers who build at the application layer do not have the skills required to manage this infrastructure and opt to use hosted solutions instead.

Kuzco provides an OpenAI-compatible API to developers who want to run inference tasks on our network, which allows developers to swap their inference provider to Kuzco by changing a single line of code. These developers want simple APIs, minimal configuration, and well thought-out developer ergonomics. The atomic unit of work in the Kuzco Network is inference tokens. Customers pay for tokens, and inference providers are paid per token delivered.

This is Kuzco’s northstar as we move toward building the most developer-friendly LLM inference API on the market. We take developer experience very seriously, and in this, have found inspiration in Vercel’s business philosophy: take a complex piece of infrastructure and make it easy for developers to use, with all the accoutrements and nice-to-haves they have grown to expect from top-tier developer platforms.

The market for selling compute capacity to developers who need LLM inference exists at three layers: The bottom layer involves purchasing or renting hardware directly from hyperscalers and building custom inference infrastructure on top of it. This option is the most expensive and challenging. The second layer is similar to the first, but instead of a long-term commitment, the hardware is rented on an hourly basis, which offers more flexibility but is still difficult to manage. The third layer focuses on selling model inference itself.

Kuzco is focused solely on the final, most abstract layer: token output. By doing so, Kuzco enables developers to concentrate on their applications and users without the burden of managing infrastructure

Kuzco Network

The Kuzco Network is fundamentally a real-time spot market for LLM inference. As more Workers join the network and capacity drives higher, the cost developers pay per token of inference comes down. The reverse is true if demand moves upward, incentivizing more Workers to contribute capacity. Participants on both sides can configure the minimum and maximum price at which they are willing to transact. Kuzco makes money by charging a small fee on inference jobs that are completed successfully.

The network is designed to handle Workers joining or leaving in as little as ten seconds; when a worker needs to leave, all we ask is that it finishes the task it is currently working on. For datacenter chips that may have other immediate obligations, this shutdown process will always take less than ten seconds. As a result, even 30 seconds of idle GPU capacity can join the network and provide value. This type of on-demand contribution is made possible by focusing solely on model inference, as model training requires processes that may need to run uninterrupted for days or weeks at a time.

Developers who run their own inference infrastructure know that they must reserve more total capacity than they use on average in order to account for the possibility of a spike in demand. As such, they nearly always have excess capacity, sitting idle, waiting for more traffic. When GPUs are billed by the hour, developers spend money on resources they never use. With Kuzco, Workers are ranked based on the quality of the hardware they provide. Fast data center hardware will be ranked higher than a RTX 4090 running on someone’s desktop. Inference requests are routed to higher ranked Workers before spilling over into consumer-grade hardware to ensure that requests are always handled by the most qualified Worker. This helps data centers as profit seeking entities maximize their GPU utilization while also allowing high-end consumer devices to contribute to the network when demand is high. Developers pay only for what they use, and always receive the best quality inference that is available.

In the medium term, Kuzco Worker rankings will serve as a reputation system for GPU-suppliers. This will provide a source of long term defensibility, as this reputation system will allow us to offer stronger guarantees to customers who require SLAs or higher levels of service (e.g. guarantees around response time; guarantees around location, e.g. USA vs China, SOC2 compliance, etc.) In a few years time, we expect to be able to not only claim that we have the largest pool of GPUs in the world, but that we have segmented and ranked those GPUs such that customers can purchase inference tokens with more options and better precision.

Our Mission

Our mission is to increase the total amount of LLM inference performed worldwide, primarily by driving down the cost to developers and providing intuitive access to the tools they need. We believe creating a global, managed marketplace for inference is the most effective way to accomplish this. If we offer a great developer experience and the best prices, we will quickly become the platform on which developers choose to build.

As incumbents like NVIDIA and newcomers like Groq scale chip production and increase total global inference capacity, so too will the idle capacity increase if we do not build a way to combine unused resources into a single network. We speculate that the demand for LLM inference is effectively unlimited going forward, at the right price, for the foreseeable future. Building this network now will get us closer to  the incredible AI-powered future we all hope to see.

The Kuzco Network is live in early alpha, with over 1,700 nodes providing a total of 1B tokens of inference per day. Try it out for yourself at kuzco.xyz. Please reach out to sam@kuzco.xyz with questions or comments.

Applications

One question we frequently receive is: what kind of applications will run on lower-end hardware? Clearly there will be many models that run locally, and at the high-end Gemini Ultra, GPT4, Claude 3, etc. will run on hyperscalers. What is left in the middle?

We believe that open source models like Llama2, Mistral, and their fine-tunes will become even more powerful over time and require fewer computational resources. Both of these trends have been clearly observable since the launch of ChatGPT at the end of 2022. Consumer hardware will also continue to improve as it has over the last two decades.

As such, we believe the opportunity set for non-A100 (and above) chips is increasing, and will in fact accelerate over the next few years. Latency insensitive applications that require a large number of tokens are well suited to run on the Kuzco Network. Below are a few example applications that fit these criteria.

Ambient Summarization - an always on, organized summary of all public discourse from court proceedings, public hearings, earnings calls, and more.

Codebase Analysis - LLM-powered security or performance audits that run as part of continuous integration systems.

Extraction Pipelines - Pass a schema and a document to an LLM to extract useful information in a specific format. Useful in RAG pipelines.

Financial Analysis - Analyze company quarterly and annual financial reports and create summaries of business performance.

Translation Services - Translate books and large documents from one language to another.

Content Generation - Repurpose transcripts from videos and podcasts into books, blog posts, and other written content.

Content Moderation - Moderate online forums by monitoring all activity for unallowed content.

SEO Optimization - Suggest improvements for all pages on a website based on SEO best practices.

Legal Analysis - Analyze complex legal documents, including pieces of legislation, to suggest improvements or answer questions.

Education Content - Create personalized educational resources, quizzes, and learning modules.