1 of 19

Async Batch Processing Solution for Cloud-LLMs

Yuhan Chen, Noah Robitshek, Sergio Rodriguez,

Andrew Sasamori, Rayan Syed, Bennett Taylor

In collaboration with Two Sigma

2 of 19

Data Transfer

Challenge:

Handling large-scale dataset uploads

Approaches:

Asynchronous Batch Processing
Partitioning Data Efficiently
Enhancements with Kafka and Flink

3 of 19

Data Transfer

Dataset

Data Batches

Data Partition

Kafka Topic

4 of 19

Data Transfers

Each Cloud provider have their own SDK

Google: Google Cloud Storage
AWS: Boto3
Azure: Blob Storage

Other helpful libraries

Dask: a parallel computing library that can handle large datasets and process them in chunks.
PyArrow: provides efficient, columnar storage formats that are perfect for large datasets, minimizing transfer times

Natively supports S3, GCS, and Azure Blob Storage file systems

5 of 19

Data Formats and Partitioning

File Format

Parquet (Columnar Format): For larger DataFrames or PyArrow tables, Parquet is highly efficient due to its columnar structure and built-in compression. It’s optimized for storage size and I/O performance.
Feather: An alternative to Parquet, Feather is a lightweight binary format that supports fast reading and writing for in-memory operations.
CSV (Text Format): Use CSV when simplicity is more important than performance, especially for smaller datasets.

Data Partitioning

We will employ Cloud Tools for Partitioning Datasets:

Google BigQuery, Amazon Redshift, Apache Hive, Apache Spark, etc.

6 of 19

API Frameworks

Django

Feature rich but more complex than other frameworks. Powerful admin dashboard and ecosystem

Flask

Lightweight and Flexible but less feature rich that other frameworks.

FastAPI

Very fast. Optimized for high performance APIs but a smaller ecosystem and documentation

Pyramid

Highly flexible, modular, and versatile but a steep learning curve and more manual setup

Falcon

Extremely fast and lightweight. Optimized for HTTP requests but lacks features and a support ecosystem

8 of 19

System Design (API)

Users

API

WebSocket or HTTP Communication

Apache Backend

9 of 19

Kafka Queueing/PubSub

Kafka will be the main form of communication between microservices within our cloud solution

Potential topics include incoming jobs, job progress, completed jobs, job errors, rate limit control topic, and many others
To move information from one microservice (such as the API) to another there will be a Kafka producer in the microservice sending the data and a Kafka consumer in the microservice receiving the data

To host Kafka we will need to allocate a number of nodes (at least 3) from our Google Kubernetes Engine cluster as Kafka nodes, with one likely being a ZNode (ZooKeeper Node)

We can use a service like Kubernetes StatefulSets to help deploy and manage nodes

We can also utilize Kafka for our Pub/Sub system by having our user facing API subscribe to notification topics in Kafka and sending notifications back to the user either via a Webhook or Websocket

10 of 19

Flink and Redis Data Store

Flink jobs will be responsible for handling rate limiting logic and communication with LLM API endpoints
Flink jobs should be partitioned via LLM API endpoints, with each job having its own endpoint

Each job can then manage a rate limiting algorithm like Token Bucket for its corresponding API endpoint
Jobs will use Kafka topics gather data to be processed and communicate with our pub/sub system

The Redis Data Store can be used to manage the data needed by Flink jobs (such as the number of tokens in each job’s bucket)

The advantage of this system is that it ensures that data used by Flink jobs is consistent and help prevent race conditions

11 of 19

Rate Limiting

Prevent abuse: Rate limits protect the API from excessive requests, preventing overloads or disruptions
Manage infrastructure load: Rate limits help maintain consistent performance, even during peak demand

Logic:

Token Bucket Algorithm: LLM API Tokens/Available Requests added to a ‘bucket’ at fixed intervals

If tokens available, requests/jobs can be processed immediately
When token consumed successfully, system sends request to LLM API
API will send response containing rate limit information (ex. ‘RateLimit-Remaining’, ‘RateLimit-Reset’)

Stop adding tokens if approaching rate limit

If no tokens available, then requests/jobs are queued
Advantages: Dynamic, allows for bursts of traffic as long as tokens present
Bucket will likely be a Python class

12 of 19

Potential Rate Limit Error Handling

If LLM responds with ‘too many requests’ error, implement exponential backoff

Retry strategy, wait some time and then try again (ex. 1, 2, 4, 8, 16 seconds)
Continue until max retry limit or success
Advantages: reduce server overload, efficient recovery
No tokens should be added to bucket during wait

Shared token bucket across batches/Kafka jobs can lead to Race Conditions

Solution: Redis

Provides distributed locks for shared resources in a mutually exclusive way
Simpler and significantly lower latency than Apache counterpart (Apache Zookeeper)

13 of 19

LLM APIs

Decision upon API model is on the basis of token usage for the most part

Data going into API request should already be parsed in a format that follows specific parameters defined by LLM of choice

LLM nodes:

Individual instances or endpoints of the language model that handle the processing of requests. Each LLM node represents a separate API call to a different language model

14 of 19

Future Plans

Simplify the project and build a proof of concept

Define what the scope of the proof of concept will be and how it will be split between group members
Build a simple project with one LLM, data type, and no batch processing.

Build a bare bones prototype that shows end to end capability the system
Discover issues with current implementation plan, perform research to resolve these problems
Start defining our API, development process, and team dynamic

15 of 19

Burndown Chart

16 of 19