1 of 19

Async Batch Processing Solution for Cloud-LLMs

Yuhan Chen, Noah Robitshek, Sergio Rodriguez,

Andrew Sasamori, Rayan Syed, Bennett Taylor

In collaboration with Two Sigma

2 of 19

Data Transfer

  • Challenge:
    • Handling large-scale dataset uploads
  • Approaches:
    • Asynchronous Batch Processing
    • Partitioning Data Efficiently
    • Enhancements with Kafka and Flink

3 of 19

Data Transfer

Dataset

Data Batches

Data Partition

Kafka Topic

4 of 19

Data Transfers

  • Each Cloud provider have their own SDK
    • Google: Google Cloud Storage
    • AWS: Boto3
    • Azure: Blob Storage
  • Other helpful libraries
    • Dask: a parallel computing library that can handle large datasets and process them in chunks.
    • PyArrow: provides efficient, columnar storage formats that are perfect for large datasets, minimizing transfer times
      • Natively supports S3, GCS, and Azure Blob Storage file systems

5 of 19

Data Formats and Partitioning

  • File Format
    • Parquet (Columnar Format): For larger DataFrames or PyArrow tables, Parquet is highly efficient due to its columnar structure and built-in compression. It’s optimized for storage size and I/O performance.
    • Feather: An alternative to Parquet, Feather is a lightweight binary format that supports fast reading and writing for in-memory operations.
    • CSV (Text Format): Use CSV when simplicity is more important than performance, especially for smaller datasets.
  • Data Partitioning
    • We will employ Cloud Tools for Partitioning Datasets:
      • Google BigQuery, Amazon Redshift, Apache Hive, Apache Spark, etc.

6 of 19

API Frameworks

  • Django
    • Feature rich but more complex than other frameworks. Powerful admin dashboard and ecosystem
  • Flask
    • Lightweight and Flexible but less feature rich that other frameworks.
  • FastAPI
    • Very fast. Optimized for high performance APIs but a smaller ecosystem and documentation
  • Pyramid
    • Highly flexible, modular, and versatile but a steep learning curve and more manual setup
  • Falcon
    • Extremely fast and lightweight. Optimized for HTTP requests but lacks features and a support ecosystem

7 of 19

API Hosting

8 of 19

System Design (API)

Users

API

WebSocket or HTTP Communication

Apache Backend

9 of 19

Kafka Queueing/PubSub

  • Kafka will be the main form of communication between microservices within our cloud solution
    • Potential topics include incoming jobs, job progress, completed jobs, job errors, rate limit control topic, and many others
    • To move information from one microservice (such as the API) to another there will be a Kafka producer in the microservice sending the data and a Kafka consumer in the microservice receiving the data
  • To host Kafka we will need to allocate a number of nodes (at least 3) from our Google Kubernetes Engine cluster as Kafka nodes, with one likely being a ZNode (ZooKeeper Node)
    • We can use a service like Kubernetes StatefulSets to help deploy and manage nodes
  • We can also utilize Kafka for our Pub/Sub system by having our user facing API subscribe to notification topics in Kafka and sending notifications back to the user either via a Webhook or Websocket

10 of 19

Flink and Redis Data Store

  • Flink jobs will be responsible for handling rate limiting logic and communication with LLM API endpoints
  • Flink jobs should be partitioned via LLM API endpoints, with each job having its own endpoint
    • Each job can then manage a rate limiting algorithm like Token Bucket for its corresponding API endpoint
    • Jobs will use Kafka topics gather data to be processed and communicate with our pub/sub system
  • The Redis Data Store can be used to manage the data needed by Flink jobs (such as the number of tokens in each job’s bucket)
    • The advantage of this system is that it ensures that data used by Flink jobs is consistent and help prevent race conditions

11 of 19

Rate Limiting

  • Prevent abuse: Rate limits protect the API from excessive requests, preventing overloads or disruptions
  • Manage infrastructure load: Rate limits help maintain consistent performance, even during peak demand

Logic:

  • Token Bucket Algorithm: LLM API Tokens/Available Requests added to a ‘bucket’ at fixed intervals
    • If tokens available, requests/jobs can be processed immediately
    • When token consumed successfully, system sends request to LLM API
    • API will send response containing rate limit information (ex. ‘RateLimit-Remaining’, ‘RateLimit-Reset’)
      • Stop adding tokens if approaching rate limit
    • If no tokens available, then requests/jobs are queued
    • Advantages: Dynamic, allows for bursts of traffic as long as tokens present
    • Bucket will likely be a Python class

12 of 19

Potential Rate Limit Error Handling

  • If LLM responds with ‘too many requests’ error, implement exponential backoff
    • Retry strategy, wait some time and then try again (ex. 1, 2, 4, 8, 16 seconds)
    • Continue until max retry limit or success
    • Advantages: reduce server overload, efficient recovery
    • No tokens should be added to bucket during wait
  • Shared token bucket across batches/Kafka jobs can lead to Race Conditions
    • Solution: Redis
      • Provides distributed locks for shared resources in a mutually exclusive way
      • Simpler and significantly lower latency than Apache counterpart (Apache Zookeeper)

13 of 19

LLM APIs

Decision upon API model is on the basis of token usage for the most part

Data going into API request should already be parsed in a format that follows specific parameters defined by LLM of choice

LLM nodes:

Individual instances or endpoints of the language model that handle the processing of requests. Each LLM node represents a separate API call to a different language model

14 of 19

Future Plans

  • Simplify the project and build a proof of concept
    • Define what the scope of the proof of concept will be and how it will be split between group members
    • Build a simple project with one LLM, data type, and no batch processing.
  • Build a bare bones prototype that shows end to end capability the system
  • Discover issues with current implementation plan, perform research to resolve these problems
  • Start defining our API, development process, and team dynamic

15 of 19

Burndown Chart

16 of 19

Async Batch Processing Solution for Cloud-LLMs

Yuhan Chen, Noah Robitshek, Sergio Rodriguez,

Andrew Sasamori, Rayan Syed, Bennett Taylor

In collaboration with Two Sigma

17 of 19

Appendix

18 of 19

System Design (TEMPLATE)

Users

API

Load Balancer

Batch Processing

LLM Nodes

19 of 19

Major Roadblocks