1 of 28

Celery and Kubernetes for a fast, scalable and robust workflow orchestration�

Param Rajani

SDE2 @ GoDaddy

2 of 28

About Me

Working at GoDaddy
3 years of experience in data engineering
Built products in Fintech and GenAI domains
Designed and maintained product architectures on both Serverless and Kubernetes

3 of 28

Pretext, Lets Talk Orchestration

Define Logical Steps
Breakdown Complex Use cases
Error Handling, Retries,
Build Data Pipelines and Logical Journeys

Lets first start by talking about orchestration

I think everyone has used stepfunctions here, or any similar tools, ��These tools are supposed to help you orchestrate workflows —

define steps, make sure A runs before B and C runs only if B passes.

�Indeed they are really powerfull tools that allow you to logically breakdown complexuse cases

But this power indeed comes with a cost, �

These cost are noticiable when your systems start scaling�When the system is customer facing like if you see the example, this is a customers stock buying and selling journey made using Stpfunctions and lambdas�

This system Becomes a problem when theres a lot of customers swarming into this,

�Now you factor in each customer and each state transitions and lambdas, �yet it is one of the most common system patterns, used by companies when they startout�

Because this scales beautifully well and is really great for building MVPs�but eventually these patterns reach some or the other failure points��🤔 “So this got me thinking:�Is there something lighter, faster, Python alternative to this?”

✅ “And that’s how I ended up rediscovering celery ”

4 of 28

Celery: A Pythonic Take On Workflow Orchestration

Topics well Explore in this talk:

Intro to celery
Celery Canvas Workflows
Compute and scaling with K8s
Observability

Celery has something been famous in the entire python community for years, I remember pycon 2024 had one talk about this as well�and this is also where this comes in��“Yes — Celery.�The thing we all used to handle background tasks… turns out, it can actually do a whole lot more.”�

“This talk is based on a real-world use case — where I led a transition of a product from AWS Lambda + Step Functions to Celery + Kubernetes based system”

Thorugh this talk I wanted to share celery as a strong open source alternative and also draw some paralelles along the way”�“So here’s what we’ll explore in this talk:”

“We’ll start with a quick intro to Celery “Then we’ll look at complex orchestration patterns using Canvas workflows.”

We will then talk about compute and scaling worklfows with K8s�“Finally we’ll cover observability and logging”��“Let’s jump in.”

5 of 28

What is celery?

"Let’s kick things off by understanding what Celery actually is — and this image captures its essence perfectly."

"Celery is a distributed task queue system. It allows Python applications to send background jobs too run asynchronously,

"On the left, we have the Producer — this is usually your web app or backend service. When the app needs to perform a time-consuming task — say, sending doing a file or image processing — it sends that request to the Broker, "

�"The Broker — commonly Redis or RabbitMQ — is a message queue. It holds the task until it can be consumer

"The Consumer, or Worker, subscribes to the broker, pulls tasks from the queue, executes them, "

"This simple flow — Producer → Broker → Consumer — is the backbone of Celery’s task execution model. And it’s surprisingly powerful when scaled right."

6 of 28

Heres An Example. Classic Fire and Forget

A good Example to show about producer and worker

On the top you have an application that wants to do a heavy task like ghiblify this image

We know it takes an annoying long time to generate that,

The app will pass this message to an SQS queue�if you see tasks.py on the bottom is

A worker subscribed to the same queue will consume it take the image arguments like the path and process it

And this is the very definition of celery worker

"A worker is the actual machine or process that executes the task. It knows what the task is, what it’s called, and how to run it. It’s constantly listening to the queue — ready to pick up new tasks as soon as they arrive."

"This is one of the most common use cases for Celery — sending an email in the background."

"On the left, you have the part of your app that wants to send the email — it just says, ‘hey, send this email’ and pushes that message to the queue."

"Celery takes that message and passes it to the broker — like Redis — which holds the message temporarily."

"On the right, there’s a worker that's always listening. As soon as it sees that message, it picks it up and sends the email."

"The beauty here is that the sender doesn’t have to wait. The app can continue instantly — no loading screen, no delay — and the work just happens in the background."

"This is the classic fire-and-forget pattern. Perfect for anything that’s time-consuming but doesn’t need an immediate response — emails, file uploads, notifications, you name it."

"Now, what is a worker?"

"A worker is the actual machine or process that executes the task. It knows what the task is, what it’s called, and how to run it. It’s constantly listening to the queue — ready to pick up new tasks as soon as they arrive."

7 of 28

The Working Of The Worker

8 of 28

Celery Is Much More Than A Background Worker

Celery Can Orchestrate complex workflows
Supports Chained, Parallel and Conditional Execution
Tasks can be routed to different workers based on resource needs

9 of 28

Celery Canvas Workflows�

Celery's Canvas Workflows give us the power to define complex orchestration logicT

he key difference compared to sfn is that with Celery, you're writing your workflows directly in Python�

So instead of building a DAG in YAML or JSON, or dealing with UI-based editors, you express the logic as code �

This example here describes linear task execution using Chain,

Task A to task b to task c happening one after the other�

You would simply import the chain class from celery and pass in each task and its arguments and it would run on after the other also handle retires if necessary and this sort of features makes celery production ready, when it lets you handles erros and failures and what to do next, So Canvas is not just about chaining tasks — it’s about building resilient, dynamic, and scalable distributed workflows in code.

��

Linear task sequences with chain()

()

Dynamic branching with map() or shared tasks

But where Celery really shines is in its production-readiness:

Retries: Each task can be configured to retry on specific failu

res.

Exponential Backoff: Celery supports retry delays and backoff strategies, so you're not slamming a failing system repeatedly.

Timeouts: Both soft and hard timeouts are available to control runaway tasks.

Routing: You can route different task types to specific queues — just like assigning tasks to different worker pools in Airflow.

Error Handling: You can define failure callbacks or alerting hooks on a per-task basis.

In essence — just like Step Functions lets you build fault-tolerant state machines, and Airflow gives you DAGs with retries and scheduling — Celery gives you the same robustness, but in a lightweight, Pythonic way, without leaving your application context.

10 of 28

11 of 28

12 of 28

Task Routing

Now you might be question , hey Param you told us celery worker susbscribe to a queue to process tasks

And then you showed tasks can be chained and grouped and beuild a workflow,

How would celery handle this like it performed task A how would it go to task B ,

Tas A is ghiblify task and b is notify task I would want the workesrs to run separateky

ofcourse this should be done by itself,

And ofcours we might want to coneect this dot as well

This is where Celery’s task routing and exchanges come into play.

Walkthrough the Diagram (Left Side):�Let’s look at this diagram. On the right, you see a Producer – this is our Celery client. It creates tasks and pushes them into an exchange. The exchange is part of our message broker – RabbitMQ in this case.

From the exchange, tasks are directed into different queues – queue1, queue2, queue3 –

Code Explanation (Right Side):�On the right side, you can see how we configure task routing in code:

�Here, we’re saying: route heavy_task to a high_cpu queue, and quick_task to a low_latency queue. And chains and groups will sequentially execute these tasks but will first ask exchange where to route this

13 of 28

14 of 28

CELERY + K8S

COMPUTE FLEXIBILITY AND SCALING

15 of 28

PAIRED WITH KUBERNETES

Deploy Each Worker as a separate POD
Flexibility with CPU and RAM Requirements
Flexibility with scaling Policies

Let’s talk about how pairing Celery with Kubernetes gives us incredible flexibility and control for orchestrating large-scale distributed workloads.��Real-World Architecture (Diagram Explanation)�This diagram is from one of the production systems I worked on while migrating from an AWS Step Functions setup to Celery running on Kubernetes.

The whole workflow runs here using Celery canvas workflows. The way it works is:

We start with a chain of three steps: Process Initiator → Dataset Initiator → Service Tasks

Each step has its own logic and occurs sequentially with its own internal workkdlows

Let’s take the dataset task — this step at a higher level is responsible for generating some datasets — could be Pandas, ML, etc.

It internally uses group() which fans out multiple tasks and waits for all of them to finish before moving to the next

There could be decision-making involved in what tasks need to be run and skipped

At a high level, this shows how orchestration happens using Celery canvases, and how tasks are distributed across worker types that listen to different queues.

Each worker here runs in a separate pod and can be scaled independently with its own CPU and RAM requirements.�For example:

The Dataset Worker runs ML and uses Pandas, which is CPU- and RAM-intensive.

The Analytics and Service Task Workers below do lightweight aggregations or API calls — they finish quickly and require far fewer resources.

Kubernetes lets us tune each worker’s pod accordingly:

Set CPU/memory to match task load�

Run multiple pods if task volume increases�Scale in/out automatically using HPAs and queue length�

Choose appropriate pool type (prefork, gevent) based on the workload

For Folks to know (Gevent and prefork are different ways to handly multiprocesses in python based on CPU intensive work or sime IO or API bound work)�and choosing the right worker process is really importat like if you see the dataset initiator here, like I mentioned intneranlly only delegates work to other workers for actual dataset task so itself can use gevent workerpool but the actual dataset generation task here must use prefork as they are compute intensive task, why it matters coz earlier in my process I gave initaor as prefork which made the speed of my pipeline really slow and I had to write different scaling policies for it but chaing it to gevent increased concurrency using less pods �

This architecture gave us massive flexibility and performance at scale, while keeping costs predictable and logic modular.

��

➤ Deploy Each Worker as a Separate POD�Each Celery worker runs in its own Kubernetes pod. This separation allows fine-grained control — you can independently scale, restart, and monitor each worker type without affecting others.

➤ Flexibility with CPU and RAM Requirements�Tasks vary widely in their compute and memory demands. With Kubernetes, we can assign CPU and RAM requests and limits on a per-pod basis — so each worker type gets the exact resources it needs. This avoids over-provisioning and ensures that memory-hungry or CPU-intensive tasks have the headroom they require. A dataset generation pod may need 4 CPUs and 6 GB of RAM, while a simple fetch worker might run on a fraction of that.

➤ Flexibility with Scaling Policies�This is where Kubernetes and Celery really shine together. You can autoscale worker pods based on metrics like CPU, memory, or number of messages in the queue. Using Celery queues + Kubernetes HPAs, we scale specific task types only when demand increases.�➤ Flexibility with Worker Types�Celery lets you choose how tasks are executed inside a worker by selecting different execution pool types. This flexibility allows you to tailor workers to the nature of the workload they handle:

Use prefork (the default) for CPU-bound or parallel tasks — this forks multiple processes, ideal for heavy computation and tasks like ML or Pandas processing.

Use gevent for I/O-bound tasks — it uses greenlets to handle concurrent tasks with minimal thread overhead. Great for API calls or lightweight network operations.

By choosing the right worker pool for the job, you can squeeze out optimal throughput and reduce resource wastage. Combined with Kubernetes, this allows you to deploy separate worker types using different pools, CPUs, memory, and concurrency — all optimized for their unique workloads.

For example:�A dataset_worker pod might use prefork with high CPU and RAM for ML tasks.�A fetch_worker pod might use gevent and low resources to make lightweight API calls.�This gives us total flexibility to optimize cost, performance, and throughput — without one worker affecting another.

16 of 28

Interesting Design Decisions

Group Multiple Tasks Into a Single worker

Less Control on Scaling
Easy To manage Scaling , �focused on a single queue
More Ram Consumption

Each Task Independent Worker

More Controlled Scaling

“One of the biggest advantages of using Celery workers on Kubernetes is the freedom it gives us to make design decisions tailored to our workload.”

“For instance, we can isolate each task type to its own dedicated worker. This way, we can scale each task independently based on demand. If one task starts spiking, we simply increase the replicas for just that worker. This isolates failure domains and keeps resource usage predictable.”

“On the other hand, grouping multiple task types into a single worker simplifies deployment and reduces the operational overhead — fewer containers to manage. This also works well when the tasks are lightweight and don’t interfere with each other’s performance. While we lose some granularity, we gain in manageability.”

“What makes this even more powerful is the ability to scale based on queue metrics — like the number of pending messages or message lag. We can hook this into Kubernetes Horizontal Pod Autoscaler or use tools like KEDA.”

“Celery also gives us fine-grained control inside each worker. Celery’s concurrency settings — such as the number of worker processes via --concurrency — we can control parallelism per pod. For example, if we’re using the prefork pool, we can increase the number of processes to take advantage of multi-core machines. For I/O-bound workloads, we can switch to eventlet or gevent worker pools to handle more tasks concurrently without spawning OS-level processes.”

“So altogether — between task isolation, dynamic queue-based scaling, and pool-type-specific concurrency — Celery on Kubernetes gives us a a lot of options

17 of 28

KUBERNETES DEPLOYMENT FILE

This is how a typical Celery deployment looks on Kubernetes.

We deploy multiple specialized workers, each with its own resource profile and execution model:

Heavy Worker:

Uses the prefork pool (process-based, great for CPU-bound work).

Allocated more CPU and memory (2 CPUs, 2Gi RAM).

Listens only to the heavy_tasks queue using --queues=heavy_tasks.

Example tasks: PDF parsing, video processing, large data transformations.

Light Worker:

Uses the gevent pool (event-based, ideal for I/O-bound work).

Runs with fewer resources (0.5 CPU, 512Mi RAM).

Listens to the light_tasks queue.

Example tasks: API calls, webhook dispatch, small database fetches.

Thanks to Kubernetes:

Each worker runs in its own pod, isolated and independently scalable.

You can tune concurrency, resource requests, and task queues per worker.

This architecture gives us fine-grained control, cost efficiency, and the ability to scale each type of task independently.

18 of 28

OBSERVABILITY BEST PRACTICES

19 of 28

Choosing The Right Concurrency Model

Prefork Worker Pool

Celery’s default pool is prefork. It forks separate OS processes for each worker.�This is great for CPU-bound tasks like:
Running pandas data transformations
ML model inference
Anything that eats up memory and processor cycles.

gevent: The I/O-bound multitasker

On the other hand, gevent is built for I/O-bound tasks. It uses greenlets — lightweight, cooperative threads that can handle many tasks at once, as long as they aren’t CPU-heavy.�This works best for workers that:
Just make decisions
Fetch data from APIs
Push messages around
Wait on queues or databases”

“When deploying Celery in production, choosing the right concurrency model — or worker pool — makes a huge difference in performance, resource usage, and debugging complexity. Celery supports multiple options, but the two most common ones are prefork and gevent.

Prefork is the default. It uses separate OS-level processes for each worker. That makes it ideal for CPU-bound tasks like pandas-heavy data transformations, ML inference, or anything computationally intensive. Because each task runs in its own process, they don’t interfere with each other — no GIL issues — but this comes at the cost of higher memory usage.

On the other hand, gevent uses greenlets, which are lightweight, cooperative threads. This model is excellent for I/O-bound tasks — for example, orchestrators that just talk to APIs, queue managers, or tasks that perform routing decisions. These don’t need multiple CPUs — they need concurrent handling of multiple I/O calls without blocking.��Going back to previous diagram

In our architecture, this distinction became super important. Some of our workers, like the Distributor or the Service Initiator, weren’t doing any real computation. They were just figuring out what to trigger next — classic I/O-bound work. Initially, I made the mistake of running them with prefork. What happened was — they waited for downstream tasks to finish before accepting the next message. This led to massive backlogs and I had to keep scaling up pods just to keep up.

But once I switched to gevent, things turned around. The distributor could now handle many users concurrently, without blocking on downstream work. The number of pods dropped to single digits, and performance improved dramatically.

And this is where Kubernetes really shines — when playing around with concurrency limits and the number of pods I can scale up, I can literally mathematically compute how many messages I can handle at once, or how many users can enter the pipeline in parallel.�It becomes a game of control knobs — gevent gives you concurrency, Kubernetes gives you elasticity, and combined, you get precise scaling and cost-efficiency.”

20 of 28

Task Queue Routing

Let’s talk about a best practice in Celery — pattern-based task routing.

When you have a large number of tasks — especially in microservice setups or large monolithic services with multiple domains — naming tasks strategically becomes really important.

In this example, we’re using the task_routes configuration with an ordered list of pattern matches.�This format ensures the order of matching is preserved, which can be crucial if you have overlapping patterns.

Here’s what this does:

Any task whose name starts with feed.tasks. goes to the feeds queue.

Tasks under web.tasks. are routed to the web queue.

And tasks like video.tasks.encode or image.tasks.resize — which match the regex — go to the media queue.

This is a clean, scalable way to route tasks based on namespace or functional domain.

Why is this important?

It simplifies queue management. You know exactly which queues handle which types of workloads.

It improves observability. You can quickly identify bottlenecks — like if your media queue is growing, maybe image processing is too slow.

It’s scalable. Adding a new task type? Just follow the naming convention and plug it into the routing logic.

It allows fine-grained tuning. You can assign different worker types or resource profiles to each queue.

So, a key takeaway:�Name your tasks wisely, use consistent prefixes, and route them using pattern matching.�This gives you powerful control over how your system behaves under load and helps you scale gracefully as you add more functionality.

21 of 28

Caching ML Models Celery

Celery workers cache models on startup → no per-task reload
Eliminates redundant I/O → saves seconds per task
Fully asynchronous + parallel on Kubernetes
Similar to avoiding cold starts in Lambda-style systems

“In this stage of the pipeline, our tasks were heavily data and compute-intensive — we're talking pandas transformations, filtering, joins, and ML inference per dataset. These weren't just background jobs — they had tight latency requirements and ran in real time as users interacted with our system.

To optimize performance, we made a crucial design decision: instead of loading ML models for every task, we loaded them once during the Celery worker startup. So when a worker pod boots up, it loads the model into memory and keeps it available. This saved us a few seconds per task, which might not sound like a lot — but across thousands of datasets per user, it added up fast.

This approach is very similar to how AWS Lambda suffers from cold starts. In Lambda, you often have to reinitialize everything unless you use global scope tricks. But with Kubernetes and Celery, I could treat each worker pod like a persistent microservice — load once, serve fast.

The other advantage was parallel execution. Since these ML tasks were CPU-bound, we used prefork workers, so model inference tasks didn’t block each other. And since they were isolated by pod, each could carry its own version of the model without interference.

Kubernetes again helped here — I could assign memory-heavy pods for model inference, and lightweight ones for orchestration. This gave us a perfect balance of speed, reliability, and control — while staying fully asynchronous and horizontally scalable.”

22 of 28

Observability With Sentry Real-time Error Alerts

23 of 28

Task Signals Example

24 of 28

Custom EFK Stack

25 of 28

Error Detection + Queue Monitoring = Reliable Systems�

Sentry tracks real-time Celery task failures
SQS metrics reveal queue health: backlog, delays, consumption
Dashboards justify autoscaling decisions
Combined view enables proactive debugging

We used both Sentry and SQS metrics to monitor how our system was doing in real time.

Sentry was mainly used to track task failures.�It gave us details like:

Task name

Arguments passed

Stack trace

Which worker it failed on

This helped us debug issues quickly and figure out which part of the system was breaking.

On the other hand, SQS metrics told us how healthy our queue was.

For example:

If the age of the oldest message was increasing, it meant tasks were waiting too long.

If number of messages visible was high, we had a backlog.

If number of empty receives was high, we were scaling too much without actual work to do.

And messages deleted helped track tasks that finished successfully.

These dashboards gave us justification for autoscaling.�For example, if the queue was stuck or messages were building up, it made sense to spin up more pods.�So even though autoscaling was automatic, these metrics helped us understand why it was needed and whether the thresholds were correct.

In short, this setup helped us move from being reactive to being proactive with monitoring.

26 of 28

Coming back to this — we applied all the tricks and techniques we discussed earlier.

We have full control over the compute, type of nodes, cpu ram ,

We have scaling policies that could handle burst traffic patterns�With the right design this workflow can be made very daynamic,

you can add more compoenets to this ,

make condaitonal journey for each customer

The best part the same workflow on sfn took 30 seconds but here it took like 8 seconds , ofcousrse this involved a lof of optimizations at code level and celery configuration level , but as a solution when you scale at many customers entering a jouney , This can handle fairly high rps over AWS approaches, ��

This helped us build a system that was truly production-ready. In fact, the same workload that previously took around 30 seconds for end-to-end processing was now brought down to under 8 seconds.

Some components were deployed as AWS Lambdas, which were outside our direct control. But once we moved to a managed infrastructure, we had full control over resource allocation — especially CPU and RAM — which also allowed us to optimize the code further. That significantly boosted performance as well.

The workflow itself was highly dynamic and responsive to varying needs.�When a customer entered the system, the workflow would assess what kind of datasets were required, how many needed to be generated, and accordingly, at the dataset initiator stage, route those tasks to different workers. Based on demand, we could also scale these workers dynamically.

We also supported different types of data aggregation, which could be triggered and scaled independently, depending on the use case.

Unlike Lambda, we had full control over the CPU and RAM for each worker. We weren’t tied down by serverless constraints. We used different worker pool types — like prefork for CPU-bound tasks and gevent for I/O-heavy stages — to accelerate specific parts of the pipeline.

The dataset initiator knob controlled how many customers the system could onboard concurrently — based on the number of pods and worker concurrency.

But here's the key point: the real load was not just about the number of customers coming in. The actual pressure on the system was driven by the number and complexity of datasets that had to be generated. While customer volume might correlate, the true driver of scaling was the dataset processing load.

So when it came to scaling, we didn’t scale based on user activity alone — we scaled based on the number of messages waiting in the queue, which reflected the real system pressure.

27 of 28

Feature	Celery + Kubernetes	SFN + Lambdas
1. Cost Drivers	EC2, EKS Services, Network Management, DevOps Engineer	Lambda Execution cost, �SFN State transition cost
2. Concurrency	Virtually Unlimited �	Default 1,000 concurrent Lambdas �Can be increased to an extent
3. Scaling Behavior	Define Scaling policies�Analysing Load tests�Good for high RPS	Effortless Scaling,�Good for most MVPs
4. Control Over Compute	Fine-grained control over RAM, CPU	Limited to Lambda Specs
Ideal For	Teams needing full control, high throughput DevOps-savvy setups	Rapid prototyping, event-driven workflows Teams prioritizing ease over control

28 of 28

KEY TAKEAWAYS & SUMMARY

Powerfull tool for workflows
K8s controls Compute and Scaling
Observability in systems ensures reliability and error tracking.