1 of 128

Building Resilient Multi-Agent System with LangGraph: Lessons from Production

LangGraph로 죽지 않는 멀티 에이전트 만들기: 실전 에이전트 생존기

오성우

PyAI Symposium 2025

All about AI and Python

2 of 128

Speaker

오성우 Sungwoo Oh

  • 링크드인 linked.in/sackoh
  • 깃헙 github.com/sackoh

PyCon Korea 2019 - 뚱뚱하고 굼뜬 판다(Pandas)를 위한 효과적인 다이어트 전략

PyCon Korea 2020 - 금융 언어 이해를 위해 개발 ALBERT 톺아보기

PyCon Korea 2023 - 로컬 환경에서 사이즈가 큰 데이터를 효과적으로 처리/분석하기 위한 전략

오픈소스와 공유의 힘을 믿는 ,,,

2

�“All about AI and Python”

PyAI Symposium 2025

3 of 128

Contents

3

Introduction

Timeline of Pandas

What’s New in Pandas 2.0 & Features

Validation of 2019’s Efficient Pandas Strategies

I. Memory Optimization

II. Performance Enhancement

III. Method Chaining

If Data Size Larger than Local Hard Disk?

Hugging Face Datasets

4

14

22

35

38

48

60

65

66

�“All about AI and Python”

PyAI Symposium 2025

4 of 128

Do we really need a framework for agents? 🤔

  • LLM ecosystem today
    • Dozens of orchestration frameworks
    • New “agent platforms” every month
  • But many experts say:
    • “Just start with the raw API”
    • “Frameworks are optional, not mandatory”

4

�“All about AI and Python”

PyAI Symposium 2025

5 of 128

Anthropic: “Start simple. Use direct API calls first”

  • Suggested Approach
    • Begin with basic loops or scripts
    • Use raw outputs (JSON / tool responses) directly)�Avoid unnecessary abstraction early on
    • Measure what actually hurts
    • Only then consider a framework
  • Core idea:
    • Simple first, complexity layer
    • Frameworks are optional, not a prerequisite

5

Introduction

Sources: https://www.anthropic.com/engineering/building-effective-agents

�“All about AI and Python”

PyAI Symposium 2025

6 of 128

Anthropic: “Start simple. Use direct API calls first”

“Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.”

...

“We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what's under the hood are a common source of customer error.”

6

Sources: https://www.anthropic.com/engineering/building-effective-agents

�“All about AI and Python”

PyAI Symposium 2025

7 of 128

Other voices: Karpathy, Hamel, …

  • Andrej Karpathy
    • “LLM apps = prompt + loop + tools”
    • Prefer transparent, debuggable code early on
  • Hamel Husain & Others
    • Ship v0 with minimal orchestration
    • Focus on logging, prompts, evals > fancy graphs
  • Shared message:
    • Don’t start with heavy abstractions
    • Own the control flow yourself first
    • Add frameworks only when pain is proven

7

Reference 추가

�“All about AI and Python”

PyAI Symposium 2025

8 of 128

Framework? Just use raw Python conde…

  • Common anti-abstraction claims
    • “Why use LangGraph? Just write a few lines of Python on top of the LLM API.”
    • “More abstraction → less control”
    • “We’ll get locked into the framework”
  • These concerns are reasonable
  • Let’s apply the same logic somewhere else…

8

�“All about AI and Python”

PyAI Symposium 2025

9 of 128

Why we don’t hand-roll LLM serving?

  • We use vLLM not because we’re incapable
  • We use it because we can’t spend all our time on:
    • KV cache, batching, scheduling
    • Glud code for every new open-source model
    • Model rollout / rollback
    • Monitoring, logging, autoscaling
  • We delegate infra complexity to a serving framework
  • This let us focus on:
    • “What should we serve to users?”
    • Not “How do we reinvent a serving stack?”

9

�“All about AI and Python”

PyAI Symposium 2025

10 of 128

Same logics can apply to Agent Framework

  • Agent systems have their own “infra”:
    • Multi-agent orchestration
    • State management & checkpoints
    • Retry, pause/resume
    • HITL, tool calling, long-running workflows
    • Observability, debugging, testing, deployment
  • Saying “we’ll code all of this from scratch” is like saying:
    • “Don’t use vLLM, build your entire serving stack yourself.”
  • Where do we want to spend our limited engineering time?
    • Frameworks don’t remove control
    • They let us delegate low-leverage complexity so we can focus on product & value

10

�“All about AI and Python”

PyAI Symposium 2025

11 of 128

Small Teams: Not Desperate, Strategic

  • 👤 Common narrative:
    • “We’re a small team, we have no choice but to use frameworks.”
  • 📍 My claim:
    • Small teams are not making desperate choices
    • They are making strategic choices
  • Key questions:
    • With limited people & time
    • Where do we intentionally invest our engineering effort?
    • What do we consciously delegate to a framework?

11

�“All about AI and Python”

PyAI Symposium 2025

12 of 128

d

12

그림 그리는 중

�“All about AI and Python”

PyAI Symposium 2025

13 of 128

The stack of Doom for Small Agent Teams ☠️

  • Prompt & Agent Design
    • Per-agent / per-model prompt tuning
    • System prompts, roles, tools schema
  • Workflow & Orchestration
    • Complex branching logic
    • State management across steps
    • Error handling & retries
  • RAG & Data Layer
    • Vector DB / indices
    • Retrieval quality, ranking
    • Data refresh & governance

  • Tooling & Integration
    • MCP servers / tool APIs
    • Internal microservices as tools
    • Auth, rate limits, quotas
  • Ops & Platform
    • Deployment, CI/CD
    • Monitoring, logging, tracing
    • Cost & quota management

13

�“All about AI and Python”

PyAI Symposium 2025

14 of 128

3-4 Engineers, 3 Months: What’s Realistic?

  • Small-team reality check 🧮
    • Team size: 3-4 engineers
    • TImeline: ~3 months to MVP/launch
  • Key questions:
    • Can we implement the entire Stack of Doom in raw Python?
    • Can we reliably maintain it as requirements & models change?
    • Can we handle prod incidents, regressions, and feature requests on top?
  • Strategic view:
    • Using frameworks is not a sign of weakness
    • Concentrating effort on differentiating layers
    • Delegating generic plumbing to a runtime/framework�→ With this in mind, let’s see where LangGraph fits into this stack.

14

�“All about AI and Python”

PyAI Symposium 2025

15 of 128

Why we chose LangGraph (Not settled for it)

  • Strategic lens for LangGraph
    • 1️⃣ Business - Focus on what the agent does
    • 2️⃣ Financial - Custom agents can be cheaper
    • 3️⃣ Ops - Agents must be safe and resilient
  • Not “we’re small, we have no choice”
  • But: “Given our constraints, what’s the smartest stack?”

15

�“All about AI and Python”

PyAI Symposium 2025

16 of 128

Business First: What should the agent actually do?

  • Business-first mindset 💼
    • Less: “How do we implement multi-agent orchestration?”
    • More: “what exact workflow should this agent execute?”
  • With LangGraph, we could spend more time on
    • Task decomposition & domain logic
    • UX of failure / clarification / handoff
    • Guardrails aligned with business rules
  • The framework handled
    • Wiring between nodes
    • State passing, retries, branching�→ We optimized for user value, not graph plumbing

16

�“All about AI and Python”

PyAI Symposium 2025

17 of 128

Costs Matter: When custom agents are actually cheaper

  • General-purpose agents are convenient… but 💸
    • Claude Code, generic “do-everything” agents
    • Great DX, but:
      • Longer reasoning chains
      • Morel tool calls & loops
      • Unpredictable token usage
    • Custom agent on LangGraph:
      • Narrow, task-specific flows
      • Hard limits on: depth of reasoning, number of tool invocations
      • Predictable + lower per-run cost
  • Trade off:
    • Higher dev cost upfront, but Lower operational cost at scale�→ For us, custom agents were the cheaper choice over time

17

�“All about AI and Python”

PyAI Symposium 2025

18 of 128

Ops view: Agents that don’t die or go rogue

  • Ops reality: shipping =! done 🚨
    • Real work starts after deployment
    • We care about
      • Not just “does it run?”
      • BUt “does it recover?” and “does it avoid bad actions?”
  • Operational needs for agents
    • State & checkpointing
    • Automatic retries & backoff
    • Pause / resume long-running workflows
    • Safe rollbacks on failures
  • LangGraph gave us A runtime build around these features and hooks for observability
    • → This made “don’t die, don’t go rogue” a design property, not an afterthought

18

�“All about AI and Python”

PyAI Symposium 2025

19 of 128

Recap: LangGraph as a Strategic Bet

  • Using an agent framework like LangGraph is
    • We want to optimize across…
      • Business - focus on behavior & value
      • Finance - control run-time cost with custom flows
      • Ops - build agents that don’t die or go rogue
  • Lang Graph act as a
    • Controlled runtime for multi-agent workflows
    • Platform for survivability features (state, retries, HITL)
    • → A strategic decision balancing product, cost, and reliability

19

�“All about AI and Python”

PyAI Symposium 2025

20 of 128

Designing “Hard to Kill” multi-agent with LangGraph

  • This talk is not
    • A LangGraph 101 tutorial
    • A feature-by-feature walkthrough
  • This talk is
    • A field report on designing resilient multi-agent systems
    • On top of LangGraph, in real production environments

20

�“All about AI and Python”

PyAI Symposium 2025

21 of 128

21

랭그래프 1.0 릴리즈 캡처

�“All about AI and Python”

PyAI Symposium 2025

22 of 128

LangGraph 1.0: Perfect Timing

  • Timing ⏲️
    • While mapping our “Stack of Doom”
    • LangGraph 1.0 was officially released
  • This raised a key question:
    • “Is this still a demo tool?”
    • Or “is this ready for production?”

22

�“All about AI and Python”

PyAI Symposium 2025

23 of 128

LangChain vs LangGraph: Who does What?

Use LangChain to shape the agent, Use LangGraph to keep the agent alive

23

Framework

Details

LangChain

  • Linear flow
  • Built on top of LangGraph
  • Standard agent patterns
  • Tool calling & model wrappers
  • Great for: “build something quickly”

LangGraph

  • Graph-based flow
  • State-based graph runtime
  • Long-running workflows
  • Checkpoints, interrupt/resume
  • Multi/hierarchical agent orchestration

Sources: https://medium.com/data-science-collective/langchain-vs-langgraph-simple-comparison-9798d8c8a95c

�“All about AI and Python”

PyAI Symposium 2025

24 of 128

LangGraph 1.0: From Toy to Real Runtime

  • v1.0 felt like a phase shift
    • From: “Cool research / demo tool”
    • To: “Potential production runtime”
  • Why it mattered to us
    • Stability for long-running workflows
    • Durable (parallel) execution + checkpointing
    • Practical interrupt / resume
    • Tight integration with LangChain & LangSmith + other Lang-ecosystem
  • “Can we run production on this?”

24

�“All about AI and Python”

PyAI Symposium 2025

25 of 128

What changed in 1.0

25

Sources: https://docs.langchain.com/oss/python/langgraph/overview

�“All about AI and Python”

PyAI Symposium 2025

26 of 128

What changed in 1.0 (That we actually care about)

  • Key aspects we cared about in 1.0 ✅
    • Long-running workflow stability
    • Durable execution & checkpoints
    • Interrupt / resume at practical level
    • Full integration with 🦜⛓️LangChain Runnables
    • 🪢 Langfuse integration (+ LangSmith) for observability
    • Self-hosted agent deployment
  • More “real orchestration engine for prod”

26

�“All about AI and Python”

PyAI Symposium 2025

27 of 128

LangGraph Application Structure

27

📌

Sources: https://docs.langchain.com/oss/python/langgraph/overview

�“All about AI and Python”

PyAI Symposium 2025

28 of 128

28

간지

�“All about AI and Python”

PyAI Symposium 2025

29 of 128

29

Revisited multi-agent system stack

�“All about AI and Python”

PyAI Symposium 2025

30 of 128

① LLMs - multiple sources by design

  • Cloud providers, or LLM Providers
    • Azure + OpenAI
    • AWS Bedrock + Claude
    • GCP Vertexai + Gemini
  • On-prem OSS serving
    • E.g., gpt-oss-120b, Qwen3, …
  • Local / fine-tuned models
    • Task-specific fine-tuned small LLMs
  • Assumption: Agents will route across heterogeneous LLM backends

30

�“All about AI and Python”

PyAI Symposium 2025

31 of 128

The real issue behind multi-model strategy

  • Our LLM question was not:
    • “How can we use many models?”
  • Our real question was:
    • “How do we make sure the system doesn’t die when one model dies?”
    • “How can we shift the model for the recent model without any workflow code changes?”
  • Multi-model reality:
    • Azure OpenAI, AWS Bedrock, GCP Gemini
    • On-prem OSS
    • Local fine-tuned LLMs
  • Goal: Swap / fail models without taking the graph down

31

�“All about AI and Python”

PyAI Symposium 2025

32 of 128

Why we didn’t hand-code every provider?

  • Why not implement per-model logic?
    • Different SDKs
    • Different parameters
    • Different error types

→ Managing all of this per-agent = 💀

  • Approach
    • Wrap everything behind LangChain’s `BaseChatModel`
    • Serve OSS via vLLM OpenAI-compatible API

→ Swapping models = config change, not code surgery

32

�“All about AI and Python”

PyAI Symposium 2025

33 of 128

In Code: Unifying Models with `BaseChatModel`

33

<1.0 코드 방식

1.0에서의 불러오는 방식 각각 코드로 보여주기

�“All about AI and Python”

PyAI Symposium 2025

34 of 128

Keeping the graph alive with `.with_fallbacks()`

  • Principle
    • Never expose raw provider errors to the user
    • Always return some controlled response
  • Fallback chain example

34

Primary: Azure OpenAI

on failure

Fallback1: AWS Bedrock Claude

on failure

Fallback2: GPT=OSS via vLLM

on failure

Final: safe “service unavailable” message

�“All about AI and Python”

PyAI Symposium 2025

35 of 128

Coding the Fallback Chain, Not Try-Except Hell

LangGraph nodes stay simple; the resilience lives in the model layer

35

With_fallbacks 예시 코드

�“All about AI and Python”

PyAI Symposium 2025

36 of 128

② Context - everything around the LLM

  • Tools
    • Weather, exchange rates, calculator, …
    • Custom internal tools
  • Databases
    • Vector DBs for retrieval / RAG
    • Legacy DMBS for transactional data
  • External APIs
    • Third-party services
    • Partner/internal microservices
  • In LangGraph: Nodes = “when/how to call which tool with which state”

36

�“All about AI and Python”

PyAI Symposium 2025

37 of 128

Context Engineering: Feeding LLMs the Right Context

Agents need carefully chosen context to act effectively.

37

Sources: https://blog.langchain.com/context-engineering-for-agents/

�“All about AI and Python”

PyAI Symposium 2025

38 of 128

Context Engineering: Feeding LLMs the Right Context

Context engineering is the art and science of filling an LLM’s limited context window with just the right information at each step.

38

Sources: https://blog.langchain.com/context-engineering-for-agents/

�“All about AI and Python”

PyAI Symposium 2025

39 of 128

Context as the Agent’s shield

  • Context = not just “extra prompt text”
  • It’s a defensive layer for agent survival
    1. Pre-context layer: Clean & sanitize input before LLM
    2. Execution context layer: Keep model/tool calls from exploding
    3. Post-context layer: Validate & adapt output for the product
  • We focus mostly on the pre-context layer

39

�“All about AI and Python”

PyAI Symposium 2025

40 of 128

3-Layer Context Middleware (Pre / Exec / Post)

40

Layer

Operation

Pre-context

layer

(before LLM)

  • Input cleaning & normalization
  • Length control (summaries / trims)
  • PII removal
  • Intent classification / safety filter
  • Context shaping (decide if RAG is needed)

Execution context

layer

(during calls)

  • Limit model call count
  • LImit tool calls & recursion
  • Timeouts / circuit breakers
  • Fallback model routing
  • HITL intercepts

Post-context

layer

(after LLM)

  • Output normalization
  • Safety filtering
  • Quality threshold → redirect / degrade
  • Summarized or downgraded responses

�“All about AI and Python”

PyAI Symposium 2025

41 of 128

LangChain’s Agent Loop

Middleware provides to handle pre, exec, post context layers seamlessly

41

Sources: https://docs.langchain.com/oss/python/langchain/middleware/overview#the-agent-loop

�“All about AI and Python”

PyAI Symposium 2025

42 of 128

Pre-Context I: Shrink the input, Save the run

  • Problem
    • Long chat history + raw tool/DB dumps → token explosion
    • Context length errors and failed runs
  • Strategy in LangGraph
    • Use `pre_model_hook` (now middleware in 1.0)
    • Insert summary / trim nodes before LLM calls
  • Preserve the minimal context needed for the current decision
  • Token limit errors are handled at the system level, not left to chance

42

�“All about AI and Python”

PyAI Symposium 2025

43 of 128

Pre-Context I: Shrink the input, Save the run

  • Problem
    • Long chat history + raw tool/DB dumps → token explosion
    • Context length errors and failed runs
  • Strategy in LangGraph
    • Use `pre_model_hook` (now middleware in 1.0)
    • Insert summary / trim nodes before LLM calls
  • Preserve the minimal context needed for the current decision
  • Token limit errors are handled at the system level, not left to chance

43

�“All about AI and Python”

PyAI Symposium 2025

44 of 128

Pre-Context I: Shrink the input, Save the run

44

Sources: https://community.openai.com/t/this-models-maximum-context-length-is-8193-tokens-does-not-make-sense/288627

�“All about AI and Python”

PyAI Symposium 2025

45 of 128

Pre-Context I: Shrink the input, Save the run

45

�“All about AI and Python”

PyAI Symposium 2025

46 of 128

Pre-Context I: Shrink the input, Save the run

46

예시 코드

�“All about AI and Python”

PyAI Symposium 2025

47 of 128

Pre-Context II: Mask PII at the door, Not after the fact

  • PII examples
    • Phone numbers, credit card numbers
    • Email addresses, national IDs
  • Design choice
    • Strip / mask PII before LLM sees anything
    • Implement as a dedicated PII-masking node at the graph entry
  • This is about security & compliance, not prompts
  • Post-incident, we can confidently say:�“No raw PII ever entered the LLM context”

47

�“All about AI and Python”

PyAI Symposium 2025

48 of 128

Pre-Context II: Mask PII at the door, Not after the fact

  • In practice
    • Regex + validators in a LangGraph node
    • Replace with place holders (<|PHONE|>, <|CARD|>, …)
    • Optional mapping table stored on a secure side-channel if needed

48

�“All about AI and Python”

PyAI Symposium 2025

49 of 128

Pre-Context II: Mask PII at the door, Not after the fact

  • In practice
    • Regex + validators in a LangGraph node
    • Replace with place holders (<|PHONE|>, <|CARD|>, …)
    • Optional mapping table stored on a secure side-channel if needed

49

위에 대한 예시 코드

�“All about AI and Python”

PyAI Symposium 2025

50 of 128

Pre-Context III: RAG only when needed

  • Native RAG pattern
    • Query → retrieve docs → send all docs + query to LLM
  • Problems in production
    • Unnecessary retrieval cost for simple queries
    • Irrelevant docs polluting the prompt
    • Higher latency & token usage
  • Context shaping layer
    • Pre-LLM router decides: direct LLM vs RAG, which domain retriever to use, how many docs
    • Based on query type, intent, and risk

50

�“All about AI and Python”

PyAI Symposium 2025

51 of 128

③ Workflow - where LangGraph shines

  • Prompts
    • System / task prompts per agent
  • Source code
    • Routing logic, transformations, validation, …
  • State
    • Conversation state, intermediate results
  • Authorization / policies
    • Who can trigger what, with which tools
  • This is modeled as a LangGraph:
    • Nodes = agents / tools / control logic
    • Edges = transitions based on state & outcomes

51

�“All about AI and Python”

PyAI Symposium 2025

52 of 128

How LangGraph keeps agents alive at runtime

  • Agent-level survival features in LangGraph
    • Recursion limit → stop infinite loops
    • Errors as state, not exceptions
    • Multi-level timeouts
    • Interrupts / breakpoints
  • Combined goal: infinite loop and infinite waiting

52

�“All about AI and Python”

PyAI Symposium 2025

53 of 128

Recursion Limit: Cutting infinite loops with recursion limits

  • Common failure mode: infinite loops
    • Agent A → Agent B → Agent A → …
    • Misplanned “try again” loops that never end
  • LangGraph’s help
    • Pregel-style engine with recursion limit
    • Stop execution after N graph steps / hops
  • LLM can still make bad plans
    • But the runtime prevents bad plans from killing the system

53

�“All about AI and Python”

PyAI Symposium 2025

54 of 128

Recursion Limit: Cutting infinite loops with recursion limits

54

위에 대한 예시 코드

�“All about AI and Python”

PyAI Symposium 2025

55 of 128

Errors as State, Not Exceptions

  • Traditional view: Error = throw exception → 500
  • Our graph view: Error = part of state
  • State schema example:
    • status: “ok” | “error” | “degraded”
    • error: error code / message
    • attempts: number of retries
  • Errors are handled inside the graph, instead of killing the entire service

55

�“All about AI and Python”

PyAI Symposium 2025

56 of 128

Timeouts: Don’t let one slow node freeze the graph

  • Multi-level timeouts
    • Model call timeout
    • Tool call timeout
    • Overall graph execution timeout
  • Per-node timeouts, e.g.
    • Retrieval + justification node: up to 5s
    • Simple reformatting node: abort after 2s, then fallback
  • On timeout
    • Route to summarized / partial answer node, user-facing apology + background queue node
    • Users rarely see “I can’t find anything” / “no result” non-answers

56

�“All about AI and Python”

PyAI Symposium 2025

57 of 128

Stopping both infinite loops and infinite waiting

Defending against two big enemies:

1️⃣ Infinite loops

2️⃣ Infinite waiting

LangGraph survival combo:

recursion_limit → cap steps / loops

timeouts → cap waiting per node / graph

interrupts → inject humans at critical points

errors-as-state → avoid global 500s

Together, these turn LangGraph from:

“a nice way to draw graphs”

into “a runtime that keeps your multi-agent system alive in production”

57

작업 필요;?

�“All about AI and Python”

PyAI Symposium 2025

58 of 128

Wait… We were already using safety features

  • safety feature were used by default
  • Built-in features
    • Default retries (max_retries)
    • Exponential backoff
    • Fail-safes model calls

58

�“All about AI and Python”

PyAI Symposium 2025

59 of 128

④ Ops & Assets - keeping agents alive in prod

  • Ops
    • Deployment, CI/CD
    • Monitoring, logging, alerting
    • Cost & quota control
  • Assets
    • Config, secrets
    • Prompt & model versions
    • Policy definitions
  • LangGraph’s role here
    • Exposes traces, checkpoints, execution logs
    • Makes the workflow observable by external Ops tooling

59

�“All about AI and Python”

PyAI Symposium 2025

60 of 128

Observability: You can’t fix what you can’t see

60

�“All about AI and Python”

PyAI Symposium 2025

61 of 128

61

�“All about AI and Python”

PyAI Symposium 2025

62 of 128

Limitations of Pandas

62

High memory usage

Performance degradation

Low scalability

Memory mapping issue

Garbage collection

Not thread safe

Introduction

�“All about AI and Python”

PyAI Symposium 2025

63 of 128

High Memory Usage

  • Object overhead
    • Every objects has additional metadata about data type, index and extra.
    • All python objects are python objects
  • Data type handling
    • Data in a more memory intensive way
  • Indexing

63

Introduction

�“All about AI and Python”

PyAI Symposium 2025

64 of 128

Degraded Performance

  • Single-threaded operations
    • Python GIL(Global Interpreter Lock) underlies this issue
  • Inefficient data types
    • Default backend data type is Numpy , not designed for Pandas DataFrame
    • Performance dependent to the factors of Numpy
  • Overhead of python objects
    • Before operations, type-checking or data correctness cause overhead
  • Lack of optimization
    • Unexpected object assignment and copy cause memory overflow
    • Operations especially allowing inplace parameter show this issue

64

Introduction

�“All about AI and Python”

PyAI Symposium 2025

65 of 128

Low Scalability

  • Designed for single-machine
  • No built-in distributed computing support
  • In-memory requirement

65

Introduction

�“All about AI and Python”

PyAI Symposium 2025

66 of 128

Other Issues

  • Pandas memory mapping issue
  • Not collected garbage
  • Not thread safe

66

Introduction

�“All about AI and Python”

PyAI Symposium 2025

67 of 128

How to scale & Alternatives to Pandas

  • Buy more RAM…….
  • Ray, Dask, Modin
  • Numba, cuDF
  • Polars

67

67

Introduction

�“All about AI and Python”

PyAI Symposium 2025

68 of 128

Pandas Ecosystem

68

Sources: https://se.ewi.tudelft.nl/desosa2019/chapters/pandas/

Introduction

�“All about AI and Python”

PyAI Symposium 2025

69 of 128

Frequently Asked Questions about Pandas (1)

Q: Why does my machine slow down when I try to apply a function across large Pandas DataFrame?

Q: Can pandas handle data streaming for real-time data processing on large datasets?

Q: Why is saving large DataFrame to a CSV file using Pandas so slow?

69

Introduction

�“All about AI and Python”

PyAI Symposium 2025

70 of 128

Frequently Asked Questions about Pandas (2)

Q: I tried to replace all the NaN values in large DataFrame using the fillna() function in Pandas and it crashed. How can handle this?

Q: How can I handle very very large dataset even not being saved on local hard disk with Pandas? (ex. over 10 TeraByte or 1 PetaByte)

70

Introduction

�“All about AI and Python”

PyAI Symposium 2025

71 of 128

Pandas 2.0 Released

71

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

72 of 128

2008-2009 Beginning

  • Created by Wes McKinney while working at AQR Capital Management in 2008
  • Motivated from programming R data frame
  • Designed to handle time-series and structured data in Python
  • Released initial version as BSD license in January 2009
  • Received significant contributions from open-source community

72

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

73 of 128

2012 Growth with Book

  • Published the book Python for Data Analysis⌟
  • Complement to the lack of Pandas documentation
  • Well-written, easy to understand with a wealth of practical examples and exercises

73

Sources: www.amazon.com

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

74 of 128

2013 Integration with Scientific Ecosystem

  • Became a core component of the scientific Python ecosystem with other libraries such as NumPy, SciPy, and Matplotlib
  • Solidified the position with the increased attention on Machine Learning and the library Scikit-learn

74

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

75 of 128

2014 Wes Mckinney Disappear

75

Sources: https://github.com/pandas-dev/pandas/graphs/contributors

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

76 of 128

2019 PyCon KR 2019: Session speaking

  • Propose three strategies to use Pandas more memory-efficient and performance-enhancing ways

76

76

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

77 of 128

2020 Pandas 1.0 Release

  • I remember it was a version deployment with a lot of symbolism
  • Became influential in the data science field and important to data processing/analysis tools
  • Widely used in both academia and industry

77

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

78 of 128

2023 Pandas 2.0 Release

  • Unlike Pandas 1.0, added a lot of performance and functionalities improvement without hurting user experience
  • Increased integration with Apache Arrow in many features and areas since first integration with Pandas 1.5

78

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

79 of 128

What’s New in Pandas 2.0 & Features

  1. Enhanced Performance and Memory Efficiency
  2. Improved Support for Time-Series Data
  3. Introduction to Nullable Data Types (ref. upcasting issue)
  4. CoW Improvements
  5. Apache Arrow Integration

79

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

80 of 128

CoW Improvement

  • Resource management technique that shares resources but creates new resources only when a copy has been modified, without the need to create new resources
  • In a single-thread, Copy & Write are performed separately
  • In a multi-threaded, Copy & Write are performed simultaneously
  • Example: Storing snapshots of data

80

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

81 of 128

Copy-on-Write

81

Process1

Process2

Physical

Memory

Page A

Page B

Page C

Sources: https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/9_VirtualMemory.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

82 of 128

Copy-on-Write

82

Process1

Process2

Physical

Memory

Page A

Page B

Page C

Copy of Page C

Sources: https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/9_VirtualMemory.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

83 of 128

CoW in Pandas 2 & 3

  • The technique mitigates the arbitrary alteration or deletion of resources in Pandas DataFrame
  • Through integration with PyArrow, it’s expected to become more robust
  • This is also why the use of Chained Assignment is recommended�(+ Python’s more precise error messaging is beneficial)

83

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

84 of 128

Official Documentation about CoW in Pandas

  • Introduced in version 1.5.0, Copy-on-Write (CoW) now supports most possible optimizations since version 2.0.
  • CoW is likely to be default from version 3.0.
  • By preventing multi-object updates in single statements, CoW provides more predictable results and eliminates side effects like indexing. Furthermore, by delaying copies, it enhances performance and memory usage.

84

Sources: https://pandas.pydata.org/docs/user_guide/copy_on_write.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

85 of 128

Example of CoW in Pandas

85

Sources: https://pandas.pydata.org/docs/user_guide/copy_on_write.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

86 of 128

PyArrow Functionalities in Pandas 2.0

  • More extensive data types compared to NumPy
  • Missing data support (NA) for all data types
  • Performant IO reader integration
  • Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)

86

Sources: https://pandas.pydata.org/docs/user_guide/pyarrow.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

87 of 128

Apache Arrow

Cross-language development framework for in-memory data

  • Language-independent columnar memory format
  • Organized for efficient analytic operations on CPUs and GPUs
  • Supports zero-copy reads for lightning-fast data access without serialization overhead
  • Available for C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust

87

Sources: https://arrow.apache.org

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

88 of 128

Pandas with PyArrow

  • Enhances memory efficiency and performance
  • Performs faster computations and Manages larger datasets

88

Sources: https://pandas.pydata.org/docs/user_guide/pyarrow.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

89 of 128

Working with text data in Pandas 2.0

  • Recommended to use string rather than object dtype for text data

89

Sources: https://pandas.pydata.org/docs/user_guide/text.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

90 of 128

Experiment on Pandas 2.0

Experiment by Marc Garcia who is main contributor of Pandas

90

Operation

Time with NumPy

Time with Arrow

Speed up

read parquet (50Mb)

141 ms

87 ms

1.6x

mean (int64)

2.03 ms

1.11 ms

1.8x

Mean (float64)

3.56 ms

1.73 ms

2.1x

endswith (string)

471 ms

14.9 ms

31.6x

Sources: https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

91 of 128

Improved on Text, but on Others?

  • Be cautious to migrate to Pandas 2.0 from legacy
  • Not all operations and data types are fully supported and integrated with PyArrow
  • Unexpected Errors might occur

91

Sources: https://medium.com/@santiagobasulto/pandas-2-0-performance-comparison-3f56b4719f58

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

92 of 128

Recap of 2019’s Efficient Pandas Strategies [Link]

92

건강검진 데이터

진료내역 데이터

1-2�Convert Dtypes

1-3�Category type

1-4�IO File Format

2-2�Efficient Algorithm

2-1Vectorization

2-3�Beyond apply()

3-1 Method Chaining

3-2 .inplace parameter

3-3 Deprecations

전략3. Adopting Conventions

전략1. Memory Optimization

전략2. Performance Enhancement

1-1�Coding

Validation of 2019’s Efficient Pandas Strategies

�“All about AI and Python”

PyAI Symposium 2025

93 of 128

Reproduction of 2019 Strategies

  • Open Data provided by NHIS(국민건강보험공단)
  • 2019 Strategies were tested on Larger dataset

93

93

PyCon KR 2019

PyCon KR 2023

Health Check Data�건강검진 데이터

Range

3y (2015 ~ 2017)

19y (2002 ~ 2020)

Rows

3 M

19 M

Size

0.3 GB

1.9 GB

Medical History Data�진료내역 데이터

Range

3y (2015 ~ 2017)

19y (2002 ~ 2020)

Rows

40 M

185 M

Size

3.1 GB

14 GB

Data Source1: https://www.data.go.kr/data/15007115/fileData.do

Data Source2: https://www.data.go.kr/data/15007122/fileData.do

Validation of 2019’s Efficient Pandas Strategies

�“All about AI and Python”

PyAI Symposium 2025

94 of 128

Comparison between Pandas Versions

  • PyCon KR 2019 ⇥ Pandas 0.24.2
  • PyCon KR 2023 ⇥ Pandas 2.0.3, + [pyarrow]

94

Pandas 0.24.2

Pandas 2.0.3

Pandas 2.0.3 [pyarrow]

Validation of 2019’s Efficient Pandas Strategies

�“All about AI and Python”

PyAI Symposium 2025

95 of 128

I. Memory Optimization

  • Purpose

: to read and manipulate large dataset efficiently

  • Testing
  • Specifying data types to reduce time and memory usage
  • Default: read and manipulate with default setting
  • Codebook: replace string values with fixed-size int dtype using codebook
  • Dtypes: specify data type for each column when reading a file
  • Comparison between file formats - csv, pickle, feather, parquet
  • Time to write/read a file
  • Disk size of the saved file

95

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

96 of 128

[Appendix] How to specify data types

96

Codebook

Dtypes

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

97 of 128

1-1. Time to read a CSV file

Read a csv file on Default setting

97

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

98 of 128

1-2. Memory usage of pd.DataFrame object

98

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

99 of 128

1-3. Results of specifying data types

Pandas 2.0.3 [pyarrow] shows significant improvement on overall settings

99

Setting

Metric

Pandas 0.24.2

Pandas 2.0.3

Pandas 2.0.3 [pyarrow]

Speed-up

Default

time

33.4 s

28.2 s

2.4 s

13.9x

increment memory

8,847 Mb

8,847 Mb

6,741 Mb

1.3x

memory usage

10,443 Mb

10,443 Mb

3,393 Mb

3.1x

Codebook

time

15.4 s

23.6 s

12.2 s

1.2x

increment memory

3,668 Mb

2,796 Mb

755 Mb

4.9x

memory usage

10,443 Mb

8,006 Mb

2,847 Mb

3.7x

Dtypes

time

34.7 s

26 s

5.5 s

6.3x

increment memory

821 Mb

2,263 Mb

4,347 Mb

0.2x

memory usage

3,393 Mb

1,063 Mb

1,092 Mb

3.1x

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

100 of 128

[Appendix] New feature: dtype_backend

100

NumPy

PyArrow

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

101 of 128

2-1. Time to write a file

csv <<<< parquet < pickle, feather

101

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

102 of 128

2-2. Disk size of the saved file

pickle < csv << feather << parquet

102

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

103 of 128

2-3. Time to read a file

csv <<<< pickle < feather, parquet

103

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

104 of 128

[Appendix]

104

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

105 of 128

II. Performance Enhancement

  • Purpose

: to enhance performance on manipulating and analyzing data

  • Testing
  • Methods to score health check rows (19M rows)
  • Iteration: pd.DataFrame.iterrows(), pd.DataFrame.apply()
  • Vectorization: pd.Series, np.array, pa.array
  • Comparison of operations to get same result (19M rows)
  • sort_values().head() VS. nlargest()
  • Comparison of operations to extract subset satisfied specific conditions (185M rows)
  • Conditional matching 940 people’s medical history from 185M rows

105

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

106 of 128

[Appendix] Scoring Health Check Data

  • To test performance of Arithmetic operation
  • Focused on the performance of time and increment memory

106

106

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

107 of 128

1-1. Time to score health check

107

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

108 of 128

1-2. Increment Memory while scoring

108

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

109 of 128

1-3. Result of Performance by methods

Vectorization is still much faster than Iteration

109

Methods

Example Code

Time

Iteration with� pd.DataFrame.iterrows()

scores = []

for _, row in df.iterrows():

scores.append(scoring_health(row))

12 min 18 s

Iteration with� pd.DataFrame.apply()

scores = df.apply(scoring_health, axis=1)

6 min 38 s

Vectorization with� pd.Series

scores = scoring_health(df)

1.54 s

Vectorization with� np.array

scores = scoring_health_np(df)

1.50 s

Vectorization with� pa.array

scores = scoring_health_pa(df)

1.29 s

570x Speed up

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

110 of 128

2-1. Time to get same result

sort_values().head() <<< nlargest()

110

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

111 of 128

2-2. Increment Memory while getting result

Pandas 2.0.3 is memory-efficient than the previous one

111

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

112 of 128

[Appendix] Extracting Subset Specific Conditions

  • Compare methods to extract same subset data which satisfied specific conditions
  • Methods to extract same subset
    • List Comprehension
    • pd.DataFrame.apply
    • pd.DataFrame.isin
    • pd.DataFrame.query
    • pd.DataFrame.merge
    • np.isin
    • pa.compute.is_in

112

112

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

113 of 128

3-1. Extracting same subset of Medical History

  • Extracting subset of 940 people’s medical history from 185M rows

113

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

114 of 128

3-2. Extracting same subset

114

Methods

Example Code

Pandas�0.24.2

Pandas�2.0.3

List Comprehension

subset = df[[x in PEOPLE for x in df[“IDV_ID”]]]

30min 1s

26min 29s

pd.DataFrame.apply

subset = df[df[“IDV_ID”].apply(lambda x: x in PEOPLE)]

136min 33s

26min 27s

pd.DataFrame.isin

subset = df[df.isin({“IDV_ID”: PEOPLE})]

49s

37.9s

pd.DataFrame.query

subset = df.query(“IDV_ID in @PEOPLE)

24.5s

6.0s

pd.DataFrame.merge

subset = df.merge(pd.Series(PEOPLE, name=”IDV_ID”), � how=’inner’, on=”IDV_ID”)

7.86s

12.9s

np.isin

subset = df[np.isin(df[“IDV_ID”].values, PEOPLE)]

21.2s

1.9s

pa.compute.is_in

subset = df[pa.compute.is_in(pa.array(df[“IDV_ID”]), � pa.array(PEOPLE)).to_pandas()]

-

2.9s

836x Speed up

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

115 of 128

3-3. Extracting same subset

  • pd.DataFrame.merge was the fastest method in Pandas 0.24.2
  • Overall improvement on all methods in Pandas 2.0.3

115

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

116 of 128

3-4. Increment Memory while extraction

Pandas’ built-in operations became memory-stable in version 2.0.3

116

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

117 of 128

III. Method Chaining

  • A pattern that allows methods to be chained by returning an object

117

117

jack_jill = JackAndJill()

on_hill = went_up(jack_jill, 'hill')

with_water = fetch(on_hill, 'water')

fallen = fell_down(with_water, jack')

broken = broke(fallen, 'jack')

after = tmple_after(broken, 'jill')

일반적인 방법

Method Chaining

jack_jill = JackAndJill()

after = (jack_jill

.went_up("hill")

.fetch("water")

.fell_down("jack")

.broke("crown")

.tumble_after("jill")

)

III. Method Chaining

�“All about AI and Python”

PyAI Symposium 2025

118 of 128

Sample Code to Return Mean Scores by Groups

118

III. Method Chaining

�“All about AI and Python”

PyAI Symposium 2025

119 of 128

Applying Method Chaining

119

III. Method Chaining

�“All about AI and Python”

PyAI Symposium 2025

120 of 128

Chained Assignment is More Faster & Efficient

After applying method chaining,

  • Time to operate all methods is 9x faster
  • Peak Memory & Increment memory are significantly lower and stable

120

III. Method Chaining

�“All about AI and Python”

PyAI Symposium 2025

121 of 128

Review - Reproduction of 2019 Strategies

  • Most strategies are still feasible in Pandas 2
  • More performance-enhanced to overall Pandas built-in methods
  • With CoW & Arrow Integration, the Method Chaining is getting memory-stable and easily debuggable
  • Significant performance improvements with Small API changes

121

III. Method Chaining

�“All about AI and Python”

PyAI Symposium 2025

122 of 128

If Data Size Larger than Local Hard Disk?

122

�“All about AI and Python”

PyAI Symposium 2025

123 of 128

Hugging Face Datasets

  • Library for easily accessing and sharing datasets
    • Audio, Computer Vision and Natural Language Processing tasks
    • also supports Tabular dataset
  • Load a dataset in a single line of code
  • Backed by Apache Arrow format, process large datasets with zero-copy reads without memory constraints for optimal speed and efficiency
  • With the Hugging Face Hub, allowing easily load & share a dataset

123

Sources: https://huggingface.co/docs/datasets/index

�“All about AI and Python”

PyAI Symposium 2025

124 of 128

Health Check Data on the Hugging Face Hub

124

�“All about AI and Python”

PyAI Symposium 2025

125 of 128

Stream & Process Partial Data

125

�“All about AI and Python”

PyAI Symposium 2025

126 of 128

126

126

126

PyAI Symposium 2025

All about AI and Python

127 of 128

Next Match Up: Pandas vs. Polars

127

�“All about AI and Python”

PyAI Symposium 2025

128 of 128

Thank you

for your time and consideration

128

128

128

PyAI Symposium 2025

All about AI and Python