2 of 128

Speaker

오성우 Sungwoo Oh

링크드인 linked.in/sackoh
깃헙 github.com/sackoh

PyCon Korea 2019 - 뚱뚱하고 굼뜬 판다(Pandas)를 위한 효과적인 다이어트 전략

PyCon Korea 2020 - 금융 언어 이해를 위해 개발 ALBERT 톺아보기

PyCon Korea 2023 - 로컬 환경에서 사이즈가 큰 데이터를 효과적으로 처리/분석하기 위한 전략

오픈소스와 공유의 힘을 믿는 ,,,

�“All about AI and Python”

PyAI Symposium 2025

3 of 128

Contents

Introduction

Timeline of Pandas

What’s New in Pandas 2.0 & Features

Validation of 2019’s Efficient Pandas Strategies

I. Memory Optimization

II. Performance Enhancement

III. Method Chaining

If Data Size Larger than Local Hard Disk?

Hugging Face Datasets

�“All about AI and Python”

PyAI Symposium 2025

4 of 128

Do we really need a framework for agents? 🤔

LLM ecosystem today

Dozens of orchestration frameworks
New “agent platforms” every month

But many experts say:

“Just start with the raw API”
“Frameworks are optional, not mandatory”

�“All about AI and Python”

PyAI Symposium 2025

5 of 128

Anthropic: “Start simple. Use direct API calls first”

Suggested Approach

Begin with basic loops or scripts
Use raw outputs (JSON / tool responses) directly)�Avoid unnecessary abstraction early on
Measure what actually hurts
Only then consider a framework

Core idea:

Simple first, complexity layer
Frameworks are optional, not a prerequisite

Introduction

Sources: https://www.anthropic.com/engineering/building-effective-agents

�“All about AI and Python”

PyAI Symposium 2025

6 of 128

Anthropic: “Start simple. Use direct API calls first”

“Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.”

...

“We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what's under the hood are a common source of customer error.”

Sources: https://www.anthropic.com/engineering/building-effective-agents

�“All about AI and Python”

PyAI Symposium 2025

7 of 128

Other voices: Karpathy, Hamel, …

Andrej Karpathy

“LLM apps = prompt + loop + tools”
Prefer transparent, debuggable code early on

Hamel Husain & Others

Ship v0 with minimal orchestration
Focus on logging, prompts, evals > fancy graphs

Shared message:

Don’t start with heavy abstractions
Own the control flow yourself first
Add frameworks only when pain is proven

Reference 추가

�“All about AI and Python”

PyAI Symposium 2025

8 of 128

Framework? Just use raw Python conde…

Common anti-abstraction claims

“Why use LangGraph? Just write a few lines of Python on top of the LLM API.”
“More abstraction → less control”
“We’ll get locked into the framework”

These concerns are reasonable
Let’s apply the same logic somewhere else…

�“All about AI and Python”

PyAI Symposium 2025

9 of 128

Why we don’t hand-roll LLM serving?

We use vLLM not because we’re incapable
We use it because we can’t spend all our time on:

KV cache, batching, scheduling
Glud code for every new open-source model
Model rollout / rollback
Monitoring, logging, autoscaling

We delegate infra complexity to a serving framework
This let us focus on:

“What should we serve to users?”
Not “How do we reinvent a serving stack?”

�“All about AI and Python”

PyAI Symposium 2025

10 of 128

Same logics can apply to Agent Framework

Agent systems have their own “infra”:

Multi-agent orchestration
State management & checkpoints
Retry, pause/resume
HITL, tool calling, long-running workflows
Observability, debugging, testing, deployment

Saying “we’ll code all of this from scratch” is like saying:

“Don’t use vLLM, build your entire serving stack yourself.”

Where do we want to spend our limited engineering time?

Frameworks don’t remove control
They let us delegate low-leverage complexity so we can focus on product & value

�“All about AI and Python”

PyAI Symposium 2025

11 of 128

Small Teams: Not Desperate, Strategic

👤 Common narrative:

“We’re a small team, we have no choice but to use frameworks.”

📍 My claim:

Small teams are not making desperate choices
They are making strategic choices

Key questions:

With limited people & time
Where do we intentionally invest our engineering effort?
What do we consciously delegate to a framework?

�“All about AI and Python”

PyAI Symposium 2025

12 of 128

그림 그리는 중

�“All about AI and Python”

PyAI Symposium 2025

13 of 128

The stack of Doom for Small Agent Teams ☠️

Prompt & Agent Design

Per-agent / per-model prompt tuning
System prompts, roles, tools schema

Workflow & Orchestration

Complex branching logic
State management across steps
Error handling & retries

RAG & Data Layer

Vector DB / indices
Retrieval quality, ranking
Data refresh & governance

Tooling & Integration

MCP servers / tool APIs
Internal microservices as tools
Auth, rate limits, quotas

Ops & Platform

Deployment, CI/CD
Monitoring, logging, tracing
Cost & quota management

�“All about AI and Python”

PyAI Symposium 2025

14 of 128

3-4 Engineers, 3 Months: What’s Realistic?

Small-team reality check 🧮

Team size: 3-4 engineers
TImeline: ~3 months to MVP/launch

Key questions:

Can we implement the entire Stack of Doom in raw Python?
Can we reliably maintain it as requirements & models change?
Can we handle prod incidents, regressions, and feature requests on top?

Strategic view:

Using frameworks is not a sign of weakness
Concentrating effort on differentiating layers
Delegating generic plumbing to a runtime/framework�→ With this in mind, let’s see where LangGraph fits into this stack.

�“All about AI and Python”

PyAI Symposium 2025

15 of 128

Why we chose LangGraph (Not settled for it)

Strategic lens for LangGraph

1️⃣ Business - Focus on what the agent does
2️⃣ Financial - Custom agents can be cheaper
3️⃣ Ops - Agents must be safe and resilient

Not “we’re small, we have no choice”
But: “Given our constraints, what’s the smartest stack?”

�“All about AI and Python”

PyAI Symposium 2025

16 of 128

Business First: What should the agent actually do?

Business-first mindset 💼

Less: “How do we implement multi-agent orchestration?”
More: “what exact workflow should this agent execute?”

With LangGraph, we could spend more time on

Task decomposition & domain logic
UX of failure / clarification / handoff
Guardrails aligned with business rules

The framework handled

Wiring between nodes
State passing, retries, branching�→ We optimized for user value, not graph plumbing

�“All about AI and Python”

PyAI Symposium 2025

17 of 128

Costs Matter: When custom agents are actually cheaper

General-purpose agents are convenient… but 💸

Claude Code, generic “do-everything” agents
Great DX, but:

Longer reasoning chains
Morel tool calls & loops
Unpredictable token usage

Custom agent on LangGraph:

Narrow, task-specific flows
Hard limits on: depth of reasoning, number of tool invocations
Predictable + lower per-run cost

Trade off:

Higher dev cost upfront, but Lower operational cost at scale�→ For us, custom agents were the cheaper choice over time

�“All about AI and Python”

PyAI Symposium 2025

18 of 128

Ops view: Agents that don’t die or go rogue

Ops reality: shipping =! done 🚨

Real work starts after deployment
We care about

Not just “does it run?”
BUt “does it recover?” and “does it avoid bad actions?”

Operational needs for agents

State & checkpointing
Automatic retries & backoff
Pause / resume long-running workflows
Safe rollbacks on failures

LangGraph gave us A runtime build around these features and hooks for observability

→ This made “don’t die, don’t go rogue” a design property, not an afterthought

�“All about AI and Python”

PyAI Symposium 2025

19 of 128

Recap: LangGraph as a Strategic Bet

Using an agent framework like LangGraph is

We want to optimize across…

Business - focus on behavior & value
Finance - control run-time cost with custom flows
Ops - build agents that don’t die or go rogue

Lang Graph act as a

Controlled runtime for multi-agent workflows
Platform for survivability features (state, retries, HITL)
→ A strategic decision balancing product, cost, and reliability

�“All about AI and Python”

PyAI Symposium 2025

20 of 128

Designing “Hard to Kill” multi-agent with LangGraph

This talk is not

A LangGraph 101 tutorial
A feature-by-feature walkthrough

This talk is

A field report on designing resilient multi-agent systems
On top of LangGraph, in real production environments

�“All about AI and Python”

PyAI Symposium 2025

21 of 128

랭그래프 1.0 릴리즈 캡처

�“All about AI and Python”

PyAI Symposium 2025

22 of 128

LangGraph 1.0: Perfect Timing

Timing ⏲️

While mapping our “Stack of Doom”
LangGraph 1.0 was officially released

This raised a key question:

“Is this still a demo tool?”
Or “is this ready for production?”

�“All about AI and Python”

PyAI Symposium 2025

23 of 128

LangChain vs LangGraph: Who does What?

Use LangChain to shape the agent, Use LangGraph to keep the agent alive

Framework	Details
LangChain	Linear flow Built on top of LangGraph Standard agent patterns Tool calling & model wrappers Great for: “build something quickly”
LangGraph	Graph-based flow State-based graph runtime Long-running workflows Checkpoints, interrupt/resume Multi/hierarchical agent orchestration

Sources: https://medium.com/data-science-collective/langchain-vs-langgraph-simple-comparison-9798d8c8a95c

�“All about AI and Python”

PyAI Symposium 2025

24 of 128

LangGraph 1.0: From Toy to Real Runtime

v1.0 felt like a phase shift

From: “Cool research / demo tool”
To: “Potential production runtime”

Why it mattered to us

Stability for long-running workflows
Durable (parallel) execution + checkpointing
Practical interrupt / resume
Tight integration with LangChain & LangSmith + other Lang-ecosystem

“Can we run production on this?”

�“All about AI and Python”

PyAI Symposium 2025

25 of 128

What changed in 1.0

Sources: https://docs.langchain.com/oss/python/langgraph/overview

�“All about AI and Python”

PyAI Symposium 2025

26 of 128

What changed in 1.0 (That we actually care about)

Key aspects we cared about in 1.0 ✅

Long-running workflow stability
Durable execution & checkpoints
Interrupt / resume at practical level
Full integration with 🦜⛓️LangChain Runnables
🪢 Langfuse integration (+ LangSmith) for observability
Self-hosted agent deployment

More “real orchestration engine for prod”

�“All about AI and Python”

PyAI Symposium 2025

27 of 128

LangGraph Application Structure

📌

Sources: https://docs.langchain.com/oss/python/langgraph/overview

�“All about AI and Python”

PyAI Symposium 2025

28 of 128

간지

�“All about AI and Python”

PyAI Symposium 2025

29 of 128

Revisited multi-agent system stack

�“All about AI and Python”

PyAI Symposium 2025

30 of 128

① LLMs - multiple sources by design

Cloud providers, or LLM Providers

Azure + OpenAI
AWS Bedrock + Claude
GCP Vertexai + Gemini

On-prem OSS serving

E.g., gpt-oss-120b, Qwen3, …

Local / fine-tuned models

Task-specific fine-tuned small LLMs

Assumption: Agents will route across heterogeneous LLM backends

�“All about AI and Python”

PyAI Symposium 2025

31 of 128

The real issue behind multi-model strategy

Our LLM question was not:

“How can we use many models?”

Our real question was:

“How do we make sure the system doesn’t die when one model dies?”
“How can we shift the model for the recent model without any workflow code changes?”

Multi-model reality:

Azure OpenAI, AWS Bedrock, GCP Gemini
On-prem OSS
Local fine-tuned LLMs

Goal: Swap / fail models without taking the graph down

�“All about AI and Python”

PyAI Symposium 2025

32 of 128

Why we didn’t hand-code every provider?

Why not implement per-model logic?

Different SDKs
Different parameters
Different error types

→ Managing all of this per-agent = 💀

Approach

Wrap everything behind LangChain’s `BaseChatModel`
Serve OSS via vLLM OpenAI-compatible API

→ Swapping models = config change, not code surgery

�“All about AI and Python”

PyAI Symposium 2025

33 of 128

In Code: Unifying Models with `BaseChatModel`

<1.0 코드 방식

1.0에서의 불러오는 방식 각각 코드로 보여주기

�“All about AI and Python”

PyAI Symposium 2025

34 of 128

Keeping the graph alive with `.with_fallbacks()`

Principle

Never expose raw provider errors to the user
Always return some controlled response

Fallback chain example

Primary: Azure OpenAI

→ on failure

Fallback1: AWS Bedrock Claude

→ on failure

Fallback2: GPT=OSS via vLLM

→ on failure

Final: safe “service unavailable” message

�“All about AI and Python”

PyAI Symposium 2025

35 of 128

Coding the Fallback Chain, Not Try-Except Hell

LangGraph nodes stay simple; the resilience lives in the model layer

With_fallbacks 예시 코드

�“All about AI and Python”

PyAI Symposium 2025

36 of 128

② Context - everything around the LLM

Tools

Weather, exchange rates, calculator, …
Custom internal tools

Databases

Vector DBs for retrieval / RAG
Legacy DMBS for transactional data

External APIs

Third-party services
Partner/internal microservices

In LangGraph: Nodes = “when/how to call which tool with which state”

�“All about AI and Python”

PyAI Symposium 2025

37 of 128

Context Engineering: Feeding LLMs the Right Context

Agents need carefully chosen context to act effectively.

Sources: https://blog.langchain.com/context-engineering-for-agents/

�“All about AI and Python”

PyAI Symposium 2025

38 of 128

Context Engineering: Feeding LLMs the Right Context

Context engineering is the art and science of filling an LLM’s limited context window with just the right information at each step.

Sources: https://blog.langchain.com/context-engineering-for-agents/

�“All about AI and Python”

PyAI Symposium 2025

39 of 128

Context as the Agent’s shield

Context = not just “extra prompt text”
It’s a defensive layer for agent survival

Pre-context layer: Clean & sanitize input before LLM
Execution context layer: Keep model/tool calls from exploding
Post-context layer: Validate & adapt output for the product

We focus mostly on the pre-context layer

�“All about AI and Python”

PyAI Symposium 2025

40 of 128

3-Layer Context Middleware (Pre / Exec / Post)

Layer	Operation
Pre-context layer (before LLM)	Input cleaning & normalization Length control (summaries / trims) PII removal Intent classification / safety filter Context shaping (decide if RAG is needed)
Execution context layer (during calls)	Limit model call count LImit tool calls & recursion Timeouts / circuit breakers Fallback model routing HITL intercepts
Post-context layer (after LLM)	Output normalization Safety filtering Quality threshold → redirect / degrade Summarized or downgraded responses

�“All about AI and Python”

PyAI Symposium 2025

41 of 128

LangChain’s Agent Loop

Middleware provides to handle pre, exec, post context layers seamlessly

Sources: https://docs.langchain.com/oss/python/langchain/middleware/overview#the-agent-loop

�“All about AI and Python”

PyAI Symposium 2025

42 of 128

Pre-Context I: Shrink the input, Save the run

Problem

Long chat history + raw tool/DB dumps → token explosion
Context length errors and failed runs

Strategy in LangGraph

Use `pre_model_hook` (now middleware in 1.0)
Insert summary / trim nodes before LLM calls

Preserve the minimal context needed for the current decision
Token limit errors are handled at the system level, not left to chance

�“All about AI and Python”

PyAI Symposium 2025

43 of 128

Pre-Context I: Shrink the input, Save the run

Problem

Long chat history + raw tool/DB dumps → token explosion
Context length errors and failed runs

Strategy in LangGraph

Use `pre_model_hook` (now middleware in 1.0)
Insert summary / trim nodes before LLM calls

Preserve the minimal context needed for the current decision
Token limit errors are handled at the system level, not left to chance

�“All about AI and Python”

PyAI Symposium 2025

44 of 128

Pre-Context I: Shrink the input, Save the run

Sources: https://community.openai.com/t/this-models-maximum-context-length-is-8193-tokens-does-not-make-sense/288627

�“All about AI and Python”

PyAI Symposium 2025

45 of 128

Pre-Context I: Shrink the input, Save the run

�“All about AI and Python”

PyAI Symposium 2025

46 of 128

Pre-Context I: Shrink the input, Save the run

예시 코드

�“All about AI and Python”

PyAI Symposium 2025

47 of 128

Pre-Context II: Mask PII at the door, Not after the fact

PII examples

Phone numbers, credit card numbers
Email addresses, national IDs

Design choice

Strip / mask PII before LLM sees anything
Implement as a dedicated PII-masking node at the graph entry

This is about security & compliance, not prompts
Post-incident, we can confidently say:�“No raw PII ever entered the LLM context”

�“All about AI and Python”

PyAI Symposium 2025

48 of 128

Pre-Context II: Mask PII at the door, Not after the fact

In practice

Regex + validators in a LangGraph node
Replace with place holders (<|PHONE|>, <|CARD|>, …)
Optional mapping table stored on a secure side-channel if needed

�“All about AI and Python”

PyAI Symposium 2025

49 of 128

Pre-Context II: Mask PII at the door, Not after the fact

In practice

Regex + validators in a LangGraph node
Replace with place holders (<|PHONE|>, <|CARD|>, …)
Optional mapping table stored on a secure side-channel if needed

위에 대한 예시 코드

�“All about AI and Python”

PyAI Symposium 2025

50 of 128

Pre-Context III: RAG only when needed

Native RAG pattern

Query → retrieve docs → send all docs + query to LLM

Problems in production

Unnecessary retrieval cost for simple queries
Irrelevant docs polluting the prompt
Higher latency & token usage

Context shaping layer

Pre-LLM router decides: direct LLM vs RAG, which domain retriever to use, how many docs
Based on query type, intent, and risk

�“All about AI and Python”

PyAI Symposium 2025

51 of 128

③ Workflow - where LangGraph shines

Prompts

System / task prompts per agent

Source code

Routing logic, transformations, validation, …

State

Conversation state, intermediate results

Authorization / policies

Who can trigger what, with which tools

This is modeled as a LangGraph:

Nodes = agents / tools / control logic
Edges = transitions based on state & outcomes

�“All about AI and Python”

PyAI Symposium 2025

52 of 128

How LangGraph keeps agents alive at runtime

Agent-level survival features in LangGraph

Recursion limit → stop infinite loops
Errors as state, not exceptions
Multi-level timeouts
Interrupts / breakpoints

Combined goal: infinite loop and infinite waiting

�“All about AI and Python”

PyAI Symposium 2025

53 of 128

Recursion Limit: Cutting infinite loops with recursion limits

Common failure mode: infinite loops

Agent A → Agent B → Agent A → …
Misplanned “try again” loops that never end

LangGraph’s help

Pregel-style engine with recursion limit
Stop execution after N graph steps / hops

LLM can still make bad plans

But the runtime prevents bad plans from killing the system

�“All about AI and Python”

PyAI Symposium 2025

54 of 128

Recursion Limit: Cutting infinite loops with recursion limits

위에 대한 예시 코드

�“All about AI and Python”

PyAI Symposium 2025

55 of 128

Errors as State, Not Exceptions

Traditional view: Error = throw exception → 500
Our graph view: Error = part of state
State schema example:

status: “ok” | “error” | “degraded”
error: error code / message
attempts: number of retries

Errors are handled inside the graph, instead of killing the entire service

�“All about AI and Python”

PyAI Symposium 2025

56 of 128

Timeouts: Don’t let one slow node freeze the graph

Multi-level timeouts

Model call timeout
Tool call timeout
Overall graph execution timeout

Per-node timeouts, e.g.

Retrieval + justification node: up to 5s
Simple reformatting node: abort after 2s, then fallback

On timeout

Route to summarized / partial answer node, user-facing apology + background queue node
Users rarely see “I can’t find anything” / “no result” non-answers

�“All about AI and Python”

PyAI Symposium 2025

57 of 128

Stopping both infinite loops and infinite waiting

Defending against two big enemies:

1️⃣ Infinite loops

2️⃣ Infinite waiting

LangGraph survival combo:

recursion_limit → cap steps / loops

timeouts → cap waiting per node / graph

interrupts → inject humans at critical points

errors-as-state → avoid global 500s

Together, these turn LangGraph from:

“a nice way to draw graphs”

into “a runtime that keeps your multi-agent system alive in production”

작업 필요;?

�“All about AI and Python”

PyAI Symposium 2025

58 of 128

Wait… We were already using safety features

safety feature were used by default
Built-in features

Default retries (max_retries)
Exponential backoff
Fail-safes model calls
…

�“All about AI and Python”

PyAI Symposium 2025

59 of 128

④ Ops & Assets - keeping agents alive in prod

Deployment, CI/CD
Monitoring, logging, alerting
Cost & quota control

Assets

Config, secrets
Prompt & model versions
Policy definitions

LangGraph’s role here

Exposes traces, checkpoints, execution logs
Makes the workflow observable by external Ops tooling

�“All about AI and Python”

PyAI Symposium 2025

60 of 128

Observability: You can’t fix what you can’t see

�“All about AI and Python”

PyAI Symposium 2025

61 of 128

�“All about AI and Python”

PyAI Symposium 2025

62 of 128

Limitations of Pandas

High memory usage

Performance degradation

Low scalability

Memory mapping issue

Garbage collection

Not thread safe

Introduction

�“All about AI and Python”

PyAI Symposium 2025

63 of 128

High Memory Usage

Object overhead

Every objects has additional metadata about data type, index and extra.
All python objects are python objects

Data type handling

Data in a more memory intensive way

Indexing

Introduction

�“All about AI and Python”

PyAI Symposium 2025

64 of 128

Degraded Performance

Single-threaded operations

Python GIL(Global Interpreter Lock) underlies this issue

Inefficient data types

Default backend data type is Numpy , not designed for Pandas DataFrame
Performance dependent to the factors of Numpy

Overhead of python objects

Before operations, type-checking or data correctness cause overhead

Lack of optimization

Unexpected object assignment and copy cause memory overflow
Operations especially allowing inplace parameter show this issue

Introduction

�“All about AI and Python”

PyAI Symposium 2025

65 of 128

Low Scalability

Designed for single-machine
No built-in distributed computing support
In-memory requirement

Introduction

�“All about AI and Python”

PyAI Symposium 2025

66 of 128

Other Issues

Pandas memory mapping issue
Not collected garbage
Not thread safe

Introduction

�“All about AI and Python”

PyAI Symposium 2025

67 of 128

How to scale & Alternatives to Pandas

Buy more RAM…….
Ray, Dask, Modin
Numba, cuDF
Polars

Introduction

�“All about AI and Python”

PyAI Symposium 2025

68 of 128

Pandas Ecosystem

Sources: https://se.ewi.tudelft.nl/desosa2019/chapters/pandas/

Introduction

�“All about AI and Python”

PyAI Symposium 2025

69 of 128

Frequently Asked Questions about Pandas (1)

Q: Why does my machine slow down when I try to apply a function across large Pandas DataFrame?

Q: Can pandas handle data streaming for real-time data processing on large datasets?

Q: Why is saving large DataFrame to a CSV file using Pandas so slow?

Introduction

�“All about AI and Python”

PyAI Symposium 2025

70 of 128

Frequently Asked Questions about Pandas (2)

Q: I tried to replace all the NaN values in large DataFrame using the fillna() function in Pandas and it crashed. How can handle this?

Q: How can I handle very very large dataset even not being saved on local hard disk with Pandas? (ex. over 10 TeraByte or 1 PetaByte)

Introduction

�“All about AI and Python”

PyAI Symposium 2025

71 of 128

Pandas 2.0 Released

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

72 of 128

2008-2009 Beginning

Created by Wes McKinney while working at AQR Capital Management in 2008
Motivated from programming R data frame
Designed to handle time-series and structured data in Python
Released initial version as BSD license in January 2009
Received significant contributions from open-source community

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

73 of 128

2012 Growth with Book

Published the book ⌜Python for Data Analysis⌟
Complement to the lack of Pandas documentation
Well-written, easy to understand with a wealth of practical examples and exercises

Sources: www.amazon.com

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

74 of 128

2013 Integration with Scientific Ecosystem

Became a core component of the scientific Python ecosystem with other libraries such as NumPy, SciPy, and Matplotlib
Solidified the position with the increased attention on Machine Learning and the library Scikit-learn

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

75 of 128

2014 Wes Mckinney Disappear

2017, Blog “10 Things I Hate About pandas ”
2016, GitHub repo. Pandas2
2015, Blog “Joining Forces for an �Arrow-Native Future”

Sources: https://github.com/pandas-dev/pandas/graphs/contributors

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

76 of 128

2019 PyCon KR 2019: Session speaking

Propose three strategies to use Pandas more memory-efficient and performance-enhancing ways

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

77 of 128

2020 Pandas 1.0 Release

I remember it was a version deployment with a lot of symbolism
Became influential in the data science field and important to data processing/analysis tools
Widely used in both academia and industry

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

78 of 128

2023 Pandas 2.0 Release

Unlike Pandas 1.0, added a lot of performance and functionalities improvement without hurting user experience
Increased integration with Apache Arrow in many features and areas since first integration with Pandas 1.5

Timeline of Pandas

�“All about AI and Python”

PyAI Symposium 2025

79 of 128

What’s New in Pandas 2.0 & Features

Enhanced Performance and Memory Efficiency
Improved Support for Time-Series Data
Introduction to Nullable Data Types (ref. upcasting issue)
CoW Improvements
Apache Arrow Integration

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

80 of 128

CoW Improvement

Resource management technique that shares resources but creates new resources only when a copy has been modified, without the need to create new resources
In a single-thread, Copy & Write are performed separately
In a multi-threaded, Copy & Write are performed simultaneously
Example: Storing snapshots of data

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

81 of 128

Copy-on-Write

Process1

Process2

Physical

Memory

Page A

Page B

Page C

Sources: https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/9_VirtualMemory.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

82 of 128

Copy-on-Write

Process1

Process2

Physical

Memory

Page A

Page B

Page C

Copy of Page C

Sources: https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/9_VirtualMemory.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

83 of 128

CoW in Pandas 2 & 3

The technique mitigates the arbitrary alteration or deletion of resources in Pandas DataFrame
Through integration with PyArrow, it’s expected to become more robust
This is also why the use of Chained Assignment is recommended�(+ Python’s more precise error messaging is beneficial)

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

84 of 128

Official Documentation about CoW in Pandas

Introduced in version 1.5.0, Copy-on-Write (CoW) now supports most possible optimizations since version 2.0.
CoW is likely to be default from version 3.0.
By preventing multi-object updates in single statements, CoW provides more predictable results and eliminates side effects like indexing. Furthermore, by delaying copies, it enhances performance and memory usage.

Sources: https://pandas.pydata.org/docs/user_guide/copy_on_write.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

85 of 128

Example of CoW in Pandas

Sources: https://pandas.pydata.org/docs/user_guide/copy_on_write.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

86 of 128

PyArrow Functionalities in Pandas 2.0

More extensive data types compared to NumPy
Missing data support (NA) for all data types
Performant IO reader integration
Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)

Sources: https://pandas.pydata.org/docs/user_guide/pyarrow.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

87 of 128

Apache Arrow

Cross-language development framework for in-memory data

Language-independent columnar memory format
Organized for efficient analytic operations on CPUs and GPUs
Supports zero-copy reads for lightning-fast data access without serialization overhead
Available for C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust

Sources: https://arrow.apache.org

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

88 of 128

Pandas with PyArrow

Enhances memory efficiency and performance
Performs faster computations and Manages larger datasets

Sources: https://pandas.pydata.org/docs/user_guide/pyarrow.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

89 of 128

Working with text data in Pandas 2.0

Recommended to use string rather than object dtype for text data

Sources: https://pandas.pydata.org/docs/user_guide/text.html

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

90 of 128

Experiment on Pandas 2.0

Experiment by Marc Garcia who is main contributor of Pandas

Operation	Time with NumPy	Time with Arrow	Speed up
read parquet (50Mb)	141 ms	87 ms	1.6x
mean (int64)	2.03 ms	1.11 ms	1.8x
Mean (float64)	3.56 ms	1.73 ms	2.1x
endswith (string)	471 ms	14.9 ms	31.6x

Sources: https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

91 of 128

Improved on Text, but on Others?

Be cautious to migrate to Pandas 2.0 from legacy
Not all operations and data types are fully supported and integrated with PyArrow
Unexpected Errors might occur

Sources: https://medium.com/@santiagobasulto/pandas-2-0-performance-comparison-3f56b4719f58

What’s New in Pandas 2.0

�“All about AI and Python”

PyAI Symposium 2025

92 of 128

Recap of 2019’s Efficient Pandas Strategies [Link]

건강검진 데이터

진료내역 데이터

1-2�Convert Dtypes

1-3�Category type

1-4�IO File Format

2-2�Efficient Algorithm

2-1�Vectorization

2-3�Beyond apply()

3-1 Method Chaining

3-2 .inplace parameter

3-3 Deprecations

전략3. Adopting Conventions

전략1. Memory Optimization

전략2. Performance Enhancement

1-1�Coding

Validation of 2019’s Efficient Pandas Strategies

�“All about AI and Python”

PyAI Symposium 2025

93 of 128

Reproduction of 2019 Strategies

Open Data provided by NHIS(국민건강보험공단)
2019 Strategies were tested on Larger dataset

		PyCon KR 2019	PyCon KR 2023
Health Check Data�건강검진 데이터	Range	3y (2015 ~ 2017)	19y (2002 ~ 2020)
	Rows	3 M	19 M
	Size	0.3 GB	1.9 GB
Medical History Data�진료내역 데이터	Range	3y (2015 ~ 2017)	19y (2002 ~ 2020)
	Rows	40 M	185 M
	Size	3.1 GB	14 GB

Data Source1: https://www.data.go.kr/data/15007115/fileData.do

Data Source2: https://www.data.go.kr/data/15007122/fileData.do

Validation of 2019’s Efficient Pandas Strategies

�“All about AI and Python”

PyAI Symposium 2025

94 of 128

Comparison between Pandas Versions

PyCon KR 2019 ⇥ Pandas 0.24.2
PyCon KR 2023 ⇥ Pandas 2.0.3, + [pyarrow]

Pandas 0.24.2

Pandas 2.0.3

Pandas 2.0.3 [pyarrow]

Validation of 2019’s Efficient Pandas Strategies

�“All about AI and Python”

PyAI Symposium 2025

95 of 128

I. Memory Optimization

Purpose

: to read and manipulate large dataset efficiently

Testing
Specifying data types to reduce time and memory usage
Default: read and manipulate with default setting
Codebook: replace string values with fixed-size int dtype using codebook
Dtypes: specify data type for each column when reading a file
Comparison between file formats - csv, pickle, feather, parquet
Time to write/read a file
Disk size of the saved file

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

96 of 128

[Appendix] How to specify data types

Codebook

Dtypes

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

97 of 128

1-1. Time to read a CSV file

Read a csv file on Default setting

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

98 of 128

1-2. Memory usage of pd.DataFrame object

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

99 of 128

1-3. Results of specifying data types

Pandas 2.0.3 [pyarrow] shows significant improvement on overall settings

Setting	Metric	Pandas 0.24.2	Pandas 2.0.3	Pandas 2.0.3 [pyarrow]	Speed-up
Default	time	33.4 s	28.2 s	2.4 s	13.9x
	increment memory	8,847 Mb	8,847 Mb	6,741 Mb	1.3x
	memory usage	10,443 Mb	10,443 Mb	3,393 Mb	3.1x
Codebook	time	15.4 s	23.6 s	12.2 s	1.2x
	increment memory	3,668 Mb	2,796 Mb	755 Mb	4.9x
	memory usage	10,443 Mb	8,006 Mb	2,847 Mb	3.7x
Dtypes	time	34.7 s	26 s	5.5 s	6.3x
	increment memory	821 Mb	2,263 Mb	4,347 Mb	0.2x
	memory usage	3,393 Mb	1,063 Mb	1,092 Mb	3.1x

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

100 of 128

[Appendix] New feature: dtype_backend

100

NumPy

PyArrow

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

101 of 128

2-1. Time to write a file

csv <<<< parquet < pickle, feather

101

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

102 of 128

2-2. Disk size of the saved file

pickle < csv << feather << parquet

102

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

103 of 128

2-3. Time to read a file

csv <<<< pickle < feather, parquet

103

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

104 of 128

[Appendix]

104

I. Memory Optimization

�“All about AI and Python”

PyAI Symposium 2025

105 of 128

II. Performance Enhancement

Purpose

: to enhance performance on manipulating and analyzing data

Testing
Methods to score health check rows (19M rows)
Iteration: pd.DataFrame.iterrows(), pd.DataFrame.apply()
Vectorization: pd.Series, np.array, pa.array
Comparison of operations to get same result (19M rows)
sort_values().head() VS. nlargest()
Comparison of operations to extract subset satisfied specific conditions (185M rows)
Conditional matching 940 people’s medical history from 185M rows

105

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

106 of 128

[Appendix] Scoring Health Check Data

To test performance of Arithmetic operation
Focused on the performance of time and increment memory

106

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

107 of 128

1-1. Time to score health check

107

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

108 of 128

1-2. Increment Memory while scoring

108

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

109 of 128

1-3. Result of Performance by methods

Vectorization is still much faster than Iteration

109

Methods	Example Code	Time
Iteration with� pd.DataFrame.iterrows()	scores = [] for _, row in df.iterrows(): scores.append(scoring_health(row))	12 min 18 s
Iteration with� pd.DataFrame.apply()	scores = df.apply(scoring_health, axis=1)	6 min 38 s
Vectorization with� pd.Series	scores = scoring_health(df)	1.54 s
Vectorization with� np.array	scores = scoring_health_np(df)	1.50 s
Vectorization with� pa.array	scores = scoring_health_pa(df)	1.29 s

570x Speed up

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

110 of 128

2-1. Time to get same result

sort_values().head() <<< nlargest()

110

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

111 of 128

2-2. Increment Memory while getting result

Pandas 2.0.3 is memory-efficient than the previous one

111

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

112 of 128

[Appendix] Extracting Subset Specific Conditions

Compare methods to extract same subset data which satisfied specific conditions
Methods to extract same subset

List Comprehension
pd.DataFrame.apply
pd.DataFrame.isin
pd.DataFrame.query
pd.DataFrame.merge
np.isin
pa.compute.is_in

112

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

113 of 128

3-1. Extracting same subset of Medical History

Extracting subset of 940 people’s medical history from 185M rows

113

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

114 of 128

3-2. Extracting same subset

114

Methods	Example Code	Pandas�0.24.2	Pandas�2.0.3
List Comprehension	subset = df[[x in PEOPLE for x in df[“IDV_ID”]]]	30min 1s	26min 29s
pd.DataFrame.apply	subset = df[df[“IDV_ID”].apply(lambda x: x in PEOPLE)]	136min 33s	26min 27s
pd.DataFrame.isin	subset = df[df.isin({“IDV_ID”: PEOPLE})]	49s	37.9s
pd.DataFrame.query	subset = df.query(“IDV_ID in @PEOPLE”)	24.5s	6.0s
pd.DataFrame.merge	subset = df.merge(pd.Series(PEOPLE, name=”IDV_ID”), � how=’inner’, on=”IDV_ID”)	7.86s	12.9s
np.isin	subset = df[np.isin(df[“IDV_ID”].values, PEOPLE)]	21.2s	1.9s
pa.compute.is_in	subset = df[pa.compute.is_in(pa.array(df[“IDV_ID”]), � pa.array(PEOPLE)).to_pandas()]	-	2.9s

836x Speed up

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

115 of 128

3-3. Extracting same subset

pd.DataFrame.merge was the fastest method in Pandas 0.24.2
Overall improvement on all methods in Pandas 2.0.3

115

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

116 of 128

3-4. Increment Memory while extraction

Pandas’ built-in operations became memory-stable in version 2.0.3

116

II. Performance Enhancement

�“All about AI and Python”

PyAI Symposium 2025

117 of 128

III. Method Chaining

A pattern that allows methods to be chained by returning an object

117

jack_jill = JackAndJill()

on_hill = went_up(jack_jill, 'hill')

with_water = fetch(on_hill, 'water')

fallen = fell_down(with_water, jack')

broken = broke(fallen, 'jack')

after = tmple_after(broken, 'jill')

일반적인 방법

Method Chaining

jack_jill = JackAndJill()

after = (jack_jill

.went_up("hill")

.fetch("water")

.fell_down("jack")

.broke("crown")

.tumble_after("jill")

)

III. Method Chaining

�“All about AI and Python”

PyAI Symposium 2025

118 of 128

Sample Code to Return Mean Scores by Groups

118

III. Method Chaining

�“All about AI and Python”

PyAI Symposium 2025

119 of 128

Applying Method Chaining

119

III. Method Chaining

�“All about AI and Python”

PyAI Symposium 2025

120 of 128

Chained Assignment is More Faster & Efficient

After applying method chaining,

Time to operate all methods is 9x faster
Peak Memory & Increment memory are significantly lower and stable

120

III. Method Chaining

�“All about AI and Python”

PyAI Symposium 2025

121 of 128

Review - Reproduction of 2019 Strategies

Most strategies are still feasible in Pandas 2
More performance-enhanced to overall Pandas built-in methods
With CoW & Arrow Integration, the Method Chaining is getting memory-stable and easily debuggable
Significant performance improvements with Small API changes

121

III. Method Chaining

�“All about AI and Python”

PyAI Symposium 2025

122 of 128

If Data Size Larger than Local Hard Disk?

122

�“All about AI and Python”

PyAI Symposium 2025

123 of 128

Hugging Face Datasets

Library for easily accessing and sharing datasets

Audio, Computer Vision and Natural Language Processing tasks
also supports Tabular dataset

Load a dataset in a single line of code
Backed by Apache Arrow format, process large datasets with zero-copy reads without memory constraints for optimal speed and efficiency
With the Hugging Face Hub, allowing easily load & share a dataset

123

Sources: https://huggingface.co/docs/datasets/index

�“All about AI and Python”

PyAI Symposium 2025