Data Engineering Applications in Online Advertising
Peter Sujan, Senior Software Engineer, Microsoft
December 5, 2024
Data 101, Fall 2024 @ UC Berkeley
Lisa Yan, Michael Ball https://data101.org/fa24/
1
LECTURE 27
Join at slido.com�#bing
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
Background/why am I here?
Education
Research
Background/why am I here?
Met Michael (Prof. Ball) as CS10 lab assistants, Spring 2012
Roommates during Master’s, 2015-2016
Career
Online/Search Advertising
Bing Ads (later -> Microsoft Ads)
Anatomy of an Ad Campaign
Lifetime of an ad in a request
Lifetime of an ad in a request
Auctions and Auto-bidding
Manual Bidding - Advertiser Specified Bids
Managing Performance
Aside: Conversions
Managing Performance
Managing Performance
Automated Bidding
Automated Bidding
Ads System Design
Scale of the Problem
Data Pipeline Design
Offline Batch Pipelines
Near-real-time/streaming Pipelines
Real-time Request Processing
Platform Comparison
System design - Push Model
Offline Batch Pipelines
ADLA – Job optimization
ADLA U-SQL Example
A = SELECT CampaignId, Clicks FROM (SSTREAM “/home/Clicks.ss”);
B = SELECT CampaignId, Spend FROM (SSTREAM “/home/Spend.ss”);
Joined =
SELECT A.CampaignId, A.Clicks, B.Spend
FROM A
INNER JOIN B
ON A.CampaignId == B.CampaignId;
OUTPUT Joined TO SSTREAM “/home/JoinedResult.ss”
CLUSTERED BY CampaignId; // essentially like a key
ADLA - Extensions
FinalResult = SELECT CampaignId,
Utils.ComputeCPC(Spend, Clicks) AS CostPerClick
FROM Joined;
#CS
public static class Utils
{
public static double ComputeCPC(double spend, double clicks)
{
return clicks > 0 ? spend / clicks : 0;
}
}
#ENDCS
ADLA - Reducers
Reducer Example – Curve Generation
Raw Counterfactual Data
Request ID | CampaignId | Bid | Spend | Conversions |
abc123 | 12345 | $0.50 | $0.40 | 0 |
abc123 | 12345 | $1.00 | $0.60 | 1 |
abc123 | 56789 | $0.25 | $0.00 | 0 |
abc123 | 56789 | $0.50 | $0.25 | 0 |
xyz789 | 12345 | $0.60 | $0.00 | 0 |
xyz789 | 12345 | $1.00 | $0.75 | 0 |
xyz789 | 56789 | $0.25 | $0.25 | 1 |
xyz789 | 56789 | $0.50 | $0.45 | 1 |
Reduced Per-campaign Curve
CampaignId | Bid | Spend | Conversions |
12345 | $0.60 | $0.40 | 0 |
12345 | $1.00 | $1.15 | 1 |
56789 | $0.25 | $0.25 | 1 |
56789 | $0.50 | $0.70 | 1 |
Reducer Example Code
ReducedCurve = REDUCE RawCounterfactualData
ON CampaignId
PRODUCE CampaignId, Bid, Spend, Conversions
USING CurveReducer;
Reducer Example Code
ReducedCurve = REDUCE RawCounterfactualData // Input data
ON CampaignId // Grouping key
PRODUCE CampaignId, Bid, Spend, Conversions
USING CurveReducer;
Reducer Example - group
Request ID | CampaignId | Bid | Spend | Conversions |
abc123 | 12345 | $0.50 | $0.40 | 0 |
abc123 | 12345 | $1.00 | $0.60 | 1 |
abc123 | 56789 | $0.25 | $0.00 | 0 |
abc123 | 56789 | $0.50 | $0.25 | 0 |
xyz789 | 12345 | $0.60 | $0.00 | 0 |
xyz789 | 12345 | $1.00 | $0.75 | 0 |
xyz789 | 56789 | $0.25 | $0.25 | 1 |
xyz789 | 56789 | $0.50 | $0.45 | 1 |
Reducer Example - shuffle
Request ID | CampaignId | Bid | Spend | Conversions |
abc123 | 12345 | $0.50 | $0.40 | 0 |
abc123 | 12345 | $1.00 | $0.60 | 1 |
xyz789 | 12345 | $0.60 | $0.00 | 0 |
xyz789 | 12345 | $1.00 | $0.75 | 0 |
abc123 | 56789 | $0.25 | $0.00 | 0 |
abc123 | 56789 | $0.50 | $0.25 | 0 |
xyz789 | 56789 | $0.25 | $0.25 | 1 |
xyz789 | 56789 | $0.50 | $0.45 | 1 |
Reduced Example - result
CampaignId | Bid | Spend | Conversions |
12345 | $0.60 | $0.40 | 0 |
12345 | $1.00 | $1.15 | 1 |
56789 | $0.25 | $0.25 | 1 |
56789 | $0.50 | $0.70 | 1 |
Reducer Challenges
Near-real-time Processing
NRT Pipeline Design
NRT Pipeline Example
Key-value Data Example
{
DailyBudget: $100,
BidStrategy: ”TargetCPA”
TargetCPA: $5.00
}
Bonus: data serialization
Data Serialization: Example Protobuf Schema
message CampaignMetadata
{
optional double DailyBudget = 1;
optional string BidStrategy = 2; // Typically an enum
optional double TargetCPA = 3;
…
}
Protobuf: Example Code
// Raw bytes from a file or data store lookup
byte[] bytes = ‘…’;
CampaignMetadata metadata =
CampaignMetadata.parseFrom(bytes);
System.out.println(“Campaign budget: ” + metadata.DailyBudget);
Last Thoughts
Thank You!
48
Concluding Remarks
December 5, 2024
Data 101, Fall 2024 @ UC Berkeley
Lisa Yan, Michael Ball https://data101.org/fa24/
49
LECTURE 27
Announcements
This week’s lecture attendance�https://edstem.org/us/courses/63937/discussion/5816846
Final Exam Logistics�https://edstem.org/us/courses/63937/discussion/5838155
50
Class Journey
SQL review
Advanced SQL queries (views, subqueries, window functions, …)
DML, DDL, Referential integrity, index selection, performance tuning
Data transformation and preparation,�Data wrangling and cleaning
Non-relational data models (Tensors, Spreadsheets, etc.), semistructured data (and mongoDB)
Relational Model and Algebra, ER and normalization, Spreadsheets, Transactions, BI and OLAP, parallel computing, security and privacy, data pipelines, …
51
Project 0
Project 1
Project 2
Project 3
Project 4
Homework Assignments
+ Optional Final Project
+ guest lecturers!
Lisa
Class Journey
SQL review
Advanced SQL queries (views, subqueries, window functions, …)
DML, DDL, Referential integrity, index selection, performance tuning
Data transformation and preparation,�Data wrangling and cleaning
Non-relational data models (Tensors, Spreadsheets, etc.), semistructured data (and mongoDB)
Relational Model and Algebra, ER and normalization, Spreadsheets, Transactions, BI and OLAP, parallel computing, security and privacy, data pipelines, …
52
In a “classical database course”, you’d probably only learn ⅓ of these topics!
Lisa
Classical Database Ideas → Data Engineering Perspective
Relational Database Management Systems
Managing the Database Management System:
53
Declarative querying and SQL are “intergalactic dataspeak” for all sorts of relational databases!
Lisa
Different Data Models and Query Languages
Structured data models and ways to transform between them
Flexible data representation formats with pros/cons
JSON, spreadsheets, CSVs, …
Considerations:
54
Data’s representation matters for what you’re trying to accomplish!
Map new representation formats/query languages to/from the relational model
Lisa
Real life is messy
Data Lake
Data Preparation /
Integration
Data Warehouse
Raw Data
Transactions
Sensors
Log Files
Experiments
Source of Truth
Governed
Secure
Audited�Managed
Use-Case-Specific
Fit for purpose
Self-Service
Data Discovery & Assessment
Data Quality�& Integrity
Metadata�Store
Michael
Revisiting Data Science Lifecycle: Transformation, Wrangling
Practical data engineering lessons:
56
Data engineers have to deal with “E” “L” and “T” in all sorts of permutations!
Michael
Revisiting Data Science Lifecycle: Exploration
Throughout: the benefits of interaction and visualization
57
Data engineers have to deal with “E” “L” and “T” in all sorts of permutations!
Michael
Relational Model Extensions/Specializations
What to do when…
58
In all cases, data manipulation requires incremental maintenance!
Michael
Class Journey
Relational Model and Algebra
Advanced SQL queries (views, subqueries, window functions, …)
DML, DDL
Referential integrity, index selection, performance tuning
Data transformation and preparation
Data wrangling and cleaning
Non-relational data models (Tensors, Spreadsheets, etc.)
Semistructured data (and mongoDB)
Transactions, ER and normalization, \�BI and OLAP, parallel computing, data pipelines, …
59
(note: topics are grouped by project and not to scale with + not in order of the class schedule)
…What did we not cover?
Michael
(what we didn’t cover)
From the classical side:
From the non-classical side:
60
Michael
(what we didn’t cover)
Data Engineering often requires SWE skills.
Raw intelligence and skill alone are not enough to get promoted.
Try it on your own even if not doing it for a grade.
But don’t just pick Yelp, Postgres, and Mongo. Explore!
61
Michael
(what we didn’t cover - different DBs)
There are many many different types of databases and data tools…
Skim the (very long) AWS / Azure / Google Cloud Product offerings
62
Michael
Where can you go from here?
CS186/286a: Database internals (narrower + deeper)
CS286b: Research-oriented database internals class
CS298-12: Data systems and foundations seminar
CS61C: Intro to systems (open to Data students in Summer 2025)
CS169A: Software Engineering (open in Summer 2025)
…and also, no better way to learn more about data engineering than by building data systems from scratch! Or contribute to existing open-source efforts! (or try the final project, or see many links in slides, …)
Happy to talk about research if you’re interested…!!!!
63
This is an amazing time to be studying data systems and data engineering!
Lisa
What we hope you took away from this course
An appreciation for conceptual and practical challenges underlying data engineering
An understanding for design decisions, capturing issues ranging from ease-of-use to scalability to flexibility
Hands-on experience with and conceptual understanding of different types of data systems
64
Data governs a lot of decisions around us!�Infrastructure to support analysis on data is going be increasingly essential.
This is an amazing time to be studying data systems and data engineering!
Lisa
65
…finally…
Thank You!
Thank you to our wonderful, wonderful course staff for making Data 101 Fall 2024 happen!!
Lisa
Lisa
67
…finally finally…
Thank YOU!!
December 5, 2024
Data 101, Fall 2024 @ UC Berkeley
Lisa Yan, Michael Ball https://data101.org/fa24/
68
LECTURE 27
Please fill out our two course evaluations, both internal and campus!
This week’s lecture attendance�https://edstem.org/us/courses/63937/discussion/5816846