1 of 68

Data Engineering Applications in Online Advertising

Peter Sujan, Senior Software Engineer, Microsoft

December 5, 2024

Data 101, Fall 2024 @ UC Berkeley

Lisa Yan, Michael Ball https://data101.org/fa24/

1

LECTURE 27

2 of 68

Join at slido.com�#bing

ⓘ

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

3 of 68

Background/why am I here?

4 of 68

Education

Attended Berkeley 2011-2016 (B.A., M.A. Statistics)
Math -> Statistics/CS
TA:

CS10
Stat 133
DS8 (second semester ever!!)

5 of 68

Research

Robotics research – BAIR Lab under Prof. Abbeel
Active Exploration using Trajectory Optimization for Robotic Grasping in the Presence of Occlusions

6 of 68

Background/why am I here?

Met Michael (Prof. Ball) as CS10 lab assistants, Spring 2012

Roommates during Master’s, 2015-2016

7 of 68

Career

Microsoft – 8 years
Hired as software engineer, hybrid stats/CS background
Ads Engineering org

8 of 68

Online/Search Advertising

9 of 68

Bing Ads (later -> Microsoft Ads)

10 of 68

Anatomy of an Ad Campaign

Campaign – a group of keywords and ads
CampaignId: 54321 (unique ID)
Budget: $100 per day
Keywords (exact or fuzzy match)

“hawaii vacation”
“tahiti hotels”

Ad(s):

Title: “Flights to Honolulu, Hawaii | Compare Prices by Dates”
Body: “Compare Cheap Flights to Honolulu, Hawaii. View deals …”
Landing page: “https://www.expedia.com/...”

11 of 68

Lifetime of an ad in a request

Selection (lighter weight models to select many potentially relevant ads)
Relevance (run more expensive models to filter out some less relevant or lower quality ads)
Click prediction (run models that predict probability of a click, probability of a conversion, etc)
Automated bidding and auction to select best few ads
Surviving ads get impressed
Some may be clicked
Some may lead to a conversion

12 of 68

Lifetime of an ad in a request

Selection (lighter weight models to select many potentially relevant ads) – natural language processing, text embeddings
Relevance (run more expensive models to filter out some less relevant or lower quality ads) - natural language processing, text embeddings, human or AI judgements/labeling
Click prediction (run models that predict probability of a click, probability of a conversion, etc) – regression, tree models
Automated bidding and auction to select best few ads – optimization (convex, linear programming), game theory, economics

13 of 68

Auctions and Auto-bidding

14 of 68

Manual Bidding - Advertiser Specified Bids

Keywords

“hawaii vacation” - $0.50 per click
“tahiti hotels” - $1.00 per click

Search advertising uses a second-price auction – meaning that the winner pays the bid of the next highest competitor
Cost-per-click will always be <= bid
Advertiser needs to manage their own performance!

15 of 68

Managing Performance

“hawaii vacation” - $0.50 bid

50 clicks per day
Avg. CPC (cost per click) – $0.40
$20 spend per day
10 conversions per day
Cost per acquisition (conversion) - $2.00

“tahiti hotels” - $1.00 bid

100 clicks per day
Avg. CPC (cost per click) – $0.80
$80 spend per day
10 conversions per day
Cost per acquisition (conversion) - $8.00

Overall CPA: $100 / 20 = $5.00 per conversion

16 of 68

Aside: Conversions

Advertiser-reported events that happen after user clicks on an ad and navigates to the advertiser’s website
Either real-time reported (Javascript or backend code), or batch reported later
Can represent things like:

Navigating to other pages in the website
User sign-ups/creating an account
Adding products to shopping cart
Checking out/placing order

Typically this is when an advertiser actually realizes revenue
Challenging to predict/model!

17 of 68

Managing Performance

“hawaii vacation” - $0.50 bid

50 clicks per day
Avg. CPC (cost per click) – $0.40
$20 spend per day
10 conversions per day
Cost per acquisition (conversion) - $2.00

“tahiti hotels” - $1.00 bid

100 clicks per day
Avg. CPC (cost per click) – $0.80
$80 spend per day
10 conversions per day
Cost per acquisition (conversion) - $8.00

Overall CPA: $100 / 20 = $5.00 per conversion

18 of 68

Managing Performance

“hawaii vacation” - $0.50 bid -> too low

50 clicks per day
Avg. CPC (cost per click) – $0.40
$20 spend per day
10 conversions per day
Cost per acquisition (conversion) - $2.00 ☺

“tahiti hotels” - $1.00 bid -> too high

100 clicks per day
Avg. CPC (cost per click) – $0.80
$80 spend per day
10 conversions per day
Cost per acquisition (conversion) - $8.00 ☹

Overall CPA: $100 / 20 = $5.00 per conversion

19 of 68

Automated Bidding

Example: TargetCPA
Objective is to maximize conversions
Constraint: CPA (Spend / Conversions) <= Advertiser specified target
TargetCPA: $4.00 per conversion
Budget: $100 per day

20 of 68

Automated Bidding

“hawaii vacation” – auto-bidding sets $0.60 bid

150 clicks per day
Avg. CPC (cost per click) – $0.5
$75 spend per day
30 conversions per day
Cost per acquisition (conversion) - $2.50 ☺

“tahiti hotels” - auto-bidding sets $0.60 bid

50 clicks per day
Avg. CPC (cost per click) – $0.50
$25 spend per day
5 conversions per day
Cost per acquisition (conversion) - $5.00 – better!

Overall CPA: $100 / 35 = $2.80 per conversion

21 of 68

Ads System Design

22 of 68

Scale of the Problem

Millions of campaigns and ads/keywords in the system
Around 200k QPS (queries per second) across all data centers
30 million clicks, $25m revenue daily - annual revenue $10+ billion
Entire request (Selection -> Relevance -> Click Prediction -> Auto-bidding -> Auction) needs to run in around 500 ms
Not all computation can be run at request-time! Some models are simply too expensive.

23 of 68

Data Pipeline Design

Offline Batch Pipelines

Near-real-time/streaming Pipelines

Real-time Request Processing

24 of 68

Platform Comparison

Batch pipelines:

Typically slower, but allow us to run data-intensive models/aggregations
Used for processing information at longer time-scales, such as logs of all requests/clicks/spend in the system for an entire hour or day.

Near-real-time pipelines:

Can handle less data, but are able to respond within minutes to changes in the system
Example: streaming updates when advertiser changes their settings like daily Budget

Real-time serving system:

Very time-constrained, so should look up pre-prepared data produced by other layers whenever possible

25 of 68

System design - Push Model

Very common to employ “push model” for data tranfer
Data is generated at each layer and made available in some kind of data store in the next layer through independent data rollouts
Once data set is registered (“publisher”), various consumers can register as “subscribers” to receive new batches of data once generated
Example subscriber: various regional data centers for serving requests may subscribe to data from offline batch and NRT pipelines
Note: emphasis is more on availability than consistency (stale modeling data is often OK)

Some exceptions, like budget-enforcement components

26 of 68

Offline Batch Pipelines

Primary platform: Azure Data Lake Analytics

(actually a Microsoft internal version called Cosmos/Scope)

Scripts written in a language called U-SQL, which allows chaining of SQL-like expressions
Language is extensible using both C# and Python
Data lives in Azure Data Lake Storage, which is a kind of distributed file system
Data in “structured stream” format can be thought of as similar to a SQL table (includes schema of columns with their types)

27 of 68

ADLA – Job optimization

Script goes through compilation/optimization step where a distributed job plan is created
Computation is distributed amongst many workers
Uses properties of input data such as partitioning
Heuristics based on input data size, JOIN properties, etc.
Does typical SQL optimizations like predicate pushdown

28 of 68

ADLA U-SQL Example

A = SELECT CampaignId, Clicks FROM (SSTREAM “/home/Clicks.ss”);

B = SELECT CampaignId, Spend FROM (SSTREAM “/home/Spend.ss”);

Joined =

SELECT A.CampaignId, A.Clicks, B.Spend

FROM A

INNER JOIN B

ON A.CampaignId == B.CampaignId;

OUTPUT Joined TO SSTREAM “/home/JoinedResult.ss”

CLUSTERED BY CampaignId; // essentially like a key

29 of 68

ADLA - Extensions

FinalResult = SELECT CampaignId,

Utils.ComputeCPC(Spend, Clicks) AS CostPerClick

FROM Joined;

#CS

public static class Utils

{

public static double ComputeCPC(double spend, double clicks)

{

return clicks > 0 ? spend / clicks : 0;

}

#ENDCS

30 of 68

ADLA - Reducers

Fancier form of script extension
Inspired by map-reduce paradigm
Given a certain key column (or set of columns), group rows according to those columns and pass each set of rows into the Reducer
Emit one or more rows for each group (does not have to be the same number of rows as were passed in)

31 of 68

Reducer Example – Curve Generation

Take records of every request in a day
Simulate performance for each campaign using different bid values on that request
Turn those into a single “curve” of bid -> performance (clicks, spend, conversions) for each campaign
Use this to pick a best bid to use for the next day

32 of 68

Raw Counterfactual Data

Request ID	CampaignId	Bid	Spend	Conversions
abc123	12345	$0.50	$0.40	0
abc123	12345	$1.00	$0.60	1
abc123	56789	$0.25	$0.00	0
abc123	56789	$0.50	$0.25	0
xyz789	12345	$0.60	$0.00	0
xyz789	12345	$1.00	$0.75	0
xyz789	56789	$0.25	$0.25	1
xyz789	56789	$0.50	$0.45	1

33 of 68

Reduced Per-campaign Curve

CampaignId	Bid	Spend	Conversions
12345	$0.60	$0.40	0
12345	$1.00	$1.15	1
56789	$0.25	$0.25	1
56789	$0.50	$0.70	1

34 of 68

Reducer Example Code

ReducedCurve = REDUCE RawCounterfactualData

ON CampaignId

PRODUCE CampaignId, Bid, Spend, Conversions

USING CurveReducer;

35 of 68

Reducer Example Code

ReducedCurve = REDUCE RawCounterfactualData // Input data

ON CampaignId // Grouping key

PRODUCE CampaignId, Bid, Spend, Conversions

USING CurveReducer;

36 of 68

Reducer Example - group

Request ID	CampaignId	Bid	Spend	Conversions
abc123	12345	$0.50	$0.40	0
abc123	12345	$1.00	$0.60	1
abc123	56789	$0.25	$0.00	0
abc123	56789	$0.50	$0.25	0
xyz789	12345	$0.60	$0.00	0
xyz789	12345	$1.00	$0.75	0
xyz789	56789	$0.25	$0.25	1
xyz789	56789	$0.50	$0.45	1

37 of 68

Reducer Example - shuffle

Request ID	CampaignId	Bid	Spend	Conversions
abc123	12345	$0.50	$0.40	0
abc123	12345	$1.00	$0.60	1
xyz789	12345	$0.60	$0.00	0
xyz789	12345	$1.00	$0.75	0
abc123	56789	$0.25	$0.00	0
abc123	56789	$0.50	$0.25	0
xyz789	56789	$0.25	$0.25	1
xyz789	56789	$0.50	$0.45	1

38 of 68

Reduced Example - result

CampaignId	Bid	Spend	Conversions
12345	$0.60	$0.40	0
12345	$1.00	$1.15	1
56789	$0.25	$0.25	1
56789	$0.50	$0.70	1

39 of 68

Reducer Challenges

Data skew – too many rows going to single reducer

Solution – recursive reducer
Allows computation do be done in stages (think divide-and-conquer, like with mergesort)
Requires commutativity/associativity of the operations

Heavy computation per reducer node

Solution (if applicable) - PRESORT rows going into reducer
Replace with native ADLA operations if possible
Redesign algorithm, split into multiple reducer stages

40 of 68

Near-real-time Processing

41 of 68

NRT Pipeline Design

Designed to only process a small set of entities at a given time
Emphasis is on low latency (but not quite as low as real-time serving stack)
Uses technologies like Apache Kafka for queueing
Process in purely streaming fashion or in “mini-batches”
Key-value store – very simple lookups (no other filters, aggregation, etc)

42 of 68

NRT Pipeline Example

Entity being processed – CampaignId: 12345
Receive message that Budget has changed (ex: $100 -> $200)
Look up several data stores, using CampaignId as key:

Campaign metadata table – containing things like budget, TargetCPA, etc.
Offline batch datasets like click/spend/conversion curves

Feed into module written in C#, which calls algorithms to re-optimize using these inputs. One API call per entity.

43 of 68

Key-value Data Example

Key: { EntityType: Campaign, EntityId: 12345 }
Value from campaign metadata store:

{

DailyBudget: $100,

BidStrategy: ”TargetCPA”

TargetCPA: $5.00

}

44 of 68

Bonus: data serialization

Key-value data is actually just binary data
Producer and consumer agree on a “schema” for how that data is to be unpacked and interpreted
Libraries for this: Microsoft Bond, Google Protobuf
Given a schema, libraries do a pre-compilation step to generate classes in whatever language is desired (C#, Python, Java, …)
These provide serialization/deserialization methods – take raw binary data, unpack it into an object which can be used just like any other in the given language

45 of 68

Data Serialization: Example Protobuf Schema

message CampaignMetadata

{

optional double DailyBudget = 1;

optional string BidStrategy = 2; // Typically an enum

optional double TargetCPA = 3;

…

}

46 of 68

Protobuf: Example Code

// Raw bytes from a file or data store lookup

byte[] bytes = ‘…’;

CampaignMetadata metadata =

CampaignMetadata.parseFrom(bytes);

System.out.println(“Campaign budget: ” + metadata.DailyBudget);

47 of 68

Last Thoughts

Data science = statistics/visualization + domain knowledge
Data engineering = data processing + product requirements

Thanks for listening and good luck with final exams!

48 of 68

Thank You!

48

49 of 68

Concluding Remarks

December 5, 2024

Data 101, Fall 2024 @ UC Berkeley

Lisa Yan, Michael Ball https://data101.org/fa24/

49

LECTURE 27

50 of 68

Announcements

This week’s lecture attendance�https://edstem.org/us/courses/63937/discussion/5816846

Final Exam Logistics�https://edstem.org/us/courses/63937/discussion/5838155

50

51 of 68

Class Journey

SQL review

Advanced SQL queries (views, subqueries, window functions, …)

DML, DDL, Referential integrity, index selection, performance tuning

Data transformation and preparation,�Data wrangling and cleaning

Non-relational data models (Tensors, Spreadsheets, etc.), semistructured data (and mongoDB)

Relational Model and Algebra, ER and normalization, Spreadsheets, Transactions, BI and OLAP, parallel computing, security and privacy, data pipelines, …

51

Project 0

Project 1

Project 2

Project 3

Project 4

Homework Assignments

+ Optional Final Project

+ guest lecturers!

Lisa

#bing

52 of 68

Class Journey

SQL review

Advanced SQL queries (views, subqueries, window functions, …)

DML, DDL, Referential integrity, index selection, performance tuning

Data transformation and preparation,�Data wrangling and cleaning

Non-relational data models (Tensors, Spreadsheets, etc.), semistructured data (and mongoDB)

Relational Model and Algebra, ER and normalization, Spreadsheets, Transactions, BI and OLAP, parallel computing, security and privacy, data pipelines, …

52

In a “classical database course”, you’d probably only learn ⅓ of these topics!

Lisa

#bing

53 of 68

Classical Database Ideas → Data Engineering Perspective

Relational Database Management Systems

Tables and relational algebra are a basis for data-based computation.
Joins combine information across multiple relations.
Query processing and optimization is automatically handled for you.

Managing the Database Management System:

Transactions and the ACID principle support DBMS updates.
User knobs to control performance: views, indexes, materialization, normalization, etc.
Constraints and normalization help ensure data “looks right” over time

53

Declarative querying and SQL are “intergalactic dataspeak” for all sorts of relational databases!

Lisa

#bing

54 of 68

Different Data Models and Query Languages

Structured data models and ways to transform between them

Flexible data representation formats with pros/cons

JSON, spreadsheets, CSVs, …

Considerations:

flexibility
efficiency
updatability
enforceability
querying intuitiveness
homogeneity

54

Data’s representation matters for what you’re trying to accomplish!

Map new representation formats/query languages to/from the relational model

Lisa

#bing

55 of 68

Real life is messy

Data Lake

Data Preparation /

Integration

Data Warehouse

Raw Data

Transactions

Sensors

Log Files

Experiments

Source of Truth

Governed

Secure

Audited�Managed

Use-Case-Specific

Fit for purpose

Self-Service

Data Discovery & Assessment

Data Quality�& Integrity

Metadata�Store

Michael

#bing

56 of 68

Revisiting Data Science Lifecycle: Transformation, Wrangling

Data doesn’t stay static in clean well-defined relations.
Code-centric and query-centric viewpoints are both valuable.

Practical data engineering lessons:

How to get started: data “smells” and getting a sense
How to deal with different data types
How to get it to the shape you want:

transformation
aggregation
drilling up/rolling down
type induction

56

Data engineers have to deal with “E” “L” and “T” in all sorts of permutations!

Michael

#bing

57 of 68

Revisiting Data Science Lifecycle: Exploration

How to clean things up:

getting rid of outliers
extrapolation/imputation
entity resolution

How to get data where it needs to be:

(sort of) dataflow/messaging/workflow

Throughout: the benefits of interaction and visualization

57

Data engineers have to deal with “E” “L” and “T” in all sorts of permutations!

Michael

#bing

58 of 68

Relational Model Extensions/Specializations

What to do when…

…your data is way too large

How to do approximation correctly (and the perils with respect to joins)

…you just want to read

OLAP: cubing and the curse of dimensionality
Deciding what to materialize so that you can reconstruct everything on demand
(sort of) Column stores as a specialization – and its downsides

…you want to employ order

Window queries – and its use in data preparation and cleaning

…you want to use multiple steps of transformations:

Give them names! CTAS, views, CTEs and their pros and cons
Create a dbt-based data pipeline

58

In all cases, data manipulation requires incremental maintenance!

Michael

#bing

59 of 68

Class Journey

Relational Model and Algebra

Advanced SQL queries (views, subqueries, window functions, …)

DML, DDL

Referential integrity, index selection, performance tuning

Data transformation and preparation

Data wrangling and cleaning

Non-relational data models (Tensors, Spreadsheets, etc.)

Semistructured data (and mongoDB)

Transactions, ER and normalization, \�BI and OLAP, parallel computing, data pipelines, …

59

(note: topics are grouped by project and not to scale with + not in order of the class schedule)

…What did we not cover?

Michael

#bing

We decided to be query focused, not file focused

A few reasons

SQL is a really useful skill
Nice complement to imperative programming bc its declarative
More prevalent in the past five years

Early 2000s: anti-SQL, noSQL movement

Even if you get a library like Spark, your interface with it might be SQL-like

Even if you live int hat universe of more open-source and freedom, your api calls might still be SQL
Eg DataBricks (Spark company), grew out of Berkeley, a large fraction of workload is SQL

Also decades of tools, supports, can be “made to dance”

discipline

Vs ad hoc packages
Intellectual fluidity between relation and matrix, relational algebra and linear algebra
Declarative language: outputs, not any algorithms

Has advantages: no overfitting of code to the task
There is an AI algorithm inside each database that chooses an algorithm, autogenerates code to run your query.
This is nice in an environment like the cloud where workloads are changing, hardware is changing, data size is changing
If you write the program by h and it is likely to be inefficient down the road as your environment changes

Having said that, nothing’s perfect

You still need extra tools, like better data exploration
The whole engineering data lifecycle does live outside SQL at some level, so need to plumb/manage that
At the end of the day, alternatives that have been evolving to SQL/open-source are pretty similar
So you’re probably not missing much from a conceptual level, from a job-readiness level it’s likely more useful than how to write low-level Spark in Scala.

60 of 68

(what we didn’t cover)

From the classical side:

Database internals stuff
Details about query optimization, indexes, …
Details about functional dependencies, normalization
Various forms of logging and recovery; concurrency control

From the non-classical side:

Transaction processing at scale and weaker consistency models
Virtual machines, containerization, …
Resource management, pricing, cloud computation
Deep learning systems

60

Michael

#bing

61 of 68

(what we didn’t cover)

Data Engineering often requires SWE skills.

Final project is an intro to ‘open-ended’ world
Real life is (much) less scaffolded
Testing! How do you know if your tools are working?
Debugging: always a skill to improve!
Reading! Both code, and prose, reports, etc.
Writing, Evaluation, Communication
Team Work

Raw intelligence and skill alone are not enough to get promoted.

Try it on your own even if not doing it for a grade.

But don’t just pick Yelp, Postgres, and Mongo. Explore!

61

Michael

#bing

62 of 68

(what we didn’t cover - different DBs)

There are many many different types of databases and data tools…

Many other RDBMSes (MySQL, Oracle, MS SQL Server, DB2)
Document Stores (Mongo, Amazon Aurora)
Hybrid Systems (Google Big Query / Spanner)
Key Value Stores (Redis)
Object / File Stores (AWS S3)
Timeseries DBs (Timescale DB)
Pub-Sub / Queuing Systems (Kafka, RabbitMQ, AWS SQS)
Text-Based Systems (Elastic Search)
OLAP / OLTP Systems

AWS Redshift
Data Lakes, Data warehouses, “Lake Houses” (sigh…)

Skim the (very long) AWS / Azure / Google Cloud Product offerings

62

Michael

#bing

63 of 68

Where can you go from here?

CS186/286a: Database internals (narrower + deeper)

CS286b: Research-oriented database internals class

CS298-12: Data systems and foundations seminar

CS61C: Intro to systems (open to Data students in Summer 2025)

CS169A: Software Engineering (open in Summer 2025)

…and also, no better way to learn more about data engineering than by building data systems from scratch! Or contribute to existing open-source efforts! (or try the final project, or see many links in slides, …)

Happy to talk about research if you’re interested…!!!!

63

This is an amazing time to be studying data systems and data engineering!

Lisa

#bing

64 of 68

What we hope you took away from this course

An appreciation for conceptual and practical challenges underlying data engineering

An understanding for design decisions, capturing issues ranging from ease-of-use to scalability to flexibility

Hands-on experience with and conceptual understanding of different types of data systems

64

Data governs a lot of decisions around us!�Infrastructure to support analysis on data is going be increasingly essential.

This is an amazing time to be studying data systems and data engineering!

Lisa

#bing

65 of 68

65

…finally…

66 of 68

Thank You!

Thank you to our wonderful, wonderful course staff for making Data 101 Fall 2024 happen!!

Lisa

#bing

67 of 68

67

…finally finally…

68 of 68

Thank YOU!!

December 5, 2024

Data 101, Fall 2024 @ UC Berkeley

Lisa Yan, Michael Ball https://data101.org/fa24/

68

LECTURE 27

Please fill out our two course evaluations, both internal and campus!

This week’s lecture attendance�https://edstem.org/us/courses/63937/discussion/5816846