1 of 68

Data Engineering Applications in Online Advertising

Peter Sujan, Senior Software Engineer, Microsoft

December 5, 2024

Data 101, Fall 2024 @ UC Berkeley

Lisa Yan, Michael Ball https://data101.org/fa24/

1

LECTURE 27

2 of 68

Join at slido.com�#bing

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

3 of 68

Background/why am I here?

4 of 68

Education

  • Attended Berkeley 2011-2016 (B.A., M.A. Statistics)
  • Math -> Statistics/CS
  • TA:
    • CS10
    • Stat 133
    • DS8 (second semester ever!!)

5 of 68

Research

6 of 68

Background/why am I here?

Met Michael (Prof. Ball) as CS10 lab assistants, Spring 2012

Roommates during Master’s, 2015-2016

7 of 68

Career

  • Microsoft – 8 years
  • Hired as software engineer, hybrid stats/CS background
  • Ads Engineering org

8 of 68

Online/Search Advertising

9 of 68

Bing Ads (later -> Microsoft Ads)

10 of 68

Anatomy of an Ad Campaign

  • Campaign – a group of keywords and ads
  • CampaignId: 54321 (unique ID)
  • Budget: $100 per day
  • Keywords (exact or fuzzy match)
    • “hawaii vacation”
    • “tahiti hotels”
  • Ad(s):
    • Title: “Flights to Honolulu, Hawaii | Compare Prices by Dates”
    • Body: “Compare Cheap Flights to Honolulu, Hawaii. View deals …”
    • Landing page: “https://www.expedia.com/...”

11 of 68

Lifetime of an ad in a request

  • Selection (lighter weight models to select many potentially relevant ads)
  • Relevance (run more expensive models to filter out some less relevant or lower quality ads)
  • Click prediction (run models that predict probability of a click, probability of a conversion, etc)
  • Automated bidding and auction to select best few ads
  • Surviving ads get impressed
  • Some may be clicked
  • Some may lead to a conversion

12 of 68

Lifetime of an ad in a request

  • Selection (lighter weight models to select many potentially relevant ads) – natural language processing, text embeddings
  • Relevance (run more expensive models to filter out some less relevant or lower quality ads) - natural language processing, text embeddings, human or AI judgements/labeling
  • Click prediction (run models that predict probability of a click, probability of a conversion, etc) – regression, tree models
  • Automated bidding and auction to select best few ads – optimization (convex, linear programming), game theory, economics

13 of 68

Auctions and Auto-bidding

14 of 68

Manual Bidding - Advertiser Specified Bids

  • Keywords
    • “hawaii vacation” - $0.50 per click
    • “tahiti hotels” - $1.00 per click
  • Search advertising uses a second-price auction – meaning that the winner pays the bid of the next highest competitor
  • Cost-per-click will always be <= bid
  • Advertiser needs to manage their own performance!

15 of 68

Managing Performance

  • “hawaii vacation” - $0.50 bid
    • 50 clicks per day
    • Avg. CPC (cost per click) – $0.40
    • $20 spend per day
    • 10 conversions per day
    • Cost per acquisition (conversion) - $2.00
  • “tahiti hotels” - $1.00 bid
    • 100 clicks per day
    • Avg. CPC (cost per click) – $0.80
    • $80 spend per day
    • 10 conversions per day
    • Cost per acquisition (conversion) - $8.00
  • Overall CPA: $100 / 20 = $5.00 per conversion

16 of 68

Aside: Conversions

  • Advertiser-reported events that happen after user clicks on an ad and navigates to the advertiser’s website
  • Either real-time reported (Javascript or backend code), or batch reported later
  • Can represent things like:
    • Navigating to other pages in the website
    • User sign-ups/creating an account
    • Adding products to shopping cart
    • Checking out/placing order
  • Typically this is when an advertiser actually realizes revenue
  • Challenging to predict/model!

17 of 68

Managing Performance

  • “hawaii vacation” - $0.50 bid
    • 50 clicks per day
    • Avg. CPC (cost per click) – $0.40
    • $20 spend per day
    • 10 conversions per day
    • Cost per acquisition (conversion) - $2.00
  • “tahiti hotels” - $1.00 bid
    • 100 clicks per day
    • Avg. CPC (cost per click) – $0.80
    • $80 spend per day
    • 10 conversions per day
    • Cost per acquisition (conversion) - $8.00
  • Overall CPA: $100 / 20 = $5.00 per conversion

18 of 68

Managing Performance

  • “hawaii vacation” - $0.50 bid -> too low
    • 50 clicks per day
    • Avg. CPC (cost per click) – $0.40
    • $20 spend per day
    • 10 conversions per day
    • Cost per acquisition (conversion) - $2.00 ☺
  • “tahiti hotels” - $1.00 bid -> too high
    • 100 clicks per day
    • Avg. CPC (cost per click) – $0.80
    • $80 spend per day
    • 10 conversions per day
    • Cost per acquisition (conversion) - $8.00 ☹
  • Overall CPA: $100 / 20 = $5.00 per conversion

19 of 68

Automated Bidding

  • Example: TargetCPA
  • Objective is to maximize conversions
  • Constraint: CPA (Spend / Conversions) <= Advertiser specified target
  • TargetCPA: $4.00 per conversion
  • Budget: $100 per day

20 of 68

Automated Bidding

  • “hawaii vacation” – auto-bidding sets $0.60 bid
    • 150 clicks per day
    • Avg. CPC (cost per click) – $0.5
    • $75 spend per day
    • 30 conversions per day
    • Cost per acquisition (conversion) - $2.50 ☺
  • “tahiti hotels” - auto-bidding sets $0.60 bid
    • 50 clicks per day
    • Avg. CPC (cost per click) – $0.50
    • $25 spend per day
    • 5 conversions per day
    • Cost per acquisition (conversion) - $5.00 – better!
  • Overall CPA: $100 / 35 = $2.80 per conversion

21 of 68

Ads System Design

22 of 68

Scale of the Problem

  • Millions of campaigns and ads/keywords in the system
  • Around 200k QPS (queries per second) across all data centers
  • 30 million clicks, $25m revenue daily - annual revenue $10+ billion
  • Entire request (Selection -> Relevance -> Click Prediction -> Auto-bidding -> Auction) needs to run in around 500 ms
  • Not all computation can be run at request-time! Some models are simply too expensive.

23 of 68

Data Pipeline Design

Offline Batch Pipelines

Near-real-time/streaming Pipelines

Real-time Request Processing

24 of 68

Platform Comparison

  • Batch pipelines:
    • Typically slower, but allow us to run data-intensive models/aggregations
    • Used for processing information at longer time-scales, such as logs of all requests/clicks/spend in the system for an entire hour or day.
  • Near-real-time pipelines:
    • Can handle less data, but are able to respond within minutes to changes in the system
    • Example: streaming updates when advertiser changes their settings like daily Budget
  • Real-time serving system:
    • Very time-constrained, so should look up pre-prepared data produced by other layers whenever possible

25 of 68

System design - Push Model

  • Very common to employ “push model” for data tranfer
  • Data is generated at each layer and made available in some kind of data store in the next layer through independent data rollouts
  • Once data set is registered (“publisher”), various consumers can register as “subscribers” to receive new batches of data once generated
  • Example subscriber: various regional data centers for serving requests may subscribe to data from offline batch and NRT pipelines
  • Note: emphasis is more on availability than consistency (stale modeling data is often OK)
    • Some exceptions, like budget-enforcement components

26 of 68

Offline Batch Pipelines

  • Primary platform: Azure Data Lake Analytics
    • (actually a Microsoft internal version called Cosmos/Scope)
  • Scripts written in a language called U-SQL, which allows chaining of SQL-like expressions
  • Language is extensible using both C# and Python
  • Data lives in Azure Data Lake Storage, which is a kind of distributed file system
  • Data in “structured stream” format can be thought of as similar to a SQL table (includes schema of columns with their types)

27 of 68

ADLA – Job optimization

  • Script goes through compilation/optimization step where a distributed job plan is created
  • Computation is distributed amongst many workers
  • Uses properties of input data such as partitioning
  • Heuristics based on input data size, JOIN properties, etc.
  • Does typical SQL optimizations like predicate pushdown

28 of 68

ADLA U-SQL Example

A = SELECT CampaignId, Clicks FROM (SSTREAM “/home/Clicks.ss”);

B = SELECT CampaignId, Spend FROM (SSTREAM “/home/Spend.ss”);

Joined =

SELECT A.CampaignId, A.Clicks, B.Spend

FROM A

INNER JOIN B

ON A.CampaignId == B.CampaignId;

OUTPUT Joined TO SSTREAM “/home/JoinedResult.ss”

CLUSTERED BY CampaignId; // essentially like a key

29 of 68

ADLA - Extensions

FinalResult = SELECT CampaignId,

Utils.ComputeCPC(Spend, Clicks) AS CostPerClick

FROM Joined;

#CS

public static class Utils

{

public static double ComputeCPC(double spend, double clicks)

{

return clicks > 0 ? spend / clicks : 0;

}

}

#ENDCS

30 of 68

ADLA - Reducers

  • Fancier form of script extension
  • Inspired by map-reduce paradigm
  • Given a certain key column (or set of columns), group rows according to those columns and pass each set of rows into the Reducer
  • Emit one or more rows for each group (does not have to be the same number of rows as were passed in)

31 of 68

Reducer Example – Curve Generation

  • Take records of every request in a day
  • Simulate performance for each campaign using different bid values on that request
  • Turn those into a single “curve” of bid -> performance (clicks, spend, conversions) for each campaign
  • Use this to pick a best bid to use for the next day

32 of 68

Raw Counterfactual Data

Request ID

CampaignId

Bid

Spend

Conversions

abc123

12345

$0.50

$0.40

0

abc123

12345

$1.00

$0.60

1

abc123

56789

$0.25

$0.00

0

abc123

56789

$0.50

$0.25

0

xyz789

12345

$0.60

$0.00

0

xyz789

12345

$1.00

$0.75

0

xyz789

56789

$0.25

$0.25

1

xyz789

56789

$0.50

$0.45

1

33 of 68

Reduced Per-campaign Curve

CampaignId

Bid

Spend

Conversions

12345

$0.60

$0.40

0

12345

$1.00

$1.15

1

56789

$0.25

$0.25

1

56789

$0.50

$0.70

1

34 of 68

Reducer Example Code

ReducedCurve = REDUCE RawCounterfactualData

ON CampaignId

PRODUCE CampaignId, Bid, Spend, Conversions

USING CurveReducer;

35 of 68

Reducer Example Code

ReducedCurve = REDUCE RawCounterfactualData // Input data

ON CampaignId // Grouping key

PRODUCE CampaignId, Bid, Spend, Conversions

USING CurveReducer;

36 of 68

Reducer Example - group

Request ID

CampaignId

Bid

Spend

Conversions

abc123

12345

$0.50

$0.40

0

abc123

12345

$1.00

$0.60

1

abc123

56789

$0.25

$0.00

0

abc123

56789

$0.50

$0.25

0

xyz789

12345

$0.60

$0.00

0

xyz789

12345

$1.00

$0.75

0

xyz789

56789

$0.25

$0.25

1

xyz789

56789

$0.50

$0.45

1

37 of 68

Reducer Example - shuffle

Request ID

CampaignId

Bid

Spend

Conversions

abc123

12345

$0.50

$0.40

0

abc123

12345

$1.00

$0.60

1

xyz789

12345

$0.60

$0.00

0

xyz789

12345

$1.00

$0.75

0

abc123

56789

$0.25

$0.00

0

abc123

56789

$0.50

$0.25

0

xyz789

56789

$0.25

$0.25

1

xyz789

56789

$0.50

$0.45

1

38 of 68

Reduced Example - result

CampaignId

Bid

Spend

Conversions

12345

$0.60

$0.40

0

12345

$1.00

$1.15

1

56789

$0.25

$0.25

1

56789

$0.50

$0.70

1

39 of 68

Reducer Challenges

  • Data skew – too many rows going to single reducer
    • Solution – recursive reducer
    • Allows computation do be done in stages (think divide-and-conquer, like with mergesort)
    • Requires commutativity/associativity of the operations
  • Heavy computation per reducer node
    • Solution (if applicable) - PRESORT rows going into reducer
    • Replace with native ADLA operations if possible
    • Redesign algorithm, split into multiple reducer stages

40 of 68

Near-real-time Processing

41 of 68

NRT Pipeline Design

  • Designed to only process a small set of entities at a given time
  • Emphasis is on low latency (but not quite as low as real-time serving stack)
  • Uses technologies like Apache Kafka for queueing
  • Process in purely streaming fashion or in “mini-batches”
  • Key-value store – very simple lookups (no other filters, aggregation, etc)

42 of 68

NRT Pipeline Example

  • Entity being processed – CampaignId: 12345
  • Receive message that Budget has changed (ex: $100 -> $200)
  • Look up several data stores, using CampaignId as key:
    • Campaign metadata table – containing things like budget, TargetCPA, etc.
    • Offline batch datasets like click/spend/conversion curves
  • Feed into module written in C#, which calls algorithms to re-optimize using these inputs. One API call per entity.

43 of 68

Key-value Data Example

  • Key: { EntityType: Campaign, EntityId: 12345 }
  • Value from campaign metadata store:

{

DailyBudget: $100,

BidStrategy: ”TargetCPA”

TargetCPA: $5.00

}

44 of 68

Bonus: data serialization

  • Key-value data is actually just binary data
  • Producer and consumer agree on a “schema” for how that data is to be unpacked and interpreted
  • Libraries for this: Microsoft Bond, Google Protobuf
  • Given a schema, libraries do a pre-compilation step to generate classes in whatever language is desired (C#, Python, Java, …)
  • These provide serialization/deserialization methods – take raw binary data, unpack it into an object which can be used just like any other in the given language

45 of 68

Data Serialization: Example Protobuf Schema

message CampaignMetadata

{

optional double DailyBudget = 1;

optional string BidStrategy = 2; // Typically an enum

optional double TargetCPA = 3;

}

46 of 68

Protobuf: Example Code

// Raw bytes from a file or data store lookup

byte[] bytes = ‘…’;

CampaignMetadata metadata =

CampaignMetadata.parseFrom(bytes);

System.out.println(“Campaign budget: ” + metadata.DailyBudget);

47 of 68

Last Thoughts

  • Data science = statistics/visualization + domain knowledge
  • Data engineering = data processing + product requirements

  • Thanks for listening and good luck with final exams!

48 of 68

Thank You!

48

49 of 68

Concluding Remarks

December 5, 2024

Data 101, Fall 2024 @ UC Berkeley

Lisa Yan, Michael Ball https://data101.org/fa24/

49

LECTURE 27

50 of 68

Announcements

50

51 of 68

Class Journey

SQL review

Advanced SQL queries (views, subqueries, window functions, …)

DML, DDL, Referential integrity, index selection, performance tuning

Data transformation and preparation,�Data wrangling and cleaning

Non-relational data models (Tensors, Spreadsheets, etc.), semistructured data (and mongoDB)

Relational Model and Algebra, ER and normalization, Spreadsheets, Transactions, BI and OLAP, parallel computing, security and privacy, data pipelines, …

51

Project 0

Project 1

Project 2

Project 3

Project 4

Homework Assignments

+ Optional Final Project

+ guest lecturers!

Lisa

52 of 68

Class Journey

SQL review

Advanced SQL queries (views, subqueries, window functions, …)

DML, DDL, Referential integrity, index selection, performance tuning

Data transformation and preparation,�Data wrangling and cleaning

Non-relational data models (Tensors, Spreadsheets, etc.), semistructured data (and mongoDB)

Relational Model and Algebra, ER and normalization, Spreadsheets, Transactions, BI and OLAP, parallel computing, security and privacy, data pipelines, …

52

In a “classical database course”, you’d probably only learn ⅓ of these topics!

Lisa

53 of 68

Classical Database Ideas → Data Engineering Perspective

Relational Database Management Systems

  • Tables and relational algebra are a basis for data-based computation.
  • Joins combine information across multiple relations.
  • Query processing and optimization is automatically handled for you.

Managing the Database Management System:

  • Transactions and the ACID principle support DBMS updates.
  • User knobs to control performance: views, indexes, materialization, normalization, etc.
  • Constraints and normalization help ensure data “looks right” over time

53

Declarative querying and SQL are “intergalactic dataspeak” for all sorts of relational databases!

Lisa

54 of 68

Different Data Models and Query Languages

Structured data models and ways to transform between them

Flexible data representation formats with pros/cons

JSON, spreadsheets, CSVs, …

Considerations:

  • flexibility
  • efficiency
  • updatability
  • enforceability
  • querying intuitiveness
  • homogeneity

54

Data’s representation matters for what you’re trying to accomplish!

Map new representation formats/query languages to/from the relational model

Lisa

55 of 68

Real life is messy

Data Lake

Data Preparation /

Integration

Data Warehouse

Raw Data

Transactions

Sensors

Log Files

Experiments

Source of Truth

Governed

Secure

Audited�Managed

Use-Case-Specific

Fit for purpose

Self-Service

Data Discovery & Assessment

Data Quality�& Integrity

Metadata�Store

Michael

56 of 68

Revisiting Data Science Lifecycle: Transformation, Wrangling

  • Data doesn’t stay static in clean well-defined relations.
  • Code-centric and query-centric viewpoints are both valuable.

Practical data engineering lessons:

  • How to get started: data “smells” and getting a sense
  • How to deal with different data types
  • How to get it to the shape you want:
    • transformation
    • aggregation
    • drilling up/rolling down
    • type induction

56

Data engineers have to deal with “E” “L” and “T” in all sorts of permutations!

Michael

57 of 68

Revisiting Data Science Lifecycle: Exploration

  • How to clean things up:
    • getting rid of outliers
    • extrapolation/imputation
    • entity resolution
  • How to get data where it needs to be:
    • (sort of) dataflow/messaging/workflow

Throughout: the benefits of interaction and visualization

57

Data engineers have to deal with “E” “L” and “T” in all sorts of permutations!

Michael

58 of 68

Relational Model Extensions/Specializations

What to do when…

  • …your data is way too large
    • How to do approximation correctly (and the perils with respect to joins)
  • …you just want to read
    • OLAP: cubing and the curse of dimensionality
    • Deciding what to materialize so that you can reconstruct everything on demand
    • (sort of) Column stores as a specialization – and its downsides
  • …you want to employ order
    • Window queries – and its use in data preparation and cleaning
  • …you want to use multiple steps of transformations:
    • Give them names! CTAS, views, CTEs and their pros and cons
    • Create a dbt-based data pipeline

58

In all cases, data manipulation requires incremental maintenance!

Michael

59 of 68

Class Journey

Relational Model and Algebra

Advanced SQL queries (views, subqueries, window functions, …)

DML, DDL

Referential integrity, index selection, performance tuning

Data transformation and preparation

Data wrangling and cleaning

Non-relational data models (Tensors, Spreadsheets, etc.)

Semistructured data (and mongoDB)

Transactions, ER and normalization, \�BI and OLAP, parallel computing, data pipelines, …

59

(note: topics are grouped by project and not to scale with + not in order of the class schedule)

…What did we not cover?

Michael

60 of 68

(what we didn’t cover)

From the classical side:

  • Database internals stuff
  • Details about query optimization, indexes, …
  • Details about functional dependencies, normalization
  • Various forms of logging and recovery; concurrency control

From the non-classical side:

  • Transaction processing at scale and weaker consistency models
  • Virtual machines, containerization, …
  • Resource management, pricing, cloud computation
  • Deep learning systems

60

Michael

61 of 68

(what we didn’t cover)

Data Engineering often requires SWE skills.

  • Final project is an intro to ‘open-ended’ world
  • Real life is (much) less scaffolded
  • Testing! How do you know if your tools are working?
  • Debugging: always a skill to improve!
  • Reading! Both code, and prose, reports, etc.
  • Writing, Evaluation, Communication
  • Team Work

Raw intelligence and skill alone are not enough to get promoted.

Try it on your own even if not doing it for a grade.

But don’t just pick Yelp, Postgres, and Mongo. Explore!

61

Michael

62 of 68

(what we didn’t cover - different DBs)

There are many many different types of databases and data tools…

  • Many other RDBMSes (MySQL, Oracle, MS SQL Server, DB2)
  • Document Stores (Mongo, Amazon Aurora)
  • Hybrid Systems (Google Big Query / Spanner)
  • Key Value Stores (Redis)
  • Object / File Stores (AWS S3)
  • Timeseries DBs (Timescale DB)
  • Pub-Sub / Queuing Systems (Kafka, RabbitMQ, AWS SQS)
  • Text-Based Systems (Elastic Search)
  • OLAP / OLTP Systems
    • AWS Redshift
    • Data Lakes, Data warehouses, “Lake Houses” (sigh…)

Skim the (very long) AWS / Azure / Google Cloud Product offerings

62

Michael

63 of 68

Where can you go from here?

CS186/286a: Database internals (narrower + deeper)

CS286b: Research-oriented database internals class

CS298-12: Data systems and foundations seminar

CS61C: Intro to systems (open to Data students in Summer 2025)

CS169A: Software Engineering (open in Summer 2025)

…and also, no better way to learn more about data engineering than by building data systems from scratch! Or contribute to existing open-source efforts! (or try the final project, or see many links in slides, …)

Happy to talk about research if you’re interested…!!!!

63

This is an amazing time to be studying data systems and data engineering!

Lisa

64 of 68

What we hope you took away from this course

An appreciation for conceptual and practical challenges underlying data engineering

An understanding for design decisions, capturing issues ranging from ease-of-use to scalability to flexibility

Hands-on experience with and conceptual understanding of different types of data systems

64

Data governs a lot of decisions around us!�Infrastructure to support analysis on data is going be increasingly essential.

This is an amazing time to be studying data systems and data engineering!

Lisa

65 of 68

65

…finally…

66 of 68

Thank You!

Thank you to our wonderful, wonderful course staff for making Data 101 Fall 2024 happen!!

Lisa

Lisa

67 of 68

67

…finally finally…

68 of 68

Thank YOU!!

December 5, 2024

Data 101, Fall 2024 @ UC Berkeley

Lisa Yan, Michael Ball https://data101.org/fa24/

68

LECTURE 27

Please fill out our two course evaluations, both internal and campus!

This week’s lecture attendance�https://edstem.org/us/courses/63937/discussion/5816846