1 of 57

Dynamic Ad Performance Reporting with Amazon Redshift: Data Science and Complex Queries at Massive Scale

November 13, 2014 | Las Vegas, NV

Timon Karnezos, Neustar

2 of 57

Outline

Four Textbook Ad-Tech Problems

Frequency
Attribution
Overlap
Ad-hoc

Four Solutions & Lessons Learned

Cohort Analysis
Sessionization
Self-join
Business Logic Joins

How Amazon Redshift Makes It Possible

3 of 57

Four Problems

4 of 57

How many ads should I show you?

5 of 57

How many ads should I show you?

Frequency

6 of 57

How much should I pay for ads?

7 of 57

How much should I pay for ads?

Attribution

8 of 57

How do I reach these people for less?

9 of 57

How do I reach these people for less?

Overlap

10 of 57

Could you run a custom query?

11 of 57

Could you run a custom query?

Ad-hoc

12 of 57

Four Solutions

13 of 57

Frequency

How many ads should I show you?

14 of 57

Frequency

Assign user to cohort by # ads seen.
Aggregate behavior by cohort.
Compute ROI by cohort.

15 of 57

Frequency

How many “people” do we have to measure?

0.7B	/ day
2B	/ week
8B	/ month
21B	/ quarter

16 of 57

-- Number of ads seen per user

WITH frequency_intermediate AS (

SELECT user_id ,

SUM(1) AS impression_count,

SUM(cost) AS cost ,

SUM(revenue) AS revenue

FROM impressions

WHERE record_date BETWEEN <...>

GROUP BY 1

)

-- Number of people who saw N ads

SELECT impression_count, SUM(1), SUM(cost), SUM(revenue)

FROM frequency_intermediate

GROUP BY 1;

Frequency

17 of 57

Frequency

Uh, that’s a big agg.

18 of 57

Frequency

Not big enough.

19 of 57

CREATE TABLE frequency_intermediate (

record_date date ENCODE lzo NOT NULL ,

campaign_id bigint ENCODE lzo NOT NULL ,

site_id bigint ENCODE lzo NOT NULL ,

user_id bigint ENCODE NOT NULL DISTKEY,

impression_count int ENCODE delta NOT NULL ,

cost bigint ENCODE delta NOT NULL ,

revenue bigint ENCODE delta NOT NULL

) SORTKEY(record_date, campaign_id, site_id, user_id);

Frequency

Effectively this table speeds up computation of our cohorts by partially aggregating each user’s data by day, campaign, and site.

We can then effiicently scan certain date ranges because of the first column in the sort key,

and then segment and aggregate to get the desired cohort criteria thanks to the next three columns in the sort key.

The down side is that by adding these extra facets, we end up with 3-10x more rows than what I previously quoted! This table, for a quarter, has upwards of 70B rows, but that still beats eating the cost of re-aggregating the 200+B rows of raw data each time.

Another added benefit of this type of aggregated intermediate is that it can be built incrementally, day by day, without incurring the cost of a VACUUM because the insert order preserves the order specified in the SORT KEY.

In other words, I get a double benefit: not only can I bite off the work of keeping this intermediate up to date in small pieces, but because record_date is the first column in the sort key, inserting each new day’s update every morning doesn’t “un-sort” the table because the new day’s rows have a greater record date that any existing rows in the table.

20 of 57

WITH user_frequency AS (

SELECT user_id, campaign_id, site_id,

SUM(impression_count) AS frequency,

SUM(cost) AS cost ,

SUM(revenue) AS revenue

FROM frequency_intermediate

WHERE record_date BETWEEN <...>

GROUP BY 1,2,3

)

SELECT campaign_id, site_id, frequency,

SUM(1), SUM(cost), SUM(revenue)

FROM user_frequency

GROUP BY 1,2,3;

Frequency

21 of 57

Frequency

Update intermediate – 1m
Compute cohorts for 90d – 10m
Aggregate cohort statistics – 11s

22 of 57

6 date ranges, 2 groupings, all clients =

2.5 hours x (8 x dw2.8xlarge) =

$96.00

Frequency

23 of 57

Lesson Learned

Massive, multi-stage aggregations are

fast and reliable.

Frequency

24 of 57

Attribution

How much should I pay for ads?

25 of 57

Attribution

$

26 of 57

-- Basic sessionization query, assemble user activity

-- that ended in a conversion into a timeline.

SELECT <...>

FROM impressions i

JOIN conversions c ON

i.user_id = c.user_id AND

i.record_date < c.record_date

ORDER BY i.record_date;

Attribution

27 of 57

Attribution

$

Position: 1

Position: 2

Position: 3

28 of 57

Attribution

$

Hour offset: 3

Position: 1

Position: 2

Hour offset: 12

Hour offset: 16

Position: 3

29 of 57

-- Sessionize user activity per conversion, partition by campaign (45-day lookback window)

SELECT c.record_date AS conversion_date ,

c.event_id AS conversion_id ,

i.campaign_id AS campaign_id ,

i.site_id AS site_id ,

i.user_id AS user_id ,

c.revenue AS conversion_revenue,

DATEDIFF('hour', i.record_date, c.record_date) AS hour_offset,

SUM(1) OVER (PARTITION BY i.user_id, i.campaign_id, c.event_id

ORDER BY i.record_date DESC ROWS UNBOUNDED PRECEDING) AS position

FROM impressions i

JOIN conversions c ON

i.user_id = c.user_id AND

i.campaign_id = c.campaign_id AND

i.record_date < c.record_date AND

i.record_date > (c.record_date - interval '45 days') AND

c.record_date BETWEEN <...>;

Attribution

Most of this query is the boilerplate I showed you earlier.

There is only one part of this query that really matters: which is the last part of the SELECT, where we use a window function to annotate each impression with its position in the session.

Those two lines of code are the real meat of the sessionization process we’re looking for.

If you’ve never messed around with window functions, I encourage you to do that as soon as you can. They are to me the most powerful and sophisticated feature of SQL.

For perspective, this join takes hundreds of billions of rows of impressions and joins them to hundreds of millions of rows of conversions to output single-digit billions of rows of sessionized impressions, all in just a few minutes.

The trick to that is that both impressions and conversions tables have sort keys first on record_date, then on user_id which are the major join conditions.

30 of 57

-- Compute statistics on sessions (funnel placement, last-touch, site-count, etc...)

SELECT campaign_id ,

site_id ,

conversion_date,

AVG(position) AS average_position,

SUM(conversion_revenue * (position = 1)::int) AS lta_attributed ,

AVG(COUNT(DISTINCT site_id)

OVER (PARTITION BY i.user_id, i.campaign_id, c.event_id

ORDER BY i.record_date ASC

ROWS UNBOUNDED PRECEDING)) AS average_unique_preceding_site_count

FROM sessions

GROUP BY 1,2,3;

Attribution

31 of 57

45d window, 45d lookback, 11 stats, all clients =

2 hours x (8 x dw2.8xlarge) =

$76.80

Attribution

32 of 57

Lesson Learned

Window functions are an effective, feature-rich way to sessionize data.

Attribution

33 of 57

Overlap

How do I reach these people for less?

34 of 57

Overlap

	Site A	Site B	Site C
Site A		20%	60%
Site B			90%
Site C
CPM	$0.06	$1.05	$9.50

35 of 57

Overlap

	Site A	Site B	Site C
Site A		20%	60%
Site B			90%
Site C
CPM	$0.06	$1.05	$9.50

90% of the people

you see on C are

also seen on B!

36 of 57

Overlap

	Site A	Site B	Site C
Site A		20%	60%
Site B			90%
Site C
CPM	$0.06	$1.05	$9.50

B is ⅛the price of C!

37 of 57

CREATE TABLE overlap_intermediate (

user_id bigint ENCODE NOT NULL DISTKEY,

site_id bigint ENCODE delta NOT NULL

) SORTKEY (user_id, site_id);

Overlap

38 of 57

WITH co_occurences AS (

SELECT

oi.site_id AS site1 ,

oi2.site_id AS site2

FROM overlap_intermediate oi

JOIN overlap_intermediate oi2 ON

oi.site_id > oi2.site_id AND

oi.ak_user_id = oi2.ak_user_id

)

SELECT site1, site2, SUM(1)

FROM co_occurences

GROUP BY 1,2;

Overlap

39 of 57

CREATE TABLE overlap_intermediate (

record_date date ENCODE lzo NOT NULL ,

campaign_id bigint ENCODE lzo NOT NULL ,

site_id bigint ENCODE lzo NOT NULL ,

user_id bigint ENCODE NOT NULL DISTKEY

) SORTKEY (record_date, campaign_id, site_id, user_id);

Overlap

40 of 57

WITH

site_overlap_intermediate AS (

SELECT user_id, site_id, campaign_id

FROM overlap_intermediate WHERE record_date BETWEEN <...> GROUP BY 1,2,3

),

site_co_occurences AS (

SELECT oi.campaign_id AS c_id, oi.site_id AS site1, oi2.site_id AS site2

FROM site_overlap_intermediate oi

JOIN site_overlap_intermediate oi2 ON

oi.site_id > oi2.site_id AND

oi.ak_user_id = oi2.ak_user_id AND

oi.campaign_id = oi2.campaign_id

)

SELECT c_id, site1, site2, SUM(1) FROM site_co_occurences GROUP BY 1,2,3;

Overlap

41 of 57

Overlap

Update intermediate – 1m
Compute 90d co-occurrences – 15m
Aggregate co-occurrences – 50s

42 of 57

6 date ranges, 3 groupings, all clients =

2.5 hour x (8 x dw2.8xlarge) =

$96.00

Overlap

43 of 57

Lesson Learned

Correctly sort your table and self-joins take care of themselves.

Overlap

And the takeaway from this one is pretty easy:

sort your table according to your join keys and you’ll be in good shape for self-joins.

In fact, the difference in time this takes to run compared to a “simpler” query like frequency which has no joins isn’t that big really.

No more than a few minutes on tens of billions of rows of input. It’s wild that a well-sorted self-join and aggregation doesn’t take too much more time than a well-sorted multistage aggregation.

And intuitively this is pretty straightforward: because it is distributed by user id and also counts unique user IDs, no intercommunication between nodes is required. Also, because it is sorted on all the join keys, the self join is really just a scan that computes cartesian products on small pieces of the table. No need to shuffle data, no need to build massive hash tables in memory.

44 of 57

Ad-hoc

Could you run a custom query?

45 of 57

Ad-hoc

No, but you can!

46 of 57

Ad-hoc

What do we send over?

8	fact tables
26	dimension tables
7	mapping tables

47 of 57

Ad-hoc

How do you extract a client’s data?

42	views
121	joins
1100	sloc

The tricky part of that was identifying which entities and events we should send to each customer of this service. After all, we couldn’t simply send them someone else’s data.

So we started walking back from our facts and dimensions and mappings and discovered that the logic to correctly determine ownership of an event or entity involved a pretty significant amount of work. Like most companies, the business rules and relationships for our customers are largely embodied in a SQL database. In our case, this was Postgres and there were about 42 views that established the links between entities and their owners.

Those views comprised about 121 joins and 1100 lines of SQL.

The prospect of trying to identically replicate the semantics of these business rules, much less keep them synchronized between two systems, was daunting. Even more so, if we wanted to accomplish that on a system that didn’t support SQL.

But as you may surmise, we realized that hey Redshift supports SQL and and it turns out that building our own pseudo-replication was easy, reliable, and quite simple.

48 of 57

$ pg_dump –Fc some_file --table=foo --table=bar

$ pg_restore --schema-only --clean –Fc some_file > schema.sql

$ pg_restore --data-only --table=foo –Fc some_file > foo.tsv

$ aws s3 cp schema.sql s3://metadata-bucket/YYYYMMDD/schema.sql

$ aws s3 cp foo.tsv s3://metadata-bucket/YYYYMMDD/foo.tsv

> \i schema.sql

> COPY foo FROM ‘s3://metadata-bucket/YYYYMMDD/foo.tsv’ <...>

# or combine ‘COPY <..> FROM <...> SSH’ and pg_restore/psql

Ad-hoc

49 of 57

UNLOAD

('

SELECT i.*

FROM impressions i

JOIN client_to_campaign_mapping m ON

m.campaign_id = i.campaign_id

WHERE i.record_date >= '{{yyyy}}-{{mm}}-{{dd}}' - interval \'1 day\' AND

i.record_date < '{{yyyy}}-{{mm}}-{{dd}}' AND

m.client_id = <...>

‘)

TO 's3://{{bucket}}/us_eastern/{{yyyy}}/{{mm}}/{{dd}}/dsdk_events/{{vers}}/impressions/'

WITH CREDENTIALS 'aws_access_key_id={{key}};aws_secret_access_key={{secret}}'

DELIMITER ',' NULL '\\N' ADDQUOTES ESCAPE GZIP MANIFEST;

Ad-hoc

50 of 57

1.5 hours x (8 x dw2.8xlarge) =

$57.60

Ad-hoc

51 of 57

Lesson Learned

If your business logic is already in SQL, keep it in SQL.

Ad-hoc

52 of 57

How this is possible

53 of 57

Frequency + Attribution + Overlap + Ad-hoc =

2.5 + 2 + 2.5 + 1.5 =

8.5 hours execution time

54 of 57

Workload	Node Count	Node Type	Restore	Maint.	Exec.
Frequency & Attribution & Overlap & Ad Hoc	16	dw2.8xlarge	2h	1h	6h
			= $691.20

Four workloads, One cluster

55 of 57

Workload	Node Count	Node Type	Restore	Maint.	Exec.
Frequency	8	dw2.8xlarge	1.5h	0.5h	2.5h
Attribution	8	dw2.8xlarge	1.5h	0.5h	2h
Overlap	8	dw2.8xlarge	1h	0.5h	2.5h
Ad-hoc	8	dw2.8xlarge	0h	0.5h	1.5h
			= $556.80

Four workloads, Four clusters

(-19%)

56 of 57

Lesson Learned

Orchestration of Redshift clusters is easy.

Don’t scale up, scale out.

57 of 57

ADV403

Please give us your feedback on this presentation

Join the conversation on Twitter with #reinvent