1 of 57

Dynamic Ad Performance Reporting with Amazon Redshift: Data Science and Complex Queries at Massive Scale

November 13, 2014 | Las Vegas, NV

Timon Karnezos, Neustar

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

2 of 57

Outline

  • Four Textbook Ad-Tech Problems
    • Frequency
    • Attribution
    • Overlap
    • Ad-hoc
  • Four Solutions & Lessons Learned
    • Cohort Analysis
    • Sessionization
    • Self-join
    • Business Logic Joins
  • How Amazon Redshift Makes It Possible

3 of 57

Four Problems

4 of 57

How many ads should I show you?

5 of 57

How many ads should I show you?

Frequency

6 of 57

How much should I pay for ads?

7 of 57

How much should I pay for ads?

Attribution

8 of 57

How do I reach these people for less?

9 of 57

How do I reach these people for less?

Overlap

10 of 57

Could you run a custom query?

11 of 57

Could you run a custom query?

Ad-hoc

12 of 57

Four Solutions

13 of 57

Frequency

How many ads should I show you?

14 of 57

Frequency

  1. Assign user to cohort by # ads seen.
  2. Aggregate behavior by cohort.
  3. Compute ROI by cohort.

15 of 57

Frequency

How many “people” do we have to measure?

0.7B

/ day

2B

/ week

8B

/ month

21B

/ quarter

16 of 57

-- Number of ads seen per user

WITH frequency_intermediate AS (

SELECT user_id ,

SUM(1) AS impression_count,

SUM(cost) AS cost ,

SUM(revenue) AS revenue

FROM impressions

WHERE record_date BETWEEN <...>

GROUP BY 1

)

-- Number of people who saw N ads

SELECT impression_count, SUM(1), SUM(cost), SUM(revenue)

FROM frequency_intermediate

GROUP BY 1;

Frequency

17 of 57

Frequency

Uh, that’s a big agg.

18 of 57

Frequency

Not big enough.

19 of 57

CREATE TABLE frequency_intermediate (

record_date date ENCODE lzo NOT NULL ,

campaign_id bigint ENCODE lzo NOT NULL ,

site_id bigint ENCODE lzo NOT NULL ,

user_id bigint ENCODE NOT NULL DISTKEY,

impression_count int ENCODE delta NOT NULL ,

cost bigint ENCODE delta NOT NULL ,

revenue bigint ENCODE delta NOT NULL

) SORTKEY(record_date, campaign_id, site_id, user_id);

Frequency

20 of 57

WITH user_frequency AS (

SELECT user_id, campaign_id, site_id,

SUM(impression_count) AS frequency,

SUM(cost) AS cost ,

SUM(revenue) AS revenue

FROM frequency_intermediate

WHERE record_date BETWEEN <...>

GROUP BY 1,2,3

)

SELECT campaign_id, site_id, frequency,

SUM(1), SUM(cost), SUM(revenue)

FROM user_frequency

GROUP BY 1,2,3;

Frequency

21 of 57

Frequency

  1. Update intermediate – 1m
  2. Compute cohorts for 90d – 10m
  3. Aggregate cohort statistics – 11s

22 of 57

6 date ranges, 2 groupings, all clients =

2.5 hours x (8 x dw2.8xlarge) =

$96.00

Frequency

23 of 57

Lesson Learned

Massive, multi-stage aggregations are

fast and reliable.

Frequency

24 of 57

Attribution

How much should I pay for ads?

25 of 57

Attribution

$

26 of 57

-- Basic sessionization query, assemble user activity

-- that ended in a conversion into a timeline.

SELECT <...>

FROM impressions i

JOIN conversions c ON

i.user_id = c.user_id AND

i.record_date < c.record_date

ORDER BY i.record_date;

Attribution

27 of 57

Attribution

$

Position: 1

Position: 2

Position: 3

28 of 57

Attribution

$

Hour offset: 3

Position: 1

Position: 2

Hour offset: 12

Hour offset: 16

Position: 3

29 of 57

-- Sessionize user activity per conversion, partition by campaign (45-day lookback window)

SELECT c.record_date AS conversion_date ,

c.event_id AS conversion_id ,

i.campaign_id AS campaign_id ,

i.site_id AS site_id ,

i.user_id AS user_id ,

c.revenue AS conversion_revenue,

DATEDIFF('hour', i.record_date, c.record_date) AS hour_offset,

SUM(1) OVER (PARTITION BY i.user_id, i.campaign_id, c.event_id

ORDER BY i.record_date DESC ROWS UNBOUNDED PRECEDING) AS position

FROM impressions i

JOIN conversions c ON

i.user_id = c.user_id AND

i.campaign_id = c.campaign_id AND

i.record_date < c.record_date AND

i.record_date > (c.record_date - interval '45 days') AND

c.record_date BETWEEN <...>;

Attribution

30 of 57

-- Compute statistics on sessions (funnel placement, last-touch, site-count, etc...)

SELECT campaign_id ,

site_id ,

conversion_date,

AVG(position) AS average_position,

SUM(conversion_revenue * (position = 1)::int) AS lta_attributed ,

AVG(COUNT(DISTINCT site_id)

OVER (PARTITION BY i.user_id, i.campaign_id, c.event_id

ORDER BY i.record_date ASC

ROWS UNBOUNDED PRECEDING)) AS average_unique_preceding_site_count

FROM sessions

GROUP BY 1,2,3;

Attribution

31 of 57

45d window, 45d lookback, 11 stats, all clients =

2 hours x (8 x dw2.8xlarge) =

$76.80

Attribution

32 of 57

Lesson Learned

Window functions are an effective, feature-rich way to sessionize data.

Attribution

33 of 57

Overlap

How do I reach these people for less?

34 of 57

Overlap

Site A

Site B

Site C

Site A

20%

60%

Site B

90%

Site C

CPM

$0.06

$1.05

$9.50

35 of 57

Overlap

Site A

Site B

Site C

Site A

20%

60%

Site B

90%

Site C

CPM

$0.06

$1.05

$9.50

90% of the people

you see on C are

also seen on B!

36 of 57

Overlap

Site A

Site B

Site C

Site A

20%

60%

Site B

90%

Site C

CPM

$0.06

$1.05

$9.50

B is ⅛ the price of C!

37 of 57

CREATE TABLE overlap_intermediate (

user_id bigint ENCODE NOT NULL DISTKEY,

site_id bigint ENCODE delta NOT NULL

) SORTKEY (user_id, site_id);

Overlap

38 of 57

WITH co_occurences AS (

SELECT

oi.site_id AS site1 ,

oi2.site_id AS site2

FROM overlap_intermediate oi

JOIN overlap_intermediate oi2 ON

oi.site_id > oi2.site_id AND

oi.ak_user_id = oi2.ak_user_id

)

SELECT site1, site2, SUM(1)

FROM co_occurences

GROUP BY 1,2;

Overlap

39 of 57

CREATE TABLE overlap_intermediate (

record_date date ENCODE lzo NOT NULL ,

campaign_id bigint ENCODE lzo NOT NULL ,

site_id bigint ENCODE lzo NOT NULL ,

user_id bigint ENCODE NOT NULL DISTKEY

) SORTKEY (record_date, campaign_id, site_id, user_id);

Overlap

40 of 57

WITH

site_overlap_intermediate AS (

SELECT user_id, site_id, campaign_id

FROM overlap_intermediate WHERE record_date BETWEEN <...> GROUP BY 1,2,3

),

site_co_occurences AS (

SELECT oi.campaign_id AS c_id, oi.site_id AS site1, oi2.site_id AS site2

FROM site_overlap_intermediate oi

JOIN site_overlap_intermediate oi2 ON

oi.site_id > oi2.site_id AND

oi.ak_user_id = oi2.ak_user_id AND

oi.campaign_id = oi2.campaign_id

)

SELECT c_id, site1, site2, SUM(1) FROM site_co_occurences GROUP BY 1,2,3;

Overlap

41 of 57

Overlap

  1. Update intermediate – 1m
  2. Compute 90d co-occurrences – 15m
  3. Aggregate co-occurrences – 50s

42 of 57

6 date ranges, 3 groupings, all clients =

2.5 hour x (8 x dw2.8xlarge) =

$96.00

Overlap

43 of 57

Lesson Learned

Correctly sort your table and self-joins take care of themselves.

Overlap

44 of 57

Ad-hoc

Could you run a custom query?

45 of 57

Ad-hoc

No, but you can!

46 of 57

Ad-hoc

What do we send over?

8

fact tables

26

dimension tables

7

mapping tables

47 of 57

Ad-hoc

How do you extract a client’s data?

42

views

121

joins

1100

sloc

48 of 57

$ pg_dump –Fc some_file --table=foo --table=bar

$ pg_restore --schema-only --clean –Fc some_file > schema.sql

$ pg_restore --data-only --table=foo –Fc some_file > foo.tsv

$ aws s3 cp schema.sql s3://metadata-bucket/YYYYMMDD/schema.sql

$ aws s3 cp foo.tsv s3://metadata-bucket/YYYYMMDD/foo.tsv

> \i schema.sql

> COPY foo FROM ‘s3://metadata-bucket/YYYYMMDD/foo.tsv’ <...>

# or combine ‘COPY <..> FROM <...> SSH’ and pg_restore/psql

Ad-hoc

49 of 57

UNLOAD

('

SELECT i.*

FROM impressions i

JOIN client_to_campaign_mapping m ON

m.campaign_id = i.campaign_id

WHERE i.record_date >= '{{yyyy}}-{{mm}}-{{dd}}' - interval \'1 day\' AND

i.record_date < '{{yyyy}}-{{mm}}-{{dd}}' AND

m.client_id = <...>

‘)

TO 's3://{{bucket}}/us_eastern/{{yyyy}}/{{mm}}/{{dd}}/dsdk_events/{{vers}}/impressions/'

WITH CREDENTIALS 'aws_access_key_id={{key}};aws_secret_access_key={{secret}}'

DELIMITER ',' NULL '\\N' ADDQUOTES ESCAPE GZIP MANIFEST;

Ad-hoc

50 of 57

1.5 hours x (8 x dw2.8xlarge) =

$57.60

Ad-hoc

51 of 57

Lesson Learned

If your business logic is already in SQL, keep it in SQL.

Ad-hoc

52 of 57

How this is possible

53 of 57

Frequency + Attribution + Overlap + Ad-hoc =

2.5 + 2 + 2.5 + 1.5 =

8.5 hours execution time

54 of 57

Workload

Node Count

Node Type

Restore

Maint.

Exec.

Frequency

& Attribution

& Overlap

& Ad Hoc

16

dw2.8xlarge

2h

1h

6h

= $691.20

Four workloads, One cluster

55 of 57

Workload

Node Count

Node Type

Restore

Maint.

Exec.

Frequency

8

dw2.8xlarge

1.5h

0.5h

2.5h

Attribution

8

dw2.8xlarge

1.5h

0.5h

2h

Overlap

8

dw2.8xlarge

1h

0.5h

2.5h

Ad-hoc

8

dw2.8xlarge

0h

0.5h

1.5h

= $556.80

Four workloads, Four clusters

(-19%)

56 of 57

Lesson Learned

Orchestration of Redshift clusters is easy.

Don’t scale up, scale out.

57 of 57

ADV403

Please give us your feedback on this presentation

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Join the conversation on Twitter with #reinvent