1 of 52

Data Engineering��OLAP and Summarization

2 of 52

Raw Data

Transactions

Sensors

Log Files

Experiments

Data Warehouse

Data Integration

Extract

Load

Transform

Source of Truth

Governed

Secure

Audited�Managed

“ELT”

3 of 52

Raw Data

Transactions

Sensors

Log Files

Experiments

Use-Case-Specific

Fit for purpose

Self-Service

Data Lake

Data Warehouse

Extract

Load

Transform

Transform

Transform

ETLT+?

Source of Truth

Governed

Secure

Audited�Managed

Data Preparation /

Integration

4 of 52

Data Warehousing & OLAP

  • A data warehouse is a data system targeted at performing large scale analytics on “clean” data
    • Usually the result of an ETL like process
    • These days it might be more like ELTLT*
  • OLAP = OnLine Analytical Processing
    • The type of processing performed in a data warehouse: reading and summarizing large volumes (PBs) of data to understand trends and patterns
      • E.g., total sales of each type of Honda car over time for each county
      • “Read-only” queries
    • Aka decision support or business intelligence (BI)
      • But now BI has diversified (e.g., ML)
  • Contrast to OLTP: OnLine Transaction Processing
    • “Read-write” queries
    • Usually touch a small amount of data
      • e.g., append a new car sale into the sales table

5 of 52

Typical Architectures

  • Imagine Honda USA
  • Many sales regions

Region SW

Region NW

Region MW

Region NE

Region S

Regional OLTP

(live or transactional) support

6 of 52

Typical Architectures

  • Imagine Honda USA
  • Many sales regions

  • OLAP is performed on a separate data warehouse away from the critical path of OLTP.
  • Post-hoc large-scale analysis happens separate from updates

Region SW

Region NW

Region MW

Region NE

Centralized

Data Warehouse

Region S

OLAP support

Regional OLTP

(live or transactional) support

7 of 52

Typical Architectures

  • Imagine Honda USA
  • Many sales regions

  • Data warehouse periodically loaded w/ new data
    • E.g, regional sales data gets consolidated at end of each day

Region SW

Region NW

Region MW

Region NE

Centralized

Data Warehouse

Region S

ETL

Periodically done

OLAP support

Regional OLTP

(live or transactional) support

8 of 52

Typical Architectures

  • Consolidation happens through ETL
    • Extract, Transform, Load
    • Extract useful business info to be summarized, transform it (e.g., canonicalize, clean up), load it in the warehouse
    • Ready for analysis!

Region SW

Region NW

Region MW

Region NE

Centralized

Data Warehouse

Region S

ETL

Periodically done

OLAP support

Regional OLTP

(live or transactional) support

9 of 52

Typical Architectures

  • Challenge: staleness
  • Still, usually a reasonable tradeoff:
    • Large scale OLAP (read) queries may delay updates
      • Crucial to ensure that sales are not prevented than a report for a manager is generated promptly
      • Latter will anyway take a long time, so might as well have them wait a bit longer
    • OK if the analysis results are a bit stale

Region SW

Region NW

Region MW

Region NE

Centralized

Data Warehouse

Region S

ETL

Regional OLTP

(live or transactional) support

Periodically done

OLAP support

10 of 52

Schemas in Data Warehouses

  • Usually employs a star (or sometimes snowflake) schema
  • One fact table and many dim. tables
  • Fact tables contain dimension attr. and measure attr.

  • Canonical data warehouse example:
    • Fact table: Sales (itemid, storeid, customerid, date, number, price)
    • Dim table: Iteminfo (itemid, itemname, color, size, category)
    • Dim table: Store (storeid, city, state, country)
    • Dim table: Customer (customerid, name, street, city, state)

10

Centralized

Data Warehouse

11 of 52

Star Schema

  • Example Schema
    • Fact table: Sales (itemid, storeid, customerid, date, number, price)
    • Dim table: Iteminfo (itemid, itemname, color, size, category)
    • Dim table: Store (storeid, city, state, country)
    • Dim table: Customer (customerid, name, street, city, state)

11

Store

Iteminfo

Customer

Sales

Foreign Key

Foreign Key

Foreign Key

12 of 52

Star to Snowflake

  • Extending the example
    • Fact table: Sales (itemid, storeid, customerid, date, number, price)
    • Dim table: Iteminfo (itemid, itemname, color, size, category, manfname)
      • Dim table: Manf (name, address, owner)
    • Dim table: Store (storeid, city, state, country)
    • Dim table: Customer (customerid, name, street, city, state)

12

Store

Customer

Sales

Manf

Iteminfo

Foreign Key

13 of 52

OLAP Queries

  • Example Schema:
    • Fact table: Sales (itemid, storeid, customerid, date, number, price)
    • Dim table: Iteminfo (itemid, itemname, color, size, category)
    • Dim table: Store (storeid, city, state, country)
    • Dim table: Customer (customerid, name, street, city, state)

  • Typical “report” queries GROUP BY some dim. attrs., aggregate some measure attrs.

SELECT category, country, COUNT(number)

FROM Sales NATURAL JOIN Iteminfo NATURAL JOIN Store

GROUP BY category, country

13

Store

Iteminfo

Customer

Sales

14 of 52

Dates in OLAP

  • Example Schema:
    • Fact table: Sales (itemid, storeid, customerid, date, number, price)
    • Dim table: Iteminfo (itemid, itemname, color, size, category)
    • Dim table: Store (storeid, city, state, country)
    • Dim table: Customer (customerid, name, street, city, state)

  • Not very useful to “aggregate” date; better to treat date as an implicit dimension!
    • Allows us to group by and see trends across dates (e.g., sales by year)
    • Fact table: Sales (itemid, storeid, customerid, date, number, price)
    • Implicit Dim table: Dateinfo (date, month, quarter, year)

14

Store

Iteminfo

Customer

Sales

15 of 52

Star Schema

  • Example Schema:
    • Fact table: Sales (itemid, storeid, customerid, date, number, price)
    • Dim table: Iteminfo (itemid, itemname, color, size, category)
    • Dim table: Store (storeid, city, state, country)
    • Dim table: Customer (customerid, name, street, city, state)
    • Implicit Dim table: Dateinfo (date, month, quarter, year)

  • What we might want:

SELECT category, country, month, COUNT(number)

FROM Sales NATURAL JOIN Iteminfo NATURAL JOIN Store NATURAL JOIN Dateinfo

GROUP BY category, country, month

  • Actual query might be (in PostgreSQL)

SELECT category, country, datepart(‘month’, date) as month, COUNT(number)

FROM Sales NATURAL JOIN Iteminfo NATURAL JOIN Store

GROUP BY category, country, month

15

Store

Iteminfo

Customer

Sales

Date

16 of 52

Introducing Data Cubes

  • For now, will operate on a “denormalized view”, where fact and dim. tables are joined
    • same considerations for star/snowflake schema
  • Consider a simpler inventory relation: (item, color, size, number)

16

Item

Color

Size

Number

Jacket

Blue

Small

1

Jacket

Red

Medium

1

Jeans

Black

Large

2

17 of 52

Vanilla GROUP BY across all groups

  • Q: If there are n item names, m colors, and k sizes, what are the number of rows?

17

Item

Color

Size

Number

Jacket

Blue

Small

23

Jacket

Blue

Medium

17

Jacket

Blue

Large

34

Jacket

Red

Small

18

Jeans

Blue

Small

14

13

18 of 52

Say I was interested in only the color and item…

  • Might want to see a cross-tabulation of aggs corr. to Item and Color

  • Q: How could you get this via a GROUP BY?

18

Blue

Red

Total

Jacket

23

45

234

Jeans

24

28

462

Total

89

132

2384

19 of 52

Say I was interested in only the color and item…

  • Might want to see a cross-tabulation of aggs corr. to Item and Color

  • Q: How could you get this via a GROUP BY?
  • A: group by combinations, and individual values, and overall count

19

Blue

Red

Total

Jacket

23

45

234

Jeans

24

28

462

Total

89

132

2384

20 of 52

Another way to view this…

  • Crosstab of item and color = vertical plane closest to us

20

Blue

Red

Total

Jacket

23

45

234

Jeans

24

28

462

Total

89

132

2384

21 of 52

Another way to view this…

  • This is a data cube
  • Has 3 dimensions (hence the name dimensions for those attributes!)
  • Can be sliced and diced in various ways

21

22 of 52

Another way to view this…

  • Slicing = adding a condition to one/more of the dimensions
    • e.g., green color
  • Dicing = partitioning of the dimension
    • Here, we partitioned based on distinct values, but can partition in more coarse-grained ways
    • e.g., light colors and dark colors for color
  • Especially useful for the date dimension:
    • Can group by months, days, years, weeks, etc.

22

23 of 52

Hierarchies for Setting the Partitioning Granularity

  • The partitioning granularity can be set based on user needs…
  • Different partitioning may be useful for different applications

23

Datetime

Hour

Full Date

Day of week

Month

Specific Month

Specific Quarter

Year

Quarter

City

State

Country

24 of 52

Hierarchies for Setting the Partitioning Granularity

  • Q: I want the aggregates per month. What if we computed a data cube based on Full Date? Can I avoid recomputing the cube (or cross-tab)?
  • Q: What about if we had computed a data cube based on Year?

24

Datetime

Hour

Full Date

Day of week

Month

Specific Month

Specific Quarter

Year

Quarter

City

State

Country

25 of 52

Moves in the Hierarchy and Corresponding Cube

  • Moving from a finer to a coarser granularity is called a rollup
  • Moving from a coarser to a finer granularity is called a drill-down

  • Unit steps from coarse to fine (drill-down):
    • Move down the hierarchy for one of the dimensions, OR
    • Move from
      • the origin to an edge, or
      • an edge to a plane, or
      • a plane to the cube

25

26 of 52

OLAP in SQL: CUBE

  • NULL” is used to indicate “ALL”

SELECT item, color, SUM(number)

FROM Sales

GROUP BY CUBE (item, color)

  • Any attr. may be replaced with NULL(ALL)

  • May result in too many combinations

26

Item

Color

Number

Jacket

Blue

23

Jacket

Green

34

Jeans

Blue

28

Jeans

Green

17

Jacket

NULL

185

Jeans

NULL

200

NULL

Blue

94

NULL

Red

74

NULL

NULL

984

27 of 52

ROLLUP is an alternative to CUBE

ROLLUP targets a smaller # of combinations

SELECT item, color, size, SUM(number)

FROM Sales

GROUP BY ROLLUP (item), ROLLUP (color, size)

  • Every combination of:
    • specific item or ALL for the first rollup
    • for the second rollup
      • specific color and specific size
      • specific color and ALL sizes
      • ALL colors and ALL sizes
  • Thus, combinations include
    • {(item, color, size), (item, color), (item), (color, size), (color), ()}

27

28 of 52

Picking the right CUBE/ROLLUP query

  • First, why does this matter?
    • If this query is being run once on a PB- sized warehouse, it is important to get it right!
    • Results usually materialized and used in dashboards, presentations, spreadsheets, ….

  • Approach:
    • Think about all the ways you want to slice and dice your data
    • Pick granularity to recreate all aggregates you want, without blowing up the query result
      • Result size grows exponentially in the attrs; can be quite bad in large snowflake schemas
      • Known as the curse of dimensionality

28

29 of 52

How does this relate to visualizations?

  • Most visualizations are group-by queries

  • SELECT AGG(M), D
  • FROM R
  • WHERE …
  • GROUP BY D

  • SELECT COUNT(*), Country
  • FROM R
  • GROUP BY Country

29

30 of 52

So, why did we learn OLAP?

  • OLAP is a special class of analytical proc. and report generation queries
    • Typically done in large “batch” ops
    • Rule of thumb: pick as “coarse-grained” query results as will allow you to construct all necessary cross-tabs
  • Concepts of data cubes, hierarchies, slicing/dicing, and rollup/drill-down are valuable to describe what you’re doing when exploring your data
  • Conveniences that come in SQL: ROLLUP and CUBE operators

30

31 of 52

Extra Slides on Visualization

32 of 52

Why Visualizations?

  • Analyze
    • Discover trends
      • Stock price is going up/down
    • Develop hypotheses
      • House prices are down due to the downturn
    • Check hypotheses
    • Detect errors
      • Null values in a column

  • Share, record & communicate

  • Our focus will primarily be on basic visualization types, how they relate to the data types, and the data processing aspects: Marti Hearst’s classes are highly recommended for a deep-dive into visualization.

32

33 of 52

Types of Data: A Visualization-Oriented Perspective

  • Nominal
    • =, ≠
  • Ordinal
    • =, ≠, <, >
  • Quantitative Interval
    • =, ≠, <, >, –
    • Arbitrary zero
  • Quantitative Ratio
    • =, ≠, <, >, –, %
    • Physical quantities

33

Airlines, Genre

Film ratings, Batteries

Year, Location

Sales, Profit,

Temperature

34 of 52

Types of Data: A Visualization-Oriented Perspective

  • Nominal
    • =, ≠
  • Ordinal
    • =, ≠, <, >
  • Quantitative Interval
    • =, ≠, <, >, –
    • Arbitrary zero
  • Quantitative Ratio
    • =, ≠, <, >, –, %
    • Physical quantities

34

Hot, cold

Good, OK, Bad

Temperature

Score

Grade

35 of 52

A Very Quick Primer on Visualization Types

  • The most basic visualization is a table!

  • Bar Charts
  • Line Charts
  • Scatter Plot
  • Choropleth

35

36 of 52

Bar Charts

36

Q: What SQL query

generates a multiple

bar chart?

37 of 52

When are bar charts appropriate?

  • When plotting a Q-R vs. either an N, O, Q-I, or Q-R
  • Emphasizes the differences in height than differences in X axis
  • Most fundamental chart
  • From a SQL standpoint, simple aggregation of some Y axis measure, grouped by one or more dimensions
    • can generate results in the appropriate order in the X axis by doing an ORDER BY following the GROUP BY

37

38 of 52

Line Charts

38

39 of 52

When are line charts appropriate?

  • When plotting a Q-R vs. a Q-I or a Q-R
  • Mainly makes sense when the X axis is ordered in some way and distances between X axis points matter
    • e.g., is the rate of change in this interval the same as the other interval
  • Want to be able to see “trends”
    • There is an assumption of interpolation between points and dependence of the Y-axis on the X-axis
  • From a SQL standpoint, the query for generation is similar to bar charts, grouping by the X-axis

39

40 of 52

Scatterplots

40

41 of 52

When are scatterplots appropriate?

  • When plotting a Q-R vs. a Q-R
  • No assumption of interpolation unlike line charts
  • Care more about “density”, understanding of “correlation”
  • From a SQL query standpoint, one way to plot a scatterplot is to simply perform a SELECT X, Y FROM R with no grouping.
    • Additional aspects (e.g., color) can also be selected if needed
  • However, there is a danger of too many rows being returned.
    • Imagine a relation of size 1B: 1B pairs returned
    • A safer option in that case is to bin the scatterplot into grid cells
    • Q: How would we do this?
    • A: CTE to add new “binned” columns corresponding to a CASE statement, followed by a GROUP BY on the new columns

41

42 of 52

Choropleths

42

43 of 52

When are choropleths appropriate?

  • Choropleths map a Q-R vs. a two-dimensional Q-I variable

  • From a SQL query standpoint, grouping can be done on a per-geographical region basis followed by overlaying on a map.

43

44 of 52

What type of visualization would you use?

  • A plot of rainfall by location on a map
  • A plot of average age by department
  • A plot of total sales by year
  • A plot of rainfall by average temperature for various locations

44

45 of 52

Visualization Tools

  • Many good visualization packages: these help you generate a visualization on your data from within a programming language
    • Matplotlib (you’ll try this out in your homework!)
    • D3/Vega/Vega-lite
    • ggplot2
    • Gnuplot
    • Google Charts

  • Plus visual analytics tools: these are tools that provide an interactive environment to explore your data visually without writing code
    • Looker
    • PowerBI
    • Spotfire
    • Tableau

45

46 of 52

What do Visual Analytics tools support?

  • At a high level, you specify what you want to see in the x axis (the aggregated measure) and the y axis (the grouping variable)
    • And the system automatically picks the “right” visualization

  • You could also encode other additional attributes as “layers” via color, shape — these could be either part of the grouping or the aggregated measure attributes

  • Some tools allow users to specify the mapping between attributes to the axes in a bit more expressive manner; we will spend a little bit of time talking about the algebra underlying this

46

47 of 52

Composing Attributes

  • The Polaris table algebra allows us to compose attributes prior to visualization in more sophisticated ways.
    • This algebra is the basis behind Tableau, a popular visual analytics platform

  • The table algebra treats Q-R and Q-I as “Q-R” and N and O into “O” (imposing an arbitrary order if needed).

47

48 of 52

Table Algebra: Basic Operands

  • Ordinal fields: interpret domain as a set that partitions table into rows and columns
    • Quarter = {(Qtr1),(Qtr2),(Qtr3),(Qtr4)}

  • Quantitative fields: treat domain as a single element set and encode spatially as axes
    • Profit = {(Profit[-410,650])}

48

49 of 52

Concatenation Operator (+)

  • Ordered union of set interpretation
    • Quarter + ProductType
    • = {(Qtr1),(Qtr2),(Qtr3),(Qtr4)} + {(Coffee), (Espresso)}
    • = {(Qtr1),(Qtr2),(Qtr3),(Qtr4),(Coffee),(Espresso)}

  • Profit + Sales = {(Profit[-310,620]),(Sales[0,1000])}

49

50 of 52

Cross operator (X)

  • Cross-product of set interpretation
  • Quarter x ProductType =
  • {(Qtr1,Coffee), (Qtr1, Tea), (Qtr2, Coffee), (Qtr2, Tea), (Qtr3, Coffee), (Qtr3, Tea), (Qtr4, Coffee), (Qtr4,Tea)}

  • ProductType x Profit =

50

51 of 52

And then what?

  • With this algebra, users can declaratively specify what combination of attributes goes in each of the “shelves” — the X, Y, and Z shelves
    • And the visualization is automatically generated

51

52 of 52

Takeaways

  • Visualization is an essential means for data exploration — hypothesis generation and confirmation, spotting of outliers and trends, among others.
  • A lot can be accomplished with a small number of visualization types: often these suffice during data exploration
    • Others may be helpful when presenting results to others
  • Visual analytics tools provide interactive visualization capabilities via simple interactions
    • The Polaris table algebra forms the basis for how the data attributes are mapped to visualization axes and panes (shelves)

52