2 of 52

Raw Data

Transactions

Sensors

Log Files

Experiments

Data Warehouse

Data Integration

Extract

Load

Transform

Source of Truth

Governed

Secure

Audited�Managed

“ELT”

3 of 52

Raw Data

Transactions

Sensors

Log Files

Experiments

Use-Case-Specific

Fit for purpose

Self-Service

Data Lake

Data Warehouse

Extract

Load

Transform

ETLT+?

Source of Truth

Governed

Secure

Audited�Managed

Data Preparation /

Integration

4 of 52

Data Warehousing & OLAP

A data warehouse is a data system targeted at performing large scale analytics on “clean” data

Usually the result of an ETL like process
These days it might be more like ELTLT*

OLAP = OnLine Analytical Processing

The type of processing performed in a data warehouse: reading and summarizing large volumes (PBs) of data to understand trends and patterns

E.g., total sales of each type of Honda car over time for each county
“Read-only” queries

Aka decision support or business intelligence (BI)

But now BI has diversified (e.g., ML)

Contrast to OLTP: OnLine Transaction Processing

“Read-write” queries
Usually touch a small amount of data

e.g., append a new car sale into the sales table

5 of 52

Typical Architectures

Imagine Honda USA
Many sales regions

Region SW

Region NW

Region MW

Region NE

Region S

Regional OLTP

(live or transactional) support

6 of 52

Typical Architectures

Imagine Honda USA
Many sales regions

OLAP is performed on a separate data warehouse away from the critical path of OLTP.
Post-hoc large-scale analysis happens separate from updates

Region SW

Region NW

Region MW

Region NE

Centralized

Data Warehouse

Region S

OLAP support

Regional OLTP

(live or transactional) support

7 of 52

Typical Architectures

Imagine Honda USA
Many sales regions

Data warehouse periodically loaded w/ new data

E.g, regional sales data gets consolidated at end of each day

Region SW

Region NW

Region MW

Region NE

Centralized

Data Warehouse

Region S

ETL

Periodically done

OLAP support

Regional OLTP

(live or transactional) support

8 of 52

Typical Architectures

Consolidation happens through ETL

Extract, Transform, Load
Extract useful business info to be summarized, transform it (e.g., canonicalize, clean up), load it in the warehouse
Ready for analysis!

Region SW

Region NW

Region MW

Region NE

Centralized

Data Warehouse

Region S

ETL

Periodically done

OLAP support

Regional OLTP

(live or transactional) support

9 of 52

Typical Architectures

Challenge: staleness
Still, usually a reasonable tradeoff:

Large scale OLAP (read) queries may delay updates

Crucial to ensure that sales are not prevented than a report for a manager is generated promptly
Latter will anyway take a long time, so might as well have them wait a bit longer

OK if the analysis results are a bit stale

Region SW

Region NW

Region MW

Region NE

Centralized

Data Warehouse

Region S

ETL

Regional OLTP

(live or transactional) support

Periodically done

OLAP support

10 of 52

Schemas in Data Warehouses

Usually employs a star (or sometimes snowflake) schema
One fact table and many dim. tables
Fact tables contain dimension attr. and measure attr.

Canonical data warehouse example:

Fact table: Sales (itemid, storeid, customerid, date, number, price)
Dim table: Iteminfo (itemid, itemname, color, size, category)
Dim table: Store (storeid, city, state, country)
Dim table: Customer (customerid, name, street, city, state)

Centralized

Data Warehouse

11 of 52

Star Schema

Example Schema

Fact table: Sales (itemid, storeid, customerid, date, number, price)
Dim table: Iteminfo (itemid, itemname, color, size, category)
Dim table: Store (storeid, city, state, country)
Dim table: Customer (customerid, name, street, city, state)

Store

Iteminfo

Customer

Sales

Foreign Key

12 of 52

Star to Snowflake

Extending the example

Fact table: Sales (itemid, storeid, customerid, date, number, price)
Dim table: Iteminfo (itemid, itemname, color, size, category, manfname)

Dim table: Manf (name, address, owner)

Dim table: Store (storeid, city, state, country)
Dim table: Customer (customerid, name, street, city, state)

Store

Customer

Sales

Manf

Iteminfo

Foreign Key

13 of 52

OLAP Queries

Example Schema:

Fact table: Sales (itemid, storeid, customerid, date, number, price)
Dim table: Iteminfo (itemid, itemname, color, size, category)
Dim table: Store (storeid, city, state, country)
Dim table: Customer (customerid, name, street, city, state)

Typical “report” queries GROUP BY some dim. attrs., aggregate some measure attrs.

SELECT category, country, COUNT(number)

FROM Sales NATURAL JOIN Iteminfo NATURAL JOIN Store

GROUP BY category, country

Store

Iteminfo

Customer

Sales

14 of 52

Dates in OLAP

Example Schema:

Fact table: Sales (itemid, storeid, customerid, date, number, price)
Dim table: Iteminfo (itemid, itemname, color, size, category)
Dim table: Store (storeid, city, state, country)
Dim table: Customer (customerid, name, street, city, state)

Not very useful to “aggregate” date; better to treat date as an implicit dimension!

Allows us to group by and see trends across dates (e.g., sales by year)
Fact table: Sales (itemid, storeid, customerid, date, number, price)
Implicit Dim table: Dateinfo (date, month, quarter, year)

Store

Iteminfo

Customer

Sales

15 of 52

Star Schema

Example Schema:

Fact table: Sales (itemid, storeid, customerid, date, number, price)
Dim table: Iteminfo (itemid, itemname, color, size, category)
Dim table: Store (storeid, city, state, country)
Dim table: Customer (customerid, name, street, city, state)
Implicit Dim table: Dateinfo (date, month, quarter, year)

What we might want:

SELECT category, country, month, COUNT(number)

FROM Sales NATURAL JOIN Iteminfo NATURAL JOIN Store NATURAL JOIN Dateinfo

GROUP BY category, country, month

Actual query might be (in PostgreSQL)

SELECT category, country, datepart(‘month’, date) as month, COUNT(number)

FROM Sales NATURAL JOIN Iteminfo NATURAL JOIN Store

GROUP BY category, country, month

Store

Iteminfo

Customer

Sales

Date

16 of 52

Introducing Data Cubes

For now, will operate on a “denormalized view”, where fact and dim. tables are joined

same considerations for star/snowflake schema

Consider a simpler inventory relation: (item, color, size, number)

Item	Color	Size	Number
Jacket	Blue	Small	1
Jacket	Red	Medium	1
Jeans	Black	Large	2
…	…	…	…

17 of 52

Vanilla GROUP BY across all groups

Q: If there are n item names, m colors, and k sizes, what are the number of rows?

Item	Color	Size	Number
Jacket	Blue	Small	23
Jacket	Blue	Medium	17
Jacket	Blue	Large	34
Jacket	Red	Small	18
…	…	…	…
Jeans	Blue	Small	14
…	…	…	13

18 of 52

Say I was interested in only the color and item…

Might want to see a cross-tabulation of aggs corr. to Item and Color

Q: How could you get this via a GROUP BY?

	Blue	Red	…	Total
Jacket	23	45	…	234
Jeans	24	28	…	462
…	…	…	…	…
Total	89	132	…	2384

19 of 52

Say I was interested in only the color and item…

Might want to see a cross-tabulation of aggs corr. to Item and Color

Q: How could you get this via a GROUP BY?
A: group by combinations, and individual values, and overall count

	Blue	Red	…	Total
Jacket	23	45	…	234
Jeans	24	28	…	462
…	…	…	…	…
Total	89	132	…	2384

20 of 52

Another way to view this…

Crosstab of item and color = vertical plane closest to us

	Blue	Red	…	Total
Jacket	23	45	…	234
Jeans	24	28	…	462
…	…	…	…	…
Total	89	132	…	2384

21 of 52

Another way to view this…

This is a data cube
Has 3 dimensions (hence the name dimensions for those attributes!)
Can be sliced and diced in various ways

22 of 52

Another way to view this…

Slicing = adding a condition to one/more of the dimensions

e.g., green color

Dicing = partitioning of the dimension

Here, we partitioned based on distinct values, but can partition in more coarse-grained ways
e.g., light colors and dark colors for color

Especially useful for the date dimension:

Can group by months, days, years, weeks, etc.

23 of 52

Hierarchies for Setting the Partitioning Granularity

The partitioning granularity can be set based on user needs…
Different partitioning may be useful for different applications

Datetime

Hour

Full Date

Day of week

Month

Specific Month

Specific Quarter

Year

Quarter

City

State

Country

24 of 52

Hierarchies for Setting the Partitioning Granularity

Q: I want the aggregates per month. What if we computed a data cube based on Full Date? Can I avoid recomputing the cube (or cross-tab)?
Q: What about if we had computed a data cube based on Year?

Datetime

Hour

Full Date

Day of week

Month

Specific Month

Specific Quarter

Year

Quarter

City

State

Country

25 of 52

Moves in the Hierarchy and Corresponding Cube

Moving from a finer to a coarser granularity is called a rollup
Moving from a coarser to a finer granularity is called a drill-down

Unit steps from coarse to fine (drill-down):

Move down the hierarchy for one of the dimensions, OR
Move from

the origin to an edge, or
an edge to a plane, or
a plane to the cube

26 of 52

OLAP in SQL: CUBE

“NULL” is used to indicate “ALL”

SELECT item, color, SUM(number)

FROM Sales

GROUP BY CUBE (item, color)

Any attr. may be replaced with NULL(ALL)

May result in too many combinations

Item	Color	Number
Jacket	Blue	23
…	…	…
Jacket	Green	34
Jeans	Blue	28
…	…	…
Jeans	Green	17
Jacket	NULL	185
Jeans	NULL	200
…	…	…
NULL	Blue	94
NULL	Red	74
…	…	…
NULL	NULL	984

27 of 52

ROLLUP is an alternative to CUBE

ROLLUP targets a smaller # of combinations

SELECT item, color, size, SUM(number)

FROM Sales

GROUP BY ROLLUP (item), ROLLUP (color, size)

Every combination of:

specific item or ALL for the first rollup
for the second rollup

specific color and specific size
specific color and ALL sizes
ALL colors and ALL sizes

Thus, combinations include

{(item, color, size), (item, color), (item), (color, size), (color), ()}

28 of 52

Picking the right CUBE/ROLLUP query

First, why does this matter?

If this query is being run once on a PB- sized warehouse, it is important to get it right!
Results usually materialized and used in dashboards, presentations, spreadsheets, ….

Approach:

Think about all the ways you want to slice and dice your data
Pick granularity to recreate all aggregates you want, without blowing up the query result

Result size grows exponentially in the attrs; can be quite bad in large snowflake schemas
Known as the curse of dimensionality

29 of 52

How does this relate to visualizations?

Most visualizations are group-by queries

SELECT AGG(M), D
FROM R
WHERE …
GROUP BY D

SELECT COUNT(*), Country
FROM R
GROUP BY Country

30 of 52

So, why did we learn OLAP?

OLAP is a special class of analytical proc. and report generation queries

Typically done in large “batch” ops
Rule of thumb: pick as “coarse-grained” query results as will allow you to construct all necessary cross-tabs

Concepts of data cubes, hierarchies, slicing/dicing, and rollup/drill-down are valuable to describe what you’re doing when exploring your data
Conveniences that come in SQL: ROLLUP and CUBE operators

31 of 52

Extra Slides on Visualization

32 of 52

Why Visualizations?

Analyze

Discover trends

Stock price is going up/down

Develop hypotheses

House prices are down due to the downturn

Check hypotheses
Detect errors

Null values in a column

Share, record & communicate

Our focus will primarily be on basic visualization types, how they relate to the data types, and the data processing aspects: Marti Hearst’s classes are highly recommended for a deep-dive into visualization.

33 of 52

Types of Data: A Visualization-Oriented Perspective

Nominal

=, ≠

Ordinal

=, ≠, <, >

Quantitative Interval

=, ≠, <, >, –
Arbitrary zero

Quantitative Ratio

=, ≠, <, >, –, %
Physical quantities

Airlines, Genre

Film ratings, Batteries

Year, Location

Sales, Profit,

Temperature

34 of 52

Types of Data: A Visualization-Oriented Perspective

Nominal

=, ≠

Ordinal

=, ≠, <, >

Quantitative Interval

=, ≠, <, >, –
Arbitrary zero

Quantitative Ratio

=, ≠, <, >, –, %
Physical quantities

Hot, cold

Good, OK, Bad

Temperature

Score

Grade

35 of 52

A Very Quick Primer on Visualization Types

The most basic visualization is a table!

Bar Charts
Line Charts
Scatter Plot
Choropleth

36 of 52

Bar Charts

Q: What SQL query

generates a multiple

bar chart?

37 of 52

When are bar charts appropriate?

When plotting a Q-R vs. either an N, O, Q-I, or Q-R
Emphasizes the differences in height than differences in X axis
Most fundamental chart
From a SQL standpoint, simple aggregation of some Y axis measure, grouped by one or more dimensions

can generate results in the appropriate order in the X axis by doing an ORDER BY following the GROUP BY

38 of 52

Line Charts

39 of 52

When are line charts appropriate?

When plotting a Q-R vs. a Q-I or a Q-R
Mainly makes sense when the X axis is ordered in some way and distances between X axis points matter

e.g., is the rate of change in this interval the same as the other interval

Want to be able to see “trends”

There is an assumption of interpolation between points and dependence of the Y-axis on the X-axis

From a SQL standpoint, the query for generation is similar to bar charts, grouping by the X-axis

40 of 52

Scatterplots

41 of 52

When are scatterplots appropriate?

When plotting a Q-R vs. a Q-R
No assumption of interpolation unlike line charts
Care more about “density”, understanding of “correlation”
From a SQL query standpoint, one way to plot a scatterplot is to simply perform a SELECT X, Y FROM R with no grouping.

Additional aspects (e.g., color) can also be selected if needed

However, there is a danger of too many rows being returned.

Imagine a relation of size 1B: 1B pairs returned
A safer option in that case is to bin the scatterplot into grid cells
Q: How would we do this?
A: CTE to add new “binned” columns corresponding to a CASE statement, followed by a GROUP BY on the new columns

42 of 52

Choropleths

43 of 52

When are choropleths appropriate?

Choropleths map a Q-R vs. a two-dimensional Q-I variable

From a SQL query standpoint, grouping can be done on a per-geographical region basis followed by overlaying on a map.

44 of 52

What type of visualization would you use?

A plot of rainfall by location on a map
A plot of average age by department
A plot of total sales by year
A plot of rainfall by average temperature for various locations

45 of 52

Visualization Tools

Many good visualization packages: these help you generate a visualization on your data from within a programming language

Matplotlib (you’ll try this out in your homework!)
D3/Vega/Vega-lite
ggplot2
Gnuplot
Google Charts

Plus visual analytics tools: these are tools that provide an interactive environment to explore your data visually without writing code

Looker
PowerBI
Spotfire
Tableau

46 of 52

What do Visual Analytics tools support?

At a high level, you specify what you want to see in the x axis (the aggregated measure) and the y axis (the grouping variable)

And the system automatically picks the “right” visualization

You could also encode other additional attributes as “layers” via color, shape — these could be either part of the grouping or the aggregated measure attributes

Some tools allow users to specify the mapping between attributes to the axes in a bit more expressive manner; we will spend a little bit of time talking about the algebra underlying this

47 of 52

Composing Attributes

The Polaris table algebra allows us to compose attributes prior to visualization in more sophisticated ways.

This algebra is the basis behind Tableau, a popular visual analytics platform

The table algebra treats Q-R and Q-I as “Q-R” and N and O into “O” (imposing an arbitrary order if needed).

48 of 52

Table Algebra: Basic Operands

Ordinal fields: interpret domain as a set that partitions table into rows and columns

Quarter = {(Qtr1),(Qtr2),(Qtr3),(Qtr4)}

Quantitative fields: treat domain as a single element set and encode spatially as axes

Profit = {(Profit[-410,650])}

49 of 52

Concatenation Operator (+)

Ordered union of set interpretation

Quarter + ProductType
= {(Qtr1),(Qtr2),(Qtr3),(Qtr4)} + {(Coffee), (Espresso)}
= {(Qtr1),(Qtr2),(Qtr3),(Qtr4),(Coffee),(Espresso)}

Profit + Sales = {(Profit[-310,620]),(Sales[0,1000])}

50 of 52

Cross operator (X)

Cross-product of set interpretation
Quarter x ProductType =
{(Qtr1,Coffee), (Qtr1, Tea), (Qtr2, Coffee), (Qtr2, Tea), (Qtr3, Coffee), (Qtr3, Tea), (Qtr4, Coffee), (Qtr4,Tea)}

ProductType x Profit =

51 of 52

And then what?

With this algebra, users can declaratively specify what combination of attributes goes in each of the “shelves” — the X, Y, and Z shelves

And the visualization is automatically generated

52 of 52

Takeaways

Visualization is an essential means for data exploration — hypothesis generation and confirmation, spotting of outliers and trends, among others.
A lot can be accomplished with a small number of visualization types: often these suffice during data exploration

Others may be helpful when presenting results to others

Visual analytics tools provide interactive visualization capabilities via simple interactions

The Polaris table algebra forms the basis for how the data attributes are mapped to visualization axes and panes (shelves)