1 of 62

Visualization and Sampling

November 14, 2023

Data 101, Fall 2023 @ UC Berkeley

Lisa Yan https://fa23.data101.org/

LECTURE 25

2 of 62

Announcements

Final exam details to go out today

3 of 62

Join at slido.com�#sampling25

ⓘ

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

4 of 62

MOLAP, ROLAP, and HOLAP

Review of Last Time + CI

Two Slides on Spark/Algebra

Three Slides on Pub-Sub and Message Queries

Two Slides on Workflow Systems

—

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

Lecture 25, Data 101 Fall 2023

5 of 62

Types of OLAP databases

Relational OLAP (ROLAP) Data cube is stored as a star (or snowflake) schema in relational DBMS

Can be implemented in SQL (WHERE, GROUP BY, plus CUBE/ROLL UP (see extra slides))
No pre-computation, other than the materialized views specified by the user

MOLAP (Multidimensional OLAP) Data cube stored as a multidimensional array storage

Optimized multidimensional array (i.e., not dense)
Pre-compute materialized slices, cross-tabs, different cube granularities

#sampling25

6 of 62

[Exercise] MOLAP vs. ROLAP

Relational OLAP (ROLAP) Data cube is stored as a star (or snowflake) schema in relational DBMS

Can be implemented in SQL (WHERE, GROUP BY, plus CUBE/ROLL UP (see extra slides))
No pre-computation, other than the materialized views specified by the user

MOLAP (Multidimensional OLAP) Data cube stored as a multidimensional array storage

Optimized multidimensional array (i.e., not dense)
Pre-compute materialized slices, cross-tabs, different cube granularities

1. (+) Fast response to queries (due to precomputation)

2. (+) Supports generalized SQL tools

3. (+) Supports data that does not fit into the data cube model

4. (–) Expensive ETL time (due to precomputation)

5. (–) No OLAP-specific optimizations (other than what is provided already in SQL)

6. (–) Needs SQL knowledge to specify new materialized views

7. (–) Potential data redundancy

A. MOLAP

B. ROLAP

MOLAP or ROLAP: Which OLAP system do each of the following pros/cons refer to?

🤔

#sampling25

7 of 62

MOLAP or ROLAP: Which OLAP system do each of the following pros/cons refer to? Format: 1: MOLAP

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

8 of 62

A third type of OLAP database

Relational OLAP (ROLAP) Data cube is stored as a star (or snowflake) schema in relational DBMS

Can be implemented in SQL (WHERE, GROUP BY, plus CUBE/ROLL UP (see extra slides))
No pre-computation, other than the materialized views specified by the user

MOLAP (Multidimensional OLAP) Data cube stored as a multidimensional array storage

Optimized multidimensional array (i.e., not dense)
Pre-compute materialized slices, cross-tabs, different cube granularities

Hybrid OLAP (HOLAP) The best of both worlds.

Some data stored in relational storage; other data stored in specialized storage.
The exact proportions are left to the specific OLAP system + the designer.

#sampling25

9 of 62

So, Why Did We Learn OLAP?

OLAP (OnLine Analytical Processing) is a special BI (Business Intelligence) system that supports analytical processing and report generation queries, typically done in large “batch” operations.

(new) data cubes, star/snowflake schema, facts, measures, dimensions
(new) slicing/dicing
(review) hierarchies, roll-up, drill-down, cross-tabs
MOLAP, ROLAP, HOLAP

[see extra] Conveniences that come in SQL: ROLLUP and CUBE operators

#sampling25

10 of 62

So, Why Did We Learn OLAP?

OLAP (OnLine Analytical Processing) is a special BI (Business Intelligence) system that supports analytical processing and report generation queries, typically done in large “batch” operations.

(new) data cubes, star/snowflake schema, facts, measures, dimensions
(new) slicing/dicing
(review) hierarchies, roll-up, drill-down, cross-tabs
MOLAP, ROLAP, HOLAP

[see extra] Conveniences that come in SQL: ROLLUP and CUBE operators

One step closer to understanding jargon!!

Microsoft Learn / Power Platform / Power BI [link]

Amazon What is OLAP [link]

#sampling25

12 of 62

Database Sampling

Lecture 25, Data 101 Fall 2023

13 of 62

Why use approximation?

Pros of approximation

Fast at "query time"
Exploration at human "Speed of thought" (HCI literature on latency Liu, Heer 2014)

Cons

Inaccuracies!

Miss data from low-frequency phenomena (“needles in haystack”)

Accidental Bias
Quantifying Uncertainty

Can be hard to do, statistically, for many queries
Can be hard for humans to interpret even when accurate (compounded with visualization of�approximated quantitative variables)

Sketches (in extra slides) have setup/materialization costs

Tip for large data set:

approximate and do EDA
then scale back up to test!

beyond, you lose train of thought

#sampling25

14 of 62

[Review] Database Sampling

2. SELECT * FROM Table TABLESAMPLE BERNOULLI(p)

Includes each row with probability p% (i.e., p=5 means 5%)

3. SELECT * FROM Table TABLESAMPLE SYSTEM(p)

Includes each memory page with probability p%
(Relations are laid out on pages on disk; pages are selected with some probability, and we include all tuples on a given page)

In data science, you often want to operate on a sample of the data—For example, the original dataset is too large and/or you want to experiment quickly on a representative subset.

Three common techniques for sampling rows from a relation Table:

SELECT * FROM Table ORDER BY RANDOM() LIMIT N;

Reasonable for exploration

Expensive for huge tables (due to sorting)

Relatively fast, but need to “flip coin” per row

Documentation 15

Fastest, but less random (due to page-level sampling)

#sampling25

15 of 62

Database Sampling, Revisited

Database sampling has several drawbacks!

Have to pick the “right” Bernoulli probability to set for queries
No control of output size
Sample as part of the initial scale; can’t sample based on data attributes themselves
Finally, not all data sources support this sort of sampling!

In the data pipeline, when do you sample the database (i.e., table sample)?

(Note: sampling is effectively a form of selection pushdown)
For efficiency: sample as close to scanning the data as possible.
This allows a smaller amount of data to flow through your pipeline, facilitating throughput.

In this class, we’ll discuss two flexible ways of sampling that address the above issues.

Reservoir sampling picks a k-sized simple random sample in a single pass.
Stratified sampling (using reservoir sampling) to get k-sized samples per grouped attribute.

#sampling25

16 of 62

Reservoir Sampling

Lecture 25, Data 101 Fall 2023

17 of 62

Reservoir Sampling: The idea

Goal(s): A k-sized simple random sample, i.e., records selected i.i.d. without replacement.� Linear runtime and single pass (i.e., total records n unknown during sampling).

Strategy:

Build a reservoir, i.e., a fixed-size array that holds k records.
Then scan the table. For record r_i, where i = 1, …, n:

If i < k, reservoir has space. Insert record r_i into reservoir.
Else, insert record r_i by evicting a random reservoir record,�with some probability P_i based on the scan order i.

Clearly this process is linear in number of rows. But it also produces a simple random sample!!

(proof left as a bonus)

#sampling25

18 of 62

Simple Reservoir Sample

# ported from �# https://en.wikipedia.org/wiki/Reservoir_sampling

from random import randrange

def reservoir_sample(data, n, k):

# fill the reservoir array

r = []

for i in range(k):

r.append(data[i])

# replace elements w/gradually decreasing probability

for i in range(k, n-1):

# generates a uniform integer between 0 and a-1

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

data = list(range(1000))

n = len(data)

k = 5

r = reservoir_sample(� data, n, k)

Demo

#sampling25

19 of 62

Simple Reservoir Sample

# ported from �# https://en.wikipedia.org/wiki/Reservoir_sampling

from random import randrange

def reservoir_sample(data, n, k):

# fill the reservoir array

r = []

for i in range(k):

r.append(data[i])

# replace elements w/gradually decreasing probability

for i in range(k, n-1):

# generates a uniform integer between 0 and a-1

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

data = list(range(1000))

n = len(data)

k = 5

r = reservoir_sample(� data, n, k)

Single pass!
Constant sample size!
n can be unbounded! (if you fiddle with this function a bit)
Can also parallelize (since you know P_i for every record r_i)

Demo

#sampling25

20 of 62

Reservoir Sampling and Probabilities

Lecture 25, Data 101 Fall 2023

21 of 62

Reservoir Sampling SRS Proof Outline

Goal(s): A k-sized simple random sample, i.e., records selected i.i.d. without replacement.� Linear runtime and single pass (i.e., total records n unknown during sampling).

def reservoir_sample(data, n, k):

r = []

for i in range(k):

r.append(data[i])

for i in range(k, n-1):

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

#sampling25

22 of 62

What is the probability of record i being in S_n, the final reservoir sample?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

23 of 62

Reservoir Sampling SRS Proof Outline

Goal(s): A k-sized simple random sample, i.e., records selected i.i.d. without replacement.� Linear runtime and single pass (i.e., total records n unknown during sampling).

def reservoir_sample(data, n, k):

r = []

for i in range(k):

r.append(data[i])

for i in range(k, n-1):

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

Define: Reservoir sample S_i of size k records produced after step i of algorithm.

Goal, translated: Every record must have equal probability of being in S_n (sample at step n).

i.e., = ????

🤔

A. 1/n

B. k/n

C. (k-1)/n

D. 1/(n^k)

F. Something else

#sampling25

24 of 62

Reservoir Sampling SRS Proof Outline

Goal(s): A k-sized simple random sample, i.e., records selected i.i.d. without replacement.� Linear runtime and single pass (i.e., total records n unknown during sampling).

def reservoir_sample(data, n, k):

r = []

for i in range(k):

r.append(data[i])

for i in range(k, n-1):

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

Define: Reservoir sample S_i of size k records produced after step i of algorithm.

Goal, translated: Every record must have equal probability of being in S_n (sample at step n).

i.e., = k/n.

*proof by induction*

The gist of the proof

#sampling25

25 of 62

Empirical Proof

import numpy as np

K = 5

N_SAMPLES = 10000

samples = []

N = 1000

data = list(range(N))

for j in range(N_SAMPLES):

samples += reservoir_sample(data, N, k)

#unique, counts = np.unique(samples, return_counts=True)

ax = pd.Series(samples).plot.hist(grid=True, bins=20, rwidth=0.9,

color='#607c8e')

ax.set_title(f"Distribution of frequencies of all {N} values")

Demo

#sampling25

26 of 62

[Extra] Reservoir Sampling SRS Proof Outline

This proof assumes you understand proof by induction.

See CS70 or DATA 140.

#sampling25

27 of 62

Reservoir Sampling Optimization: Algorithm L

Lecture 25, Data 101 Fall 2023

28 of 62

Small optimization to Reservoir: Algorithm L

def reservoir_sample(data, n, k):

r = []

for i in range(k):

r.append(data[i])

for i in range(k, n-1):

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

Calling a random number generator for every record can be slow, particularly when we have trillions of records.

#sampling25

29 of 62

Small optimization to Reservoir: Algorithm L

How can we reduce the number of calls to randrange()?

Let’s check out how many rows we skip between chosen values:

def reservoir_sample(data, n, k):

r = []

for i in range(k):

r.append(data[i])

for i in range(k, n-1):

j = randrange(i+1)

if j < k:

r[j] = data[i]

return r

approximately geometric, has closed form! (we won’t prove this)
idea: don’t call randrange() on records that will be discarded anyway
Algorithm L approach: pick random gaps from our geometric distribution, then only call randrange on a much smaller set of records.

Calling a random number generator for every record can be slow, particularly when we have trillions of records.

Sampling gap distribution for S_1000 from 100k records�(gaps between chosen values 1 to 100k)

#sampling25

30 of 62

The plotting code, if you’re curious

import numpy as np

import pandas as pd

def plot_gaps(r):

r.sort()

gaps = pd.Series(np.diff(r))

gaps.plot.hist(grid=True, bins=20, rwidth=0.9,

color='#607c8e')

data = list(range(100000))

n = len(data)

k = 1000

r = reservoirSample(data, n, k)

plot_gaps(r)

Demo

#sampling25

31 of 62

[Extra] Algorithm L Implementation

# This is called Algorithm L, ported from

# https://en.wikipedia.org/wiki/Reservoir_sampling

from random import random, randrange

from math import exp, log, floor

def reservoir_sample_L(data, n, k):

# fill the reservoir array

r = []

for i in range(k):

r.append(data[i])

# generates a uniform [0,1) random number

w = exp(log(random())/k)

while i < n:

i = i + floor(log(random())/log(1-w)) + 1

if i < n:

# replace random reservoir item with item i

r[randrange(k)] = data[i]

w = w * exp(log(random())/k)

return r

Algorithm L [wikipedia]

We’ll assume this works as an optimal strategy for reservoir sampling. . For more, see proof on wikipedia.

Demo

#sampling25

32 of 62

Reservoir Sampling in SQL

Lecture 25, Data 101 Fall 2023

33 of 62

Reservoir Sampling as a Table-Valued UDF

%%sql

-- create the reservoir_swaps UDF --

DROP TYPE IF EXISTS reservoir_pair CASCADE;

CREATE TYPE reservoir_pair AS (rownum integer, pos integer);

CREATE OR REPLACE FUNCTION reservoir_swaps(k integer, n integer) RETURNS setof reservoir_pair

AS $$

# optimized reservoir sampling algorithm, � # Algorithm L

LANGUAGE 'plpython3u'

VOLATILE

RETURNS NULL ON NULL INPUT;

CREATE OR REPLACE FUNCTION

reservoir_rows(k integer, n integer)

RETURNS setof integer

AS $$

SELECT MAX(rownum) AS rownum

FROM reservoir_swaps(k, n)

GROUP by pos $$

LANGUAGE 'sql'

VOLATILE;

We skip the details of this UDF. It returns a table with one attribute reservoir_rows for the k row numbers in S_n.

Demo

#sampling25

34 of 62

Reservoir Sampling in SQL

%%sql

WITH rrows AS (� SELECT reservoir_rows(10, count(*)::integer) � AS rows

FROM batting),

rbatting AS (

SELECT ROW_NUMBER() OVER(), * FROM batting)

SELECT *

FROM rbatting, rrows

WHERE row_number = rows;

Demo

#sampling25

35 of 62

Stratified Sampling

Lecture 25, Data 101 Fall 2023

36 of 62

What about Stratified Sampling?

SQL folks might call stratified sampling GROUP BY sampling:

i.e. I want a k-sized sample per GROUP
the "GROUP BY" columns are called subpopulations

PostgreSQL's Bernoulli tablesamples do not support stratification!

Because the sample happens on the initial scan.

However, our reservoir implementation works with GROUP BY!

Demo

#sampling25

37 of 62

What about Stratified Sampling?

SQL folks might call stratified sampling GROUP BY sampling:

i.e. I want a k-sized sample per GROUP
the "GROUP BY" columns are called subpopulations

PostgreSQL's Bernoulli tablesamples do not support stratification!

Because the sample happens on the initial scan.

However, our reservoir implementation works with GROUP BY!

%%sql

-- Stratified Sampling with Reservoirs

WITH grprows AS (

SELECT teamid,

reservoir_rows(10, COUNT(*)::integer) AS rows

FROM batting

GROUP BY teamid),

rbatting AS (

SELECT row_number() over(partition by teamid), *

FROM batting)

SELECT *

FROM rbatting b, grprows g

WHERE row_number = rows

AND b.teamid = g.teamid

ORDER BY b.teamid;

Demo

#sampling25

38 of 62

Sampling and Pitfalls

Lecture 25, Data 101 Fall 2023

39 of 62

Huge pitfall: Do not join table samples!!!

SELECT SUM(E.salary / D.budget)

FROM employees E

INNER JOIN departments D

…

We cannot estimate this join with a join on samples.

They are unequal! Not statistically robust! (proof left to you)

Worse yet: The join of two table samples is often empty.

Consider: sampled employee and sampled departments. What are the odds of an employee’s department being in the sample? Likely zero if you have many in both table!

SELECT SUM(E_s.salary / D_s.budget)

FROM employee_sample E_s

INNER JOIN departments_sample D_s

…

NOT EQUIVALENT! Will likely give you garbage!!

❌

#sampling25

40 of 62

Should we ever Materialize our Samples?

Pros of approximation

Fast at "query time"
Exploration at human "Speed of thought" (HCI literature on latency Liu, Heer 2014)

Cons

Inaccuracies!

Miss data from low-frequency phenomena (“needles in haystack”)

Accidental Bias
Quantifying Uncertainty

Can be hard to do, statistically, for many queries
Can be hard for humans to interpret even when accurate (compounded with visualization of�approximated quantitative variables)

Sketches have setup/materialization costs (more later)

Materializing samples is important for fast iteration, e.g., with ML training. But if we don’t periodically resample/refresh, then we’ll pick up a lot of bias!

beyond, you lose train of thought

#sampling25

41 of 62

Materialized Sampling Tips

As you may expect in SQL: use CREATE TABLE AS SELECT …

Then, reuse this sample at will, via desktop, Pandas, etc.
But again, there are many concerns with materialized samples.

Tips to avoid bias and inaccuracies:

Update “population” estimators to sample estimators

e.g., scale COUNT or SUM by sampling rate
(Note: “population” here is the full database table, which is itself could also be a sample)

Include and advertise sample estimator uncertainty

e.g., confidence intervals, bootstrapping, etc.
pitfalls: C.I.’s are not always particularly reliable with query structures, esp. w/JOIN, etc.

Avoid bias, particularly if you reuse the sample for many different estimators!

Resample from the sample
Or, periodically refresh the sample from the database

But above all: Validate on the full dataset!

After exploratory work, run the pipeline on the full data!
Do your “desktop” work in a scalable language so that whatever pipeline�you explore works directly on the cloud.

#sampling25

42 of 62

Sketches: Materialized Samples, but Better

Sketches are like materialized views that can approximate the answer to a class of queries.�Like Materialized views, they take time to build, and need to be kept fresh.

But they're typically tiny and insensitive to input size, which is very cool
Can pass them around for all kinds of tricks: e.g. ship to apps in browser or phone, etc.

Sketches typically work as streaming algorithms, which is nice

Most support incremental additions
Some support deletions

Many sketches can be computed in parallel!

Sketches are beyond the scope of this course, but do check out the extra notebook.

Two very useful sketches:

CountMin sketches

SELECT COUNT(*) ... WHERE col = k
WHERE col BETWEEN k AND l

HyperLogLog sketches

SELECT COUNT(DISTINCT col)

Professor Jelani Nelson

Theory Professor, EECS

https://people.eecs.berkeley.edu/~minilek/

#sampling25

43 of 62

Last few notes on Samples and SQL

More complex SQL also messes up sampling/estimators

joins (as we talked about)
Window queries, subqueries, etc.

Pragmatic strategy:

Play with samples of query outputs as CTEs/Views/CTAS
*Don't* do complicated SQL with the sample
No join, window, subquery, etc.
Instead do the complicated SQL on the full tables
Go to (1)

Active research: Approximate Query Processing

not common in shipping database systems. :-)

Tip:

Run the full query of what you want in the cloud
Take a reservoir sample of that
THEN run ML in your desktop Jupyter Notebook, etc.
Return to the cloud when you need to rerun complex queries to get addtl data.

#sampling25

44 of 62

[extra] Visualization Tools

This is not in scope for the final exam.

Lecture 25, Data 101 Fall 2023

45 of 62

How does this relate to visualizations?

Today we will largely focus on

Data 100 visualization review
Visualization Tools

Then we will discuss Sampling and Sketching.

Most visualizations are group-by queries!

SELECT AGG(M), D�FROM R�WHERE …�GROUP BY D

If you are not using an OLAP system, you will inherently need to internalize best practices with visualizations—both for yourself as well as for other team members.

Discover trends (stock prices up/down)
Develop hypotheses (economic downturn)
Check hypotheses
Detect errors, e.g., Null values in a column

SELECT COUNT(*), Country�FROM R�GROUP BY Country

#sampling25

46 of 62

Visualization Tools

So far, you have likely used many good visualization packages: these help you generate a visualization on your data from within a programming language.

However, there also exist many visual analytics tools: these are tools that provide an interactive environment to explore your data visually without writing code.

D3/Vega/Vega-lite Gnuplot

ggplot2 Google Charts

matplotlib, seaborn

plotly

Looker

PowerBI

Spotfire

Tableau

#sampling25

47 of 62

What do Visual Analytics tools support?

So far, you have likely used many good visualization packages: these help you generate a visualization on your data from within a programming language.

However, there also exist many visual analytics tools: these are tools that provide an interactive environment to explore your data visually without writing code.

D3/Vega/Vega-lite Gnuplot

ggplot2 Google Charts

matplotlib, seaborn

plotly

Looker

PowerBI

Spotfire

Tableau

At a high level, you specify what you want to see in the x axis (the aggregated measure) and the y axis (the grouping variable), and the system automatically picks the “right” visualization.
You can also encode additional attributes as “layers” via color, shape — these could be either part of the grouping or the aggregated measure attributes.
Some tools allow users to specify the mapping between attributes to the axes in a bit more expressive manner.

We will spend a little bit of time talking about (1) how to systematically find the right visualizations, and (2) the algebra underlying expressivity!

#sampling25

48 of 62

[Data 100, sort of] From Data Types to Visualization types

This is not in scope for the final exam.

Lecture 25, Data 101 Fall 2023

49 of 62

Data Types, roughly retranslated from Data 100

Nominal: Labels, no order to data (e.g., names of products)

Ordinal: Named + ordered variables (e.g., Letter grades A, B, C, D, F)

Quantitative Interval: Named + ordered + proportionate interval between variables� (e.g., geolocation). Note: arbitrary zero

Quantitative Ratio: Named + ordered + proportionate interval + absolute zero� (e.g., Kelvin Temperature scale, weight, height). Physical quantities

#sampling25

50 of 62

Bar Chart: Quantitative Ratio vs. Another Type

Emphasizes the differences in height than differences in X axis

From a SQL standpoint:

Aggregate of some Y axis measure
Grouped by one or more dimensions
Generate results in the appropriate order in the X axis by doing GROUP BY, then ORDER BY

#sampling25

51 of 62

Line Chart: Quantitative Ratio vs. Quantitative Ratio, or vs. Quantitative Interval

Mainly makes sense when the X axis is ordered in some way and distances between X axis points matter.

e.g., is the rate of change in this interval the same as the other interval

Want to be able to see trends.

There is an assumption of interpolation between points and�dependence of the Y-axis on the X-axis

From a SQL standpoint,�the query for generation� is similar to bar charts, �grouping by the X-axis.

#sampling25

52 of 62

Scatterplot: Quantitative Ratio vs. Quantitative Ratio

No assumption of interpolation (unlike line charts)

Care more about density and understanding of correlation.

From a SQL query standpoint:

perform a SELECT X, Y FROM R with no grouping.

Additional aspects (e.g., color) can also be selected if needed!

However, there is a danger of too many rows being returned.

Imagine a relation of size 1B: 1B pairs returned…
Safer option: bin the scatterplot into grid cells

CTE to add new “binned” columns corresponding to a CASE statement
followed by a GROUP BY on the new columns

#sampling25

53 of 62

Choropleths map a Q-Ratio vs. a two-dimensional Q-Interval variable

From a SQL query standpoint:

Grouping can be done on a per-geographical region basis
Then overlap on a map.

#sampling25

54 of 62

[Extra]

What type of visualization would you use?

A plot of rainfall by location on a map
A plot of average age by department
A plot of total sales by year
A plot of rainfall by average temperature for various locations

#sampling25

55 of 62

Visualization Takeaways

Visualization is an essential means for data exploration — hypothesis generation and confirmation, spotting of outliers and trends, among others.

A lot can be accomplished with a small number of visualization types: often these suffice during data exploration

Others may be helpful when presenting results to others

Visual analytics tools provide interactive visualization capabilities via simple interactions

[extra] The Polaris table algebra forms the basis for how the data attributes are mapped to visualization axes and panes (shelves)

#sampling25

56 of 62

[Extra] Sketches

See Extra Jupyter Notebook

This is not in scope for the final exam.

Lecture 25, Data 101 Fall 2023

57 of 62

[Extra] Table Algebra

See Extra Jupyter Notebook

This is not in scope for the final exam.

Lecture 25, Data 101 Fall 2023

58 of 62

Table Algebra: Composing Attributes

The Polaris table algebra allows us to compose attributes prior to visualization in more sophisticated ways.

This algebra is the basis behind Tableau, a popular visual analytics platform

The table algebra treats Q-R and Q-I as “Q-R” and N and O into “O”�(imposing an arbitrary order if needed).

#sampling25

59 of 62

Table Algebra: Basic Operands

Ordinal fields: interpret domain as a set that partitions table into rows and columns

Quarter = {(Qtr1),(Qtr2),(Qtr3),(Qtr4)}

Quantitative fields: treat domain as a single element set and encode spatially as axes

Profit = {(Profit[-410,650])}

#sampling25

60 of 62

Concatenation operator (+)

Ordered union of set interpretation

Quarter + ProductType
= {(Qtr1),(Qtr2),(Qtr3),(Qtr4)} + {(Coffee), (Espresso)}
= {(Qtr1),(Qtr2),(Qtr3),(Qtr4),(Coffee),(Espresso)}

Profit + Sales = {(Profit[-310,620]),(Sales[0,1000])}

#sampling25

61 of 62

Cross operator (X)

Cross-product of set interpretation
Quarter x ProductType =
{(Qtr1,Coffee), (Qtr1, Tea), (Qtr2, Coffee), (Qtr2, Tea), (Qtr3, Coffee), (Qtr3, Tea), (Qtr4, Coffee), (Qtr4,Tea)}

ProductType x Profit =

#sampling25

62 of 62

And then what?

With this algebra, users can declaratively specify what combination of attributes goes in each of the “shelves” — the X, Y, and Z shelves

And the visualization is automatically generated.

#sampling25

1 of 62

2 of 62

3 of 62

4 of 62

5 of 62

6 of 62

7 of 62

8 of 62

9 of 62

10 of 62

11 of 62

12 of 62

13 of 62

14 of 62

15 of 62

16 of 62

17 of 62

18 of 62

19 of 62

20 of 62

21 of 62

22 of 62

23 of 62

24 of 62

25 of 62

26 of 62

27 of 62

28 of 62

29 of 62

30 of 62

31 of 62

32 of 62

33 of 62

34 of 62

35 of 62

36 of 62

37 of 62

38 of 62

39 of 62

40 of 62

41 of 62

42 of 62

43 of 62

44 of 62

45 of 62

46 of 62

47 of 62

48 of 62

49 of 62

50 of 62

51 of 62

52 of 62

53 of 62

54 of 62

55 of 62

56 of 62

57 of 62

58 of 62

59 of 62

60 of 62

61 of 62

62 of 62