Visualization and Sampling
November 14, 2023
Data 101, Fall 2023 @ UC Berkeley
Lisa Yan https://fa23.data101.org/
1
LECTURE 25
Announcements
Final exam details to go out today
2
Join at slido.com�#sampling25
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
MOLAP, ROLAP, and HOLAP
Review of Last Time + CI
Two Slides on Spark/Algebra
Three Slides on Pub-Sub and Message Queries
Two Slides on Workflow Systems
—
BI: OLAP and OLTP
Data Cubes and OLAP Queries
Star Schema, Snowflake Schema
MOLAP, ROLAP, and HOLAP
[Extra] OLAP in SQL
[Extra] Implicit Dimensions
4
Lecture 25, Data 101 Fall 2023
Types of OLAP databases
Relational OLAP (ROLAP) Data cube is stored as a star (or snowflake) schema in relational DBMS
MOLAP (Multidimensional OLAP) Data cube stored as a multidimensional array storage
5
#sampling25
[Exercise] MOLAP vs. ROLAP
Relational OLAP (ROLAP) Data cube is stored as a star (or snowflake) schema in relational DBMS
MOLAP (Multidimensional OLAP) Data cube stored as a multidimensional array storage
1. (+) Fast response to queries (due to precomputation)
2. (+) Supports generalized SQL tools
3. (+) Supports data that does not fit into the data cube model
4. (–) Expensive ETL time (due to precomputation)
5. (–) No OLAP-specific optimizations (other than what is provided already in SQL)
6. (–) Needs SQL knowledge to specify new materialized views
7. (–) Potential data redundancy
6
A. MOLAP
B. ROLAP
MOLAP or ROLAP: Which OLAP system do each of the following pros/cons refer to?
🤔
#sampling25
MOLAP or ROLAP: Which OLAP system do each of the following pros/cons refer to? Format: 1: MOLAP
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
A third type of OLAP database
Relational OLAP (ROLAP) Data cube is stored as a star (or snowflake) schema in relational DBMS
MOLAP (Multidimensional OLAP) Data cube stored as a multidimensional array storage
Hybrid OLAP (HOLAP) The best of both worlds.
8
#sampling25
So, Why Did We Learn OLAP?
OLAP (OnLine Analytical Processing) is a special BI (Business Intelligence) system that supports analytical processing and report generation queries, typically done in large “batch” operations.
9
[see extra] Conveniences that come in SQL: ROLLUP and CUBE operators
#sampling25
So, Why Did We Learn OLAP?
OLAP (OnLine Analytical Processing) is a special BI (Business Intelligence) system that supports analytical processing and report generation queries, typically done in large “batch” operations.
10
[see extra] Conveniences that come in SQL: ROLLUP and CUBE operators
One step closer to understanding jargon!!
Microsoft Learn / Power Platform / Power BI [link]
Amazon What is OLAP [link]
#sampling25
11
Database Sampling
12
Lecture 25, Data 101 Fall 2023
Why use approximation?
Pros of approximation
Cons
Tip for large data set:
13
beyond, you lose train of thought
#sampling25
[Review] Database Sampling
2. SELECT * FROM Table TABLESAMPLE BERNOULLI(p)
3. SELECT * FROM Table TABLESAMPLE SYSTEM(p)
In data science, you often want to operate on a sample of the data—For example, the original dataset is too large and/or you want to experiment quickly on a representative subset.
Three common techniques for sampling rows from a relation Table:
14
Expensive for huge tables (due to sorting)
Relatively fast, but need to “flip coin” per row
Fastest, but less random (due to page-level sampling)
#sampling25
Database Sampling, Revisited
Database sampling has several drawbacks!
In the data pipeline, when do you sample the database (i.e., table sample)?
In this class, we’ll discuss two flexible ways of sampling that address the above issues.
15
#sampling25
Reservoir Sampling
16
Lecture 25, Data 101 Fall 2023
Reservoir Sampling: The idea
Goal(s): A k-sized simple random sample, i.e., records selected i.i.d. without replacement.� Linear runtime and single pass (i.e., total records n unknown during sampling).
Strategy:
17
Clearly this process is linear in number of rows. But it also produces a simple random sample!!
(proof left as a bonus)
#sampling25
Simple Reservoir Sample
# ported from �# https://en.wikipedia.org/wiki/Reservoir_sampling
from random import randrange
def reservoir_sample(data, n, k):
# fill the reservoir array
r = []
for i in range(k):
r.append(data[i])
# replace elements w/gradually decreasing probability
for i in range(k, n-1):
# generates a uniform integer between 0 and a-1
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
data = list(range(1000))
n = len(data)
k = 5
r = reservoir_sample(� data, n, k)
r
18
Demo
#sampling25
Simple Reservoir Sample
# ported from �# https://en.wikipedia.org/wiki/Reservoir_sampling
from random import randrange
def reservoir_sample(data, n, k):
# fill the reservoir array
r = []
for i in range(k):
r.append(data[i])
# replace elements w/gradually decreasing probability
for i in range(k, n-1):
# generates a uniform integer between 0 and a-1
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
data = list(range(1000))
n = len(data)
k = 5
r = reservoir_sample(� data, n, k)
r
19
Demo
#sampling25
Reservoir Sampling and Probabilities
20
Lecture 25, Data 101 Fall 2023
Reservoir Sampling SRS Proof Outline
Goal(s): A k-sized simple random sample, i.e., records selected i.i.d. without replacement.� Linear runtime and single pass (i.e., total records n unknown during sampling).
def reservoir_sample(data, n, k):
r = []
for i in range(k):
r.append(data[i])
for i in range(k, n-1):
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
21
#sampling25
What is the probability of record i being in S_n, the final reservoir sample?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Reservoir Sampling SRS Proof Outline
Goal(s): A k-sized simple random sample, i.e., records selected i.i.d. without replacement.� Linear runtime and single pass (i.e., total records n unknown during sampling).
def reservoir_sample(data, n, k):
r = []
for i in range(k):
r.append(data[i])
for i in range(k, n-1):
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
23
Define: Reservoir sample S_i of size k records produced after step i of algorithm.
Goal, translated: Every record must have equal probability of being in S_n (sample at step n).
i.e., = ????
🤔
A. 1/n
B. k/n
C. (k-1)/n
D. 1/(n^k)
E.
F. Something else
#sampling25
Reservoir Sampling SRS Proof Outline
Goal(s): A k-sized simple random sample, i.e., records selected i.i.d. without replacement.� Linear runtime and single pass (i.e., total records n unknown during sampling).
def reservoir_sample(data, n, k):
r = []
for i in range(k):
r.append(data[i])
for i in range(k, n-1):
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
24
Define: Reservoir sample S_i of size k records produced after step i of algorithm.
Goal, translated: Every record must have equal probability of being in S_n (sample at step n).
i.e., = k/n.
*proof by induction*
The gist of the proof
#sampling25
Empirical Proof
import numpy as np
K = 5
N_SAMPLES = 10000
samples = []
N = 1000
data = list(range(N))
for j in range(N_SAMPLES):
samples += reservoir_sample(data, N, k)
#unique, counts = np.unique(samples, return_counts=True)
ax = pd.Series(samples).plot.hist(grid=True, bins=20, rwidth=0.9,
color='#607c8e')
ax.set_title(f"Distribution of frequencies of all {N} values")
25
Demo
#sampling25
[Extra] Reservoir Sampling SRS Proof Outline
26
This proof assumes you understand proof by induction.
See CS70 or DATA 140.
#sampling25
Reservoir Sampling Optimization: Algorithm L
27
Lecture 25, Data 101 Fall 2023
Small optimization to Reservoir: Algorithm L
def reservoir_sample(data, n, k):
r = []
for i in range(k):
r.append(data[i])
for i in range(k, n-1):
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
28
Calling a random number generator for every record can be slow, particularly when we have trillions of records.
#sampling25
Small optimization to Reservoir: Algorithm L
How can we reduce the number of calls to randrange()?
Let’s check out how many rows we skip between chosen values:
def reservoir_sample(data, n, k):
r = []
for i in range(k):
r.append(data[i])
for i in range(k, n-1):
j = randrange(i+1)
if j < k:
r[j] = data[i]
return r
29
Calling a random number generator for every record can be slow, particularly when we have trillions of records.
Sampling gap distribution for S_1000 from 100k records�(gaps between chosen values 1 to 100k)
#sampling25
The plotting code, if you’re curious
import numpy as np
import pandas as pd
def plot_gaps(r):
r.sort()
gaps = pd.Series(np.diff(r))
gaps.plot.hist(grid=True, bins=20, rwidth=0.9,
color='#607c8e')
data = list(range(100000))
n = len(data)
k = 1000
r = reservoirSample(data, n, k)
plot_gaps(r)
30
Demo
#sampling25
[Extra] Algorithm L Implementation
# This is called Algorithm L, ported from
# https://en.wikipedia.org/wiki/Reservoir_sampling
from random import random, randrange
from math import exp, log, floor
def reservoir_sample_L(data, n, k):
# fill the reservoir array
r = []
for i in range(k):
r.append(data[i])
# generates a uniform [0,1) random number
w = exp(log(random())/k)
while i < n:
i = i + floor(log(random())/log(1-w)) + 1
if i < n:
# replace random reservoir item with item i
r[randrange(k)] = data[i]
w = w * exp(log(random())/k)
return r
31
Algorithm L [wikipedia]
We’ll assume this works as an optimal strategy for reservoir sampling. . For more, see proof on wikipedia.
Demo
#sampling25
Reservoir Sampling in SQL
32
Lecture 25, Data 101 Fall 2023
Reservoir Sampling as a Table-Valued UDF
%%sql
-- create the reservoir_swaps UDF --
DROP TYPE IF EXISTS reservoir_pair CASCADE;
CREATE TYPE reservoir_pair AS (rownum integer, pos integer);
CREATE OR REPLACE FUNCTION reservoir_swaps(k integer, n integer) RETURNS setof reservoir_pair
AS $$
# optimized reservoir sampling algorithm, � # Algorithm L
$$
LANGUAGE 'plpython3u'
VOLATILE
RETURNS NULL ON NULL INPUT;
CREATE OR REPLACE FUNCTION
reservoir_rows(k integer, n integer)
RETURNS setof integer
AS $$
SELECT MAX(rownum) AS rownum
FROM reservoir_swaps(k, n)
GROUP by pos $$
LANGUAGE 'sql'
VOLATILE;
33
We skip the details of this UDF. It returns a table with one attribute reservoir_rows for the k row numbers in S_n.
Demo
#sampling25
Reservoir Sampling in SQL
%%sql
WITH rrows AS (� SELECT reservoir_rows(10, count(*)::integer) � AS rows
FROM batting),
rbatting AS (
SELECT ROW_NUMBER() OVER(), * FROM batting)
SELECT *
FROM rbatting, rrows
WHERE row_number = rows;
34
Demo
#sampling25
Stratified Sampling
35
Lecture 25, Data 101 Fall 2023
What about Stratified Sampling?
SQL folks might call stratified sampling GROUP BY sampling:
PostgreSQL's Bernoulli tablesamples do not support stratification!
However, our reservoir implementation works with GROUP BY!
36
Demo
#sampling25
What about Stratified Sampling?
SQL folks might call stratified sampling GROUP BY sampling:
PostgreSQL's Bernoulli tablesamples do not support stratification!
However, our reservoir implementation works with GROUP BY!
%%sql
-- Stratified Sampling with Reservoirs
WITH grprows AS (
SELECT teamid,
reservoir_rows(10, COUNT(*)::integer) AS rows
FROM batting
GROUP BY teamid),
rbatting AS (
SELECT row_number() over(partition by teamid), *
FROM batting)
SELECT *
FROM rbatting b, grprows g
WHERE row_number = rows
AND b.teamid = g.teamid
ORDER BY b.teamid;
37
Demo
#sampling25
Sampling and Pitfalls
38
Lecture 25, Data 101 Fall 2023
Huge pitfall: Do not join table samples!!!
SELECT SUM(E.salary / D.budget)
FROM employees E
INNER JOIN departments D
…
We cannot estimate this join with a join on samples.
Worse yet: The join of two table samples is often empty.
SELECT SUM(E_s.salary / D_s.budget)
FROM employee_sample E_s
INNER JOIN departments_sample D_s
…
39
NOT EQUIVALENT! Will likely give you garbage!!
❌
#sampling25
Should we ever Materialize our Samples?
Pros of approximation
Cons
Materializing samples is important for fast iteration, e.g., with ML training. But if we don’t periodically resample/refresh, then we’ll pick up a lot of bias!
40
beyond, you lose train of thought
#sampling25
Materialized Sampling Tips
As you may expect in SQL: use CREATE TABLE AS SELECT …
Tips to avoid bias and inaccuracies:
41
#sampling25
Sketches: Materialized Samples, but Better
Sketches are like materialized views that can approximate the answer to a class of queries.�Like Materialized views, they take time to build, and need to be kept fresh.
Sketches typically work as streaming algorithms, which is nice
Many sketches can be computed in parallel!
Sketches are beyond the scope of this course, but do check out the extra notebook.
Two very useful sketches:
42
#sampling25
Last few notes on Samples and SQL
More complex SQL also messes up sampling/estimators
Pragmatic strategy:
Active research: Approximate Query Processing
Tip:
43
#sampling25
[extra] Visualization Tools
44
This is not in scope for the final exam.
Lecture 25, Data 101 Fall 2023
How does this relate to visualizations?
Today we will largely focus on
Then we will discuss Sampling and Sketching.
Most visualizations are group-by queries!
SELECT AGG(M), D�FROM R�WHERE …�GROUP BY D
If you are not using an OLAP system, you will inherently need to internalize best practices with visualizations—both for yourself as well as for other team members.
45
SELECT COUNT(*), Country�FROM R�GROUP BY Country
#sampling25
Visualization Tools
So far, you have likely used many good visualization packages: these help you generate a visualization on your data from within a programming language.
However, there also exist many visual analytics tools: these are tools that provide an interactive environment to explore your data visually without writing code.
46
D3/Vega/Vega-lite Gnuplot
ggplot2 Google Charts
matplotlib, seaborn
plotly
Looker
PowerBI
Spotfire
Tableau
#sampling25
What do Visual Analytics tools support?
So far, you have likely used many good visualization packages: these help you generate a visualization on your data from within a programming language.
However, there also exist many visual analytics tools: these are tools that provide an interactive environment to explore your data visually without writing code.
47
D3/Vega/Vega-lite Gnuplot
ggplot2 Google Charts
matplotlib, seaborn
plotly
Looker
PowerBI
Spotfire
Tableau
We will spend a little bit of time talking about (1) how to systematically find the right visualizations, and (2) the algebra underlying expressivity!
#sampling25
[Data 100, sort of] From Data Types to Visualization types
48
This is not in scope for the final exam.
Lecture 25, Data 101 Fall 2023
Data Types, roughly retranslated from Data 100
Nominal: Labels, no order to data (e.g., names of products)
Ordinal: Named + ordered variables (e.g., Letter grades A, B, C, D, F)
Quantitative Interval: Named + ordered + proportionate interval between variables� (e.g., geolocation). Note: arbitrary zero
Quantitative Ratio: Named + ordered + proportionate interval + absolute zero� (e.g., Kelvin Temperature scale, weight, height). Physical quantities
49
#sampling25
Bar Chart: Quantitative Ratio vs. Another Type
Emphasizes the differences in height than differences in X axis
From a SQL standpoint:
50
#sampling25
Line Chart: Quantitative Ratio vs. Quantitative Ratio, or vs. Quantitative Interval
Mainly makes sense when the X axis is ordered in some way and distances between X axis points matter.
Want to be able to see trends.
From a SQL standpoint,�the query for generation� is similar to bar charts, �grouping by the X-axis.
51
#sampling25
Scatterplot: Quantitative Ratio vs. Quantitative Ratio
No assumption of interpolation (unlike line charts)
Care more about density and understanding of correlation.
From a SQL query standpoint:
Additional aspects (e.g., color) can also be selected if needed!
However, there is a danger of too many rows being returned.
52
#sampling25
Choropleths map a Q-Ratio vs. a two-dimensional Q-Interval variable
From a SQL query standpoint:
53
#sampling25
[Extra]
What type of visualization would you use?
54
#sampling25
Visualization Takeaways
Visualization is an essential means for data exploration — hypothesis generation and confirmation, spotting of outliers and trends, among others.
A lot can be accomplished with a small number of visualization types: often these suffice during data exploration
Visual analytics tools provide interactive visualization capabilities via simple interactions
55
#sampling25
[Extra] Sketches
56
See Extra Jupyter Notebook
This is not in scope for the final exam.
Lecture 25, Data 101 Fall 2023
[Extra] Table Algebra
57
See Extra Jupyter Notebook
This is not in scope for the final exam.
Lecture 25, Data 101 Fall 2023
Table Algebra: Composing Attributes
The Polaris table algebra allows us to compose attributes prior to visualization in more sophisticated ways.
This algebra is the basis behind Tableau, a popular visual analytics platform
The table algebra treats Q-R and Q-I as “Q-R” and N and O into “O”�(imposing an arbitrary order if needed).
58
#sampling25
Table Algebra: Basic Operands
59
#sampling25
Concatenation operator (+)
60
#sampling25
Cross operator (X)
61
#sampling25
And then what?
With this algebra, users can declaratively specify what combination of attributes goes in each of the “shelves” — the X, Y, and Z shelves
62
#sampling25