Data Preparation III:�Granularity (cont.) and Outliers
October 10, 2023
Data 101, Fall 2023 @ UC Berkeley
Lisa Yan https://fa23.data101.org/
1
LECTURE 14
Join at slido.com�#12345690
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
Roll Up, Drill Down, Roll Up, Drill Down…
[from last time]
Demo 1: Roll-Ups and Drill-Downs
Demo 2: Connections to Statistics
3
In this video game, the character Kirby rolls up (inflates) and drills down (deflates).
2
Demo
#12345690
Implementing Explicit Hierarchies
Implementing Explicit Hierarchies
—
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond “Fill Down” Imputation
4
[from last time]
Lecture 14, Data 101 Fall 2023
Do we need to make our own hierarchies? Yes, often so!
The GNIS dataset was great because the hierarchy was embedded into the relational schema. If we don’t have this, then we have to assemble an explicit hierarchy.
5
Take the baseball database for example. Suppose in baseball, in a given year (*Note: Lisa does not follow baseball)
If we want to do roll-ups and drill-downs on this hierarchy, we need player-year data that looks like:
(playerid, teamid, divid, leagueid,� yearid,� thing_we_want_to_aggregate)
Let’s transform our data!
Demo
#12345690
Assembling the Explicit Hierarchy
6
Two relations have components of the hierarchy that we want:� Appearances and Teams
Let’s use these relations to generate a hierarchy view, bball_tree.
If we want to do roll-ups and drill-downs on this hierarchy, we need player-year data that looks like:
(playerid, teamid, divid, leagueid,� yearid,� thing_we_want_to_aggregate)
Demo
#12345690
Assembling the Explicit Hierarchy
CREATE OR REPLACE VIEW bball_tree AS
SELECT DISTINCT a.playerid, a.teamid, t.divid,
a.lgid, a.yearid
FROM Appearances a NATURAL JOIN Teams t;
7
Two relations have components of the hierarchy that we want:� Appearances and Teams
Let’s use these relations to generate a hierarchy view, bball_tree.
If we want to do roll-ups and drill-downs on this hierarchy, we need player-year data that looks like:
(playerid, teamid, divid, leagueid,� yearid,� thing_we_want_to_aggregate)
When making an explicit hierarchy, each record represents a datum in the smallest granularity (player-years in this example).
Demo
#12345690
Performing roll-ups and drill-downs on an explicit hierarchies
Recall our homerun query:
Let’s rewrite this query for roll up/drill down�on the bball_tree hierarchy.
SELECT namefirst, namelast, yearid,
MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)
FROM batting b, people p
WHERE b.playerid = p.playerid
GROUP BY namelast, namefirst, yearid
ORDER BY max DESC
LIMIT 10;
8
Demo
#12345690
1. Construct the roll-up/drill-down CTE
9
Table "public.batting"
Column | Type
----------+----------------------
playerid | character varying(10)
yearid | integer
stint | integer
teamid | character varying(3)
lgid | character varying(2)
g | integer
ab | integer
r | integer
h | integer
h2b | integer
h3b | integer
hr | integer
rbi | integer
sb | integer
cs | integer
bb | integer
so | integer
ibb | integer
hbp | integer
sh | integer
sf | integer
gidp | integer
Indexes:
"batting_pkey" PRIMARY KEY, btree (playerid, yearid, stint)
View "public.bball_tree"
Column | Type
----------+-----------------------
playerid | character varying(10)
teamid | character varying(3)
divid | character varying(1)
lgid | character varying(2)
yearid | integer
🤔
Demo
#12345690
What pieces of information do we need from the hierarchy?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
1. Construct the roll-up/drill-down CTE
11
Table "public.batting"
Column | Type
----------+----------------------
playerid | character varying(10)
yearid | integer
stint | integer
teamid | character varying(3)
lgid | character varying(2)
g | integer
ab | integer
r | integer
h | integer
h2b | integer
h3b | integer
hr | integer
rbi | integer
sb | integer
cs | integer
bb | integer
so | integer
ibb | integer
hbp | integer
sh | integer
sf | integer
gidp | integer
Indexes:
"batting_pkey" PRIMARY KEY, btree (playerid, yearid, stint)
View "public.bball_tree"
Column | Type
----------+-----------------------
playerid | character varying(10)
teamid | character varying(3)
divid | character varying(1)
lgid | character varying(2)
yearid | integer
WITH batting_tree AS
SELECT b.*, t.divid
FROM batting b, bball_tree t
WHERE b.playerid = t.playerid
AND b.yearid = t.yearid
AND b.teamid = t.teamid� -- to account for trades in-year
SELECT …
Demo
#12345690
2. Write the full query
Rewrite query to roll up/drill down on bball_tree hierarchy.
WITH batting_tree AS
(
SELECT b.*, t.divid
FROM batting b, bball_tree t
WHERE b.playerid = t.playerid
AND b.yearid = t.yearid
AND b.teamid = t.teamid -- trades within year
)
SELECT namefirst, namelast,
bt.teamid, bt.lgid, bt.divid, bt.yearid,
MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)
FROM batting_tree bt, people p
WHERE bt.playerid = p.playerid
GROUP BY bt.playerid, bt.teamid, bt.lgid, bt.divid, bt.yearid, namelast, namefirst
ORDER BY max DESC
LIMIT 10;
12
Demo
#12345690
Announcements
Professor Yan’s office hours just for today Tuesday 10/10 are 2-3pm around the Hearst Mining Circle outside (not Warren)
PSA: COVID-19 / flu boosters
UC policy compliance needed by 12/1/23
13
Data Preparation III: Outliers
October 9, 2023
Data 101, Fall 2023 @ UC Berkeley
Lisa Yan https://fa23.data101.org/
14
LECTURE 14
Outlier Detection and Handling
Implementing Explicit Hierarchies
—
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond “Fill Down” Imputation
15
Lecture 14, Data 101 Fall 2023
Onto more Data Transformations: Principles to practice
Instead of “data cleaning,” think data transformation:
16
Example: Embrace simple metadata.
#12345690
Outlier Detection and Handling: Terminology
An outlier is a value “far enough” from “average.”
Every strategy of detecting/handling outliers needs two statistical measures on a set of values:
17
Let’s explore the robustness of three outlier handling strategies:
(formal definition of robustness to follow)
We say masking occurs when an outlier influences the dispersion measure and impacts our ability to detect other outliers.
#12345690
Common Outlier Handling Strategies
18
#12345690
Gaussian Outliers
Implementing Explicit Hierarchies
—
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond “Fill Down” Imputation
19
Lecture 14, Data 101 Fall 2023
Gaussian Outliers
Model data as a normal (Gaussian) distribution.
20
gray: cleaned (truncated)
In practice:
#12345690
Demo: Assuming Gaussian Data
Suppose our 1000 observed datapoints (observations) were truly Normal�with mean 50, standard deviation 5.
21
%sql DROP TABLE IF EXISTS observations CASCADE;
%sql CREATE TABLE observations AS \
SELECT normal_rand AS x \
FROM normal_rand(1000, 50, 5);
results = %sql SELECT x FROM observations
sns.displot(results.dict(), fill=True, kde=True, bins=20)
#12345690
Demo: Assuming Gaussian Data
Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.
Find outliers and store them in a view normal_outliers.
22
%%sql
CREATE OR REPLACE VIEW normal_outliers AS
WITH bounds AS (
SELECT avg(x) - 2*stddev(x) AS lo,
avg(x) + 2*stddev(x) AS hi
FROM observations
)
SELECT x AS outlier
FROM observations o, bounds b
WHERE x NOT BETWEEN b.lo AND b.hi;
(bounds has a single record, therefore lo/hi are scalars)
#12345690
[Exercise] Cleaned vs. Original?
Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.
Find outliers and store them in a view normal_outliers.
23
results = %sql SELECT x, 'original' AS label \
FROM observations \
UNION ALL \
SELECT x, 'cleaned' AS label\
FROM observations \
WHERE x NOT IN \
(SELECT * FROM normal_outliers)
sns.displot(results.dict(), x="x", kind='hist',
hue='label', kde=True, bins=20)
plt.xlim(30, 70)
What does this code do?
🤔
A.
B.
C.
D.
E. Something else/error
#12345690
What does this code do?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
[Exercise] Cleaned vs. Original?
Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.
Find outliers and store them in a view normal_outliers.
25
results = %sql SELECT x, 'original' AS label \
FROM observations \
UNION ALL \
SELECT x, 'cleaned' AS label\
FROM observations \
WHERE x NOT IN \
(SELECT * FROM normal_outliers)
sns.displot(results.dict(), x="x", kind='hist',
hue='label', kde=True, bins=20)
plt.xlim(30, 70)
The cleaned distribution, with fewer extremes, has lower dispersion.
#12345690
Demo: Breaking the Gaussian Assumption
Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.
Find outliers and store them in a view normal_outliers.
Construct a cleaned dataset from observations and plot alongside the original.
What if we corrupt just one value to be very large?
26
## corrupt one value
%sql UPDATE observations SET x = x*10 \
WHERE x = (SELECT MAX(x) FROM OBSERVATIONS);
The 10x value is masking our previous outliers.
One value is dragging the mean and standard deviation super far!
#12345690
Trimming and Winsorizing
Implementing Explicit Hierarchies
—
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond “Fill Down” Imputation
27
Lecture 14, Data 101 Fall 2023
Common Outlier Handling Strategies
28
#12345690
Common Outlier Handling Strategies
29
#12345690
[Exercise] Understanding Winsorization
Consider this dataset which has mean 101.5 and median 68.0.
{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}
Compute a 90% winsorization:
30
🤔
A. Mean bigger, median bigger
B. Mean bigger, median smaller
C. Mean smaller, median bigger
D. Mean smaller, median smaller
E. Mean same, median different
F. Mean different, median same
G. Mean same, median same
H. Something else
#12345690
What are the statistics of the winsorized data, compared to the original?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
[Exercise] Understanding Winsorization
Consider this dataset which has mean 101.5 and median 68.0.
{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}
Compute a 90% winsorization:
Winsorized data:
{92, 19, 101, 58, 101, 91, 26, 78, 10, 13, −5, 101, 86, 85, 15, 89, 89, 28, −5, 41}
32
#12345690
Demo: Trimming vs. Winsorizing
33
Even with our giant outlier, we still identify outliers based on the order (percentile) and not mean/SD, so we trim successfully.
Winsorizing preserves probability density of the tails
5% trimmed mean,
5% winsorized mean
#12345690
Robustness
Implementing Explicit Hierarchies
—
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond “Fill Down” Imputation
34
Lecture 14, Data 101 Fall 2023
Robustness and Breakdown Points
Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).
Ex1 Breakdown point of the mean:
35
#12345690
Robustness and Breakdown Points
Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).
Ex1 Breakdown point of the mean:
Ex2 Breakdown point of the 1% trimmed mean:
36
What k% gives us a maximally robust trimmed mean?
🤔
#12345690
What k% gives us a maximally robust trimmed mean? Answer as a number with no percent sign.
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Robustness and Breakdown Points
Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).
Ex1 Breakdown point of the mean:
Ex2 Breakdown point of the 1% trimmed mean:
What k% gives us a maximally robust k% trimmed mean?
38
The median of any distribution is maximally robust; it can handle up to 50% corruption of the data.
#12345690
Robust Estimators of a Distribution
Center
Maximally robust: Median
Dispersion
Maximally robust: Median Absolute� Deviation (MAD)
Recall that we are estimating the center and dispersion of our data.
39
#12345690
Robust Estimators of a Distribution
Center
Maximally robust: Median
Dispersion
Maximally robust: Median Absolute� Deviation (MAD)
Recall that we are estimating the center and dispersion of our data.
40
Other common robust measures of center:
Other common robust measures of dispersion:
Related: “Tukey numbers” are a five-number summary of univariate numeric data
#12345690
Computing MAD in SQL
(for you to review)
41
%%sql
-- percentile_disc returns an actual data value near the percentile (discrete)
-- percentile_cont returns an interpolated value at the percentile (continuous)
CREATE OR REPLACE VIEW median AS
(SELECT percentile_disc(0.5)
WITHIN GROUP (ORDER BY x) as median� FROM observations);
%%sql
CREATE OR REPLACE VIEW mad AS
WITH absdevs AS
(SELECT abs(x - median) as d
FROM observations, median)
SELECT percentile_disc(0.5)� WITHIN GROUP (ORDER BY d) as mad
FROM absdevs;
SELECT median, mad
FROM median, mad;
where
Demo
#12345690
Metrics for Outliers: Hampel X84, Model-based
Implementing Explicit Hierarchies
—
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond “Fill Down” Imputation
42
Lecture 14, Data 101 Fall 2023
Metrics for outliers
How much should we trim or winsorize? You may know a few common metrics:
“2 sigma”: Trim Gaussian outliers
Quartiles: Trim bottom 25% and top 25%
43
Engineering consulting hedge fund
#12345690
Metrics for outliers
How much should we trim or winsorize? You may know a few common metrics:
“2 sigma”: Trim Gaussian outliers
Quartiles: Trim bottom 25% and top 25%
44
Let’s translate this normal estimation notion of “standard unit” into robust estimators:
For standard Normal (mean 0, stddev 1),
1 stddev = 1.4826 MADs.�Try it at home!
Engineering consulting hedge fund
This is known as the Hampel X84�robust outlier detection technique.
#12345690
More reading: Joe Hellerstein’s work
Hellerstein, J.M. (2008) Quantitative Data Cleaning for Large Databases. United Nations Economic Commission for Europe (UNECE). [link]
45
Joe Hellerstein
UC Berkeley EECS Professor
Original designer of Data 101 (Spring 2021)!
Data engineering involves lots of arbitrary (but reasonable!) metrics. This is one of them!
#12345690
Let’s redo our outliers with Hampel x84
46
%%sql
CREATE OR REPLACE VIEW hampelx84x2_observations AS
SELECT o.x,
'hampelx84x2' AS label
FROM observations o, median, mad
WHERE o.x BETWEEN (median-2*1.4826*mad) AND (median+2*1.4826*mad)
UNION ALL
SELECT o.x, 'orig' AS label
FROM observations o;
CREATE OR REPLACE VIEW Hampel84x2_outliers AS
SELECT x
FROM observations o, median, mad
WHERE x NOT BETWEEN (median - 2*1.4826*mad) AND (median + 2*1.4826*mad);
Demo
#12345690
Let’s redo our outliers with Hampel x84
47
results = %sql SELECT * FROM hampelx84x2_observations
sns.displot(results.dict(), x="x", kind='hist',
hue='label', kde=True, bins=np.linspace(32, 70, 20),
height=4, aspect=1.5, rug=True)
plt.xlim(30, 70)
Demo
#12345690
Summary: Outliers
Outliers depend on some measure of center and spread.
48
#12345690
[Beyond scope] Outlier detection with model fitting
Outliers depend on some measure of center and spread.
If we fit multi-dimensional data to a model, do we have more metrics for detecting “outliers”?
49
#12345690
Data Imputation
Implementing Explicit Hierarchies
—
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond “Fill Down” Imputation
50
[for next time]
Lecture 14, Data 101 Fall 2023
Beyond “fill down” imputation
Implementing Explicit Hierarchies
—
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond “Fill Down” Imputation
51
[for next time]
Lecture 14, Data 101 Fall 2023