Data Preparation III: Outliers and Imputation
March 4, 2025
1
LECTURE 13
2
Join at slido.comļæ½#25681641
ā
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
3
Data Preparation II: Granularity (Wrap-Up)
4
Recap of Hierarchies
Data granularity implies multiple hierarchies
In this context, a hierarchy dictates the transformation into smaller/larger granularities.
Three categories of hierarchies, roughly:
5
4795043
Explicit Hierarchies
Many hierarchical models of the world.
6
Related encoding: semantic triples, i.e.
RDF triples (Resource Description Framework)
One standard encoding: IsA pairs (āIs Aā)
(basically child/parent pairs in a tree)
4795043
Everything exists in Space-Time
Time
All data (not just physical phenomena) follow a bitemporal model for data:
Space
All physical phenomena have a geolocation, which can be encoded in many ways:
The most common dimensions of data!
7
4795043
Everything exists in Space-Time
Time
All data (not just physical phenomena) follow a bitemporal model for data.
Hierarchies are common:
Many time subdivisions are periodic: seasons, months, etc.
Space
All physical phenomena have a geolocation, which can be encoded in many ways:
Explicit and often complex hierarchiesā¦
While we generally use general-purpose databases and tools, you may need special systems.
8
GIS: Geographic Information Systems
Temporal databases
Rollup/Drilldown Constructs
Roll Up / Drill Down
Implementing Explicit Hierarchies
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond āFill Downā Imputation
9
Lecture 13, Data 101, Spring 2025
Changing granularity
āRoll upā: Transform to coarser grain (e.g., go up in a hierarchy).
āDrill downā: Transform to finer grain (e.g., go down in a hierarchy).
Operators:
10
Table "public.national"
Column | Type
-----------------+------------------
feature_id | bigint
feature_name | text
feature_class | text
state_alpha | text
state_numeric | bigint
county_name | text
county_numeric | double precision
primary_lat_dms | text
prim_long_dms | text
prim_lat_dec | double precision
prim_long_dec | double precision
source_lat_dms | text
source_long_dms | text
source_lat_dec | double precision
source_long_dec | double precision
elev_in_m | double precision
elev_in_ft | double precision
map_name | text
date_created | text
Letās see an example.
25681641
Example
Suppose we start with county-level granularity of elevation.
11
SELECT state_numeric, county_numeric,
AVG(elev_in_m),
STDDEV(elev_in_m), COUNT(*)
FROM national TABLESAMPLE BERNOULLI(10)
GROUP BY state_numeric, county_numeric;
Table "public.national"
Column | Type
-----------------+------------------
feature_id | bigint
feature_name | text
feature_class | text
state_alpha | text
state_numeric | bigint
county_name | text
county_numeric | double precision
primary_lat_dms | text
prim_long_dms | text
prim_lat_dec | double precision
prim_long_dec | double precision
source_lat_dms | text
source_long_dms | text
source_lat_dec | double precision
source_long_dec | double precision
elev_in_m | double precision
elev_in_ft | double precision
map_name | text
date_created | text
County ānumbersā are reused across states!
Multi-attribute GROUP BY ensures unique counties across U.S.
How do we roll up to state level?
25681641
Example: Roll Up
Suppose we start with county-level granularity of elevation.
Letās roll up to state-level granularity.
12
SELECT state_numeric, county_numeric,
AVG(elev_in_m),
STDDEV(elev_in_m), COUNT(*)
FROM national TABLESAMPLE BERNOULLI(10)
GROUP BY state_numeric, county_numeric;
Table "public.national"
Column | Type
-----------------+------------------
feature_id | bigint
feature_name | text
feature_class | text
state_alpha | text
state_numeric | bigint
county_name | text
county_numeric | double precision
primary_lat_dms | text
prim_long_dms | text
prim_lat_dec | double precision
prim_long_dec | double precision
source_lat_dms | text
source_long_dms | text
source_lat_dec | double precision
source_long_dec | double precision
elev_in_m | double precision
elev_in_ft | double precision
map_name | text
date_created | text
County ānumbersā are reused across states!
Multi-attribute GROUP BY ensures unique counties across U.S.
SELECT state_numeric,
AVG(elev_in_m),
STDDEV(elev_in_m), COUNT(*)
FROM national TABLESAMPLE Bernoulli(10)
GROUP BY state_numeric;
GROUP BY clause:ļæ½More or fewer columns than before?
SELECT clause:ļæ½Can we choose the same columns as before?
25681641
Example: Drill Down
Letās roll up to state-level granularity.
From state-level granularity,ļæ½letās drill down into feature class granularity as well.
13
Table "public.national"
Column | Type
-----------------+------------------
feature_id | bigint
feature_name | text
feature_class | text
state_alpha | text
state_numeric | bigint
county_name | text
county_numeric | double precision
primary_lat_dms | text
prim_long_dms | text
prim_lat_dec | double precision
prim_long_dec | double precision
source_lat_dms | text
source_long_dms | text
source_lat_dec | double precision
source_long_dec | double precision
elev_in_m | double precision
elev_in_ft | double precision
map_name | text
date_created | text
SELECT state_numeric,
AVG(elev_in_m),
STDDEV(elev_in_m), COUNT(*)
FROM national TABLESAMPLE Bernoulli(10)
GROUP BY state_numeric;
SELECT state_numeric, feature_class,
AVG(elev_in_m),
STDDEV(elev_in_m), COUNT(*)
FROM national
GROUP BY state_numeric, feature_class
ORDER BY count(*) DESC;
Drilling down requires having the finer-granularity attribute!
If you donāt have it, you have to sample from a model (more later).
25681641
Roll-Up and Statistics
Roll up effectively computes the marginal distribution (see DATA 100/140).
14
SELECT state_numeric,
AVG(elev_in_m),
STDDEV(elev_in_m), COUNT(*),
SUM(COUNT(*)) OVER () AS total,
COUNT(*)/SUM(COUNT(*)) OVER ()
AS marginal
FROM national TABLESAMPLE Bernoulli(.1)
GROUP BY state_numeric;
25681641
Drill-Down and Statistics
If we donāt actually have the drill-down data, we could sample from a model.
15
SELECT state_numeric,
AVG(elev_in_m),
STDDEV(elev_in_m), COUNT(*)
FROM national
GROUP BY state_numeric
ORDER BY count(*) DESC;
public.fips_counties"
Column | Type
fips | bigint
county | text ļæ½ state_numeric | bigint
Indexes:
"ix_fips_counties_fips" btree (fips)
What if we wanted to drill down to all counties, including ones that might not be in national?
all counties
Possible solution: When drilling down, estimate the elevation from a normal distribution of the stateās elevation.
Rollup/Drilldown & Tensors
Consider a GROUP BY query with 3 grouping columns and 1 aggregate column:
SELECT state_numeric, county_numeric,
feature_class, COUNT(*)
FROM national
GROUP BY state_numeric, county_numeric, feature_class;
This returns a (sparse) 3-dimensional tensor!
16
25681641
Roll Up, Drill Down, Roll Up, Drill Downā¦
[On your own]
Demo 1: Roll-Ups and Drill-Downs
17
In this video game, the character Kirby rolls up (inflates) and drills down (deflates).
1
SELECT state_numeric, feature_class,
avg(elev_in_m),
stddev(elev_in_m), count(*)
FROM national TABLESAMPLE Bernoulli(10)
GROUP BY state_numeric, feature_class
ORDER BY count(*) DESC;
25681641
Demo
Roll Up, Drill Down, Roll Up, Drill Downā¦
Demo 2: Connections to Statistics
18
In this video game, the character Kirby rolls up (inflates) and drills down (deflates).
25681641
Demo
Implementing Explicit Hierarchies
Roll Up / Drill Down
Implementing Explicit Hierarchies
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond āFill Downā Imputation
19
Lecture 13, Data 101, Spring 2025
Do we need to make our own hierarchies? Yes!
The GNIS dataset was great because the hierarchy was embedded into the relational schema. If we donāt have this, then we have to assemble an explicit hierarchy.
20
Take the baseball database for example. Suppose in baseball, in a given year
So if we want to do roll-ups and drill-downs on this hierarchy, we need player-year data that looks like:
(player_id, team_id, div_id, lg_id, year_id,ļæ½ thing_we_want_to_aggregate)
Letās transform our data!
Demo
Assembling the Explicit Hierarchy
SELECT a.player_id, a.team_id, t.div_id, a.*
FROM appearances a
NATURAL JOIN Teams t
WHERE a.year_id = 2015
LIMIT 100;
CREATE OR REPLACE VIEW bball_tree AS
SELECT DISTINCT a.player_id, a.team_id, t.div_id,
a.lg_id, a.year_id
FROM appearances a NATURAL JOIN teams t;
21
When making an explicit hierarchy, each record represents a datum in the smallest granularity (player-years in this example).
25681641
Demo
Roll-ups & drill-downs on an explicit hierarchies
Recall our home run query:
Set up for roll up/drill down on bball_tree hierarchy.
SELECT name_first, name_last, year_id,
MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)
FROM batting b, people p
WHERE b.player_id = p.player_id
GROUP BY name_last, name_first, year_id
ORDER BY max DESC
LIMIT 10;
22
25681641
Demo
Letās construct the CTE
23
baseball=# \d batting
Table "public.batting"
Column | Type |
-----------+-----------------------+
player_id | character varying(10) |
year_id | integer |
stint | integer |
team_id | character varying(3) |
lg_id | character varying(2) |
g | integer |
ab | integer |
r | integer |
h | integer |
h2b | integer |
h3b | integer |
hr | integer |
rbi | integer |
sb | integer |
cs | integer |
bb | integer |
so | integer |
ibb | integer |
hbp | integer |
sh | integer |
sf | integer |
gidp | integer |
Indexes:
"batting_pkey" PRIMARY KEY, btree (player_id, year_id, stint)
View "public.bball_tree"
Column | Type
-----------+-----------------------
player_id | character varying(10)
team_id | character varying(3)
div_id | character varying(1)
lg_id | character varying(2)
year_id | integer
š¤
Demo
24
What pieces of information do we need from the hierarchy?
ā
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Roll-ups & drill-downs on an explicit hierarchies
Set up for roll up/drill down on bball_tree hierarchy.
WITH batting_tree AS (
SELECT b.*, t.div_id
FROM batting b, bball_tree t
WHERE b.player_id = t.player_id
AND b.year_id = t.year_id )
SELECT name_first, name_last,
bt.team_id, bt.lg_id, bt.div_id, bt.year_id,
MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)
FROM batting_tree bt, people p
WHERE bt.player_id = p.player_id
GROUP BY bt.player_id, bt.team_id, bt.lg_id, bt.div_id, bt.year_id, name_last, name_first
ORDER BY max DESC
LIMIT 10;
25
25681641
Demo
Announcements
26
27
Data Preparation III
Outlier Detection and Handling
Roll Up / Drill Down
Implementing Explicit Hierarchies
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond āFill Downā Imputation
28
Lecture 13, Data 101, Spring 2025
Onto more Data Transformations: Principles to practice
Instead of ādata cleaning,ā think data transformation:
29
Example: Embrace simple metadata.
25681641
Outlier Detection and Handling: Terminology
An outlier is a value āfar enoughā from āaverage.ā
Every strategy of detecting/handling outliers needs two statistical measures on a set of values:
30
Letās explore the robustness of three outlier handling strategies:
(formal definition of robustness to follow)
We say masking occurs when an outlier influences the dispersion measure and impacts our ability to detect other outliers.
25681641
Common Outlier Handling Strategies
31
25681641
Gaussian Outliers
Roll Up / Drill Down
Implementing Explicit Hierarchies
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond āFill Downā Imputation
32
Lecture 13, Data 101, Spring 2025
Gaussian Outliers
Model data as a normal (Gaussian) distribution.
33
gray: cleaned (truncated)
In practice:
25681641
Demo: Assuming Gaussian Data
Suppose our 1000 observed datapoints (observations) were truly Normalļæ½with mean 50, standard deviation 5.
34
%sql DROP TABLE IF EXISTS observations CASCADE;
%sql CREATE TABLE observations AS \
SELECT normal_rand AS x \
FROM normal_rand(1000, 50, 5);
results = %sql SELECT x FROM observations
sns.displot(results.dict(), fill=True, kde=True, bins=20)
Demo: Assuming Gaussian Data
Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.
Find outliers and store them in a view
normal_outliers.
35
%%sql
CREATE OR REPLACE VIEW normal_outliers AS
WITH bounds AS (
SELECT avg(x) - 2*stddev(x) AS lo,
avg(x) + 2*stddev(x) AS hi
FROM observations
)
SELECT x AS outlier
FROM observations o, bounds b
WHERE x NOT BETWEEN b.lo AND b.hi;
(bounds has a single record, therefore lo/hi are scalars)
[Exercise] Cleaned vs. Original?
Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.
Find outliers and store them in a view normal_outliers.
36
results = %sql SELECT x, 'original' AS label \
FROM observations \
UNION ALL \
SELECT x, 'cleaned' AS label\
FROM observations \
WHERE x NOT IN \
(SELECT * FROM normal_outliers)
sns.displot(results.dict(), x="x", kind='hist',
hue='label', kde=True, bins=20)
plt.xlim(30, 70)
What does this code do?
š¤
A.
B.
C.
D.
E. Something else/error
37
What does this code do?
ā
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
[Exercise] Cleaned vs. Original?
Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.
Find outliers and store them in a view normal_outliers.
38
results = %sql SELECT x, 'original' AS label \
FROM observations \
UNION ALL \
SELECT x, 'cleaned' AS label \
FROM observations \
WHERE x NOT IN \
(SELECT * FROM normal_outliers)
sns.displot(results.dict(), x="x", kind='hist',
hue='label', kde=True, bins=20)
plt.xlim(30, 70)
The cleaned distribution, with fewer extremes, has lower dispersion.
Demo: Breaking the Gaussian Assumption
Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.
Find outliers, store in view normal_outliers.
Construct a cleaned dataset from
observations and plot alongside the original.
What if we corrupt just one value to be very large?
39
## corrupt one value
%sql UPDATE observations SET x = x*10 \
WHERE x = (SELECT MAX(x) FROM OBSERVATIONS);
The 10x value is masking our previous outliers.
One value is dragging the mean and standard deviation super far!
Trimming and Winsorizing
Roll Up / Drill Down
Implementing Explicit Hierarchies
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond āFill Downā Imputation
40
Lecture 13, Data 101, Spring 2025
Common Outlier Handling Strategies
41
25681641
Common Outlier Handling Strategies
42
25681641
Common Outlier Handling Strategies
43
25681641
[Exercise] Understanding Winsorization
Consider this dataset which has mean 101.5 and median 68.0.
{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, ā40, 101, 86, 85, 15, 89, 89, 28, ā5, 41}
Compute a 90% winsorization:
44
š¤
A. Mean bigger, median bigger
B. Mean bigger, median smaller
C. Mean smaller, median bigger
D. Mean smaller, median smaller
E. Mean same, median different
F. Mean different, median same
G. Mean same, median same
H. Something else
25681641
45
What are the statistics of the winsorized data, compared to the original?
ā
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
[Exercise] Understanding Winsorization
Consider this dataset which has mean 101.5 and median 68.0.
{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, ā40, 101, 86, 85, 15, 89, 89, 28, ā5, 41}
Compute a 90% winsorization:
Winsorized data:
{92, 19, 101, 58, 101, 91, 26, 78, 10, 13, ā5, 101, 86, 85, 15, 89, 89, 28, ā5, 41}
46
25681641
Got until here
Demo: Trimming vs. Winsorizing
47
Even with our giant outlier, we still identify outliers based on the order (percentile) and not mean/SD, so we trim successfully.
Winsorizing preserves probability density of the tails
5% trimmed mean,
5% winsorized mean
25681641
Robustness
Roll Up / Drill Down
Implementing Explicit Hierarchies
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond āFill Downā Imputation
48
Lecture 13, Data 101, Spring 2025
Robustness and Breakdown Points
Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).
Ex1 Breakdown point of the mean:
49
25681641
Robustness and Breakdown Points
Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).
Ex2 Breakdown point of the 1% trimmed mean:
50
What k% gives us a maximally robust trimmed mean?
š¤
25681641
51
What k% gives us a maximally robust trimmed mean? Answer as a number with no percent sign.
ā
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Robustness and Breakdown Points
Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).
���What k% gives us a maximally robust k% trimmed mean?
52
The median of any distribution is maximally robust; it can handle up to 50% corruption of the data.
25681641
Robust Estimators of a Distribution
Recall that we are estimating the center and dispersion of our data.
53
Dispersion
Maximally robust: Median Absoluteļæ½ Deviation(MAD)
Center
Maximally robust: Median
25681641
Robust Estimators of a Distribution
Recall that we are estimating the center and dispersion of our data.
54
Other common robust measures of center:
Other common robust measures of dispersion:
Related: āTukey numbersā are a five-number summary of univariate numeric data
Center
Maximally robust: Median
Dispersion
Maximally robust: Median Absoluteļæ½ Deviation (MAD)
(see previous slide)
25681641
Computing MAD in SQL
(for you to review)
55
%%sql
-- percentile_disc returns an actual data value near the percentile (discrete)
-- percentile_cont returns an interpolated value at the percentile (continuous)
CREATE OR REPLACE VIEW median AS
(SELECT percentile_disc(0.5)
WITHIN GROUP (ORDER BY x) as medianļæ½ FROM observations);
%%sql
CREATE OR REPLACE VIEW mad AS
WITH absdevs AS
(SELECT abs(x - median) as d
FROM observations, median)
SELECT percentile_disc(0.5)ļæ½ WITHIN GROUP (ORDER BY d) as mad
FROM absdevs;
SELECT median, mad
FROM median, mad;
where
25681641
Demo
Metrics for Outliers: Hampel X84, Model-based
Roll Up / Drill Down
Implementing Explicit Hierarchies
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84, Model-based
Data Imputation
Beyond āFill Downā Imputation
56
Lecture 13, Data 101, Spring 2025
Metrics for outliers
How much should we trim or winsorize? You may know a few common metrics:
ā2 sigmaā: Trim Gaussian outliers
Quartiles: Trim bottom 25% and top 25%
57
25681641
Metrics for outliers
58
Letās translate this normal estimation notion of āstandard unitā into robust estimators:
For standard Normal (mean 0, stddev 1),
1 stddev = 1.4826 MADs.ļæ½Try it at home!
This is known as the Hampel X84ļæ½robust outlier detection technique.
25681641
More reading: Joe Hellersteinās work
Hellerstein, J.M. (2008) Quantitative Data Cleaning for Large Databases. United Nations Economic Commission for Europe (UNECE). [link]
59
Joe Hellerstein
UC Berkeley EECS Professor
Original designer of Data 101 (Spring 2021)!
Data engineering involves lots of arbitrary (but reasonable!) metrics. This is one of them!
Letās redo our outliers with Hampel x84
60
%%sql
CREATE OR REPLACE VIEW hampelx84x2_observations AS (
SELECT o.x, 'hampelx84x2' AS label
FROM observations o, median, mad
WHERE o.x BETWEEN (median-2*1.4826*mad)
AND (median+2*1.4826*mad)
UNION ALL
SELECT o.x, 'orig' AS label
FROM observations o
);
CREATE OR REPLACE VIEW Hampel84x2_outliers AS (
SELECT x
FROM observations o, median, mad
WHERE x NOT BETWEEN (median - 2*1.4826*mad)
AND (median + 2*1.4826*mad)
);
Demo
Letās redo our outliers with Hampel x84
61
results = %sql SELECT * FROM hampelx84x2_observations
sns.displot(results.dict(), x="x", kind='hist',
hue='label', kde=True, bins=np.linspace(32, 70, 20),
height=4, aspect=1.5, rug=True)
plt.xlim(30, 70)
Demo
Summary: Outliers
Outliers depend on some measure of center and spread.
62
25681641
[Beyond scope] Outlier detection with model fitting
If we fit multi-dimensional data to a model, do we have more metrics for detecting āoutliersā?
63
25681641