Data Preparation III: Outliers and Imputation
1
Data 101, Fall 2025 @ UC Berkeley
Aditya Parameswaran
2
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84
[At home] Code for Demos
Data Imputation
Linear interpolation Example
[At home] “Fill Down” Imputation
Outline
Onto more Data Transformations: Principles to practice
Instead of “data cleaning,” think data transformation:
3
Example: Embrace simple metadata.
Outlier Detection and Handling: Terminology
An outlier is a value “far enough” from “average.”
Every strategy of handling outliers needs two stat. measures on a set of values:
4
Let’s explore robustness of four outlier handling strategies:
(formal definition of robustness to follow)
We say masking occurs when an outlier influences the dispersion measure and impacts our ability to detect other outliers.
Common Outlier Handling Strategies
5
6
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84
[At home] Code for Demos
Data Imputation
Linear interpolation Example
[At home] “Fill Down” Imputation
Outline
Gaussian Outliers
Model data as a normal (Gaussian) distribution.
7
gray: cleaned (truncated)
In practice:
Demo: Assuming Gaussian Data
Suppose our 1000 observed datapoints (observations) were truly Normal�with mean 50, standard deviation 5.
8
%sql DROP TABLE IF EXISTS observations CASCADE;
%sql CREATE TABLE observations AS \
SELECT normal_rand AS x \
FROM normal_rand(1000, 50, 5);
results = %sql SELECT x FROM observations
sns.displot(results.dict(), fill=True, kde=True, bins=20)
Demo: Assuming Gaussian Data
Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.
Find outliers and store them in a view
normal_outliers.
9
%%sql
CREATE OR REPLACE VIEW normal_outliers AS
WITH bounds AS (
SELECT avg(x) - 2*stddev(x) AS lo,
avg(x) + 2*stddev(x) AS hi
FROM observations
)
SELECT x AS outlier
FROM observations o, bounds b
WHERE x NOT BETWEEN b.lo AND b.hi;
(bounds has a single record, therefore lo/hi are scalars)
Cleaned vs. Original?
Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.
Find outliers and store them in a view normal_outliers.
10
results = %sql SELECT x, 'original' AS label \
FROM observations \
UNION ALL \
SELECT x, 'cleaned' AS label \
FROM observations \
WHERE x NOT IN \
(SELECT * FROM normal_outliers)
sns.displot(results.dict(), x="x", kind='hist',
hue='label', kde=True, bins=20)
plt.xlim(30, 70)
The cleaned distribution, with fewer extremes, has lower dispersion.
Demo: Breaking the Gaussian Assumption
Q: What if we corrupt just one value to be very large? Would Gaussian outlier detection still work?
11
## corrupt one value
%sql UPDATE observations SET x = x*10 \
WHERE x = (SELECT MAX(x) FROM OBSERVATIONS);
12
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84
[At home] Code for Demos
Data Imputation
Linear interpolation Example
[At home] “Fill Down” Imputation
Outline
Common Outlier Handling Strategies
13
Trimming with Percentiles
14
Detect and define outliers by order statistics, i.e., percentiles.
Trimming with Percentiles
15
[Exercise] Understanding Winsorization
Consider this dataset which has mean 101.5 and median 68.0.
{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}
Compute a 90% winsorization:
16
🤔
A. Mean bigger, median bigger
B. Mean bigger, median smaller
C. Mean smaller, median bigger
D. Mean smaller, median smaller
E. Mean same, median different
F. Mean different, median same
G. Mean same, median same
H. Something else
[Exercise] Understanding Winsorization
Consider this dataset which has mean 101.5 and median 68.0.
{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}
Compute a 90% winsorization:
Winsorized data:
{92, 19, 101, 58, 101, 91, 26, 78, 10, 13, −5, 101, 86, 85, 15, 89, 89, 28, −5, 41}
17
Trimming vs. Winsorizing with the 10x outlier
18
Trimming vs. Winsorizing with the 10x outlier
19
Trimming, winsorizing are more robust than Gaussian detection.
“5% trimmed mean,”
“5% winsorized mean”
20
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84
[At home] Code for Demos
Data Imputation
Linear interpolation Example
[At home] “Fill Down” Imputation
Outline
More reading: Joe Hellerstein’s work
21
Joe Hellerstein, EECS Professor Emeritus, UC Berkeley
Original designer of Data 101 (Spring 2021)!
Data engineering involves lots of arbitrary (but reasonable) metrics.
One example: Hampel X84. Why? Let’s properly define robustness.
Hellerstein, J.M. (2008) Quantitative Data Cleaning for Large Databases. United Nations Economic Commission for Europe (UNECE). [link]
Robustness and Breakdown Points
Robustness is a worst-case analysis of an estimator.
Example non-robust statistic: Mean
22
Breakdown points
The breakdown point of an estimator:
23
Breakdown points
The breakdown point of an estimator:
Ex1 Breakdown point of mean:
Ex2 Breakdown point of 1% trimmed mean:
24
Breakdown points
The breakdown point of an estimator:
Ex1 Breakdown point of mean:
Ex2 Breakdown point of 1% trimmed mean:
25
Ex 3 What k% gives us a maximally robust trimmed mean?
🤔
Robustness and Breakdown Points
Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).
Ex 3 What k% gives us a maximally robust k% trimmed mean?
26
The median of any distribution is maximally robust. It can handle up to 50% corruption of the data.
Robust Estimators of a Distribution
Center:
Dispersion:
[Aside] Tukey numbers for univariate data:
27
(see end of slides for SQL code)
Using Median and MAD as part of Hampel X84
28
But why?
Why trim 1.48? Metrics for outliers
How much should we trim with MAD?
29
Let’s translate this normal estimation notion of “standard unit” into robust estimators:
For standard Normal (mean 0, stddev 1),
1 stddev = 1.4826 MADs.�Try it at home!
This is known as the Hampel X84�robust outlier detection technique.
[Beyond scope] Outlier detection with model fitting
If we fit multi-dimensional data to a model, do we have more metrics for detecting “outliers”?
30
Summary: Outliers
Outliers depend on some measure of center and spread (dispersion).
31
Summary
So Far
32
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84
[At home] Code for Demos
Data Imputation
Linear interpolation Example
[At home] “Fill Down” Imputation
Outline
Computing MAD in SQL
[At Home]
33
%%sql
-- percentile_disc returns an actual data value near the percentile (discrete)
-- percentile_cont returns an interpolated value at the percentile (continuous)
CREATE OR REPLACE VIEW median AS
(SELECT percentile_disc(0.5)
WITHIN GROUP (ORDER BY x) as median� FROM observations);
%%sql
CREATE OR REPLACE VIEW mad AS
WITH absdevs AS
(SELECT abs(x - median) as d
FROM observations, median)
SELECT percentile_disc(0.5)� WITHIN GROUP (ORDER BY d) as mad
FROM absdevs;
SELECT median, mad
FROM median, mad;
where
Demo
Let’s redo our outliers with Hampel x84
[At Home]
34
%%sql
CREATE OR REPLACE VIEW hampelx84x2_observations AS (
SELECT o.x, 'hampelx84x2' AS label
FROM observations o, median, mad
WHERE o.x BETWEEN (median-2*1.4826*mad)
AND (median+2*1.4826*mad)
UNION ALL
SELECT o.x, 'orig' AS label
FROM observations o
);
CREATE OR REPLACE VIEW Hampel84x2_outliers AS (
SELECT x
FROM observations o, median, mad
WHERE x NOT BETWEEN (median - 2*1.4826*mad)
AND (median + 2*1.4826*mad)
);
Demo
Let’s redo our outliers with Hampel x84
[At Home]
35
results = %sql SELECT * FROM hampelx84x2_observations
sns.displot(results.dict(), x="x", kind='hist',
hue='label', kde=True, bins=np.linspace(32, 70, 20),
height=4, aspect=1.5, rug=True)
plt.xlim(30, 70)
Demo
36
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84
[At home] Code for Demos
Data Imputation
Linear interpolation Example
[At home] “Fill Down” Imputation
Outline
Data Imputation
When data are missing, we often impute missing data by filling in “likely” values.
37
What is a good imputation scheme for your setting?
Q: What imputation schemes would you consider using to fill in missing values?
Imputation methods we will cover
1. Default values for a column
2. Correlation across columns
3. General model-based interpolation
4. Interpolation across ordered rows
38
interpolation w/ columns: imputing values assuming a relationship between columns
interpolation w/ rows
GNIS
For purposes of illustration, let's introduce some missing values into our data.
39
%%sql
SELECT setseed(0.12345);
DROP TABLE IF EXISTS holey CASCADE;
CREATE TABLE holey AS
SELECT feature_id, feature_name,� feature_class, state_alpha, county_name,
prim_lat_dec, prim_long_dec,
CASE WHEN random() > 0.9 THEN NULL
ELSE elev_in_m
END AS elev_in_m
FROM national;
SELECT count(elev_in_m)::float / count(*)�FROM holey;
[aside] Good data lineage practice: Make a copy of the data
Method 1. Default values for a column
Impute missing data using the mean, or some other reasonable univariate data statistic.
Implement in SQL with two passes through the data:
1. Aggregate CTE to compute mean
2. Query
40
Method 1. Default values for a column
[At home]
WITH elevavg AS (
SELECT avg(elev_in_m)::int FROM holey
)
SELECT h.*,
CASE WHEN h.elev_in_m IS NOT NULL
THEN h.elev_in_m
ELSE e.avg
END AS imputed_elev_in_m
FROM holey h,
elevavg e
LIMIT 100;
41
[Aside] Good data lineage practice: Document imputation through attribute names (elev_in_m, imputed_elev_in_m)
Demo
Method 2. Correlation across columns
Interpolate by assuming a relation between columns.
Ex linear regression model
42
prim_long_dec
elev_in_m
Method 2. Correlation across columns
[At home]
(could use CTE, but here’s a notebook alternative)
43
# Here we'll train the model in SQL just for fun
result = %sql SELECT \
regr_slope(elev_in_m, prim_long_dec), \
regr_intercept(elev_in_m, prim_long_dec) \
FROM holey
slope, intercept = result[0]
%%sql
SELECT *,
CASE WHEN elev_in_m IS NOT NULL
THEN elev_in_m
ELSE {{slope}}*prim_long_dec +{{intercept}}
END AS imputed_elev_in_m
FROM holey
LIMIT 10;
Demo
Method 3. General model-based interpolation
Similar to Method 2, but trained on different data in advance.
We won’t show an example,�but the general idea:
44
SELECT *,
CASE WHEN column IS NOT NULL
THEN column
ELSE model_predict(<constants>,� <columns>)
END AS imputed_column
FROM table;
Method 4. Interpolation across ordered rows
Order rows in the relation, then use that order to impute values.
Ex “fill down” until the next non-null value
Ex Linear interpolation between rows
45
46
401
194
47
111
row order
Method 4. Interpolation across ordered rows
Order rows in the relation, then use that order to impute values.
Ex “fill down” until the next non-null value
Ex Linear interpolation between rows
46
Typically involves a window function
More complicated than meets the eye (in SQL).
See bonus slides
Requires multiple passes through the data.
Let’s see it in action…!
🏡
47
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84
[At home] Code for Demos
Data Imputation
Linear interpolation Example
[At home] “Fill Down” Imputation
Outline
Linear Interpolation Example
Linear interpolation across ordered rows:
48
46
401
194
47
111
Linear interpolation is non-trivial!
It requires knowing
Multiple passes through data!
Q: How would you go about doing this?
Preprocess: Define a “run” of rows
The trick: Interpolate across runs (the number of not NULL values seen so far)
49
feature_id | feature | elev_in_m | run | run_start | run_end | run_size | run_rank | impute_elev_in_m |
493 | Adobe Canyon | 1423 | 84 | 1423 | 1423 | 3 | 0 | 1423 |
494 | Adobe Flats | None | 84 | 1423 | 517 | 3 | 1 | 1121 |
495 | Adobe Lake | None | 84 | 1423 | 517 | 3 | 2 | 819 |
496 | Adobe Mountain | 517 | 85 | 517 | 517 | 1 | 1 | 517 |
“run-level statistics”
Impute using rank in the current run
Once we know which “run” a row pertains to, we can interpolate via whatever scalar math we like.
Strategy: Multiple passes through the data
Add run-level statistics to every row:
50
🤔
A. 0
B. 1
C. 2
D. 3
E. 4
F. Something else
In SQL, how many passes through the relation do you need to compute these five attributes per row AND interpolate?
A 3-pass algorithm, at least in SQL
1. Pass forward:
2. Pass backward
3. Use scalars to interpolate
51
run_size = 3 because we exclude run_end, the start of the next run
1. Forward
2. Backward
3. Interpolate
Original
run_start: by taking the last non-null value
Didn’t cover slides, just did demo - here for reference
A custom function that we wrote that effectively coalesces values over a window frame:
coalesce_agg(elev_in_m)
OVER (ORDER BY feature_id) AS run_start
52
Thanks, StackOverflow!
Demo
next_val: Next row value
Didn’t cover slides, just did demo - here for reference
In first pass:
CASE WHEN elev_in_m IS NULL
THEN lead(elev_in_m, 1)
OVER (ORDER BY feature_id)
ELSE NULL
END AS next_val
In second pass:
53
Demo
[Code] A 3-pass algorithm
Didn’t cover slides, just did demo - here for reference
54
%%sql
-- 1. Forward assign run numbers to rows, propagate val, get next_val
CREATE OR REPLACE VIEW forward AS
SELECT *,
SUM(CASE WHEN elev_in_m IS NULL THEN 0 ELSE 1 END)
OVER (ORDER BY feature_id) AS run,
coalesce_agg(elev_in_m) OVER (ORDER BY feature_id) AS run_start,
CASE WHEN elev_in_m IS NULL
THEN lead(elev_in_m, 1) OVER (ORDER BY feature_id)
ELSE NULL
END AS next_val
FROM holey;
%%sql
-- 2. Backward: assign run_end, run_size, run_rank
CREATE OR REPLACE VIEW backward AS
SELECT *,
CASE WHEN elev_in_m IS NOT NULL THEN elev_in_m
ELSE coalesce_agg(next_val) OVER� (PARTITION BY run ORDER BY feature_id DESC)
END AS run_end,
count(*) OVER (PARTITION BY run) AS run_size,
-1 + (RANK() OVER (PARTITION BY run ORDER BY feature_id)) AS run_rank
FROM forward;
SELECT * FROM backward ORDER BY feature_id ASC LIMIT 12 OFFSET 183;
%%sql
-- 3. Simple scalar pass
CREATE OR REPLACE VIEW final AS
SELECT *,
run_start + (run_rank)*((run_end-run_start)/(run_size))
AS interpolated
FROM backward;
Demo
Didn’t cover slides, just did demo - here for reference
%%sql
-- 1. Forward assign run numbers to rows, propagate val, get next_val
CREATE OR REPLACE VIEW forward AS
SELECT *,
SUM(CASE WHEN elev_in_m IS NULL
THEN 0
ELSE 1 END
) OVER (ORDER BY feature_id) AS run,
coalesce_agg(elev_in_m) OVER (ORDER BY feature_id) AS run_start,
CASE WHEN elev_in_m IS NULL
THEN lead(elev_in_m, 1) OVER (ORDER BY feature_id)
ELSE NULL
END AS next_val
FROM holey;
55
Demo
[Code] A 3-pass algorithm
Didn’t cover slides, just did demo - here for reference
How well did PostgreSQL do?�%sql EXPLAIN SELECT * from final LIMIT 500;
56
(can you do better? what if you don’t use SQL?)
Demo
Summary: Imputation
Some imputation methods for standard single imputation:
1. Default values for a column
2. Correlation across columns
3. General model-based interpolation
4. General Imputation across Ordered Rows
57
[Aside] For fancier statistical methods:
58
Outlier Detection and Handling
Gaussian Outliers
Trimming and Winsorizing
Robustness
Metrics for Outliers: Hampel X84
[At home] Code for Demos
Data Imputation
Linear interpolation Example
[At home] “Fill Down” Imputation
Outline
Method 4. Impute by filling down (according to some order)
Order rows in the relation, then use that order to impute values.
59
row order
The following doesn’t work in PostgreSQL
%%sql
WITH buggy AS (
SELECT *,
CASE WHEN elev_in_m IS NOT NULL THEN elev_in_m
ELSE lag(elev_in_m, 1)
OVER (ORDER BY feature_id)
END AS imputed_elev_in_m
FROM holey
)
SELECT feature_id, elev_in_m,� imputed_elev_in_m FROM buggy;
60
What is the bug with this “fill down” imputation?
🤔
Demo
The bug
CASE WHEN elev_in_m IS NOT NULL THEN elev_in_m
ELSE lag(elev_in_m, 1)
OVER (ORDER BY feature_id)
END AS imputed_elev_in_m
61
Unfortunately, PostgreSQL doesn’t support lag() IGNORE NULL!
⚠️
Demo
Concept is simple, but SQL implementation hard
Unfortunately, PostgreSQL doesn’t support lag() IGNORE NULL!
62
Thanks, StackOverflow!
Demo
Concept is simple, but SQL implementation hard
%%sql
CREATE OR REPLACE FUNCTION coalesce_agg_sfunc(state anyelement, value anyelement) RETURNS anyelement AS
$$
SELECT coalesce(value, state);
$$ LANGUAGE SQL;
CREATE AGGREGATE coalesce_agg(anyelement) (
SFUNC = coalesce_agg_sfunc,
STYPE = anyelement);
63
Thanks, StackOverflow!
Demo