1 of 63

Data Preparation III: Outliers and Imputation

March 4, 2025

1

Data 101, Spring 2025 @ UC Berkeley

Aditya Parameswaran https://data101.org/sp25

LECTURE 13

2 of 63

2

Join at slido.com�#25681641

ⓘ

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

3 of 63

3

Data Preparation II: Granularity (Wrap-Up)

4 of 63

4

Recap of Hierarchies

5 of 63

Data granularity implies multiple hierarchies

In this context, a hierarchy dictates the transformation into smaller/larger granularities.

Three categories of hierarchies, roughly:

Explicit hierarchy, e.g., via a hierarchy table
Time hierarchy, e.g., by second, by minute, hourly, daily, …
Space hierarchy, e.g., geolocation of a physical phenomenon

5

4795043

6 of 63

Explicit Hierarchies

Many hierarchical models of the world.

Domain, Kingdom, Phylum, Class, Order, Family, Genus, Species.
City, County, State, Nation
{Cars, Trucks, Planes} → Vehicles

6

Related encoding: semantic triples, i.e.

RDF triples (Resource Description Framework)

One standard encoding: IsA pairs (“Is A”)

(basically child/parent pairs in a tree)

4795043

7 of 63

Everything exists in Space-Time

Time

All data (not just physical phenomena) follow a bitemporal model for data:

transaction time: the time a datum is recorded
valid time: the time range when the datum is considered to be true

Space

All physical phenomena have a geolocation, which can be encoded in many ways:

(latitude, longitude)
place-name
postal code

The most common dimensions of data!

7

4795043

8 of 63

Everything exists in Space-Time

Time

All data (not just physical phenomena) follow a bitemporal model for data.

Hierarchies are common:

msec < sec < min < hour < day < month < year < …
Timezones, etc. complicate hierarchies!

Many time subdivisions are periodic: seasons, months, etc.

Space

All physical phenomena have a geolocation, which can be encoded in many ways:

Explicit and often complex hierarchies…

…because of geopolitics!

U.S. congressional districts may span parts of counties
Not all countries are recognized by the United Nations

Often not strict hierarchies; have overlaps.

While we generally use general-purpose databases and tools, you may need special systems.

8

GIS: Geographic Information Systems

Postgres Example: PostGIS [link]

Temporal databases

Postgres Example: Timescale [link]

9 of 63

Rollup/Drilldown Constructs

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

9

Lecture 13, Data 101, Spring 2025

10 of 63

Changing granularity

“Roll up”: Transform to coarser grain (e.g., go up in a hierarchy).

“Drill down”: Transform to finer grain (e.g., go down in a hierarchy).

Operators:

pandas/SQL: GROUP BY/Aggregation with different groups
R terminology: Split-Apply-Combine

10

Table "public.national"

Column | Type

-----------------+------------------

feature_id | bigint

feature_name | text

feature_class | text

state_alpha | text

state_numeric | bigint

county_name | text

county_numeric | double precision

primary_lat_dms | text

prim_long_dms | text

prim_lat_dec | double precision

prim_long_dec | double precision

source_lat_dms | text

source_long_dms | text

source_lat_dec | double precision

source_long_dec | double precision

elev_in_m | double precision

elev_in_ft | double precision

map_name | text

date_created | text

Let’s see an example.

25681641

11 of 63

Example

Suppose we start with county-level granularity of elevation.

11

SELECT state_numeric, county_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national TABLESAMPLE BERNOULLI(10)

GROUP BY state_numeric, county_numeric;

Table "public.national"

Column | Type

-----------------+------------------

feature_id | bigint

feature_name | text

feature_class | text

state_alpha | text

state_numeric | bigint

county_name | text

county_numeric | double precision

primary_lat_dms | text

prim_long_dms | text

prim_lat_dec | double precision

prim_long_dec | double precision

source_lat_dms | text

source_long_dms | text

source_lat_dec | double precision

source_long_dec | double precision

elev_in_m | double precision

elev_in_ft | double precision

map_name | text

date_created | text

County “numbers” are reused across states!

Multi-attribute GROUP BY ensures unique counties across U.S.

How do we roll up to state level?

25681641

12 of 63

Example: Roll Up

Suppose we start with county-level granularity of elevation.

Let’s roll up to state-level granularity.

12

SELECT state_numeric, county_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national TABLESAMPLE BERNOULLI(10)

GROUP BY state_numeric, county_numeric;

Table "public.national"

Column | Type

-----------------+------------------

feature_id | bigint

feature_name | text

feature_class | text

state_alpha | text

state_numeric | bigint

county_name | text

county_numeric | double precision

primary_lat_dms | text

prim_long_dms | text

prim_lat_dec | double precision

prim_long_dec | double precision

source_lat_dms | text

source_long_dms | text

source_lat_dec | double precision

source_long_dec | double precision

elev_in_m | double precision

elev_in_ft | double precision

map_name | text

date_created | text

County “numbers” are reused across states!

Multi-attribute GROUP BY ensures unique counties across U.S.

SELECT state_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national TABLESAMPLE Bernoulli(10)

GROUP BY state_numeric;

GROUP BY clause:�More or fewer columns than before?

SELECT clause:�Can we choose the same columns as before?

25681641

13 of 63

Example: Drill Down

Let’s roll up to state-level granularity.

From state-level granularity,�let’s drill down into feature class granularity as well.

13

Table "public.national"

Column | Type

-----------------+------------------

feature_id | bigint

feature_name | text

feature_class | text

state_alpha | text

state_numeric | bigint

county_name | text

county_numeric | double precision

primary_lat_dms | text

prim_long_dms | text

prim_lat_dec | double precision

prim_long_dec | double precision

source_lat_dms | text

source_long_dms | text

source_lat_dec | double precision

source_long_dec | double precision

elev_in_m | double precision

elev_in_ft | double precision

map_name | text

date_created | text

SELECT state_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national TABLESAMPLE Bernoulli(10)

GROUP BY state_numeric;

SELECT state_numeric, feature_class,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national

GROUP BY state_numeric, feature_class

ORDER BY count(*) DESC;

Drilling down requires having the finer-granularity attribute!

If you don’t have it, you have to sample from a model (more later).

25681641

14 of 63

Roll-Up and Statistics

Roll up effectively computes the marginal distribution (see DATA 100/140).

GROUP BY/COUNT(*) queries are empirical probability distributions!

(with some normalization by the overall count.)

Roll-up has fewer grouping columns and thus�sums the probabilities across a subset of dimensions.
More later when we see Data Cubes.

14

SELECT state_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*),

SUM(COUNT(*)) OVER () AS total,

COUNT(*)/SUM(COUNT(*)) OVER ()

AS marginal

FROM national TABLESAMPLE Bernoulli(.1)

GROUP BY state_numeric;

25681641

15 of 63

Drill-Down and Statistics

If we don’t actually have the drill-down data, we could sample from a model.

Notify any downstream users that this is simulated data!

15

SELECT state_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national

GROUP BY state_numeric

ORDER BY count(*) DESC;

public.fips_counties"

Column | Type

fips | bigint

county | text � state_numeric | bigint

Indexes:

"ix_fips_counties_fips" btree (fips)

What if we wanted to drill down to all counties, including ones that might not be in national?

all counties

Possible solution: When drilling down, estimate the elevation from a normal distribution of the state’s elevation.

16 of 63

Rollup/Drilldown & Tensors

Consider a GROUP BY query with 3 grouping columns and 1 aggregate column:

SELECT state_numeric, county_numeric,

feature_class, COUNT(*)

FROM national

GROUP BY state_numeric, county_numeric, feature_class;

This returns a (sparse) 3-dimensional tensor!

16

25681641

17 of 63

Roll Up, Drill Down, Roll Up, Drill Down…

[On your own]

Demo 1: Roll-Ups and Drill-Downs

Start with county level
Roll up to state level
Drill down to state level, feature class
Roll up again to feature class
Drill down to state numeric, county numeric

17

In this video game, the character Kirby rolls up (inflates) and drills down (deflates).

Is there a key in the schema of this last output? Why (not)?
What would this mean for PIVOTing?

1

SELECT state_numeric, feature_class,

avg(elev_in_m),

stddev(elev_in_m), count(*)

FROM national TABLESAMPLE Bernoulli(10)

GROUP BY state_numeric, feature_class

ORDER BY count(*) DESC;

25681641

Demo

18 of 63

Roll Up, Drill Down, Roll Up, Drill Down…

Demo 2: Connections to Statistics

Roll-up with marginal distributions
Drill-down with normally distributed elevations:

Start from state level
Join with all counties
Simulate each county’s elevation from a normally distributed variable using the state statistics
Note: Need to CREATE EXTENSION tablefunc;�for normal_rand function
Note 2: LATERAL keyword in FROM lets us reference columns from FROM items. [postgres docs 7.2.1.5, not strictly necessary with tablefunc]

18

In this video game, the character Kirby rolls up (inflates) and drills down (deflates).

25681641

Demo

19 of 63

Implementing Explicit Hierarchies

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

19

Lecture 13, Data 101, Spring 2025

20 of 63

Do we need to make our own hierarchies? Yes!

The GNIS dataset was great because the hierarchy was embedded into the relational schema. If we don’t have this, then we have to assemble an explicit hierarchy.

20

Take the baseball database for example. Suppose in baseball, in a given year

Players play for Teams
Teams are in Divisions
Divisions are in Leagues

So if we want to do roll-ups and drill-downs on this hierarchy, we need player-year data that looks like:

(player_id, team_id, div_id, lg_id, year_id,� thing_we_want_to_aggregate)

Let’s transform our data!

Demo

21 of 63

Assembling the Explicit Hierarchy

SELECT a.player_id, a.team_id, t.div_id, a.*

FROM appearances a

NATURAL JOIN Teams t

WHERE a.year_id = 2015

LIMIT 100;

CREATE OR REPLACE VIEW bball_tree AS

SELECT DISTINCT a.player_id, a.team_id, t.div_id,

a.lg_id, a.year_id

FROM appearances a NATURAL JOIN teams t;

21

When making an explicit hierarchy, each record represents a datum in the smallest granularity (player-years in this example).

25681641

Demo

22 of 63

Roll-ups & drill-downs on an explicit hierarchies

Recall our home run query:

Set up for roll up/drill down on bball_tree hierarchy.

Join each (raw) person-year in batting with the associated `bball_tree` entry by (playerid, yearid) in a CTE
Use this result for roll-up and drill-down.

SELECT name_first, name_last, year_id,

MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)

FROM batting b, people p

WHERE b.player_id = p.player_id

GROUP BY name_last, name_first, year_id

ORDER BY max DESC

LIMIT 10;

22

25681641

Demo

23 of 63

Let’s construct the CTE

What pieces of information do we need from the hierarchy?
How do we write our CTE?

23

baseball=# \d batting

Table "public.batting"

Column | Type |

-----------+-----------------------+

player_id | character varying(10) |

year_id | integer |

stint | integer |

team_id | character varying(3) |

lg_id | character varying(2) |

g | integer |

ab | integer |

r | integer |

h | integer |

h2b | integer |

h3b | integer |

hr | integer |

rbi | integer |

sb | integer |

cs | integer |

bb | integer |

so | integer |

ibb | integer |

hbp | integer |

sh | integer |

sf | integer |

gidp | integer |

Indexes:

"batting_pkey" PRIMARY KEY, btree (player_id, year_id, stint)

View "public.bball_tree"

Column | Type

-----------+-----------------------

player_id | character varying(10)

team_id | character varying(3)

div_id | character varying(1)

lg_id | character varying(2)

year_id | integer

🤔

Demo

24 of 63

24

What pieces of information do we need from the hierarchy?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

25 of 63

Roll-ups & drill-downs on an explicit hierarchies

Set up for roll up/drill down on bball_tree hierarchy.

Join each (raw) person-year in batting with the associated `bball_tree` entry by (player_id, year_id) in a CTE
Use this result for roll-up and drill-down.

WITH batting_tree AS (

SELECT b.*, t.div_id

FROM batting b, bball_tree t

WHERE b.player_id = t.player_id

AND b.year_id = t.year_id )

SELECT name_first, name_last,

bt.team_id, bt.lg_id, bt.div_id, bt.year_id,

MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)

FROM batting_tree bt, people p

WHERE bt.player_id = p.player_id

GROUP BY bt.player_id, bt.team_id, bt.lg_id, bt.div_id, bt.year_id, name_last, name_first

ORDER BY max DESC

LIMIT 10;

25

25681641

Demo

26 of 63

Announcements

Midterm logistics are out!

Let us know ASAP if you have any questions, concerns, etc.

Project 2 is due on 3/5 at 5pm!
Homework 3 was released on 2/28 and is due on 3/19

26

27 of 63

27

Data Preparation III

28 of 63

Outlier Detection and Handling

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

28

Lecture 13, Data 101, Spring 2025

29 of 63

Onto more Data Transformations: Principles to practice

Instead of “data cleaning,” think data transformation:

We are inherently imposing a model over our data.

Moves away from the the “hygiene”/”cleaning” analogy!

As a result, record input, output, and the data lineage of how the output is computed.

29

Example: Embrace simple metadata.

If transforming just one column

keep the original column
Add a derived column to the right, and name it something meaningful

If transforming much of a dataset

create a new derived dataset stored “near” the original

Examples: same filesystem directory, same database schema, same git repo, etc.
Name the derived dataset in a way that hints at its lineage

In all cases, keep the transformation code and treat it like source code:

Manage it, version control it, document it
Keep it in the same repository/toolchain as the data itself

25681641

30 of 63

Outlier Detection and Handling: Terminology

An outlier is a value “far enough” from “average.”

Detecting outliers helps us identify aberrances with respect to a model.
Handling our outliers helps us model the majority of the data

Note: We should always acknowledge these outliers in our assumptions and our reports!

Every strategy of detecting/handling outliers needs two statistical measures on a set of values:

Center: what is “average”.

e.g., mean, median, etc.

Dispersion: what is “far” from average.

e.g., standard deviation, variance, interquartile range (IQR), etc.

30

Let’s explore the robustness of three outlier handling strategies:

What is the measure of dispersion?
How large are masking effect?
What is the worst-case analysis?

(formal definition of robustness to follow)

We say masking occurs when an outlier influences the dispersion measure and impacts our ability to detect other outliers.

25681641

31 of 63

Common Outlier Handling Strategies

Dropping gaussian outliers

Trimming outliers based on percentiles

Winsorizing, i.e., replacing outliers with percentile statistics

31

25681641

32 of 63

Gaussian Outliers

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

32

Lecture 13, Data 101, Spring 2025

33 of 63

Gaussian Outliers

Model data as a normal (Gaussian) distribution.

Center: mean
Dispersion unit: standard deviation
Define a gaussian outlier as

±2 standard deviations from�the mean.
In percentiles:�below p2.5 and above p97.5.

33

gray: cleaned (truncated)

In practice:

Assume a Gaussian distribution�with mean, stdev from the data.
Find the gaussian outliers based on #1.
Construct a “cleaned” distribution�(with outliers dropped).

25681641

34 of 63

Demo: Assuming Gaussian Data

Suppose our 1000 observed datapoints (observations) were truly Normal�with mean 50, standard deviation 5.

34

%sql DROP TABLE IF EXISTS observations CASCADE;

%sql CREATE TABLE observations AS \

SELECT normal_rand AS x \

FROM normal_rand(1000, 50, 5);

results = %sql SELECT x FROM observations

sns.displot(results.dict(), fill=True, kde=True, bins=20)

35 of 63

Demo: Assuming Gaussian Data

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view

normal_outliers.

35

%%sql

CREATE OR REPLACE VIEW normal_outliers AS

WITH bounds AS (

SELECT avg(x) - 2*stddev(x) AS lo,

avg(x) + 2*stddev(x) AS hi

FROM observations

)

SELECT x AS outlier

FROM observations o, bounds b

WHERE x NOT BETWEEN b.lo AND b.hi;

(bounds has a single record, therefore lo/hi are scalars)

36 of 63

[Exercise] Cleaned vs. Original?

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view normal_outliers.

36

results = %sql SELECT x, 'original' AS label \

FROM observations \

UNION ALL \

SELECT x, 'cleaned' AS label\

FROM observations \

WHERE x NOT IN \

(SELECT * FROM normal_outliers)

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=20)

plt.xlim(30, 70)

What does this code do?

🤔

A.

B.

C.

D.

E. Something else/error

37 of 63

37

What does this code do?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

38 of 63

[Exercise] Cleaned vs. Original?

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view normal_outliers.

38

results = %sql SELECT x, 'original' AS label \

FROM observations \

UNION ALL \

SELECT x, 'cleaned' AS label \

FROM observations \

WHERE x NOT IN \

(SELECT * FROM normal_outliers)

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=20)

plt.xlim(30, 70)

The cleaned distribution, with fewer extremes, has lower dispersion.

39 of 63

Demo: Breaking the Gaussian Assumption

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers, store in view normal_outliers.

Construct a cleaned dataset from

observations and plot alongside the original.

What if we corrupt just one value to be very large?

Data becomes right-biased (no longer normal)
Gaussian outlier detection suddenly becomes fragile

39

## corrupt one value

%sql UPDATE observations SET x = x*10 \

WHERE x = (SELECT MAX(x) FROM OBSERVATIONS);

The 10x value is masking our previous outliers.

One value is dragging the mean and standard deviation super far!

40 of 63

Trimming and Winsorizing

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

40

Lecture 13, Data 101, Spring 2025

41 of 63

Common Outlier Handling Strategies

Dropping gaussian outliers

Trimming outliers based on percentiles

Winsorizing, i.e., replacing outliers with percentile statistics

Simple to implement
However, definition of dispersion (and center) are not robust and are sensitive to the distribution of the outliers themselves.

41

25681641

42 of 63

Common Outlier Handling Strategies

Dropping gaussian outliers

Trimming outliers based on percentiles

Winsorizing, i.e., replacing outliers with percentile statistics

Define outliers by order statistics, i.e., percentiles
k% trimming drops both k% tails of the distribution.
Example: 1% trimmed distribution drops below p1 and above p99

42

25681641

43 of 63

Common Outlier Handling Strategies

Dropping gaussian outliers�
Trimming outliers based on percentiles

Winsorizing, i.e., replacing outliers with percentile statistics

Define outliers by order statistics, i.e., percentiles
k% winsorization replaces tails with k-, (100-k)-percentile values
Example: 5% winsorization

Replace values below p5 with the p5 value
Replace values above p95 with the p95 value
Note: sometimes defined as 90% winsorization (why?)

43

25681641

44 of 63

[Exercise] Understanding Winsorization

Consider this dataset which has mean 101.5 and median 68.0.

{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}

Compute a 90% winsorization:

95th percentile (nearest) value: 101
5th percentile (nearest) value: -5

44

Which values get replaced? With what?
What are the statistics of the winsorized data, compared to the original?

🤔

A. Mean bigger, median bigger

B. Mean bigger, median smaller

C. Mean smaller, median bigger

D. Mean smaller, median smaller

E. Mean same, median different

F. Mean different, median same

G. Mean same, median same

H. Something else

25681641

45 of 63

45

What are the statistics of the winsorized data, compared to the original?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

46 of 63

[Exercise] Understanding Winsorization

Consider this dataset which has mean 101.5 and median 68.0.

{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}

Compute a 90% winsorization:

95th percentile (nearest) value: 101
5th percentile (nearest) value: -5

Winsorized data:

{92, 19, 101, 58, 101, 91, 26, 78, 10, 13, −5, 101, 86, 85, 15, 89, 89, 28, −5, 41}

46

mean-median difference way smaller now!
mean becomes more reflective of the dataset!

25681641

Got until here

47 of 63

Demo: Trimming vs. Winsorizing

47

Even with our giant outlier, we still identify outliers based on the order (percentile) and not mean/SD, so we trim successfully.

More robust than SD-based.
Minor masking on the right, but not nearly as bad.

Winsorizing preserves probability density of the tails

Usually not a big difference from trimming
Mean stays roughly same, but stddev changes
Winsorizing preferred to Trimming if something downstream forbids NULL

5% trimmed mean,

5% winsorized mean

25681641

48 of 63

Robustness

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

48

Lecture 13, Data 101, Spring 2025

49 of 63

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).

Consider an adversary that can “corrupt” data values to arbitrary values.
The breakdown point of an estimator is the smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
For data preparation: knowing the breakdown point determines when masking can occur.

Ex1 Breakdown point of the mean:

0.
One outlier value changes our entire dataset’s mean, if we make it large enough.

49

25681641

50 of 63

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).

Consider an adversary that can “corrupt” data values to arbitrary values.
The breakdown point of an estimator is the smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
For data preparation: knowing the breakdown point determines when masking can occur.

Ex2 Breakdown point of the 1% trimmed mean:

1%.
1% of values on each side will be trimmed out anyway, so an adversary adjusting 1% will not drastically impact the estimator.
Anything beyond 1% (e.g., 1% plus one extra value) will break down the estimator.

50

What k% gives us a maximally robust trimmed mean?

🤔

25681641

51 of 63

51

What k% gives us a maximally robust trimmed mean? Answer as a number with no percent sign.

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

52 of 63

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).

Consider an adversary that can “corrupt” data values to arbitrary values.
The breakdown point of an estimator is the smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
For data preparation: knowing the breakdown point determines when masking can occur.

��What k% gives us a maximally robust k% trimmed mean?

50%. Note: 50% trimmed mean produces the median.

52

The median of any distribution is maximally robust; it can handle up to 50% corruption of the data.

25681641

53 of 63

Robust Estimators of a Distribution

Recall that we are estimating the center and dispersion of our data.

53

Wikipedia

Dispersion

Maximally robust: Median Absolute� Deviation(MAD)

For a dataset X, define

(proof: intuition, but left to you)

Center

Maximally robust: Median

(see previous slide)

25681641

54 of 63

Robust Estimators of a Distribution

Recall that we are estimating the center and dispersion of our data.

54

Other common robust measures of center:

k% trimmed mean
k% winsorized mean (why?)

Other common robust measures of dispersion:

Interquartile Range (IQR), p75 - p25

Related: “Tukey numbers” are a five-number summary of univariate numeric data

minimum, p25, median, p75, max
center: median (robust)
spread: IQR (robust) min/max

Center

Maximally robust: Median

(see previous slide)

Dispersion

Maximally robust: Median Absolute� Deviation (MAD)

(see previous slide)

25681641

55 of 63

Computing MAD in SQL

(for you to review)

55

%%sql

-- percentile_disc returns an actual data value near the percentile (discrete)

-- percentile_cont returns an interpolated value at the percentile (continuous)

CREATE OR REPLACE VIEW median AS

(SELECT percentile_disc(0.5)

WITHIN GROUP (ORDER BY x) as median� FROM observations);

%%sql

CREATE OR REPLACE VIEW mad AS

WITH absdevs AS

(SELECT abs(x - median) as d

FROM observations, median)

SELECT percentile_disc(0.5)� WITHIN GROUP (ORDER BY d) as mad

FROM absdevs;

SELECT median, mad

FROM median, mad;

where

25681641

Demo

56 of 63

Metrics for Outliers: Hampel X84, Model-based

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

56

Lecture 13, Data 101, Spring 2025

57 of 63

Metrics for outliers

How much should we trim or winsorize? You may know a few common metrics:

“2 sigma”: Trim Gaussian outliers

Center: mean
Dispersion unit: standard deviation
Trim: 2*standard deviations from the mean

Quartiles: Trim bottom 25% and top 25%

But is there an analogous way to trim on a robust “dispersion unit” of spread?

57

25681641

58 of 63

Metrics for outliers

58

Let’s translate this normal estimation notion of “standard unit” into robust estimators:

Center: median
Dispersion unit: MAD
Trim: 2*1.4826 MADs from the median

For standard Normal (mean 0, stddev 1),

1 stddev = 1.4826 MADs.�Try it at home!

This is known as the Hampel X84�robust outlier detection technique.

25681641

59 of 63

60 of 63

Let’s redo our outliers with Hampel x84

60

%%sql

CREATE OR REPLACE VIEW hampelx84x2_observations AS (

SELECT o.x, 'hampelx84x2' AS label

FROM observations o, median, mad

WHERE o.x BETWEEN (median-2*1.4826*mad)

AND (median+2*1.4826*mad)

UNION ALL

SELECT o.x, 'orig' AS label

FROM observations o

);

CREATE OR REPLACE VIEW Hampel84x2_outliers AS (

SELECT x

FROM observations o, median, mad

WHERE x NOT BETWEEN (median - 2*1.4826*mad)

AND (median + 2*1.4826*mad)

);

Demo

61 of 63

Let’s redo our outliers with Hampel x84

61

results = %sql SELECT * FROM hampelx84x2_observations

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=np.linspace(32, 70, 20),

height=4, aspect=1.5, rug=True)

plt.xlim(30, 70)

Demo

62 of 63

Summary: Outliers

Outliers depend on some measure of center and spread.

The normal distribution (and 2 SD truncation) gives nice intuition, but it is not robust.
Robustness leads us to want a high breakdown point (maximum 50%).

Order statistics like percentiles are robust and lead to�outlier handling techniques like trimming and winsorizing.
Median, MAD are maximally robust estimators of center and dispersion, respectively.

Hampel X84: a robust outlier metric that incorporates dispersion.

62

25681641

63 of 63

[Beyond scope] Outlier detection with model fitting

If we fit multi-dimensional data to a model, do we have more metrics for detecting “outliers”?

In linear regression:

L2 distance or residuals, e.g., |predicted - actual value|2
OLS: assume that your residuals are normally distributed

You can use the same outlier metrics, but this time applied to model residuals
We’ll stop here; anything further is down the path of data analysis, not engineering

63

25681641