1 of 63

Data Preparation III: Outliers and Imputation

March 4, 2025

1

Data 101, Spring 2025 @ UC Berkeley

Aditya Parameswaran https://data101.org/sp25

LECTURE 13

2 of 63

2

Join at slido.comļæ½#25681641

ā“˜

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

3 of 63

3

Data Preparation II: Granularity (Wrap-Up)

4 of 63

4

Recap of Hierarchies

5 of 63

Data granularity implies multiple hierarchies

In this context, a hierarchy dictates the transformation into smaller/larger granularities.

Three categories of hierarchies, roughly:

  • Explicit hierarchy, e.g., via a hierarchy table
  • Time hierarchy, e.g., by second, by minute, hourly, daily, …
  • Space hierarchy, e.g., geolocation of a physical phenomenon

5

4795043

6 of 63

Explicit Hierarchies

Many hierarchical models of the world.

  • Domain, Kingdom, Phylum, Class, Order, Family, Genus, Species.
  • City, County, State, Nation
  • {Cars, Trucks, Planes} → Vehicles

6

Related encoding: semantic triples, i.e.

RDF triples (Resource Description Framework)

One standard encoding: IsA pairs (ā€œIs Aā€)

(basically child/parent pairs in a tree)

4795043

7 of 63

Everything exists in Space-Time

Time

All data (not just physical phenomena) follow a bitemporal model for data:

  • transaction time: the time a datum is recorded
  • valid time: the time range when the datum is considered to be true

Space

All physical phenomena have a geolocation, which can be encoded in many ways:

  • (latitude, longitude)
  • place-name
  • postal code

The most common dimensions of data!

7

4795043

8 of 63

Everything exists in Space-Time

Time

All data (not just physical phenomena) follow a bitemporal model for data.

Hierarchies are common:

  • msec < sec < min < hour < day < month < year < …
  • Timezones, etc. complicate hierarchies!

Many time subdivisions are periodic: seasons, months, etc.

Space

All physical phenomena have a geolocation, which can be encoded in many ways:

Explicit and often complex hierarchies…

  • …because of geopolitics!
    • U.S. congressional districts may span parts of counties
    • Not all countries are recognized by the United Nations
  • Often not strict hierarchies; have overlaps.

While we generally use general-purpose databases and tools, you may need special systems.

8

GIS: Geographic Information Systems

  • Postgres Example: PostGIS [link]

Temporal databases

  • Postgres Example: Timescale [link]

9 of 63

Rollup/Drilldown Constructs

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond ā€œFill Downā€ Imputation

9

Lecture 13, Data 101, Spring 2025

10 of 63

Changing granularity

ā€œRoll upā€: Transform to coarser grain (e.g., go up in a hierarchy).

ā€œDrill downā€: Transform to finer grain (e.g., go down in a hierarchy).

Operators:

  • pandas/SQL: GROUP BY/Aggregation with different groups
  • R terminology: Split-Apply-Combine

10

Table "public.national"

Column | Type

-----------------+------------------

feature_id | bigint

feature_name | text

feature_class | text

state_alpha | text

state_numeric | bigint

county_name | text

county_numeric | double precision

primary_lat_dms | text

prim_long_dms | text

prim_lat_dec | double precision

prim_long_dec | double precision

source_lat_dms | text

source_long_dms | text

source_lat_dec | double precision

source_long_dec | double precision

elev_in_m | double precision

elev_in_ft | double precision

map_name | text

date_created | text

Let’s see an example.

25681641

11 of 63

Example

Suppose we start with county-level granularity of elevation.

11

SELECT state_numeric, county_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national TABLESAMPLE BERNOULLI(10)

GROUP BY state_numeric, county_numeric;

Table "public.national"

Column | Type

-----------------+------------------

feature_id | bigint

feature_name | text

feature_class | text

state_alpha | text

state_numeric | bigint

county_name | text

county_numeric | double precision

primary_lat_dms | text

prim_long_dms | text

prim_lat_dec | double precision

prim_long_dec | double precision

source_lat_dms | text

source_long_dms | text

source_lat_dec | double precision

source_long_dec | double precision

elev_in_m | double precision

elev_in_ft | double precision

map_name | text

date_created | text

County ā€œnumbersā€ are reused across states!

Multi-attribute GROUP BY ensures unique counties across U.S.

How do we roll up to state level?

25681641

12 of 63

Example: Roll Up

Suppose we start with county-level granularity of elevation.

Let’s roll up to state-level granularity.

12

SELECT state_numeric, county_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national TABLESAMPLE BERNOULLI(10)

GROUP BY state_numeric, county_numeric;

Table "public.national"

Column | Type

-----------------+------------------

feature_id | bigint

feature_name | text

feature_class | text

state_alpha | text

state_numeric | bigint

county_name | text

county_numeric | double precision

primary_lat_dms | text

prim_long_dms | text

prim_lat_dec | double precision

prim_long_dec | double precision

source_lat_dms | text

source_long_dms | text

source_lat_dec | double precision

source_long_dec | double precision

elev_in_m | double precision

elev_in_ft | double precision

map_name | text

date_created | text

County ā€œnumbersā€ are reused across states!

Multi-attribute GROUP BY ensures unique counties across U.S.

SELECT state_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national TABLESAMPLE Bernoulli(10)

GROUP BY state_numeric;

GROUP BY clause:ļæ½More or fewer columns than before?

SELECT clause:ļæ½Can we choose the same columns as before?

25681641

13 of 63

Example: Drill Down

Let’s roll up to state-level granularity.

From state-level granularity,ļæ½let’s drill down into feature class granularity as well.

13

Table "public.national"

Column | Type

-----------------+------------------

feature_id | bigint

feature_name | text

feature_class | text

state_alpha | text

state_numeric | bigint

county_name | text

county_numeric | double precision

primary_lat_dms | text

prim_long_dms | text

prim_lat_dec | double precision

prim_long_dec | double precision

source_lat_dms | text

source_long_dms | text

source_lat_dec | double precision

source_long_dec | double precision

elev_in_m | double precision

elev_in_ft | double precision

map_name | text

date_created | text

SELECT state_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national TABLESAMPLE Bernoulli(10)

GROUP BY state_numeric;

SELECT state_numeric, feature_class,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national

GROUP BY state_numeric, feature_class

ORDER BY count(*) DESC;

Drilling down requires having the finer-granularity attribute!

If you don’t have it, you have to sample from a model (more later).

25681641

14 of 63

Roll-Up and Statistics

Roll up effectively computes the marginal distribution (see DATA 100/140).

  • GROUP BY/COUNT(*) queries are empirical probability distributions!
    • (with some normalization by the overall count.)
  • Roll-up has fewer grouping columns and thusļæ½sums the probabilities across a subset of dimensions.
  • More later when we see Data Cubes.

14

SELECT state_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*),

SUM(COUNT(*)) OVER () AS total,

COUNT(*)/SUM(COUNT(*)) OVER ()

AS marginal

FROM national TABLESAMPLE Bernoulli(.1)

GROUP BY state_numeric;

25681641

15 of 63

Drill-Down and Statistics

If we don’t actually have the drill-down data, we could sample from a model.

  • Notify any downstream users that this is simulated data!

15

SELECT state_numeric,

AVG(elev_in_m),

STDDEV(elev_in_m), COUNT(*)

FROM national

GROUP BY state_numeric

ORDER BY count(*) DESC;

public.fips_counties"

Column | Type

fips | bigint

county | text ļæ½ state_numeric | bigint

Indexes:

"ix_fips_counties_fips" btree (fips)

What if we wanted to drill down to all counties, including ones that might not be in national?

all counties

Possible solution: When drilling down, estimate the elevation from a normal distribution of the state’s elevation.

16 of 63

Rollup/Drilldown & Tensors

Consider a GROUP BY query with 3 grouping columns and 1 aggregate column:

SELECT state_numeric, county_numeric,

feature_class, COUNT(*)

FROM national

GROUP BY state_numeric, county_numeric, feature_class;

This returns a (sparse) 3-dimensional tensor!

16

25681641

17 of 63

Roll Up, Drill Down, Roll Up, Drill Down…

[On your own]

Demo 1: Roll-Ups and Drill-Downs

  • Start with county level
  • Roll up to state level
  • Drill down to state level, feature class
  • Roll up again to feature class
  • Drill down to state numeric, county numeric

17

In this video game, the character Kirby rolls up (inflates) and drills down (deflates).

  • Is there a key in the schema of this last output? Why (not)?
  • What would this mean for PIVOTing?

1

SELECT state_numeric, feature_class,

avg(elev_in_m),

stddev(elev_in_m), count(*)

FROM national TABLESAMPLE Bernoulli(10)

GROUP BY state_numeric, feature_class

ORDER BY count(*) DESC;

25681641

Demo

18 of 63

Roll Up, Drill Down, Roll Up, Drill Down…

Demo 2: Connections to Statistics

  • Roll-up with marginal distributions
  • Drill-down with normally distributed elevations:
    • Start from state level
    • Join with all counties
    • Simulate each county’s elevation from a normally distributed variable using the state statistics
    • Note: Need to CREATE EXTENSION tablefunc;ļæ½for normal_rand function
    • Note 2: LATERAL keyword in FROM lets us reference columns from FROM items. [postgres docs 7.2.1.5, not strictly necessary with tablefunc]

18

In this video game, the character Kirby rolls up (inflates) and drills down (deflates).

25681641

Demo

19 of 63

Implementing Explicit Hierarchies

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond ā€œFill Downā€ Imputation

19

Lecture 13, Data 101, Spring 2025

20 of 63

Do we need to make our own hierarchies? Yes!

The GNIS dataset was great because the hierarchy was embedded into the relational schema. If we don’t have this, then we have to assemble an explicit hierarchy.

20

Take the baseball database for example. Suppose in baseball, in a given year

  • Players play for Teams
  • Teams are in Divisions
  • Divisions are in Leagues

So if we want to do roll-ups and drill-downs on this hierarchy, we need player-year data that looks like:

(player_id, team_id, div_id, lg_id, year_id,ļæ½ thing_we_want_to_aggregate)

Let’s transform our data!

Demo

21 of 63

Assembling the Explicit Hierarchy

SELECT a.player_id, a.team_id, t.div_id, a.*

FROM appearances a

NATURAL JOIN Teams t

WHERE a.year_id = 2015

LIMIT 100;

CREATE OR REPLACE VIEW bball_tree AS

SELECT DISTINCT a.player_id, a.team_id, t.div_id,

a.lg_id, a.year_id

FROM appearances a NATURAL JOIN teams t;

21

When making an explicit hierarchy, each record represents a datum in the smallest granularity (player-years in this example).

25681641

Demo

22 of 63

Roll-ups & drill-downs on an explicit hierarchies

Recall our home run query:

Set up for roll up/drill down on bball_tree hierarchy.

  • Join each (raw) person-year in batting with the associated `bball_tree` entry by (playerid, yearid) in a CTE
  • Use this result for roll-up and drill-down.

SELECT name_first, name_last, year_id,

MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)

FROM batting b, people p

WHERE b.player_id = p.player_id

GROUP BY name_last, name_first, year_id

ORDER BY max DESC

LIMIT 10;

22

25681641

Demo

23 of 63

Let’s construct the CTE

  • What pieces of information do we need from the hierarchy?
  • How do we write our CTE?

23

baseball=# \d batting

Table "public.batting"

Column | Type |

-----------+-----------------------+

player_id | character varying(10) |

year_id | integer |

stint | integer |

team_id | character varying(3) |

lg_id | character varying(2) |

g | integer |

ab | integer |

r | integer |

h | integer |

h2b | integer |

h3b | integer |

hr | integer |

rbi | integer |

sb | integer |

cs | integer |

bb | integer |

so | integer |

ibb | integer |

hbp | integer |

sh | integer |

sf | integer |

gidp | integer |

Indexes:

"batting_pkey" PRIMARY KEY, btree (player_id, year_id, stint)

View "public.bball_tree"

Column | Type

-----------+-----------------------

player_id | character varying(10)

team_id | character varying(3)

div_id | character varying(1)

lg_id | character varying(2)

year_id | integer

šŸ¤”

Demo

24 of 63

24

What pieces of information do we need from the hierarchy?

ā“˜

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

25 of 63

Roll-ups & drill-downs on an explicit hierarchies

Set up for roll up/drill down on bball_tree hierarchy.

  • Join each (raw) person-year in batting with the associated `bball_tree` entry by (player_id, year_id) in a CTE
  • Use this result for roll-up and drill-down.

WITH batting_tree AS (

SELECT b.*, t.div_id

FROM batting b, bball_tree t

WHERE b.player_id = t.player_id

AND b.year_id = t.year_id )

SELECT name_first, name_last,

bt.team_id, bt.lg_id, bt.div_id, bt.year_id,

MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)

FROM batting_tree bt, people p

WHERE bt.player_id = p.player_id

GROUP BY bt.player_id, bt.team_id, bt.lg_id, bt.div_id, bt.year_id, name_last, name_first

ORDER BY max DESC

LIMIT 10;

25

25681641

Demo

26 of 63

Announcements

  • Midterm logistics are out!
    • Let us know ASAP if you have any questions, concerns, etc.
  • Project 2 is due on 3/5 at 5pm!
  • Homework 3 was released on 2/28 and is due on 3/19

26

27 of 63

27

Data Preparation III

28 of 63

Outlier Detection and Handling

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond ā€œFill Downā€ Imputation

28

Lecture 13, Data 101, Spring 2025

29 of 63

Onto more Data Transformations: Principles to practice

Instead of ā€œdata cleaning,ā€ think data transformation:

  • We are inherently imposing a model over our data.
    • Moves away from the the ā€œhygieneā€/ā€cleaningā€ analogy!
  • As a result, record input, output, and the data lineage of how the output is computed.

29

Example: Embrace simple metadata.

  • If transforming just one column
    • keep the original column
    • Add a derived column to the right, and name it something meaningful
  • If transforming much of a dataset
    • create a new derived dataset stored ā€œnearā€ the original
      • Examples: same filesystem directory, same database schema, same git repo, etc.
      • Name the derived dataset in a way that hints at its lineage
  • In all cases, keep the transformation code and treat it like source code:
    • Manage it, version control it, document it
    • Keep it in the same repository/toolchain as the data itself

25681641

30 of 63

Outlier Detection and Handling: Terminology

An outlier is a value ā€œfar enoughā€ from ā€œaverage.ā€

  • Detecting outliers helps us identify aberrances with respect to a model.
  • Handling our outliers helps us model the majority of the data
    • Note: We should always acknowledge these outliers in our assumptions and our reports!

Every strategy of detecting/handling outliers needs two statistical measures on a set of values:

  • Center: what is ā€œaverageā€.
    • e.g., mean, median, etc.
  • Dispersion: what is ā€œfarā€ from average.
    • e.g., standard deviation, variance, interquartile range (IQR), etc.

30

Let’s explore the robustness of three outlier handling strategies:

  • What is the measure of dispersion?
  • How large are masking effect?
  • What is the worst-case analysis?

(formal definition of robustness to follow)

We say masking occurs when an outlier influences the dispersion measure and impacts our ability to detect other outliers.

25681641

31 of 63

Common Outlier Handling Strategies

  1. Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics

31

25681641

32 of 63

Gaussian Outliers

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond ā€œFill Downā€ Imputation

32

Lecture 13, Data 101, Spring 2025

33 of 63

Gaussian Outliers

Model data as a normal (Gaussian) distribution.

  • Center: mean
  • Dispersion unit: standard deviation
  • Define a gaussian outlier as
    • ±2 standard deviations fromļæ½the mean.
    • In percentiles:ļæ½below p2.5 and above p97.5.

33

gray: cleaned (truncated)

In practice:

  1. Assume a Gaussian distributionļæ½with mean, stdev from the data.
  2. Find the gaussian outliers based on #1.
  3. Construct a ā€œcleanedā€ distributionļæ½(with outliers dropped).

25681641

34 of 63

Demo: Assuming Gaussian Data

Suppose our 1000 observed datapoints (observations) were truly Normalļæ½with mean 50, standard deviation 5.

34

%sql DROP TABLE IF EXISTS observations CASCADE;

%sql CREATE TABLE observations AS \

SELECT normal_rand AS x \

FROM normal_rand(1000, 50, 5);

results = %sql SELECT x FROM observations

sns.displot(results.dict(), fill=True, kde=True, bins=20)

35 of 63

Demo: Assuming Gaussian Data

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view

normal_outliers.

35

%%sql

CREATE OR REPLACE VIEW normal_outliers AS

WITH bounds AS (

SELECT avg(x) - 2*stddev(x) AS lo,

avg(x) + 2*stddev(x) AS hi

FROM observations

)

SELECT x AS outlier

FROM observations o, bounds b

WHERE x NOT BETWEEN b.lo AND b.hi;

(bounds has a single record, therefore lo/hi are scalars)

36 of 63

[Exercise] Cleaned vs. Original?

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view normal_outliers.

36

results = %sql SELECT x, 'original' AS label \

FROM observations \

UNION ALL \

SELECT x, 'cleaned' AS label\

FROM observations \

WHERE x NOT IN \

(SELECT * FROM normal_outliers)

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=20)

plt.xlim(30, 70)

What does this code do?

šŸ¤”

A.

B.

C.

D.

E. Something else/error

37 of 63

37

What does this code do?

ā“˜

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

38 of 63

[Exercise] Cleaned vs. Original?

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view normal_outliers.

38

results = %sql SELECT x, 'original' AS label \

FROM observations \

UNION ALL \

SELECT x, 'cleaned' AS label \

FROM observations \

WHERE x NOT IN \

(SELECT * FROM normal_outliers)

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=20)

plt.xlim(30, 70)

The cleaned distribution, with fewer extremes, has lower dispersion.

39 of 63

Demo: Breaking the Gaussian Assumption

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers, store in view normal_outliers.

Construct a cleaned dataset from

observations and plot alongside the original.

What if we corrupt just one value to be very large?

  • Data becomes right-biased (no longer normal)
  • Gaussian outlier detection suddenly becomes fragile

39

## corrupt one value

%sql UPDATE observations SET x = x*10 \

WHERE x = (SELECT MAX(x) FROM OBSERVATIONS);

The 10x value is masking our previous outliers.

One value is dragging the mean and standard deviation super far!

40 of 63

Trimming and Winsorizing

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond ā€œFill Downā€ Imputation

40

Lecture 13, Data 101, Spring 2025

41 of 63

Common Outlier Handling Strategies

  • Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics
  • Simple to implement
  • However, definition of dispersion (and center) are not robust and are sensitive to the distribution of the outliers themselves.

41

25681641

42 of 63

Common Outlier Handling Strategies

  • Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics
  • Define outliers by order statistics, i.e., percentiles
  • k% trimming drops both k% tails of the distribution.
  • Example: 1% trimmed distribution drops below p1 and above p99

42

25681641

43 of 63

Common Outlier Handling Strategies

  • Dropping gaussian outliersļæ½
  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics
  • Define outliers by order statistics, i.e., percentiles
  • k% winsorization replaces tails with k-, (100-k)-percentile values
  • Example: 5% winsorization
    • Replace values below p5 with the p5 value
    • Replace values above p95 with the p95 value
    • Note: sometimes defined as 90% winsorization (why?)

43

25681641

44 of 63

[Exercise] Understanding Winsorization

Consider this dataset which has mean 101.5 and median 68.0.

{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, āˆ’40, 101, 86, 85, 15, 89, 89, 28, āˆ’5, 41}

Compute a 90% winsorization:

  • 95th percentile (nearest) value: 101
  • 5th percentile (nearest) value: -5

44

  • Which values get replaced? With what?
  • What are the statistics of the winsorized data, compared to the original?

šŸ¤”

A. Mean bigger, median bigger

B. Mean bigger, median smaller

C. Mean smaller, median bigger

D. Mean smaller, median smaller

E. Mean same, median different

F. Mean different, median same

G. Mean same, median same

H. Something else

25681641

45 of 63

45

What are the statistics of the winsorized data, compared to the original?

ā“˜

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

46 of 63

[Exercise] Understanding Winsorization

Consider this dataset which has mean 101.5 and median 68.0.

{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, āˆ’40, 101, 86, 85, 15, 89, 89, 28, āˆ’5, 41}

Compute a 90% winsorization:

  • 95th percentile (nearest) value: 101
  • 5th percentile (nearest) value: -5

Winsorized data:

{92, 19, 101, 58, 101, 91, 26, 78, 10, 13, āˆ’5, 101, 86, 85, 15, 89, 89, 28, āˆ’5, 41}

46

  • mean-median difference way smaller now!
  • mean becomes more reflective of the dataset!

25681641

Got until here

47 of 63

Demo: Trimming vs. Winsorizing

47

Even with our giant outlier, we still identify outliers based on the order (percentile) and not mean/SD, so we trim successfully.

  • More robust than SD-based.
  • Minor masking on the right, but not nearly as bad.

Winsorizing preserves probability density of the tails

  • Usually not a big difference from trimming
  • Mean stays roughly same, but stddev changes
  • Winsorizing preferred to Trimming if something downstream forbids NULL

5% trimmed mean,

5% winsorized mean

25681641

48 of 63

Robustness

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond ā€œFill Downā€ Imputation

48

Lecture 13, Data 101, Spring 2025

49 of 63

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).

  • Consider an adversary that can ā€œcorruptā€ data values to arbitrary values.
  • The breakdown point of an estimator is the smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
  • For data preparation: knowing the breakdown point determines when masking can occur.

Ex1 Breakdown point of the mean:

  • 0.
  • One outlier value changes our entire dataset’s mean, if we make it large enough.

49

25681641

50 of 63

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).

  • Consider an adversary that can ā€œcorruptā€ data values to arbitrary values.
  • The breakdown point of an estimator is the smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
  • For data preparation: knowing the breakdown point determines when masking can occur.

Ex2 Breakdown point of the 1% trimmed mean:

  • 1%.
  • 1% of values on each side will be trimmed out anyway, so an adversary adjusting 1% will not drastically impact the estimator.
  • Anything beyond 1% (e.g., 1% plus one extra value) will break down the estimator.

50

What k% gives us a maximally robust trimmed mean?

šŸ¤”

25681641

51 of 63

51

What k% gives us a maximally robust trimmed mean? Answer as a number with no percent sign.

ā“˜

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

52 of 63

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).

  • Consider an adversary that can ā€œcorruptā€ data values to arbitrary values.
  • The breakdown point of an estimator is the smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
  • For data preparation: knowing the breakdown point determines when masking can occur.

���What k% gives us a maximally robust k% trimmed mean?

  • 50%. Note: 50% trimmed mean produces the median.

52

The median of any distribution is maximally robust; it can handle up to 50% corruption of the data.

25681641

53 of 63

Robust Estimators of a Distribution

Recall that we are estimating the center and dispersion of our data.

53

Dispersion

Maximally robust: Median Absoluteļæ½ Deviation(MAD)

  • For a dataset X, define

  • (proof: intuition, but left to you)

Center

Maximally robust: Median

  • (see previous slide)

25681641

54 of 63

Robust Estimators of a Distribution

Recall that we are estimating the center and dispersion of our data.

54

Other common robust measures of center:

  • k% trimmed mean
  • k% winsorized mean (why?)

Other common robust measures of dispersion:

  • Interquartile Range (IQR), p75 - p25

Related: ā€œTukey numbersā€ are a five-number summary of univariate numeric data

  • minimum, p25, median, p75, max
  • center: median (robust)
  • spread: IQR (robust) min/max

Center

Maximally robust: Median

  • (see previous slide)

Dispersion

Maximally robust: Median Absoluteļæ½ Deviation (MAD)

(see previous slide)

25681641

55 of 63

Computing MAD in SQL

(for you to review)

55

%%sql

-- percentile_disc returns an actual data value near the percentile (discrete)

-- percentile_cont returns an interpolated value at the percentile (continuous)

CREATE OR REPLACE VIEW median AS

(SELECT percentile_disc(0.5)

WITHIN GROUP (ORDER BY x) as medianļæ½ FROM observations);

%%sql

CREATE OR REPLACE VIEW mad AS

WITH absdevs AS

(SELECT abs(x - median) as d

FROM observations, median)

SELECT percentile_disc(0.5)ļæ½ WITHIN GROUP (ORDER BY d) as mad

FROM absdevs;

SELECT median, mad

FROM median, mad;

where

25681641

Demo

56 of 63

Metrics for Outliers: Hampel X84, Model-based

Roll Up / Drill Down

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond ā€œFill Downā€ Imputation

56

Lecture 13, Data 101, Spring 2025

57 of 63

Metrics for outliers

How much should we trim or winsorize? You may know a few common metrics:

ā€œ2 sigmaā€: Trim Gaussian outliers

  • Center: mean
  • Dispersion unit: standard deviation
  • Trim: 2*standard deviations from the mean

Quartiles: Trim bottom 25% and top 25%

  • But is there an analogous way to trim on a robust ā€œdispersion unitā€ of spread?

57

25681641

58 of 63

Metrics for outliers

58

Let’s translate this normal estimation notion of ā€œstandard unitā€ into robust estimators:

  • Center: median
  • Dispersion unit: MAD
  • Trim: 2*1.4826 MADs from the median

For standard Normal (mean 0, stddev 1),

1 stddev = 1.4826 MADs.ļæ½Try it at home!

This is known as the Hampel X84ļæ½robust outlier detection technique.

25681641

59 of 63

More reading: Joe Hellerstein’s work

Hellerstein, J.M. (2008) Quantitative Data Cleaning for Large Databases. United Nations Economic Commission for Europe (UNECE). [link]

59

Paper [link], MATLAB context: hampel [link]

Joe Hellerstein

UC Berkeley EECS Professor

Original designer of Data 101 (Spring 2021)!

Data engineering involves lots of arbitrary (but reasonable!) metrics. This is one of them!

60 of 63

Let’s redo our outliers with Hampel x84

60

%%sql

CREATE OR REPLACE VIEW hampelx84x2_observations AS (

SELECT o.x, 'hampelx84x2' AS label

FROM observations o, median, mad

WHERE o.x BETWEEN (median-2*1.4826*mad)

AND (median+2*1.4826*mad)

UNION ALL

SELECT o.x, 'orig' AS label

FROM observations o

);

CREATE OR REPLACE VIEW Hampel84x2_outliers AS (

SELECT x

FROM observations o, median, mad

WHERE x NOT BETWEEN (median - 2*1.4826*mad)

AND (median + 2*1.4826*mad)

);

Demo

61 of 63

Let’s redo our outliers with Hampel x84

61

results = %sql SELECT * FROM hampelx84x2_observations

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=np.linspace(32, 70, 20),

height=4, aspect=1.5, rug=True)

plt.xlim(30, 70)

Demo

62 of 63

Summary: Outliers

Outliers depend on some measure of center and spread.

  • The normal distribution (and 2 SD truncation) gives nice intuition, but it is not robust.
  • Robustness leads us to want a high breakdown point (maximum 50%).
    • Order statistics like percentiles are robust and lead toļæ½outlier handling techniques like trimming and winsorizing.
    • Median, MAD are maximally robust estimators of center and dispersion, respectively.
  • Hampel X84: a robust outlier metric that incorporates dispersion.

62

25681641

63 of 63

[Beyond scope] Outlier detection with model fitting

If we fit multi-dimensional data to a model, do we have more metrics for detecting ā€œoutliersā€?

  • In linear regression:
    • L2 distance or residuals, e.g., |predicted - actual value|2
    • OLS: assume that your residuals are normally distributed
  • You can use the same outlier metrics, but this time applied to model residuals
  • We’ll stop here; anything further is down the path of data analysis, not engineering

63

25681641