1 of 51

Data Preparation III:�Granularity (cont.) and Outliers

October 10, 2023

Data 101, Fall 2023 @ UC Berkeley

Lisa Yan https://fa23.data101.org/

1

LECTURE 14

2 of 51

Join at slido.com�#12345690

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

3 of 51

Roll Up, Drill Down, Roll Up, Drill Down…

[from last time]

Demo 1: Roll-Ups and Drill-Downs

  • Start with county-level
  • Roll up to state level
  • Drill down to state level, feature class
  • Roll up again to feature class
  • Drill down to state numeric, county numeric

Demo 2: Connections to Statistics

  • Roll-up with marginal distributions
  • Drill-down with normally distributed elevations:
    • Start from state level
    • Join with all counties
    • Simulate each county’s elevation from a normally distributed variable using the state statistics
    • Note: Need to CREATE EXTENSION tablefunc;�for normal_rand function
    • Note 2: LATERAL keyword in FROM lets us reference columns from FROM items. [postgres docs 7.2.1.5 link]

3

In this video game, the character Kirby rolls up (inflates) and drills down (deflates).

2

Demo

#12345690

4 of 51

Implementing Explicit Hierarchies

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

4

[from last time]

Lecture 14, Data 101 Fall 2023

5 of 51

Do we need to make our own hierarchies? Yes, often so!

The GNIS dataset was great because the hierarchy was embedded into the relational schema. If we don’t have this, then we have to assemble an explicit hierarchy.

5

Take the baseball database for example. Suppose in baseball, in a given year (*Note: Lisa does not follow baseball)

  • Players play for Teams
  • Teams are in Divisions
  • Divisions are in Leagues

If we want to do roll-ups and drill-downs on this hierarchy, we need player-year data that looks like:

(playerid, teamid, divid, leagueid,� yearid,� thing_we_want_to_aggregate)

Let’s transform our data!

Demo

#12345690

6 of 51

Assembling the Explicit Hierarchy

6

Two relations have components of the hierarchy that we want:� Appearances and Teams

Let’s use these relations to generate a hierarchy view, bball_tree.

If we want to do roll-ups and drill-downs on this hierarchy, we need player-year data that looks like:

(playerid, teamid, divid, leagueid,� yearid,� thing_we_want_to_aggregate)

Demo

#12345690

7 of 51

Assembling the Explicit Hierarchy

CREATE OR REPLACE VIEW bball_tree AS

SELECT DISTINCT a.playerid, a.teamid, t.divid,

a.lgid, a.yearid

FROM Appearances a NATURAL JOIN Teams t;

7

Two relations have components of the hierarchy that we want:� Appearances and Teams

Let’s use these relations to generate a hierarchy view, bball_tree.

If we want to do roll-ups and drill-downs on this hierarchy, we need player-year data that looks like:

(playerid, teamid, divid, leagueid,� yearid,� thing_we_want_to_aggregate)

When making an explicit hierarchy, each record represents a datum in the smallest granularity (player-years in this example).

Demo

#12345690

8 of 51

Performing roll-ups and drill-downs on an explicit hierarchies

Recall our homerun query:

Let’s rewrite this query for roll up/drill down�on the bball_tree hierarchy.

  1. Join each (raw) person-year in batting with the associated bball_tree entry by (playerid, yearid) in a CTE.
  2. Use this result for roll-up and drill-down.

SELECT namefirst, namelast, yearid,

MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)

FROM batting b, people p

WHERE b.playerid = p.playerid

GROUP BY namelast, namefirst, yearid

ORDER BY max DESC

LIMIT 10;

8

Demo

#12345690

9 of 51

1. Construct the roll-up/drill-down CTE

  • What pieces of information do we need from the hierarchy?
  • (no slido) How do we write our CTE?

9

Table "public.batting"

Column | Type

----------+----------------------

playerid | character varying(10)

yearid | integer

stint | integer

teamid | character varying(3)

lgid | character varying(2)

g | integer

ab | integer

r | integer

h | integer

h2b | integer

h3b | integer

hr | integer

rbi | integer

sb | integer

cs | integer

bb | integer

so | integer

ibb | integer

hbp | integer

sh | integer

sf | integer

gidp | integer

Indexes:

"batting_pkey" PRIMARY KEY, btree (playerid, yearid, stint)

View "public.bball_tree"

Column | Type

----------+-----------------------

playerid | character varying(10)

teamid | character varying(3)

divid | character varying(1)

lgid | character varying(2)

yearid | integer

🤔

Demo

#12345690

10 of 51

What pieces of information do we need from the hierarchy?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

11 of 51

1. Construct the roll-up/drill-down CTE

11

Table "public.batting"

Column | Type

----------+----------------------

playerid | character varying(10)

yearid | integer

stint | integer

teamid | character varying(3)

lgid | character varying(2)

g | integer

ab | integer

r | integer

h | integer

h2b | integer

h3b | integer

hr | integer

rbi | integer

sb | integer

cs | integer

bb | integer

so | integer

ibb | integer

hbp | integer

sh | integer

sf | integer

gidp | integer

Indexes:

"batting_pkey" PRIMARY KEY, btree (playerid, yearid, stint)

View "public.bball_tree"

Column | Type

----------+-----------------------

playerid | character varying(10)

teamid | character varying(3)

divid | character varying(1)

lgid | character varying(2)

yearid | integer

WITH batting_tree AS

SELECT b.*, t.divid

FROM batting b, bball_tree t

WHERE b.playerid = t.playerid

AND b.yearid = t.yearid

AND b.teamid = t.teamid� -- to account for trades in-year

SELECT …

Demo

#12345690

12 of 51

2. Write the full query

Rewrite query to roll up/drill down on bball_tree hierarchy.

  • Join each (raw) person-year in batting with the associated bball_tree entry by (playerid, yearid) in a CTE.
  • Use this result for roll-up and drill-down.

WITH batting_tree AS

(

SELECT b.*, t.divid

FROM batting b, bball_tree t

WHERE b.playerid = t.playerid

AND b.yearid = t.yearid

AND b.teamid = t.teamid -- trades within year

)

SELECT namefirst, namelast,

bt.teamid, bt.lgid, bt.divid, bt.yearid,

MIN(hr), MAX(hr), AVG(hr), STDDEV(hr), SUM(hr)

FROM batting_tree bt, people p

WHERE bt.playerid = p.playerid

GROUP BY bt.playerid, bt.teamid, bt.lgid, bt.divid, bt.yearid, namelast, namefirst

ORDER BY max DESC

LIMIT 10;

12

Demo

#12345690

13 of 51

Announcements

Professor Yan’s office hours just for today Tuesday 10/10 are 2-3pm around the Hearst Mining Circle outside (not Warren)

PSA: COVID-19 / flu boosters

  • walk-in at Kaiser (Kaiser HMO only)
  • appointments at CVS, Safeway�(most insurances accepted)
  • appointments via UHS/eTang for students (soon)

UC policy compliance needed by 12/1/23

  • more info / UHS flu vaccine clinics: link
  • UHS COVID-19 vaccine availability: link

13

14 of 51

Data Preparation III: Outliers

October 9, 2023

Data 101, Fall 2023 @ UC Berkeley

Lisa Yan https://fa23.data101.org/

14

LECTURE 14

15 of 51

Outlier Detection and Handling

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

15

Lecture 14, Data 101 Fall 2023

16 of 51

Onto more Data Transformations: Principles to practice

Instead of “data cleaning,” think data transformation:

  • We are inherently imposing a model over our data.
    • Moves away from the the “hygiene”/”cleaning” analogy!
  • As a result, record input, output, and the data lineage of how the output is computed.

16

Example: Embrace simple metadata.

  • If transforming just one column
    • keep the original column
    • Add a derived column to the right, and name it something meaningful
  • If transforming much of a dataset
    • create a new derived dataset stored “near” the original
      • Examples: same filesystem directory, same database schema, same git repo, etc.
      • Name the derived dataset in a way that hints at its lineage
  • In all cases, keep the transformation code and treat it like source code:
    • Manage it, version control it, document it
    • Keep it in the same repository/toolchain as the data itself

#12345690

17 of 51

Outlier Detection and Handling: Terminology

An outlier is a value “far enough” from “average.”

  • Detecting outliers helps us identify aberrances with respect to a model.
  • Handling our outliers helps us model the majority of the data
    • Note: We should always acknowledge these outliers in our assumptions and our reports!

Every strategy of detecting/handling outliers needs two statistical measures on a set of values:

  • Center: what is average”.
    • e.g., mean, median, etc.
  • Dispersion: what is “far” from average.
    • e.g., standard deviation, variance, interquartile range (IQR), etc.

17

Let’s explore the robustness of three outlier handling strategies:

  • What is the measure of dispersion?
  • How large are masking effect?
  • What is the worst-case analysis?

(formal definition of robustness to follow)

We say masking occurs when an outlier influences the dispersion measure and impacts our ability to detect other outliers.

#12345690

18 of 51

Common Outlier Handling Strategies

  1. Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics

18

#12345690

19 of 51

Gaussian Outliers

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

19

Lecture 14, Data 101 Fall 2023

20 of 51

Gaussian Outliers

Model data as a normal (Gaussian) distribution.

  • Center: mean
  • Dispersion unit: standard deviation
  • Define a gaussian outlier as
    • ±2 standard deviations from�the mean.
    • In percentiles:�below p2.5 and above p97.5.

20

gray: cleaned (truncated)

In practice:

  1. Assume a Gaussian distribution�with mean, stdev from the data.
  2. Find the gaussian outliers based on #1.
  3. Construct a “cleaned” distribution�(with outliers dropped).

#12345690

21 of 51

Demo: Assuming Gaussian Data

Suppose our 1000 observed datapoints (observations) were truly Normal�with mean 50, standard deviation 5.

21

%sql DROP TABLE IF EXISTS observations CASCADE;

%sql CREATE TABLE observations AS \

SELECT normal_rand AS x \

FROM normal_rand(1000, 50, 5);

results = %sql SELECT x FROM observations

sns.displot(results.dict(), fill=True, kde=True, bins=20)

#12345690

22 of 51

Demo: Assuming Gaussian Data

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view normal_outliers.

22

%%sql

CREATE OR REPLACE VIEW normal_outliers AS

WITH bounds AS (

SELECT avg(x) - 2*stddev(x) AS lo,

avg(x) + 2*stddev(x) AS hi

FROM observations

)

SELECT x AS outlier

FROM observations o, bounds b

WHERE x NOT BETWEEN b.lo AND b.hi;

(bounds has a single record, therefore lo/hi are scalars)

#12345690

23 of 51

[Exercise] Cleaned vs. Original?

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view normal_outliers.

23

results = %sql SELECT x, 'original' AS label \

FROM observations \

UNION ALL \

SELECT x, 'cleaned' AS label\

FROM observations \

WHERE x NOT IN \

(SELECT * FROM normal_outliers)

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=20)

plt.xlim(30, 70)

What does this code do?

🤔

A.

B.

C.

D.

E. Something else/error

#12345690

24 of 51

What does this code do?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

25 of 51

[Exercise] Cleaned vs. Original?

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view normal_outliers.

25

results = %sql SELECT x, 'original' AS label \

FROM observations \

UNION ALL \

SELECT x, 'cleaned' AS label\

FROM observations \

WHERE x NOT IN \

(SELECT * FROM normal_outliers)

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=20)

plt.xlim(30, 70)

The cleaned distribution, with fewer extremes, has lower dispersion.

#12345690

26 of 51

Demo: Breaking the Gaussian Assumption

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view normal_outliers.

Construct a cleaned dataset from observations and plot alongside the original.

What if we corrupt just one value to be very large?

  • Data becomes right-biased (no longer normal)
  • Gaussian outlier detection suddenly becomes fragile

26

## corrupt one value

%sql UPDATE observations SET x = x*10 \

WHERE x = (SELECT MAX(x) FROM OBSERVATIONS);

The 10x value is masking our previous outliers.

One value is dragging the mean and standard deviation super far!

#12345690

27 of 51

Trimming and Winsorizing

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

27

Lecture 14, Data 101 Fall 2023

28 of 51

Common Outlier Handling Strategies

  • Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics
  • Simple to implement
  • However, definition of dispersion (and center) are not robust and are sensitive to the distribution of the outliers themselves.

28

#12345690

29 of 51

Common Outlier Handling Strategies

  • Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics
  • Simple to implement
  • However, definition of dispersion (and center) are not robust and are sensitive to the distribution of the outliers themselves.
  • Define outliers by order statistics, i.e., percentiles
  • k% trimming drops both k% tails of the distribution.
  • Example: 1% trimmed distribution drops below p1 and above p99
  • Define outliers by order statistics, i.e., percentiles
  • k% winsorization replaces tails with k-, (100-k)-percentile values
  • Example: 5% winsorization
    • Replace values below p5 with the p5 value
    • Replace values above p95 with the p95 value
    • Note: sometimes defined as 90% winsorization (why?)

29

#12345690

30 of 51

[Exercise] Understanding Winsorization

Consider this dataset which has mean 101.5 and median 68.0.

{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}

Compute a 90% winsorization:

  • 95th percentile (nearest) value: 101
  • 5th percentile (nearest) value: -5

30

  • (no slido) Which values get replaced? With what?
  • What are the statistics of the winsorized data, compared to the original?

🤔

A. Mean bigger, median bigger

B. Mean bigger, median smaller

C. Mean smaller, median bigger

D. Mean smaller, median smaller

E. Mean same, median different

F. Mean different, median same

G. Mean same, median same

H. Something else

#12345690

31 of 51

What are the statistics of the winsorized data, compared to the original?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

32 of 51

[Exercise] Understanding Winsorization

Consider this dataset which has mean 101.5 and median 68.0.

{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}

Compute a 90% winsorization:

  • 95th percentile (nearest) value: 101
  • 5th percentile (nearest) value: -5

Winsorized data:

{92, 19, 101, 58, 101, 91, 26, 78, 10, 13, −5, 101, 86, 85, 15, 89, 89, 28, −5, 41}

32

  • mean-median difference way smaller now!
  • mean becomes more reflective of the dataset!

#12345690

33 of 51

Demo: Trimming vs. Winsorizing

33

Even with our giant outlier, we still identify outliers based on the order (percentile) and not mean/SD, so we trim successfully.

  • More robust than SD-based.
  • Minor masking on the right, but not nearly as bad.

Winsorizing preserves probability density of the tails

  • Usually not a big difference from trimming
  • Mean stays roughly same, but stddev changes
  • Winsorizing preferred to Trimming if something downstream forbids NULL

5% trimmed mean,

5% winsorized mean

#12345690

34 of 51

Robustness

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

34

Lecture 14, Data 101 Fall 2023

35 of 51

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).

  • Consider an adversary that can “corrupt” data values to arbitrary values.
  • The breakdown point of an estimator is the smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
  • For data preparation: knowing the breakdown point determines when masking can occur.

Ex1 Breakdown point of the mean:

  • 0.
  • One outlier value changes our entire dataset’s mean, if we make it large enough.

35

#12345690

36 of 51

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).

  • Consider an adversary that can “corrupt” data values to arbitrary values.
  • The breakdown point of an estimator is the smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
  • For data preparation: knowing the breakdown point determines when masking can occur.

Ex1 Breakdown point of the mean:

  • 0.
  • One outlier value changes our entire dataset’s mean, if we make it large enough.

Ex2 Breakdown point of the 1% trimmed mean:

  • 1%.
  • 1% of values on each side will be trimmed out anyway, so an adversary adjusting 1% will not drastically impact the estimator.
  • Anything beyond 1% (e.g., 1% plus one extra value) will break down the estimator.

36

What k% gives us a maximally robust trimmed mean?

🤔

#12345690

37 of 51

What k% gives us a maximally robust trimmed mean? Answer as a number with no percent sign.

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

38 of 51

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).

  • Consider an adversary that can “corrupt” data values to arbitrary values.
  • The breakdown point of an estimator is the smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
  • For data preparation: knowing the breakdown point determines when masking can occur.

Ex1 Breakdown point of the mean:

  • 0.
  • One outlier value changes our entire dataset’s mean, if we make it large/small enough.

Ex2 Breakdown point of the 1% trimmed mean:

  • 1%.
  • 1% of values on each side will be trimmed out anyway, so drastically changing just 1% on won’t change much.
  • Anything beyond 1% (e.g., 1% plus one extra value) will break down the estimator.

What k% gives us a maximally robust k% trimmed mean?

  • 50%. Note: 50% trimmed mean produces the median.

38

The median of any distribution is maximally robust; it can handle up to 50% corruption of the data.

#12345690

39 of 51

Robust Estimators of a Distribution

Center

Maximally robust: Median

  • (see previous slide)

Dispersion

Maximally robust: Median Absolute� Deviation (MAD)

  • For a dataset X, define

  • (proof: intuition, but left to you)

Recall that we are estimating the center and dispersion of our data.

39

#12345690

40 of 51

Robust Estimators of a Distribution

Center

Maximally robust: Median

  • (see previous slide)

Dispersion

Maximally robust: Median Absolute� Deviation (MAD)

  • For a dataset X, define

  • (proof: intuition, but left to you)

Recall that we are estimating the center and dispersion of our data.

40

Other common robust measures of center:

  • k% trimmed mean
  • k% winsorized mean (why?)

Other common robust measures of dispersion:

  • Interquartile Range (IQR), p75 - p25

Related: “Tukey numbers” are a five-number summary of univariate numeric data

  • minimum, p25, median, p75, max
  • center: median (robust)
  • spread: IQR (robust) min/max

#12345690

41 of 51

Computing MAD in SQL

(for you to review)

41

%%sql

-- percentile_disc returns an actual data value near the percentile (discrete)

-- percentile_cont returns an interpolated value at the percentile (continuous)

CREATE OR REPLACE VIEW median AS

(SELECT percentile_disc(0.5)

WITHIN GROUP (ORDER BY x) as median� FROM observations);

%%sql

CREATE OR REPLACE VIEW mad AS

WITH absdevs AS

(SELECT abs(x - median) as d

FROM observations, median)

SELECT percentile_disc(0.5)� WITHIN GROUP (ORDER BY d) as mad

FROM absdevs;

SELECT median, mad

FROM median, mad;

where

Demo

#12345690

42 of 51

Metrics for Outliers: Hampel X84, Model-based

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

42

Lecture 14, Data 101 Fall 2023

43 of 51

Metrics for outliers

How much should we trim or winsorize? You may know a few common metrics:

“2 sigma”: Trim Gaussian outliers

  • Center: mean
  • Dispersion unit: standard deviation
  • Trim: 2*standard deviations from the mean

Quartiles: Trim bottom 25% and top 25%

  • But is there an analogous way to trim on a robust “dispersion unit” of spread?

43

Engineering consulting hedge fund

#12345690

44 of 51

Metrics for outliers

How much should we trim or winsorize? You may know a few common metrics:

“2 sigma”: Trim Gaussian outliers

  • Center: mean
  • Dispersion unit: standard deviation
  • Trim: 2*standard deviations from the mean

Quartiles: Trim bottom 25% and top 25%

  • But is there an analogous way to trim on a robust “dispersion unit” of spread

44

Let’s translate this normal estimation notion of “standard unit” into robust estimators:

  • Center: median
  • Dispersion unit: MAD
  • Trim: 2*1.4826 MADs from the median

For standard Normal (mean 0, stddev 1),

1 stddev = 1.4826 MADs.�Try it at home!

Engineering consulting hedge fund

This is known as the Hampel X84�robust outlier detection technique.

#12345690

45 of 51

More reading: Joe Hellerstein’s work

Hellerstein, J.M. (2008) Quantitative Data Cleaning for Large Databases. United Nations Economic Commission for Europe (UNECE). [link]

45

Paper [link], MATLAB context: hampel [link]

Joe Hellerstein

UC Berkeley EECS Professor

Original designer of Data 101 (Spring 2021)!

Data engineering involves lots of arbitrary (but reasonable!) metrics. This is one of them!

#12345690

46 of 51

Let’s redo our outliers with Hampel x84

46

%%sql

CREATE OR REPLACE VIEW hampelx84x2_observations AS

SELECT o.x,

'hampelx84x2' AS label

FROM observations o, median, mad

WHERE o.x BETWEEN (median-2*1.4826*mad) AND (median+2*1.4826*mad)

UNION ALL

SELECT o.x, 'orig' AS label

FROM observations o;

CREATE OR REPLACE VIEW Hampel84x2_outliers AS

SELECT x

FROM observations o, median, mad

WHERE x NOT BETWEEN (median - 2*1.4826*mad) AND (median + 2*1.4826*mad);

Demo

#12345690

47 of 51

Let’s redo our outliers with Hampel x84

47

results = %sql SELECT * FROM hampelx84x2_observations

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=np.linspace(32, 70, 20),

height=4, aspect=1.5, rug=True)

plt.xlim(30, 70)

Demo

#12345690

48 of 51

Summary: Outliers

Outliers depend on some measure of center and spread.

  • The normal distribution (and 2 SD truncation) gives nice intuition, but it is not robust.
  • Robustness leads us to want a high breakdown point (maximum 50%).
    • Order statistics like percentiles are robust and lead to�outlier handling techniques like trimming and winsorizing.
    • Median, MAD are maximally robust estimators of center and dispersion, respectively.
  • Hampel X84: a robust outlier metric that incorporates dispersion.

48

#12345690

49 of 51

[Beyond scope] Outlier detection with model fitting

Outliers depend on some measure of center and spread.

  • The normal distribution (and 2 SD truncation) gives nice intuition, but it is not robust.
  • Robustness leads us to want a high breakdown point (maximum 50%).
    • Order statistics like percentiles are robust and lead to�outlier handling techniques like trimming and winsorizing.
    • Median, MAD are maximally robust estimators of center and dispersion, respectively.
  • Hampel X84: a robust outlier metric that incorporates dispersion.

If we fit multi-dimensional data to a model, do we have more metrics for detecting “outliers”?

  • In linear regression:
    • L2 distance or residuals, e.g., |predicted - actual value|2
    • OLS: assume that your residuals are normally distributed
  • You can use the same outlier metrics, but this time applied to model residuals
  • We’ll stop here; anything further is down the path of data analysis, not engineering

49

#12345690

50 of 51

Data Imputation

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

50

[for next time]

Lecture 14, Data 101 Fall 2023

51 of 51

Beyond “fill down” imputation

Implementing Explicit Hierarchies

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84, Model-based

Data Imputation

Beyond “Fill Down” Imputation

51

[for next time]

Lecture 14, Data 101 Fall 2023