1 of 63

Data Preparation III: Outliers and Imputation

1

Data 101, Fall 2025 @ UC Berkeley

Aditya Parameswaran

2 of 63

2

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84

[At home] Code for Demos

Data Imputation

Linear interpolation Example

[At home] “Fill Down” Imputation

Outline

3 of 63

Onto more Data Transformations: Principles to practice

Instead of “data cleaning,” think data transformation:

  • We are inherently imposing a model over our data.
    • Moves away from the the “hygiene”/”cleaning” analogy!
  • So, record input, output, and data lineage of how the output is computed.

3

Example: Embrace simple metadata.

  • If transforming just one column
    • keep original, add derived column to the right + name it something meaningful
  • If transforming much of a dataset
    • create a new derived dataset stored “near” the original
      • eg same directory, database schema, git repo, etc.
      • Name derived dataset in a way that hints at its lineage
  • In all cases, keep the transformation code and treat it like source code:
    • Manage it, version control it, document it
    • Keep it in the same repository/toolchain as data itself

4 of 63

Outlier Detection and Handling: Terminology

An outlier is a value “far enough” from “average.”

  • Detecting outliers helps us identify aberrances with respect to a model.
  • Handling our outliers helps us model the majority of the data
    • NB: always ack. these outliers in our assumptions and our reports!

Every strategy of handling outliers needs two stat. measures on a set of values:

  • Center: what is average”.
    • e.g., mean, median, etc.
  • Dispersion: what is “far” from average.
    • e.g., standard deviation, variance, interquartile range (IQR), etc.

4

Let’s explore robustness of four outlier handling strategies:

  • What is the measure of dispersion?
  • How large is masking?

(formal definition of robustness to follow)

We say masking occurs when an outlier influences the dispersion measure and impacts our ability to detect other outliers.

5 of 63

Common Outlier Handling Strategies

  1. Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics

  • Dropping “median outliers” with Hampel X84

5

6 of 63

6

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84

[At home] Code for Demos

Data Imputation

Linear interpolation Example

[At home] “Fill Down” Imputation

Outline

7 of 63

Gaussian Outliers

Model data as a normal (Gaussian) distribution.

  • Center: mean
  • Dispersion unit: standard deviation
  • Define a gaussian outlier as
    • ±2 standard deviations from�the mean.
    • In percentiles:�below p2.5 and above p97.5.

7

gray: cleaned (truncated)

In practice:

  1. Assume a Gaussian distribution�with mean, stdev from the data.
  2. Find the gaussian outliers based on #1.
  3. Construct a “cleaned” distribution�(with outliers dropped).

8 of 63

Demo: Assuming Gaussian Data

Suppose our 1000 observed datapoints (observations) were truly Normal�with mean 50, standard deviation 5.

8

%sql DROP TABLE IF EXISTS observations CASCADE;

%sql CREATE TABLE observations AS \

SELECT normal_rand AS x \

FROM normal_rand(1000, 50, 5);

results = %sql SELECT x FROM observations

sns.displot(results.dict(), fill=True, kde=True, bins=20)

9 of 63

Demo: Assuming Gaussian Data

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view

normal_outliers.

9

%%sql

CREATE OR REPLACE VIEW normal_outliers AS

WITH bounds AS (

SELECT avg(x) - 2*stddev(x) AS lo,

avg(x) + 2*stddev(x) AS hi

FROM observations

)

SELECT x AS outlier

FROM observations o, bounds b

WHERE x NOT BETWEEN b.lo AND b.hi;

(bounds has a single record, therefore lo/hi are scalars)

10 of 63

Cleaned vs. Original?

Suppose our 1000 observed datapoints (observations) were truly Normal with mean 50, standard deviation 5.

Find outliers and store them in a view normal_outliers.

10

results = %sql SELECT x, 'original' AS label \

FROM observations \

UNION ALL \

SELECT x, 'cleaned' AS label \

FROM observations \

WHERE x NOT IN \

(SELECT * FROM normal_outliers)

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=20)

plt.xlim(30, 70)

The cleaned distribution, with fewer extremes, has lower dispersion.

11 of 63

Demo: Breaking the Gaussian Assumption

Q: What if we corrupt just one value to be very large? Would Gaussian outlier detection still work?

11

## corrupt one value

%sql UPDATE observations SET x = x*10 \

WHERE x = (SELECT MAX(x) FROM OBSERVATIONS);

  • Data becomes right-biased (no longer normal)
  • Gaussian outlier detection suddenly becomes fragile
  • Mean and standard deviation both move considerably
  • The 10x value is masking our previous outliers

12 of 63

12

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84

[At home] Code for Demos

Data Imputation

Linear interpolation Example

[At home] “Fill Down” Imputation

Outline

13 of 63

Common Outlier Handling Strategies

  • Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics

  • Dropping “median outliers” with Hampel X84

  • Simple to implement
  • However, definition of dispersion (and center) are not robust and are sensitive to the distribution of the outliers themselves.

13

14 of 63

Trimming with Percentiles

  1. Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics

  • Dropping “median outliers” with Hampel X84

14

Detect and define outliers by order statistics, i.e., percentiles.

15 of 63

Trimming with Percentiles

  • k% trimming drops both k% tails of the distribution.
  • Ex 5% trimmed distribution drops below p5 and above p95
  • Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics

  • Dropping “median outliers” with Hampel X84
  • k% winsorization replaces tails�with k-, (100-k)-percentile values
  • Ex 5% winsorization
    • Replace values ≤ p5 with the p5 value
    • Replace values ≥ p95 with the p95 value
    • Note: sometimes called 90% winsorization (why?)

15

16 of 63

[Exercise] Understanding Winsorization

Consider this dataset which has mean 101.5 and median 68.0.

{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}

Compute a 90% winsorization:

  • 95th percentile (nearest) value: 101
  • 5th percentile (nearest) value: -5

16

  • Which values get replaced? With what?
  • What are the statistics of the winsorized data, compared to the original?

🤔

A. Mean bigger, median bigger

B. Mean bigger, median smaller

C. Mean smaller, median bigger

D. Mean smaller, median smaller

E. Mean same, median different

F. Mean different, median same

G. Mean same, median same

H. Something else

17 of 63

[Exercise] Understanding Winsorization

Consider this dataset which has mean 101.5 and median 68.0.

{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}

Compute a 90% winsorization:

  • 95th percentile (nearest) value: 101
  • 5th percentile (nearest) value: -5

Winsorized data:

{92, 19, 101, 58, 101, 91, 26, 78, 10, 13, −5, 101, 86, 85, 15, 89, 89, 28, −5, 41}

17

  • mean goes from 101 to 55
  • mean-median difference way smaller now!
  • mean is more reflective of the dataset!

18 of 63

Trimming vs. Winsorizing with the 10x outlier

18

19 of 63

Trimming vs. Winsorizing with the 10x outlier

19

Trimming, winsorizing are more robust than Gaussian detection.

  • Mean(s) stays roughly same, but stddev(s) changes
  • Winsorizing in particular preserves probability density of the tails.
  • Usually not much difference between the two, but you should prefer winsorizing over trimming if if something downstream forbids NULL
  • … What is “robust”?

“5% trimmed mean,”

“5% winsorized mean”

20 of 63

20

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84

[At home] Code for Demos

Data Imputation

Linear interpolation Example

[At home] “Fill Down” Imputation

Outline

21 of 63

More reading: Joe Hellerstein’s work

21

Joe Hellerstein, EECS Professor Emeritus, UC Berkeley

Original designer of Data 101 (Spring 2021)!

Data engineering involves lots of arbitrary (but reasonable) metrics.

One example: Hampel X84. Why? Let’s properly define robustness.

Hellerstein, J.M. (2008) Quantitative Data Cleaning for Large Databases. United Nations Economic Commission for Europe (UNECE). [link]

22 of 63

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator.

  • Robust statistics “maintain their properties even if the underlying distribution is incorrect.” [wikipedia]
  • Will outliers cause our statistical measures of “center” and “dispersion” to break?
    • Consider an adversary that can “corrupt” outlier data values to arbitrary values. Will our statistic be robust to this adversary?

Example non-robust statistic: Mean

  • If an extreme value is in the dataset, the mean is thrown off.
  • Further, the mean is no longer the “center”; it could even become an outlier in itself.

22

23 of 63

Breakdown points

The breakdown point of an estimator:

  • The smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
  • For data preparation, knowing the breakdown point determines when masking can occur (i.e., when our ability to detect all outliers is impacted by a few extra extreme outliers).

23

24 of 63

Breakdown points

The breakdown point of an estimator:

  • The smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
  • For data preparation, knowing the breakdown point determines when masking can occur (i.e., when our ability to detect all outliers is impacted by a few extra extreme outliers).

Ex1 Breakdown point of mean:

  • 0.
  • One outlier value changes our entire dataset’s mean, if we make it large enough.

Ex2 Breakdown point of 1% trimmed mean:

  • 1%.
  • 1% of values on each side will be trimmed out anyway, so an adversary adjusting 1% will not drastically impact the estimator.
  • Anything beyond 1% (e.g., 1% plus one extra value) will break down the estimator.

24

25 of 63

Breakdown points

The breakdown point of an estimator:

  • The smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
  • For data preparation, knowing the breakdown point determines when masking can occur (i.e., when our ability to detect all outliers is impacted by a few extra extreme outliers).

Ex1 Breakdown point of mean:

  • 0.
  • One outlier value changes our entire dataset’s mean, if we make it large enough.

Ex2 Breakdown point of 1% trimmed mean:

  • 1%.
  • 1% of values on each side will be trimmed out anyway, so an adversary adjusting 1% will not drastically impact the estimator.
  • Anything beyond 1% (e.g., 1% plus one extra value) will break down the estimator.

25

Ex 3 What k% gives us a maximally robust trimmed mean?

🤔

26 of 63

Robustness and Breakdown Points

Robustness is a worst-case analysis of an estimator (in this case, the statistics used to capture outlier detection).

  • Consider an adversary that can “corrupt” data values to arbitrary values.
  • The breakdown point of an estimator is the smallest fraction of corrupted values an estimator can handle before an incorrect (e.g., arbitrarily large) result.
  • For data preparation: knowing the breakdown point determines when masking can occur.

Ex 3 What k% gives us a maximally robust k% trimmed mean?

  • 50%.
  • Equivalently, 50% trimmed mean is the median.

26

The median of any distribution is maximally robust. It can handle up to 50% corruption of the data.

27 of 63

Robust Estimators of a Distribution

Center:

  • Median (maximally robust, 50% trimmed mean)
  • k% trimmed mean
    • generally, order statistics like percentiles are robust,�depending on your choice of k.
  • k% winsorized mean (why?)

Dispersion:

  • Median Absolute Deviation (MAD)
    • Maximally robust
    • (proof omitted)

[Aside] Tukey numbers for univariate data:

  • 5-number summary: minimum, p25, median, p75, max
  • Center: median (robust)
  • Dispersion: Inter Quartile Range (robust), max - min (not robust—why?)

27

(see end of slides for SQL code)

28 of 63

Using Median and MAD as part of Hampel X84

  • Dropping gaussian outliers

  • Trimming outliers based on percentiles

  • Winsorizing, i.e., replacing outliers with percentile statistics

  • Dropping “median outliers” with Hampel X84

  • Center: Median
  • Dispersion unit: MAD(Median Absolute Deviation)
  • Trim: 1.4826 MADs from median
  • Center: Mean
  • Dispersion unit: Standard Deviation
  • Trim: 2 standard deviations from mean

28

But why?

29 of 63

Why trim 1.48? Metrics for outliers

How much should we trim with MAD?

  • Trim with Gaussian Outliers: 2*standard deviations from the mean

29

Let’s translate this normal estimation notion of “standard unit” into robust estimators:

  • Center: median
  • Dispersion unit: MAD
  • Trim: 2*1.4826 MADs from the median

For standard Normal (mean 0, stddev 1),

1 stddev = 1.4826 MADs.�Try it at home!

This is known as the Hampel X84�robust outlier detection technique.

30 of 63

[Beyond scope] Outlier detection with model fitting

If we fit multi-dimensional data to a model, do we have more metrics for detecting “outliers”?

  • In linear regression:
    • L2 distance or residuals, e.g., |predicted - actual value|2
    • OLS: assume that your residuals are normally distributed
  • You can use the same outlier metrics, but this time applied to model residuals
  • We’ll stop here; anything further is down the path of data analysis, not engineering

30

31 of 63

Summary: Outliers

Outliers depend on some measure of center and spread (dispersion).

  • The normal distribution (and 2 SD truncation) gives nice intuition, but it is not robust.
  • Robustness leads us to want a high breakdown point (maximum 50%).
    • Order statistics like percentiles are robust and lead to outlier handling techniques like trimming and winsorizing.
    • Median, MAD are maximally robust estimators of center and dispersion, respectively.
  • Hampel X84: a robust outlier metric.

31

Summary

So Far

32 of 63

32

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84

[At home] Code for Demos

Data Imputation

Linear interpolation Example

[At home] “Fill Down” Imputation

Outline

33 of 63

Computing MAD in SQL

[At Home]

33

%%sql

-- percentile_disc returns an actual data value near the percentile (discrete)

-- percentile_cont returns an interpolated value at the percentile (continuous)

CREATE OR REPLACE VIEW median AS

(SELECT percentile_disc(0.5)

WITHIN GROUP (ORDER BY x) as median� FROM observations);

%%sql

CREATE OR REPLACE VIEW mad AS

WITH absdevs AS

(SELECT abs(x - median) as d

FROM observations, median)

SELECT percentile_disc(0.5)� WITHIN GROUP (ORDER BY d) as mad

FROM absdevs;

SELECT median, mad

FROM median, mad;

where

Demo

34 of 63

Let’s redo our outliers with Hampel x84

[At Home]

34

%%sql

CREATE OR REPLACE VIEW hampelx84x2_observations AS (

SELECT o.x, 'hampelx84x2' AS label

FROM observations o, median, mad

WHERE o.x BETWEEN (median-2*1.4826*mad)

AND (median+2*1.4826*mad)

UNION ALL

SELECT o.x, 'orig' AS label

FROM observations o

);

CREATE OR REPLACE VIEW Hampel84x2_outliers AS (

SELECT x

FROM observations o, median, mad

WHERE x NOT BETWEEN (median - 2*1.4826*mad)

AND (median + 2*1.4826*mad)

);

Demo

35 of 63

Let’s redo our outliers with Hampel x84

[At Home]

35

results = %sql SELECT * FROM hampelx84x2_observations

sns.displot(results.dict(), x="x", kind='hist',

hue='label', kde=True, bins=np.linspace(32, 70, 20),

height=4, aspect=1.5, rug=True)

plt.xlim(30, 70)

Demo

36 of 63

36

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84

[At home] Code for Demos

Data Imputation

Linear interpolation Example

[At home] “Fill Down” Imputation

Outline

37 of 63

Data Imputation

When data are missing, we often impute missing data by filling in “likely” values.

  • After all, missing data can lead to bias.
  • Furthermore, some downstream operators can’t tolerate missing data
    • e.g., a stats package that needs a dense tensor (as opposed to sparse)

37

What is a good imputation scheme for your setting?

  • It depends; this is part of the art of statistics.
  • We will not offer prescriptions here; instead, we will focus on the how instead of the what.

Q: What imputation schemes would you consider using to fill in missing values?

38 of 63

Imputation methods we will cover

1. Default values for a column

  • Usually a column aggregate,�e.g., center/mean/median

2. Correlation across columns

  • e.g., conditional distributions like P(elevation|latitude)

3. General model-based interpolation

  • Trained on other data

4. Interpolation across ordered rows

  • Order rows in the relation, then use that order to impute values

38

interpolation w/ columns: imputing values assuming a relationship between columns

interpolation w/ rows

39 of 63

GNIS

For purposes of illustration, let's introduce some missing values into our data.

39

%%sql

SELECT setseed(0.12345);

DROP TABLE IF EXISTS holey CASCADE;

CREATE TABLE holey AS

SELECT feature_id, feature_name,� feature_class, state_alpha, county_name,

prim_lat_dec, prim_long_dec,

CASE WHEN random() > 0.9 THEN NULL

ELSE elev_in_m

END AS elev_in_m

FROM national;

SELECT count(elev_in_m)::float / count(*)�FROM holey;

[aside] Good data lineage practice: Make a copy of the data

40 of 63

Method 1. Default values for a column

Impute missing data using the mean, or some other reasonable univariate data statistic.

Implement in SQL with two passes through the data:

1. Aggregate CTE to compute mean

2. Query

40

41 of 63

Method 1. Default values for a column

[At home]

WITH elevavg AS (

SELECT avg(elev_in_m)::int FROM holey

)

SELECT h.*,

CASE WHEN h.elev_in_m IS NOT NULL

THEN h.elev_in_m

ELSE e.avg

END AS imputed_elev_in_m

FROM holey h,

elevavg e

LIMIT 100;

41

[Aside] Good data lineage practice: Document imputation through attribute names (elev_in_m, imputed_elev_in_m)

Demo

42 of 63

Method 2. Correlation across columns

Interpolate by assuming a relation between columns.

  • Given a correlation model,
  • Apply a scalar function.

Ex linear regression model

  • elevation meters = f(longitude)�elev_in_m = m*prim_long_dec + b

42

prim_long_dec

elev_in_m

43 of 63

Method 2. Correlation across columns

[At home]

(could use CTE, but here’s a notebook alternative)

43

# Here we'll train the model in SQL just for fun

result = %sql SELECT \

regr_slope(elev_in_m, prim_long_dec), \

regr_intercept(elev_in_m, prim_long_dec) \

FROM holey

slope, intercept = result[0]

%%sql

SELECT *,

CASE WHEN elev_in_m IS NOT NULL

THEN elev_in_m

ELSE {{slope}}*prim_long_dec +{{intercept}}

END AS imputed_elev_in_m

FROM holey

LIMIT 10;

Demo

44 of 63

Method 3. General model-based interpolation

Similar to Method 2, but trained on different data in advance.

  • Call a scalar function taking a model prediction function
  • Pass values in the row as parameters to the model predictor.

We won’t show an example,�but the general idea:

44

SELECT *,

CASE WHEN column IS NOT NULL

THEN column

ELSE model_predict(<constants>,� <columns>)

END AS imputed_column

FROM table;

45 of 63

Method 4. Interpolation across ordered rows

Order rows in the relation, then use that order to impute values.

Ex “fill down” until the next non-null value

Ex Linear interpolation between rows

45

46

401

194

47

111

row order

46 of 63

Method 4. Interpolation across ordered rows

Order rows in the relation, then use that order to impute values.

Ex “fill down” until the next non-null value

Ex Linear interpolation between rows

46

Typically involves a window function

More complicated than meets the eye (in SQL).

See bonus slides

Requires multiple passes through the data.

Let’s see it in action…!

🏡

47 of 63

47

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84

[At home] Code for Demos

Data Imputation

Linear interpolation Example

[At home] “Fill Down” Imputation

Outline

48 of 63

Linear Interpolation Example

Linear interpolation across ordered rows:

  • Order by some column (or set of columns).
  • Between ordered rows, average to impute the missing values.

48

46

401

194

47

111

Linear interpolation is non-trivial!

It requires knowing

  • What row we are in
  • What prev/next values are

Multiple passes through data!

Q: How would you go about doing this?

49 of 63

Preprocess: Define a “run” of rows

The trick: Interpolate across runs (the number of not NULL values seen so far)

49

feature_id

feature

elev_in_m

run

run_start

run_end

run_size

run_rank

impute_elev_in_m

493

Adobe Canyon

1423

84

1423

1423

3

0

1423

494

Adobe Flats

None

84

1423

517

3

1

1121

495

Adobe Lake

None

84

1423

517

3

2

819

496

Adobe Mountain

517

85

517

517

1

1

517

“run-level statistics”

Impute using rank in the current run

Once we know which “run” a row pertains to, we can interpolate via whatever scalar math we like.

50 of 63

Strategy: Multiple passes through the data

Add run-level statistics to every row:

  • run: Number of non-NULL values seen so far
  • run_start: Initial-non-NULL value or the earliest closest non-NULL value)
  • run_end: The next closest non-NULL value
  • run_size: Total number of rows in the run
  • run_rank: This row’s index in the run

50

🤔

A. 0

B. 1

C. 2

D. 3

E. 4

F. Something else

In SQL, how many passes through the relation do you need to compute these five attributes per row AND interpolate?

51 of 63

A 3-pass algorithm, at least in SQL

1. Pass forward:

    • Compute run
    • Propagate run_start
    • Get next_val into the last row of each run

2. Pass backward

    • Partition by run
      • Compute run_size
      • Compute run_rank
    • Propagate run_end from next_val computed in Pass 1

3. Use scalars to interpolate

    • Unordered; just use whatever order streams out of Pass 2

51

run_size = 3 because we exclude run_end, the start of the next run

1. Forward

2. Backward

3. Interpolate

Original

52 of 63

run_start: by taking the last non-null value

Didn’t cover slides, just did demo - here for reference

A custom function that we wrote that effectively coalesces values over a window frame:

coalesce_agg(elev_in_m)

OVER (ORDER BY feature_id) AS run_start

  • Order rows by feature_id
  • For the frame ending with this row, get the closest elev_in_m to this row
  • (implementation beyond the scope of this class. See slides)

52

Thanks, StackOverflow!

Demo

53 of 63

next_val: Next row value

Didn’t cover slides, just did demo - here for reference

In first pass:

CASE WHEN elev_in_m IS NULL

THEN lead(elev_in_m, 1)

OVER (ORDER BY feature_id)

ELSE NULL

END AS next_val

In second pass:

  • Compute run_end by applying coalesce_agg to next_val (but go backwards).

53

Demo

54 of 63

[Code] A 3-pass algorithm

Didn’t cover slides, just did demo - here for reference

54

%%sql

-- 1. Forward assign run numbers to rows, propagate val, get next_val

CREATE OR REPLACE VIEW forward AS

SELECT *,

SUM(CASE WHEN elev_in_m IS NULL THEN 0 ELSE 1 END)

OVER (ORDER BY feature_id) AS run,

coalesce_agg(elev_in_m) OVER (ORDER BY feature_id) AS run_start,

CASE WHEN elev_in_m IS NULL

THEN lead(elev_in_m, 1) OVER (ORDER BY feature_id)

ELSE NULL

END AS next_val

FROM holey;

%%sql

-- 2. Backward: assign run_end, run_size, run_rank

CREATE OR REPLACE VIEW backward AS

SELECT *,

CASE WHEN elev_in_m IS NOT NULL THEN elev_in_m

ELSE coalesce_agg(next_val) OVER� (PARTITION BY run ORDER BY feature_id DESC)

END AS run_end,

count(*) OVER (PARTITION BY run) AS run_size,

-1 + (RANK() OVER (PARTITION BY run ORDER BY feature_id)) AS run_rank

FROM forward;

SELECT * FROM backward ORDER BY feature_id ASC LIMIT 12 OFFSET 183;

%%sql

-- 3. Simple scalar pass

CREATE OR REPLACE VIEW final AS

SELECT *,

run_start + (run_rank)*((run_end-run_start)/(run_size))

AS interpolated

FROM backward;

Demo

55 of 63

Didn’t cover slides, just did demo - here for reference

%%sql

-- 1. Forward assign run numbers to rows, propagate val, get next_val

CREATE OR REPLACE VIEW forward AS

SELECT *,

SUM(CASE WHEN elev_in_m IS NULL

THEN 0

ELSE 1 END

) OVER (ORDER BY feature_id) AS run,

coalesce_agg(elev_in_m) OVER (ORDER BY feature_id) AS run_start,

CASE WHEN elev_in_m IS NULL

THEN lead(elev_in_m, 1) OVER (ORDER BY feature_id)

ELSE NULL

END AS next_val

FROM holey;

55

Demo

56 of 63

[Code] A 3-pass algorithm

Didn’t cover slides, just did demo - here for reference

How well did PostgreSQL do?�%sql EXPLAIN SELECT * from final LIMIT 500;

  • Note: forward, backward, and final are all views
  • Two sorts! Can you do better?

56

(can you do better? what if you don’t use SQL?)

Demo

57 of 63

Summary: Imputation

Some imputation methods for standard single imputation:

1. Default values for a column

    • Usually a column aggregate, eg., center/mean/median

2. Correlation across columns

    • Conditional distributions like P(elevation|latitude)

3. General model-based interpolation

    • Trained on other data

4. General Imputation across Ordered Rows

    • Order rows in the relation, then pick something in-between
    • We saw a 3-pass algorithm for linear (and general) imputation

57

[Aside] For fancier statistical methods:

  • https://en.wikipedia.org/wiki/Imputation_(statistics)
  • Multiple imputation averages across multiple imputed datasets
  • Getting fancier requires even more query gymnastics
  • You’ve seen enough to already be dangerous…😈

58 of 63

58

Outlier Detection and Handling

Gaussian Outliers

Trimming and Winsorizing

Robustness

Metrics for Outliers: Hampel X84

[At home] Code for Demos

Data Imputation

Linear interpolation Example

[At home] “Fill Down” Imputation

Outline

59 of 63

Method 4. Impute by filling down (according to some order)

Order rows in the relation, then use that order to impute values.

  • Order by some other column (or set of columns)
  • "Fill down" until the next non-null value.
  • Typically involves a window function.

59

row order

60 of 63

The following doesn’t work in PostgreSQL

%%sql

WITH buggy AS (

SELECT *,

CASE WHEN elev_in_m IS NOT NULL THEN elev_in_m

ELSE lag(elev_in_m, 1)

OVER (ORDER BY feature_id)

END AS imputed_elev_in_m

FROM holey

)

SELECT feature_id, elev_in_m,� imputed_elev_in_m FROM buggy;

60

What is the bug with this “fill down” imputation?

🤔

Demo

61 of 63

The bug

CASE WHEN elev_in_m IS NOT NULL THEN elev_in_m

ELSE lag(elev_in_m, 1)

OVER (ORDER BY feature_id)

END AS imputed_elev_in_m

61

Unfortunately, PostgreSQL doesn’t support lag() IGNORE NULL!

⚠️

Demo

62 of 63

Concept is simple, but SQL implementation hard

Unfortunately, PostgreSQL doesn’t support lag() IGNORE NULL!

  • Fix: define a User-Defined Function, or UDF.
  • Beyond the scope of this course; next slide is hidden but available for your reference

62

Thanks, StackOverflow!

Demo

63 of 63

Concept is simple, but SQL implementation hard

%%sql

CREATE OR REPLACE FUNCTION coalesce_agg_sfunc(state anyelement, value anyelement) RETURNS anyelement AS

$$

SELECT coalesce(value, state);

$$ LANGUAGE SQL;

CREATE AGGREGATE coalesce_agg(anyelement) (

SFUNC = coalesce_agg_sfunc,

STYPE = anyelement);

63

Thanks, StackOverflow!

Demo