Data Preprocessing
1
Data Preprocessing
2
2
Data Quality: Why Preprocess the Data?
3
Major Tasks in Data Preprocessing
4
Chapter 3: Data Preprocessing
5
5
Data Cleaning
6
Incomplete (Missing) Data
7
How to Handle Missing Data?
8
Noisy Data
9
How to Handle Noisy Data?
10
Data Cleaning as a Process
11
Chapter 3: Data Preprocessing
12
12
Data Integration
13
13
Handling Redundancy in Data Integration
14
14
Correlation Analysis (Nominal Data)
15
Chi-Square Calculation: An Example
16
| Play chess | Not play chess | Sum (row) |
Like science fiction | 250(90) | 200(360) | 450 |
Not like science fiction | 50(210) | 1000(840) | 1050 |
Sum(col.) | 300 | 1200 | 1500 |
Correlation Analysis (Numeric Data)
where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.
17
Visually Evaluating Correlation
18
Scatter plots showing the similarity from –1 to 1.
Correlation (viewed as linear relationship)
19
Covariance (Numeric Data)
where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B.
20
Correlation coefficient:
Co-Variance: An Example
Chapter 3: Data Preprocessing
22
22
Data Reduction Strategies
23
Data Reduction 1: Dimensionality Reduction
24
Mapping Data to a New Space
25
Two Sine Waves
Two Sine Waves + Noise
Frequency
What Is Wavelet Transform?
26
Wavelet Transformation
27
Haar2
Daubechie4
Wavelet Decomposition
28
Haar Wavelet Coefficients
29
Coefficient “Supports”
2 2 0 2 3 5 4 4
-1.25
2.75
0.5
0
0
-1
0
-1
+
-
+
+
+
+
+
+
+
-
-
-
-
-
-
+
-
+
+
-
+
-
+
-
+
-
-
+
+
-
-1
-1
0.5
0
2.75
-1.25
0
0
Original frequency distribution
Hierarchical decomposition structure (a.k.a. “error tree”)
Why Wavelet Transform?
30
Principal Component Analysis (PCA)
31
x2
x1
e
Principal Component Analysis (Steps)
32
Attribute Subset Selection
33
Heuristic Search in Attribute Selection
34
Attribute Creation (Feature Generation)
35
Data Reduction 2: Numerosity Reduction
36
Parametric Data Reduction: Regression and Log-Linear Models
37
Regression Analysis
38
y
x
y = x + 1
X1
Y1
Y1’
Regress Analysis and Log-Linear Models
39
Histogram Analysis
40
Clustering
41
Sampling
42
Types of Sampling
43
44
Sampling: With or without Replacement
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
Sampling: Cluster or Stratified Sampling
45
Raw Data
Cluster/Stratified Sample
Data Cube Aggregation
46
Data Reduction 3: Data Compression
47
Data Compression
48
Original Data
Compressed
Data
lossless
Original Data
Approximated
lossy
Chapter 3: Data Preprocessing
49
Data Transformation
50
Normalization
51
Where j is the smallest integer such that Max(|ν’|) < 1
Discretization
52
Data Discretization Methods
53
Simple Discretization: Binning
54
Binning Methods for Data Smoothing
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
55
Discretization Without Using Class Labels�(Binning vs. Clustering)
56
Data
Equal interval width (binning)
Equal frequency (binning)
K-means clustering leads to better results
Discretization by Classification & Correlation Analysis
57
Concept Hierarchy Generation
58
Concept Hierarchy Generation �for Nominal Data
59
Automatic Concept Hierarchy Generation
60
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
Chapter 3: Data Preprocessing
61
Summary
62
References
63