Data� Preparation � for �Knowledge Discovery
Lecture #7
Jamolbek Mattiev�
Databases and Data Mining
Outline: Data Preparation
Intro to ML and DM
slide 2/51
6th lecture
Data preparation
Knowledge Discovery Process�flow, according to CRISP-DM
Intro to ML and DM
slide 3/51
6th lecture
Monitoring
Data preparation
Knowledge Discovery Process, �in practice
Intro to ML and DM
slide 4/51
6th lecture
Data Preparation estimated to take 70-80% of the time and effort
Monitoring
Data
Preparation
Data preparation
Data Understanding: Relevance
Intro to ML and DM
slide 5/51
6th lecture
Data preparation
Data Understanding: Quantity
Intro to ML and DM
slide 6/51
6th lecture
Data preparation
Data Cleaning Steps
Intro to ML and DM
slide 7/51
6th lecture
Data preparation
Data Cleaning: Acquisition
Intro to ML and DM
slide 8/51
6th lecture
Data preparation
Data Cleaning: Example
Intro to ML and DM
slide 9/51
6th lecture
000000000130.06.19971979-10-3080145722 #000310 111000301.01.000100000000004 0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000. 000000000000000.000000000000000.0000000...… 000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.00 0000000000300.00 0000000000300.00
0000000001,199706,1979.833,8014,5722 , ,#000310 …. ,111,03,000101,0,04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,0300.00
Data preparation
Data Cleaning: Metadata
Intro to ML and DM
slide 10/51
6th lecture
Data preparation
Data Cleaning: Reformatting
Convert data to a standard format �(e.g. arff or csv)
Intro to ML and DM
slide 11/51
6th lecture
Data preparation
Data Cleaning: Reformatting, 2
Convert nominal fields whose values have order to numeric to be able to use “>” and “<“ comparisons on these fields.
Intro to ML and DM
slide 12/51
6th lecture
Data preparation
Data Cleaning: Missing Values
Intro to ML and DM
slide 13/51
6th lecture
Data preparation
Data Cleaning: Missing Values, 2
Intro to ML and DM
slide 14/51
6th lecture
Data preparation
Data Cleaning: Unified Date Format
Intro to ML and DM
slide 15/51
6th lecture
Data preparation
Data Cleaning: �Unified Date Format, 2
Intro to ML and DM
slide 16/51
6th lecture
Data preparation
Unified Date Format Options
Intro to ML and DM
slide 17/51
6th lecture
Data preparation
KSP Date Format
days_starting_Jan_1 - 0.5
KSP Date = YYYY + --------------------------------------
365 + 1_if_leap_year
Intro to ML and DM
slide 18/51
6th lecture
Data preparation
Y2K issues: 2 digit Year
Intro to ML and DM
slide 19/51
6th lecture
Data preparation
Conversion: Nominal to Numeric
Intro to ML and DM
slide 20/51
6th lecture
Data preparation
Conversion
Intro to ML and DM
slide 21/51
6th lecture
Data preparation
Conversion: Binary to Numeric
Intro to ML and DM
slide 22/51
6th lecture
Data preparation
Conversion: Ordered to Numeric
Intro to ML and DM
slide 23/51
6th lecture
Data preparation
Conversion: Ordered to Numeric, 2
Natural order allows meaningful comparisons, e.g. Grade > 3.5
Intro to ML and DM
slide 24/51
6th lecture
Data preparation
Conversion: Nominal, Few Values
Intro to ML and DM
slide 25/51
6th lecture
ID | Color | … |
371 | red | |
433 | yellow | |
ID | C_red | C_orange | C_yellow | … |
371 | 1 | 0 | 0 | |
433 | 0 | 0 | 1 | |
Data preparation
Conversion: Nominal, Many Values
Intro to ML and DM
slide 26/51
6th lecture
Data preparation
Data Cleaning: Discretization
Intro to ML and DM
slide 27/51
6th lecture
Data preparation
Discretization: Equal-width
Equal Width, bins Low <= value < High
Intro to ML and DM
slide 28/51
6th lecture
[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]
Temperature values:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
2
2
Count
4
2
2
2
0
Data preparation
Discretization: �Equal-width may produce clumping
Intro to ML and DM
slide 29/51
6th lecture
[0 – 200,000) … ….
1
Count
Salary in a corporation
[1,800,000 –
2,000,000]
Data preparation
Equal-width problems
Intro to ML and DM
slide 30/51
6th lecture
[0 – 200,000) … ….
1
Count
Salary in a corporation
[1,800,000 –
2,000,000]
What can we do to get a more even distribution?
Data preparation
Discretization: Equal-height
Equal Height = 4, except for the last bin
Intro to ML and DM
slide 31/51
6th lecture
[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]
Temperature values:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
4
Count
4
4
2
Data preparation
Discretization: Equal-height advantages
Intro to ML and DM
slide 32/51
6th lecture
Data preparation
Discretization
How else can we discretize?
What is another method from the literature?
Intro to ML and DM
slide 33/51
6th lecture
Data preparation
Discretization: Class Dependent
min of 3 values per bucket
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
Intro to ML and DM
slide 34/51
6th lecture
64
85
Data preparation
Discretization considerations
Intro to ML and DM
slide 35/51
6th lecture
Data preparation
Outliers and Errors
Intro to ML and DM
slide 36/51
6th lecture
Data preparation
Examine Data Statistics
Intro to ML and DM
slide 37/51
6th lecture
Data preparation
Data Cleaning: Field Selection
First: Remove fields with no or little variability
Intro to ML and DM
slide 38/51
6th lecture
Data preparation
False Predictors or Information “Leakers”
Intro to ML and DM
slide 39/51
6th lecture
Data preparation
False Predictor Example
Q: What is a false predictor for a student’s likelihood of passing a course?
Intro to ML and DM
slide 40/51
6th lecture
A: The student’s final grade.
Data preparation
False Predictors: Find “suspects”
Intro to ML and DM
slide 41/51
6th lecture
Data preparation
(Almost) Automated False Predictor Detection
Intro to ML and DM
slide 42/51
6th lecture
Data preparation
Selecting Most Relevant Fields
Intro to ML and DM
slide 43/51
6th lecture
Data preparation
Field Reduction Improves Classification
Intro to ML and DM
slide 44/51
6th lecture
Data preparation
Derived Variables
Intro to ML and DM
slide 45/51
6th lecture
Data preparation
Unbalanced Target Distribution
Intro to ML and DM
slide 46/51
6th lecture
Data preparation
Handling Unbalanced Data
Intro to ML and DM
slide 47/51
6th lecture
Data preparation
Building Balanced Train Sets
Intro to ML and DM
slide 48/51
6th lecture
Y
..
..
N
N
N
..
..
..
..
Y
..
N
..
Raw Held
Targets
Non-Targets
Balanced set
Balanced Train
Balanced Test
Data preparation
Learning with Unbalanced Data
Intro to ML and DM
slide 49/51
6th lecture
Data preparation
Data Preparation Key Ideas
Intro to ML and DM
slide 50/51
6th lecture
Data preparation
Summary
Good data preparation is key to producing valid and reliable models
Intro to ML and DM
slide 51/51
6th lecture
Data preparation