1 of 11

Mining a Demographic Dataset

Senthil Shanmugam

Seth Carlson

Ishwarya Sethuraman

Daniel Ryan

2 of 11

Dataset

Goals

Identify best model for classifying the data set on income >= $50K
Identify individual demographics which lead to association rules of income >= $50K

The Data

Demographic dataset of US residents with dependent variable of income >= $50K
Data set includes:

Age
Work class
Education
Marital status
Occupation
Relationship
Race
Sex
Hours worked per week
Native country
Income from investment

3 of 11

Data description

Transformations of the Dataset

Binned the continuous variables
Created dummy variables for categorical dependent variables

Modeling approaches

Logistic Regression
Naïve Bayes
Hierarchical Clustering
Classification trees
K-Nearest Neighbor
Neural Network

4 of 11

Best Models

Naïve Bayes

Better at higher specificities
Error rate of 19.20% in validation data

5 of 11

Best Models

K-Nearest Neighbor

Better at higher sensitivities
Best error rate of 18.30% in validation data

6 of 11

Best Pruned Tree

Best pruned tree classified the data almost as good as the other models with easier visual presence.

From the pruned tree we can infer that if the sample is an husband with education level more than 13 years we classify him as earning more than $50000.

7 of 11

Association Rules/Clusters

If a new record is to be classified as earning more than $50000 the most probable variables are when -the citizen is a White with native continent as North America, -marital status as married civilian spouse

-Male, with education level of more than 13 years (College)

We used Hierarchical clustering to see if any meaningful clusters could be discovered. Any relationships uncovered were either obvious, such as continental/racial groupings, or unusable.

8 of 11

Results

K nearest neighbor

9 of 11

Results

Classification tree: decision nodes at Relationship and binned Education level to form a quick judgement

We were able to create a model with which we could reliably classify a new record as having over $50,000 in income.

Our best model for doing so was the Naive Bayes algorithm, with an error rate of 19.20%.

10 of 11

Conclusion

We were able to successfully create a model to classify income over $50,000 with an error rate of 19.20%

Classification tree and association rules tell us that Education level, Continent of origin (specifically North America,) being married, race (white) and sex (male) are significant factors in having a high income.