Mining a Demographic Dataset
Senthil Shanmugam
Seth Carlson
Ishwarya Sethuraman
Daniel Ryan
Dataset
Goals
The Data
Data description
Transformations of the Dataset
Modeling approaches
Best Models
Naïve Bayes
Best Models
K-Nearest Neighbor
Best Pruned Tree
Best pruned tree classified the data almost as good as the other models with easier visual presence.
From the pruned tree we can infer that if the sample is an husband with education level more than 13 years we classify him as earning more than $50000.
Association Rules/Clusters
If a new record is to be classified as earning more than $50000 the most probable variables are when -the citizen is a White with native continent as North America, -marital status as married civilian spouse
-Male, with education level of more than 13 years (College)
We used Hierarchical clustering to see if any meaningful clusters could be discovered. Any relationships uncovered were either obvious, such as continental/racial groupings, or unusable.
Results
Results
Classification tree: decision nodes at Relationship and binned Education level to form a quick judgement
We were able to create a model with which we could reliably classify a new record as having over $50,000 in income.
Our best model for doing so was the Naive Bayes algorithm, with an error rate of 19.20%.
Conclusion
We were able to successfully create a model to classify income over $50,000 with an error rate of 19.20%
Classification tree and association rules tell us that Education level, Continent of origin (specifically North America,) being married, race (white) and sex (male) are significant factors in having a high income.
Questions?
Data set source - http://archive.ics.uci.edu/ml/datasets/Adult