1 of 11

Mining a Demographic Dataset

Senthil Shanmugam

Seth Carlson

Ishwarya Sethuraman

Daniel Ryan

2 of 11

Dataset

Goals

  • Identify best model for classifying the data set on income >= $50K
  • Identify individual demographics which lead to association rules of income >= $50K

The Data

  • Demographic dataset of US residents with dependent variable of income >= $50K
  • Data set includes:
    • Age
    • Work class
    • Education
    • Marital status
    • Occupation
    • Relationship
    • Race
    • Sex
    • Hours worked per week
    • Native country
    • Income from investment

3 of 11

Data description

Transformations of the Dataset

  • Binned the continuous variables
  • Created dummy variables for categorical dependent variables

Modeling approaches

  • Logistic Regression
  • Naïve Bayes
  • Hierarchical Clustering
  • Classification trees
  • K-Nearest Neighbor
  • Neural Network

4 of 11

Best Models

Naïve Bayes

  • Better at higher specificities
  • Error rate of 19.20% in validation data

5 of 11

Best Models

K-Nearest Neighbor

  • Better at higher sensitivities
  • Best error rate of 18.30% in validation data

6 of 11

Best Pruned Tree

Best pruned tree classified the data almost as good as the other models with easier visual presence.

From the pruned tree we can infer that if the sample is an husband with education level more than 13 years we classify him as earning more than $50000.

7 of 11

Association Rules/Clusters

If a new record is to be classified as earning more than $50000 the most probable variables are when -the citizen is a White with native continent as North America, -marital status as married civilian spouse

-Male, with education level of more than 13 years (College)

We used Hierarchical clustering to see if any meaningful clusters could be discovered. Any relationships uncovered were either obvious, such as continental/racial groupings, or unusable.

8 of 11

Results

  • K nearest neighbor

9 of 11

Results

Classification tree: decision nodes at Relationship and binned Education level to form a quick judgement

We were able to create a model with which we could reliably classify a new record as having over $50,000 in income.

Our best model for doing so was the Naive Bayes algorithm, with an error rate of 19.20%.

10 of 11

Conclusion

We were able to successfully create a model to classify income over $50,000 with an error rate of 19.20%

Classification tree and association rules tell us that Education level, Continent of origin (specifically North America,) being married, race (white) and sex (male) are significant factors in having a high income.

11 of 11

Questions?