Super Research Inc.

225 Bush St,

San Francisco, CA 94104

Prepared by Dmitry Karpovich

Feature Selection Report

Here in the Data Science department of Super Research Inc. we take feature selection very seriously.  Selecting the right features for predictive model building makes big data manageable, and this proposal aims to demonstrate how.


UCI Madelon Set        1

Big Postgres Madelon Set        1

THE DATA        1

Features and Targets        1

Why reduce?        2

RESOURCES        2

AWS t2 Micro Servers        2

Docker and Jupyter Notebooks        2

THE PROCESS        2

Exploratory Data Analysis        2

Getting the Data        2

Reducing the features        4

Pearson Correlation Coefficients        4

SelectKBest        5

Lasso with SelectFromModel        8

Model Building        9

LogisticRegression        9

KNN        10

SVC        10

RandomForestClassifier        11

Conclusion        12


UCI Madelon Set

4400 Rows

500 Features

5 Relevant

15 Linear Combinations of Relevant Features

The UCI Madelon Set was used in the NIPS 2003 feature selection challenge.

Big Postgres Madelon Set

200.000 Rows

1000 Features

5 Relevant

15 Linear Combinations of Relevant Features

200,000 rows and 1001 columns seems like an impossibly big data set to manage using standard recourses.


Features and Targets

Both datasets are binary classification problems with targets that contain a single bit (1 or 0).  We have some insight into the data that we can use to our advantage as we attempt to extract the relevant features, greatly reducing our datasets in size.

Why reduce?


AWS t2 Micro Servers

We have deployed a very cost effective, barebones AWS server to demonstrate how we can reduce these large datasets to manageable levels, even for t2 Micro servers.  

Docker and Jupyter Notebooks

Using Docker containers and Jupyter Notebooks, we can demonstrate Exploratory Data Analysis, Feature Selection and Reduction and finally Model Building.  


Exploratory Data Analysis

Getting the Data

The data for the larger Madelon Dataset was uploaded to a second AWS server running PostgreSQL.  A tool (connection_helper) was developed by our lead data scientists Juan Sucre and Dmitry Karpovich to quickly connect and query the database, but from the start we encountered issues due to the size of the dataset and the limited recourses of our servers.  We devised the following plan to make the data more manageable.  Some notes:

Reducing the features

The idea was to reduce the queries to only relevant features in our sampling so that we could pull larger samples with reduced feature size and thus reduce server load.  These were the steps:

  1. Pull 3 random 1% samples from the postgres server with all features
  1. Save the samples as CSVs locally to reduce server load
  1. Attempt to find the most relevant features
  2. Pull 3 random samples, but this time Increasing sample size to 10% while reducing feature  
  1. Again, save the 3 samples locally to reduce server load.

Pearson Correlation Coefficients

Correlations of All Columns

We knew from the start that our data was parametric.  We knew that there was a binary classification target and that of 5 informative features, there were 15 additional features that were some linear combinations of the 5.  We quickly deduced that 20 should correlate and the rest should not.

We noticed an 80% reduction in the absolute value of  Pearson Correlation Coefficients after the top 20th and surmised that these must be our 20 relevant features.

Exploring the Top 20 Correlations



After interpreting the Pearson Correlation Coefficients and confirming our results across 3 1% samples, it was time to pull out new tools to weed out the 5 true features.  SelektKBest was our next choice.  Using the f_classif score function gave us “Anova F-value” scores of each of our top 20 features towards the Target.

Results of f_classif for 3 10% samples with top 20 correlated features.

Results of f_classif for 1% sample with all 1000 features.

Our results are fairly consistent across both the 20 feature samples and our 1000 feature sample but this information did not prove very useful.


We attempted to see if we could glean any valuable insight by changing the scoring function to mutual_info_classif.

Results of mutual_info_classif for 3 10% sample 20 correlated features.

Results of mutual_info_classif for 1% sample with all 1000 features.

SelectKBest Conclusions:

Though this information helped identify features that were informative to the target, we still did not learn which 5 features were true and which were the linear combination of the 5.  


Lasso with SelectFromModel

Our next approach was to use the Lasso model to push coefficient weights of similar features to 0 by using the L1 Regularization.  By setting a very low threshold on SelectFromModel, we attempted to extract our 5 relevant features.


LassoCV allowed us to find a regularization hyperparameter that would reduce our features to 5.  An alpha of .01 was the sweet spot.  


By tweaking the alpha and thresholds of our models, we consistently came back with the same 5 features:

Model Building

With what we assume to be our top 5 features, we can compare our model results between PCA, our 5 selected true and our 20 features as our datasets for our models.

For our sample, we used 50% of the Postgres Madelon Set with our top 20 correlated features.


Given this is a classification problem, the first thought was to use the lightest of our classification models, LogisticRegression.

Naive Logistic Regression Score:  .602


  1. StandardScaler
  2. PCA: 5 features
  3. LogisticRegression: C = .1



  1. StandardScaler
  2. SelectKBest: 5 features
  3. LogisticRegression: C = .1


Conclusion:  Though Logistic Regression fits and tests very quickly, it does not score well.


K Nearest Neighbors is another go-to classifier.  We attempted to see if we could get better results with that.

Naive KNN Score:  .849


  1. StandardScaler
  2. n_neighbors: 9



  1. StandardScaler
  2. PCA: 5
  3. n_neighbors: 9


Conclusion:  KNN gave us much better results than Logistic Regression, but we must be careful not to overfit.  We observed a .04 difference between train and test scores.  StandardScaling increased train, but decreased test scores.  Changing weight and distance_p values only lowered the score.


SVC is a resource intensive classifier model.  We attempted to run some fits but our t2-Micro could not handle it.


RandomForestClassifier is a relatively quick fit and test algorithm.  We attempted to see if we could get better results with it.  Running this classifier was a painstaking process with many hyper parameters to test in the GridSearch.

Naive RandomForestClassifier Score:  .840


  1. StandardScaler
  2. PCA:5
  3. RFC__n_estimator: 120
  4. RFC__min_samples_split: 4
  5. RFC__max_depth: 10


Conclusion:  RandomForestClassifier gave us similar results to KNN but seemed much more resource hungry.


K-Nearest Neighbors has proven to be not only our most accurate classifier model with a score of .850, but also one of the fastest to fit and test.  This was all possible on a t2-Micro thanks to extensive EDA and feature selection using Pearson Correlation Coefficients, SelectKBest, as well as some undocumented tests using ExtraTreesClassifier.  With these methods, we were able to reduce to the 20 most relevant features, and gained insight into our possible 5 true features.

After reducing the number of features by 98% on our Postgres Madelon Dataset, we could implement and test some serious pipelines with many different classification models.  If we were to up our system recourses, Gradient Descent and SVC models could be tested, but this was beyond the scope of our assignment.

We at Super Research Inc. hope that you find this documentation useful in understanding how we can help you manage the Data Science aspects of your Big Data.  Thank you for inquiring.