225 Bush St,
San Francisco, CA 94104
Prepared by Dmitry Karpovich
Feature Selection Report
Here in the Data Science department of Super Research Inc. we take feature selection very seriously. Selecting the right features for predictive model building makes big data manageable, and this proposal aims to demonstrate how.
THE MADELON DATASETS 1
UCI Madelon Set 1
Big Postgres Madelon Set 1
THE DATA 1
Features and Targets 1
Why reduce? 2
AWS t2 Micro Servers 2
Docker and Jupyter Notebooks 2
THE PROCESS 2
Exploratory Data Analysis 2
Getting the Data 2
Reducing the features 4
Pearson Correlation Coefficients 4
Lasso with SelectFromModel 8
Model Building 9
15 Linear Combinations of Relevant Features
The UCI Madelon Set was used in the NIPS 2003 feature selection challenge.
15 Linear Combinations of Relevant Features
200,000 rows and 1001 columns seems like an impossibly big data set to manage using standard recourses.
Both datasets are binary classification problems with targets that contain a single bit (1 or 0). We have some insight into the data that we can use to our advantage as we attempt to extract the relevant features, greatly reducing our datasets in size.
We have deployed a very cost effective, barebones AWS server to demonstrate how we can reduce these large datasets to manageable levels, even for t2 Micro servers.
Using Docker containers and Jupyter Notebooks, we can demonstrate Exploratory Data Analysis, Feature Selection and Reduction and finally Model Building.
The data for the larger Madelon Dataset was uploaded to a second AWS server running PostgreSQL. A tool (connection_helper) was developed by our lead data scientists Juan Sucre and Dmitry Karpovich to quickly connect and query the database, but from the start we encountered issues due to the size of the dataset and the limited recourses of our servers. We devised the following plan to make the data more manageable. Some notes:
The idea was to reduce the queries to only relevant features in our sampling so that we could pull larger samples with reduced feature size and thus reduce server load. These were the steps:
We knew from the start that our data was parametric. We knew that there was a binary classification target and that of 5 informative features, there were 15 additional features that were some linear combinations of the 5. We quickly deduced that 20 should correlate and the rest should not.
We noticed an 80% reduction in the absolute value of Pearson Correlation Coefficients after the top 20th and surmised that these must be our 20 relevant features.
After interpreting the Pearson Correlation Coefficients and confirming our results across 3 1% samples, it was time to pull out new tools to weed out the 5 true features. SelektKBest was our next choice. Using the f_classif score function gave us “Anova F-value” scores of each of our top 20 features towards the Target.
Results of f_classif for 3 10% samples with top 20 correlated features.
Results of f_classif for 1% sample with all 1000 features.
Our results are fairly consistent across both the 20 feature samples and our 1000 feature sample but this information did not prove very useful.
We attempted to see if we could glean any valuable insight by changing the scoring function to mutual_info_classif.
Results of mutual_info_classif for 3 10% sample 20 correlated features.
Results of mutual_info_classif for 1% sample with all 1000 features.
Though this information helped identify features that were informative to the target, we still did not learn which 5 features were true and which were the linear combination of the 5.
Our next approach was to use the Lasso model to push coefficient weights of similar features to 0 by using the L1 Regularization. By setting a very low threshold on SelectFromModel, we attempted to extract our 5 relevant features.
LassoCV allowed us to find a regularization hyperparameter that would reduce our features to 5. An alpha of .01 was the sweet spot.
By tweaking the alpha and thresholds of our models, we consistently came back with the same 5 features:
With what we assume to be our top 5 features, we can compare our model results between PCA, our 5 selected true and our 20 features as our datasets for our models.
For our sample, we used 50% of the Postgres Madelon Set with our top 20 correlated features.
Given this is a classification problem, the first thought was to use the lightest of our classification models, LogisticRegression.
Naive Logistic Regression Score: .602
Conclusion: Though Logistic Regression fits and tests very quickly, it does not score well.
K Nearest Neighbors is another go-to classifier. We attempted to see if we could get better results with that.
Naive KNN Score: .849
Conclusion: KNN gave us much better results than Logistic Regression, but we must be careful not to overfit. We observed a .04 difference between train and test scores. StandardScaling increased train, but decreased test scores. Changing weight and distance_p values only lowered the score.
SVC is a resource intensive classifier model. We attempted to run some fits but our t2-Micro could not handle it.
RandomForestClassifier is a relatively quick fit and test algorithm. We attempted to see if we could get better results with it. Running this classifier was a painstaking process with many hyper parameters to test in the GridSearch.
Naive RandomForestClassifier Score: .840
Conclusion: RandomForestClassifier gave us similar results to KNN but seemed much more resource hungry.
K-Nearest Neighbors has proven to be not only our most accurate classifier model with a score of .850, but also one of the fastest to fit and test. This was all possible on a t2-Micro thanks to extensive EDA and feature selection using Pearson Correlation Coefficients, SelectKBest, as well as some undocumented tests using ExtraTreesClassifier. With these methods, we were able to reduce to the 20 most relevant features, and gained insight into our possible 5 true features.
After reducing the number of features by 98% on our Postgres Madelon Dataset, we could implement and test some serious pipelines with many different classification models. If we were to up our system recourses, Gradient Descent and SVC models could be tested, but this was beyond the scope of our assignment.
We at Super Research Inc. hope that you find this documentation useful in understanding how we can help you manage the Data Science aspects of your Big Data. Thank you for inquiring.