What I Learned Analyzing the Earnings of 11,037 Software Developers
Nnamdi Iregbulem
Problem
Most analyses of worker pay are fundamentally unrigorous
Lack of Rigor
Simple averages do not tell the full story
The average data scientist is different in numerous ways from the average database administrator
These differences correlate with pay
Hence, we cannot isolate the impact of developer type on pay using simple averages
We need to control for other variables
Ceteris Paribus
“All else equal, a 45-year-old software developer earns X% more than a similar 30-year-old developer”
Solution
Regression Analysis
Regressions Have Issues Too
Overfitting
Confounders / Omitted variable bias
Multicollinearity
P-hacking
Correlation is not causation
(Real) Solution
Regression Analysis*
(* with a twist)
Methodology
Double Lasso Variable Selection
Leverages two-stages of lasso regression for principled covariate selection
Removes inappropriate covariates while retaining independent variables with high predictive power
Yields better (less biased) estimates of potentially causal relationships (though does not prove causality)
Double Lasso Variable Selection
Traditional Lasso loss function:
Lasso of Y on X
Lasso of T on X
OLS of Y on selected covariates
Naive OLS regression
where
Each lasso regression should be calibrated (thereby selecting lambda) using cross-validation
Coefficient on T will better reflect true relationship
Verifies relevance of covariates
Results
We can do better
Thank You!
Full results at: