1 of 26

What I Learned Analyzing the Earnings of 11,037 Software Developers

Nnamdi Iregbulem

2 of 26

  • Problem
  • Solution
  • Methodology
  • Results

3 of 26

Problem

4 of 26

Most analyses of worker pay are fundamentally unrigorous

5 of 26

Lack of Rigor

Simple averages do not tell the full story

The average data scientist is different in numerous ways from the average database administrator

These differences correlate with pay

Hence, we cannot isolate the impact of developer type on pay using simple averages

We need to control for other variables

6 of 26

Ceteris Paribus

“All else equal, a 45-year-old software developer earns X% more than a similar 30-year-old developer”

7 of 26

Solution

8 of 26

Regression Analysis

9 of 26

Regressions Have Issues Too

Overfitting

  • Linear regression models are notoriously good at overfitting, sometimes finding spurious relationships

Confounders / Omitted variable bias

  • Omitting an independent variable that is correlated both with both the dependent variable and one or more of the included independent variables

Multicollinearity

  • Strongly correlated independent variables prevents isolation of partial effects

P-hacking

  • Run every possible regression

Correlation is not causation

  • Regression coefficients may not represent true cause-and-effect relationships

10 of 26

(Real) Solution

11 of 26

Regression Analysis*

(* with a twist)

12 of 26

Methodology

13 of 26

Double Lasso Variable Selection

Leverages two-stages of lasso regression for principled covariate selection

Removes inappropriate covariates while retaining independent variables with high predictive power

Yields better (less biased) estimates of potentially causal relationships (though does not prove causality)

14 of 26

Double Lasso Variable Selection

Traditional Lasso loss function:

Lasso of Y on X

Lasso of T on X

OLS of Y on selected covariates

Naive OLS regression

where

  1. Fit a lasso regression of dependent variable Y on set of potential covariates X
  2. Fit a lasso regression of “treatment” variable T on X
  3. Fit a linear regression of Y on T and the union of covariates selected through either lasso

Each lasso regression should be calibrated (thereby selecting lambda) using cross-validation

Coefficient on T will better reflect true relationship

Verifies relevance of covariates

15 of 26

Results

16 of 26

17 of 26

18 of 26

19 of 26

20 of 26

21 of 26

22 of 26

23 of 26

24 of 26

25 of 26

We can do better

26 of 26

Thank You!

Full results at:

whoisnnamdi.com