1 of 25

Finding Expected + Unexpected Goals With R

Namita Nandakumar

@nnstats

2 of 25

What is hockey?

  • Violent soccer on ice.�
  • 31 (soon to be 32!) teams in the�NHL spend all year trying to be�good enough at this that they�win the Stanley Cup.

3 of 25

Why does hockey matter?

What is hockey?

  • Violent soccer on ice.�
  • 31 (soon to be 32!) teams in the�NHL spend all year trying to be�good enough at this that they�win the Stanley Cup.������
  • Some people think it’s fun?

4 of 25

Some ways to evaluate hockey teams + players::

  • Win %
  • Goals For
  • Goals Against
  • Shots For
  • Shots Against
  • Feelings

5 of 25

Some ways to evaluate hockey teams + players::

  • Win %
  • Goals For
  • Goals Against
  • Shots For
  • Shots Against
  • Feelings

These have the nice statistical property of being one of the most granular things we can observe,��but people complain that not all shots are equally dangerous...

6 of 25

And they’re right!��So, what other features can we account for?

7 of 25

Let’s take a look.

8 of 25

A closer look...

  • I randomly chose this shot from the 2019-20 NHL season.�
  • gif via @NHLFlyers

9 of 25

What can the play-by-play data tell us?

  • It was a goal.�
  • It was scored by Flyers forward Travis Konecny on Kings goalie Jack Campbell.�
  • It was at even strength, 5v5.�
  • It was right in front of the net, ~18 feet away.�
  • It was a wrist shot.�
  • It caused the Flyers to go up 2-0.

10 of 25

What is the play-by-play data hiding?

  • The shot angle looks off, which is always plausible since it’s manually tracked (for now).�
  • James van Riemsdyk set up Travis Konecny with a great, quick pass.�
  • Konecny was wide open.�
  • There were two players screening the net.

This is where I might’ve labeled the shot.

11 of 25

5v5 unblocked shots have scored ~6% of the time this season…but I think TK’s shot was better than that.

12 of 25

Couch to 5K xG

FACTS.

  • It’s worth contextualizing shots with all the extra info that we have by estimating the probability of scoring.�
  • Everyone with a passable understanding of hockey and R can build an expected goals model right now.�

OPINION.

  • This is a fun way to spend a weekend afternoon.

13 of 25

xG Researchers, Past + Present

  • Brian Macdonald (2012)
  • Dawson Sprigings,�Asmae Toumi (2015)
  • Manny Perry (2016)
  • Peter Tanner (2016)
  • Matt Barlowe (2017)
  • Micah Blake McCurdy (2018)
  • Luke and Josh Younggren (2018)
  • and many more!
  • None of the ideas that follow are brand new or solely mine.�
  • Except for the parts where I do nice ggplots and talk about my favorite players.

14 of 25

Data Source: NHL via @MoneyPuckdotcom

  • Peter Tanner scrapes NHL data every day and shares season-level .csvs of every unblocked shot.�
  • We’re going to use 2017-present.

Some stuff he did for us that I did not want to do::

  • Label rush shots, rebounds, and more.
  • Adjust x, y coordinates for recording bias.
  • Clean it up so it’s ready to be #tidily analyzed.

15 of 25

This is all the code you need to get started!

16 of 25

Fun w/�Heat Maps

Unblocked Shots

Goals

ggplot(aes(arena_adjusted_x_cord_abs, arena_adjusted_y_cord)) + .

gg_rink() + .

stat_density_2d(aes(fill = ..level..), geom = 'polygon', alpha = 0.8) + .

scale_fill_viridis(option = 'D') ...

17 of 25

Lazy Logistic Regression xG

  • Estimate P(goal) based solely on the shot’s absolute distance to the net.�
  • This will be our baseline model.

glm(goal ~ poly(arena_adjusted_shot_distance, 2),� data = shots, family = 'binomial')

18 of 25

Slightly Less Lazy xgboost xG

  • In addition to shot distance, we can incorporate shot angle, shot type, and time/distance from last event.�
  • Use xgboost so as to not assume linearity.�
  • Do some 5-fold CV to choose the # of boosting rounds.�
  • ???�
  • Profit.

Google image results for “machine learning”

19 of 25

Which variables am I not including?

Score State.

  • I can’t think of a reason why scoreline would be important after controlling for everything else we already know.�
  • @EvolvingWild and others have confirmed this.

Shooters + Goalies.

  • For simplicity, I’ll evaluate shooters and goalies by their residuals, but one could incorporate them into the model.

20 of 25

Some unsurprising xgboost variable importance #s.

21 of 25

I love using geom_smooth() to glance at calibration.

22 of 25

More Model Comparisons

xG Model

All Shots Are The Same

Shot Distance Logistic

xgboost�(CV results)

MoneyPuck

Log Loss

0.220

0.202

0.193

0.192

AUC

0.500

0.728

0.769

0.771

TK Goal Probability

5.7%

9.9%

11.8%

14.6%

23 of 25

Unexpected Goals: Shooters

shots %>%

group_by(shooter_id, shooter_name) %>%

summarize(exp_shooting_pct = mean(x_goal_xgboost),

actual_shooting_pct = mean(goal),

diff = actual_shooting_pct - exp_shooting_pct,

shot_count = n()) %>%

filter(shot_count > 300) %>%

arrange(-diff) %>%

View()

24 of 25

Thank you!!

+ thanks to @MoneyPuckdotcom, @mannyelk, @IneffectiveMath, @EvolvingWild, and all the other�researchers I’ve read + cited.��You can find me @nnstats.

25 of 25

Appendix: Goalies

  • There’s much more spread in actual save % than expected save % once you face 500+ unblocked shots.�
  • Labeled goalies deviate�> 1.5% from expectation.�
  • All goalies are the same, except the bad ones?