Using R for Reproducible Analysis
July 27, 2023
UNCLASSIFIED - NON CLASSIFIÉ
Who are you and where did you come from?
Before this…
2
Every day R user since 2017
UNCLASSIFIED - NON CLASSIFIÉ
What I’m Going to Tell You About
3
Please interrupt me or raise your hand if:
I can’t help you with your pay problem!
UNCLASSIFIED - NON CLASSIFIÉ
The Main Take Away
4
R is accessible to you
It doesn’t take a wizard or a statistician with an advanced degree to use it
It is available to you at no charge
It can improve your data workflow
It can help you do things you can’t do with the tools you already have
…. and might be the right choice for your use case
You can do it! Don’t be scared of it!
UNCLASSIFIED - NON CLASSIFIÉ
What is R?
5
Available to you at
UNCLASSIFIED - NON CLASSIFIÉ
Why do we use R at PAB?
6
UNCLASSIFIED - NON CLASSIFIÉ
Reproducible Analysis
7
Data
Results
Script that defines
your analysis
UNCLASSIFIED - NON CLASSIFIÉ
Reproducible Analysis
8
Don’t sweat it, this is fake data!
UNCLASSIFIED - NON CLASSIFIÉ
Reproducible Analysis
9
Don’t sweat it, this is fake data!
UNCLASSIFIED - NON CLASSIFIÉ
Reproducible Analysis
10
You can try to put the analysis back together based on bread crumbs,
but you’re still going to have a hard time answering “why”?
Don’t sweat it, this is fake data!
UNCLASSIFIED - NON CLASSIFIÉ
Reproducible Analysis
11
Line | What’s happening? |
1-3 | Make a list of “enquiry” type transaction numbers |
5 | Choose the data set |
6 | A case is on time if it was processed before it’s due date |
7-8 | Filter the dataset to keep cases where
|
9 | Count the number of cases each period that were and weren’t processed on time. |
What’s the definition of “on time”?
Do we count invalid cases in this metric?
Do we count cancelled cases in this metric?
UNCLASSIFIED - NON CLASSIFIÉ
R is good for tight turnarounds
12
A reproducible analysis means…. you can do it again! Quickly!
Data
Results
Script that defines
your analysis
More Data
More Results
Different Data
Different Results
A well written R script is evergreen
The initial cost of the analysis can be higher
The marginal cost of the analysis is much lower
12
UNCLASSIFIED - NON CLASSIFIÉ
An illustrative hypothetical
13
Imagine, for a moment, that it’s October 2021….
Dune, directed by French-Canadian Denis Villeneuve, is topping the box office charts
Atlanta has just defeated the Houston Astros in the World Series
Your boss just tasked you with a forecast of the number of pay cases that will be opened in the next quarter.
🔮
UNCLASSIFIED - NON CLASSIFIÉ
Let’s look at the data
14
(… many more rows of data …)
UNCLASSIFIED - NON CLASSIFIÉ
Visualize data
15
Line Number | What’s happening? |
1 | Choose the data set |
2 | Decide we want to make a plot (ggplot2 package) |
3-4 | Set variables that map to the x and y axes |
5-6 | Draw points (dots) on the graph |
7 | Set the range (limits) of the y-axis, from 0 to whatever |
8-10 | Add labels to graph and axes |
UNCLASSIFIED - NON CLASSIFIÉ
Visualize data
16
Line Number | What’s happening? |
1 | Choose the data set |
2 | Decide we want to make a plot |
3-4 | Set variables that map to the x and y axes |
6 | Draw points (dots) on the graph |
7 | Set the range (limits) of the y-axis, from 0 to whatever |
8-10 | Add labels to graph and axes |
11 | Draw a smoothed line |
UNCLASSIFIED - NON CLASSIFIÉ
A short term forecast
17
How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?
“The best predictor of future behavior is relevant past behavior” - Anonymous
UNCLASSIFIED - NON CLASSIFIÉ
A short term forecast
18
How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?
“The best predictor of future behavior is relevant past behavior” - Anonymous
Dr. Phil
UNCLASSIFIED - NON CLASSIFIÉ
A short term forecast
19
How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?
We could use the mean of the past year as our prediction for the future.
UNCLASSIFIED - NON CLASSIFIÉ
A short term forecast
20
How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?
We could use the last value of the LOESS line as our future prediction.
UNCLASSIFIED - NON CLASSIFIÉ
A short term forecast
21
How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?
We couldn’t extrapolate the trend from our LOESS line – outside of the range of our data, the LOESS function is poorly defined and error prone!
UNCLASSIFIED - NON CLASSIFIÉ
A short term forecast
22
How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?
“The best predictor of future behavior is relevant past behavior” - Anonymous
Dr. Phil
UNCLASSIFIED - NON CLASSIFIÉ
Re-visualize data
23
Line Number | What’s happening? |
1 | Choose the data set |
2 | Keep only the data from 2019 and onwards |
3 | Decide we want to make a plot |
4-6 | Set variables that map to the x, y, and colour dimensions |
8 | Draw dots (points) on the graph |
9 | Connect the points with line |
10 | Set the range (limits) of the y-axis, from 0 to whatever |
11-13 | Add labels to graph and axes |
UNCLASSIFIED - NON CLASSIFIÉ
Re-visualize data
24
UNCLASSIFIED - NON CLASSIFIÉ
We need more robust statistical methods!
25
Hey! Stop!
You should be checking
the series for autocorrelation!
UNCLASSIFIED - NON CLASSIFIÉ
Time Series Forecasting
what happened last week
and
what happened in the same week one year ago
26
UNCLASSIFIED - NON CLASSIFIÉ
R is the lingua franca of statisticians
27
UNCLASSIFIED - NON CLASSIFIÉ
Let’s try three different models…
28
Line Number | What’s happening? |
1-4 | The data in “case_data” is a time series, and the column “pay period” holds the information about when things happened |
6 | Save the results in an object called “ts_models” |
7 | Let’s create some models, and I’m going to tell you how they’re defined |
8-12 | The first model is an “autoregressive model”, with a seasonality of 52 weeks |
13-17 | The second model is a “moving average” model, with a seasonality of 52 weeks |
18-21 | The third model is a “time series linear model”, with a seasonality of 52 weeks |
UNCLASSIFIED - NON CLASSIFIÉ
How well do these models fit our past data?
29
Model | Variance | Akaike information criterion | Bayesian information criterion |
AR_fit | 28200519 | 1876.9 | 1922.5 |
MA_fit | 27628477 | 1623.8 | 1659.2 |
TSLM_fit | 32923603 | 1873.4 | 1914.0 |
Lower is better
UNCLASSIFIED - NON CLASSIFIÉ
What does the forecast say?
30
Line Number | What’s happening? |
1 | Hey, remember those models we stored at “ts_models”? |
2 | Forecast ahead six points, please! |
3 | Then plot it alongside the historical data we made saved earlier |
4 | And then split it up into three different “facets”, one for each model |
UNCLASSIFIED - NON CLASSIFIÉ
How did we do?
31
Model | Root Mean Square Error | Mean Absolute Error | Mean Absolute Percent Error |
AR_fit | 2283 | 1640 | 3.1% |
MA_fit | 1590 | 1092 | 2.1% |
TSLM_fit | 2682 | 1760 | 3.2% |
UNCLASSIFIED - NON CLASSIFIÉ
To sum up…
32
If you’re interested, give it a shot.
UNCLASSIFIED - NON CLASSIFIÉ
Resources
33
R for Data Science, the Classic Introductory Text: https://r4ds.had.co.nz/
R in the Government of Canada: https://open-canada.github.io/r4gc/
Julia Silge, an R educator: https://juliasilge.com/
Streams of analyses: https://www.youtube.com/juliasilge
R Ladies Global: https://rladies.org/
If you’re interested, give it a shot.
UNCLASSIFIED - NON CLASSIFIÉ