1 of 33

Using R for Reproducible Analysis

July 27, 2023

UNCLASSIFIED - NON CLASSIFIÉ

2 of 33

Who are you and where did you come from?

  • Colin Douglas
    • Public Services and Procurement Canada (PSPC)
      • Pay Administration Branch (PAB)
        • Policy, Analytics and Communications Sector (PAC)
          • Data and Reporting Directorate (D&R)
            • Forecasting and Data Science Team (FADS)

Before this…

  • St. Francis Xavier University in Antigonish, NS
    • Analytical chemistry – how to use very faint signals to identify bacteria
  • University of Toronto in Toronto, ON
    • Biological chemistry – detecting how very small amounts of protein interact with each other
  • Dalhousie University in Halifax, NS
    • Faculty of Medicine – how fast certain enzymes work, and why they go so fast
  • A medical device company in Dartmouth, NS
    • Coagulation assays – proving that they are safe and effective to regulatory bodies

2

Every day R user since 2017

UNCLASSIFIED - NON CLASSIFIÉ

3 of 33

What I’m Going to Tell You About

  • Introduce you to the R programming language
  • Tell you why R works well for our team
  • Walk you through demonstration analysis that shows why R might be suitable for your team
    • Reproducible
    • Quick prototyping and exploratory data analysis
    • Access to niche methods or routines

3

Please interrupt me or raise your hand if:

    • I am talking too fast
    • You don’t know what I’m talking about
      • Other people are probably confused too!
    • You have a question about the R programming language

I can’t help you with your pay problem!

UNCLASSIFIED - NON CLASSIFIÉ

4 of 33

The Main Take Away

4

R is accessible to you

It doesn’t take a wizard or a statistician with an advanced degree to use it

It is available to you at no charge

It can improve your data workflow

It can help you do things you can’t do with the tools you already have

…. and might be the right choice for your use case

You can do it! Don’t be scared of it!

UNCLASSIFIED - NON CLASSIFIÉ

5 of 33

What is R?

  • R is a programming language for statistical computing

  • It fills a niche – a language that can quickly compute against long lists of numbers (i.e., vectors)
    • Key for performant statistical calculations

  • R is a popular language, and has consistently ranked in top 20 programming languages worldwide, even when compared against general purpose languages (like C and Java)

  • It is very popular in the fields of
    • statistics (especially for applied statistics)
    • bioinformatics (e.g., working with gene sequences)
    • analytics

5

  • R is free and open source software!
    • R is free as in “free beer” (gratis)
      • It is available to you at no cost – no licensing fees, no acquisition costs
    • R is free as in “free speech” (libre)
      • It is licensed under the GNU General Public License (GPL), which grants you permission to run, study, share, and modify the software

Available to you at

https://cran.r-project.org/

UNCLASSIFIED - NON CLASSIFIÉ

6 of 33

Why do we use R at PAB?

  • In the Pay Administration Branch (PAB), we spend a lot of time reporting on the number of cases in the pay cases queue

  • Policy makers, communications professionals, and management within PAB frequently need to know how many cases are being created, how many cases are being processed, and what kinds of cases they are

  • The data set is very large!
    • There are more than 10 million cases in the Case Management Tool
    • Each pay period, approximately 50,000 cases are created and processed – it’s a lot of data that needs to be summarized and distilled down to essential trends and speaking points

  • The reporting cycle is very tight – most reports are created biweekly, but some reports are daily!

6

UNCLASSIFIED - NON CLASSIFIÉ

7 of 33

Reproducible Analysis

  • Using R for analysis is less about doing calculations and more about writing the instructions for analysis

  • Separating the data from the analysis is one of the key features of the workflow

  • It also means that other folks on your team can understand all of the steps you took when you did the analysis

7

Data

Results

Script that defines

your analysis

UNCLASSIFIED - NON CLASSIFIÉ

8 of 33

Reproducible Analysis

  • You’ve been tasked with calculating the service standard rate for the last pay period
  • They send you the old spreadsheet that your predecessor used to do the analysis

8

Don’t sweat it, this is fake data!

UNCLASSIFIED - NON CLASSIFIÉ

9 of 33

Reproducible Analysis

  • You’ve been tasked with calculating the service standard rate for the last pay period
  • They send you the old spreadsheet that your predecessor used to do the analysis

9

Don’t sweat it, this is fake data!

UNCLASSIFIED - NON CLASSIFIÉ

10 of 33

Reproducible Analysis

  • You’ve been tasked with calculating the service standard rate for the last pay period
  • They send you the old spreadsheet that your predecessor used to do the analysis

10

You can try to put the analysis back together based on bread crumbs,

but you’re still going to have a hard time answering “why”?

Don’t sweat it, this is fake data!

UNCLASSIFIED - NON CLASSIFIÉ

11 of 33

Reproducible Analysis

  • Using R for analysis is less about doing calculations and more about writing the instructions for analysis

  • Separating the data from the analysis is one of the key features of the workflow

  • It also means that other folks on your team can understand all of the steps you took when you did the analysis

11

Line

What’s happening?

1-3

Make a list of “enquiry” type transaction numbers

5

Choose the data set

6

A case is on time if it was processed before it’s due date

7-8

Filter the dataset to keep cases where

  • State is not “Invalid” and
  • Transaction type is not in the list of enquiry transactions

9

Count the number of cases each period that were and weren’t processed on time.

What’s the definition of “on time”?

Do we count invalid cases in this metric?

Do we count cancelled cases in this metric?

UNCLASSIFIED - NON CLASSIFIÉ

12 of 33

R is good for tight turnarounds

12

A reproducible analysis means…. you can do it again! Quickly!

Data

Results

Script that defines

your analysis

More Data

More Results

Different Data

Different Results

A well written R script is evergreen

The initial cost of the analysis can be higher

The marginal cost of the analysis is much lower

12

UNCLASSIFIED - NON CLASSIFIÉ

13 of 33

An illustrative hypothetical

13

Imagine, for a moment, that it’s October 2021….

Dune, directed by French-Canadian Denis Villeneuve, is topping the box office charts

Atlanta has just defeated the Houston Astros in the World Series

Your boss just tasked you with a forecast of the number of pay cases that will be opened in the next quarter.

🔮

UNCLASSIFIED - NON CLASSIFIÉ

14 of 33

Let’s look at the data

  • You’re given (yet another) Excel file with a bunch of time series data in it
    • A “time series” is a sequence of observations with time or date stamps
    • Typically, the time between the data points is equal (but not always!)

  • R is built to easily read data in many formats
    • Here, we’re loading data in from Excel in a single line
    • But there are also tools that let you easily pull data from SAS or SPSS files, or even query databases directly

14

(… many more rows of data …)

UNCLASSIFIED - NON CLASSIFIÉ

15 of 33

Visualize data

  • R gives you access to very powerful plotting and graphing functions

  • The graph you’re seeing is completely described by the data set and the ten lines of code shown above it.

15

Line Number

What’s happening?

1

Choose the data set

2

Decide we want to make a plot (ggplot2 package)

3-4

Set variables that map to the x and y axes

5-6

Draw points (dots) on the graph

7

Set the range (limits) of the y-axis, from 0 to whatever

8-10

Add labels to graph and axes

UNCLASSIFIED - NON CLASSIFIÉ

16 of 33

Visualize data

  • R gives you access to very powerful plotting and graphing functions

  • We can look for trends in our data using a number of functions built into the plotting library

16

Line Number

What’s happening?

1

Choose the data set

2

Decide we want to make a plot

3-4

Set variables that map to the x and y axes

6

Draw points (dots) on the graph

7

Set the range (limits) of the y-axis, from 0 to whatever

8-10

Add labels to graph and axes

11

Draw a smoothed line

UNCLASSIFIED - NON CLASSIFIÉ

17 of 33

A short term forecast

17

How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?

“The best predictor of future behavior is relevant past behavior” - Anonymous

UNCLASSIFIED - NON CLASSIFIÉ

18 of 33

A short term forecast

18

How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?

“The best predictor of future behavior is relevant past behavior” - Anonymous

Dr. Phil

UNCLASSIFIED - NON CLASSIFIÉ

19 of 33

A short term forecast

19

How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?

We could use the mean of the past year as our prediction for the future.

UNCLASSIFIED - NON CLASSIFIÉ

20 of 33

A short term forecast

20

How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?

We could use the last value of the LOESS line as our future prediction.

UNCLASSIFIED - NON CLASSIFIÉ

21 of 33

A short term forecast

21

How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?

We couldn’t extrapolate the trend from our LOESS line – outside of the range of our data, the LOESS function is poorly defined and error prone!

UNCLASSIFIED - NON CLASSIFIÉ

22 of 33

A short term forecast

22

How could we go about predicting the number of cases opened in each of the six pay periods in the next quarter?

“The best predictor of future behavior is relevant past behavior” - Anonymous

Dr. Phil

UNCLASSIFIED - NON CLASSIFIÉ

23 of 33

Re-visualize data

  • R allows us to rapidly iterate on analyses and ideas without having to drastically rearrange or modify our data

  • We can quickly try new things and either succeed or “fail fast

23

Line Number

What’s happening?

1

Choose the data set

2

Keep only the data from 2019 and onwards

3

Decide we want to make a plot

4-6

Set variables that map to the x, y, and colour dimensions

8

Draw dots (points) on the graph

9

Connect the points with line

10

Set the range (limits) of the y-axis, from 0 to whatever

11-13

Add labels to graph and axes

UNCLASSIFIED - NON CLASSIFIÉ

24 of 33

Re-visualize data

24

UNCLASSIFIED - NON CLASSIFIÉ

25 of 33

We need more robust statistical methods!

  • There are best practices for applying certain methodologies

  • The burden is on you, as the analyst, to apply these best practices and validate your work

  • But the many of the tools for validation conveniently exist already within the R ecosystem!

25

Hey! Stop!

You should be checking

the series for autocorrelation!

UNCLASSIFIED - NON CLASSIFIÉ

26 of 33

Time Series Forecasting

  • Using historical trends to try and predict what will happen in the future
    • You probably already have an intuitive understanding of it!

  • If you’re trying to predict what will happen this week, then….

what happened last week

and

what happened in the same week one year ago

  • Most methods of time series forecasting are mathematical constructs based around this kind of intuition
    • But there are dozens of different ways to put all of the factors together to create forecasts and estimate the uncertainty in the forecasts

  • How does one implement and test all of these models, to see which works the best?

26

UNCLASSIFIED - NON CLASSIFIÉ

27 of 33

R is the lingua franca of statisticians

  • R is a commonly used language amongst academics

  • If there is a methodology you’d like to try, there’s a good chance someone has already implemented it in R

  • Implementations of different methodologies are typically distributed via packages, that simplify the process

27

UNCLASSIFIED - NON CLASSIFIÉ

28 of 33

Let’s try three different models…

  • Many common models are already implemented in R’s package ecosystem
    • The hard work is already done, and you don’t have to reinvent the wheel

  • Still important to understand how each of these models work
    • You will need to explain them to decision makers!

28

Line Number

What’s happening?

1-4

The data in “case_data” is a time series, and the column “pay period” holds the information about when things happened

6

Save the results in an object called “ts_models”

7

Let’s create some models, and I’m going to tell you how they’re defined

8-12

The first model is an “autoregressive model”, with a seasonality of 52 weeks

13-17

The second model is a “moving average” model, with a seasonality of 52 weeks

18-21

The third model is a “time series linear model”, with a seasonality of 52 weeks

UNCLASSIFIED - NON CLASSIFIÉ

29 of 33

How well do these models fit our past data?

  • R (and it’s package ecosystem) give analysts the tools to quickly evaluate multiple models in parallel

29

Model

Variance

Akaike information criterion

Bayesian information criterion

AR_fit

28200519

1876.9

1922.5

MA_fit

27628477

1623.8

1659.2

TSLM_fit

32923603

1873.4

1914.0

Lower is better

UNCLASSIFIED - NON CLASSIFIÉ

30 of 33

What does the forecast say?

30

Line Number

What’s happening?

1

Hey, remember those models we stored at “ts_models”?

2

Forecast ahead six points, please!

3

Then plot it alongside the historical data we made saved earlier

4

And then split it up into three different “facets”, one for each model

UNCLASSIFIED - NON CLASSIFIÉ

31 of 33

How did we do?

31

Model

Root Mean Square Error

Mean Absolute Error

Mean Absolute Percent Error

AR_fit

2283

1640

3.1%

MA_fit

1590

1092

2.1%

TSLM_fit

2682

1760

3.2%

UNCLASSIFIED - NON CLASSIFIÉ

32 of 33

To sum up…

  • R is writing down the instructions for your analysis

  • It lets you rapidly iterate to do exploratory analysis

  • It gives you access to more complex statistical functions

  • And it lets you do your work reproducibly
    • Now, other folks can see how you’ve done the work
    • And you can quickly repeat your analysis when the source data changes

32

If you’re interested, give it a shot.

UNCLASSIFIED - NON CLASSIFIÉ

33 of 33

Resources

33

R for Data Science, the Classic Introductory Text: https://r4ds.had.co.nz/

R in the Government of Canada: https://open-canada.github.io/r4gc/

Julia Silge, an R educator: https://juliasilge.com/

Streams of analyses: https://www.youtube.com/juliasilge

R Ladies Global: https://rladies.org/

If you’re interested, give it a shot.

UNCLASSIFIED - NON CLASSIFIÉ