Analysing Scottish Hill Race Data with R

Introduction        2

R        3

Recap        8

Exercises        9

Other Examples        9

Using ggplot2 to visualise data        11

Conclusion        14

Postscript        14


Introduction

Anthony Atkinson (1986) published the record times for thirty-five hill races in Scotland from the 1984 fixture list of the Scottish Hill Runners Association. The records have details of the distances in miles and the climb in feet. He excluded “a third explanatory variable”, the time of year.

The data appear as Table 1 in the paper.

T1.jpeg

Anthony notes in his paper that observation number 11 “appears strongly outlying on the plot against climb”. This race has a low climb for its distance. He suggests that:

Further analysis would include checking the data in Table 1 both against the original fixture list and against the lists for other years. (1986:402)

Observation 18 (Knock Hill) has been the subject of subsequent discussions of this data set. (Atkinson, 1988; DAAG)

R

The thirty-five observations in the data set are available in a tab-delimited text file.

They are available in the MASS package in R and can be loaded as:

> library(MASS)

For this introduction to the use of R, we are going to load the data into RStudio.

This is how the exercise starts:

Note:

# Is a description of an action

> indicates what you type into the RStudio console and then press enter

#Load the library

> library(MASS)

#Specify that the data being used is hills

> data(hills)

# Type hills

> hills

# When you press return this list appears

H1.jpeg

#Note that this appears in your Environment pane in R Studio

EE1.jpeg

# You can look at the distribution of variables in these observations. We will use “dist” as distance in miles (on the map); “climb” as total height gained during the run in feet; “time” as the record time in minutes. We can ascertain the values for each variable in the hills data. We can specify dist as column 1, climb as column 2 and time as column 3. Note the typing of these:

> dist<-hills[,1]

> climb<-hills[,2]

> time<-hills[,3]

# These variables are listed in the environment pane too

E2.jpeg

# We can look at the distribution of these variables. In this example we use “summary” to indicate the distributions.

> summary(dist)

> summary(climb)

> summary(time)

#Note that as you press enter, the commands produce the distributions for each variable (minimum, 1st quartile, median, mean, 3rd quartile, maximum).

#We can now visualise these as box plots with the following instructions:

> par(mfrow=c(1,3))

> boxplot(dist, ylab="Distance in Miles")

> boxplot(climb, ylab="Height Gained in Feet")

> boxplot(time, ylab="Record Time in Minutes")

#As you press enter after each of the boxplot instructions, you should see these plots in the plots pane of RStudio. They have their axis label as you typed it within the “”.

BP1.jpeg

# We can plot scatterplots of pairs of variables (six in total) with the pairs command.

> pairs(hills)

This produces this visualisation in the plots pane:

Pairs.jpeg

# You can plot the relationships between two variables by using the plot command. For example:

> plot(dist, time)

DT.jpeg

> plot(climb, time)

CT.jpeg

Recap

This has been an introductory R exercise. We have used two conventions to guide you:

# is a text message

> library(MASS) is an example of what you type into the R Studio console.

If you have received any error messages in the process please check the exact typing of your command. For example:

> library(MA)

Error in library(MA) : there is no package called ‘MA’

I did not enter the correct library.

Patience and accuracy gave me this:

> library(MASS)

>

Note that a correct entry returns the > prompt in the next line.

Exercises

Using the MASS library, can you:

Label the x axis on a boxplot.

Plot variables on x and y axes.

Identify outliers in the data set. (You might like to pay particular attention to Knock Hill.) Do the boxplot graphics help you to identify outliers?

Other Examples

The process shared above was informed by:

A York University assignment (Toronto, Canada) from 2005.

Lab 6, Plotting Practice (Rebecca Nugent, Carnegie Mellon University) from 2006.

Others have made use of the Scottish hill data to discuss:

Regression (Randall Jennings, 2016)

Multiple Regression Models (John Maindonald and John Braun, 2003 and 2010)

Linear Regression (Hao Zhang, Purdue University)

Transformations (DAAG, 2017)

Graphical Data Analysis with R (Antony Unwin, 2015)

R Introduction and ggplot2 (Russ Hyde, 2017)

A number of textbooks use the Hill Race data to pose questions for students.

Core Statistics (Simon Wood, 2015)

A new hill race is proposed over 7 miles with a climb of 2400 feet. To generate interest in the first year, the organisers offer a prize for every runner who completes the course in less than a set time (T0). The challenge is to set a time that attracts a lot of entrants but limits the amount that has to be paid. You are asked to propose a winning time for the race using the R package MASS. Find and estimate a suitable linear model for predicting the winning time in minutes in terms of race distance (miles) and total height climbed (feet).  (Simon suggests students might consult William Naismith’s rule to help with this problem.)

Patty Solomon (University of Adelaide, 2004)

  • data(hills): This dataset is from the MASS library
  • hills: List the data.
  • pairs(hills): Show a matrix of pairwise scatterplots.
  • objects() See which R objects you have created.
  • rm(x,y,hull) Remove objects no longer needed (i.e., cleanup).

Applied Multivariate Statistics with R (Daniel Zelterman, 2015:150)

  • Use R to find means and standard deviations of the distances, altitudes and times. Are these useful summaries of this data?
  • Why might you expect the distribution of race times, altitudes and distances be skewed? Plot the data in R and verify if this is the case.
  • Do transformations of the data provide more symmetric distributions?
  • There was a debate about whether the time value for Knock Hill should be 18.65 or 78.65. Which of these values do you think is the correct one?

Using ggplot2 to visualise data

ggplot2 is a plotting system for R, based on the grammar of graphics. Hadley Wickham, the creator of ggplot2, has shared this presentation about ggplot2 and is the author of ggplot2: Elegant Graphics for Data Analysis. There is information about the ggplot2 package here.

There is a Quick-R introduction to ggplot2.

Fortunately, Russ Hyde has provided an example of ggplot2 with the Scottish Hill Run data in his R introduction and ggplot2 workshop. He uses the ‘dplyr’ package in his workshop. Dplyr is “a fast, consistent tool for working with data frame like objects” that provides “a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges”.

Here is an example from Russ’s workshop:

#Note that I have the packages ‘’ggplot2’’ and “dplyr” installed. If you do not have these you will need to use the commands install.packages(“ggplot2”) and install.packages(“dplyr”)

> library(MASS)

> data(hills)

> library(ggplot2)

> library(dplyr)

> head(hills)

             dist climb   time

Greenmantle   2.5   650 16.083

Carnethy      6.0  2500 48.350

Craig Dunain  6.0   900 33.650

Ben Rha       7.5   800 45.600

Ben Lomond    8.0  3070 62.267

Goatfell      8.0  2866 73.217

# Note that there is no column name for the peaks in these data. This can be resolved with this line

> tidy_hills <- mutate(hills, peak = row.names(hills))

#List the now tidied data frame

> head(tidy_hills)

  dist climb   time         peak

1  2.5   650 16.083  Greenmantle

2  6.0  2500 48.350     Carnethy

3  6.0   900 33.650 Craig Dunain

4  7.5   800 45.600      Ben Rha

5  8.0  3070 62.267   Ben Lomond

6  8.0  2866 73.217     Goatfell

# Russ’s workshop then moves to some basic plots of the data. An example:

> ggplot(data = hills, mapping = aes(x = dist, y = time)) + geom_point()

# Produces

Plot2.jpeg

# In another example, the code below produces a blue histogram with labelled axes

> ggplot(tidy_hills, mapping = aes(x = time)) +
 geom_histogram(bins = 15, fill = "orangered1", col = "black") + labs(x = "Time (minutes)", y = "Count")

Hist.jpeg

Conclusion

This is a very brief introduction to R that introduces some basic procedures in R, ggplot2 and dplyr.

These are important resources in sport analytics. We hope this might encourage you to extend your thinking about analysing performance.

23 September 2017

Postscript

There is a DAAG package that provides more hill race data. hills2000 provides the record times in 2000 for 56 Scottish hill races.