Analysing Scottish Hill Race Data with R
Using ggplot2 to visualise data 11
Anthony Atkinson (1986) published the record times for thirty-five hill races in Scotland from the 1984 fixture list of the Scottish Hill Runners Association. The records have details of the distances in miles and the climb in feet. He excluded “a third explanatory variable”, the time of year.
The data appear as Table 1 in the paper.
Anthony notes in his paper that observation number 11 “appears strongly outlying on the plot against climb”. This race has a low climb for its distance. He suggests that:
Further analysis would include checking the data in Table 1 both against the original fixture list and against the lists for other years. (1986:402)
Observation 18 (Knock Hill) has been the subject of subsequent discussions of this data set. (Atkinson, 1988; DAAG)
The thirty-five observations in the data set are available in a tab-delimited text file.
They are available in the MASS package in R and can be loaded as:
> library(MASS)
For this introduction to the use of R, we are going to load the data into RStudio.
This is how the exercise starts:
Note:
# Is a description of an action
> indicates what you type into the RStudio console and then press enter
#Load the library
> library(MASS)
#Specify that the data being used is hills
> data(hills)
# Type hills
> hills
# When you press return this list appears
#Note that this appears in your Environment pane in R Studio
# You can look at the distribution of variables in these observations. We will use “dist” as distance in miles (on the map); “climb” as total height gained during the run in feet; “time” as the record time in minutes. We can ascertain the values for each variable in the hills data. We can specify dist as column 1, climb as column 2 and time as column 3. Note the typing of these:
> dist<-hills[,1]
> climb<-hills[,2]
> time<-hills[,3]
# These variables are listed in the environment pane too
# We can look at the distribution of these variables. In this example we use “summary” to indicate the distributions.
> summary(dist)
> summary(climb)
> summary(time)
#Note that as you press enter, the commands produce the distributions for each variable (minimum, 1st quartile, median, mean, 3rd quartile, maximum).
#We can now visualise these as box plots with the following instructions:
> par(mfrow=c(1,3))
> boxplot(dist, ylab="Distance in Miles")
> boxplot(climb, ylab="Height Gained in Feet")
> boxplot(time, ylab="Record Time in Minutes")
#As you press enter after each of the boxplot instructions, you should see these plots in the plots pane of RStudio. They have their axis label as you typed it within the “”.
# We can plot scatterplots of pairs of variables (six in total) with the pairs command.
> pairs(hills)
This produces this visualisation in the plots pane:
# You can plot the relationships between two variables by using the plot command. For example:
> plot(dist, time)
> plot(climb, time)
This has been an introductory R exercise. We have used two conventions to guide you:
# is a text message
> library(MASS) is an example of what you type into the R Studio console.
If you have received any error messages in the process please check the exact typing of your command. For example:
> library(MA)
Error in library(MA) : there is no package called ‘MA’
I did not enter the correct library.
Patience and accuracy gave me this:
> library(MASS)
>
Note that a correct entry returns the > prompt in the next line.
Using the MASS library, can you:
Label the x axis on a boxplot.
Plot variables on x and y axes.
Identify outliers in the data set. (You might like to pay particular attention to Knock Hill.) Do the boxplot graphics help you to identify outliers?
The process shared above was informed by:
A York University assignment (Toronto, Canada) from 2005.
Lab 6, Plotting Practice (Rebecca Nugent, Carnegie Mellon University) from 2006.
Others have made use of the Scottish hill data to discuss:
Regression (Randall Jennings, 2016)
Multiple Regression Models (John Maindonald and John Braun, 2003 and 2010)
Linear Regression (Hao Zhang, Purdue University)
Transformations (DAAG, 2017)
Graphical Data Analysis with R (Antony Unwin, 2015)
R Introduction and ggplot2 (Russ Hyde, 2017)
A number of textbooks use the Hill Race data to pose questions for students.
Core Statistics (Simon Wood, 2015)
A new hill race is proposed over 7 miles with a climb of 2400 feet. To generate interest in the first year, the organisers offer a prize for every runner who completes the course in less than a set time (T0). The challenge is to set a time that attracts a lot of entrants but limits the amount that has to be paid. You are asked to propose a winning time for the race using the R package MASS. Find and estimate a suitable linear model for predicting the winning time in minutes in terms of race distance (miles) and total height climbed (feet). (Simon suggests students might consult William Naismith’s rule to help with this problem.)
Patty Solomon (University of Adelaide, 2004)
Applied Multivariate Statistics with R (Daniel Zelterman, 2015:150)
ggplot2 is a plotting system for R, based on the grammar of graphics. Hadley Wickham, the creator of ggplot2, has shared this presentation about ggplot2 and is the author of ggplot2: Elegant Graphics for Data Analysis. There is information about the ggplot2 package here.
There is a Quick-R introduction to ggplot2.
Fortunately, Russ Hyde has provided an example of ggplot2 with the Scottish Hill Run data in his R introduction and ggplot2 workshop. He uses the ‘dplyr’ package in his workshop. Dplyr is “a fast, consistent tool for working with data frame like objects” that provides “a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges”.
Here is an example from Russ’s workshop:
#Note that I have the packages ‘’ggplot2’’ and “dplyr” installed. If you do not have these you will need to use the commands install.packages(“ggplot2”) and install.packages(“dplyr”)
> library(MASS)
> data(hills)
> library(ggplot2)
> library(dplyr)
> head(hills)
dist climb time
Greenmantle 2.5 650 16.083
Carnethy 6.0 2500 48.350
Craig Dunain 6.0 900 33.650
Ben Rha 7.5 800 45.600
Ben Lomond 8.0 3070 62.267
Goatfell 8.0 2866 73.217
# Note that there is no column name for the peaks in these data. This can be resolved with this line
> tidy_hills <- mutate(hills, peak = row.names(hills))
#List the now tidied data frame
> head(tidy_hills)
dist climb time peak
1 2.5 650 16.083 Greenmantle
2 6.0 2500 48.350 Carnethy
3 6.0 900 33.650 Craig Dunain
4 7.5 800 45.600 Ben Rha
5 8.0 3070 62.267 Ben Lomond
6 8.0 2866 73.217 Goatfell
# Russ’s workshop then moves to some basic plots of the data. An example:
> ggplot(data = hills, mapping = aes(x = dist, y = time)) + geom_point()
# Produces
# In another example, the code below produces a blue histogram with labelled axes
> ggplot(tidy_hills, mapping = aes(x = time)) +
geom_histogram(bins = 15, fill = "orangered1", col = "black") + labs(x = "Time (minutes)", y = "Count")
This is a very brief introduction to R that introduces some basic procedures in R, ggplot2 and dplyr.
These are important resources in sport analytics. We hope this might encourage you to extend your thinking about analysing performance.
23 September 2017
There is a DAAG package that provides more hill race data. hills2000 provides the record times in 2000 for 56 Scottish hill races.