Working on a simple Dataset
by Sascha Venohr and Julian Ausserhofer
Football fans pay money to see goals from their favourite hometeam in their stadium. We want to find out how much money a fan paid for a homegoal in the Austrian Bundesliga, by buying a season ticket for 2011/2012.
We are going to work with two simple Datasets: The final standing of the Austrian Bundesliga Season 2011/2012 (table based on homegame results) and the ticket prices with different categories (Attention: use this data only for training purposes - some ticket prices are for the coming season). You find the data on two sheets in this public Google-Doc: https://docs.google.com/spreadsheet/ccc?key=0AtWOMTyA06KFdENZMi16MTBrR3g1SkpxaFpGcUpONlE#gid=6
(To be ready to start, you have to make a copy from the document File > Make copy)
A typical situation during DDJ-projects: We have to clean the data. In our case we have to split the homegoals and awaygoals in the column G “Goals”.
For splitting data in cells starting from one special character (in our case the colon in Cell G2) we use this formular:
The values separated by colons in cell G2 will be split into three columns I, J and K. When working with a spreadsheet, you can copy a formula to other cells by dragging from the bottom right corner downwards. The spreadsheet automatically adjusts the cell references in that formula to the new location (i.e. in the next cell creating a =SPLIT(G3;”:”).
Now we need the average costs out of the different ticket categories (prices are for seasontickets). We jump to the sheet “Ticketprices” and will calculated in column “I”. We could use the =AVERAGE function but it is important to keep in mind that the function =MEDIAN is much more resistant to outliers than is the =AVERAGE. We want to mitigate the effects of outliers (i.e. expensive VIP-Tickets) that is why the =MEDIAN is our choice.
To get the median out of the cells D2 - H2 we use this formula:
=MEDIAN(D2:H2) (an again we copy the formula to the other rows).
Next we need to match the two datasets from the different sheets together (again a typical situation in DDJ-projects, in our case easy because of only 10 on 10 entries).
Let’s jump to the the sheet “Hometable”
We are going to use the function
=vlookup(search_criterion, array, index, sort_order)
You’ll find an easy to understand example right here http://woorkup.com/2010/02/19/10-useful-google-spreadsheet-formulas-you-must-know/
That means in our example in cell “L2” (Sheet Hometable):
Let’s translate this expression in spoken words:
Look for the clubname you find in cell B2 in the array cells B2 to I12 (marked red) on sheet “Ticketprices”. If the clubname matches exactly with cell B2 (here FC Red Bull Salzburg) take the date out of the 8th column (marked green - our calculated average ticketprices)
We need to use the same fixed array in the other rows, too. In the other rows we just need to change the cells with the clubnames. To prevend an automatical adjusting to the cell references we fix the array using $:
Now, we can copy the formula by dragging the cell downwards, again.
Now we are able to calculate the price per homegoal for the different teams by dividing the numbers:
Lets reduce our freshly calculated numbers to a compact dataset and create a bar chart to visualize the differences between the football clubs.
We create a new Sheet by clicking on the “+” in the bleft corner. From our Sheet “Hometable” we copy the column “Clubs” and our column with the “Prices per Homegoal”. Important: We want to copy the values not the formulas. That means after copying the data into the clipboard we go to Edit > Paste Special > Paste values only
We mark our two columns and go to Insert > Chart
You can easily choose a bar chart and visualize your new dataset, now. Clicking on the chart you’ll find under “Publish chart” an ready to use embed code.