I’m co-admin of a little page on Facebook that caters to a niche audience of AFL statistics nerds known as Useless AFL Stats, where (founder and co-admin) Aaron and I discover stats that have no relevance to anything at all, and will never be useful to anyone, ever. And that got me thinking, why should I have all the fun finding these nuggets of gold?
I’m a firm believer in open source programming and data, the idea that as much data should be made publicly available as possible to the largest possible audience. After all, 1000 plus brains are going to be more innovative, better at analysing trends, faster at fact checking (and creating useless AFL stats) than just 2.
That’s why I’ll be taking you on a journey from beginner to expert, to give you the tools to be the master of your own AFL data analysis. Key points will be covered in this first post:
- Setup of your coding workspace
- R installation
- Rstudio installation and setup
- Basics of R
- Downloading AFL data
- Creating your first useless AFL stat
If you have any questions during this tutorial, you can tweet at me @crow_data_sci. If you’ve got a useless stat, feel free to tweet at us @UselessStatsAFL or message the page.
Ok, let get into the setup!
R and RStudio Installation
The program that we’ll be using to generate stats is R, a powerful tool commonly used by professional statisticians and data scientists in academia and industry, but we’ll be using it to shitpost about AFL stats.
We’ll be downloading R from here https://cloud.r-project.org/ and choose the version for your OS. I’d recommend all the default file locations and setup. If you are using a MAC OS, click R-4.0.x.pkg
which is the executable.
We’ll also be using RStudio, a Graphical User Interface (GUI) that allows for easier use of the R language. Download the free version here at https://rstudio.com/products/rstudio/download/.
Once they’ve been downloaded, open up your newly installed RStudio program. It should automatically find R and open up an interface. Think of R as the frame, steering wheel and engine of a car. You can get pretty far with just that, but RStudio completes the car, adding all the bells and whistles for a more comfortable journey. God that’s a terrible analogy.
Basics of R
Anyway, you should see a screen and tab named ‘Console’. This is where all the the commands get executed, lets try a couple out.
1+1
## [1] 2
2*4
## [1] 8
3^3
## [1] 27
Nice, what you’ll see is that the answer has been calculated and result has been produced on the next line.
Let’s save these results, we may want to use them later. In R we use an assignment command <-
, which looks like a backwards arrow. You can assign a result to any string of characters. You can also assign characters (like names, teams, locations, etc) to variables too.
a <- 1+1
b <- 2*4
c <- 3^3
first_name <- 'John'
If you look at the ‘Environment’ tab in the top right section, we can see our newly created variables, which we can use later on. Variables can be used in commands with each other too. You can’t add strings to numbers though.
a + b * c
## [1] 218
Another important concept are vectors, a data structure that can hold multiple values. We use c()
to denote a vector and we can insert multiple values.
d <- c(3,7,8,2)
teams <- c('WCE','Freo','Geel')
d
## [1] 3 7 8 2
d*2
## [1] 6 14 16 4
d[3]
## [1] 8
teams[2]
## [1] "Freo"
We can multiply and add to vectors and use square brackets to pull out certain indexes (positions) of the vector. Press the up arrow to cycle backwards through your previous commands.
Projects
Projects help contain all of your data and files in an easy to maintain structure. We’ll create one for all of our AFL data analysis.
- Click the dropdown menu in the top right corner
- New project (and save)
- New Directory
- New Project
- Name it (
AFL_Scripts
or something similar)
This initialises your new project, and we’ll do all our analysis in this project. Use the command getwd()
to find out the file path of this directory. You should see something similar to C:/Users/your_name/Documents/AFL_Scripts
.
Scripts
Scripts are an easy way to store commands that you want to come back to later, and all of our data analysis will be written in scripts. To create a new script:
- File (top left)
- New File
- R Script
Writing commands in the script and pressing enter will not execute the command, but will take you to a newline. To execute a line, use Ctrl
+ Enter
. Press Ctrl
+ S
to save your script, and you should see it appear in the ‘Files’ tab on the left hand side.
Let’s get into some AFL Analytics!
In our new script, we first need to install some packages and load them into our workspace by executing the following commands.
Installing packages
install.packages("devtools") #allows us to download from github
install.packages("dplyr") #data manipulation tools
install.packages("tidyr") #more data manipulation tools
install.packages("snakecase")#data cleaning tool
install.packages("hms") #date formatting
devtools::install_github("jimmyday12/fitzRoy") #package that loads afltables into R
Ingore the warning about “Rtools”, it’s a windows development tool that we won’t be using.
Loading packages
library(dplyr)
library(tidyr)
library(snakecase)
library(fitzRoy)
Here’s another analogy, think of install.packages
as a light bulb and library
as a switch. You only need to install a package once (unless you update R), and can use the library
function to turn them on when needed.
Side note using a hash (#
) is a programming technique called commenting. Anything after a #
will not be run and it allows the programmer to add notes, like what a certain line or function does.
Loading in AFL data
Now that we have everything set up, we can dive right in to the stats. We are going to load in data from afltables.com using fitzRoy, an R package put together by James Day that contains most of the match data in a consistent structure.
Lets load in all the data from the year 2000 onwards and assign it to a variable.
afltables <- get_afltables_stats(start_date = '2000-01-01')
## Returning data from 2000-01-01 to 2020-07-28
## Downloading data
##
## Finished downloading data. Processing XMLs
## Finished getting afltables data
# afltables <- get_afltables_stats() will pull in all the data from 1897
Now that the data is loaded into our workspace, you can see in the ‘Environment’ tab that we have 172 thousand rows (observations) and 59 columns (variables). We can confirm this with the dim
(short for dimensions) function. Loading in the data from 1897 will have over 600k rows.
dim(afltables) #rows by columns
## [1] 174284 59
Lets use the head command, which shows the top 6 or so rows of our dataset.
head(afltables)
## # A tibble: 6 x 59
## Season Round Date Local.start.time Venue Attendance Home.team HQ1G
## <dbl> <chr> <date> <int> <chr> <dbl> <chr> <int>
## 1 2000 1 2000-03-11 1450 Foot~ 37222 Adelaide 3
## 2 2000 1 2000-03-11 1450 Foot~ 37222 Adelaide 3
## 3 2000 1 2000-03-11 1450 Foot~ 37222 Adelaide 3
## 4 2000 1 2000-03-11 1450 Foot~ 37222 Adelaide 3
## 5 2000 1 2000-03-11 1450 Foot~ 37222 Adelaide 3
## 6 2000 1 2000-03-11 1450 Foot~ 37222 Adelaide 3
## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>, HQ3G <int>,
## # HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>, Away.team <chr>,
## # AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>, AQ3G <int>, AQ3B <int>,
## # AQ4G <int>, AQ4B <int>, Away.score <int>, First.name <chr>, Surname <chr>,
## # ID <dbl>, Jumper.No. <dbl>, Playing.for <chr>, Kicks <dbl>, Marks <dbl>,
## # Handballs <dbl>, Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>,
## # Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>,
## # Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>,
## # Contested.Possessions <dbl>, Uncontested.Possessions <dbl>,
## # Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>,
## # Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <dbl>,
## # Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>,
## # Umpire.4 <chr>, group_id <int>
And use the command names
to check the column names, to help get a sense of what this data holds.
names(afltables)
## [1] "Season" "Round"
## [3] "Date" "Local.start.time"
## [5] "Venue" "Attendance"
## [7] "Home.team" "HQ1G"
## [9] "HQ1B" "HQ2G"
## [11] "HQ2B" "HQ3G"
## [13] "HQ3B" "HQ4G"
## [15] "HQ4B" "Home.score"
## [17] "Away.team" "AQ1G"
## [19] "AQ1B" "AQ2G"
## [21] "AQ2B" "AQ3G"
## [23] "AQ3B" "AQ4G"
## [25] "AQ4B" "Away.score"
## [27] "First.name" "Surname"
## [29] "ID" "Jumper.No."
## [31] "Playing.for" "Kicks"
## [33] "Marks" "Handballs"
## [35] "Goals" "Behinds"
## [37] "Hit.Outs" "Tackles"
## [39] "Rebounds" "Inside.50s"
## [41] "Clearances" "Clangers"
## [43] "Frees.For" "Frees.Against"
## [45] "Brownlow.Votes" "Contested.Possessions"
## [47] "Uncontested.Possessions" "Contested.Marks"
## [49] "Marks.Inside.50" "One.Percenters"
## [51] "Bounces" "Goal.Assists"
## [53] "Time.on.Ground.." "Substitute"
## [55] "Umpire.1" "Umpire.2"
## [57] "Umpire.3" "Umpire.4"
## [59] "group_id"
This next step I like to include cleans up some of the naming used, makes it more consistent format that is less likely to break a function later down the line.
#rename all the columns to a snakecase format
names(afltables) <- to_snake_case(names(afltables))
names(afltables) # now the column headers are in lowercase and have dots replaced with underscores
## [1] "season" "round"
## [3] "date" "local_start_time"
## [5] "venue" "attendance"
## [7] "home_team" "hq_1_g"
## [9] "hq_1_b" "hq_2_g"
## [11] "hq_2_b" "hq_3_g"
## [13] "hq_3_b" "hq_4_g"
## [15] "hq_4_b" "home_score"
## [17] "away_team" "aq_1_g"
## [19] "aq_1_b" "aq_2_g"
## [21] "aq_2_b" "aq_3_g"
## [23] "aq_3_b" "aq_4_g"
## [25] "aq_4_b" "away_score"
## [27] "first_name" "surname"
## [29] "id" "jumper_no"
## [31] "playing_for" "kicks"
## [33] "marks" "handballs"
## [35] "goals" "behinds"
## [37] "hit_outs" "tackles"
## [39] "rebounds" "inside_50_s"
## [41] "clearances" "clangers"
## [43] "frees_for" "frees_against"
## [45] "brownlow_votes" "contested_possessions"
## [47] "uncontested_possessions" "contested_marks"
## [49] "marks_inside_50" "one_percenters"
## [51] "bounces" "goal_assists"
## [53] "time_on_ground" "substitute"
## [55] "umpire_1" "umpire_2"
## [57] "umpire_3" "umpire_4"
## [59] "group_id"
Useless AFL Stat
Lets work towards two stats:
Who has the highest disposal tally equal to their jumper number?
and:
Which season has the highest accuracy?
Selecting columns
Now that the data is loaded and in a format we can easily manipulate, lets take a look at some basic functions. dplyr
has built in functions to make this process as painless as possible. Firstly, lets look at select
, and function that keeps the columns we want to investigate. Also we are going to be making use of %>%
, known as a pipe, to channel our data through various functions. A shortcut for the command is Ctrl
+ Shift
+ m
.
afltables %>%
select(season, round, id, first_name, surname, jumper_no, kicks, handballs)
## # A tibble: 174,284 x 8
## season round id first_name surname jumper_no kicks handballs
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2000 1 12 Mark Bickley 26 16 3
## 2 2000 1 68 Brett Burton 24 8 2
## 3 2000 1 77 James Byrne 6 5 1
## 4 2000 1 78 Matthew Clarke 4 1 3
## 5 2000 1 79 Andrew Crowell 1 4 4
## 6 2000 1 61 Andrew Eccles 33 10 7
## 7 2000 1 34 Tyson Edwards 9 17 4
## 8 2000 1 35 Shane Ellen 13 10 2
## 9 2000 1 51 Simon Goodwin 36 10 8
## 10 2000 1 22 Ben Hart 34 7 2
## # ... with 174,274 more rows
Combining (mutating) columns
Nice, now we’ve got the data want to investigate we can use a technique using a function called mutate
. Whats interesting is this data source doesn’t have a disposals count column, but we can easily recreate it by adding handballs to kicks with one line of code.
afltables %>%
select(season, round, id, first_name, surname, jumper_no, kicks, handballs) %>%
mutate(disposals = kicks + handballs) #name of our new column goes on the left hand side
## # A tibble: 174,284 x 9
## season round id first_name surname jumper_no kicks handballs disposals
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2000 1 12 Mark Bickley 26 16 3 19
## 2 2000 1 68 Brett Burton 24 8 2 10
## 3 2000 1 77 James Byrne 6 5 1 6
## 4 2000 1 78 Matthew Clarke 4 1 3 4
## 5 2000 1 79 Andrew Crowell 1 4 4 8
## 6 2000 1 61 Andrew Eccles 33 10 7 17
## 7 2000 1 34 Tyson Edwards 9 17 4 21
## 8 2000 1 35 Shane Ellen 13 10 2 12
## 9 2000 1 51 Simon Goodwin 36 10 8 18
## 10 2000 1 22 Ben Hart 34 7 2 9
## # ... with 174,274 more rows
Filtering our data
A staple of Useless AFL Stats is the use of the jumper number, mainly because its arbitrary and should have no effect on the game, but nonetheless makes for a great stat. Lets find all the times the disposal count was equal to the jumper number. We can achieve this by using the filter
function.
afltables %>%
select(season, round, id, first_name, surname, jumper_no, kicks, handballs) %>%
mutate(disposals = kicks + handballs) %>%
filter(disposals == jumper_no)
## # A tibble: 3,704 x 9
## season round id first_name surname jumper_no kicks handballs disposals
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2000 1 77 James Byrne 6 5 1 6
## 2 2000 1 78 Matthew Clarke 4 1 3 4
## 3 2000 1 436 Luke Darcy 14 11 3 14
## 4 2000 3 4 Mark Ricciuto 32 15 17 32
## 5 2000 3 558 Tim McGrath 17 9 8 17
## 6 2000 7 12 Mark Bickley 26 15 11 26
## 7 2000 8 80 Scott Welsh 17 13 4 17
## 8 2000 9 56 Nathan Bassett 8 6 2 8
## 9 2000 9 777 Glenn Archer 11 8 3 11
## 10 2000 11 79 Andrew Crowell 1 0 1 1
## # ... with 3,694 more rows
Arranging by a column
Ok, so we have all the occurrences when jumper no. was equal to disposals, what was the largest? We can use the arrange
function on a column to sort by ascending or descending order. The default arrangement for a columns is ascending (smallest at the top to biggest), so we’ll wrap the column in desc()
to get the descending order.
afltables %>%
select(season, round, id, first_name, surname, jumper_no, kicks, handballs) %>%
mutate(disposals = kicks + handballs) %>%
filter(disposals == jumper_no) %>%
arrange(desc(jumper_no))
## # A tibble: 3,704 x 9
## season round id first_name surname jumper_no kicks handballs disposals
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2016 18 11787 Tom Rockliff 38 24 14 38
## 2 2016 13 11787 Tom Rockliff 38 15 23 38
## 3 2006 2 51 Simon Goodwin 36 21 15 36
## 4 2010 13 51 Simon Goodwin 36 17 19 36
## 5 2003 4 842 Nick Stevens 36 22 14 36
## 6 2008 10 1460 Dane Swan 36 19 17 36
## 7 2009 9 1460 Dane Swan 36 21 15 36
## 8 2009 18 1460 Dane Swan 36 19 17 36
## 9 2010 16 1460 Dane Swan 36 21 15 36
## 10 2012 4 1460 Dane Swan 36 18 18 36
## # ... with 3,694 more rows
And there we have it, your first useless AFL stat. You should see Tom Rockliff up the top with 2 games of 38 disposals wearing the number 38. Useless.
Group by and Summarise
Grouping is a powerful tool we use to group certain values in columns. An example of this would be season, where each year is essentially its own category, and we can run commands that (for example) take the average amount of goals per season. Lets put this into practice with a simple example based off data we already have answering the following question:
Which season has the highest accuracy?
Lets pull in the data we need to create this stat. We need to sum the total goals and behinds per season.
afltables %>%
select(season, goals, behinds) %>%
group_by(season) %>%
summarise(
sum_g = sum(goals),
sum_b = sum(behinds),
.groups = 'drop'
)
## # A tibble: 21 x 3
## season sum_g sum_b
## <dbl> <dbl> <dbl>
## 1 2000 5626 3705
## 2 2001 5258 3586
## 3 2002 5121 3563
## 4 2003 5101 3546
## 5 2004 5012 3504
## 6 2005 5118 3439
## 7 2006 4959 3535
## 8 2007 5107 3581
## 9 2008 5238 3482
## 10 2009 4913 3515
## # ... with 11 more rows
Now that we have the total counts per season, we can use mutate
to calculate accuracy, which is \(\dfrac{Goals}{Shots}\). We are also going to arrange the result to see which season has the highest accuracy.
afltables %>%
select(season, goals, behinds) %>%
group_by(season) %>%
summarise(
sum_g = sum(goals),
sum_b = sum(behinds),
.groups = 'drop' #we also need to drop the grouping after running the command
) %>%
mutate(
accuracy = sum_g/(sum_g+sum_b)*100
) %>%
arrange(desc(accuracy))
## # A tibble: 21 x 4
## season sum_g sum_b accuracy
## <dbl> <dbl> <dbl> <dbl>
## 1 2000 5626 3705 60.3
## 2 2008 5238 3482 60.1
## 3 2005 5118 3439 59.8
## 4 2001 5258 3586 59.5
## 5 2015 5176 3536 59.4
## 6 2003 5101 3546 59.0
## 7 2002 5121 3563 59.0
## 8 2004 5012 3504 58.9
## 9 2007 5107 3581 58.8
## 10 2013 5561 3907 58.7
## # ... with 11 more rows
Nice, we can see that the year 2000 has the highest accuracy. Another question we might ask is which round in each season has the highest accuracy? We can group by a second variable, round.
afltables %>%
select(season, round, goals, behinds) %>%
group_by(season, round) %>%
summarise(
sum_g = sum(goals),
sum_b = sum(behinds),
.groups = 'drop' #we also need to drop the grouping after running the command
) %>%
mutate(
accuracy = sum_g/(sum_g+sum_b)*100
) %>%
arrange(desc(accuracy))
## # A tibble: 558 x 5
## season round sum_g sum_b accuracy
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2005 EF 64 28 69.6
## 2 2011 SF 62 28 68.9
## 3 2004 QF 60 29 67.4
## 4 2014 GF 32 16 66.7
## 5 2017 SF 48 24 66.7
## 6 2000 QF 69 35 66.3
## 7 2015 12 175 89 66.3
## 8 2008 EF 61 32 65.6
## 9 2005 PF 49 26 65.3
## 10 2007 16 250 135 64.9
## # ... with 548 more rows
Nice, 2005 Elimination finals games were at nearly 70% kicking accuracy. group_by
and summarise
work really well together, and you can switch out sum
with mean
for the average, or max
and min
for the maximum and minimum in each group. The possibilities are endless.
Conclusion
Thanks for making it this far, hopefully its given you a taste of the potential insights (useful or useless) that R and AFL have to offer. This is hopefully the first in a series of tutorials about AFL analytics in R. You can contact us on twitter or Facebook, let us know what you found interesting, insightful, difficult, any other types of stats you’d like to see recreated in upcoming posts.