I’m co-admin of a little page on Facebook that caters to a niche audience of AFL statistics nerds known as Useless AFL Stats, where (founder and co-admin) Aaron and I discover stats that have no relevance to anything at all, and will never be useful to anyone, ever. And that got me thinking, why should I have all the fun finding these nuggets of gold?
I’m a firm believer in open source programming and open data, the philosophy that as much data should be made publicly available as possible to the largest possible audience. A thousand or so brains are going to be more innovative, better at analysing trends, faster at fact checking (and creating useless AFL stats) than just 2.
That’s why I’ll be taking you on a journey from beginner to expert, to give you the tools to be the master of your own AFL data analysis. Key points will be covered in this first post:
- Setup of your coding workspace
- R installation
- Rstudio installation and setup
- Basics of R
- Downloading AFL data
- Creating your first AFL stat
If you have any questions during this tutorial, you can tweet at me @crow_data_sci. If you’ve got a useless stat, feel free to tweet at us @UselessStatsAFL or message our facebook page.
Ok, let get into the setup!
R and RStudio Installation
The program that we’ll be using to generate stats is R, a powerful tool commonly used by professional statisticians and data scientists in academia and industry, but we’ll be using it to shitpost about AFL stats.
We’ll be downloading R from here https://cloud.r-project.org/ and choose the version for your OS. I’d recommend all the default file locations and setup. If you are using a MAC OS, click R-4.2.x.pkg
which is the executable.
We’ll also be using RStudio, a Graphical User Interface (GUI) that allows for easier use of the R language. Download the free version here at https://rstudio.com/products/rstudio/download/.
Once they’ve been downloaded, open up your newly installed RStudio program. It should automatically find R and open up an interface. Think of the R language as the frame, steering wheel and engine of a car. You can get pretty far with just that, but RStudio completes the car, adding all the bells and whistles for a more comfortable journey. God that’s a terrible analogy.
Basics of R
Anyway, you should see a screen and tab named ‘Console’. This is where all the the commands get executed, lets try a couple out.
1+1
## [1] 2
2*4
## [1] 8
3^3
## [1] 27
What you’ll see is that the answer has been calculated and result has been produced on the next line.
Let’s save these results, we may want to use them later. In R we use an assignment command <-
, which looks like a backwards arrow. You can assign a result to any string of characters. You can also assign characters (like names, teams, locations, etc) to variables too.
a <- 1+1
b <- 2*4
c <- 3^3
first_name <- 'John'
If you look at the ‘Environment’ tab in the top right section, we can see our newly created variables. Variables can be used in commands with each other too. You can’t add strings to numbers though.
a + b * c
## [1] 218
Another important concept are vectors, a data structure that can hold multiple values. We use c()
to denote a vector and we can insert multiple values.
d <- c(3,7,8,2)
teams <- c('WCE','Freo','Geel')
d
## [1] 3 7 8 2
d*2
## [1] 6 14 16 4
d[3]
## [1] 8
teams[2]
## [1] "Freo"
We can multiply and add to vectors and use square brackets to pull out certain indexes (positions) of the vector. Press the up arrow to cycle backwards through your previous commands.
Projects
Projects help contain all of your data and files in an easy to maintain structure. We’ll create one for all of our AFL data analysis.
- Click the dropdown menu in the top right corner
- New project (and save)
- New Directory
- New Project
- Name it (
AFL_Scripts
or something similar)
This initialises your new project, and we’ll do all our analysis in this project. Use the command getwd()
in the console to find out the file path of this directory. You should see something similar to C:/Users/your_name/Documents/AFL_Scripts
.
Scripts
Scripts are an easy way to store commands that you want to come back to later, and all of our data analysis will be written in scripts. To create a new script:
- File (top left)
- New File
- R Script
Writing commands in the script and pressing enter will not execute the command, but will take you to a newline. To execute a line, use Ctrl
+ Enter
. Press Ctrl
+ S
to save your script, and you should see it appear in the ‘Files’ tab on the right of RStudio.
Let’s get into some AFL Analytics!
In our new script, we first need to install some packages and load them into our workspace by executing the following commands in the console (not the Script).
Installing packages
install.packages("devtools") #allows us to download from github
install.packages("dplyr") #data manipulation tools
install.packages("tidyr") #more data manipulation tools
install.packages("snakecase")#data cleaning tool
install.packages("hms") #date formatting
install.packages("fitzRoy") #get AFL data - mind the capital R
# devtools::install_github("jimmyday12/fitzRoy") #get the dev version - advanced
Loading packages
library(dplyr)
library(tidyr)
library(snakecase)
library(fitzRoy)
Here’s another analogy, think of install.packages
as a light bulb and library
as a switch. You only need to install a package once (unless you update R), and can use the library
function to turn them on when needed.
Side note using a hash (#
) is a programming technique called commenting. Anything after a #
will not be run and it allows the programmer to add notes, like what a certain line or function does.
Loading in AFL data
Now that we have everything set up, we can dive right in to the stats. We are going to load in data from afltables.com using fitzRoy, an R package put together by James Day that contains most of the match data in a consistent structure.
Lets load in all the data from the year 2000 onwards and assign it to a variable.
afltables <- fetch_player_stats_afltables(season = 2021) #loads in 2021 data
## i Looking for data from 2021-01-01 to 2021-12-31
##
i fetching cached data from <github.com>
v fetching cached data from <github.com> ... done
## i No new data found - returning cached data
## Finished getting afltables data
# afltables <- fetch_player_stats_afltables(season = 2000:2010) #loads in data from 2000 to 2010
Now that the data is loaded into our workspace, you can see in the ‘Environment’ tab that we have 9527 thousand rows (observations) and 59 columns (variables). We can confirm this with the dim
(short for dimensions) function. Loading in the data from 1897 will have over 600k rows.
dim(afltables) #rows by columns
## [1] 9527 59
Lets use the head command, which shows the top 6 or so rows of our dataset.
head(afltables)
## # A tibble: 6 x 59
## Season Round Date Local.start.time Venue Attendance Home.team HQ1G
## <dbl> <chr> <date> <int> <chr> <dbl> <chr> <int>
## 1 2021 1 2021-03-18 1925 M.C.G. 49218 Richmond 3
## 2 2021 1 2021-03-18 1925 M.C.G. 49218 Richmond 3
## 3 2021 1 2021-03-18 1925 M.C.G. 49218 Richmond 3
## 4 2021 1 2021-03-18 1925 M.C.G. 49218 Richmond 3
## 5 2021 1 2021-03-18 1925 M.C.G. 49218 Richmond 3
## 6 2021 1 2021-03-18 1925 M.C.G. 49218 Richmond 3
## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>, HQ3G <int>,
## # HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>, Away.team <chr>,
## # AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>, AQ3G <int>, AQ3B <int>,
## # AQ4G <int>, AQ4B <int>, Away.score <int>, First.name <chr>, Surname <chr>,
## # ID <dbl>, Jumper.No. <chr>, Playing.for <chr>, Kicks <dbl>, Marks <dbl>,
## # Handballs <dbl>, Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>,
## # Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>, ...
And use the command names
to check the column names, to help get a sense of what this data holds.
names(afltables)
## [1] "Season" "Round"
## [3] "Date" "Local.start.time"
## [5] "Venue" "Attendance"
## [7] "Home.team" "HQ1G"
## [9] "HQ1B" "HQ2G"
## [11] "HQ2B" "HQ3G"
## [13] "HQ3B" "HQ4G"
## [15] "HQ4B" "Home.score"
## [17] "Away.team" "AQ1G"
## [19] "AQ1B" "AQ2G"
## [21] "AQ2B" "AQ3G"
## [23] "AQ3B" "AQ4G"
## [25] "AQ4B" "Away.score"
## [27] "First.name" "Surname"
## [29] "ID" "Jumper.No."
## [31] "Playing.for" "Kicks"
## [33] "Marks" "Handballs"
## [35] "Goals" "Behinds"
## [37] "Hit.Outs" "Tackles"
## [39] "Rebounds" "Inside.50s"
## [41] "Clearances" "Clangers"
## [43] "Frees.For" "Frees.Against"
## [45] "Brownlow.Votes" "Contested.Possessions"
## [47] "Uncontested.Possessions" "Contested.Marks"
## [49] "Marks.Inside.50" "One.Percenters"
## [51] "Bounces" "Goal.Assists"
## [53] "Time.on.Ground.." "Substitute"
## [55] "Umpire.1" "Umpire.2"
## [57] "Umpire.3" "Umpire.4"
## [59] "group_id"
This next step I like to include cleans up some of the naming used, makes it more consistent format that is less likely to break a function later down the line.
#rename all the columns to a snakecase format
names(afltables) <- to_snake_case(names(afltables))
names(afltables) # now the column headers are in lowercase and have dots replaced with underscores
## [1] "season" "round"
## [3] "date" "local_start_time"
## [5] "venue" "attendance"
## [7] "home_team" "hq_1_g"
## [9] "hq_1_b" "hq_2_g"
## [11] "hq_2_b" "hq_3_g"
## [13] "hq_3_b" "hq_4_g"
## [15] "hq_4_b" "home_score"
## [17] "away_team" "aq_1_g"
## [19] "aq_1_b" "aq_2_g"
## [21] "aq_2_b" "aq_3_g"
## [23] "aq_3_b" "aq_4_g"
## [25] "aq_4_b" "away_score"
## [27] "first_name" "surname"
## [29] "id" "jumper_no"
## [31] "playing_for" "kicks"
## [33] "marks" "handballs"
## [35] "goals" "behinds"
## [37] "hit_outs" "tackles"
## [39] "rebounds" "inside_50_s"
## [41] "clearances" "clangers"
## [43] "frees_for" "frees_against"
## [45] "brownlow_votes" "contested_possessions"
## [47] "uncontested_possessions" "contested_marks"
## [49] "marks_inside_50" "one_percenters"
## [51] "bounces" "goal_assists"
## [53] "time_on_ground" "substitute"
## [55] "umpire_1" "umpire_2"
## [57] "umpire_3" "umpire_4"
## [59] "group_id"
AFL Stats
Lets work towards two stats:
Who has the highest amount of disposals equal to their tackle count?
and:
Which team has the highest accuracy?
Selecting columns
Now that the data is loaded and in a format we can easily manipulate, lets take a look at some basic functions. dplyr
has built in functions to make this process as painless as possible. Firstly, lets look at select
, and function that keeps the columns we want to investigate. Also we are going to be making use of %>%
, known as a pipe, to channel our data through various functions. A shortcut for the command is Ctrl
+ Shift
+ m
.
afltables %>%
select(season, round, id, first_name, surname, kicks, handballs, tackles)
## # A tibble: 9,527 x 8
## season round id first_name surname kicks handballs tackles
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2021 1 12790 Jake Aarts 7 5 2
## 2 2021 1 11828 David Astbury 4 5 1
## 3 2021 1 12661 Liam Baker 4 11 0
## 4 2021 1 12686 Noah Balta 10 1 1
## 5 2021 1 12535 Shai Bolton 13 12 1
## 6 2021 1 12456 Nathan Broad 4 5 0
## 7 2021 1 12010 Josh Caddy 12 5 2
## 8 2021 1 12431 Jason Castagna 8 5 0
## 9 2021 1 11557 Shane Edwards 11 16 3
## 10 2021 1 12576 Jack Graham 22 11 3
## # ... with 9,517 more rows
Combining (mutating) columns
Nice, now we’ve got the data want to investigate, we can use a technique using a function called mutate
. Whats interesting is this data source doesn’t have a disposals count column, but we can easily recreate it by adding handballs to kicks with one line of code.
afltables %>%
select(season, round, id, first_name, surname, kicks, handballs, tackles) %>%
mutate(disposals = kicks + handballs) #name of our new column goes on the left hand side
## # A tibble: 9,527 x 9
## season round id first_name surname kicks handballs tackles disposals
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2021 1 12790 Jake Aarts 7 5 2 12
## 2 2021 1 11828 David Astbury 4 5 1 9
## 3 2021 1 12661 Liam Baker 4 11 0 15
## 4 2021 1 12686 Noah Balta 10 1 1 11
## 5 2021 1 12535 Shai Bolton 13 12 1 25
## 6 2021 1 12456 Nathan Broad 4 5 0 9
## 7 2021 1 12010 Josh Caddy 12 5 2 17
## 8 2021 1 12431 Jason Castagna 8 5 0 13
## 9 2021 1 11557 Shane Edwards 11 16 3 27
## 10 2021 1 12576 Jack Graham 22 11 3 33
## # ... with 9,517 more rows
Filtering our data
Lets find all the times the disposal count was equal to the tackles. We can achieve this by using the filter
function.
afltables %>%
select(season, round, id, first_name, surname, kicks, handballs, tackles) %>%
mutate(disposals = kicks + handballs) %>%
filter(disposals == tackles)
## # A tibble: 263 x 9
## season round id first_name surname kicks handballs tackles disposals
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2021 1 12545 Callum Brown 0 0 0 0
## 2 2021 1 12748 Rhylee West 0 0 0 0
## 3 2021 1 12756 Kade Chandler 0 0 0 0
## 4 2021 1 12265 Tom Cutler 0 0 0 0
## 5 2021 1 12857 Connor Downie 0 0 0 0
## 6 2021 1 12443 Rhys Mathieson 0 0 0 0
## 7 2021 1 12509 Will Hayward 0 0 0 0
## 8 2021 1 12865 Charlie Lazzaro 1 0 1 1
## 9 2021 1 12821 Xavier OHalloran 0 0 0 0
## 10 2021 1 12312 Mason Wood 0 0 0 0
## # ... with 253 more rows
Arranging by a column
Ok, so we have all the occurrences when tackles was equal to disposals, what was the largest? We can use the arrange
function on a column to sort by ascending or descending order. The default arrangement for a columns is ascending (smallest at the top to biggest), so we’ll wrap the column name in desc()
to get the descending order.
afltables %>%
select(season, round, id, first_name, surname, kicks, handballs, tackles) %>%
mutate(disposals = kicks + handballs) %>%
filter(disposals == tackles) %>%
arrange(desc(disposals))
## # A tibble: 263 x 9
## season round id first_name surname kicks handballs tackles disposals
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2021 16 12076 Dayne Zorko 9 3 12 12
## 2 2021 17 12771 Kysaiah Pickett 6 3 9 9
## 3 2021 21 12904 Kieren Briggs 6 3 9 9
## 4 2021 9 12905 Ronin OConnor 1 7 8 8
## 5 2021 16 11994 Scott Lycett 5 3 8 8
## 6 2021 PF 12695 Willem Drew 5 3 8 8
## 7 2021 8 12849 Sam Berry 3 4 7 7
## 8 2021 9 12637 Jamaine Jones 3 4 7 7
## 9 2021 13 12485 Mabior Chol 4 3 7 7
## 10 2021 17 12596 Lachie Fogarty 3 4 7 7
## # ... with 253 more rows
And there we have it, your first AFL stat. You should see Dayne Zorko up the top with 12 disposals and tackles in round 16.
Group by and Summarise
Grouping is a powerful tool we use to group certain values in columns. An example of this would be season, where each year is essentially its own category, and we can run commands that (for example) take the average amount of goals per team. Lets put this into practice with a simple example based off data we already have answering the following question:
Which team has the highest accuracy?
Lets pull in the data we need to create this stat. We need to sum the total goals and behinds per team.
afltables %>%
select(playing_for, goals, behinds) %>%
group_by(playing_for) %>%
summarise(
sum_g = sum(goals),
sum_b = sum(behinds),
.groups = 'drop'
)
## # A tibble: 18 x 3
## playing_for sum_g sum_b
## <chr> <dbl> <dbl>
## 1 Adelaide 230 197
## 2 Brisbane Lions 333 222
## 3 Carlton 250 201
## 4 Collingwood 225 166
## 5 Essendon 291 200
## 6 Fremantle 219 220
## 7 Geelong 295 213
## 8 Gold Coast 201 180
## 9 Greater Western Sydney 279 190
## 10 Hawthorn 239 145
## 11 Melbourne 323 242
## 12 North Melbourne 213 157
## 13 Port Adelaide 294 213
## 14 Richmond 253 183
## 15 St Kilda 237 184
## 16 Sydney 303 195
## 17 West Coast 257 168
## 18 Western Bulldogs 339 238
Now that we have the total counts per team, we can use mutate
to calculate accuracy, which is \(\dfrac{Goals}{Shots}\). We are also going to arrange the result to see which team has the highest accuracy in 2021.
afltables %>%
select(playing_for, goals, behinds) %>%
group_by(playing_for) %>%
summarise(
sum_g = sum(goals),
sum_b = sum(behinds),
.groups = 'drop' #we also need to drop the grouping after running the command
) %>%
mutate(
accuracy = sum_g/(sum_g+sum_b)*100 #multiply by 100 to get a %
) %>%
arrange(desc(accuracy))
## # A tibble: 18 x 4
## playing_for sum_g sum_b accuracy
## <chr> <dbl> <dbl> <dbl>
## 1 Hawthorn 239 145 62.2
## 2 Sydney 303 195 60.8
## 3 West Coast 257 168 60.5
## 4 Brisbane Lions 333 222 60
## 5 Greater Western Sydney 279 190 59.5
## 6 Essendon 291 200 59.3
## 7 Western Bulldogs 339 238 58.8
## 8 Geelong 295 213 58.1
## 9 Richmond 253 183 58.0
## 10 Port Adelaide 294 213 58.0
## 11 North Melbourne 213 157 57.6
## 12 Collingwood 225 166 57.5
## 13 Melbourne 323 242 57.2
## 14 St Kilda 237 184 56.3
## 15 Carlton 250 201 55.4
## 16 Adelaide 230 197 53.9
## 17 Gold Coast 201 180 52.8
## 18 Fremantle 219 220 49.9
Nice, we can see that the year 2000 has the highest accuracy. Another question we might ask is which team in each round had the highest accuracy? We can group by a second variable, round
.
afltables %>%
select(playing_for, round, goals, behinds) %>%
group_by(playing_for, round) %>%
summarise(
sum_g = sum(goals),
sum_b = sum(behinds),
.groups = 'drop'
) %>%
mutate(
accuracy = sum_g/(sum_g+sum_b)*100
) %>%
arrange(desc(accuracy))
## # A tibble: 414 x 5
## playing_for round sum_g sum_b accuracy
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Hawthorn 21 15 2 88.2
## 2 West Coast 4 13 2 86.7
## 3 Greater Western Sydney 3 11 2 84.6
## 4 Richmond 17 11 2 84.6
## 5 Adelaide 6 16 3 84.2
## 6 Hawthorn 13 14 3 82.4
## 7 Western Bulldogs 18 14 3 82.4
## 8 Carlton 20 18 4 81.8
## 9 Adelaide 19 16 4 80
## 10 Hawthorn 5 8 2 80
## # ... with 404 more rows
Hawthorn in Round 21 topping the charts with a whopping 88% accuracy. group_by
and summarise
work really well together, and you can switch out sum
with mean
for the average, or max
and min
for the maximum and minimum in each group.
You could also switch out playing_for
with id
,first_name
and surname
to get individual player’s accuracy. Another variation is to import the data from 2000 to 2022. The possibilities are endless.
Conclusion
Thanks for making it this far, hopefully its given you a taste of the potential insights (useful or useless) that R and AFL have to offer. This is hopefully the first in a series of tutorials about AFL analytics in R. You can contact us on twitter or Facebook, let us know what you found interesting, insightful, difficult, any other types of stats you’d like to see recreated in upcoming posts. @crow_data_sci