Introduction to AFL Analytics with Useless AFL Stats

2020-07-16 · 3262 words · 16 minute read

AFL Data Tutorial

R · Useless AFL Stats · Introduction to R

I’m co-admin of a little page on Facebook that caters to a niche audience of AFL statistics nerds known as Useless AFL Stats, where (founder and co-admin) Aaron and I discover stats that have no relevance to anything at all, and will never be useful to anyone, ever. And that got me thinking, why should I have all the fun finding these nuggets of gold?

I’m a firm believer in open source programming and data, the idea that as much data should be made publicly available as possible to the largest possible audience. After all, 1000 plus brains are going to be more innovative, better at analysing trends, faster at fact checking (and creating useless AFL stats) than just 2.

That’s why I’ll be taking you on a journey from beginner to expert, to give you the tools to be the master of your own AFL data analysis. Key points will be covered in this first post:

Setup of your coding workspace
- R installation
- Rstudio installation and setup
Basics of R
Downloading AFL data
Creating your first useless AFL stat

If you have any questions during this tutorial, you can tweet at me @crow_data_sci. If you’ve got a useless stat, feel free to tweet at us @UselessStatsAFL or message the page.

Ok, let get into the setup!

R and RStudio Installation

The program that we’ll be using to generate stats is R, a powerful tool commonly used by professional statisticians and data scientists in academia and industry, but we’ll be using it to shitpost about AFL stats.

We’ll be downloading R from here https://cloud.r-project.org/ and choose the version for your OS. I’d recommend all the default file locations and setup. If you are using a MAC OS, click R-4.0.x.pkg which is the executable.

We’ll also be using RStudio, a Graphical User Interface (GUI) that allows for easier use of the R language. Download the free version here at https://rstudio.com/products/rstudio/download/.

Once they’ve been downloaded, open up your newly installed RStudio program. It should automatically find R and open up an interface. Think of R as the frame, steering wheel and engine of a car. You can get pretty far with just that, but RStudio completes the car, adding all the bells and whistles for a more comfortable journey. God that’s a terrible analogy.

Basics of R

Anyway, you should see a screen and tab named ‘Console’. This is where all the the commands get executed, lets try a couple out.

1+1

## [1] 2

2*4

## [1] 8

3^3

## [1] 27

Nice, what you’ll see is that the answer has been calculated and result has been produced on the next line.

Let’s save these results, we may want to use them later. In R we use an assignment command <-, which looks like a backwards arrow. You can assign a result to any string of characters. You can also assign characters (like names, teams, locations, etc) to variables too.

a <- 1+1
b <- 2*4
c <- 3^3
first_name <- 'John'

If you look at the ‘Environment’ tab in the top right section, we can see our newly created variables, which we can use later on. Variables can be used in commands with each other too. You can’t add strings to numbers though.

a + b * c

## [1] 218

Another important concept are vectors, a data structure that can hold multiple values. We use c() to denote a vector and we can insert multiple values.

d <- c(3,7,8,2)
teams <- c('WCE','Freo','Geel')
d

## [1] 3 7 8 2

d*2

## [1]  6 14 16  4

d[3]

## [1] 8

teams[2]

## [1] "Freo"

We can multiply and add to vectors and use square brackets to pull out certain indexes (positions) of the vector. Press the up arrow to cycle backwards through your previous commands.

Projects

Projects help contain all of your data and files in an easy to maintain structure. We’ll create one for all of our AFL data analysis.

Click the dropdown menu in the top right corner
New project (and save)
New Directory
New Project
Name it (AFL_Scripts or something similar)

This initialises your new project, and we’ll do all our analysis in this project. Use the command getwd() to find out the file path of this directory. You should see something similar to C:/Users/your_name/Documents/AFL_Scripts.

Scripts

Scripts are an easy way to store commands that you want to come back to later, and all of our data analysis will be written in scripts. To create a new script:

File (top left)
New File
R Script

Writing commands in the script and pressing enter will not execute the command, but will take you to a newline. To execute a line, use Ctrl + Enter. Press Ctrl + Sto save your script, and you should see it appear in the ‘Files’ tab on the left hand side.

Let’s get into some AFL Analytics!

In our new script, we first need to install some packages and load them into our workspace by executing the following commands.

Installing packages

install.packages("devtools") #allows us to download from github
install.packages("dplyr")    #data manipulation tools
install.packages("tidyr")    #more data manipulation tools
install.packages("snakecase")#data cleaning tool
install.packages("hms")      #date formatting
devtools::install_github("jimmyday12/fitzRoy") #package that loads afltables into R

Ingore the warning about “Rtools”, it’s a windows development tool that we won’t be using.

Loading packages

library(dplyr)
library(tidyr)
library(snakecase)
library(fitzRoy)

Here’s another analogy, think of install.packages as a light bulb and library as a switch. You only need to install a package once (unless you update R), and can use the library function to turn them on when needed.

Side note using a hash (#) is a programming technique called commenting. Anything after a # will not be run and it allows the programmer to add notes, like what a certain line or function does.

Loading in AFL data

Now that we have everything set up, we can dive right in to the stats. We are going to load in data from afltables.com using fitzRoy, an R package put together by James Day that contains most of the match data in a consistent structure.

Lets load in all the data from the year 2000 onwards and assign it to a variable.

afltables <- get_afltables_stats(start_date = '2000-01-01')

## Returning data from 2000-01-01 to 2020-07-28

## Downloading data

## 
## Finished downloading data. Processing XMLs

## Finished getting afltables data

# afltables <- get_afltables_stats() will pull in all the data from 1897

Now that the data is loaded into our workspace, you can see in the ‘Environment’ tab that we have 172 thousand rows (observations) and 59 columns (variables). We can confirm this with the dim (short for dimensions) function. Loading in the data from 1897 will have over 600k rows.

dim(afltables) #rows by columns

## [1] 174284     59

Lets use the head command, which shows the top 6 or so rows of our dataset.

head(afltables)

## # A tibble: 6 x 59
##   Season Round Date       Local.start.time Venue Attendance Home.team  HQ1G
##    <dbl> <chr> <date>                <int> <chr>      <dbl> <chr>     <int>
## 1   2000 1     2000-03-11             1450 Foot~      37222 Adelaide      3
## 2   2000 1     2000-03-11             1450 Foot~      37222 Adelaide      3
## 3   2000 1     2000-03-11             1450 Foot~      37222 Adelaide      3
## 4   2000 1     2000-03-11             1450 Foot~      37222 Adelaide      3
## 5   2000 1     2000-03-11             1450 Foot~      37222 Adelaide      3
## 6   2000 1     2000-03-11             1450 Foot~      37222 Adelaide      3
## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>, HQ3G <int>,
## #   HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>, Away.team <chr>,
## #   AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>, AQ3G <int>, AQ3B <int>,
## #   AQ4G <int>, AQ4B <int>, Away.score <int>, First.name <chr>, Surname <chr>,
## #   ID <dbl>, Jumper.No. <dbl>, Playing.for <chr>, Kicks <dbl>, Marks <dbl>,
## #   Handballs <dbl>, Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>,
## #   Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>,
## #   Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>,
## #   Contested.Possessions <dbl>, Uncontested.Possessions <dbl>,
## #   Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>,
## #   Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <dbl>,
## #   Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>,
## #   Umpire.4 <chr>, group_id <int>

And use the command names to check the column names, to help get a sense of what this data holds.

names(afltables)

##  [1] "Season"                  "Round"                  
##  [3] "Date"                    "Local.start.time"       
##  [5] "Venue"                   "Attendance"             
##  [7] "Home.team"               "HQ1G"                   
##  [9] "HQ1B"                    "HQ2G"                   
## [11] "HQ2B"                    "HQ3G"                   
## [13] "HQ3B"                    "HQ4G"                   
## [15] "HQ4B"                    "Home.score"             
## [17] "Away.team"               "AQ1G"                   
## [19] "AQ1B"                    "AQ2G"                   
## [21] "AQ2B"                    "AQ3G"                   
## [23] "AQ3B"                    "AQ4G"                   
## [25] "AQ4B"                    "Away.score"             
## [27] "First.name"              "Surname"                
## [29] "ID"                      "Jumper.No."             
## [31] "Playing.for"             "Kicks"                  
## [33] "Marks"                   "Handballs"              
## [35] "Goals"                   "Behinds"                
## [37] "Hit.Outs"                "Tackles"                
## [39] "Rebounds"                "Inside.50s"             
## [41] "Clearances"              "Clangers"               
## [43] "Frees.For"               "Frees.Against"          
## [45] "Brownlow.Votes"          "Contested.Possessions"  
## [47] "Uncontested.Possessions" "Contested.Marks"        
## [49] "Marks.Inside.50"         "One.Percenters"         
## [51] "Bounces"                 "Goal.Assists"           
## [53] "Time.on.Ground.."        "Substitute"             
## [55] "Umpire.1"                "Umpire.2"               
## [57] "Umpire.3"                "Umpire.4"               
## [59] "group_id"

This next step I like to include cleans up some of the naming used, makes it more consistent format that is less likely to break a function later down the line.

#rename all the columns to a snakecase format
names(afltables) <- to_snake_case(names(afltables))
names(afltables) # now the column headers are in lowercase and have dots replaced with underscores

##  [1] "season"                  "round"                  
##  [3] "date"                    "local_start_time"       
##  [5] "venue"                   "attendance"             
##  [7] "home_team"               "hq_1_g"                 
##  [9] "hq_1_b"                  "hq_2_g"                 
## [11] "hq_2_b"                  "hq_3_g"                 
## [13] "hq_3_b"                  "hq_4_g"                 
## [15] "hq_4_b"                  "home_score"             
## [17] "away_team"               "aq_1_g"                 
## [19] "aq_1_b"                  "aq_2_g"                 
## [21] "aq_2_b"                  "aq_3_g"                 
## [23] "aq_3_b"                  "aq_4_g"                 
## [25] "aq_4_b"                  "away_score"             
## [27] "first_name"              "surname"                
## [29] "id"                      "jumper_no"              
## [31] "playing_for"             "kicks"                  
## [33] "marks"                   "handballs"              
## [35] "goals"                   "behinds"                
## [37] "hit_outs"                "tackles"                
## [39] "rebounds"                "inside_50_s"            
## [41] "clearances"              "clangers"               
## [43] "frees_for"               "frees_against"          
## [45] "brownlow_votes"          "contested_possessions"  
## [47] "uncontested_possessions" "contested_marks"        
## [49] "marks_inside_50"         "one_percenters"         
## [51] "bounces"                 "goal_assists"           
## [53] "time_on_ground"          "substitute"             
## [55] "umpire_1"                "umpire_2"               
## [57] "umpire_3"                "umpire_4"               
## [59] "group_id"

Useless AFL Stat

Lets work towards two stats:

Who has the highest disposal tally equal to their jumper number?

and:

Which season has the highest accuracy?

Selecting columns

Now that the data is loaded and in a format we can easily manipulate, lets take a look at some basic functions. dplyr has built in functions to make this process as painless as possible. Firstly, lets look at select, and function that keeps the columns we want to investigate. Also we are going to be making use of %>%, known as a pipe, to channel our data through various functions. A shortcut for the command is Ctrl + Shift + m.

afltables %>% 
  select(season, round, id, first_name, surname, jumper_no, kicks, handballs)

## # A tibble: 174,284 x 8
##    season round    id first_name surname jumper_no kicks handballs
##     <dbl> <chr> <dbl> <chr>      <chr>       <dbl> <dbl>     <dbl>
##  1   2000 1        12 Mark       Bickley        26    16         3
##  2   2000 1        68 Brett      Burton         24     8         2
##  3   2000 1        77 James      Byrne           6     5         1
##  4   2000 1        78 Matthew    Clarke          4     1         3
##  5   2000 1        79 Andrew     Crowell         1     4         4
##  6   2000 1        61 Andrew     Eccles         33    10         7
##  7   2000 1        34 Tyson      Edwards         9    17         4
##  8   2000 1        35 Shane      Ellen          13    10         2
##  9   2000 1        51 Simon      Goodwin        36    10         8
## 10   2000 1        22 Ben        Hart           34     7         2
## # ... with 174,274 more rows

Combining (mutating) columns

Nice, now we’ve got the data want to investigate we can use a technique using a function called mutate. Whats interesting is this data source doesn’t have a disposals count column, but we can easily recreate it by adding handballs to kicks with one line of code.

afltables %>% 
  select(season, round, id, first_name, surname, jumper_no, kicks, handballs) %>% 
  mutate(disposals = kicks + handballs) #name of our new column goes on the left hand side

## # A tibble: 174,284 x 9
##    season round    id first_name surname jumper_no kicks handballs disposals
##     <dbl> <chr> <dbl> <chr>      <chr>       <dbl> <dbl>     <dbl>     <dbl>
##  1   2000 1        12 Mark       Bickley        26    16         3        19
##  2   2000 1        68 Brett      Burton         24     8         2        10
##  3   2000 1        77 James      Byrne           6     5         1         6
##  4   2000 1        78 Matthew    Clarke          4     1         3         4
##  5   2000 1        79 Andrew     Crowell         1     4         4         8
##  6   2000 1        61 Andrew     Eccles         33    10         7        17
##  7   2000 1        34 Tyson      Edwards         9    17         4        21
##  8   2000 1        35 Shane      Ellen          13    10         2        12
##  9   2000 1        51 Simon      Goodwin        36    10         8        18
## 10   2000 1        22 Ben        Hart           34     7         2         9
## # ... with 174,274 more rows

Filtering our data

A staple of Useless AFL Stats is the use of the jumper number, mainly because its arbitrary and should have no effect on the game, but nonetheless makes for a great stat. Lets find all the times the disposal count was equal to the jumper number. We can achieve this by using the filter function.

afltables %>% 
  select(season, round, id, first_name, surname, jumper_no, kicks, handballs) %>% 
  mutate(disposals = kicks + handballs) %>% 
  filter(disposals == jumper_no)

## # A tibble: 3,704 x 9
##    season round    id first_name surname  jumper_no kicks handballs disposals
##     <dbl> <chr> <dbl> <chr>      <chr>        <dbl> <dbl>     <dbl>     <dbl>
##  1   2000 1        77 James      Byrne            6     5         1         6
##  2   2000 1        78 Matthew    Clarke           4     1         3         4
##  3   2000 1       436 Luke       Darcy           14    11         3        14
##  4   2000 3         4 Mark       Ricciuto        32    15        17        32
##  5   2000 3       558 Tim        McGrath         17     9         8        17
##  6   2000 7        12 Mark       Bickley         26    15        11        26
##  7   2000 8        80 Scott      Welsh           17    13         4        17
##  8   2000 9        56 Nathan     Bassett          8     6         2         8
##  9   2000 9       777 Glenn      Archer          11     8         3        11
## 10   2000 11       79 Andrew     Crowell          1     0         1         1
## # ... with 3,694 more rows

Arranging by a column

Ok, so we have all the occurrences when jumper no. was equal to disposals, what was the largest? We can use the arrange function on a column to sort by ascending or descending order. The default arrangement for a columns is ascending (smallest at the top to biggest), so we’ll wrap the column in desc() to get the descending order.

afltables %>% 
  select(season, round, id, first_name, surname, jumper_no, kicks, handballs) %>% 
  mutate(disposals = kicks + handballs) %>% 
  filter(disposals == jumper_no) %>% 
  arrange(desc(jumper_no))

## # A tibble: 3,704 x 9
##    season round    id first_name surname  jumper_no kicks handballs disposals
##     <dbl> <chr> <dbl> <chr>      <chr>        <dbl> <dbl>     <dbl>     <dbl>
##  1   2016 18    11787 Tom        Rockliff        38    24        14        38
##  2   2016 13    11787 Tom        Rockliff        38    15        23        38
##  3   2006 2        51 Simon      Goodwin         36    21        15        36
##  4   2010 13       51 Simon      Goodwin         36    17        19        36
##  5   2003 4       842 Nick       Stevens         36    22        14        36
##  6   2008 10     1460 Dane       Swan            36    19        17        36
##  7   2009 9      1460 Dane       Swan            36    21        15        36
##  8   2009 18     1460 Dane       Swan            36    19        17        36
##  9   2010 16     1460 Dane       Swan            36    21        15        36
## 10   2012 4      1460 Dane       Swan            36    18        18        36
## # ... with 3,694 more rows

And there we have it, your first useless AFL stat. You should see Tom Rockliff up the top with 2 games of 38 disposals wearing the number 38. Useless.

Group by and Summarise

Grouping is a powerful tool we use to group certain values in columns. An example of this would be season, where each year is essentially its own category, and we can run commands that (for example) take the average amount of goals per season. Lets put this into practice with a simple example based off data we already have answering the following question:

Which season has the highest accuracy?

Lets pull in the data we need to create this stat. We need to sum the total goals and behinds per season.

afltables %>% 
  select(season, goals, behinds) %>% 
  group_by(season) %>% 
  summarise(
    sum_g = sum(goals),
    sum_b = sum(behinds),
    .groups = 'drop'
  )

## # A tibble: 21 x 3
##    season sum_g sum_b
##     <dbl> <dbl> <dbl>
##  1   2000  5626  3705
##  2   2001  5258  3586
##  3   2002  5121  3563
##  4   2003  5101  3546
##  5   2004  5012  3504
##  6   2005  5118  3439
##  7   2006  4959  3535
##  8   2007  5107  3581
##  9   2008  5238  3482
## 10   2009  4913  3515
## # ... with 11 more rows

Now that we have the total counts per season, we can use mutate to calculate accuracy, which is \(\dfrac{Goals}{Shots}\). We are also going to arrange the result to see which season has the highest accuracy.

afltables %>% 
  select(season, goals, behinds) %>% 
  group_by(season) %>% 
  summarise(
    sum_g = sum(goals),
    sum_b = sum(behinds),
    .groups = 'drop'  #we also need to drop the grouping after running the command
  ) %>% 
  mutate(
    accuracy = sum_g/(sum_g+sum_b)*100
  ) %>% 
  arrange(desc(accuracy))

## # A tibble: 21 x 4
##    season sum_g sum_b accuracy
##     <dbl> <dbl> <dbl>    <dbl>
##  1   2000  5626  3705     60.3
##  2   2008  5238  3482     60.1
##  3   2005  5118  3439     59.8
##  4   2001  5258  3586     59.5
##  5   2015  5176  3536     59.4
##  6   2003  5101  3546     59.0
##  7   2002  5121  3563     59.0
##  8   2004  5012  3504     58.9
##  9   2007  5107  3581     58.8
## 10   2013  5561  3907     58.7
## # ... with 11 more rows

Nice, we can see that the year 2000 has the highest accuracy. Another question we might ask is which round in each season has the highest accuracy? We can group by a second variable, round.

afltables %>% 
  select(season, round, goals, behinds) %>% 
  group_by(season, round) %>% 
  summarise(
    sum_g = sum(goals),
    sum_b = sum(behinds),
    .groups = 'drop'  #we also need to drop the grouping after running the command
  ) %>% 
  mutate(
    accuracy = sum_g/(sum_g+sum_b)*100
  ) %>% 
  arrange(desc(accuracy))

## # A tibble: 558 x 5
##    season round sum_g sum_b accuracy
##     <dbl> <chr> <dbl> <dbl>    <dbl>
##  1   2005 EF       64    28     69.6
##  2   2011 SF       62    28     68.9
##  3   2004 QF       60    29     67.4
##  4   2014 GF       32    16     66.7
##  5   2017 SF       48    24     66.7
##  6   2000 QF       69    35     66.3
##  7   2015 12      175    89     66.3
##  8   2008 EF       61    32     65.6
##  9   2005 PF       49    26     65.3
## 10   2007 16      250   135     64.9
## # ... with 548 more rows

Nice, 2005 Elimination finals games were at nearly 70% kicking accuracy. group_by and summarise work really well together, and you can switch out sum with mean for the average, or max and min for the maximum and minimum in each group. The possibilities are endless.

Conclusion

Thanks for making it this far, hopefully its given you a taste of the potential insights (useful or useless) that R and AFL have to offer. This is hopefully the first in a series of tutorials about AFL analytics in R. You can contact us on twitter or Facebook, let us know what you found interesting, insightful, difficult, any other types of stats you’d like to see recreated in upcoming posts.