Introduction
The Brownlow medal, one of the most prestigious AFL awards, is notoriously difficult to predict. You can read this article from Fat Stats for a great look as to why. Here’s a quick summary:
- Umpires are human and have their own biases
- Highly predictive stats from last year may not be predictive this year
- Media influence may impact umpires when they vote
Another point that I’d like to throw into the mix:
Only 3 out of 44 players can actually receive a vote!
6% of players per game receive a vote! Further more, 1 out of 44 receive the 3 votes. This is a ridiculously small sample size, and predicting rare events is one of the trickiest things to do in predictive modelling. I’m gonna try anyway.
The plan
In this post I’ll be exploring the underlying structure of the data that players accumulate per game. To do do this I’ll be employing 2 different techniques that employ dimensionality reduction. Dimensionality reduction is a way of representing high dimensional datasets in 2 or 3 dimensions that retain as much information about the original dataset as possible.
Once a low dimensional representation has been calculated and visualised, we expect to see different structures between 0 vote games and vote worthy games. Here are the two methods I’ll be using.
- Principal Component Analysis (PCA, linear, fairly common, robust)
- Uniform Manifold Approximation and Projection (UMAP, non-linear, relatively new, tricky)
PCA reduces the dimensions of the dataset while preserving as much of the total variability as possible. PCA outputs a linear combination of vectors, where each principal component explains a proportion of variability in the dataset. 1
UMAP is a non-linear dimension reduction technique and uses a fuzzy topological structure to model the manifold. The end result is a low dimensional projection of the data that is closest to its equivalent fuzzy topological structure. 2
Using each method, we can compare the visualisations and make judgments about which method better represents the structure of the data, and helps us differentiate between vote worthy games.
What do I mean by dimensionality reduction?
Each statistic that you can record in the AFL is a dimension (ie kicks, marks and handballs). Lets plot these on a 3D chart and colour each point by the bronwlow votes they received.
This is a 3 dimensional graph where each point is a player from round 1 2019, corresponding to their stats for that game. Humans can only realistically understand 3 spatial dimension, but computers and algorithms have the luxury of analysing 100’s of dimensions, a fact that we can exploit. You can hover over points and see who the player is and their corresponding stats.
In reality, the AFL records hundreds of data points including free kicks, frees against, bounces, defensive one on ones and contested posessions, just to name a few. We can’t visualise all these at once, so we need a representation of this data in a form we can understand. Dimensionality reduction is the process of taking these 100’s of dimensions and leaving us with 2 (or 3) that most accurately correspond to its higher dimensional shape.
The first attempts
Principal component analysis (no scaling)
For the next sections I’ll be using the 2018 Home and Away Season, and sourcing the data from fitzRoy. I’ll be using prcomp
from stats
to generate the principal components.
These are the statistics that are going to be used in the following analyses: kicks, marks, handballs, goals, behinds, hit_outs, tackles, rebounds, inside_50_s, clearances, clangers, frees_for, frees_against, contested_possessions, uncontested_possessions, contested_marks, marks_inside_50, one_percenters, bounces, goal_assists, time_on_ground.
A nice feature of plotly
is that you can click the colours on the legend to isolate certain groups.
Here we can see PC1 and PC2, the top 2 principal components. Each dot is represents a player in a game, with the coordinates representing its approximation in lower dimensional space. I’ve added colours corresponding to the votes afterwards.
What we would like to see is an nice distinction between the groups, which isn’t too evident. Most of the 1, 2 and 3 votes are clustered in the top right section, but zooming in shows that lots of 0 vote games are also being estimated with the same lower dimensional representation. Lets see if adding a third PC makes it any better.
Visualising PC3 has added arguably not that much to the interpretability of the data (but gives a cool looking plot nonetheless). Don’t be fooled by the red dots, they stand out quite a bit but if you zoom in, you can see that there are quite a few grey dots scattered in there too. If you squint, you can probably convince me that the 1-3 votes are clustering together, but the definition between the groups is fairly small.
Uniform Manifold Approximation and Projection (no scaling)
Let have a look at our non-linear dimensionality reduction technique, using the same data as before and visualise two components. An important note is that the observations are randomly initialised, so plots may look different from run to run. 3
Wow, that is much better! We can clearly see some definition in the votes, as they tend to cluster towards the edge of the main group. But zooming in shows that a lot of 0 vote game interspersed in there. Interestingly, there is a clear structure made up purely of ruckmen.
Lets take a look at 3 components.
We can see that vote worthy games tend to clump together, and gradually filter in with the no vote games. There also seems to be a small cluster of vote games apart from the main group, representing high hit out games (mainly ruckmen) and another cluster of high goal kickers.
The second attempt (this time with scaling!)
One important thing to note is that the initial attempts treated all the players equally, regardless of name, playing condition, home advantage: purely based on performance. For the next attempts I’ll be scaling the statistics by game, which will remove the groupl level game bias. A player that has the most touches in a wet game at 25 will no longer be treated the same as a dry 35 disposal game, but scaled relative to others they played against.
Each game will grouped and scaled, where each stat will have its mean adjusted to 0 and the variance to 1. When hovering over the points, keep in mind the following definitions:
- stat = 0: an average amount for that game
- stat > 0: higher than average
- stat < 0: lower than average
Principal component analysis
Not bad. 3, 2 and 1 votes are clustered together towards one side. Using the selective highlighting tool, we can see that there are intersections between 0 vote games with 3 vote games. When comparing this to the umap method, we don’t get a clear distinction between ruckmen and ball winners.
Uniform Manifold Approximation and Projection
Next up is umap with game scaled data, lets see how it fares.
Here we can see much better definition between the types of vote worthy games:
- Kick and Marks >1
- Goals >4
- Hitouts >4
The use of a non-linear method like umap greatly helps in our understanding of the types of structures that appear in large non-linear data sources
Removing 0 vote games, we can see there is an overlap between these voted games, and presents an interesting question. Given the current data and methods, there is no apparent difference between these games. Given how similar these games are to each other, is there truly a 3 vote game? If not, how certain can we be of models that predict a 3 vote game?
Future Work
Next post on this topic will be the generation of predictions using the generated lower dimensional embeddings, and compare their effectiveness against the PCA, and against more components. Given this exploration was on 2018 data, I’ll generate predictions for 2019 and compare the results.
Grouping and scaling by game probably isn’t the most effective way to normalize the data, and future investigations might uncover better ways to scale the data.
Feature selection techniques can be employed to filter out data (maybe bounces) that are not predictive of brownlow votes. A simple method would be Weight of Effect
afltables.com is not an exhaustive source of AFL statistics, and until Champion Data gets its act together and has a change of heart to release data to the amateur AFL analytics community, this is probably the best we can get. Additional work has been done by @Fryzigg to add higher level data and would be a worthwhile investigation.
I have been running umap with the default parameter settings, and tweaking these may produce better results. These include: 4
- Number of neighbours: local vs global data
- Minimum distance: How close the points are together
- Number of components: Investigated, maybe 10 dimensions are needed to generate predictions from
- Metric: How is distance computed (euclidean, manhattan)
Conclusion
I believe that there are 3 main takeaways from this analysis that can improve our understanding of how brownlow votes are allocated.
- Use non-linear methods for determining brownlow votes
- There is (most likely) no discernible difference between 1, 2 and 3 vote games
- If there’s no discernible difference between voted games, then confidence intervals should reflect it
- Scaling data per game improves interpretability
Thanks for making it this far, you can check out the script I used to generate this analysis here. You can ask me questions about analysis at my twitter @crow_data_sci.
Also, make sure to check out my previous post, Intro to AFL Stats in R.