Being locked down gave me an idea:
If I could travel, whats the shortest distance to visit all Australian football grounds?
To answer this question, I’m going to be exploring one of the most famous optimisation problems in mathematics: the Traveling Salesman Problem, or a Travelling Footballer Problem.
The data
Here’s some data I prepared earlier, 17 venues with corresponding latitude and longitude. These are Australian based venues that have been used in the last 3 years of the AFL regular season.
library(dplyr)
library(tidyr)
afl_venue_locations <- read.csv("post_data/afl_venue_locations.csv")
library(leaflet)
leaflet(data = afl_venue_locations) %>% addTiles() %>%
addMarkers(~longitude, ~latitude, popup = ~venue, label = ~venue)
Given these 17 locations, and we want to find the shortest path that allows us to visit each of the grounds and finish where we started. To start off we need a distance matrix, a calculation of the distance between each venue. We are going to use the geosphere
package with built in functions to calculate distances from the venues’ latitude and longitude coordinates.
The data prep
library(geosphere)
afl_venue_coords <- afl_venue_locations %>% select(longitude, latitude)
distance_matrix <- as.matrix(
distm(afl_venue_coords, fun = distHaversine)
)/1000 #convert metres to kilometres
rownames(distance_matrix) <- afl_venue_locations$venue
colnames(distance_matrix) <- afl_venue_locations$venue
Our distance matrix has been computed. You’ll notice that the diagonals are all 0, which makes sense given its the same location. Also the matrix is symmetrical along the diagonal: Walking from the MCG to Docklands is the same as the return journey. Please note that this is a “As the crow flies” implementation, a straight line from venue to venue.
Model Building
I chose to use the ompr
implementation of the TSP, which I found to be pretty stable and quick*. We will also need the development version of ompr
to get access to the vectorised version of the algorithm.
*I haven’t tried any others.
install.packages("ompr")
devtools::install_github("dirkschumacher/ompr.roi") #or cran version higher than 0.8.0.9
install.packages("ROI.plugin.glpk")
library(ompr)
#specify the dimensions of the distance matrix
n <- length(afl_venue_locations$id)
#create a distance extraction function
dist_fun <- function(i, j) {
vapply(seq_along(i), function(k) distance_matrix[i[k], j[k]], numeric(1L))
}
model <- MILPModel() %>%
# we create a variable that is 1 iff we travel from city i to j
add_variable(x[i, j], i = 1:n, j = 1:n,
type = "integer", lb = 0, ub = 1) %>%
# a helper variable for the MTZ formulation of the tsp
add_variable(u[i], i = 1:n, lb = 1, ub = n) %>%
# minimize travel distance
set_objective(sum_expr(colwise(dist_fun(i, j)) * x[i, j], i = 1:n, j = 1:n), "min") %>%
# you cannot go to the same city
set_bounds(x[i, i], ub = 0, i = 1:n) %>%
# leave each city
add_constraint(sum_expr(x[i, j], j = 1:n) == 1, i = 1:n) %>%
# visit each city
add_constraint(sum_expr(x[i, j], i = 1:n) == 1, j = 1:n) %>%
# ensure no subtours (arc constraints)
add_constraint(u[i] >= 2, i = 2:n) %>%
add_constraint(u[i] - u[j] + 1 <= (n - 1) * (1 - x[i, j]), i = 2:n, j = 2:n)
model
## Mixed integer linear optimization problem
## Variables:
## Continuous: 17
## Integer: 289
## Binary: 0
## Model sense: minimize
## Constraints: 306
The model has been built, now lets solve it.
#Solving the model
library(ompr.roi)
library(ROI.plugin.glpk)
result <- solve_model(model, with_ROI(solver = "glpk", verbose = TRUE))
## <SOLVER MSG> ----
## GLPK Simplex Optimizer, v4.47
## 306 rows, 306 columns, 1330 non-zeros
## 0: obj = 0.000000000e+000 infeas = 5.000e+001 (34)
## * 54: obj = 2.453647438e+004 infeas = 0.000e+000 (1)
## * 128: obj = 8.936395032e+003 infeas = 0.000e+000 (1)
## OPTIMAL SOLUTION FOUND
## GLPK Integer Optimizer, v4.47
## 306 rows, 306 columns, 1330 non-zeros
## 289 integer variables, 272 of which are binary
## Integer optimization begins...
## + 128: mip = not found yet >= -inf (1; 0)
## + 877: >>>>> 1.603882389e+004 >= 9.723428934e+003 39.4% (82; 1)
## + 1144: >>>>> 1.507632658e+004 >= 9.723917868e+003 35.5% (109; 5)
## + 2068: >>>>> 1.507157320e+004 >= 1.051733544e+004 30.2% (158; 31)
## + 2215: >>>>> 1.167150030e+004 >= 1.069026063e+004 8.4% (163; 32)
## + 3581: >>>>> 1.166830513e+004 >= 1.160505429e+004 0.5% (11; 353)
## + 3671: mip = 1.166830513e+004 >= tree is empty 0.0% (0; 399)
## INTEGER OPTIMAL SOLUTION FOUND
## <!SOLVER MSG> ----
result_val <- round(objective_value(result), 2)
result_val
## [1] 11668.31
Nice! The shortest distance travelled to visit all venues is 11668km.
Results
How does this route look on a map?
solution <- get_solution(result, x[i, j]) %>%
filter(value > 0)
paths <- select(solution, i, j) %>%
rename(from = i, to = j) %>%
mutate(trip_id = row_number()) %>%
inner_join(afl_venue_locations, by = c("from" = "id"))
paths_leaflet <- paths[1,]
paths_row <- paths[1,]
for (i in 1:n) {
paths_row <- paths %>% filter(from == paths_row$to[1])
paths_leaflet <- rbind(paths_leaflet, paths_row)
}
leaflet() %>%
addTiles() %>%
addMarkers(data = afl_venue_locations, ~longitude, ~latitude, popup = ~venue, label = ~venue) %>%
addPolylines(data = paths_leaflet, ~longitude, ~latitude, weight = 2)
Seems pretty intuitive, essentially a big loop around Australia.
Maths behind the TSP
How many combinations were there? Lets start with a simple case of 4 locations and the following path, ending where we started.
\(A \rightarrow B \rightarrow C \rightarrow D \rightarrow A\)
is the same as
\(B \rightarrow C \rightarrow D \rightarrow A \rightarrow B\)
This means the “starting venue” is already decided, we can just apply a rotation to the cycle. We can also travel in either direction (thanks to a symmetric matrix).
\(A \rightarrow B \rightarrow C \rightarrow D \rightarrow A\)
is the same as
\(A \rightarrow D \rightarrow C \rightarrow B \rightarrow A\)
which means we can remove half of the possible cycles, which are just reflections of each other.
This gives us the resulting formula:
\[ \texttt{NComb} = \dfrac{(n-1)!}{2}\\ n=17, \therefore \texttt{NComb} = \dfrac{(17-1)!}{2} = 10,461,394,944,000 \]
To brute force a 17 venue TSP, 10 trillion combinations would need to be evaluated, even when removing cycling and reflections, and the ompr
R implementation solved it in about 1 second! Amazing!
Bonus Round
What’s the maximum distance? Easy. Replace the objective from min
to max
.
model_max <- MILPModel() %>%
# we create a variable that is 1 iff we travel from city i to j
add_variable(x[i, j], i = 1:n, j = 1:n,
type = "integer", lb = 0, ub = 1) %>%
# a helper variable for the MTZ formulation of the tsp
add_variable(u[i], i = 1:n, lb = 1, ub = n) %>%
# maximise travel distance
set_objective(sum_expr(colwise(dist_fun(i, j)) * x[i, j], i = 1:n, j = 1:n), "max") %>%
# you cannot go to the same city
set_bounds(x[i, i], ub = 0, i = 1:n) %>%
# leave each city
add_constraint(sum_expr(x[i, j], j = 1:n) == 1, i = 1:n) %>%
# visit each city
add_constraint(sum_expr(x[i, j], i = 1:n) == 1, j = 1:n) %>%
# ensure no subtours (arc constraints)
add_constraint(u[i] >= 2, i = 2:n) %>%
add_constraint(u[i] - u[j] + 1 <= (n - 1) * (1 - x[i, j]), i = 2:n, j = 2:n)
model_max
## Mixed integer linear optimization problem
## Variables:
## Continuous: 17
## Integer: 289
## Binary: 0
## Model sense: maximize
## Constraints: 306
result_max <- solve_model(model_max, with_ROI(solver = "glpk", verbose = TRUE))
## <SOLVER MSG> ----
## GLPK Simplex Optimizer, v4.47
## 306 rows, 306 columns, 1330 non-zeros
## 0: obj = 0.000000000e+000 infeas = 5.000e+001 (34)
## * 54: obj = 2.453647438e+004 infeas = 0.000e+000 (1)
## * 166: obj = 3.605414246e+004 infeas = 0.000e+000 (1)
## OPTIMAL SOLUTION FOUND
## GLPK Integer Optimizer, v4.47
## 306 rows, 306 columns, 1330 non-zeros
## 289 integer variables, 272 of which are binary
## Integer optimization begins...
## + 166: mip = not found yet <= +inf (1; 0)
## + 181: >>>>> 3.605414246e+004 <= 3.605414246e+004 < 0.1% (2; 0)
## + 181: mip = 3.605414246e+004 <= tree is empty 0.0% (0; 3)
## INTEGER OPTIMAL SOLUTION FOUND
## <!SOLVER MSG> ----
result_val_max <- round(objective_value(result_max))
paste0('Total distance: ',result_val_max,'km')
## [1] "Total distance: 36054km"
solution_max <- get_solution(result_max, x[i, j]) %>%
filter(value > 0)
paths_max <- select(solution_max, i, j) %>%
rename(from = i, to = j) %>%
mutate(trip_id = row_number()) %>%
inner_join(afl_venue_locations, by = c("from" = "id"))
paths_max_leaflet <- paths_max[1,]
paths_max_row <- paths_max[1,]
for (i in 1:n) {
paths_max_row <- paths_max %>% filter(from == paths_max_row$to[1])
paths_max_leaflet <- rbind(paths_max_leaflet, paths_max_row)
}
leaflet() %>%
addTiles() %>%
addMarkers(data = afl_venue_locations, ~longitude, ~latitude, popup = ~venue, label = ~venue) %>%
addPolylines(data = paths_max_leaflet, ~longitude, ~latitude, weight = 2)
Lots of criss-crossing paths. Imagine travelling from Canberra to Sydney via Perth.
Conclusion
Thanks for making it this far, you can check out the script I used to generate this analysis here. You can ask me questions about analysis at my twitter @crow_data_sci.
Make sure to check out ompr
on github and follow the creator of this package on twitter, @dirk_sch.