Using R for Football Data Analysis – Monte Carlo

OK, so I’m going to try my hand at a tutorial, we’re going to use R to run a Monte Carlo simulation on the expected goal rates of the shots in the Southampton V Liverpool game (23/02/2015), and calculate the win probability of an average team given those chances based on those ExpG numbers. We will then build a bar plot to display that information, and lastly save a PNG image of said bar plot to our hard drive. All contained in one handy script, write the code once, and never have to write it again. (you hope!)

But first I just want to get a few things out of the way. I’m not an R expert. I’ve been programming a for a few years now, mostly just a hobby, and mainly in iOS and Objective-C. When I got interested in football data, that iOS grounding gave me a solid base to explore more data orientated languages such as SQL and R. Just recently I even exchanged a career in Taxation to one in iOS developing!

Anyway, I’d expect this tutorial might help people who want to get started in R and it would mostly suit someone who at least knows the difference between a variable and a for loop. I’m going to keep the tutorial as simple as possible, so the code might be dumbed down a little. That’s not to be condescending, just speaking from experience in that any R tutorials I read seem to get quite technical quite quickly. Also, in my opinion R doesn’t read well, it’s not a language that’s very verbose, which I think makes it look ugly at times, and that non-verbosity also makes it look far more complicated than it is. But, it’s as powerful as it is ugly when it comes to statistical analysis. As a result I’ve structured my code here in a way that I think reads a little easier. Hell, I’m not even going to write a custom function.

Things we need:

1. An understanding of basic programming concepts and what a monte carlo simulation is. If you need to brush up on programming code academy would be a good place to start. I’d probably start with the javascript course there. And at least some limited knowledge of R. What packages are etc.

2. R installed on your machine. R download here.

2. Our IDE of choice: R Studio – you can download it here.

3. A .csv file with expected goal numbers. You can download my sample here.

4. Some knowledge of what expected goals is. Great explanation by @footballfactman here.

So with everything installed and R studio opened up let’s get started. In R studio, the bottom left panel is our console and it’s where we will see all of our outputs. Ignore the rest, it will become more familiar as we go along. Before we do anything we need to find our working R directory, trust me on this, I lost a lot of time here.
fullScreenImage3

Click into the console and type: getwd() and press return. This will give us the path to our working directory in R. This is where we want to save our script files for the moment, and it will save us having to find the directory path to any data files (csv) when we read them in. Next click file, new file, R script, a new file will open in the text editor window, now click file again and click save as, in the dialog box navigate to the working directory we found above, and name our file singleMatchMonteCarlo. The .R file name extension will be added automatically. Normally I wouldn’t use csv files as I would connect to a SQL database and run queries directly from R as a way to read in specific data. If you can, I’d highly recommend getting started with SQL or even SQLite as a starter. There is a great and easy to understand primer here on SQL. For football data analysis, R is much more powerful this way.

Next move your sample csv file you downloaded above (myData.csv) to the same working directory as your singleMatchMonteCarlo.R file. Both files should now be in the same directory. If this sounds cumbersome, then don’t worry, we won’t have to do this every time as you can link to any file on your hard drive once your have that files directory path. That’s for another day though.

Lastly, just a small style point note. You assign variables/objects in R by using -> but you can also use the standard way with = I’ll be using the equals sign, simply because I just don’t see the point in an extra keystroke. Programming is enough bloody typing as it is.

So we have 3 columns in our csv file. team, xg and matchID. Ignore matchID for the purposes of this tutorial. In a later tutorial, (if there is one!) this will become relevant when we want to loop through a number of different games. Before I go further, if you are new to programming or just starting off in R I’d advise typing the code rather than copy and pasting. Repetition will help. Excuse the cheese, but programming is like playing a musical instrument, if you don’t practice, you won’t be good at it. First off we need to read the data in our csv file into R, one way of doing this is like so:

myData = read.csv(file="myData.csv", header=TRUE, sep=",")

So type the above into the text editor in R. Here we are using R’s built-in factory function read.csv and between the parenthesis we are passing in 3 parameters. Our file to read in is our first parameter, note if we had saved our myData.csv file to any folder other than our working R directory we would have to get the path to that file. And it would probably look something like this:

myData - read.csv(file="/Users/Documents/_Stats/myData.csv", header=TRUE, sep=",")

You can if you wish change your default working R directory within R Studio itself by going to the Set Working Directory menu under Session. Play around with it, if you’re feeling brave!

Screen Shot 2015-02-23 at 23.35.58

Go ahead and run the code above that you typed in the editor. You can do this by selecting/highlighting the code in the editor itself and hitting the RUN button in the top right corner of the editor. Ctrl+a will do the trick to highlight everything in the editor. Alternatively you can copy the line and paste it directly into the console below and press return. If everything worked correctly you should receive no errors in the console and in the environment panel you will see the object myData initialised and ready for use.

Screen Shot 2015-02-23 at 23.50.54

You can expand and view the data directly by double clicking the small spreadsheet symbol in the environment panel, which should open up the myData object within the code editor. See image above.

Go ahead now and type the following into the console and press return.

print(myData)

The same data should now print to the console. Printing to the console is a great tool to use when your programming becomes a little more complex, as it can help to see if variables and objects have been set or initialised at arbitrary points within the control flow of your code.

Now that we’ve read in our data we can talk about what we need to do. Looking at how the data is constructed usually determines what logic we need to write to extract that data and use it the way we want to use it. We want to use a Monte Carlo simulation, which is basically testing the likelihood an event might happen. The event in our case is whether a goal is scored or not based on our ExpG numbers, which is just the probability of that particular shot being scored based on a number of factors such as location of shot, shot type, pass type etc. etc. So if we pick a random number between 1 and 100 and that number falls below our ExpG number then we register that a goal has been scored.

There are a number of ways to do this. You only have to spend 10 minutes on stack overflow to realise that every question has multiple correct answers, some more code efficient than others, but it’s worth pointing out, there is more than one way to skin a cat. We will use arrays or vectors as they are called in R, our data table we read in above and a couple of for loops to loop through the table. We use for loops when we know how many times we want to iterate through something like a vector. Vectors/arrays are just numbered lists. So we loop through our data, check who took the shot (Southampton or Liverpool), get a random number, check to see if the random number was less than the ExpG, and if so record that a goal was scored. So let’s declare and initialise some variables and vectors we’re going to use.

homeTeam = 'Southampton'
awayTeam = 'Liverpool'
outputHomeTeam = c()
outputAwayTeam = c()
numberOfTimesToRunSim = 10
randomNumber = NULL
countGoalsScoredByHomeTeam = 0
countGoalsScoredByAwayTeam = 0

First our 2 string variables homeTeam and awayTeam will hold a reference to Southampton and Liverpool respectfully. While we loop through the ExpG numbers we will need a way to check who took each individual shot. We’ll use these in an if else statement from within the for loop to check. Hit the clear button in the environment panel and re-run the code again. Play around in the console now by typing some of the variables into it and pressing return. Your IDE should now look like this:

fullScreenImage1

Now we wouldn’t normally run our sim only 10 times, but leave it as it is for the time being, as we just want to be sure our outputs are what they should be later on. So lets loop through our ExpG numbers. We want to run our loop with numberOfTimesToRunSim, so x is our counter and numberOfTimesToRunSim is our end condition which is set to 10. Add the following code below the variables we set above.

for (x in 1:numberOfTimesToRunSim) {
print(x)
}

Apologies for how the code is displayed, wordpress doesn’t seem to have an option to indent the code properly. I’ll re-look at this later on to see if I can format it properly. Non-indented code is horrible to look at! Again go ahead and clear the environment and re-run all of the code. (remember: click into the code editor and hit ctrl+a, this is our shortcut to highlight all the code) Numbers 1 – 10 inclusive should print to the console. Next we need another for loop within our above loop to iterate over our ExpG numbers.


for (x in 1:numberOfTimesToRunSim) {
print(x)
for (i in 1:length(myData$team)) {
print(i)
}
}

Here we are just saying iterate through the for loop the same number of times as the count of rows under team in our myData object. Type length(myData$team) into the console and it should return 19, which is the same number of shots that was taken by both teams combined. Again clear your environment, highlight all the code in the editor and hit run. You should see a count to 19, ten times. It’s tiresome, but running and printing to the console will help understand the machinations of what is being executed. Next we need to check which team took each shot with a conditional if else statement. Update your for loop to match with the following:


for (x in 1:numberOfTimesToRunSim) {
for (i in 1:length(myData$team)) {
if(myData$team[i]==homeTeam){
print("The home team Southampton took the shot")
}else if (myData$team[i]==awayTeam){
print("The away team Liverpool took the shot")
}
}
} # end of numberOfTimesToRunSimLoop

Again clear and re- run and peruse the console to see what is going on. Now we know which team took each shot we can get the corresponding ExpG value on each row. But first we need to get a random number between 0.0 and 1.0 and scale it. Then we check if our random number is less than our ExpG number for each shot with another if else statement, if so a simulated goal was scored and we add 1 to our countGoalsScoredByHomeTeam or countGoalsScoredByAwayTeam variable. Of course I don’t have to have an else if condition, “else” will execute all other conditions, but for legibility this is how I’ll continue. So remove the print statements and update your code with the following:

if(myData$team[i]==homeTeam){

randomNumber = sample(1:100, 1)

if (randomNumber<=myData$xg[i]*100) {

countGoalsScoredByHomeTeam = countGoalsScoredByHomeTeam+1

}

}else if (myData$team[i]==awayTeam){

randomNumber = sample(1:100, 1)

if(randomNumber<=myData$xg[i]*100) {
countGoalsScoredByAwayTeam = countGoalsScoredByAwayTeam+1

}
}

Your IDE should now look like so:

fullScreenImage2
Again clear and re-run the code and watch the console for when a simulated goal has scored. But this gives us the number of goals scored in total by each team for the total number of simulations. So homeTeam scored x amount of goals over 10 simulated games. But we need to capture that number at the end of each match simulation so we can work out if that game was won or lost. This is where our outputHomeTeam and outputAwayTeam vectors we initialised earlier come into play. So immediately before the end of our numberOfTimesToRunSim for loop ends type the following:

#save our results after testing each game
outputHomeTeam = c(outputHomeTeam, countGoalsScoredByHomeTeam)
outputAwayTeam = c(outputAwayTeam, countGoalsScoredByAwayTeam)

#set our counters back to zero because we want to use them again for the next match test
countGoalsScoredByHomeTeam = 0
countGoalsScoredByAwayTeam = 0
randomNumber = NULL

In our first two lines we are updating the vectors with how many goals are scored after each simulation of each match. So after the first simulation the first index in each vector will be initialised with how many goals were scored, then the second simulation the second index will be stored with the number of goals scored and so on up to numberOfSimsToRun, which in our initial case is set to 10. After each simulated game we set our counters back to zero. Our code should look like the following:

codeAfterLoops

Again, clear the environment and re-run the code to test it. Now print(outputHomeTeam) and print(outputAwayTeam) to the console and you can see for each simulated game that the score of each team is saved in the corresponding vectors. We can write a quick loop to test this and paste it straight into the console:
for (s in 1:length(outputHomeTeam)){
cat("Home score is ", outputHomeTeam[s], " and away team score is ", outputAwayTeam[s], "\n")
}

Cat just concatenates our strings and variables together and \n prints the next line to a new line in the console. As a side note if you need to clear the console type this into it:
cat14
If you take a look in the environment panel at outputHomeTeam and outputAwayTeam you can see the score of each game, and that it should match what we printed to the console directly above. So now we have our game scores it’s just a matter of looping through the scores and checking which team scored higher than the other, if so, they won that particular game, and then we simply record the events. We then use some basic math on the outputs to calculate the percentages. So input the following code after (and outside) our numberOfSimsToRun for loop.

# check to see if the game was won i.e. if more goals scored by either team
homeTeamGamesWon = 0
awayTeamGamesWon = 0

for (y in 1:length(outputHomeTeam)){

if(outputHomeTeam[y]>outputAwayTeam[y]){

homeTeamGamesWon = homeTeamGamesWon+1

}else if (outputHomeTeam[y]<outputAwayTeam[y]){

awayTeamGamesWon = awayTeamGamesWon+1
}

}

percGamesWonByHomeTeam = homeTeamGamesWon*1.0/numberOfTimesToRunSim
percGamesWonByAwayTeam = awayTeamGamesWon*1.0/numberOfTimesToRunSim
percGamesDrawn = 1.0-(percGamesWonByHomeTeam+percGamesWonByAwayTeam)

# objects to use in our barplot
matchTeams = c(homeTeam, awayTeam)
matchOutcome = c(percGamesWonByHomeTeam, percGamesWonByAwayTeam)
print(matchOutcome)

Here we just loop through the vectors we saved earlier and check to see if the number of goals scored by one team was greater than the other team, if so record a win for that team. We then calculate the percentages and save them to vectors we will use in our barplot. All going well the percentages should print to the console. To make sure everything is running correctly take a look at your outputHomeTeam and outputAwayTeam vectors in the environment panel. You can do a quick calculation from that to see how many games each team won and see if it matches to the output in the console. If so, you’re done!

Now we just need to display our findings in a bar plot and save the image to our hard drive. There are several ways to do this, you could use the ggplot package which is a fantastic tool, but for the time being I’m just going to keep things simple and use R’s built-in factory method “barplot”. Go ahead and type the following after the above code.

# build our bar plot
yMax = max(matchOutcome)+.1

barPlotTitle = 'ExpG Premier League 14/15'

bp = barplot(matchOutcome, ylim = c(0.0, yMax), col="darkgreen", main=barPlotTitle,ylab ="Win Probability", xlab ="Teams",
names.arg=c(matchTeams), cex.names=0.6,)
text(bp, 0, round(matchOutcome, 2),cex=1,pos=3)

# save our bar plot as a PNG image to the hard drive
dev.copy(png, filename=paste("test.png"))
dev.off ()

The first line is related to a display issue, it just get’s the maximum value from the matchOutcome vector and adds .1 to it. It makes the y axis on the bar plot extend slightly (.1) higher than the highest bar. You can go ahead and move barPlotTitle to the top of the code where we declared our first set of variables, just so they are all in the one place when we want to change them. The next line just calls the barplot function and passes some parameters to it. Mainly, our win probabilities, the size of our axis, colour of our bars, bar title and various other text and sizes of our text. Play around with them later to test what each does. Lastly we save the image of our barplot to our working directory. Now for the last time, clear your environment panel, change numberOfTimesToRunSim to 20,000 (or whatever you prefer), if you have a slow machine 20,000 could take more than a minute to run. Commenting out any print statements will speed up the process though. Now go and find your barplot test.png and post to Twitter! Also if you have ExpG data from any other game, then the only variables you will need to change are the links to your csv file, the references to the column headers in that file, and the 2 teams you want to test in that match. You can even refactor the code further to minimise the variables you need to change. Reusable code!

You can also see now how easy it would be to add more games to our csv file. With another for loop and a couple of more variables we could run a season for a particular team and display it. We could also run multiple seasons for a team and calculate the simulated points for each simulated season and display them on a histogram. There are numerous possibility’s. Now go ahead and have fun. I’ll post the entire code to bitbucket later on. Hopefully this will get some people started with R.

You can download the source code from bitbucket here.

Advertisements

One thought on “Using R for Football Data Analysis – Monte Carlo

  1. joelk

    This is a really great tutorial. You’ve given a sample dataset with expected goals – do you know anywhere online with a few more datasets like that? Obviously I could use available shot data and create my own ExpG numbers, but to practice this type of R code in the short term it would be easier to have someone else’s ExpG data!

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s