Using R for Football Data Analysis – Monte Carlo

OK, so I’m going to try my hand at a tutorial, we’re going to use R to run a Monte Carlo simulation on the expected goal rates of the shots in the Southampton V Liverpool game (23/02/2015), and calculate the win probability of an average team given those chances based on those ExpG numbers. We will then build a bar plot to display that information, and lastly save a PNG image of said bar plot to our hard drive. All contained in one handy script, write the code once, and never have to write it again. (you hope!)

But first I just want to get a few things out of the way. I’m not an R expert. I’ve been programming a for a few years now, mostly just a hobby, and mainly in iOS and Objective-C. When I got interested in football data, that iOS grounding gave me a solid base to explore more data orientated languages such as SQL and R. Just recently I even exchanged a career in Taxation to one in iOS developing!

Anyway, I’d expect this tutorial might help people who want to get started in R and it would mostly suit someone who at least knows the difference between a variable and a for loop. I’m going to keep the tutorial as simple as possible, so the code might be dumbed down a little. That’s not to be condescending, just speaking from experience in that any R tutorials I read seem to get quite technical quite quickly. Also, in my opinion R doesn’t read well, it’s not a language that’s very verbose, which I think makes it look ugly at times, and that non-verbosity also makes it look far more complicated than it is. But, it’s as powerful as it is ugly when it comes to statistical analysis. As a result I’ve structured my code here in a way that I think reads a little easier. Hell, I’m not even going to write a custom function.

Things we need:

1. An understanding of basic programming concepts and what a monte carlo simulation is. If you need to brush up on programming code academy would be a good place to start. I’d probably start with the javascript course there. And at least some limited knowledge of R. What packages are etc.

2. R installed on your machine. R download here.

2. Our IDE of choice: R Studio – you can download it here.

3. A .csv file with expected goal numbers. You can download my sample here.

4. Some knowledge of what expected goals is. Great explanation by @footballfactman here.

So with everything installed and R studio opened up let’s get started. In R studio, the bottom left panel is our console and it’s where we will see all of our outputs. Ignore the rest, it will become more familiar as we go along. Before we do anything we need to find our working R directory, trust me on this, I lost a lot of time here.
fullScreenImage3

Click into the console and type: getwd() and press return. This will give us the path to our working directory in R. This is where we want to save our script files for the moment, and it will save us having to find the directory path to any data files (csv) when we read them in. Next click file, new file, R script, a new file will open in the text editor window, now click file again and click save as, in the dialog box navigate to the working directory we found above, and name our file singleMatchMonteCarlo. The .R file name extension will be added automatically. Normally I wouldn’t use csv files as I would connect to a SQL database and run queries directly from R as a way to read in specific data. If you can, I’d highly recommend getting started with SQL or even SQLite as a starter. There is a great and easy to understand primer here on SQL. For football data analysis, R is much more powerful this way.

Next move your sample csv file you downloaded above (myData.csv) to the same working directory as your singleMatchMonteCarlo.R file. Both files should now be in the same directory. If this sounds cumbersome, then don’t worry, we won’t have to do this every time as you can link to any file on your hard drive once your have that files directory path. That’s for another day though.

Lastly, just a small style point note. You assign variables/objects in R by using -> but you can also use the standard way with = I’ll be using the equals sign, simply because I just don’t see the point in an extra keystroke. Programming is enough bloody typing as it is.

So we have 3 columns in our csv file. team, xg and matchID. Ignore matchID for the purposes of this tutorial. In a later tutorial, (if there is one!) this will become relevant when we want to loop through a number of different games. Before I go further, if you are new to programming or just starting off in R I’d advise typing the code rather than copy and pasting. Repetition will help. Excuse the cheese, but programming is like playing a musical instrument, if you don’t practice, you won’t be good at it. First off we need to read the data in our csv file into R, one way of doing this is like so:

myData = read.csv(file="myData.csv", header=TRUE, sep=",")

So type the above into the text editor in R. Here we are using R’s built-in factory function read.csv and between the parenthesis we are passing in 3 parameters. Our file to read in is our first parameter, note if we had saved our myData.csv file to any folder other than our working R directory we would have to get the path to that file. And it would probably look something like this:

myData - read.csv(file="/Users/Documents/_Stats/myData.csv", header=TRUE, sep=",")

You can if you wish change your default working R directory within R Studio itself by going to the Set Working Directory menu under Session. Play around with it, if you’re feeling brave!

Screen Shot 2015-02-23 at 23.35.58

Go ahead and run the code above that you typed in the editor. You can do this by selecting/highlighting the code in the editor itself and hitting the RUN button in the top right corner of the editor. Ctrl+a will do the trick to highlight everything in the editor. Alternatively you can copy the line and paste it directly into the console below and press return. If everything worked correctly you should receive no errors in the console and in the environment panel you will see the object myData initialised and ready for use.

Screen Shot 2015-02-23 at 23.50.54

You can expand and view the data directly by double clicking the small spreadsheet symbol in the environment panel, which should open up the myData object within the code editor. See image above.

Go ahead now and type the following into the console and press return.

print(myData)

The same data should now print to the console. Printing to the console is a great tool to use when your programming becomes a little more complex, as it can help to see if variables and objects have been set or initialised at arbitrary points within the control flow of your code.

Now that we’ve read in our data we can talk about what we need to do. Looking at how the data is constructed usually determines what logic we need to write to extract that data and use it the way we want to use it. We want to use a Monte Carlo simulation, which is basically testing the likelihood an event might happen. The event in our case is whether a goal is scored or not based on our ExpG numbers, which is just the probability of that particular shot being scored based on a number of factors such as location of shot, shot type, pass type etc. etc. So if we pick a random number between 1 and 100 and that number falls below our ExpG number then we register that a goal has been scored.

There are a number of ways to do this. You only have to spend 10 minutes on stack overflow to realise that every question has multiple correct answers, some more code efficient than others, but it’s worth pointing out, there is more than one way to skin a cat. We will use arrays or vectors as they are called in R, our data table we read in above and a couple of for loops to loop through the table. We use for loops when we know how many times we want to iterate through something like a vector. Vectors/arrays are just numbered lists. So we loop through our data, check who took the shot (Southampton or Liverpool), get a random number, check to see if the random number was less than the ExpG, and if so record that a goal was scored. So let’s declare and initialise some variables and vectors we’re going to use.

homeTeam = 'Southampton'
awayTeam = 'Liverpool'
outputHomeTeam = c()
outputAwayTeam = c()
numberOfTimesToRunSim = 10
randomNumber = NULL
countGoalsScoredByHomeTeam = 0
countGoalsScoredByAwayTeam = 0

First our 2 string variables homeTeam and awayTeam will hold a reference to Southampton and Liverpool respectfully. While we loop through the ExpG numbers we will need a way to check who took each individual shot. We’ll use these in an if else statement from within the for loop to check. Hit the clear button in the environment panel and re-run the code again. Play around in the console now by typing some of the variables into it and pressing return. Your IDE should now look like this:

fullScreenImage1

Now we wouldn’t normally run our sim only 10 times, but leave it as it is for the time being, as we just want to be sure our outputs are what they should be later on. So lets loop through our ExpG numbers. We want to run our loop with numberOfTimesToRunSim, so x is our counter and numberOfTimesToRunSim is our end condition which is set to 10. Add the following code below the variables we set above.

for (x in 1:numberOfTimesToRunSim) {
print(x)
}

Apologies for how the code is displayed, wordpress doesn’t seem to have an option to indent the code properly. I’ll re-look at this later on to see if I can format it properly. Non-indented code is horrible to look at! Again go ahead and clear the environment and re-run all of the code. (remember: click into the code editor and hit ctrl+a, this is our shortcut to highlight all the code) Numbers 1 – 10 inclusive should print to the console. Next we need another for loop within our above loop to iterate over our ExpG numbers.


for (x in 1:numberOfTimesToRunSim) {
print(x)
for (i in 1:length(myData$team)) {
print(i)
}
}

Here we are just saying iterate through the for loop the same number of times as the count of rows under team in our myData object. Type length(myData$team) into the console and it should return 19, which is the same number of shots that was taken by both teams combined. Again clear your environment, highlight all the code in the editor and hit run. You should see a count to 19, ten times. It’s tiresome, but running and printing to the console will help understand the machinations of what is being executed. Next we need to check which team took each shot with a conditional if else statement. Update your for loop to match with the following:


for (x in 1:numberOfTimesToRunSim) {
for (i in 1:length(myData$team)) {
if(myData$team[i]==homeTeam){
print("The home team Southampton took the shot")
}else if (myData$team[i]==awayTeam){
print("The away team Liverpool took the shot")
}
}
} # end of numberOfTimesToRunSimLoop

Again clear and re- run and peruse the console to see what is going on. Now we know which team took each shot we can get the corresponding ExpG value on each row. But first we need to get a random number between 0.0 and 1.0 and scale it. Then we check if our random number is less than our ExpG number for each shot with another if else statement, if so a simulated goal was scored and we add 1 to our countGoalsScoredByHomeTeam or countGoalsScoredByAwayTeam variable. Of course I don’t have to have an else if condition, “else” will execute all other conditions, but for legibility this is how I’ll continue. So remove the print statements and update your code with the following:

if(myData$team[i]==homeTeam){

randomNumber = sample(1:100, 1)

if (randomNumber<=myData$xg[i]*100) {

countGoalsScoredByHomeTeam = countGoalsScoredByHomeTeam+1

}

}else if (myData$team[i]==awayTeam){

randomNumber = sample(1:100, 1)

if(randomNumber<=myData$xg[i]*100) {
countGoalsScoredByAwayTeam = countGoalsScoredByAwayTeam+1

}
}

Your IDE should now look like so:

fullScreenImage2
Again clear and re-run the code and watch the console for when a simulated goal has scored. But this gives us the number of goals scored in total by each team for the total number of simulations. So homeTeam scored x amount of goals over 10 simulated games. But we need to capture that number at the end of each match simulation so we can work out if that game was won or lost. This is where our outputHomeTeam and outputAwayTeam vectors we initialised earlier come into play. So immediately before the end of our numberOfTimesToRunSim for loop ends type the following:

#save our results after testing each game
outputHomeTeam = c(outputHomeTeam, countGoalsScoredByHomeTeam)
outputAwayTeam = c(outputAwayTeam, countGoalsScoredByAwayTeam)

#set our counters back to zero because we want to use them again for the next match test
countGoalsScoredByHomeTeam = 0
countGoalsScoredByAwayTeam = 0
randomNumber = NULL

In our first two lines we are updating the vectors with how many goals are scored after each simulation of each match. So after the first simulation the first index in each vector will be initialised with how many goals were scored, then the second simulation the second index will be stored with the number of goals scored and so on up to numberOfSimsToRun, which in our initial case is set to 10. After each simulated game we set our counters back to zero. Our code should look like the following:

codeAfterLoops

Again, clear the environment and re-run the code to test it. Now print(outputHomeTeam) and print(outputAwayTeam) to the console and you can see for each simulated game that the score of each team is saved in the corresponding vectors. We can write a quick loop to test this and paste it straight into the console:
for (s in 1:length(outputHomeTeam)){
cat("Home score is ", outputHomeTeam[s], " and away team score is ", outputAwayTeam[s], "\n")
}

Cat just concatenates our strings and variables together and \n prints the next line to a new line in the console. As a side note if you need to clear the console type this into it:
cat14
If you take a look in the environment panel at outputHomeTeam and outputAwayTeam you can see the score of each game, and that it should match what we printed to the console directly above. So now we have our game scores it’s just a matter of looping through the scores and checking which team scored higher than the other, if so, they won that particular game, and then we simply record the events. We then use some basic math on the outputs to calculate the percentages. So input the following code after (and outside) our numberOfSimsToRun for loop.

# check to see if the game was won i.e. if more goals scored by either team
homeTeamGamesWon = 0
awayTeamGamesWon = 0

for (y in 1:length(outputHomeTeam)){

if(outputHomeTeam[y]>outputAwayTeam[y]){

homeTeamGamesWon = homeTeamGamesWon+1

}else if (outputHomeTeam[y]<outputAwayTeam[y]){

awayTeamGamesWon = awayTeamGamesWon+1
}

}

percGamesWonByHomeTeam = homeTeamGamesWon*1.0/numberOfTimesToRunSim
percGamesWonByAwayTeam = awayTeamGamesWon*1.0/numberOfTimesToRunSim
percGamesDrawn = 1.0-(percGamesWonByHomeTeam+percGamesWonByAwayTeam)

# objects to use in our barplot
matchTeams = c(homeTeam, awayTeam)
matchOutcome = c(percGamesWonByHomeTeam, percGamesWonByAwayTeam)
print(matchOutcome)

Here we just loop through the vectors we saved earlier and check to see if the number of goals scored by one team was greater than the other team, if so record a win for that team. We then calculate the percentages and save them to vectors we will use in our barplot. All going well the percentages should print to the console. To make sure everything is running correctly take a look at your outputHomeTeam and outputAwayTeam vectors in the environment panel. You can do a quick calculation from that to see how many games each team won and see if it matches to the output in the console. If so, you’re done!

Now we just need to display our findings in a bar plot and save the image to our hard drive. There are several ways to do this, you could use the ggplot package which is a fantastic tool, but for the time being I’m just going to keep things simple and use R’s built-in factory method “barplot”. Go ahead and type the following after the above code.

# build our bar plot
yMax = max(matchOutcome)+.1

barPlotTitle = 'ExpG Premier League 14/15'

bp = barplot(matchOutcome, ylim = c(0.0, yMax), col="darkgreen", main=barPlotTitle,ylab ="Win Probability", xlab ="Teams",
names.arg=c(matchTeams), cex.names=0.6,)
text(bp, 0, round(matchOutcome, 2),cex=1,pos=3)

# save our bar plot as a PNG image to the hard drive
dev.copy(png, filename=paste("test.png"))
dev.off ()

The first line is related to a display issue, it just get’s the maximum value from the matchOutcome vector and adds .1 to it. It makes the y axis on the bar plot extend slightly (.1) higher than the highest bar. You can go ahead and move barPlotTitle to the top of the code where we declared our first set of variables, just so they are all in the one place when we want to change them. The next line just calls the barplot function and passes some parameters to it. Mainly, our win probabilities, the size of our axis, colour of our bars, bar title and various other text and sizes of our text. Play around with them later to test what each does. Lastly we save the image of our barplot to our working directory. Now for the last time, clear your environment panel, change numberOfTimesToRunSim to 20,000 (or whatever you prefer), if you have a slow machine 20,000 could take more than a minute to run. Commenting out any print statements will speed up the process though. Now go and find your barplot test.png and post to Twitter! Also if you have ExpG data from any other game, then the only variables you will need to change are the links to your csv file, the references to the column headers in that file, and the 2 teams you want to test in that match. You can even refactor the code further to minimise the variables you need to change. Reusable code!

You can also see now how easy it would be to add more games to our csv file. With another for loop and a couple of more variables we could run a season for a particular team and display it. We could also run multiple seasons for a team and calculate the simulated points for each simulated season and display them on a histogram. There are numerous possibility’s. Now go ahead and have fun. I’ll post the entire code to bitbucket later on. Hopefully this will get some people started with R.

You can download the source code from bitbucket here.

Why Always Me – Mario Balotelli

Mario Balotelli’s apparent imminent signing at Liverpool looks like the most surprising transfer of the summer, especially after Brendan Rodger’s explicitly denied any interest in the Italian during the summer tour of the United States. So why have Liverpool reneged on their initial public denial?

I have detailed figures going back to 10/11 so I can take a fairly comprehensive look at what he will bring to an all ready formidable Reds attack. So what is Balotelli’s style? Well he won’t be a forward who will create a lot of chances for his fellow teammates. The average key passes per 90 figure for a forward in my 10 season database is 1.4 per 90. Balotelli has hit somewhere between 0.7 and 1.6 in his time between City and Milan per season. He’ll probably create at around the rate an average forward would so it’s not really a big tool in his attacking toolbox. His expected assists over the last 4 years has ranged between 0.10 and 0.13 per 90 with the average amongst strikers being at 0.14.

Balotelli - Key Passes 13/14

Balotelli – Key Passes 13/14

Does he get involved in the attacking play? Will he make something happen in the final third? Well the average number of attempted passes per 90 into the final third for top strikers is around 14, Balotelli will give you circa 15 per 90. He likes to drift out to the right and come inside, but that’s pretty standard for a right-footed attacker to drift to that side and come in on his favourite shooting foot.

These average figures are in no way a reflection of the quality of the player, it’s just a stylistic guide, if you will. His dribbles per 90 are a little more interesting though. Whether it was tactical or not, he had a crazy first half year from the January transfer window at Milan, where he hit 6.2 attempted dribbles per 90. In his City days he was averaging around 3.3 dribbles per 90, then last season at Milan he hit 4.9 dribbles per 90. It would be interesting to know from somebody more clued up on Italian football whether this was more of a systemic issue or an individual/psychological one. Did he simply get more confidence?

Shooting

I’ve seen a lot of tweets on social media regarding Balotelli’s very poor shot conversion over a number of years. Shot conversion isn’t repeatable (expected goals is, somewhat) and thus what a player converts at in year N hasn’t much bearing on what a player might do in year N+1. Is 7% shot conversion bad? Well maybe, but in simple terms, think of this, 7% of 100 shots gives you a different goal return as 7% of 300 shots. So is 7% conversion still bad? As always, it depends on the context.

I have expected goal numbers for Balotelli over his last 4 seasons. So I’ll divide that time up into time he spent at Man City and time he spent at AC Milan. Looking at his numbers for City in the 12/13 season (before he was transferred to Milan in the January transfer window) seems a little pointless, as he played less than 600 minutes, and there’s just too much sample size issues, not to mention strength of schedule bias.

Expected Goals EPL Per 90 10/11 = 11/12

Expected Goals EPL Per 90 10/11 = 11/12

It’s worth remembering Balotelli was 20-21 years of age in these two seasons in the Premiership. Ok so in his first season he played less than 30% of the available minutes, but scored 0.34 goals P90, expected goals per 90 was at 0.43, and shots per 90 at 3.8, these are all very good baseline numbers for a 20 year old, and kind of gave you the feeling something big was about to happen. And it did.

Balotelli Shots 11/12 - Larger=goals

Balotelli Shots 11/12 – Larger=goals

Things really took off for Balotelli in the 11/12 season, which of course was Man City’s dramatic tittle-winning year. Again though, the problem here was he just didn’t get enough minutes. Having said that these are some elite numbers for a striker. In the last 4 season in the Premier League, of the 486 players to play more than 900 minutes and take >30 shots only 6 of those players had a better expected goals per 90 greater than 0.69. Both his goals per 90 and shots per 90 also went to an elite level in 11/12, which really was an indication that Balotelli’s career was on an upward curve. Onward to Milan.

Expected Goals Serie A 12/13 & 13/14

Expected Goals Serie A 12/13 & 13/14

In 12/13 something happened to Balotelli’s shot volume. He starting hitting 5.6 shots per 90. Over the last 2 seasons in the top 5 leagues only Suarez can better that number in a single season. But why had Balotelli suddenly become a shot monster? It’s difficult to figure out whether this was part of his natural progression as a striker or that it was something more systemic that brought it out in him at Milan. Milan weren’t very good last year. It doesn’t look like it was brought about by position either, as I could only find 5 occasions in his Milan career where he started slightly wider of another striker, incidentally those times the system used was a 4-1-2-1-2 diamond (Brendan Rodgers take note). Though this doesn’t take into account positional changes during matches, so might be a touch misleading.

Balotelli maintained his expected goals per 90 but his Xgoals per shot dramatically decreased from his time at City. Plummeting from a high of 0.126 expected goals per shot to just 0.08. On a per shot basis Balotelli had lower value chances, but was able to maintain his XGoals per 90 numbers by way of increasing his shot volume. He went from taking around 40% of his shots from prime at City to taking just 20% of his shots from prime at Milan. In that same year at Milan he took an incredible 75% of his shots from outside the box. Having done all of that he still kept his goals per 90 at a very decent 0.47.

Balotelli - Shots 12/13 (larger=goals)

Balotelli – Shots 12/13 (larger=goals)

Again there was a similar pattern last season. Expected goals was maintained above 0.4 per 90, not elite in itself, but a decent return for a striker, considering you’d expect your striker to outperform Xgoals in probably 3 out of every 4 seasons. In the context of a full season, if he played 38 90’s that would garner him 15 goals. His shots per 90 increased again in 13/14 to nearly 5.8 per 90, which is the highest of any player playing more than 900 minutes in the last 2 seasons in the top 5 leagues. And for the first time in 4 seasons Balotelli had managed to play more than 2,000 minutes in a single season. Again he took a measly 20% of his shots from prime, and a massive 65% of shots from outside the box. Except this time he outperformed his expected goals from outside the box due to scoring 4 goals from 41 free kicks. It’s unclear whether this was skill or luck as the previous season seen just a 1 goal return from 34 free kicks.

Balotelli Shots - 13/14 (larger=goals)

Balotelli Shots – 13/14 (larger=goals)

I am always wary when I see player score a number of goals from outside the box. So I tend to check their past record to see if they’ve previously shown any history of scoring regularly from outside. Balotelli’s done it just once in the last 4 seasons, which suggests to me he might have got a bit lucky with those long range efforts last season.

So in conclusion, what are the numbers telling us. He won’t create for his teammates at a high level. He will attempt a lot of dribbles and try to make something happen himself, and while he won’t get involved in the build up play to the extent of a striker like Suarez, he will get involved. He’s become a shot monster over the last 2 seasons, my instinct tells me this is just a natural progression for him rather than a systemic one brought about by Milan’s tactics or deficiencies. Systemic or not though, it’s a worrying trend only 20% of his shots came from dangerous areas and on average at Milan 70% of his shots came from outside the box. That’s not where you want your strikers taking shots from. Lastly on the negative side, for whatever reason, he’s played less than 50% of the available minutes to him over the last 4 seasons. This is a big worry.

On the plus side, and I feel this is a major plus, he’s regularly managed greater than 0.4 expected goals per season. In my database I could only find one other player who managed that, and it was Van Persie. Neither Suarez or Sturridge could. A caveat applies to Balotelli’s lack of minutes in some of those seasons though. Apart from 10/11 at City, he’s also managed greater than 0.4 goals per 90 in each of his 3 other seasons. So his output is there, and this is really promising.

Weaknesses: reliability and consistency in getting minutes on the pitch. Too many shots from low value areas.

Strengths: Dribbles, shot volume (but needs to be proportioned better), consistent in expected goals and goals per 90.

Verdict: There’s a very, very good player in there. The question is can Brendan Rodgers and Liverpool bring it out of him at a consistent level. Personally, he’s never really impressed me when I’ve seen him play, I always thought, hmm “much ado about nothing”. Maybe I watched the wrong games though. But at 24 years of age, and at a good price the risk to reward ratio is very positive. If I was asked for one word to describe his career to date? Erratic. And therein lies the crux of the matter.

Liverpool Season Preview 2014/2015

Before we go into the expected line up, signings and outgoings let’s deal with the giant buck-toothed elephant in the room. The big questions that have the experts, I feel, under-valuing Liverpool for the forth-coming season, are:

1. Can they cope with the loss of Luis Suarez, and the goals, assists and key passes that go with him.

2. The added fixtures that come with being back in the elite of the Champions League. (We won’t really know the answers to this question until the competition gets underway)

First off, it’s been done to death. There is no replacing Luis Suarez, at least not directly. But there are other ways you can crawl those missed 31 goals back.

Defence

For example, in defence. Simply, you can start of by conceding less shots. 7 teams conceded more shots than Liverpool last season, but 4 of those teams you could consider direct rivals this season, they were United, Chelsea, City and Spurs. However, when you consider shot location, type of shot etc the expected (non pen, non own goal) goals Liverpool should have conceded was 38, they conceded 42. So a slight under-performance. Incidentally, Chelsea fared best with 31 XG, with City next at 32.

If you look a little deeper and look at shots conceded in the danger zone then Liverpool conceded only 26 more shots than Chelsea in that zone. A difference of about 4.6 expected goals. About 20 shots more conceded in both left and right wide in the box also, which has a conversion rate of c.4%, and lastly around 40 shots more than Chelsea and City conceded from outside the box, shots which also have a very low probability of scoring. So all in all, Liverpool conceded a lot more shots than their rivals, but those shots tended to be very low value shots. In fact, only Chelsea conceded more shots from OPTA’s big chances metric last season than Liverpool. If you take a quick glance at the graphic below, it’s clear there are a lack of red (very high value) shots conceded.

Home on the left. Away right.

LFC Shots Conceded 13/14 – Home on the left. Away right.

I’m not saying this was a good defensive performance by any means, in fact, it’s a worry, volume-wise, to concede so many shots as it says to me, that tactically, Liverpool aren’t set up correctly when they lose the ball. Liverpool players blocked 132 shots last season, conceded 4 own goals, 4 penalties and conceded a lot of shots from the zone outside the box. They were too lose in midfield, and when opposition attackers did get beyond the midfield zone, defenders were forced into mistakes.

You only have to look at the ball error numbers to see that. Only Spurs had more errors that led to a goal last season, and no team conceded more shots from errors last season than Liverpool. But what about opportunity? Liverpool had lot’s of possession so you’d expect them to have more errors. Well Chelsea, City and Arsenal had a lot of possession too, but didn’t incur near this amount of errors. And if you look at it in ‘touches per goal error’, Liverpool made a goal error every 2,611 touches. Only Spurs and Norwich made an error more often in terms of touches. The thing is I’m not too sure whether those errors were as a direct result of players just being sloppy, or whether it was a more systemic issue that permeated throughout the team. I’d be inclined to think it was a little bit of both.

Of course we can’t talk about the defence without considering the Achilles Heel, set pieces. Liverpool conceded 11 goals from headers last season (Chelsea & City conceded only 5), only WBA, Fulham and Cardiff conceded more. 2 of those teams were relegated. Liverpool’s opponents converted 13.4% of their headed shots, only Stoke’s opponents converted a higher proportion. So that says it all really. Again, I think these are both systemic and personnel issues. But both can be improved and used as a way of pulling back some of those 31 goals lost by the departure of Suarez.

So how do Liverpool fix these issues? System changes, tweaks, tightening up the midfield, and work on the training ground go a long way to ironing out defensive issues. Change of personnel is another way. Hence, the defensive additions of Lovren, Moreno, Manquillo and to a lesser extent Emre Can, who can fill in at left back and defensive mid. Full back issues were also a big problem last season. Glen Johnson’s defensive positioning is as shaky as a drunk baby stumbling around a playpen, and for a supposed attacking full back his offensive output is poor compared to other full backs.

Full Backs 13/14

Full Backs 13/14

The centre backs never looked happy, Skrtel, who actually was quite poor initially gradually played himself into some kind of form, but neither Agger, or Toure looked comfortable. Sakho at times, perhaps looked the most comfortable, and at a 16 million outlay, you’d have to think that eventually he will be first choice with Lovren.

But will the new defensive additions bring more solidity? Along with Cahill and Terry, Lovren was perhaps the best centre back in the league last year. In fact, his style of defending reminds me a lot of Sami Hyppia. Positionally sound, a good reader of the game, and a commanding presence in the centre of the box (Remember those set pieces Liverpool concede from). You can get a quick idea of what he might bring to the Liverpool compared to current centre backs from the below chart.

Centre Backs Compared

*Adjusted defensive metrics – I’ll write a longer piece on this soon. I’ve adjusted each defensive metric (where you see adj pre-fixed) based on the number of passes conceded by the team each player plays for while that player was on the pitch. I’ve only looked at games where a player has played >75 mins.

Manquillo will likely be eased in, but I expect Moreno will get much more game time. Him and Flanagan will most likely alternate quite a bit based on the opponents Liverpool will be facing home/away.

Having said all of this, I somehow feel the catalyst to Liverpool improving defensively is tightening up in midfield. Gerard offers so much, but he needs runners in alongside him to help out with the defensive side. Henderson provides that, and more, but with moving to a 4-4-2 diamond last season to accommodate two centre forwards, it gave Henderson that little bit too much to do.  I’d expect Rodgers to return to more of a 4-3-3 this season. Gerrard at the base of a midfield triangle with 2 runners either side in Henderson and possibly Emre Can. If that sacrifices attacking play too much, Coutinho is a possibility as the left sided midfield player in the three. He showed his battling qualities last season playing to the left of the diamond.

Taking all of this into account, can Liverpool claw back the Suarez goals in defence? Well certainly not the full amount, but I think they have addressed their needs in the transfer market of full backs, and a commanding centre back. Fix those systemic issues and I can’t see why they can’t improve their goals conceded by at least 10.

Attack

The huge conversion rates maintained last season will inevitably drop this season. I think the big question here is  by how much? Both Suarez and Sturridge hugely over-performed in expected goals. Such over-performance has practically no year on year correlation. It’s also worth noting that Liverpool may not HAVE  to score that many goals to do well. Average goals scored by the Premier League winners in the last 10 years is 84 goals. Which is 17 less than they scored last season, but given the attacking talent on display at City and Chelsea I can’t see the winners scoring less than 90 goals. Furthermore, Man City also hugely over-performed in XGoals, so I expect their number of goals will also decrease in the coming season.

Over / Under Performing XGoals 13/14

Over / Under Performing XGoals 13/14

It can’t be over-emphasised enough, what a record breaking season that was from Suarez last season.  But not just his goals will be missed, his all round play, dragging defenders out of position, assists and link up. His goal involvement P90 (goals+assists Per 90) was at 1.31 last year. In the last 4 seasons in the Premiership only 1 player can better that tally, which was Aguero last season at 1.36, and only 7 players in the last 4 seasons have broken the 1 per 90 barrier. A huge contribution.

But there are some really positive signs in an attacking sense from Liverpool. Sturridge has grown into his role at the club. In his last 4 seasons in the EPL he’s only under-performed in XGoals once, which was a slight under-performance of 0.001 per 90 in 12/13. Coupled with his XGoals per 90 in the last 4 seasons at 0.58, 0.385, 0.718 and last seasons 0.571 gives him an average of 0.56 per season. If he can stay fit, play the majority of games and score at the rate an average player would given the chances he gets then he’s likely to get c.20 goals this season. And herein lies the problem. If Sturridge gets injured who’s going to replace him? Lambert’s XGoals per 90 in the last 2 seasons was 0.287 and 0.364, so any long term absence from Sturridge may be critical to Liverpool’s goal scoring. They can solve this by dipping into the transfer market. Names such as Cavani, Benzema and Falcao have been thrown around. If Sturridge gets injured then I believe getting a quality striker in before the season starts may be the difference between struggling to get into the Champion’s League places and being relatively comfortable in the Champion’s League places.

Markovic and Lallana have also signed, but I can’t help feeling Markovic may be a little slow getting off the ground and Lallana was signed to add depth to the squad rather than displace one of Suarez, Coutinho or Sterling in the starting eleven. My biggest worry in terms of goal scoring however can be summarised in this chart. Over/Under performance figures are marked on the labels.

Expected Goals by Position 13/14

Expected Goals by Position 13/14

In particular the midfield area. Over the last 3 seasons Liverpool have under-performed in expected goals. While not a huge problem per se, in a year when you lose your top goalscorer (and vitally haven’t bought another striker) and have a huge over-performance in expected goals it’s a imperative you get as much from your midfield as possible. Can (no pun intended) Markovic, Lallana, Sterling and Coutinho step up their goal scoring performance. Particularly Coutinho who only scored 5 goals last season and who’s shooting was erratic to say the least. We know both Sterling and Coutinho can create, both were in the top 20 expected assists per 90 last season in the EPL. So creating chances won’t be a problem for Liverpool, converting them might be though. Incidentally, both have looked unstoppable in pre-season games.

Key Pass Origins 13/14

Key Pass Origins 13/14

In summary: the defensive personnel have been improved, systemic issues should have been addressed in pre-season, and a move back to an extra man in midfield should shore up that zone. Squad depth has been improved, creativity shouldn’t be an issue, but a striker hasn’t been purchased yet, that leaves a lot of goal-scoring responsibility on Sturridge and Lambert.

Lastly, there has been some suggestion last year’s title challenge was some sort of fluke. While no one expected it, the rise into the top 4 certainly wasn’t a fluke. Liverpool’s expected goal ratio, expected goal ratio which I found to have strong correlation (R2=0.78) with points earned, has risen since the 10/11 season where they posted a XGR of just 53.8, they’ve had an XGR of 0.636, 0.637 and last seasons 0.655 since that poor 10/11 season. In fact, they are the only team to have an XGR >0.60 to finish outside the top 4 (twice) in the last 4 seasons. It’s almost like there was a plan in place.

Prediction 3rd.

Star Man: Raheem Sterling 

 

 

Testing Repeatability – Player Level

So yeah, this is just going to be a quick post to deal with some house-keeping. I’ve run a series of tests to check the repeatability of the various metrics I use. These are all done at player level, I plan on doing the same at team level at some stage. There will be no fancy Tableau graphics here! Just plain old Excel scatter plots. So here is a rundown of what I found. These may, or may not be useful for somebody.

GPS – Goal Probability Per Shot per 90

GPS (Expect goals/non-pen Shots)

Expected Goals per 90

Expected Non-Pen Goals Per 90

Expected Goal Difference Per 90

EXPGoalDiffP90 (Actual Goals-EXPGoals) Top 4 Leagues

Expected Goals From Shot Placement per 90

EXPGoals From Shot Placement (on target shots)

Expected Goals Shot Placement Difference Per 90

XGSPDiff P90 (Actual Goals-XGSP)

Expected Goals Shot Placement per Shot per 90

XGSP_GPS (XGSP/non-pen shots)

Shot Placement Extra Goals per 90 (SPEG)

SPEG P90 (XGSP-EXPGoals)

 

Updated Expected Goals

Originally I had all non-pen shots divided up into 4 separate locations. One prime location, (centre of box), wide right and left inside the box and all other shots outside the box. I’ve since had a serious re-think and have had more time to study the data. There are just too many discrepancies in the shot conversions within those zones and as a result of having more time recently I’ve decided to upgrade my expected goals model. It took a few months work on and off, but I got there in the end. For example, shots wide in the box with the old model are usually converted at around 4/5%. If I separate wide in the box into 2 zones, zone D gives a conversion of c.2%, whilst zone E gives a conversion of c.6%. These are significant differences over a season. And now that I know the differences, well I just can’t live with the old expected goals model knowing that.

Furthermore, whereas N was around 40k shots, I’ve since gathered more data and have increased N to around 100k. This also allows me to sub-categorise the data more heavily and not be too concerned about any sample size issues infecting the results.

(* for the purposes of describing these conversion rates, these are all non-pen shots from these specific locations, with no qualifier added. All though obviously qualifiers are added into the expected goals model, as described below.)

A picture describes a thousand words so here are the locations of the shots I’ve filtered.

Shot Breakdown Zones

So I’ve now broken down the danger zone into 3 distinct locations. Crudely represented by the letters A, B and C. Wide in the box are also separated into 2 distinct zones, F and G, and D and E.

After much studying of conversion rates outside the box I made an educated decision on these zones. I found conversion rates just outside the box differed enough between zones M, N and O and the remaining areas to divide them up into these distinct zones. For example, shots from zone N converted at c.5/6% whereas shots from R and S were converted at c.3%. As I got further out into the halfway territory the sample size got considerably smaller, as a result I felt there wasn’t enough data to separate these areas out further. Besides, conversion rates were getting to a lowly 1%, which in the grand scheme of things I don’t believe separating zones U and V out to anything smaller would have made any significant difference.

 

Qualifiers 

So what qualifiers did I account for. Firstly, for each zone I separated non-big chance shots and big chance shots. A note on big chances, I’ve really had a chance to study these in detail since the Stat Zone website released big chance location data. It’s not a perfect system by any means, but I believe it’s a really good indicator of defensive pressure. The only problem here is, it’s an all or nothing situation. To improve this metric it needs an extra qualifier to record the level of defensive pressure. For example, a big chance with just the keeper to beat is classed as equal to a big chance that is an open goal. So yeah, that’s going to cause a problem on individual shots. At a team level I’m not so sure, how many open goals do you see in football. Not many. On defensive pressure, well blocked shots can be indication that a player is close to the shooter, about 4.5% of big chance shots last season in the EPL were blocked, compared to around 28% of non-big chances shots. The question is though, should a big chance be classed as a big chance if there is the opportunity of it being blocked by an outfield player? These are the difficulties. As I say, it’s not perfect, but it’s all we have, it’s just important to be aware of it’s limitations and non-limitations. I digress.

I then sub-categorised these shots again with head, foot and yes, even other body part shots, and then also what type of pass did the shot come from, inter alia, I controlled for corners, crosses and free kicks.

 

Nervous Nelly Corner

Overall I’m pretty happy with the model, I’ve controlled for almost anything I can get my hands on publicly, which makes it pretty granular. The locations aren’t picked from the top of my head, I’ve studied the data and made the best decision possible on the locations, that is, based on the data that I have collected. As alluded to above, I’m not entirely happy with the big chance data, it bugs me that a big chance with the goalkeeper to beat is classed as equal to an open goal. But without viewing every shot on video myself I’ve no way to account for this.

Future improvements: immediately what comes to mind, is accounting for position, i.e. the model currently takes an average player’s conversion rate for each location and sub-category, so we are judging players based on how an average player would convert, this doesn’t recognise the fact that a forward will convert at a higher rate than a defender. I’ve already took tentative steps towards this, but even with 100k shots, sub-categorising even more based on position dilutes the data even more and leaves it open to variance. I haven’t been using this model long, since the start of the World Cup, and I’ve even improved it since then, but I imagine the more I use it the more (big!) chances there are that some flaws will arise, which I can learn from and use to improve.

 

Finishing Skill

Lastly, and this is a small annoyance of mine. I do not think that this expected goals model is measuring finishing skill, but rather the ability to get into good positions and get good shots off. I don’t believe any model can measure finishing skill without taking into account how the ball is hit, technique is almost everything, and choice of technique is important.

For example, was the ball hit with the instep, laces, outside of the boot? Did the player volley it, or hit it along the ground (daisy cutter)? etc

Did the player apply bend to the shot, if so, there are further factors to consider, what foot did he use? (Foot and position are important when applying bend to a shot) What position was the player in when he applied the swerve? Did the ball bend from outside to inside or vice versa? Say we want to shoot with bend from the left of the goal and to apply swerve to go in the far right corner of the goal: if your right-footed you need to hit the ball with your instep, if your left-footed you’ll need to hit the ball with the outside of the boot (a much more difficult skill). That’s before you even consider shot placement in the goal, top corner, bottom left, straight at the keeper etc. Even considering whether a player is actually applying skill or not to any particular shot. This type of nuanced data is needed before anyone can properly start to measure finishing skill.

 

 

 

 

 

Expected Goals – Shot Placement

After the Premier League season ended last year I was wondering why there aren’t more shot placement models out there. There has been some work done on it over at www.statsbomb.com but nothing I could find of note since. I was surprised by this, because if you want to measure finishing skill, isn’t shot placement (along with technique and other variables) a rather large part of a player’s goal scoring armoury. There doesn’t seem to be ‘technique’ data available, at least not in the public domain, and I don’t even know if OPTA (or anyone else) collect data regarding how a player hits the ball? e.g. toe poke, instep, volley etc. It strikes me that EXPGoals is just a “quality of chance from shot location” measurement, and doesn’t directly deal with finishing skill. Indirectly you can measure the difference between actual goals scored and expected goals (which I have done, and gives you an EXPGoalDiff + or -) which could indicate whether or not a player is better than the average player at converting his chances into goals, but for me that’s taking a big leap forward, without understanding why one player scores more than the average. Like others I have found no year on year correlation for over-performing in expected goals, with an R2 of just 0.002. So EXPGoalDiff can tell you what may have happened in a particular season, but has no predictive powers of what might happen in the next season.

XGDiffP90 (Actual Goals-EXPGoals)

EXPGoals deals with variables up until the moment the player touches the ball to shoot. But a lot can happen between touching the ball and ending up in the back of the net. How the ball is hit, with bend, without bend, velocity, shot placement and external factors such as weather and opposition player positions etc etc. Even if EXPGoal difference was repeatable, it could INDICATE finishing ability, but it won’t tell us why, and I like to understand things, so the why really bugs the hell out of me.

Shot Placement With all these other factors I mentioned I think shot placement data is the only variable that is in the public domain, and even then only the top 5 Leagues over the last 2 seasons. So after the EPL season finished last year I started collecting shot placement data. That was quickly put on hold during the World Cup, but since then I’ve been beavering away. I managed 4 of the top 5 Leagues, France will have to wait, I just didn’t have the staying power. Sorry France. Upon finishing I got to work on the shot placement model and connecting the data between EXPGoals and shot placement. My idea was, that I wanted to control for the exact same variables as the EXPGoal model. That way I’d could compare the same shot from both perspectives. i.e. I’d have an expected goal value just before the shot was struck, and an expected goal value after the shot was struck. I could then see the difference between the two values and by that, know how much any individual player had increased/decreased their chances of scoring, just by where they placed the ball in the goal. I’d also be controlling for a whole host of actions related to shooting and thus hopefully get some decent outputs. And as I’m writing, I tweeted about shot placement models and have just been tweeted this, which is a piece by Devin Pleuler; http://www.optasportspro.com/about/optapro-blog/posts/2014/on-the-topic-of-expected-goals-and-the-repeatability-of-finishing-skill.aspx And there I was thinking I had an original idea.

EXP Goal Zones

Obviously off target shots can’t be scored and as such have an expected goal value of zero so won’t be included. I took all on target shots, and controlled for the same inputs (location, type of shot etc) as my EXPGoal model, with the added qualifier of separating each instance into separate parts of the goal.

Goa Sections

I divided the goal up into 6 boxes, see above, and got an EXPGoal value for each location in the goal. Why these boxes specifically? I needed at least 6 to delineate from central and corners, but couldn’t go any more than 6 as I’d run into sample size issues. Ideally you’d probably want at least 10 areas, an extra 2, top and bottom, either side of the central boxes. But like I said, sample size issues, and each box added creates a mountain of extra work. Let’s just take a quick example of an instance: one specific instance could be, all non-headed on target shots taken from Zone C and placed in the top right corner of the goal, which are converted at 60%. (Or an XGSP value of 0.60) I done the same for each section of the shot placement area, top left, top centre and so on. The same for headed shots in Zone C, and for every other zone marked on the pitch above. This, took a lot of bloody time, and I have to admit I nearly gave up on more than one occasion. Now each shot on target has an expected goal value before the shot is struck and after the shot is struck.

On to those messy acronyms. For want of a better name, I’m going to call it Expected Goals Shot Placement or XGSP for short. Lets first take a look at whether XGSP-P90 correlates to GoalsP90.

XGSPP90_GP90

A pretty strong correlation at 0.771, which is what you would expect, the better your shot placement the more goals you should score.

Shot Placement Extra Goals Now I’m going to introduce another pesky new acronym, SPEG, or Shot Placement Extra Goals, which is just the difference between expected goals (from on target shots, pre-shot – based on location, type of shot etc) and expected goals from shot placement (post-shot – based on all of the variables in EXPGoals, with shot placement added in). I’ve leant away from using ‘finishing skill’ as a name, because for me it’s not finishing skill, as I believe finishing skill incorporates a whole host of different skills, and shot placement is just one of those skills.

So at a basic level, over a full season, if we look at each shot a player takes, give it an EXP goal value pre-shot, then give an Exp goal value post-shot, based on shot placement, and if that player can show that they have increased their probability of scoring, just by their shot placement, doesn’t that show some skill at putting the ball in the back of the net? It should do, but we could run into the same problems as EXPGoalDiff and things like Shot Conversion %. They just aren’t very repeatable year on year. I ran two tests, firstly on just the EPL alone (because I needed to test before I continued collecting data for other leagues) in the last two seasons, where R2=0.47 and then I tested the Top 4 leagues, (EPL, La Liga, Bundesliga, Serie A) where player x had >=10 shots in year N and year N+1 and here’s what I found.

Shot Placement Extra Goals

An R2 of 0.427, while probably not a good result in any other type of metric is significant enough when it comes to conversion/goal scoring. Certainly enough to warrant more investigation. Ideally I’d like to go back at least 5 seasons to test it, but still, there is some shot placement skill evident, and these are just my initial findings, so I haven’t had much time to digest the implications. I also decided to do so some further visual tests to see if things are what they seem. As a side note, the huge outlier at 0.9 is Morata. I was wondering the same myself.

Visual Tests If you follow me on Twitter you’ll know I like to post these scatter plots which I call dashboards. I like the fact that they can show 4 or 5 different metrics at any one time. I mostly plot them with similar type metrics that give some context to all the metrics as a whole. Here I’ve plotted EXGoalsP90 on the vertical axis and SPEGP90 (shot placement extra goals) on the horizontal. GPS, or expected goals per shot is coloured, and goalsP90 (output is also important!) is referenced by the size of the coloured circles.

SPEGP90

Visually, SPEGP90 looks good, the players who you’d expect to do well are doing well. It’s encouraging that the likes of Messi, Ronaldo, Suarez, Dzeko, and Sturridge all appear above 2 standard deviations in both metrics for both seasons.

Edge Case – Mertens Ok, so that’s good, lets take a look at some edge cases (apologies – that’s the programmer in me coming out) or outliers and see what we find. First of all, Dries Mertens. Colour-wise he’s in the kind of blue-green range which means on a per shot basis he has low expected goals, and on a per 90 basis he’s also going to be low. His shots per 90 are at 3.9 so that’s quite high. So lots of shots, but low value chances of converting, which usually means shots from outside the box. But his SPEG-P90 is above 2 standard deviations which indicates that by way of his shot placement he’s increased his expected rate of scoring somewhat. Visually, lets see what that looks like. First his shot chart from last season, remember, it’s heat map orientated, the hotter the shot the higher chance of converting and vica-versa for the colder shots. Larger dots represent goals, X’s represent headers.

Mertens Non-Pen Shots

Pretty much as expected based on his GPS and XGP90 on the scatter plot. Lots of shots from outside the box, that have an obvious low scoring probability. Next let’s take a look at his shot placement.

Mertens Shot Placement

Before we even consider the numbers, visually, if you look at the sheer volume of his low value chances on his shot chart above (blue shots), then compare his shot placement it looks quite good. Only 5 of his 36 shots on target where placed down the centre. 26 of his on target shots had an expected goal value (pre-shot) of less than 0.06, yet after the shot was taken 30 of those on target shots had a SPEG value of greater than 0.089. So yeah, in this instance you could say his SPEG numbers match what is happening visually.

Edge Case – Destro Lastly lets take a quick look at another outlier. Destro in the 12/13 season, he’s in the top left of the plot above. Here’s his shot chart.

Desto Non-Pen Shots

GPS and EXPGoals both indicated high value chances and it’s clear from his shot chart that most of his shots came from prime central in the danger zone. Only 2 of Destro’s shots came from outside the box and 12 of his 22 shots on target had an expected goal value greater than 0.30. High quality chances indeed. But his SPEGP90 indicates he increased is expected probability of scoring by 0.167 per 90, whilst the average increase over the plot is 0.133. So he’s slightly above average, which is not really that good. Let’s look at his shot placement chart.

Destro Shot Placement

Again visually, it seems clear that the reason his increase from expected goals (pre-shot) to SPEG (post shot) is low, is because he hit most of his shots low centre, which is really goalkeeper territory and has a much lower chance of being scored. It’s still early days, but it’s nice to know the model is working as it should be, and that the numbers, for now, pan out visually.

Future Improvements Well the inputs in the model probably won’t be improved much as I can’t sub-divide the categories any further without running into sample size issues. Not to mention the enormous amount of work it would involve to tinker with it in that way. In fact, I spent so much time on it I’m fed up looking at the numbers at this stage. For now it will interest me, just to to use it for the coming season and see what I can learn and what it’s best application is. I have no formal statistical training or background, so this is a hobby, and a very time-consuming one at that. I’ll continue collecting the data and input it into the model for the coming season, but if it comes too much of a burden I’ll have to stop.

In a visual sense, I would like to connect both shot charts and shot placements in the goal to show the increase before and after the shot has been taken. The holy grail would be in some sort of 3D environment, but that would take an awful lot of coding and again I’m not sure I have the time.

What I would like to do before the season starts is look at SPEG at a team level. I’m aware though that shot placement is really an individual based skill, but I think it might be interesting to discover what it says at a more macroscopic level. In particular, SPEG conceded, and maybe SPEG total shot ratio. Though I’m not that hopeful of either being that predictive, nonetheless, it will be fun to find out. I think.

Feedback welcome, as I got so caught up with this I might have missed something that’s just plain obvious.

The Curious Case of Expected Goal Ratio

I’m no advanced statistician. I take more than a passing interest in numbers in football. Why? Well, because I find them interesting. Simple. Do numbers tell us everything? No. Certainly not. But in my humble opinion, combined with watching games and studying the numbers, they do give you valuable insight into what is happening on the pitch.

So I’ve been reading a lot of advanced statistical articles over the last year, and the common theme seems to be repeatability and regression/progression to the mean. Now bear with me I’m still learning, but from what I understand if you can find repeatability, then you have a metric which suggests you can predict future outputs of said metric, and if a metric regresses heavily to the mean (see PDO) then that may suggest there is a large element of luck/variance involved and thus is unpredictable and difficult to measure. As a layman, it seems to me, looking for correlation, or relationships between metrics and making predictions is one of the fundamentals of data analysis.

But how do we test a metric for repeatability? Well luckily the statistical world provides us with a way and it’s called linear regression. Ok, so don’t get bogged down by the term, the basic principle is easy to understand, with the caveat though that it can get quite technical if you go deeper. Let’s think about a simple example. We want to test if there is a relationship between 2 variables, those variables are, book sales, and shelf space say. If books are sold, book sales go up along with shelf space being increased right? Of course. We’re not going to get into the nuts and bolts of the example it’s just to get us thinking along the correct lines. Let’s move our example over to the football world. Wouldn’t it be nice if we could find a relationship between, say, the variable: points per game, and shots in the box. If we found a strong correlation then it might help us predict how the table might finish, and being repeatable it also suggests there is some skill behind the metric as opposed to luck or temporary skill which may give us a good indication of the performance of a team or player. Something solid to measure them by. Remember, both of us are learning as we go here.

What’s that got to do with Expected Goal Ratio (EXPGr)? Firstly let’s deal with what my version of expected goals (EXPg) is. It’s worth noting I am by no means the pioneer of this, it’s been done by others (and probably better), but personally I happen to think it’s a brilliant metric.  I’ve broken the playing field into different areas. As you can see in the below graphic. The reason for this is that we need to judge players and teams on an even playing field, so to speak. If we have player A, who has a goal conversion rate (CONV%) of 8% from all non-pen shots, but takes a high proportion of his shots from outside the box, then comparing him to player B with a CONV% rate of 15% who takes the majority of his shots from inside the box, then this would be an unfair comparison. Why? Because shots from inside the box are converted at a much higher rate than those from outside. Hence we ‘control for this’, by treating shots from each location differently. So expected goals from prime is: Total shots from prime/Mean League CONV% from prime. Player A takes 100 shots from prime, we apply a league mean of 15% which gives us an expected goal rate of 15 from those 100 shots. An average player taking 100 shots will score 15 goals. Simple. I carry out this calculation for each of the specified locations and simply add each EXPg: (EXPgPrime+EXPgOutside+EXPgInBoxWideRight+EXPgInBoxWideLeft) = EXPg total. I can then further determine if a player/team is performing above or below the EXPg rate by simply looking at actual goals. Actual Goals-EXPg = EXPg difference. So if a player/team has a +8.5 EXPg difference they have scored 8.5 goals more than they should have given those shots from those locations.

Which brings me on nicely to Total Shot Ratio (TSR). All you need to know on TSR is here. James has done a lot of work on TSR, and has lot of data on it, and it’s been proven time and again to be a metric that is highly repeatable. Combine that with it’s simplicity (TSR = shots for/(shots for + shots conceded) and you have a really nice effective metric. However, TSR doesn’t control for shot location, and assumes that all shots are equal. Which of course they are not. For example we have Spurs this year who have taken 336 non-pen shots (only City & LFC have taken more), and conceded only 211. (Only City, Chelsea & Southampton have conceded less) With that in mind you would expect Tottenham’s TSR to be high, and it is, 61%, which is second highest in the league. How is that a good predictor, Spurs haven’t been great this season and are in mid-table I hear you say. Well TSR doesn’t account for shot location, if we look deeper we see Spurs have taken only 31% of their shots from Prime locations, the 4th lowest proportion in the league. They have taken 177 shots from outside the box, the highest total in the league, and with the current CONV% only being 4.2% for shots from outside the box then we begin to understand Tottenham’s underlying problems. I’m by no means dis-respecting TSR, if you have a look at James’s body of work on his blog you’ll understand he’s quite the guru on these metrics and has far more knowledge than me on these subjects.

So the next question is how do we incorporate shot location into TSR. I have thought about TSR from Prime + TSR from outside + TSR from left in box + TSR from right in the box = Total Shot Ratio Location (TSRLo), and it’s something to investigate further, but for the moment I already have the data for EXPg collated. If I combine EXPg with TSR then that in some way controls for shot location. So I applied the same formula from TSR to EXPGr. The calculation is as follows: EXPg for/(EXPg for + EXPg conceded) = Expected Goal Ratio or EXPGr, which saves me writing the whole bloody thing every time.

Let’s apply that EXPGr calculation to Spurs. (27.45 + 20.1)/ 27.45 = 57.7% Now Tottenham’s EXPGr is 57.7% which is 5th best in the league, and in my opinion brings them closer to their real performance level this season. Of course Spurs is just one example. But all this doesn’t really matter if EXPGr is not repeatable and doesn’t correlate from year to year.

Which brings us back to linear regression, which helps us test for repeatability. If we can account for the variance or luck, then we determine the non-luck element or skill element. To do that we look at linear regression which has something called R2 which is the amount of variation in your (dependant variable which is on the Y axis) variable that is accounted for. A small note on R2 (correlation coefficient) and a more technical definition I came across on the interwebs: it measures the strength and direction (uphill/positive – downhill/negative) of the linear relationship between two ‘quantitative’ variables x and y. It’s a number between -1 and +1 that is unit free, which means if you changed from say pounds to ounces the R2 value wouldn’t change. If the relationship between x and y is upward, (as x increases so does y) the R2 value is positive. Is the relationship is downhill or negative, (as x decreases so does y) the R2 value is negative. See the short snapshot I took below for a guide on reading an R2 value.

Screen Shot 2014-01-10 at 10.29.05

We test for a relationship between a team’s EXPGr in year N (11/12 season) and EXPGr in year N+1 (12/13 season). Due to relegation we can obviously only test 17 teams. We do that by using a scatter plot and excel will calculate the R2 value for us. You can also use Excel’s regression analysis tool for a more detailed breakdown, but to be honest those results are beyond the scope of this piece.

Expected Goal Ratio Plot

Expected Goal Ratio Plot

We’ve plotted the data, now we can visually look for a relationship between the two variables. Do our points appear to follow the the line? If they do, we can say a linear relationship exists between EXGr in year n and EXPGr in year n+1. The closer all of the data points are to the line, or the absence of scatter, the stronger the correlation, and conversely, the more dispersed the data points, the lower the correlation. We quantify this relationship by using the R2 value. As you can see the R2 value is 0.8406 which means our model can account for 84% of the variance or luck. Anything above 70% is considered high and anything below 50% is considered low. Of course there are only 17 data points, which is a relatively low sample size, and this is only an initial finding, I would be a lot happier using this metric if I had more seasons of data but for now 2 full seasons are all I have to go on. But the metric looks promising.

Some future work would be to more rigorously test EXPGr. To that I need more data, so my next step is to collect the data for season 10/11. It was also suggested to me by @gnepon to see how EXPGr correlates with points per game, which is a good idea and also not to difficult to test, but I’ll leave that open to another blog. Another area which I’ve been probing is some basic prediction of the Y value, so it will be interesting to see how EXPGr correlates with PPG. If it correlates well I’ll have a stab at predicting the points for the rest of the season. I’d also like to construct a table that compares TSR standing to league place and then see how that matches up to EXPGr standing and league place standing. So that’s my first step into regression. I’m not expecting this to be completely correct as it can be a difficult subject to get your head around, so I’d welcome some feedback and suggestions from anybody with more experience than me in this area. Thanks for reading.