I’m no advanced statistician. I take more than a passing interest in numbers in football. Why? Well, because I find them interesting. Simple. Do numbers tell us everything? No. Certainly not. But in my humble opinion, combined with watching games and studying the numbers, they do give you valuable insight into what is happening on the pitch.
So I’ve been reading a lot of advanced statistical articles over the last year, and the common theme seems to be repeatability and regression/progression to the mean. Now bear with me I’m still learning, but from what I understand if you can find repeatability, then you have a metric which suggests you can predict future outputs of said metric, and if a metric regresses heavily to the mean (see PDO) then that may suggest there is a large element of luck/variance involved and thus is unpredictable and difficult to measure. As a layman, it seems to me, looking for correlation, or relationships between metrics and making predictions is one of the fundamentals of data analysis.
But how do we test a metric for repeatability? Well luckily the statistical world provides us with a way and it’s called linear regression. Ok, so don’t get bogged down by the term, the basic principle is easy to understand, with the caveat though that it can get quite technical if you go deeper. Let’s think about a simple example. We want to test if there is a relationship between 2 variables, those variables are, book sales, and shelf space say. If books are sold, book sales go up along with shelf space being increased right? Of course. We’re not going to get into the nuts and bolts of the example it’s just to get us thinking along the correct lines. Let’s move our example over to the football world. Wouldn’t it be nice if we could find a relationship between, say, the variable: points per game, and shots in the box. If we found a strong correlation then it might help us predict how the table might finish, and being repeatable it also suggests there is some skill behind the metric as opposed to luck or temporary skill which may give us a good indication of the performance of a team or player. Something solid to measure them by. Remember, both of us are learning as we go here.
What’s that got to do with Expected Goal Ratio (EXPGr)? Firstly let’s deal with what my version of expected goals (EXPg) is. It’s worth noting I am by no means the pioneer of this, it’s been done by others (and probably better), but personally I happen to think it’s a brilliant metric. I’ve broken the playing field into different areas. As you can see in the below graphic. The reason for this is that we need to judge players and teams on an even playing field, so to speak. If we have player A, who has a goal conversion rate (CONV%) of 8% from all non-pen shots, but takes a high proportion of his shots from outside the box, then comparing him to player B with a CONV% rate of 15% who takes the majority of his shots from inside the box, then this would be an unfair comparison. Why? Because shots from inside the box are converted at a much higher rate than those from outside. Hence we ‘control for this’, by treating shots from each location differently. So expected goals from prime is: Total shots from prime/Mean League CONV% from prime. Player A takes 100 shots from prime, we apply a league mean of 15% which gives us an expected goal rate of 15 from those 100 shots. An average player taking 100 shots will score 15 goals. Simple. I carry out this calculation for each of the specified locations and simply add each EXPg: (EXPgPrime+EXPgOutside+EXPgInBoxWideRight+EXPgInBoxWideLeft) = EXPg total. I can then further determine if a player/team is performing above or below the EXPg rate by simply looking at actual goals. Actual Goals-EXPg = EXPg difference. So if a player/team has a +8.5 EXPg difference they have scored 8.5 goals more than they should have given those shots from those locations.
Which brings me on nicely to Total Shot Ratio (TSR). All you need to know on TSR is here. James has done a lot of work on TSR, and has lot of data on it, and it’s been proven time and again to be a metric that is highly repeatable. Combine that with it’s simplicity (TSR = shots for/(shots for + shots conceded) and you have a really nice effective metric. However, TSR doesn’t control for shot location, and assumes that all shots are equal. Which of course they are not. For example we have Spurs this year who have taken 336 non-pen shots (only City & LFC have taken more), and conceded only 211. (Only City, Chelsea & Southampton have conceded less) With that in mind you would expect Tottenham’s TSR to be high, and it is, 61%, which is second highest in the league. How is that a good predictor, Spurs haven’t been great this season and are in mid-table I hear you say. Well TSR doesn’t account for shot location, if we look deeper we see Spurs have taken only 31% of their shots from Prime locations, the 4th lowest proportion in the league. They have taken 177 shots from outside the box, the highest total in the league, and with the current CONV% only being 4.2% for shots from outside the box then we begin to understand Tottenham’s underlying problems. I’m by no means dis-respecting TSR, if you have a look at James’s body of work on his blog you’ll understand he’s quite the guru on these metrics and has far more knowledge than me on these subjects.
So the next question is how do we incorporate shot location into TSR. I have thought about TSR from Prime + TSR from outside + TSR from left in box + TSR from right in the box = Total Shot Ratio Location (TSRLo), and it’s something to investigate further, but for the moment I already have the data for EXPg collated. If I combine EXPg with TSR then that in some way controls for shot location. So I applied the same formula from TSR to EXPGr. The calculation is as follows: EXPg for/(EXPg for + EXPg conceded) = Expected Goal Ratio or EXPGr, which saves me writing the whole bloody thing every time.
Let’s apply that EXPGr calculation to Spurs. (27.45 + 20.1)/ 27.45 = 57.7% Now Tottenham’s EXPGr is 57.7% which is 5th best in the league, and in my opinion brings them closer to their real performance level this season. Of course Spurs is just one example. But all this doesn’t really matter if EXPGr is not repeatable and doesn’t correlate from year to year.
Which brings us back to linear regression, which helps us test for repeatability. If we can account for the variance or luck, then we determine the non-luck element or skill element. To do that we look at linear regression which has something called R2 which is the amount of variation in your (dependant variable which is on the Y axis) variable that is accounted for. A small note on R2 (correlation coefficient) and a more technical definition I came across on the interwebs: it measures the strength and direction (uphill/positive – downhill/negative) of the linear relationship between two ‘quantitative’ variables x and y. It’s a number between -1 and +1 that is unit free, which means if you changed from say pounds to ounces the R2 value wouldn’t change. If the relationship between x and y is upward, (as x increases so does y) the R2 value is positive. Is the relationship is downhill or negative, (as x decreases so does y) the R2 value is negative. See the short snapshot I took below for a guide on reading an R2 value.
We test for a relationship between a team’s EXPGr in year N (11/12 season) and EXPGr in year N+1 (12/13 season). Due to relegation we can obviously only test 17 teams. We do that by using a scatter plot and excel will calculate the R2 value for us. You can also use Excel’s regression analysis tool for a more detailed breakdown, but to be honest those results are beyond the scope of this piece.
We’ve plotted the data, now we can visually look for a relationship between the two variables. Do our points appear to follow the the line? If they do, we can say a linear relationship exists between EXGr in year n and EXPGr in year n+1. The closer all of the data points are to the line, or the absence of scatter, the stronger the correlation, and conversely, the more dispersed the data points, the lower the correlation. We quantify this relationship by using the R2 value. As you can see the R2 value is 0.8406 which means our model can account for 84% of the variance or luck. Anything above 70% is considered high and anything below 50% is considered low. Of course there are only 17 data points, which is a relatively low sample size, and this is only an initial finding, I would be a lot happier using this metric if I had more seasons of data but for now 2 full seasons are all I have to go on. But the metric looks promising.
Some future work would be to more rigorously test EXPGr. To that I need more data, so my next step is to collect the data for season 10/11. It was also suggested to me by @gnepon to see how EXPGr correlates with points per game, which is a good idea and also not to difficult to test, but I’ll leave that open to another blog. Another area which I’ve been probing is some basic prediction of the Y value, so it will be interesting to see how EXPGr correlates with PPG. If it correlates well I’ll have a stab at predicting the points for the rest of the season. I’d also like to construct a table that compares TSR standing to league place and then see how that matches up to EXPGr standing and league place standing. So that’s my first step into regression. I’m not expecting this to be completely correct as it can be a difficult subject to get your head around, so I’d welcome some feedback and suggestions from anybody with more experience than me in this area. Thanks for reading.