Originally I had all non-pen shots divided up into 4 separate locations. One prime location, (centre of box), wide right and left inside the box and all other shots outside the box. I’ve since had a serious re-think and have had more time to study the data. There are just too many discrepancies in the shot conversions within those zones and as a result of having more time recently I’ve decided to upgrade my expected goals model. It took a few months work on and off, but I got there in the end. For example, shots wide in the box with the old model are usually converted at around 4/5%. If I separate wide in the box into 2 zones, zone D gives a conversion of c.2%, whilst zone E gives a conversion of c.6%. These are significant differences over a season. And now that I know the differences, well I just can’t live with the old expected goals model knowing that.
Furthermore, whereas N was around 40k shots, I’ve since gathered more data and have increased N to around 100k. This also allows me to sub-categorise the data more heavily and not be too concerned about any sample size issues infecting the results.
(* for the purposes of describing these conversion rates, these are all non-pen shots from these specific locations, with no qualifier added. All though obviously qualifiers are added into the expected goals model, as described below.)
A picture describes a thousand words so here are the locations of the shots I’ve filtered.
So I’ve now broken down the danger zone into 3 distinct locations. Crudely represented by the letters A, B and C. Wide in the box are also separated into 2 distinct zones, F and G, and D and E.
After much studying of conversion rates outside the box I made an educated decision on these zones. I found conversion rates just outside the box differed enough between zones M, N and O and the remaining areas to divide them up into these distinct zones. For example, shots from zone N converted at c.5/6% whereas shots from R and S were converted at c.3%. As I got further out into the halfway territory the sample size got considerably smaller, as a result I felt there wasn’t enough data to separate these areas out further. Besides, conversion rates were getting to a lowly 1%, which in the grand scheme of things I don’t believe separating zones U and V out to anything smaller would have made any significant difference.
So what qualifiers did I account for. Firstly, for each zone I separated non-big chance shots and big chance shots. A note on big chances, I’ve really had a chance to study these in detail since the Stat Zone website released big chance location data. It’s not a perfect system by any means, but I believe it’s a really good indicator of defensive pressure. The only problem here is, it’s an all or nothing situation. To improve this metric it needs an extra qualifier to record the level of defensive pressure. For example, a big chance with just the keeper to beat is classed as equal to a big chance that is an open goal. So yeah, that’s going to cause a problem on individual shots. At a team level I’m not so sure, how many open goals do you see in football. Not many. On defensive pressure, well blocked shots can be indication that a player is close to the shooter, about 4.5% of big chance shots last season in the EPL were blocked, compared to around 28% of non-big chances shots. The question is though, should a big chance be classed as a big chance if there is the opportunity of it being blocked by an outfield player? These are the difficulties. As I say, it’s not perfect, but it’s all we have, it’s just important to be aware of it’s limitations and non-limitations. I digress.
I then sub-categorised these shots again with head, foot and yes, even other body part shots, and then also what type of pass did the shot come from, inter alia, I controlled for corners, crosses and free kicks.
Nervous Nelly Corner
Overall I’m pretty happy with the model, I’ve controlled for almost anything I can get my hands on publicly, which makes it pretty granular. The locations aren’t picked from the top of my head, I’ve studied the data and made the best decision possible on the locations, that is, based on the data that I have collected. As alluded to above, I’m not entirely happy with the big chance data, it bugs me that a big chance with the goalkeeper to beat is classed as equal to an open goal. But without viewing every shot on video myself I’ve no way to account for this.
Future improvements: immediately what comes to mind, is accounting for position, i.e. the model currently takes an average player’s conversion rate for each location and sub-category, so we are judging players based on how an average player would convert, this doesn’t recognise the fact that a forward will convert at a higher rate than a defender. I’ve already took tentative steps towards this, but even with 100k shots, sub-categorising even more based on position dilutes the data even more and leaves it open to variance. I haven’t been using this model long, since the start of the World Cup, and I’ve even improved it since then, but I imagine the more I use it the more (big!) chances there are that some flaws will arise, which I can learn from and use to improve.
Lastly, and this is a small annoyance of mine. I do not think that this expected goals model is measuring finishing skill, but rather the ability to get into good positions and get good shots off. I don’t believe any model can measure finishing skill without taking into account how the ball is hit, technique is almost everything, and choice of technique is important.
For example, was the ball hit with the instep, laces, outside of the boot? Did the player volley it, or hit it along the ground (daisy cutter)? etc
Did the player apply bend to the shot, if so, there are further factors to consider, what foot did he use? (Foot and position are important when applying bend to a shot) What position was the player in when he applied the swerve? Did the ball bend from outside to inside or vice versa? Say we want to shoot with bend from the left of the goal and to apply swerve to go in the far right corner of the goal: if your right-footed you need to hit the ball with your instep, if your left-footed you’ll need to hit the ball with the outside of the boot (a much more difficult skill). That’s before you even consider shot placement in the goal, top corner, bottom left, straight at the keeper etc. Even considering whether a player is actually applying skill or not to any particular shot. This type of nuanced data is needed before anyone can properly start to measure finishing skill.