Predicting Premier League Points
The Premier League is the most watched sports league in the world and generates billions of pounds in revenues in the form of TV rights and sponsorship agreements, and it grows year on year. With each season that passes, transfer records are broken, player wages increase and so do ticket prices to watch the games.
The sport is so money oriented that the teams with the deepest pockets tend to win the trophies. In fact, it’s so rare for a team outside of the ‘big six’ to win anything or even qualify for European competitions that when Leicester City achieved the unthinkable and beat 5000/1 odds to win the league a couple of seasons ago it received worldwide media attention.
I wanted to dig into the data and explore to what extent the influence of money has on a team’s success on the pitch. Or perhaps there are other factors involved. Aside from the big six teams, I think the majority of teams in the Premier League are only one bad season away from being relegated into the Championship.
For this analysis, I scraped historical Premier League table data from Wikipedia from the 2000-01 season through to the 2016-17 season; the final points each team gets in a season would serve as the target variable.
For the majority of my data acquisition I focused on the comprehensive football (soccer) website, transfermarkt.com, but also collected data from a variety of other sources, including Reddit, news articles and blog posts for some of the harder to acquire data points.
I’ll give a brief rundown of the data I collected and/or used as features in my model (more charts and analysis can be seen on GitHub).
This chart shows the number of points teams have accumulated since the 2000-01 season. There's quite a high variation in the number of points the bottom placed team ends a season with. Around mid table, it flattens somewhat, while the closer teams get towards the top of the table, the gradient begins to climb. Interestingly, only once in the past 15 years has a team managed to accumulate more than 40 points and still be relegated.
What I'm referring to here are the 6 clubs that have a notably higher turnover than the other teams in the division, giving them a clear advantage in their ability to outspend rival clubs when it comes to player transfers and wages. I'm sure it'll come as no surprise that the 'big six' clubs are Manchester United, Manchester City, Arsenal, Chelsea, Liverpool and Tottenham Hotspur.
According some newspaper articles I have seen, the 'average' Premier League football player earns £44,000 per week, and at the top end of the scale the big clubs are purported to be shelling out more than £250,000 per week for their big name stars. Above is a pair plot depicting wages data (scaled) vs final points teams accumulated over the course of a season. There's quite a clear positive relationship. It's also quite apparent that the big six teams (in blue) really do outspend the rest of the league.
According to Sky Sports, in the 2016 summer transfer window, Premier League clubs spent £1.194 billion on players. The top 4 biggest spenders were Manchester City (£169.05 Mn), Manchester United (£141.05 Mn), Chelsea (£97.65 Mn) and Arsenal (£82.23 Mn). There's actually not a great deal of information to be gleaned from the chart with no obvious relationship like can be seen with the wages data.
I collected squad size data because I thought there may be an advantage for the teams with a larger pool of players to rely upon throughout the season. The data, however, shows hardly any kind of relationship. I think this is because the data from transfermarkt also included the reserve and academy players who are rarely involved in first team matches.
Like with the wages data, there's a clear difference between team market values for the big six teams and the others, with 2 clusters of data points. I would assume that transfermarkt team value data is based on the values of the players in their squads, so it's no surprise really that the teams with the most money spend the most on players and therefore have the most valuable squads.
Below is the final model. I ended up choosing a standard ordinary least squares regression.
I only had 313 observations because I removed the 2016-17 data so that I could predict the 2016-17 season and compare the results.
Not all of the feature data I collected made it in to the final model – I removed those that offered little to zero predictive power. One of those was the net transfer spend data. It took me quite a long time to collect that, so I was hoping it would be a better predictor! Never mind…
An R-squared value of 0.683 isn’t too bad. The wage, squad size and big six features have very strong statistical significance with p values of 0.00 each. Other factors such as number of transfers in, average time a squad has spent together and average ages of players all show subtle but varying degrees of relationships. The first season feature doesn’t appear to be too great from a statistical perspective but I left it in because the coefficient is negative and intuitively that makes sense to me.
Prediction / outcome
Below are the model predictions vs the actual final Premier League table for the 2016-17 season:
It correctly relegated Middlesbrough and Hull City, but predicted that Sunderland would finish safe from relegation on 44 points. The other team that the model relegated was Watford, who in real life only finished one place higher and with 40 points vs the 38 predicted.
At the other end of the table, the model incorrectly predicted Manchester United would win the league with 79 points. In reality, they only managed 69 points and a 6th placed finish. All of the big six teams were correctly predicted to finish in the top 6 places, albeit in the wrong order. It also correctly predicted that Everton would end the season in 7th place.
Similarly with the top 6 places, the mid-table sides are roughly the same in both the model and real life, in a slightly shuffled order.
One general observation I can see (and thought may be an issue) is that at the upper end of the table, the model under-estimated the number of points the top few teams managed to accumulate. Likewise at the bottom – it also over-estimated points and failed to predict the dismal sub-30 point tallies of Middlesbrough and Sunderland.
I’m sure there’s additional data that could be incorporated to further refine the predictions. Management changes is one feature that springs to mind. Another would be to add recent performance or league placing data which I’m sure would hugely improve the model.