Thursday, 30 May 2019

A simulation to see who will win the World Cup


One of the main purposes of statistics is to help inform decisions. Cricket statistics are often used when deciding on selection of players, or (more often) arguments about who is the best at a particular aspect. They can help decide which strategies are best, what an equivalent score is in a reduced match (with a particular case of Duckworth Lewis Stern) or which teams should automatically qualify for the World Cup (David Kendix). They are often also used by bookmakers (both the reputable, legal variety and the more dubious underworld version) to set odds about who is going to win.

I decided to attempt to build a model to calculate the probability of each team winning, based on their previous form. This was going to allow me (hopefully) to predict the probabilities of each outcome of the world cup, by using a simulation. It didn’t prove to be as easy as I had hoped.

My first thought was to look at each team’s net run rate in each match, adjust for home advantage, and then average it out. That seemed sensible, and the first attempt at doing that looked like it would be perfect. Most teams (all except Zimbabwe) had roughly symmetrical net run rates, and they fitted a normal curve really well. The only problem was that Afghanistan was miles ahead of everyone else. The fact that they had mostly played lower quality opponents in the past 4 years meant that they had recorded a lot more convincing wins than anyone else.

This was clearly a problem. India and England both had negative net run rates, while Afghanistan, Bangladesh and West Indies were all expected to win most of their matches.

I then tried a different approach, based off David Kendix’s approach of using each result to adjust a ranking. But rather than having a ranking that was based off wins, I based it off net run rate. So if a team had an expected net run rate of 0.5, and another had an expected net run rate of 0.6, the first team would have an expected net run rate of -0.1 for their match. If they did better than that, they went up, and if they did worse than that, they went down.

However, I found that some results ended up having too much bearing. If I made it sensitive to a change in the results, it ended up changing way too much based off one big loss/win. England dropped almost a whole net run per over based on the series in the West Indies. So this was clearly not a good option.

Next, I decided to try using logistic regression, and seeing how that turned out. Logistic regression is a way of determining probabilities of events happening if there are only two outcomes. To do that, I removed every tie or match with no result, and set to work building the models.

My initial results were exciting. By just using the team, opposition and home/away status, I was able to predict the results of the previous three world cups quite accurately using the data from the preceding 4 years. (I could not go back further than that, as they included teams making their ODI debut, and there was accordingly no data to use to build the model.

The results were really pleasing. I graphed them here, grouped to the nearest 0.2 (ie the point at 0.6 represents all matches that the model gave between 0.5 and 0.7 as the chance for a team to win), compared to the actual result for that match. It seems that they slightly overstate the chance of an upset (possibly due to upsets being more common outside world cups, where players tend to be rested against smaller nations), but overall they were fairly reliable, and (most importantly) the team that the model predicted would win, generally won.

I could then use this to give a ranking of each team that directly related to their likelihood of winning against each other. The model gave everything in relation to Afghanistan, with the being 0, and any number higher than 0 being how much more likely a team was to win against the same opponent as Afghanistan. (Afghanistan was the reference simply because they were first in the alphabet).







This turns out to be fairly close to the ICC rankings. So that was encouraging.

I tried adding a number of things to the model (ground types, continents, interactions, weighting the more recent matches more highly) but the added complexity did not result in better predictions when I tested them, so I stuck to a fairly simple model, only really controlling for home advantage.
Next I applied the probabilities to every match and found the probabilities of each team making the semi-finals.


The next step was to then extend the simulation past the group stage, and find the winner.

After running through the simulation a few more times, I came out with this:


A couple of points to remember here: every simulation is an estimate. The model is almost certainly going to estimate the probabilities incorrectly, but it will get them close, and they will be close enough to give a good estimate of the actual final probabilities. It is also likely to overstate Bangladesh’s ability due to their incredible home record; overstate Pakistan’s ability as a lot of neutral matches for them they have had a degree of home advantage in UAE; and understate West Indies, due to them having not played their best players in a lot of matches in the past 4 years. But these are not likely to make a massive difference to the semi-finalist predictions.



Given this, I’d suggest that if you are wanting to bet on the winner of the world cup, these are the odds that I would consider fair for each team:


I will try to update these probabilities periodically throughout the world cup, and report on their accuracy.

2 comments:

  1. Brilliant Research Sir

    ReplyDelete
  2. niranjan kanjolia31 May 2019 at 03:38

    hello sir, can you give me any mail id or contact detail. i m from media and wanted to talk to you for article sharing by you. you can mail me - niranjan.kanjolia@in.patrika.com

    ReplyDelete