By popular demand, we’ve decided to let you into our world and give you a little insight into the technology and work that went into running the Euro 2020 tournament prediction process. Here’s how we turn a collection of different data sources into a cohesive model.
Here is a high level architecture diagram:
Key components and technology used
A lot of the source data is collated into Excel, especially the predictions coming in from people. The ability to copy and paste quickly from emails into a managed spreadsheet is simpler and gives us more flexibility.
Some of the source data is scraped from websites using the R programming language.
We use Microsoft SSIS to import the key datasets of participants, fixtures, predictions and goals scored into an SQLServer database into a staging area. From the staging tables, we then check the validity of the data and load into the final storage tables with our own processes.
We also use IBM Planning Analytics as a standalone score predictor, taking into account a wide range of factors and using these to derive a likely score. Planning Analytics receives its data from the cleansed SQLServer tables via a CSV export process and turbo integrator to import the data into its model. The score predictions are then fed back into our spreadsheet and loaded into the database along with all our other predictors.
Data storage and processing
Initially the data is loaded into staging tables to provide an audit trail and to provide the ability for final data validation before being loaded into the model.
Processing predominantly includes the storage of individual predictions and the final scores, building up an accuracy score for each prediction (human or machine).
Our predictive model, uses a range of factors including goals scored and conceded for each team and relative team strengths to predict an expected score.
We capture every goal scored and the scorer and time, and aggregate these to derive the final results.
We like to use a broad range of visualisation tools and essentially show each to produce the same dashboard or report pages, this gives us a ready made catalogue that we can demonstrate
Currently we have Microsoft PowerBI, Google Data Studio, Tableau and Planning Analytics. For the Euro 2020 tournament we selected PowerBI to have a consistent look and feel, but we will also add more options over time and future tournaments as the opportunity arises.
The boring manual bits that add the real value
It’s not all automation though, some things are just too complex to solve using automation. The BBC Sport website have a group of pundits who predict the scores of every match, so for those we enter them onto our spreadsheet manually. It’s a bit of a chore, but its worth it so we can benchmark ourselves against the experts.
At the end of it we have a really interesting dataset, which we use to demonstrate different technologies. It’s a little different to the usual fictional business models that others tend to use and a lot more fun and engaging.
Planned improvements ahead of World Cup 2022
We improved our processes this time after the 2018 World Cup, adding scores and opening up to external participants. But to improve for next time we need to have a more efficient method of data capture. We will set up an online system for people to enter their score predictions, which will be integrated into our stack saving a proportion of the manual work. We will shortly be testing this and will invite a few people to join our testing programme. See you for the next round!