Alternative Text Stephen Birch | 30 August 2024 |

From predictor to pundit – how AI coped with Euro 2024 match prediction.

AI generated image of a football in digital surroundings

You might have read our blog article about developing an AI model to predict the outcomes of the matches which took place as part of the UEFA Euro 2024 football tournament. Now that the competition is over, we’ve had a chance to carry out a drains-up review of what went well and what could have worked better. 

So, did we develop a world-beating football predictor? Did the pundits excel at their jobs? Is human knowledge a match for data processing? Could the cats be the key to my fortune? 

The backroom staff 

In our previous article, we focused on the team on the pitch: the AI predictor, the DTB staff, the pundits and the cats. 

Our predictions were based on a Machine Learning model trained with results data from the last 30 years of international results. We then used generative AI to interact and interrogate the data to come up with a prediction.   

We used prompt engineering with the Large Language Model and retrieval augmented generation (RAG), using selected data sources, using real-time data to augment later responses. (Check out another one of our AI blogs, which demystifies some of the AI terminology.)

But behind every first-class team is a hard-working backroom squad which rarely gets the credit they deserve. 

We have a number of reliable resources to draw on, each with their own specific skill sets and areas of expertise.  

For the large language model, we considered using IBM’s Granite models, or bring in open-source models such as Meta’s Llama 3 or Mistral.ai’s Mixtral–7bx8. 

Given the relative simplicity of the model, it was decided that for the intent classification of the users’ questions, satisfactory results could be achieved with the Llama 3-8B model, which is 3 times cheaper than the Llama 3-70B model and used for the more complex implementations. Mixtral–7bx8 is 3x cheaper still, and still capable of achieving similar results for the basic model used in this scenario. 

When working with these different large language models, it is necessary to be aware of the differences in the prompting required for each. The prompt format for Llama 3 is very different from the prompt format for Mixtral. As a result, you can often achieve very good results with smaller and cheaper models such as Mixtral providing that the prompt is correctly configured for that specific model. 

LLM benchmark data

LLM benchmark data compiled from https://artificialanalysis.ai/models, https://bito.ai/blog/gemini-1-5-pro-vs-gpt-4-turbo-benchmarks/, https://www.vellum.ai/llm-leaderboard, https://www.vellum.ai/blog/llm-benchmarks-overview-limits-and-model-comparison

The table above shows how different models rank in relation to different context tests. It is clear that the models perform differently, and also that cost can play a part in the decision to choose a suitable model. For example, Llama 3.1 70B is great for reasoning and language understanding, but the cost is relatively high. Gemini Pro 1.5 is excellent for multimodal tasks, involving text and images, but is matched for reasoning and knowledge by the less costly Llama 3.1 70B. This demonstrates the importance of defining you needs or use case before deciding on the means to deliver it. 

Mixtral is considered to be an instruct model, which typically provides a more precise and relevant answer to specific instructions like summarising research. However, with appropriate prompting, the instruct model can perform as well as a chat model (such as OpenAI’s ChatGPT) for conversational tasks. 

Building the AI Assistant

Chester and Jinx

Chester and Jinx at their training camp.

We built the AI assistant interface in WatsonX Assistant, which accessed the data for the responses through a custom extension using an openapi spec. The app was built using fastapi and works by obtaining the initial intent from the users question, deciding if the user is asking about the machine learned (ML) predictions that have been stored in a Google BigQuery table, or if the user is asking a generic question about the Euros tournament. For the latter, it will follow a RAG approach using documents obtained from news articles and selected websites. 

When considering the human factor in this experiment, the DeeperThanBlue staff were motivated by the bragging rights of coming top of the table while the pundits were paid (by their employers) for their predictions. 

The cats were powered by a combination of Felix and Dreamies. 

Dissent in the ranks 

It wasn’t all plain sailing, and we faced several problems along the way. 

Main issues that we faced with the initial implementation were: 

  • Inaccurate document retrieval 
  • Quality issues with Llama 3 and Mixtral models 
  • High occurrences of hallucinations 

However, these were not insurmountable. Such challenges are to be expected when setting up such a system from scratch. 

We were able to overcome the issues as follows: 

  • Few shot prompting helped with improving the quality of answers from the models. We did notice, however, that Llama 3 was returning higher quality answers than Mixtral. 
  • We updated the model parameters, changing the decoding method from sample to greedy, and reduced the temperature. This helped to ensuring responses were consistent and always truthful. Initially we wanted to give the model some freedom to decide how it delivers its answers by using a high temperature and sample decoding, thereby giving it a more human like feel and make it more pundit like. However, this resulted in it often giving erratic and incorrect responses. 
  • The accuracy of the document retrieval was improved by utilising the cohere re-ranker. This reordered the documents that were retrieved by the base retriever in order of relevance and significantly improved the accuracy of retrieval, given the base retriever was often not ranking the documents that contained the relevant information as the most relevant document. 
  • We updated the model parameters and improved the quality of retrieval to solve the issues with the high occurrence of hallucinations. We also updated the prompts so that the assistant could tell the user if it was unable to retrieve the information relating to the user’s question. 

The super-sub—Harry the AI Pundit 

As the tournament approached the quarterfinals, we introduced an AI assistant, Harry the Pundit. Using Meta’s Llama 3 LLM on IBM WatsonX, Harry was able to surface knowledgeable responses to questions posed about the tournament and tie them into the prediction data made available in BigQuery, derived from our ML model. 

We wanted Harry to provide more human-like responses, in the style of a regular football pundit, so we primed it with examples of pundit speak for it to mimic.  

Once again, the outcome was quite satisfactory. 

We started with an easy one: Who is playing in the Euro 2024 Final? 

Harry said, “Ah, the grand finale of Euro 2024 is setting the stage for a nail-biter between Spain and England. Both teams have shown incredible resilience and flair throughout the tournament, but it all comes down to this one match in Berlin. Let the best team win!” 

Then we asked: Who do you predict will win the Euro 2024 Final? 

Harry: “For the Euro 2024 final, it’s a clash of titans with Spain and England locking horns. Based on their journey and performances leading up to the final, my prediction leans towards England clinching the title. They’ve shown resilience and strategic prowess, particularly in their ability to navigate through tight matches, as evidenced by their penalty shootout win against Switzerland in the quarterfinals. Expect a tactical showdown, but I reckon England might just edge it with a scoreline of 2-1. The Three Lions seem poised to end their long wait for a major trophy on the continental stage!” 

If only that was the case. 

How about something more challenging?: Who will be the stand-out players in the Final? 

Harry: “Ah, the Euro 2024 final, always a stage where legends are made! For Spain, look out for their midfield maestro, who has been slicing through defenses like a hot knife through butter throughout the tournament. His vision and precision are simply top-notch, and I reckon he’ll be pivotal in orchestrating Spain’s attacks. 

On the other side, England’s young gun on the wing has been absolutely electric. His pace and trickery have left many defenders in the dust, and his knack for clutch goals can’t be ignored. Expect him to be a constant threat on the break, giving the Spanish defence a real headache. 

So, keep your eyes peeled on these two; they’re likely to light up the scoreboard and dazzle us with their skills in Berlin! 

This could be Rodri and Saka that Harry is referring to, but it’s a shame that he couldn’t pull through the names too. 

How did Harry cope with something a bit more detailed? What tactics should Gareth Southgate use? 

Harry: Ah, the question of tactics for Gareth Southgate, a topic as rich as a deep-dish pizza! Given the high-stakes nature of the Euro 2024 final, Southgate should likely stick to his trusted 3-4-3 setup, which has served England well throughout the tournament. The three center-backs offer a solid foundation, allowing the wing-backs to bomb forward and provide width. This system also enables England to transition quickly from defense to attack, catching opponents off-guard with swift counter-attacks. 

In midfield, it’s crucial to maintain a balance between defensive solidity and creative output. A double pivot could provide the defensive shield to thwart the Spanish midfield maestros, while still allowing someone like Mason Mount or Phil Foden to operate in a more advanced role and link up with the forwards. 

Up top, having a dynamic front three can exploit Spain’s defensive vulnerabilities. Harry Kane, with his ability to drop deep and play as a false nine, could be key in pulling Spanish defenders out of position, creating space for the likes of Raheem Sterling and Bukayo Saka to penetrate with their pace and direct running. 

In short, keep it tight at the back, control the midfield battle, and use pace on the flanks to unsettle the Spanish defence. It’s all about balance and quick transitions for Gareth and the boys!” 

There’s quite a lot of detail here, and on the surface it sounds quite reasonable. Harry even picks out players in their correct positions. Unfortunately, the information is out of date. Neither Mason Mount nor Raheem Sterling were in the England Squad, let alone the first team. 

Was Harry caught playing away from home? 

It has to be said that Harry took a bit of coaching to become trustworthy. Early in the development, we asked Harry: “How’s Bellingham doing?”. We got the response “Jude Bellingham has been one of the standout performers in this tournament so far! The young English midfielder has been instrumental in his team’s progress to the quarterfinals, scoring a crucial goal in their 2-1 win over Senegal in the Round of 16. He’s been a box-to-box midfielder, controlling the tempo of the game and winning the battle in the middle of the park. If England wants to go all the way, they’ll need Bellingham to continue his impressive form!” 

This is prime example of hallucination. The assistant found reference to Jude Bellingham but thought it was appropriate to surface data from the from a game that hasn’t taken place—England have only played Senegal once, and that was a 3-0 win for England in the 2022 World Cup. 

However, with a bit of one-to-one coaching, we were able to bring Harry into line.  

Final score 

In the face-off between AI and humans, who came out on top? 

Tables showing the final outcome of the match predictions for Euros 2024Our AI predictor, which was fine-tuned throughout the tournament with match results as they happened and latterly data relating to players unavailable through injury and suspension, correctly predicted the outcome (the winner or a draw), 25 times out of 51 matches. In the knockout stages (before we introduced the effects of player changes, the success rate was 47% (17/36), and in the knock-out stages, the success rate was 53% (8/15). 

In terms of predicting the actual scores, the AI model managed to predict 6 scores correctly. 

Comparing this to the human predictions, when you aggregate the data from all participants (excluding the pundits), there were 23 correct outcomes predicted and 7 correct score lines. However, looking at individuals one predicted 29/51 correct outcomes (57%) with 10 correct scores, while another predicted 25 correct outcomes, but 13 correct scores. 

The best pundit (Chris Sutton from the BBC) predicted 25 correct outcomes and 8 correct scores. 

The cats have been a bit of a letdown. Certainly no match for Paul the Octopus from the 2010 World Cup. They had a success rate of 33% (17/51). Unlike their human counterparts (or servants as the cats see them), they are unable to predict the score lines. As mentioned in the previous post, if we are generous to them and assume they might have correctly predicted correct score lines a third of the time, they would have improved their position in the table by four places. 

The introduction of Harry, the AI Pundit was an interesting experiment. We’ve been able to show that these AI Assistants can be prompted to respond in a way that expresses some personality, a feature that can be adapted to present a brand in a unique way. 

While the assistant interpreted the questions posed, the answer quality was a bit wayward at times. However, this has more to do with the data which the model can reference than the quality of the interpretation. The larger and more current the data set, the better the results will be. 

What does this tell us? And what could be done better next time? 

The outcome from this experiment shows us that the AI predictor performed no worse than the human ‘group-wisdom’, but then again it wasn’t significantly better.  

It should be recognised that the model only had a limited data set to refer to—there’s always more data that can be incorporated. But it does highlight the fact AI models are still dependent on the data it is given to work with. When built correctly, it is unlikely to go rogue and bring in data that will render the output unusable. 

Prompting the AI Assistant to return answers in a particular style was an interesting exercise in customisation. If you have a customer-facing assistant, and you want it to reflect a particular characteristic of your brand, this is absolutely possible, allowing you to give your assistant some personality. 

If you interested to find out more how Assistants and AI Platforms can help automate your business and save time and improve business efficiency please feel free to get in touch with the Chat facility below or use the Contact Us form.

Related Articles

These might interest you

AI, Food for thought - 24 April 2024

AI For Business – Let’s get on with it…

AI is here to stay. You’re already using it whether you like it or not. So how do you use Read More
AI, Blog - 28 June 2024

My Cat could do better! – Can AI help predict the Euros

Back in 2018 and 2021, DeeperThanBlue challenged analytics platforms to predict the outcomes of football matches played at the FIFA Read More

Want to find out more about what AI could do for you?

Get in touch!

+44 (0)114 399 2820

[email protected]

Get in touch