Alternative Text Stephen Birch | 21 October 2024 |

Unhappy with your AI searches? The Devil’s in the detail! Comparing AI Performance

AI generated image of a devil surrounded by data.

Do you have a favourite AI platform? I know I do. I like the output I get from Perplexity, so much so that I’ve largely ignored the other tools out there. Therefore, I thought it was about time that I looked into AI performance and took a closer look at the alternatives. 

In this instance, I am using these tools as a glorified search engine and to create fairly basic content, but I thought it was an interesting test to see how they differ in their abilities to deliver relevant and (more importantly) accurate information. The first thing I learned was that the tools definitely do not behave the same, despite being asked for the same thing. The second thing I learned was that you can’t assume that the AI will interpret your prompt as you wanted it to, and on the whole the more detail you include in the prompt, the more accurate the output…most of the time.

There is an overwhelming message to take away from this exercise: do not believe everything you read on your AI. After all, the devil is in the detail.

The prompts

As with many computer systems, AI operates on a GIGO based: Garbage In, Garbage Out. Ask a nonsense question and you’ll end up with nonsense. It will also make assumptions about the questions you ask, so the more detail you provide, the better the response you will receive.

You will quickly spot a theme in the prompts I have used in this test. I am a follower of Sheffield Wednesday Football Club, and those in the know will be aware of the club’s rivalry with Barnsley Football Club.* The questions I have asked can also be easily fact-checked to confirm the accuracy of the responses. I asked four AI tools exactly the same things, all within continuous threads. I didn’t want to start a new chat or thread for each question.

I asked each platform the following:

  • Please tell me about Sheffield Wednesday Football Club and Barnsley Football Club.
  • Please could you write a couple of paragraphs comparing Sheffield Wednesday Football Club and Barnsley Football Club?
  • Who are the longest serving players for Sheffield Wednesday Football Club and Barnsley Football Club?
  • Who, historically is the top scorer for Sheffield Wednesday Football Club and Barnsley Football Club?
  • Please could you write a four paragraph summary of the last game in which Sheffield Wednesday Football Club played Barnsley Football Club on 29 May 2023? Please also provide data about the venue, attendance, scorers and assists and match odds, and statistics for fouls, possession, passes, and shots. Please could you add a couple of sentences about what the result meant for both clubs? I would like this to be written as if you were a Sheffield Wednesday fan.

 

You will note that the prompts move from the vague to the more specific.

The tools

I selected four of the best-known AI tools for the experiment:

 

All four are AI large language models designed to assist users with a wide range of tasks, including answering questions, generating content, and providing information. They all excel in understanding and generating human-like text, allowing for conversational interactions.

ChatGPT is based on OpenAI’s GPT-4 architecture, Claude uses Anthropic’s proprietary architecture, Copilot combines GPT-4 and other Microsoft technologies (si it will be interesting to see how responses compare against ChatGPT) and Perplexity is based on the Mistral 7B model.

Each model specialises in different ways:

  • ChatGPT (GPT-4o mini): Versatile, with strengths in creative tasks and coding.
  • Claude 3.5 Sonnet: Focuses on safety and ethical considerations in its responses.
  • Microsoft Copilot: Specialises in integrating with Microsoft’s ecosystem, enhancing productivity in Office applications and Windows.
  • Perplexity (pplx-7b-online): Optimised for web browsing and quick information retrieval.

 

The output

As you might expect, I generated a fair amount of text from these queries, so rather than include the full details of all conversations here, I’m just going cover the main observations and include extracts from the responses for comparison purposes. Those of you who want to see the details can find the full transcripts here.

Please tell me about Sheffield Wednesday Football Club and Barnsley Football Club.

I started with quite a generic question, leaving it up to each platform to interpret my request.

  • ChatGPT provided me with a short history of the two clubs and key achievements and a bit of narrative about their fortunes and fans.
  • Claude focused on outline facts with little narrative, but did suggest topics for follow-on questions.
  • Copilot simply states the basic facts (founding date, nickname, stadium, current league, achievements and local rivalries), but does provide links to the references used to obtain the information–in this case two Wikipedia pages.
  • Perplexity provides summary paragraphs (rather than brief bullet points) of key facts highlighting some of the key highlights in both club’s histories.

 

Screenshot from Perplexity showing pictures and videos relating the search

Perplexity added a selection of photos and videos relating to the search.

Please could you write a couple of paragraphs comparing Sheffield Wednesday Football Club and Barnsley Football Club?

Here, I was a bit more prescriptive about what I wanted as an output, asking for a couple of paragraphs of text and for the platforms to actually ‘think’ about the data available and compare one club against the other.

  • ChatGPT offers similar data presented in the first question but brings it together in the text. For example: ‘The clubs also differ in size and stature. Sheffield Wednesday’s Hillsborough Stadium, with a capacity of around 39,732, dwarfs Barnsley’s Oakwell, which holds about 23,287 fans.’
  • Claude’s comparison is less fact-based and feels more interpretive. For example: ‘While Sheffield Wednesday has generally maintained a higher profile and attracted more high-profile players, Barnsley has earned a reputation for developing young talent and playing an energetic, pressing style of football.’
  • Copilot, meanwhile, has taken the key facts from before and created paragraphs, which isn’t a million miles away from the ChatGPT offering for the first prompt. This shouldn’t be too surprising as Copilot is based on GPT-4 architecture.
  • Like Copilot, Perplexity has taken the key facts and presented them in (as requested) a couple of paragraphs. A nice extra touch is that Perplexity has included some images and videos which relate to the search and provides links to the 8 different sources it has used to extract the information.

 

 

Screenshot of the sources that Perplexity used for response to the search prompt.

Perplexity found 8 relevant information sources related to the prompt.

Who are the longest serving players for Sheffield Wednesday Football Club and Barnsley Football Club?

So far, so good. The open-ended prompts have been interpreted accurately and answered correctly. What happens when I ask for specific details? Surely finding out who the longest serving player is can be cross-checked and verified?

  • ChatGPT tells me that Liam Palmer (legend!) is the longest service Sheffield Wednesday player, having made his debut in 2010, but offers up Luke Thomas (debut 2019) and Callum Styles (2018) as their long-standing players). These are all current squad members
  • Claude goes for a historical perspective, telling me that David Hirst made more than 350 appearances for Sheffield Wednesday between 1986 and 1997, while Bobby Hassell made more than 270 appearances for Barnsley between 1992 and 1996.
  • Copilot tells me about Liam Palmer again for Sheffield Wednesday, but tells me that Jordan Williams is Barnsley’s longest-serving player, having joined the club in 2018. Where did Copilot go for this information? efl.com for the Liam Palmer data, and the Barnsley Chronicle for Jordan Williams.
  • Despite interrogating the same 8 references as before, Perplexity really struggled with this question. It identifies Josh Benson as their currency longest serving player having been with the club for 1,196 days (as of the search date – precise, or what), but this means that he joined the club in mid-2021. For Sheffield Wednesday, Perplexity tells me, ‘While I don’t have specific information about their longest-serving players, it’s common for clubs to have a mix of experienced players who have been with the team for several years and newer additions.’ That’s not all that helpful!

 

Maybe not. This highlights the need to be precise in the prompts. I was unclear whether I meant longest serving in the current squad, or longest serving of all time.

Who, historically is the top scorer for Sheffield Wednesday Football Club and Barnsley Football Club?

Hopefully this is less ambiguous. Please just tell me the person who has scored the most goals for the clubs in all time. Can we all agree on that? What do our ‘experts’ say?

  • ChatGPT: For Sheffield Wednesday, Andrew Wilson scored 217 goals in 545 appearances between 1900 and 1920. For Barnsley, Ernie Hall scored 131 goals in 310 appearances between 1921 and 1938.
  • Claude comes up with the same names identifying that Hine scored his goals over two separate spells at the club but tells me that Wilson only scored 216 goals. Could one have been disallowed by VAR?
  • Copilot gives me the same results as Claude. Interesting as it uses GPT-4 but differs slightly from ChatGPT.
  • Perplexity again struggles here. It can find the data about Andrew Wilson but is unable to find the information about Barnsley’s leading goal scorers. It rather unhelpfully tells me. ‘it’s worth noting that historical goal-scoring records are often cherished parts of a club’s heritage.’

 

It seems that Perplexity has got itself stuck in a loop of searching the same selection sources it identified earlier in the thread. As a test, I started a new thread in Perplexity and asked the same question. This time Perplexity told me that it couldn’t find accurate information about Sheffield Wednesday’s all-time top scorer but was able to identify Ernie Hine as the leading goal scorer, and listed four other leading goal scorers from the club’s history.

Please could you write a four paragraph summary of the last game in which Sheffield Wednesday Football Club played Barnsley Football Club on 29 May 2023? Please also provide data about the venue, attendance, scorers and assists and match odds, and statistics for fouls, possession, passes, and shots. Please could you add a couple of sentences about what the result meant for both clubs? I would like this to be written as if you were a Sheffield Wednesday fan.

Now we’re putting the platforms to the test. I am being more prescriptive about my requirements in terms of the length of the response, a particular match between the two clubs, the specific stats I’m interested in and an interpretation of the outcome of the game. I’ve also asked the AI to write the piece from a particular perspective. Let’s see how they each got on.

Screenshot from Perplexity showing pictures and videos relating the search

Perplexity showed some images and videos from the football match asked about.

  • ChatGPT correctly identified the game as the 2023 EFL League One Play-Off Final which would see the winners promoted to the Championship. If I’m being honest, it’s a game that Wednesday were lucky to be in having been 4-0 down to Peterborough after the first leg of the semi-finals. This is a nuance that both Copilot and Perplexity picked up on.
    ChatGPT returned four paragraphs and helpfully highlighted the stats I asked for using bold font. The writing felt like it favoured Sheffield Wednesday, ‘For Sheffield Wednesday, the victory was monumental. It marked a return to the Championship after two seasons in League One, a vital step in restoring the club to its former glory. Wednesday’s win meant not just promotion, but the chance to rebuild momentum for a brighter future, and it left their fans dreaming of bigger things ahead.’
    However, there was one glaring error. ChatGPT told me, ‘Sheffield Wednesday were reduced to 10 men following Adam Phillips’ red card in the 49th minute for a dangerous challenge’, and then, ‘In terms of statistics, Sheffield Wednesday had slightly less of the ball, with 47% possession compared to Barnsley’s 53%, reflecting Barnsley’s efforts to take control after the red card.’ Now Adam Phillips was sent off in the 49th minute, but he was a Barnsley Player, and it was actually Wednesday that had more of the possession.
  • Claude starts strongly with a Wednesday bias, ‘As a Sheffield Wednesday fan, I’m thrilled to recount our glorious victory over Barnsley in the League One play-off final on May 29, 2023, at Wembley Stadium. The atmosphere was electric, with 79,380 fans in attendance, creating a deafening roar that echoed throughout the iconic venue.’ It also included the stats I asked for including the ‘brilliant cross from Lee Gregory’ which resulted in the only goal of the game. This is perhaps my favourite response of the four. Claude signs off with, ‘For us Wednesdayites, it was a day of pure ecstasy, a memory we’ll cherish forever, and a launching pad for what we hope will be a bright future back in the second tier of English football.’
  • The Copilot response reads more like a match report with a less partisan approach than Claude, but it was able to provide the stats that I requested. However, there was some discrepancy in the stats reported. For example, Claude tells me that Wednesday committed 14 fouls to Barnsley’s 12, while Copilot says that Wednesday committed 14 to Barnsley’s 16. In terms of shots, Claud says that Wednesday registered 14 to Barnsley’s 13, while Copilot reported that it was 15 to 12 in Wednesday’s favour.
  • Perplexity appears to have got fully on board with the ‘writing as a fan’ request. ‘As a Sheffield Wednesday fan, I’m thrilled to recount our glorious victory over Barnsley in the League One play-off final on 29 May 2023.’ ‘Our lads dominated the statistics, demonstrating their superiority on the pitch.’ Apparently Wednesday had 17 shots with 9 on target, while Barnsley had just 9 shots with only 3 on target. It was also able to tell me about the game stats that I asked for, including match odds. Interestingly, though Perplexity didn’t register the sending off, a key moment in the game that was, arguably, crucial to the game’s outcome.
  • As with previous results, Perplexity provided images that related to the search.

 

Summary

While this experiment has had quite a light-hearted approach, and the subject is perhaps not to everyone’s liking, I think that it has perhaps shown some of the variability in the outputs from different AI platforms and highlighted the fact that the results shouldn’t be taken at face value.

It also shows that the tools have strengths in different areas. In a previous blog (https://www.deeperthanblue.co.uk/from-predictor-to-pundit-how-ai-coped-with-euro-2024-match-prediction/) I wrote about the relative strengths of different Large Language Models, be that maths, reasoning and knowledge, coding or multitasking.

What this has told me is that I need to be selective about which AI tool I use depending on the task I want it to perform.

It has also reminded me that the more specific I can be in my prompts, the more detailed and accurate the results will be. After all the devil is in the detail.

*I could have referred to the Steel City rivalry between Sheffield Wednesday and Sheffield United, but I didn’t think that the comparison would have been quite as favourable towards the Owls, and since the late 1990s Owls fans have had more than their fair share of disappointment!

Related Articles

These might interest you

AI, Blog - 30 August 2024

From predictor to pundit – how AI coped with Euro 2024 match prediction.

You might have read our blog article about developing an AI model to predict the outcomes of the matches which Read More
AI, Blog - 28 June 2024

My Cat could do better! – Can AI help predict the Euros

Back in 2018 and 2021, DeeperThanBlue challenged analytics platforms to predict the outcomes of football matches played at the FIFA Read More