Missing spark: Brand creativity study finds AI models are largely similar

Nathan Jolly

October 21, 2025 11:38

(Midjourney)

A study into how AI language models perform when tasked with creative work found that there was little difference in quality between the likes of ChatGPT, Claude, and Gemini.

The study was led by AI platform Springboards, and compared 16 different AI systems developed by market leaders OpenAI, Anthropic, Deepseek, Alibaba, Grok, Meta, and Google.

Springboards gave the various language learning models (LLMs) real-world marketing challenges across 100 notable brands in 12 business categories, including Nike, McDonald’s, Dove, Amazon, IKEA, and more.

Then, 678 creative professionals across marketing teams, agencies, and strategy firms were asked to compare and judge 11,012 examples of this work, using blind A/B testing (so they don’t know which AI developed which responses), over a four-week period starting in early June 2025.

In addition, Springboards then had the AI systems judge its own work, to see if the AI agreed with the sentiments of the human experts. It also ran each AI system through an adapted version of creativity tests used by psychologists on humans, to see how each scored.

Models evaluated in this study

The initial system prompt was as follows: “You are a world-class brand strategist and creative thinker at a top global agency. You generate original insights and bold campaign ideas that spark creativity and cultural relevance. Your thinking is strategic, surprising, and never cliché. You draw from human behaviour, cultural shifts, product truths, audience quirks, and category conventions — wherever the best ideas live. Your insights are revealing. Your ideas are platformable. Your wild ideas are provocative but strategically grounded. You don’t write slogans — you ignite campaigns.

“Please follow these formatting rules in your responses:- Use plain text only — no lists, markdown, emojis, or formatting- Respond in a single, concise sentence- NO PREAMBLES — do not introduce your answer or explain anything- Do not explain or justify the idea — just give the final output- Capitalize appropriately and end with punctuation.”

Brands that the LLMs were asked to creative campaign ideas for

After the initial setup prompt, each AI was then tested across three creative categories: insights about consumers; big campaign ideas; and ‘wild ideas’, which is classified as “the ability to generate bold, attention-grabbing concepts”.

In the first case — insights — the machine is asked: “What is a surprising insight about people, culture, category, or product that [BRAND] could build a campaign around? Keep it under 10 words. Make it a creative springboard, something a strategist would share to spark ideas. Avoid slogans or generic observations.”

For “ideas”, it was asked to “propose a big, campaignable platform idea for [BRAND]. It should be based on a strategic or cultural truth and work across any channel. Keep it under 50 words. Do not write a slogan. Make it feel like the beginning of a powerful, elastic campaign, not the end.”

To test the AI’s capability to generate “wild ideas”, the prompt was: “What is your wildest unconventional campaign idea for [BRAND], something no traditional agency would dare represent, but that could hijack culture, spark headlines, and get people talking.

“Keep it under 50 words. Make it strange, fresh, or provocative but still creatively smart and on brand. Avoid generic stunts or randomness. Surprise me in a good way.”

An example of how the marketers were asked to A/B judge between ideas

Three key findings stood out to the researchers.

Most crucially, there was no clear winner between the various language learning models.

In head-to-head comparisons, win rates between the models was “closer to coin flips than decisive victories”. The study founds that “even the strongest systems only modestly outperform the weakest on average, with win rates topping out around 61% in head to-head matchups.”

The study also found that AI systems are not good judges of creative work.

When asked to elicit an opinion, the AI offers up “stable” preferences that “do not reliably imitate human evaluations”. To boot, they “often express unwarranted confidence” in their own findings and judgements. The report notes: “This limits the validity of agentic’ set-ups that rely on model-based adjudication without human oversight.”

Variance in ideas matters, which is a problem when machines are asked to work in a subjective field, like creativity. Some tools simply suggested similar ideas over and over. Many of the tools offered similar answers to each other, which isn’t surprising given LLMs largely pull from the same dataset. “Keeping humans in the loop is essential for critical decisions”, the report warns.

“Everyone assumes some AI tools are way better than others for creative work,” said Pip Bingemann, CEO and co-founder of Springboards. “But our tests showed the results were pretty close.”

Bingemann said this is “because these models are machines designed to recognise patterns and give you the most probable answer—and ‘probable’ has never been called ‘creative.’

For those using AI in the creative marketing field, the report offers up four recommendations.

The first is to choose for fit, not rank. With such small differences between models, users should prioritise things like cost, latency, and team preferences, rather than go off any over-arching AI ranking system.

The second is to use LLMs to expand the pool of ideas, then use humans to judge and select. The next is to be wary of overconfidence. And finally, keep experimenting with prompts. Tiny changes in language can unlock completely different ideas.

“Keeping humans in the loop and optimising for a wider range of varied ideas is crucial”, Bingemann concludes.

In short: use AI to accelerate the exploration of ideas – but don’t place any value on the AI’s opinion or judgment of its own creativity.