If you’re as old as I am, you might remember that the biggest hit on Guns N’ Roses’ Use Your Illusion was the iconic rock ballad, “November Rain.”
It was cold. It was sweet. And that Slash solo? Fire. It was the soundtrack of 1991.
What you might not recall (or choose to forget) was that the same album offered a cover of the McCartney/Wings ode to James Bond, Live and Let Die. It turned out to be a sleeper hit, ranking near the top of a double album that went platinum seven times.
I bring this up because, sometimes, there’s no logic to predicting human behavior. It’s an illusion. You can try to explain why a cover of a weird post-Beatles tune – soaked with an appropriate layer of pre-Grunge acid – appealed to the kids of my generation. But it ended up being one of the most popular songs on an album that sold 39 million copies.
Apple also knows a lot about music. And phones. And movies. Heck, they seem to have all media caught in their intrepid net of tech domination. Case in point: At their annual Worldwide Developers Conference (WWDC) – which kicked off today and runs through June 13th – they introduced some stunning enhancements across nearly every one of their products.
I watched most of the opening, and I could spend a lot of time conveying how breathtaking the new Vision Pro UI is, how it brings flat 2D images to life with dramatic depth. I could wax philosophically about the emergence of spatial content, and how Apple is altering the fabric of reality with drag-to-life capabilities for retail products (try adding a couch to your living room from a web page and seeing it appear IRL).
Instead, I’m going to zero in on everyone’s favorite subject: AI. Because there’s not enough fatigue around a subject that everyone wants to bank their future roadmaps on. If I sound cynical, it’s because I’m subjected to a lot of demos in our CMS and DXP corner of the software world, and I’m always trying to splice the hype from the hope.
So here's the deal: Coinciding with WWDC, Apple just released some new research that casts doubt on advanced AI models. The study sharply reveals a truth that many have already suggested: AI isn’t that smart, and it’s a long way from pulling any sentient, Skynet-level shenanigans.
Good news? Maybe for those who were concerned about its propensity to go all self-aware Terminator on us. But for hype machines promoting AI’s grand promise of higher-order thinking and logic, this might sour the apples, so to speak.
The research paper, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, is a comparative study of Large Reasoning Models (LRMs). It validates that they show promise on a variety of reasoning benchmarks, and there's clear evidence of that. But when the rubber meets the road, their limitations are, as Apple surmises, “insufficiently understood.”
Said another way: Not good.
In the open salvo of the research, the authors explain how they have systematically investigated the gaps that exist around mathematical and coding constraints, which impact the accuracy of a model’s output as the challenges become more difficult. Through some extensive experimentation, they arrived at a core problem:
LRMs face a complete accuracy collapse beyond certain complexities.
Wow. And a deeper dive into the paper's content – a 30-page railgun of diagrams comparing failure tracks along a horizon of problem complexity – fires a cannon of chaos, illuminating just how real these limitations might be.
BTW, that’s actually a song title on Use Your Illusion I.
So, what’s all the fuss about this research? Well, if you believe the doomsayers, it describes a “total collapse” scenario, suggesting that leading frontier LRMs from OpenAI, Anthropic, and DeepSeek (there's a great article on that one here) fall apart when trying to solve really complex problems – and they don’t yield any remarkable benefits over LLMs that operate downstream.
This scenario is based on the paper’s conclusion that these models aren’t, in fact, able to “think” or reason for themselves. They’re simply mimicking whatever they’ve been trained on. When given new and highly complex problems, even the most advanced LRMs perform less reliably than the standard LLMs on basic tasks.
This is a “big pause” moment given the untamed investment in LRMs, which claim superior logic and reasoning for solving mathematical proofs, making decisions, and processing diverse data types with multi-modal capabilities. We’ve known about their heftier computational requirements – more tokens, slower response times – but those limitations were poopooed in light of their potential to handle higher-order problems.
To conduct its tests, Apple employed a series of “puzzles” that allowed for greater control over the granularity of complexity dimensions (they also designed them to minimize contamination that’s common in other relative benchmarks). It then compared reasoning-level responses from LRMs (such as o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) against a non-reasoning LLM.
The paper provides more specifics, but what they discovered is that frontier LRMs experience a “complete accuracy collapse,” which results in a kind of logic bedlam. As observed, there were surprising limitations in the LRMs’ ability to perform exact computations, including a failure to benefit from explicit algorithms and their inconsistent reasoning across the different puzzle types.
There’s a weird phenomenon at play here. Given the increased complexity of the puzzles, the models should have engaged in more detailed chain of thought computations, thus consuming more resources. But instead, they consumed less, effectively giving up. Throwing in the towel, so to speak.
Not knowing the “why” in this case is particularly harrowing. As Apple states in the paper, they believe the lack of systematic analyses investigating these questions is due to “limitations in current evaluation paradigms.” In other words, we don’t even know how to test for the problem. And that’s a problem.
This all might seem a little morose, but no one’s saying the future of AI is DOA. Well, not me, anyway. Still, the plague of inaccuracies has been scaling as of late, and Apple has been at the center of it.
In January, shortly after launching its shiny new Apple Intelligence feature on late-model iPhones, the company temporarily pulled the feature after the technology produced fake and misleading summaries of news headlines that appeared almost identical to regular push notifications.
The reversal was jarring, given the heavy holiday marketing they had done to promote their new AI capabilities. But this isn’t just an Apple problem. AI hallucinations are as prevalent as ever. A July 2024 study from Cornell, the University of Washington, and the University of Waterloo revealed that top AI models still can’t be fully trusted, given their propensity for making stuff up.
No doubt, the timing of this research release was carefully orchestrated. With so many new product upgrades and announcements – and a gnarly video of Craig Federighi, Apple’s SVP of Software Engineering, circling the roof of the Cupertino saucer in a Formula 1 racer – there was ample cover. This was comms strategy at its finest, stimulating just the right amount of public discourse.
There are, of course, those who might suggest Apple’s delayed arrival in the “Big AI Game” might be a key motivator. While Apple has forged a reputation for innovation, they have also been known to lag on trends like TV, autonomous vehicles, and other initiatives that fizzled. Given every player’s struggles with AI accuracy, a vilification of LRMs might provide cannon fodder to blow up the competitive matrix in an already tight race.
So was it an illusion? Was this the LRM hype machine in overdrive, with Slash shredding outside the chapel on a dusty plain? I think we can safely say that some of it was. And herein lies the problem: With AI, we’re moving to product so quickly, we don’t always have the answers. We’re often not even asking the right questions. This research might be the best example of that.
For the CMS and DXP industries, Apple is blazing some interesting trails with its portal to spatial content. There's a lot to unpack regarding the storage and structure of that content and expressing it in three dimensions, and that's exciting stuff that companies like Cloudinary are already preparing for.
At the same time, this AI research is offering us a moment of clarity that can't be overstated. As I said earlier this year in my CMS Kickoff 25 wrap, we need to slow down and be more thoughtful in this race to harness AI. The risks and consequences are simply too great, and the accuracy of content is in the crosshairs. Maybe we can use the illusion as a signal – and find better pathways to test, analyze, and validate what we’re building.
Slowing down is hard advice to follow, especially in this hyperactive digital marketplace.
But it might be the best prescription for writing the next hit.
That – and making sure Slash gets his solo.
August 5-6, 2025 – Montreal, Canada
We are delighted to present the second annual summer edition of our signature global conference dedicated to the content management community! CMS Connect will be held again in beautiful Montreal, Canada, and feature a unique blend of masterclasses, insightful talks, interactive discussions, impactful learning sessions, and authentic networking opportunities. Join vendors, agencies, and customers from across our industry as we engage and collaborate around the future of content management – and hear from the top thought leaders at the only vendor-neutral, in-person conference exclusively focused on CMS. Space is limited for this event, so book your seats today.