Most of the people I like think AI is a bubble. This is a tricky topic to discuss, because the “bubble” framing couples financial and technical issues. It’s like a sports fan debating “Is this player overrated?“. The answer depends on how good you think that player is, and how good you think other people think they are.
I don’t have anything much to add to the financial part of the “AI bubble” conversation. Various equity prices are based on very optimistic estimates about how AI will progress. This post is about the technological question. I’ll leave it to you to judge what sort of forecast any given asset price actually represents.
The main case I want to make is that performance probably won’t plateau — or at least, the common arguments for why it will plateau don’t add up. These arguments have been mostly the same for the last few years, and they’ve been consistently inconsistent with what’s actually happened.
People mostly say that AI performance will plateau because they believe that performance increases have been driven by brute-force scale. The argument goes that the industry has been spending more and more money on training and operating larger and larger models. The basic approach has diminishing returns, but there’s been exponential spending behind it, producing what looks like linear progress — but this trend can’t last. Either some unknown breakthrough bails the industry out, or advancements will halt at a point while the cost of providing the service is still much higher than the value people would actually be willing to pay for it — especially since from the consumer side, people are mostly buying in because they expect the capabilities to improve, not because they see them as currently market-ready.
I also used to expect a plateau in capabilities, especially around GPT-1 and GPT-2. If I only had a few sentences to say to myself then, here’s one way to put it. You know how there were two big splashy lines of deep learning research: generative AI based on massive pretraining (GPT etc), and deep reinforcement learning (AlphaZero etc)? Well, what people did is put those together. It’s like the least surprising thing ever when you put it like that. But that’s actually something importantly true about problem solving, which explains how the “putting them together” works. For any answer you want, there’s a way to pose the question that makes the last leap really obvious. How do you find that question? Well, there’s going to be some intermediate step that makes the second last leap obvious. Each new text generated is like a move in a game, and the AI “wins” the game if it gets to the right answer.
If you asked GPT-1 a question like “Do Berlin, London and Mumbai together have a greater population than Australia?”, the model would start generating a reply that’s most similar to the sort of thing people would say to that in the web text it was trained on. But what you could do is ask it separately questions like, “What’s the population of Mumbai?”, “What’s the population of Australia?” etc. Assume for the sake of argument it knows correct answers to all those questions. If you give it the output to all that, now it will get the original question right.
By GPT-3, the behaviour is much harder to characterise, but we can say roughly it would generate a likely useful reply from the web, and that might happen to be a plan that breaks it down step by step, which would guide it to the right answer. But this would only work if it had seen questions asking about sets of numbers, and seen the sorts of replies people give to that.
For Claude Opus and GPT-5, they’ve been trained to generate the intermediate questions themselves. In the same
way that AlphaZero learns to play chess, when the model succeeds in generating a sequence of steps that
leads it to the right answer, that gets reinforced. What it learns from this is the general process of breaking
down problems. What sort of steps can I try that might lead me in the right direction? If I’m trying to do an
operation like x + y + z > a, maybe I should fetch all the values, and then I can sum up x + y + z, and
then when I’m looking at the two numbers together right in front of me I can generate the answer.
Describing models like GPT-1 or GPT-2 as “fancy autocomplete” was reductive but reasonable. For GPT-3 I’d call it directionally incorrect. For today’s models it’s a serious misunderstanding that will lead you to make wrong decisions that have a concrete impact on your life. I hope this post can help.
I’ll explain my credentials on the topic briefly before I go into more detail. I published my first paper on natural language processing in 2004, and did my PhD 2005-2009. I stayed in academia 2009-2014, then left to write a popular open-source NLP library and found a small startup. The stuff I worked on wasn’t in the main line of Large Language Model (LLM) research, but probably some of my software has been used in minor ways in various processing pipelines. I’d describe my research as two or three degrees of separation from the major breakthroughs. I was publishing at the same conferences, and some of my close collaborators were close collaborators of close collaborators of the people who made the most important contributions.
I was next to, rather than exactly inside, the main line of work on LLMs because I was wrong on some of the key technical questions. There’s also an odds-adjustment thing you have to do as a researcher. If the field as a whole is pursuing a few different paradigms, and your training and background is within one of them, that’s where your comparative advantage is going to be. My background was in linguistics, so that was the line of research I had the most to offer to. If my background had been in neuroscience I would’ve signed up with a different camp. Nobody knows for sure which boat is heading in the right direction, so you go with whichever one seems to have a you-shaped hole in the team. The tricky thing is there’s a lot of pressure to explain why your paradigm is right actually. It’s hard to remain exactly as agnostic as you should be, so you can jump ship when it’s time to.
When GPT-3 was released, I thought it was fairly likely we’d invent some much more efficient neural network architecture, but if we didn’t, the models would probably plateau. The reinforcement learning was the main thing I didn’t appreciate well. Relatively soon after ChatGPT was released, the idea of a “chain of thought” was introduced. In the prompt, you tell the model something like “think step by step”. One conceptual model for why this worked was that it allowed the LLM a flexible amount of computation. This isn’t wrong per se, but I think it isn’t the most apt way to understand what’s going on. More recently, we started to get models that were explicitly trained to do “chain of thought” well. The industry term for this is “reasoning” models.
Some people bristle at the implied anthropomorphism in terms like “chain of thought” and “reasoning”. Look, I get that, but it’s very difficult to think about, let alone talk about, complicated things without gathering a bunch of ideas up together in a plastic bag and sticking some sort of post-it note on it as a label. What I try to do is make sure I’m often taking out all the pieces and making sure I’m thinking through it all mechanistically, so I’m not just getting tricked by implications of the words written on the post-it notes. Yes, the model is not alive and there will be some implications of the term “reasoning” that do not hold, and yes, terms of art are biased towards marketing sizzle and it’s fine to grump at that — but the grumping needs to be the start of the reply, not the end of it.
The “reasoning” mechanism is important because logic is work, even in a purely symbolic system. If you have a collection of statements and they include “if A then B” and also “if B then C”, you get to add “if A then C” to the collection. You haven’t added any information, but you’ve performed computation. If you have a bunch of statements you know are true, and you want to know whether they imply that some other statement is true, you don’t just get that for free. It takes computation.
If the statements are in natural language, it mostly won’t be about binary true/false decisions, but you still have inference, and it still takes work. If I give you a bunch of statements, it takes work to figure out whether some of them are contradictory. You don’t just load them all in and get a database error when you attempt to believe something inconsistent with something else you already know.
LLMs of course don’t work like humans, but as far as we know, free inference is literally impossible. Nobody is expecting to invent some algorithm where you tell it the axioms of mathematics and it just, immediately tells you whether some theorem is true or false. It will always require lots of intermediate steps.
The language modelling objective allows the neural network to build a distributed representation of lots of general knowledge. The information is stored associatively, and there’s implicit acquisition of lots of layers of abstraction that allow the model to squeeze better predictions out of a given number of parameters. This distributed knowledge representation is great, it’s very powerful. But it still doesn’t just, automatically know all the things it knows. It can’t magically see all the implications. Logic is work.
A humble hypothesis of the “reasoning” models is that there’s a general skill of coming up with intermediate steps which seem useful for working towards a conclusion. I call this hypothesis humble because I think it’s fairly self-evident. If you see two statements that seem contradictory, you can write a statement that points that out. You can reason from the general to the particular, or vice versa. It shouldn’t be controversial to say that the process of evaluating arguments or interrogating implications is mostly the same across domains, with different facts plugged in. What humans do is only tangentially relevant, but it also seems obvious to me that this is highly trainable. People are much better at it in a domain they’re familiar with, but if you’ve learned to reason in a few different domains, you end up with strategies that generalise quite well.
The “reasoning” models use reinforcement learning to acquire these general strategies. If you give some little puzzle to a purely completion-based model, and tell it to “think step by step”, it will often imitate some of the sorts of logical moves that a person might do. It’s playing the character of a person solving a problem, and this will sometimes result in the problem getting solved. If you know the answer to the problem, when it gets there you can reinforce the sorts of steps it used. If you give a person a problem, one of the things the person might say is “fuck off”, so from a language modelling perspective, that’s one of the outputs that has a certain probability. The reinforcement learning isn’t training it to say whatever a person says, though. It’s training it to produce text that leads to correct answers. If at some point during the training it tries out a step where it checks its work before getting to a final answer, and this proves very effective, it can start to do that consistently even if that’s not a common thing in the language examples.
The size-based plateau argument really only applies to the completion model. If we want to get better at the language modelling task itself, yeah we need to either do some new innovation, do the details of data and training better, or just make the model bigger. There’s always going to be diminishing returns from merely bigger. The “reasoning” part gives us two additional levers to pull to make LLMs more useful, however.
The first lever is that you can let it run for longer. You can make it do more reasoning, and try to get a better result that way. Without the reasoning, you could get slightly better results by just sampling more outputs, but it was a pretty limited technique.
The second lever is more reinforcement learning. There’s not much reason to believe that current models are at any sort of limit for how good they could potentially be at the general strategy of working through intermediate statements towards a conclusion. And this is even before you factor in the ability to execute commands, such as searching the web, drawing diagrams, etc.
If you think we’ll quickly hit a ceiling on learning reasoning strategies, it’s worth thinking about the interaction of these two levers as well. The models can learn to backtrack when they see they’re at a dead end or when they hit a contradiction, and in fact they already do this to some extent. With backtracking, the LLM basically has a way to keep exploring the space, much like searching for the best plan in chess. And we’re not even doing much parallel exploration at the moment.
There’s also no obvious data bottleneck for the reinforcement learning. Training the completion model hits a scaling limit in that there’s only so much human text to learn from, but reinforcement learning doesn’t depend on that. Forming a chain of reasoning is a one-way function. It’s much easier to recognise a successful path to a conclusion than it is to create it, so when the reasoning succeeds the model can recognise and reinforce that. The commercial providers are obviously training on all the usage data, and I’m sure this is better than just making it up, but even without that usage data there wouldn’t be a data bottleneck. It’s not precisely the same as the self-play objective used to train something like AlphaZero, but there’s a rough analogy — the relevant point is simply that there’s a mechanism for self-improvement.
My research career was full of ideas that I felt “should” work, but ran into some unforeseen obstacle. I’m not saying there’s some ironclad guarantee that there’s a smooth reinforcement learning path from here into the sunset. What I’m saying is that there’s no publicly known reason to believe progress will halt. People making the plateau argument generally believe that the plateau will be the default case absent further scientific developments, but it’s actually the opposite.
The actual economic and social impact of continued progress on AI is a more complicated question, and I’m working on a second, less hinged, post about that stuff (with a big side of philosophy). I’ve seen the same stuff about the NVIDIA/OpenAI/Oracle ouruburus (or circular company centipede, if you like) that you have, and yeah, I don’t trust Sam Altman and I don’t like OpenAI. But I do think the “plateau” argument is a very important thesis in the case for a bubble, because if the technology does actually work, it’s reasonable to guess that industry will sort it out. I’m not bullish about OpenAI specifically, but I do think the data centers will get used, and used productively.