honnibal.dev

LLM thoughts emailed to a friend

2025-11-01 · 22 minute read

One time when we went camping you bought your skim board, and I was trying to do it. I had trouble committing so I kept under-shooting, and you observed that since I kept failing in the same direction I should try harder and harder to over-commit and get the opposite failure.

I’ve been consistently directionally wrong on LLMs, despite paying close attention very early. Specifically, I’ve consistently underestimated how well they’d work. I still stand by the stuff I’ve written about about how to build systems around LLMs, but I basically didn’t expect this line of research to work, and it did.

I remember the “sentiment neuron” paper well, and I remember some early seeds of this stuff in a paper by Richard Socher’s group, where they suggested that you could have a general model for any task by encoding the question in English and using generation. Sure you’d have to train it on POS tagging or whatever, but you could have just one do-everything implementation. This really seemed a dumb idea to me. Why should we give up the ability to define task-specific loss functions? Sure, a per-token loss will be correlated with the loss we probably want, but why settle like that? At a given size the model will be worse. And you can’t do structured prediction this way, which is what my last few years of research had been about.

Fast forward a little later and I was pretty dismissive of the first GPT and the work leading up to it. I remember especially the “sentiment neuron” stuff. Already I had felt that the research culture coming out of the Bay Area was too aggressive and if not outright norm-violating, at least norm-stepping. SOTA wasn’t what it used to be: the hardware and associated training techniques were changing so quickly that with a little time and money you could get SOTA on some small dataset that hadn’t been optimised for the last few months. Just come up with some idea that doesn’t actively harm performance, give it a charismatic name and backstory, and evaluate using more external data, better network tricks etc. You’ll beat the last guy’s thing and “prove” your idea. Wonderful. There were also so many free parameters and so much variance between runs that testing on the held-out set was increasingly powerful on small data.

So I wasn’t that impressed by training for a month and getting a 10% error reduction, and I was an OpenAI hater anyway. And the press-release translation of the result seemed even more distasteful than usual. The distributions of words in positive and negative movie reviews are of course very different. You could add “in the training domain” to the end of all their claims, like how you can add “in mice” to a lot of breathless life science research PR. The thing is, “sentiment” isn’t a single scalar! It’s the training data that’s divided into two buckets, and those buckets are extremely different. Why should I be impressed here?

In hindsight, there’s a point to this work that the cynic in me saw as true but dismissed as “trivial, trivial” because it’s quite obvious in hindsight. But it actually is important. The point is that the language modelling objective function doesn’t just result in a model that assigns higher probability to valid strings of English characters. A token that’s syntactically valid but completely contextually implausible will also get a low probability.

We used to talk in NLP about tasks being “AI complete”. The idea was to draw a boundary between tasks which could be solved only with information about the structure of the language, and tasks which needed some unspecified degree of world knowledge. World knowledge is generally correlated together to some degree. Facts cohere. We live in a somewhat orderly universe. So once a task starts to become knowledge-dependent, you cross a threshold into “AI completeness”, where you can always craft examples where the model should have no reason to prefer the correct answer unless it has arbitrary access to external knowledge and reasoning. There’s an important implication of this AI completeness idea though: transfer learning should work. If there’s an underlying capability that all of these tasks require, it doesn’t matter how you acquire that capability — you should then be able to generalise it to other tasks. And generating texts is certainly AI complete.

As we progressed through the GPTs I remained skeptical of the line of work. The truth is I still don’t feel like I can explain its success. I don’t know what I should say to 2017, 2018, 2019, 2020, 2021, 2022, 2023 or 2024 me that would cause me to more correctly predict the next results. This bothers me. If I imagine a 2018 reality where performance just plateaued, how must that reality differ from ours? That alternate reality should confuse me. If I can’t tell you why it’s very unlikely given what else we know what have I really understood and explained? Fuck all, that’s what.

I don’t expect to be able to explain everything that happened to happen. Some stuff really is just contingent. Over the same period I also didn’t predict the trajectory of Bitcoin, but I don’t feel like there are interesting structural reasons behind that — it’s not like web 3.0 was real or something; I wasn’t wrong that blockchain is basically useless. So it doesn’t give me the same sense that there’s something I misunderstood. I know some of what I would say to previous me to explain the outcome: I was insufficiently pessimistic about US politics and regulatory efforts, and there’s a backbone of demand for the financial equivalent of putting a brown paper bag around your liquor bottle that has kept Bitcoin itself from bottoming out. In combination with this, increasing wealth inequality has caused all asset prices to increase, as less global wealth goes towards consumption. In this world where asset prices increase in general, it doesn’t take much focussed manipulation to get Bitcoin moving in the same way, absent any oversight or regulation to prevent it.

The general increase in asset prices is an example of something I do think is true for structural reasons, and if I went back in time I hope I could convince myself that this would happen. If you look at the ratio of house prices to bread or energy or whatever, and you put more and more of the wealth into the hands of fewer and fewer people, the dollars demanding housing is going to rise relative to the dollars demanding bread etc. There’s only so much bread one person can ever want, and it spoils. You can’t purchase some massive stockpile of bread and perch over it like a dragon over gold. Over the last however many years the prices of assets — anything you can hoard — has gone up and up, so in every asset people are asking “Is it a bubble?“. I think predicting how one asset will move relative to another is chaotic, but I don’t think the general increase in asset prices is some sort of “Keynesian beauty contest” thing that’s contingent or chaotic. If you transported me to an alternate reality where asset prices didn’t increase relative to consumption stuff, I’d want to know where this reasoning fell down. (Corollary to the aside, since I can’t help myself: There’s a lot of discussion about zoning, building regulations etc in relation to housing. There’s important improvements to make there, but if we’re talking about average people owning homes, it misses the big picture. The median person won’t own their home because the median person won’t own much of anything at all. Wealth inequality says that on the tin. The following three statements cannot all be true: ‘A few people own almost everything of value’, ‘most people own their homes’ and ‘homes are a thing of value’. I know which one I’m betting against.)

So to get back on track: I don’t expect to be able to understand and explain everything that happens. Lots of things are contingent and chaotic. But not everything, and I think it’s very important to be looking for the stuff that’s interestingly surprising. If we were living in a world where the “universe of game-changing blockchain innovations” or “web 3.0” or whatever people were saying a few years ago came true, I’d be deeply confused. Just the price of Bitcoin, whatever. Chaotic.

Chaotic is a very difficult bullet to bite though. I can’t take the same perspective for how LLMs have developed. I really was wrong about something. I still don’t think I can articulate it properly and get my predictions back on track. But here’s the best I have.

I remember when I started out in ML, I’d have all these ideas. I expected things to work. What would often happen is I’d imagine what the model should learn, but this “ought” I had in mind was nowhere encoded in the dataset and objective’s “is”. There would always be some other solution which minimised the loss that wasn’t what I’d hoped for, or the algorithm would be unable to find the true minimum for boring technical reasons. I took away from this a general skepticism. My default expectation was that nothing would work, and I was even more skeptical of ideas that sounded neat. If “sounds neat” and “effective” are largely orthogonal, but both make me more likely to hear about the idea, then if I’m hearing about it and it sounds neat it’s probably less effective. It’s basically Taleb’s “doctors shouldn’t look like doctors” point. If you get to the NBA by either being real tall or shooting real good, and you show me a guy who’s tall for an NBA player, I’ll guess he shoots badly for an NBA player.

So, it’s 2017 or 2018, and there’s this idea that you could learn pretty much everything just by language modelling. I agree that to accurately predict the next word you in principle need to know everything. But I’ve kept expecting this to not work for the same basically boring reasons so many ideas didn’t work before. The probability distribution of next possible words is so intensely narrowed by the immediate context, and predicting within those locally-plausible words is already so difficult. I expected that the training signal for longer coherance would just be too small. It would get lost in the inherent noise you get from bad training data, numeric instability, regularisation penalties, etc. My experience has always been that it’s hard enough to shepherd a neural network towards even a good valley of the training loss (let alone generalisation). It’s just perpetually seemed far-fetched.

I still find it very difficult to intuit the long-term coherence of the generated responses. It feels like small errors in prediction should propagate, and cause the model’s output to diverge from real text. It feels like this shouldn’t work. But I have to accept that it mostly does; if it didn’t “work” in the sense that I would expect for technical reasons, the output would not be anywhere near what it is. So the best I can do is zoom out a little from these very fine-grained technical expectations, and think at a more abstract level.

LLMs clearly encode a representation of a huge array of facts, including the things necessary for common-sense reasoning. They ‘know’ that sheep are white, despite ‘black sheep’ being a more common bigram than ‘white sheep’. And the ‘facts’ the models ‘know’ are integrated together to create something like a world model. The generation process itself doesn’t involve explicit symbolic reasoning, but language doesn’t losslessly map to symbolic reasoning anyway. A word in context is not an explicit variable; when we speak or write (or think!) we are usually much vaguer than that. If I say, “One of my cats is black”, this does not imply that I have in mind some predicate where given any cat I could say “is black” or “is not black”. The example we all know is “how much hair before someone isn’t bald”, but really everything is like this to some extent. There’s a cloud of matter that acts together so consistently that I’m never going to be confused about “is this cat” and “is not this cat”, but most concepts are fuzzier than that. Language isn’t symbolic like that and I see no particular reason to think our thoughts are either.

So in my best grasping towards an explanation, I think this is important. I think LLMs succeed at representing knowledge because they learn it directly from text, preventing us from getting in the way and trying to do the thing that doesn’t work. You can have a symbolic knowledge representation (and do deductive reasoning with it), or you can have a knowledge representation that can interface natively with language — but you can’t have both, because language just ain’t like that, and really neither are we. Explicit reasoning over symbolic representations really only works in closed domains. There are various attempts to do “Argument, but with formal Bayes!” and when you see it attempted really sincerely, it’s always a bit embarrassing. The best recent example was the ROOTClaim debate about COVID origins.

At some point in our discussion, I said that I feel like a lot of speech — even technical explanations — are people “running their language model”, which got some dissent. Here’s what I meant by that, so we can drill down more into where we could disagree. I think of myself as having this sort of distributed knowledge representation, and there’s some sort of motivation to express something, and I start speaking. If it’s something I’ve talked about a few times before, the words basically flow. I don’t consciously plan the sentences and I decide as I go whether to start down some diversion I think of based on what I just said. It’s all associative. I’m perfectly capable of contradicting myself when I do this, or saying things which are logically inconsistent with some other belief I would profess.

I do also stop and think of course. But I don’t see why I need to model this as more than a sort of scratch-space, or correction loop. You said “I don’t think we think in language”, and I basically shrug at that — I don’t think it’s a very important distinction in lots of situations. We obviously have the ability to imagine things visually and to manipulate those images, plus lots of other capabilities. But there’s also lots of times where we’re not doing that, and we’re in the realm of purely abstract ideas. I think our ‘thoughts’ on these topics are basically like loops of activations without the last steps of collapsing the probabilities to specific words. I think a mechanism that just has a scratch-space in words can do those ideas-only tasks. When I’m reasoning, I feel like I’m continually creating intermediate representations that follow from where I was previously. If I can put two statements side by side I can generate the contradiction, but I need to work through those steps. I can’t just interrogate my model, “What do you believe that’s contradictory?“. Or more often, I see an implication of some previous statements, and that reframing moves me closer to the conclusion. There are lots of false starts in this process, but the metaphor of a “train of thought” has been with us for a long time, and I think it captures something true and fundamental. It’s a game of states. You take some speculative step from where you are that you hope will get you closer to the answer.

Anyway. It doesn’t really matter what humans do. The question is what sort of algorithm is in principle sufficient for some task. And I claim that if you can formulate statements and then produce some subsequent statement that now seems true, and you can do that in with some generally good search heuristics, you can “do reasoning”.

The “generally good search heuristics” are provided by reinforcement learning, which is why I think the “thinking”/“reasoning” line of models that are such a big development. The paper I’ve liked best (not that I’ve read all, or even an appreciable fraction…) is the DeepSeek one: https://arxiv.org/pdf/2501.12948 . They train the LLM on the language modelling objective and then do reinforcement learning on tasks, without the supervised fine-tuning. I think the absence of the supervised fine-tuning step makes it all much easier to reason about. I always found that step quite mysterious, because the details I’ve seen have been so vague.

Reinforcement learning (the Alpha Zero, Alpha-Star etc stuff) is another thing that worked better than I would have expected. I hadn’t really looked at it before the big Alpha-stuff results came out, but for what it’s worth I would not have expected the sparse reward signals you get from a very large search space and lots of intermediate steps to be sufficient; I would have thought the models would get stuck at the cold start for too long. But there’s no doubt about the answer to “can you learn to solve search problems with reinforcement learning”.

With the reasoning step there, I can almost convince myself to expect the whole contraption to work. The thing about LLMs that I’ve always struggled with the most is this idea of “emergent capabilities”: tasks that the model was not trained to do but can sort of do to some extent anyway, just as an epiphenomenon of the text generation process itself. Some of it I could wave away as just lax standards: if you give it a bunch of examples of things doctors say and ask it to generate something superficially similar, I suppose you can call that “medical advice”, in the same way that you can call any set of steps a “recipe” if you eat whatever the result is at the end.

Here’s an example of a very simple sort of thing that I would expect to not work in a purely completion-trained model:

>> Write the first paragraph of a news story, but insert the word 'potato' at
>> every second word.
> Local potato officials potato announced potato this potato morning potato that
> potato the potato city potato council potato has potato approved potato a
> potato major potato infrastructure potato project potato aimed potato at
> potato improving potato downtown potato traffic potato flow potato and potato
> pedestrian potato safety potato throughout potato the potato area potato.

Ihe text here is corrupted in an arbitrary and novel way, so I would have thought that it would be trying to complete a low probability string and end up in a fairly random part of the output space. That output is from the latest Claude, but even much earlier models were able to do this sort of thing. I don’t understand why the instructions so often feel sort of separate from the output, in terms of its behaviours — when really it’s just generating token-by-token. I don’t have an example at hand, but there’s also a lot of stuff that feels like it should require preplanning — stuff where the first token should doom you to a bad response and you can’t reasonably guess what’s a good first token without generating the whole thing. And yet the models have always surprised me in these scenarios. I’ll try to think of some examples.

Could this stuff all be in the supervised fine-tuning? I honestly have no idea; as I said that part has always been a mystery to me. In my head I’ve always thought like, “How could this really be at a scale that changes the behaviour that much?“. But maybe the reinforcement learning has been meaningfully there all along, and I just haven’t understood it properly.

So what’s my take — how do I think LLM-stuff will develp? I think that the current usage patterns and product framings undersell the actual importance of the representation learning that’s been achieved and how powerful the reasoning can potentially be. From day one the usage story around LLMs has been basically, “Talk to it like a person! Just ask it to do stuff!“. Everything about the way consumer LLM-stuff is packaged ends up being hostile to users’ ability to form and apply a mental model of how the thing actually works. They don’t tell us enough about how it was built, they encourage massive context windows and all-in-one tasks, and the models change underneath the user so often that it’s difficult to develop good usage heuristics just by trial and error.

I think what will happen is that the reinforcement learning reasoning stuff will be pretty effective, especially if you train it on specific tasks, with task-specific rewards and control loops. This will be expensive, so it’ll really make an impact in professional contexts. It’s never going to make sense to spend money computing outputs for stuff people don’t care enough to pay for. In the sense that that’s what a lot of the current LLM usage looks like, yeah I think we can call that a “bubble”. But LLMs can really do stuff, and I don’t think we’re nearing a plateau. It doesn’t really matter whether the completion models continue to improve. So far we’re mostly just sampling one output. It remains weird to me that that works at all, but there’s much more that can be done, and the guided search version is already being rolled out.

Postscript: The chess example

I wanted to also revisit the chess example you brought up: if you try to play a game of chess with it in dialogue, it will generate incorrect moves — because it doesn’t anywhere have a model of the actual board. First of all, a point of detail. It’s certainly possible to encode the state of a chess board as neural network activations, e.g. you could have n*m boolean activations for the n pieces and m squares, with more compressed distributed representations obviously possible as well. It’s also obviously possible to have a pattern of weights that, when combined with non-linear activation functions, encode the constraints for what constitutes a legal move. So if you trained a transformer only on sequences of chess notations, whether or not it learns to construct such representations well is purely an empirical question. You might find in practice that it sucks at this, but the sucking will be contingent, not necessary.

I don’t think it’s that significant that models trained on web text, and only seeing chess notations as an incidental fraction of their training data, don’t acquire such representations very well. The chess notation is very compressed, and I’ve played several thousand games of chess, from young enough that I can’t remember my first game or the process of learning to play — but I can’t follow a game in notation without a board. It’s not at all uncommon to be able to do that; I think any rated player would be able to. But it doesn’t come for free just from knowing how to play chess.

So here’s a different thought experiment. Imagine you had the LLM write out some ascii representation of the chess board, let it play lots and lots of games against itself in this format, and trained it to win with reinforcement learning. This is just Alpha-Zero with extra carbon cost, right? Why shouldn’t this be able to learn to play chess? If the objection is that it shouldn’t be necessary for it to have this chess experience to reach some understanding of the game, we’re setting an extremely high bar; I think far beyond current human intelligence. A human cannot read a bunch of chess books and play well in their first game.

…And after having written all of that, I just tried it with Claude, and here’s what it did:

Let's play a game of chess. I start with e4
I'd be happy to play chess with you! I'll play as Black.
1. e4
I'll respond with e5 - the classical response, leading to open game positions.
The board now looks like this:
8 ♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜
7 ♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
6 · · · · · · · ·
5 · · · · ♟ · · ·
4 · · · · ♙ · · ·
3 · · · · · · · ·
2 ♙ ♙ ♙ ♙ · ♙ ♙ ♙
1 ♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
a b c d e f g h
Your move!

Did Anthropic train it to do this specifically, as some sort of party trick because people were talking about it? I have no idea! That’s one of the things I’ve found so vexing about trying to grapple with LLM capabilities. When GPT-3 was released, my Twitter feed was full of impressive examples, but when I tried it, not much worked. I ended up wondering whether OpenAI had planted a bunch of easter eggs, and then had patsies on social media “discover” them. I have no idea whether they did this, but if they didn’t, I still think it would most likely be because they either didn’t think of it or it wasn’t necessary. My trust in their integrity is entirely absent. I do think Anthropic are considerably more trustworthy, and they do tell us some things — but we’re still left guessing on almost all of it.

Regardless, I think you’ll need a new example, if not now then very soon.