Over the last few years there’s been a big debate raging with keywords like “the singularity”, “superintelligence”, and “doomers”. I propose a sort of truce on that debate. The terms of the truce are that everyone still gets to sneer at their erstwile opponents and their cringe idiot takes, but we also all agree that whatever’s being discussed there, the hypothetical “But what if the dumbest possible version of everything happens? What then?” hasn’t really been the conversation, because wtf why make that the premise, right?
Well. Times have changed.
The way current and imminent AI technologies are being deployed introduces very tangible risks. These risks don’t require superintelligence, and they’re not “existential”. They’re plenty bad though. So the truce I’m proposing is that we all get to care about these risks, without the “denialists” rushing to say “see it’s not existential!” or the “doomers” getting to say “see I told you shit could get bad”.
I promise this is a serious post, even though the situation is so stupid my tone will often crack. The basic thesis statement is that a self-replicating thing doesn’t have to be very smart to cause major problems. Generally we can plan ahead though, and contain the damage. Well, we could do that. In theory. Or we could spice things up a bit. Maybe run some bat-licking ecotours instead. Why not?
Here’s a rough sketch of a bad scenario. Imagine you have some autonomous way to convert resources into exploits — hacks, basically. Maybe you have some prompts that try to trick Claude Code or Codex into doing it, maybe you use open-source models. However works. Now, these exploits are going to pay out in various ways when you can land them. Lowest yield is just some compute, but maybe you can also steal some dollars or crypto, or steal some data to sell, or even ransomware. The question is, what happens when we reach the tipping point where exploits become cheaper to autonomously develop than they yield on average?
The general scenario is something I’ve always thought was worth worrying about. But you know, maybe it could be okay, at least for a while — after all, the stuff that’s making the exploits cheaper to develop should let us make everything more secure too, right? …Right? Lol no, this is the clownpocalypse, where the bats taste great. We use coding agents to make everything way less secure.
The general mindset in the industry at the moment is that everything’s a frantic race, and if you’re worrying you’re losing. The sheer pace of change in software systems would be a concern in itself, but there are so many other problems I almost don’t know where to start.
I guess I’ll start with an example that would be easy to fix, but captures the zeitgeist pretty well. Coding agents like Claude Code and Codex can read in “skills” files, which are basically just Markdown files that get appended to the prompt (you can have code as well, but that’s not important here). Kind of nice. So everyone rushes to publish skills, you get sites to find and install skills like Skills.sh. Except, nobody bothered to even think far enough ahead to prohibit HTML comments in the Markdown. This means any skill you browse on a website like Skills.sh could have hidden text that isn’t rendered to you, but can direct your agent to get up to various mischief. Remember that agents often have extremely broad permissions. During development loops people often give the agent access to basically everything the developer has. People leave agents running unsupervised. This problem has been known for weeks. There was even a high-profile demonstration of the vulnerability: Jamieson O’Reilly published a skill called “What Would Elon Do” (chef’s kiss), manipulated it to the top of a popular marketplace, and notified victims they’d been owned. The fix is trivial: obviously the skills format should prohibit HTML comments, but to date there’s been zero move to actually do that. It’s nobody’s problem and nobody seems to care.
O’Reilly demonstrated the unrendered text vulnerability in the OpenClaw ecosystem, which is for sure one of the four balloon animals of the AI clownpocalypse. I don’t know what the other three would be, but OpenClaw is a lock for one of them. So many stories of people just giving the agent all their keys and letting it drive, only for it to immediately drive into a wall by deleting files, distributing sensitive information, racking up usage bills, deleting emails…And all of these things can honestly be considered expected usage, it isn’t a “bug” when a classifier makes an incorrect prediction, it’s part of the game. What is a bug are the thousands of misconfigured instances open to the internet, along with the hundreds of other security vulnerabilities. Mostly nobody cared though. It was still the fastest growing project in GitHub history, before being acquihired into OpenAI.
How did we get here? I dunno man, I really don’t. Normalization of deviance I guess? The literal phrase seems to capture the current political meta, and there’s an air of resigned watch-the-world-burn apathy to everything. It doesn’t help that insecurity is baked into LLMs pretty fundamentally. When ChatGPT was first released I thought prompt injection would be this sort of quaint oversight, like oh they forgot to concatenate in a copy of the prompt vector high up in the network, so the model can tell which bit is the prompt alone and which bit is the prompt-plus-context. But nah nobody ever did that. I guess it didn’t work? Nobody talks about it, so as far as I can tell nobody’s even trying. So we’ve all just accepted that maybe one day our coding agent will read an html page that tricks it into deleting our home directory. Oopsie. Well I can run my agent sandboxed, so at least my files will be safe. But what if it tricks my agent into including a comment in the source of my docs page that will trick a lot of your agents into including a comment that… etc. Well, fortunately that hasn’t happened yet, and we all know that’s the main thing that counts when assessing the severity of a potential vulnerability, right?
You see the go-fast-but-also-meh-whatever vibe everywhere if you look for it. Google’s LLM product, Gemini, insisted on shipping with this one-click API key workflow, presumably because the product owners hated the idea of making users sign up through Google Cloud, which is a longer process than you need for something like OpenAI. Except, this introduced this whole separate auth flow, which has been recently upgraded from clusterfuck to catastrafuck. Previously I thought that the situation was just confusing: the web pages for the two rival workflows don’t mention each other, there’s no vocabulary to describe the difference, and there’s some features that only work if you auth one way but not the other. Clusterfuck. But, recently we learned that the Gemini API keys break a design assumption behind Google’s existing security posture: keys aren’t supposed to be secrets; you’re supposed to be able to embed them in client code, if you’re doing something like distributing a free app that has to access Google Maps. But now many of those existing keys are also auth keys for Gemini! So thousands of people had keys lying around that could be used to steal money from them by using Gemini (e.g. to develop malware), having done absolutely nothing wrong themselves. Well, fortunately the vulnerability was found by professionals, and reported through the proper channels, so no harm done, right? Well, almost. The researchers did contact Google correctly, but then Google first denied the problem, and only accepted it when the researchers showed Google’s own keys were affected. So then the 90 day disclosure window started, and Google shuffled their feet a bit, rolled out a patchwork fix, and ultimately blew the deadline. So the report went live without a full fix in place. Catastrafuck.
So far even when they’ve been bad, malware attacks haven’t been that bad. So okay, even if this does go wrong…how bad could the AI clownpocalypse be? This is where I ask for just a little imagination, along with some acceptance that today’s AI models are not entirely incompetent, and they’re getting more capable every day. Many current AI models are no longer really “language models”, in that the objective they’ve mostly been trained to do is predict successful reasoning paths, rather than predict likely text continuations. I wrote about this in a previous post. If there’s a malware going around suborning existing agents or co-opting hardware by installing its own agent onto it, it’s probably going to be using one of these reasoning-trained models. They’re much better for coding, and the malware probably wants to execute multi-step plans. It wants to send phishing emails, do some social engineering, hunt around for crypto or bank details, maybe send some “help stranded please send money” scam messages — you get the picture. Well, those plans will involve reading a lot of text in, and the malware probably isn’t going to use a high capability model. At any point the model’s view of its current goal can drift. Instead of telling your grandmother to send money, it could tell her to drink drain cleaner. Or it could message her “Rawr XD tackles you”. I don’t want to make out like there’s this inner kill-bot, waiting to be unleashed. It’s just that it could be anything, especially since these models probably won’t be super high capability. There’s truly no way of knowing — Anthropic call it the “hot mess” safety problem, which I think is apt. In the clownpocalypse scenario you have millions of these hot messes. How bad could that be? Hard to say, especially now that the Department of War has demanded the authority to put LLMs in charge of fully autonomous weapons.
So what can be done? Well, maybe one day we’ll work our way back to pretending to try. Honestly give it a go if you haven’t been, even just for nostalgia’s sake? At least you’ll be able to feel a bit high-and-mighty if it does go pear-shaped.
Further reading
- Bruce Schneier & Barath Raghavan, “Why AI Keeps Falling for Prompt Injection Attacks” (IEEE Spectrum)
- UK NCSC, “Prompt injection is not SQL injection (it may be worse)“
- Simon Willison, “The lethal trifecta for AI agents”
- Bruce Schneier, “The Promptware Kill Chain” (Lawfare)
- Apiiro, “4x Velocity, 10x Vulnerabilities”
- Cory Doctorow, “The Reverse-Centaur’s Guide to Criticizing AI”
- Ben Nassi et al., “Morris II: self-replicating AI worm”
- Johann Rehberger, “Agentic ProbLLMs” (39C3 talk)