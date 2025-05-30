I come to the end of another surprisingly insane week of work and finally look at the news, and what do I see? Multiple stories about the sex life and/or bladder integrity of Elon Musk (also, there is apparently a Glenn Greenwald sex tape floating around the internet, so take care out there).

Let’s think about literally anything else instead.

There’s a story by writer Noor Al-Sibai in Futurism today about a major problem with large language models (LLMs), the jumped-up autocomplete programs we’ve collectively decided to call “AI”. In order to keep progressing, LLMs need to be fed more and more training data. But they’ve chewed through most of the publicly available/easy-to-steal data out there, so AI programmers have to find new sources. First, they tried augmenting human-generated training data with LLM-generated data, or “synthetic” data, instead. They fed the machine what it shits out, in other words.

This cannibalism approach (or autocoprophagia approach, maybe) can lead to something called “model collapse”. Stephen Vaughan-Nichols explained this concept in The Register, a UK-based tech news site:

In an AI model collapse, AI systems, which are trained on their own outputs, gradually lose accuracy, diversity, and reliability. This occurs because errors compound across successive model generations, leading to distorted data distributions and “irreversible defects” in performance. The final result? A Nature 2024 paper stated, “The model becomes poisoned with its own projection of reality.” Model collapse is the result of three different factors. The first is error accumulation, in which each model generation inherits and amplifies flaws from previous versions, causing outputs to drift from original data patterns. Next, there is the loss of tail data: In this, rare events are erased from training data, and eventually, entire concepts are blurred. Finally, feedback loops reinforce narrow patterns, creating repetitive text or biased recommendations. I like how the AI company Aquant puts it: “In simpler terms, when AI is trained on its own outputs, the results can drift further away from reality.”

So an AI chatbot that eats its own shit gets the AI equivalent of a prion disease, and its (metaphorical) brain turns to mush, thus squandering billions of dollars of effort. To avoid this, engineers enabled the models to do something called retrieval-augmented generation (RAG), that is, to pull in data from outside sources rather than just relying on the data they’d been trained with.

The issue, of course, is that there is now so much AI-generated slop text on the web that RAG is just causing the same problem that synthetic data does. The models query an outside source; that outside source is actually chatbot poo, and the models begin to degrade anyway. From Futurism:

So if AI is going to run out of training data — or it has already — and plugging it up to the internet doesn’t work because the internet is now full of AI slop, where do we go from here? Vaughn-Nichols notes that some folks have suggested mixing authentic and synthetic to produce a heady cocktail of good AI training data — but that would require humans to keep creating real content for training data, and the AI industry is actively undermining the incentive structures fo them to continue — while pilfering their work without permission, of course. A third option, Vaughn-Nichols predicts, appears to already be in motion. “We’re going to invest more and more in AI, right up to the point that model collapse hits hard and AI answers are so bad even a brain-dead CEO can’t ignore it,” he wrote.

This is, I guess, what passes as good news in these fallen days.