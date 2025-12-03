Guest post series from *Carlo Graziani.

On Artificial Intelligence

Part 5: Hallucinations

In November 2022, Vancouver resident Jake Moffat needed to travel to Toronto to attend the funeral of his deceased mother. He asked an Air Canada chatbot about the terms of a bereavement fare, and the chatbot assured him, incorrectly, that according to the company’s rules he could receive the bereavement discount retroactively after traveling on the regular fare. When Air Canada denied Moffat the discount, he sued the company. The Tribunal held that Air Canada was liable for its chatbot’s representations to customers on its own website, and had to pay Moffat damages and legal fees.

In May 2023, a plaintiff attorney named Steven A. Schwartz filed a legal brief in the Southern District of New York containing references that the judge deemed to be “…bogus judicial decisions with bogus quotes and bogus internal citations.” Schwartz acknowledged that the source of the bogus references was in fact ChatGPT, representing to the judge that ChatGPT had, upon being questioned about the authenticity of the cases, responded that they were “real” and “can be found in reputable legal databases such as LexisNexis and Westlaw.”

In Spring of 2025, The Chicago Sun-Times published a 15-title summer reading list. Ten items on the list were made-up titles attributed to real authors.

Google’s AI Overview has recommended using non-toxic glue on pizza to help cheese stick to the pie.

I could go on, but it gets boring. Finding examples in the media of AI going off the rails in embarrassing ways is easier than finding inebriated people on Chicago streets at noon on Saint Patrick’s day. Just try a web search on “AI hallucinations”. The AI hallucination is a daily phenomenon, affecting programmers attempting to speed up their coding, scientists looking for fast ways to generate or clean up papers and proposals, and anyone in need of text that must precisely reflect some legal constraints.

AI models are also notoriously bad at mathematical reasoning, making elementary arithmetic mistakes as well as serious mathematical errors. I have prompted ChatGPT to perform a certain standard physics derivation twice now, at a distance of several months, and both times I have obtained careless stupidity that no undergraduate would be capable of producing, presented with professorial polish and total didactic aplomb.

It’s fun to point and laugh, but sometimes it is no joking matter. People have received bad, even dangerous medical advice from ChatGPT. There is a high-profile effort underway to use AI to “democratize” financial advice, which is seemingly innocent of the associated risks. There’s a pending patent for “AI Traffic Control” which is exactly as terrifyingly stupid an idea as it sounds. In fact, we are living through a moment in which the Tech industry is desperately attempting to propose AI for any application offering any prospect of profitability–no firm today makes any money on AI services–so it is not surprising to see such risks minimized or hidden altogether.

To observers of this discipline, the hallucination phenomenon is a very serious problem, and is another reason to question whether “Artificial General Intelligence” (AGI) is even a remote possibility on our current technological path. Certainly it would seem that if the hallucination issue is not understood and corrected somehow, any prospective AGI will babble hilariously, and possibly dangerously, some unpredictable fraction of the time.

The Tech industry consensus on hallucinations, however, is some combination of (a) hallucinations are not really a problem, and (b) more pretraining of improved (i.e. “larger”) models with more data at higher cost in compute and power will make them go away, as AGI finally emerges. I have had conversations with people who really believe in (a) or (b), and at least one person I spoke with appeared to somehow hold both views simultaneously.

View (a) is obviously not even worth discussing, given the high stakes involved in many AI applications. What I’d like to discuss today is view (b): can we really expect bigger models trained at higher expense with more data to do away with the stubbornly persistent phenomenon of AI hallucinations?

In order to address this question, we need to understand where these hallucinations come from. For that, we first need to review what it is that LLMs do.

What Does an LLM Do?

It is helpful to recall the basic definition of statistical learning–the subject that encompasses all of AI–at this point. Here is how all this stuff works:

Take a set of data, and infer an approximation to the statistical distribution from which the data was sampled; In the case of LLMs, the data consists of hundreds of billions of words of text At the same time, optimize some decision-choosing rule in a space of such decisions, exploiting the learned distribution; With LLMs, a decision could be response to a text prompt, or a judgment about whether the text expresses positive or negative sentiment, or a translation to another language, etc. Now when presented with a new set of data samples, produce the decisions appropriate to each one, using facts about the data distribution and about the decision optimality criteria inferred in parts 1. and 2.

The reason that I keep bringing these up is that I find this model-agnostic view of the machine learning enterprise extremely clarifying, and helpful in directing attention towards what matters and away from irrelevant aspects of model design.

We should apply the above catechism to what LLMs do. Data from natural language text consists of sequences of words, interspersed with punctuation. LLMs learn features of the distributions over such sequences that allow them to probabilistically predict what the next response word should be, given a prompt and any response words previously supplied.

So, for example, suppose your prompt to be completed by the LLM is “Bob was nervous about his presentation to the board, despite his preparation the night before.” and the LLM completes it with “He had practiced by reading his slides and timing what he said while each one was displayed.” The LLM starts with the prompt as its context, and uses the learned distribution to compute the probability distribution of the next word. From that distribution, it samples (i.e. decides on) the word “He”. It then appends “He” to the prompt to form a new context, and calculates a new distribution for the next word. It turns out that “had” is pretty high in the probability list, and gets selected. The context is now “Bob was nervous about his presentation to the board, despite his preparation the night before. He had”. The LLM repeats the process, and probabilistically samples the word “practiced” from the new distribution. And so on.

No kidding. This is all that is going on. Next response token prediction based on the prompt and all previous response tokens. That is the entire trick. Neat, eh?

Back to hallucinations: there are two places to look for their origin: the approximation to the data distribution, and the next-token decisions founded on that distribution. Let’s take them in order.

Approximating the Distribution of Human Language

The reason that text comprising hundreds of billions of words are required to train an LLM is that the statistical regularities of human language are extremely complex, and not easy to capture in a principled statistical model.

Just scroll your eyes up and down this essay briefly, and then imagine figuring out the rules by which the words are juxtaposed, without being detained by trivialities such as meaning. There are rules: you rarely see the same word repeated immediately (e.g. “immediately immediately”) which is clearly a rule. There are grammar rules, and context rules. Certain clusters of words recur together in certain types of text and not in others: you will find pairings of “octopus” and “cephalopod” within a few hundred words of each other in texts from works on marine biology, but pairings such as “octopus” and “mortgage” are probably very rare. In fact, the occurrence of “octopus” in a page probably means that the probability of encountering “mortgage” within the next 1000 words is considerably reduced from the average rate of occurrence of that word, while the occurrence probability of “shark” is likely enhanced. And so on. How would one go about describing these patterns?

The approach used in natural language processing (NLP) since time immemorial is to begin by breaking down the text into tokens, then describe the text as a sequence of such tokens. This tokenizationis a subtle and arcane art. You might think that it would be logical to break things down into words, numerals, punctuation, etc. While not wrong, this approach is very inefficient. The problem is that the English language (say) has about 500,000 words, which is a huge vocabulary for an LLM to manage. Vocabulary size is a critical parameter to be managed in this game, because the larger the vocabulary, the larger and more expensive the model.

On the other hand, breaking things down into individual letters is also a bad idea. While the vocabulary size is now much smaller (less than 50 for English), the token sequences are much longer, and the patterns much harder to find. The patterns are at the word level, not the letter level. It’s just that there are so many damn words!

The secret sauce is to notice that most of those half-million words are extremely rare. Studies of natural language have shown that knowing 10,000 words in any language allows one to understand 99% of texts in that language. In English, that would be 2% of the full vocabulary. Moreover, the rare words can be built up out of smaller word pieces. Identifying an optimal set of word pieces, most of which are full words in their own right, is the name of the game here. Algorithms exist that can represent all English text using 30,000 to 50,000 tokens, which is a considerable savings in vocabulary size. So tokenization is (largely) a solved problem .

Embedding

The next thing that essentially every NLP method does with its tokenized text is a process called embedding: Each token is mapped into a vector space of dimension about 1000 (basically, each token gets described by a list of 1000 numbers) endowed with a notion of distance between points similar to the notion of distance between points in 3-dimensions. At this point, all operations on a sequence of tokens become operations on such lists of numbers. So when a transformer operates on a sequence of tokens (a sentence or a set of sentences, including previously-generated text) that sequence gets embedded in a very high-dimensional space: for example, the prompt above (“Bob was nervous…”) consists of about 17 tokens, so it is mapped to a point in an approximately 17,000-dimensional space consisting of 17 copies of the original 1000-dimensional embedding space.

I want to draw attention to embedding for several reasons: one is that it is an essentially universal practice in NLP, preceding the invention of transformers by many years. It turns out to be much easier to model probability distributions by operating on lists of numbers than by operating directly on sequences of discrete tokens sampled from a finite-dimensional set (the vocabulary). So researchers have defaulted to the embedding strategy.

Another reason to emphasize embedding is that when transformers train embedding parameters, they appear to do something magical: the resulting embeddings cluster together words and word fragments with similar meanings or functions, in well-separated clusters in the embedding space. You can see examples of this at Kevin Gimpel’s Bert Embedding Visualization Page, where you will see visualized in 2-dimensions clusters of suffixes, of verbs with similar meanings, of types of enclosed spaces, etc. It is one of those weird effect that persuades some people that LLMs are in fact acquiring a sense of the meaning of words.

The final reason to draw attention to embedding is this: embedding almost certainly poison the approximation to the distribution of language tokens. The embedding step destroys information about that distribution. The reason is that the original native space of token sequences is entirely innocent of vector spaces, and contains no geometric notion of spatial proximity such as arises in the embedding space. That spatial proximity structure is entirely imposed by the NLP architecture. And it almost certainly gives rise to improper notions of proximity between sentences that are sensible (i.e. have a high probability of occurrence) and other sentences that are nonsense (i.e. have a low probability of occurrence).

As an example of improper proximity, consider these two brief sentences: “My dog is fast”, and “My sparrow is fast.” Both are well-formed, grammatically correct, and obey applicable syntactic and semantic rules. The difference is that the first sentence ought to be ascribed a much higher probability than the second one, because nobody actually owns a sparrow.

As embedded points, however, the two sentences are quite similar: a dog and a sparrow are both animals, and hence live in some proximity in the embedding space. Furthermore, dogs are pets, and while sparrows are not pets, they are birds, and some birds are pets. There are enough ways to draw proximity connections in the embedding space to make the second sentence seem plausible in the distribution approximation, despite the fact that it is, obviously, a hallucination.

So embedding is, in my opinion, one of the origins of hallucinations. It is the reason that the approximation made by LLMs to the distribution of language is so brittle. There is nonsense lurking “near” sense in the embedded sequences, because in their native space (token sequences) there was no notion of geometric “nearness”: that property of relative proximity is an artifact of the model.

And if true, this is very bad news for AGI, because it means that hallucinations are a structural feature of all LLMs. They all embed sequences. So you cannot just train your way out of insanity and into “General Intelligence”, because all those new tokens will have exactly the same problem of spurious proximity. The distribution will be corrupted from the outset. It may be that the most likely responses could appear sane, but insane responses will always lurk nearby, waiting to be sampled by the LLM.

The Tyranny of Sampling

I’ve been referring to the process of “sampling” tokens above, and I should say a bit more about that, because while we have seen the origin of hallucinations in a broken estimate of the distribution of language sequences (part 1 of statistical learning) we need to see how the problem is aggravated by an LLM’s response decisions (part 2).

LLMs are often referred to as “generative” models (the “G” in “GPT”). What this means is that their output is, in a sense, random rather than deterministic. They compute probability distributions over the next token, and then exploit that distribution to decide what the identity of the next token should be. They generally do this by choosing the token randomly, with a higher probability of selection ascribed to tokens judged more likely by the calculation.

You might well ask: “Why not simply select the next token by choosing the one with the highest probability?”

This is occasionally tried. It is a strategy called “greedy sampling”. It is very efficient. Unfortunately, it is also a recipe for disaster, a ticket to hallucination pandemonium.

The problem is this: what one really wants is the most likely extended response, to the prompt, according to the learned distribution over language. This might consist of hundreds, or thousands of tokens. The distribution, while imperfectly learned, appears to at least get the most likely extended response right, in the sense that it is the one least likely to contain a hallucination.

Unfortunately, sampling the most likely next-token at every stage does not produce the most likely extended response. This can be a surprise at first, but from a mathematical standpoint it is not surprising at all. The probability of the 17th next-token conditional on the prompt and the previous 16 next-tokens can be very different from the probability of the 17th next-token conditional on the prompt and on entire remaining most-likely response (tokens 1-16 and 18-1000, say). Choosing the most likely token at every stage can, and usually does, lead the LLM into crazy rabbit-holes.

So instead, one attempts to let the probability distribution do its thing by allowing it to somehow sample the next-token distribution. This is better, but more expensive. In principle, what one ought to do is sample the 1000-token response many times (10,000 times, say) and choose the most frequently-occurring response. That strategy would probably abate a good deal of the hallucination phenomenon. Unfortunately, it would be totally unaffordable in inference computation cost, as well as quite slow. So intermediate strategies are adopted, restricting the next-token distribution to the top 90% of candidates, and looking along a tree to the next and next-next tokens for each one of these top tokens (the so-called “beam search”). This is better, but still not great for finding the top 1000-token response.

You might call this the Tyranny of Sampling: one must somehow sample from an LLM in order to defend its output from the worst hallucinatory offenses. But if you try to do the right thing, the computational cost will destroy the usefulness of the method. Rock, hard place.

Hallucinations Are Structural

Here’s the bottom line: Hallucinations are a structural feature of LLMs, produced by a corrupted model of the probability distribution over language sequences learned in training. The corruption is due to embedding, which is a ubiquitous feature of LLMs.

The only available hallucination abatement strategy is some form of generative sampling, which means accepting the unsettling fact that LLMs cannot produce the same output twice to the same prompt. And even accepting this non-determinism as a cost of doing business, the sampling strategy that cleans up the problem to a maximal extent is totally unaffordable. Unsatisfactory look-ahead strategies are better than nothing, but they still let a lot of nonsense through.

There is no hallucination abatement strategy that begins with more token data and larger models. That’s just not a thing, despite what the Tech industry would like to believe (and would certainly like investors to believe). More tokens and larger models likely aggravate the embedding problem, because there will be more improper proximities discovered in the embedding space.

And note that “larger” models are not “more clever” models. This discipline has not produced radical innovations to the transformer architecture since its invention, or at least none that have led to any breakthroughs comparable to what was wrought by the transformer’s first introduction in 2017. A “larger” model simply means “more parameters” , not new mechanisms that make the model more clever. Given the argument that I make here, I very much doubt that any new cleverness could be built into a transformer that could eliminate the hallucinationatory mechanisms baked into its structure at its most fundamental level.

All of which is to say this: LLM-based “AGI” will be mentally ill at birth.