Balloon Juice - Carlo's Artificial Intelligence Series Archives

Part 7: The Coming AI Winter

by WaterGirl| January 8, 20267:30 pm| 74 Comments

This post is in: Carlo Graziani, Carlo's Artificial Intelligence Series, Guest Posts, Science & Technology

This will be the final post in the series, so I want to extend a big thanks to Carlo for sharing both his knowledge and his thinking on AI with us!

Between the holiday break and some work deadlines and all the craziness, we’ve had some distractions that have made finding the right time for this final post a bit more complicated. There has not been one complaint (not that I have seen, at least) about the delay, so I’m hoping that maybe this is better timing for all of us!

Thanks again, Carlo!

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Hello, Jackals, welcome back. Happy New Year, and thank you again for this opportunity. Being able to write these posts on AI has been very helpful to me, because writing this stuff out in a manner suitable for exposition has really forced me to clarify my own views on AI, and has allowed me to be much more exact and specific on both what my objections to the AI enterprise are, and on what the value of that enterprise is. Basically I’m in a better place to tell the baby from the bathwater, thanks to this series.

This is the last post in the series. I have more-or-less emptied the bag of technical matters concerning AI which I think I understand that most people don’t, and having done that, it’s time for me to summarize the content of the past six posts, and to use that summary to try to understand where we are on AI today, and where we are likely going, at least in the near future.

Again, you have my gratitude for reading, and for the very high level of the comments that have followed each of the previous posts. BJ really is a unique, special place.

Part 7: The Coming AI Winter

Let’s start out with a very high-level summary of where we’ve been in this series.

AI and Learning

At several points in this series, I pointed out that all “AI” is in fact a form of statistical learning, and introduced the Statistical Learning Catechism, which states that what every such system does is the following:

Take a set of data, and infer an approximation to the statistical distribution from which the data was sampled;
- Data could be images, weather states, protein structures, text…
At the same time, optimize some decision-choosing rule in a space of such decisions, exploiting the learned distribution;
- A decision could be the forecast of a temperature, or a label assignment to a picture, or the next move of a robot in a biochemistry lab, or a policy, or a response to a text prompt…
Now when presented with a new set of data samples, produce the decisions appropriate to each one, using facts about the data distribution and about the decision optimality criteria inferred in parts 1. and 2.

It is worth emphasizing, again, that this is all that is going on in any AI system, including all Large Language Models (LLMs). It’s basically just pattern recognition coupled to decision-making: if you can model how you would have made reasonable decisions based on past data, you can use that model to make reasonable new decisions based on new data.

The fundamental simplicity of this scheme belies the power of the methods that it enables when coupled to modern computing. Academic computer scientists were the first to realize the possibilities inherent in that coupling. In 2007, they began to show that many learning problems previously regarded as intractable could be easily solved using a set of computational techniques based on neural network models that came to be known as “Deep Learning”. Examples of such problems are image classification, voice recognition, protein structure prediction, materials properties prediction, empirical weather forecasting, and, beginning in 2017, natural language processing. The last one, brought about by the advent of a type of DL model called a “transformer”, marks the arrival of LLMs, and the beginning of what most people nowadays think of as “AI”.

The Role of the Tech Industry

The choice of the term “Artificial Intelligence” to describe this subject is misleading, however. AI is nothing but statistical learning, and learning is a very limited aspect of human cognition, not being remotely sufficient to model “intelligence”. After all, even single-celled animals “learn”. The impressive parlor tricks that can be performed by LLM-based chatbots should not deceive us into anthropomorphic intepretations of their workings. They are by no stretch of the imagination “Intelligent”.

show full post on front page

Unfortunately, the deceptiveness of the term “AI” is to a large extent deliberate. The fact of the matter is that this subject has escaped the control of academic researchers, having been for all intents and purposes hijacked by the Tech industry, which soon saw business opportunities in chatbot parlor tricks. The industry’s problem was that by 2022, revenues from tech platforms (FaceBook, Twitter, Google ads, etc.) were no longer growing at the spectacular rates that their leaders felt was the special privilege of their futuristic enterprises. They were in danger of becoming normal companies with normal growth expectations ¹. This prospect outraged the peculiar techno-messianic sensibilities that seem endemic to many in the industry’s leadership class. They settled on AI as the basis for the next phase of their industry’s evolution. In so doing, they began to impress a corrupting influence on machine learning research that is analogizable to the corruption of climate science perpetrated by the Oil industry, or to the earlier corruption of cancer research by the Tobacco industry.

The AI future that Tech leaders imagined is a strange admixture of technical naiveté, utopian ambition, and naked greed. On the evidence of current LLM capabilities for natural language processing (NLP), they persuaded themselves that they could bring about a new type of AI called “Artificial General Intelligence” that would constitute a transformative, disruptive technology, analogous to the steam engine or the telegraph. This they could accomplish by means of very large investments (totalling $3-5T by 2030) in data center construction, in acquisition of computing hardware (GPUs), in securing large power contracts required to run that hardware, and in extensive hires of data science and computer engineering personnel.

This AGI technology would become an engine of disruption, changing every industry or business that it touched. In the Utopic version of this story, it becomes a driver of a sort of “post-scarcity” world in which human resource competition and conflict become things of the past. In the more self-interest-grounded C-Suite discussions of the profitability of AI, a darker story is told, in which business leaders of other industries are offered AI tools that enable them to reduce their labor costs (i.e. to replace much of their work forces with AI tools). This process would re-direct large fractions of the resulting labor cost savings back to suppliers of AI services, thus assuring those firms of the abnormal profitability growth which they regard as their birthright.

What Is Wrong With This Picture?

Set aside the pious self-congratulatory naivetè of the Utopian story, and the obvious self-interested amorality of the business plan. From a practical point of view, what we should ask ourselves is this: Is there any reason to believe that this vision of the future is at all possible?

That’s another rhetorical question. The answer is that for good and sufficient reasons, this plan is going to fail, badly, and with negative consequences for Tech firms, their customers, the global economy, and the U.S. and other governments. Let us review reasons, and consequences:

(1) There Is No AGI Down This Technological Path

For the central driver of a story designed to mobilize trillions of dollars in investments, the term “AGI” is defined with maddening vagueness. But it is generally agreed that a fundamental aspect of AGI is computational reason. And in fact, researchers working in the AI industry lard their technical terminology with allusions to reason, including “reasoning models”, “reasoning tokens”, “chain-of-thought systems” and other such constructions. By such means they have persuaded themselves that they are on the path to solve (or perhaps have already solved) one of the thorniest problems in science: the modeling of human reason.

They have, of course, done nothing of the sort. In Parts 3-4 of this series, I discussed the wrong-headedness of conflating learning with reasoning. I gave a fairly detailed discussion of the type of scientific approach that one might try to connect modern machine learning to reasoning. This set of reflections has the benefit of illustrating the fact that reasoning qualitatively transcends learning in essential ways: recall that reasoning features the discontinuity of “Aha!”, which learning, as a continuous process, cannot emulate. The notion that one can simply train a machine learning system until it “learns to reason” (sometimes referred to as “emergence”) is scientific nonsense of the type characterized by Wolfgang Pauli as “not even wrong”.

Since all current AI is comprised of learning systems, none of it can learn to reason, even in principle. While AGI might be possible in principle, and we may see some version someday, it will certainly not be on a foundation of current DL-based technologies. Which means that the target motivating the colossal current investments in AI doesn’t even exist. That’s Bad News Message Number 1 for the AI enterprise.

(2) AI Hallucinations Are Never Going Away

By now most people are familiar with, or at least have heard about “AI Hallucinations”. AI famously gives dangerously bad financial and medical advice, incorrectly instructs firms’ customers about company policy, writes legal briefs that cites non-existent case law, screws up mathematical reasoning with unflappable didactic aplomb, writes wrong and unusable computer code, and otherwise reliably produces nonsense at a sustained rate whenever asked to produce output. There exists a subject now called “prompt engineering” in which domain experts use their expertise to detect such incorrect output, and tune or steer their AI prompts in order to coax chatbots into producing improved output. Only by such means is it possible to get actual useful output from AI systems.

The Tech industry’s “AGI” narrative holds, among other things, that such hallucinations are minor failings, which will in any event be addressed through more computing power, more training data, and larger models. In effect they believe that through hyperscaling, they can train models to stop hallucinating.

There is not a shred of quality evidence to suggest that this is true, and considerable reason to believe that hallucinations are an ineradicable part of the output of modern LLMs. They arise, in my view, from a very brittle and imperfect model of the probabilistic distribution over language sequences (sentences, paragraphs, and so on) that the models learn in training. The imperfection of the models is in fact structural: it is brought about by the token embedding process shared by all NLP methods. This process endows the “space” of language sequences with an improper notion of proximity that assimilates nonsense sentences to sensible ones, to a degree not justified by the actual sentence distribution sampled by the training data.

Note that this is a different problem from the impossibility of “learning to reason”. What one would hope of a DL-based LLM is that even though it doesn’t reason, it can generate sentences that are at least consistent with the sorts of sentences that it encountered in training. But that is not the nature of many AI hallucinations: they actually contravene the training distribution, because their broken internal model of that distribution places nonsense responses “closer” to sensible ones than they actually are, for technical reasons having to do with the geometry of the embedding process.

What this means is that AI hallucinations are in all likelihood structural to LLMs. They cannot be eliminated by training with more data, because more data would require more model capacity, which in turn would create more opportunities for embedding to find ways to situate nonsense geometrically near sense. And therefore, any “AGI” that is marketed by the industry is certain to be mentally ill at birth. This is Bad News Message Number 2.

(3) Hyperscaling, “Emergence”, and Model Inefficiency

One of the funniest intellectual failures driven by the corruption of machine learning by the Tech industry is practitioners’ faith in the phenomenon of “emergence” of general intelligence, which, they believe, is achieved by scaling model capacity (measured in billions of adjustable parameters) to the point that a model “suddenly” starts to produce reasonable responses to prompts. There is even the notion that some kind of fundamental law of computing has been discovered, relating the size of a training corpus of text to the required model capacity required to achieve this “emergence” of intelligence. One scales linearly with the other: double the size of the training data and you must double the model capacity to achieve “emergence”. This is in fact the “scaling law” at the root of the industry’s mad drive to build out datacenters, buy GPU hardware, secure power contracts sufficient to power medium-sized cities, and hire data science talent, which we now call hyperscaling.

The problem is that “emergence” is bullshit. The scaling law between training data size and model capacity is a fact of deep learning that has been true of all DL systems since the earliest examples of their application. It is as true of an image classifier recursive neural net (RNN) as it is of a transformer LLM. There is no qualitative difference between the two cases.

The qualitative difference that does apply pertains to the data being modeled: the complexity of the distribution of human language sentences is vastly greater than that of the distribution of natural images, or of weather states, or of materials properties. That is, the modeling of human natural language is an enormously more ambitious subject to tackle than any other problem that has been addressed using DL methods. For this reason, vastly more data is required to get a grip on the distribution than has been the case for any other DL modeling problem. An image classifier needs about 30,000 labeled images to achieve state-of-the-art performance. An NLP transformer needs tens of billions of words to achieve even moderately acceptable performance. And because the data requirements are higher, so are the model capacity requirements. You have to hyperscale the model size in order to get to the point where an NLP transformer doesn’t suck anymore. Hilariously, this “point of sucking less” is what practitioners of the subject refer to as “emergence” of intelligence (and even, in their more reckless moments, of AGI). It is nothing of the sort, unless you think that scaling an image classifier’s model capacity to the point that you can train it to classify images is also “emergence”.

While this model capacity scaling is a reasonable tradeoff for essentially all other DL applications, it is madly, unaffordably expensive when applied to language modeling. Unfortunately there are not many research efforts underway to attempt to find alternative modeling methods that break the scaling, because the AI train as currently constituted has an irresistible momentum conferred upon it by the mad level of investment that has been summoned up by the industry. The level of groupthink on this, even among academic scientists, is astounding, as well as depressing.

So, this is Bad News Message Number 3: unaffordable hyperscaling is baked into the methodology underlying AI chatbottery, and there is negligible research effort dedicated to circumventing it.

On “AI Winters”

Here’s the summary of the summary: the Tech industry’s target, AGI, is not achievable even in principle using current technology. And a subsidiary goal, the taming of the stubborn problem of AI hallucinations, is also not on the table. This being the case, the insane build-out of AI infrastructure is to no purpose, or at least to no purpose articulated by the AI industry. There is no pot of gold marked “AGI” at the end of the hyperscaling rainbow, or even a small purse marked “no hallucinations”. But it would take an infinite amount of data, compute, electrical power, effort, and capital to get there and find out. This cannot possibly end well.

We should try to ask ourselves the “How Does This End” question now, so as to be prepared to recognize the end when it happens. As it happens the subject of AI has an instructive history antedating the post-2007 developments that gave rise to modern “AI”. That history points to a certain cyclic pattern of research and development, culminating in “cataclysmic” (to its academic practitioners) collapses that are known as “AI Winters”.

As usual with history, it is difficult to divide up events into neat periodizations. Nonetheless, from the widest possible perspective, there appear to have been two prominent cycles of development and disappointment, culminating in AI Winters. The first Winter is generally held to have set in around 1973-1974, when research funding agencies in the U.S. and the U.K. concluded that the promise of research in machine translation, speech understanding, and “perceptrons” (single-layer neural networks) had been overhyped and was unlikely to deliver anything of value. Millions of dollars of funding was cancelled, leading to a massive contraction of the field, and to the end of many young (untenured) scientific careers.

By the 1980s, interest in AI had revived, carried largely by the advent of “Expert Systems”, which embodied hand-encoded knowledge according to some new knowledge representation schemes. New enthusiasm among researchers also led to a fair amount of media hype concerning the prospects for the new AI. Government funding followed suit, with the Japanese government’s 1981 announcement of its “Fifth Generation Computing” project. The U.S. government responded with DARPA’s Strategic Computing Initiative (SCI) in 1983.

But as early as 1984, in an article coining the term “AI Winter”, researchers Roger Schank and Marvin Minsky had started sounding the alarm about the growing AI hype, recognizing an echo of circumstances that had resulted in the disappointment and funding cutbacks of 1973-1974. As summarized in the Wikipedia article, Schank and Minski “… described a chain reaction, similar to a ”nuclear winter”, that would begin with pessimism in the AI community, followed by pessimism in the press, followed by a severe cutback in funding, followed by the end of serious research. Three years later the billion-dollar AI industry began to collapse.” By 1987, DARPA had wound up the SCI, and Japan did the same with the Fifth Generation project in 1991. Both were judged to be wasteful disappointments. By the 1990s, essentially all development of expert systems ceased, and the term “AI” itself began to seem somewhat disreputable among scientists and science funders. AI Winter II had set in.

Winters, and Bubbles

There are some useful lessons for our current moment in these events. In particular, it seems that Schank and Minski were prescient in their warning, and perceived a clear pattern to AI research and funding leading to the AI Winter Cycle. The phases of that cycle are:

1. Technical progress

2. Excitement

3. Media hype

4. Investment based on hype

5. Disappointment of investor hopes

6. Withdrawal of investment

7. Winter

When I first began thinking about writing a series of essays on AI (mid-summer 2025), it appeared to me that the hype around modern AI was vastly over-wrought, but it was successfully bringing in unprecedented levels of investment ($300B-$400B annualized investment in 2025, anticipated to rise going forward). That is, if, as I suspected, this was an AI Winter cycle, we we at about Phase (4). There were some differences in the historical pattern, the principal one being that the funders in this cycle are private investors to a far greater extent than government, but I thought that there were enough correspondences to suggest the same cresting wave pattern, culminating in an AI Winter. I couldn’t be sure of when the climax (Phase 7) might arrive, however. And very few of my colleagues were willing to entertain the idea that another Winter was coming.

Under current circumstances, with private investors taking the role previously occupied by Government funding agencies, the concept of an AI Winter necessarily becomes linked to that of an AI Bubble. This was also not a common position to take in Summer 2025. At that stage, the people suggesting that there might be a financial bubble connected to AI in the offing were definitely a minority², and it was certainly not a respectable position to take.

As of this writing (January 2026), we appear to have fast-forwarded to Phase 5, and are beginning to see signs of Phase 6. The phrase “AI Bubble” is no longer in bad odor, but rather has featured prominently in mainstream press reporting, including in The New York Times, The FT, the WSJ, Bloomberg, Barron’s and many other sources of both general and financial news. Oracle’s stock dove into the toilet on investor fears of the unsustainability of its datacenter buildout plans. NVidia and Meta’s datacenter financing deals are being compared to the practices that led to the 2001 implosion of Enron.

And, suddenly, people are noticing that no AI company is profitable, and that nobody knows a path to AI profitability. Ed Zitron estimates that OpenAI’s real revenue is a fraction of its inference costs, and therefore cannot even begin to amortize the GPT model family’s enormous training costs. A similar story applies to Anthropic. In a thus far fruitless search for profitability, both of those companies have taken to poaching the business of the downstream vendors who repackage and resell their services, a sure sign of an business ecosystem heading for collapse.

Suddenly, it seems as if we are already past the threshold of Phase 6-7 of the AI Winter Cycle, and this time it comes with an extra helping of Financial Shipwreck. Seven AI companies now make up 30% of the value of the S&P 500, and represent essentially all the year’s growth in the index. U.S. GDP and annual GDP growth from AI are both now in the 1%-1.5% range, which is a crazy level of risk exposure to an industry with a weak grasp on its own business.

If, as now seems very likely, investors start demanding that AI start paying its own way now, instead of waiting for their scheduled 2030 arrival at the Sunny AGI Uplands, the AI world is suddenly going to look like a very different place, because AI can’t pay its own way at current revenue levels. Free ChatGPT/Claude/Gemini is going away this year, I feel pretty sure. I don’t think that even $200/seat/month accounts bring in enough money to pay for the inference costs that they inflict on their suppliers, so at a minimum all those subscription rates need to be raised sharply just to cover their costs, and that is the way to shrink the hell out of that market.

The U.S. government has been making large AI infrastructure and model development investments predicated on partnering with the AI industry, which would still be largely in charge of pretraining models to be fine-tuned by government scientific (and other) customers. That model is not going to work at all if those AI companies start selling off their data centers and power contracts to Crypto miners for pennies on the dollar, and start firing employees as if suffering from a case of corporate dysentery. There would be nobody on the partner side answering email.

Also, the U.S. government cannot bail out the AI industry. There will no doubt be calls for a new Federal program to rescue NVidia, Microsoft, Meta, OpenAI, Anthropic etc., adverting to the “strategic” value that those firms have to U.S. national security. But even if this were not the wretched self-serving horseshit that it clearly is, the U.S. can’t afford to make up the losses of trillions of dollars in mis-allocated capital that these companies have already set on fire. The U.S. government’s financial position is much more precarious than it was in 2009, and in any event nobody believes that the Feds would know what to do with NVidia, or Anthropic, or Oracle, if they somehow wound up with majority shares of those firms.

I think that it’s coming apart now. Not in a few months. It is happening under our eyes, now.

Then What?

The last 8 paragraphs are, I believe, the most derivative and least informed of any that I have written in this series. I am not a financial expert in any sense of the phrase. Most of what I have written in these essays I can defend based on my fairly extensive domain knowledge in the subjects of machine learning and statistics and computational science. I do wish to write about what I think is going to happen, however, and in this moment it appears that to do so I need to connect the things that I do understand well to financial matters that I feel much less certain about. Please do not make any investment decisions based on what I write.

Instead of babbling on about business economics, I’d like to return to the subject of machine learning (and be shot of the term “AI”), and try to understand what residual value will remain from this strange, two-decade adventure, once the now-inevitable Winter brings on its now-inevitable retrenchment.

If you’ve been patient enough to read through these posts, then you probably know that I am an admirer of many of the scientific accomplishments that came out of the 2007 deep learning revolution. Those accomplishments are real. We can now distill weather patterns output by computationally-expensive numerical weather prediction (NWP) codes into computationally-cheap DL models that can reliably forecast weather up to 2 weeks in the future as well as those NWP codes can. We can predict protein structure, which, take my word for it, is a huge advance in biology. We can make very good guesses at chemical and material structures that lead to desirable properties, and that can actually be manufactured, without those chemicals or materials ever having existed before. These are all Nobel-caliber advances, some of which have in fact been rewarded with Nobel prizes. We will still have them, and obtain others like them, after the dust settles on the burst AI bubble.

The AI companies themselves will, incredibly, be forced to bequest to the public some things of great value. Most of the models at the cores of their chatbots, including their trained model parameters, are publicly available and open-source licensed at Huggingface, a public code and data repository dedicated to LLMs. The firms that built those models do not release their training data, but they did train those models on enormous datasets, at enormous expense in compute, electrical power, and capital, and the resulting models can actually be useful when run stand-alone (if one has reasonable expectations). I think that Marc has made this sort of point several times in comments to previous posts, and I basically agree with it.

Moreover, as I wrote in Part 6, I also believe that the LLM adventure has taught us something interesting that we did not previously know: it is in fact possible to model human language on a computer. This is one of the hardest modeling problems ever attempted, and the history of NLP is littered with failed attempts to crack it. The 2017 introduction of the transformer architecture succeeded in demonstrating that the crazy idiomatic quirkiness of human expression is not beyond capture by computers that we can build, now. That’s a fantastic discovery, because previous experience suggested that capturing the distribution of human language on a computer might be one of those things that are possible in principle, but that one could bankrupt the world and still fail if one actually attempted to do it in practice. Now we know that it can be done by nearly bankrupting the world. This is progress! What we need now is to figure out alternative NLP strategies that do not suffer from the crazy scalings of DL methods. I think that there are some real possibilities for this.

For me, an important silver lining in the LLM debacle is that academic research on machine learning may finally recover from the learned helplessness with which it has faced its exclusion from the development of state-of-the-art language models. The eye-watering cost of the infrastucture required to train and operate such models has meant that academic scientist have been sidelined from model development and pretraining. Also (I’m a bit ashamed to say) a certain passivity and vulnerability to groupthink has prevented them from looking for good alternatives to simply becoming customers of the AI industry. Now that one can forsee that the colossal investments by the industry will soon cease to exercise their mesmerizing effects on researchers, perhaps we can get serious about the science of machine learning, and about the uses of machine learning in science. And about recovering our agency in our own scientific endeavor.

Part 7: The Coming AI WinterPost + Comments (74)

Part 6: The Pathology at the Heart of Hyperscaling

by WaterGirl| December 10, 20257:30 pm| 40 Comments

This post is in: Carlo Graziani, Carlo's Artificial Intelligence Series, Guest Posts, Science & Technology

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. Being able to write these posts on AI has been very helpful to me in clarifying and sorting out my thinking on this subject. The comments that have followed each post have been of very high quality and on point, making up excellent and informative (including to me) discussions.

The plan is to release one of these per week, on Wednesdays, with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

Most of these posts have had a nerdy tinge, because the take that I have developed on AI is an unusual one, blending a mix of reflections on the technical side of the subject with a skeptical (and largely contrarian) outlook on much that passes for conventional wisdom among this discipline’s practitioners. This sort of project, wherein someone claims that much accepted technical wisdom of a certain field of science and technology is in fact wrong, necessarily exposes one to charges of being a crank or a crackpot, unless one is careful to provide some detailed and hopefully persuasive technical arguments pointing to unexamined assumptions and to scientifically plausible alternative views. Hence the plunge into nerd-core.

This post is the last of the truly nerdy posts in this series. After this, I’ve mostly emptied the bag of things that I think I know that most people don’t. So the final post will be a sort of high-level summary, combining take-aways from the series with some historical considerations to attempt some synthesis of where we are with AI, what this moment means, and where we might be heading in the not-too-distant future. If you’ve been waiting for a hopefully more accessible discussion of AI, that should be the one.

That said…

Part 6: The Pathology at the Heart of Hyperscaling

The term “hyperscaling” entered the common vernacular with a vengeance in 2025. The prefix “hyper” in “hyperscaling” is not marketing vapor. At this stage, most people who notice economic news at all are aware that something very unusual is happening to the U.S. economy. AI Capital Expenditures (“CapEx”) contributed 1.1% to U.S. GDP growth in the first half of 2025, a figure that is almost certainly higher when annualized. Total AI CapEx in 2025 is estimated at somewhere in the neighborhood of $500B, which is an incredible 1.4% of U.S. GDP. The Tech industry has persuaded analysts that AI investments between now and 2030 will add up to something like $5T. And as astounding as these numbers are, they look tame compared to the anticipated build-out of U.S. electrical power generation: estimates of the additional generation (and transmission) capacity required to run AI training and inference are somewhere in the range of 75-100 GW. The middle of that range (87.5 GW) corresponds roughly to roughly 7% of total U.S. electrical generation capacity. And that’s just to power the AI data centers.

show full post on front page

These numbers are ridiculous. It seems clear that those projections for cost and power generation are totally unrealistic, unlikely to ever come to pass, and damaging to the economy to the limited extent that they do come to pass.

The damage is not limited to economics, however. This level of investment also moves minds. The Tech industry consensus—that hyperscaling of large language models (LLMs) will bring about Artificial General Intelligence (AGI)—has attracted a critical mass of mindshare from government and from academia. This is certainly a bit of an own-goal for academic institutions, because by buying in to this consensus academic scientists have entirely shut themselves out of LLM development: no university or consortium of universities can afford to build out even a small fraction of the computational infrastructure required to train or operate these models at these scales. And even the U.S. government is struggling to stay in the game.

The interesting question here is why has this consensus developed across the field. And the “why” that I’d like to discuss in not a business strategy “why”, but rather a technical “why”. As a matter of science and technology, what is the peculiarity of LLMs that fuels the drive to hyperscale? After all, we have had “AI” deep learning (DL) methods since 2007. Hyperscaling, however, is peculiar to the subset of DL methods that power LLMs, and has only really gotten underway since 2022. What changed?

Overparametrization

In a sense, the seeds of hyperscaling have been an implicit part of DL ever since the DL revolution began in 2007. The origin of the phenomenon is the practice of model overparametrization.

In normal parametrization, one has a model with some parameters, which may be thought of as knobs that one can dial to arbitrary values. The knobs control the predictions that the model makes of the data. One determines the values of those knobs by adjusting them so as to minimize the misfit between the model and the data. Normally, one has many fewer parameters than data samples.

Overparametrization is the practice of endowing a model with many more parameters than one has data samples to model. It has not always been regarded as a useful practice: before the rise of DL, it was in very bad odor, for good and sufficient reasons. The problem with overparametrization is that it endows a model with too much flexibility and too little predictive power. After all, the purpose of training a model on some data is to obtain a means of making predictions and decisions given new data samples (recall the Statistical Learning Catechism, from Part 1 of this series). But if I have 10,000 data samples, and I naively train a 100,000-parameter model on that data, what will inevitably happen is that the model will predict the training data perfectly, while giving wildly wrong predictions about new data not included in the training set. In statistical parlance, the model will overfit the data.

Basically, as a matter of algebra, you only need 10,000 parameters to “solve” for the 10,000 training data samples. This is sometimes called “memorizing the training data”, i.e. learning to predict those 10,000 training samples perfectly. Despite its perfect fit to the training data, that 10,000-parameter model will give terrible predictions of any new data not included in the training set. And a 100,000-parameter model is—or ought to be—much worse.

The point is that the true process underlying the data always has more smoothness than the data itself, because the data has additional random noise. Any computational model of that process should not be allowed to play connect-the-dots with the data, because that would be tantamount to chasing the random noise. But that is exactly what a naive overparametrized model does. And because it learns to chase random noise in the training data, it cannot properly predict the smooth behavior of the underlying process. We say of such models that they have poor generalization properties, which just means that no interpolation based on such a model can be trusted ¹.

The problem is that in the example of the 100,000-parameter model trained on 10,000 data samples (not an atypical case for a convolutional neural net trained on a corpus of images) there is a roughly 90,000-dimensional parameter subspace of solutions that exactly memorize the training data. That subspace has a thickened neighborhood (in the full 100,000-dimensional space) of solutions that don’t quite memorize the data. In that neighborhood exist some values of the parameters that endow the model with reasonable generalization properties. Those values are the target.

Regularization and Stochastic Gradient Descent

The way to improve the generalization properties of an overparametrized model is to regularize it. Regularization is a very general term for a time-honored, broad family of techniques that deprive an overparametrized model of much of its freedom. By regularizing such a model, we in effect impose some kind of smoothness properties on it that interferes with its ability to play connect-the-dots with the data, thus making it a more realistic representation of the smooth underlying data-generating process. The regularization technique adopted in DL methods is peculiar to the discipline, because it was in DL that extreme overparametrization was first seriously considered.

DL methods find the target parameters through the technique of stochastic gradient descent (SGD). The “gradient descent” means that there is some defined cost function of the parameters (such as some average of the prediction errors over the training set, for example), and that cost function is minimized (“descent”) by following it through the parameter space along the steepest descent direction.

Here’s an analogy to help understand the training process: think of the cost function as a continental landscape. You know that the landscape has a long, broad valley somewhere in the middle of the continent, and you are targeting a destination somewhere near the valley floor. Not the exact floor, because that would correspond to the noise-chasing data-memorization solutions, but somewhere in that neighborhood. To find the right neighborhood, you need to locate the valley floor.

A reasonable first approach is to always follow the local descending direction, until you find the lowest point. But that won’t work. The problem is that that while the valley walls slope down on average (by about a foot per mile, say), the landscape is highly textured, with lots of local structure such as boulders, hills, small and twisty valleys, mountain passes, and so on. The local direction of steepest descent might actually lead you away from the continental valley floor, because you are descending some random twisty valley, or because the direction that you need to follow leads up some mountain pass. The landscape texture in the analogy is the structure added to the cost function by those 10,000 data samples, each one of which acts as a sort of structure-adding knob, contributing the boulders, hills, etc.

You need some strategy to ignore the local texture and find the large-scale average descent direction. If only by squinting you could blur the landscape, only resolving structure to within a mile instead of to a few feet, you might be able to see the average descent towards the valley floor, because the boulders etc. would be fuzzed away.

That is where the “stochastic” part of SGD comes in. By randomly swapping subsamples (“minibatches”) of data at every step of gradient descent, the optimization code’s view of the landscape texture is blurred, because the landscape of fine-scale features changes with every minibatch. As a consequence, the small-scale structure ceases to obscure the large-scale path of descent. Even better, the 90,000-dimensional memorization submanifold—the exact bottom of the valley, corresponding to the family of connect-the-dots solutions—also gets fuzzed out somewhat, and harder to find. The optimization algorithm dwells for longer in the liminal region where good generalization may be found.

Such a solution is identified by computing the cost function on held-out “test” data, which is not used in the gradient computation. At first, both the training and test costs decline in tandem. At some point in the training, however, the test cost stops dropping and either flattens out or starts rising again, while the training cost continues to drop. This is the stopping point of the training loop: a parameter solution has been located with reasonable generalization properties, because the test cost is OK, and there is no point in continuing, because the optimization routine is about to find the data-memorization submanifold (the lowest point of the valley floor).

Overparametrization and Data Types

That was rather a long explanation of the practice of overparametrization in DL. I need it here, because I need to discuss the costs and benefits of the practice, and how those vary depending on the type of data and decisions that one’s model is required to traffic in.

The combination of overparametrization and SGD is key to the success of all DL methods. It should be clear, however, that from a strictly statistical point of view, overparametrization is inefficient. Remember that according to the Statistical Learning Catechism, an important part of the job of any such model is to learn the distribution of the data. That distribution may be quite complex, and many parameters may be required to approximate it. But in a principled statistical model, the required number of parameters should not depend on the number of samples used to determine those parameters. Instead, the number of parameters should be fixed, and determined only by the structure of the data-generating process. As the number of data samples increases, the precision with which that fixed parameter set is determined should get better and better ².

DL models do not have this property. Instead, there is a notion of model capacity, which corresponds to the size of the parametrization, and which is required to grow as the training data grows, in order to maintain the necessary overparametrization. The more data one trains on, the greater the required model capacity. This is the sense in which they are inefficient.

This inefficiency is not, in and of itself, a bad thing. As we have seen, DL methods have been incredibly successful at addressing problems such as image classification, protein folding, materials design, etc. that were previously regarded as intractable. When used in such applications, they could be trained at very reasonable cost in computation and energy. It was definitely not necessary to occupy a 10,000-GPU datacenter for months to train a high-quality image classifier, for example. On the other hand, this is exactly the computational scale required to train an LLM on corpuses of text. Why the difference?

The answer, in my opinion, is that the distribution of text tokens from human language is vastly more complex than any other data distribution that has ever been modeled using DL methods. Billions of tokens of text are required in order to begin to capture the regularities and peculiarities and sheer chaos of human expression. This is larger, by several orders of magnitude, than, for example, the number of photographic images required to train an image classifier. And this being the case, the model capacity of LLMs has of necessity grown to unprecedented size (hundreds of billions, or even trillions of parameters), in order to maintain overparametrization.

But training models of this sort of capacity requires vastly more compute and energy. Hence, hyperscaling.

On “Emergence”

The account that LLM practitioners give of their capacity issues is simultaneously humorous and frustrating. They do not appear to view those issues in the light of the story that I have just presented. They see no continuity with the capacity requirements characteristic of all DL. Instead, they seem to have persuaded themselves that the relationship between data corpus size (in billions of tokens) and LLM capacity (in billions of parameters) is sui generis, and amounts to some kind of natural law of computing wherein intelligent behavior “emerges” naturally in consequence of the growth of the complexity of the model. The linear relationship between corpus size and model size has even been dignified with a scientific name: the “Chinchilla Scaling Law”, named after the Chichilla model discussed in this paperfrom Google Brain. The idea that there are such “scaling laws” for LLMs was originally suggested in this paper from OpenAI.

This is funny, in a grotesque sort of way. What is really happening here is that, as with all DL methods, without sufficient model capacity the model performance sucks. Then, as one grows the model capacity by adding parametrized computational elements, at some point the model becomes trainable, and begins to suck less. This “point of sucking less” is what LLM developers, in a brilliant stroke of marketing, have chosen to call “emergence”. And the advent of emergence is purportedly predicted by the scaling laws, which are felt to hold some deep significance about the nature of intelligence, and whose discovery rivals that of Newton’s law of gravitation in scientific importance.

This is, of course, pure horseshit. It is indicative of the corrupted state of the science of machine learning under the influence of the business imperatives of the Tech industry. A certain answer is required to be true by the industry: AGI is here, or, at least, nigh, which we know because of these “scientific discoveries”. The scaling law points the way to AGI: we will get there through larger models, more compute, higher CapEx. “Science says” that Hyperscaling will bring about AGI. And AGI will be so great that markets will emerge to absorb the $5T in investment required to get there.

This is a ridiculous story, since there is no credible scientific support for any of these claims. It’s a wild adventure in Madness of Crowds capitalism that cannot possibly end well.

What Have We Really Gained With LLMs?

The advent of the transformer architecture in 2017 represented a true breakthrough in natural language processing (NLP), one worthy of the highest scientific praise. Not because of any bullshit about AGI or “emergence” or scaling laws, but because transformers have demonstrated that natural language can be modeled at all. It was not clear prior to the transformer that this was true, or at least that it might ever be possible to model the distribution of human language text on a realistic computer.

There are two types of scientific impossibilities. The first type is that of things that are simply straight-up impossible so far as our current scientific understanding is concerned (faster-than-light travel, for example). The second type consists of things that, while not impossible in principle, are impossible in practice, because you if you tried to accomplish them you would bankrupt the World in the attempt and still fail (human spacefaring travel to nearby stars is in this category).

Some “type 2 impossible” problems have to do with whether something is or is not amenable to computational modeling. There are many problems that have known scientific principles, but which we do not expect to ever be able to model on a real computer. For example, first-principles numerical modeling of strongly turbulent fluid flow is just one such “type 2 impossible” problem, because no computer we can imagine building would have the capacity to resolve all the physical length scales required for such a simulation.

As to NLP, it has always been clear that language has patterns, and that text corpuses issue from some complicated distribution. It was not clear until 2017 however that computational modeling of that distribution was not “type 2 impossibile”. Now we know that it is in fact possible. That is a real scientific accomplishment.

Unfortunately, the realization of such models using transformers comes at a very daunting cost in compute, power, and capital. This is especially true now that the subject has become entangled with a totally unrealistic quest for a goal (AGI, or, at least, artificial reasoning) that is almost certainly not available along the current technological path. This cannot go on forever. And as economists are fond of saying, if something cannot go on forever, it will stop.

Are There Alternatives To Transformers?

Transformer-based LLMs have exhausted their usefulness as research tools. It is past time to start looking for alternatives.

A small number of leading researchers are already unhitching their wagons from the LLM caravan. Yann LeCun, one of the most celebrated DL scientists, made a splash earlier this year by parting ways with Meta to start up a research effort on an approach called “World Modeling”. Song-Chun Zhu, a renown expert in computer vision, has returned to China after 20 years of research in the U.S., to pursue a set of approaches that he characterizes as “Small Data, Big Task” (by implication, DL approaches have it the other way around).

I think it is promising that some very bright folks are breaking away from transformers (and possibly from DL altogether). As a matter of personal outlook, I am not sanguine about purely computational approaches such as World Models and SDBT superseding LLMs. These are, as I understand matters, still computational learning approaches. As I’ve written in past posts, I feel that the most serious defect of DL approaches is that they place little value on reasoning about data distributions, while focusing too much attention on models. In effect, in light of the Statistical Learning Catechism that I’ve expounded upon, they are computational models attempting to do a statistical model’s job, and as a consequence they make inefficient use of their data. That inefficiency is tolerable for most data types of interest, but unaffordable for human language learning. And while learning human language distributions is not sufficient to model human reason, I believe that without learning human language distributions there is no chance of any kind of emulation of the human faculty of reason.

So what I’d like to see, and something I am attempting to do in my own research, is to try to introduce principled statistical models to perform language-learning tasks. I can’t tell yet whether the things that I am trying will work as well as a transformer, or even be competitive with one. However, I do think that it ought to be possible in principle to construct a model whose capacity need not scale with data corpus size, and whose parameters are fixed in number by the structure of the human language distribution that it models. Those parameters should be determined more accurately the more training data the model is fed.

To accomplish something like this, not only must one break away from transformers: it is necessary to give up on DL methods altogether, because the overparametrization required by all DL-based methods makes language learning unaffordable.

This type of approach is in a sense less ambitious than what LeCun and Zhu are attempting to bring about, because it is still strictly concerned with language modeling. But this seems to me a good way to leverage the one solid, useful result that has emerged from modern NLP: natural language can be modeled on a real computer. Now we just need to find a way to do it that doesn’t bankrupt the World.

Part 6: The Pathology at the Heart of HyperscalingPost + Comments (40)

Part 5: Hallucinations

by WaterGirl| December 3, 20257:30 pm| 78 Comments

This post is in: Carlo Graziani, Carlo's Artificial Intelligence Series, Guest Posts, Science & Technology

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. Being able to write these posts on AI has been very helpful to me in clarifying and sorting out my thinking on this subject, and the comments that have followed each post have been of high quality and on point, making up excellent and informative (including to me) discussions.

The plan is to release one of these per week, on Wednesdays, with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

I had originally planned to post a high-level summary during Thanksgiving Week, to try to offer usable take-aways to people put off by my nerd-babble. After some discussion with WaterGirl, we have decided instead to leave the summary posts to after the conclusion of the series.

Part 5: Hallucinations

In November 2022, Vancouver resident Jake Moffat needed to travel to Toronto to attend the funeral of his deceased mother. He asked an Air Canada chatbot about the terms of a bereavement fare, and the chatbot assured him, incorrectly, that according to the company’s rules he could receive the bereavement discount retroactively after traveling on the regular fare. When Air Canada denied Moffat the discount, he sued the company. The Tribunal held that Air Canada was liable for its chatbot’s representations to customers on its own website, and had to pay Moffat damages and legal fees.

In May 2023, a plaintiff attorney named Steven A. Schwartz filed a legal brief in the Southern District of New York containing references that the judge deemed to be “…bogus judicial decisions with bogus quotes and bogus internal citations.” Schwartz acknowledged that the source of the bogus references was in fact ChatGPT, representing to the judge that ChatGPT had, upon being questioned about the authenticity of the cases, responded that they were “real” and “can be found in reputable legal databases such as LexisNexis and Westlaw.”

In Spring of 2025, The Chicago Sun-Times published a 15-title summer reading list. Ten items on the list were made-up titles attributed to real authors.

Google’s AI Overview has recommended using non-toxic glue on pizza to help cheese stick to the pie.

show full post on front page

I could go on, but it gets boring. Finding examples in the media of AI going off the rails in embarrassing ways is easier than finding inebriated people on Chicago streets at noon on Saint Patrick’s day. Just try a web search on “AI hallucinations”. The AI hallucination is a daily phenomenon, affecting programmers attempting to speed up their coding, scientists looking for fast ways to generate or clean up papers and proposals, and anyone in need of text that must precisely reflect some legal constraints.

AI models are also notoriously bad at mathematical reasoning, making elementary arithmetic mistakes as well as serious mathematical errors. I have prompted ChatGPT to perform a certain standard physics derivation ¹ twice now, at a distance of several months, and both times I have obtained careless stupidity that no undergraduate would be capable of producing, presented with professorial polish and total didactic aplomb.

It’s fun to point and laugh, but sometimes it is no joking matter. People have received bad, even dangerous medical advice from ChatGPT. There is a high-profile effort underway to use AI to “democratize” financial advice, which is seemingly innocent of the associated risks. There’s a pending patent for “AI Traffic Control” which is exactly as terrifyingly stupid an idea as it sounds. In fact, we are living through a moment in which the Tech industry is desperately attempting to propose AI for any application offering any prospect of profitability–no firm today makes any money on AI services–so it is not surprising to see such risks minimized or hidden altogether.

To observers of this discipline, the hallucination phenomenon is a very serious problem, and is another reason to question whether “Artificial General Intelligence” (AGI) is even a remote possibility on our current technological path. Certainly it would seem that if the hallucination issue is not understood and corrected somehow, any prospective AGI will babble hilariously, and possibly dangerously, some unpredictable fraction of the time.

The Tech industry consensus on hallucinations, however, is some combination of (a) hallucinations are not really a problem, and (b) more pretraining of improved (i.e. “larger”) models with more data at higher cost in compute and power will make them go away, as AGI finally emerges. I have had conversations with people who really believe in (a) or (b), and at least one person I spoke with appeared to somehow hold both views simultaneously.

View (a) is obviously not even worth discussing, given the high stakes involved in many AI applications. What I’d like to discuss today is view (b): can we really expect bigger models trained at higher expense with more data to do away with the stubbornly persistent phenomenon of AI hallucinations?

In order to address this question, we need to understand where these hallucinations come from. For that, we first need to review what it is that LLMs do.

What Does an LLM Do?

It is helpful to recall the basic definition of statistical learning–the subject that encompasses all of AI–at this point. Here is how all this stuff works:

Take a set of data, and infer an approximation to the statistical distribution from which the data was sampled;
- In the case of LLMs, the data consists of hundreds of billions of words of text
At the same time, optimize some decision-choosing rule in a space of such decisions, exploiting the learned distribution;
- With LLMs, a decision could be response to a text prompt, or a judgment about whether the text expresses positive or negative sentiment, or a translation to another language, etc.
Now when presented with a new set of data samples, produce the decisions appropriate to each one, using facts about the data distribution and about the decision optimality criteria inferred in parts 1. and 2.

The reason that I keep bringing these up is that I find this model-agnostic view of the machine learning enterprise extremely clarifying, and helpful in directing attention towards what matters and away from irrelevant aspects of model design.

We should apply the above catechism to what LLMs do. Data from natural language text consists of sequences of words, interspersed with punctuation. LLMs learn features of the distributions over such sequences that allow them to probabilistically predict what the next response word should be, given a prompt and any response words previously supplied.

So, for example, suppose your prompt to be completed by the LLM is “Bob was nervous about his presentation to the board, despite his preparation the night before.” and the LLM completes it with “He had practiced by reading his slides and timing what he said while each one was displayed.” The LLM starts with the prompt as its context, and uses the learned distribution to compute the probability distribution of the next word. From that distribution, it samples (i.e. decides on) the word “He”. It then appends “He” to the prompt to form a new context, and calculates a new distribution for the next word. It turns out that “had” is pretty high in the probability list, and gets selected. The context is now “Bob was nervous about his presentation to the board, despite his preparation the night before. He had”. The LLM repeats the process, and probabilistically samples the word “practiced” from the new distribution. And so on.

No kidding. This is all that is going on. Next response token prediction based on the prompt and all previous response tokens. That is the entire trick. Neat, eh?

Back to hallucinations: there are two places to look for their origin: the approximation to the data distribution, and the next-token decisions founded on that distribution. Let’s take them in order.

Approximating the Distribution of Human Language

The reason that text comprising hundreds of billions of words are required to train an LLM is that the statistical regularities of human language are extremely complex, and not easy to capture in a principled statistical model.

Just scroll your eyes up and down this essay briefly, and then imagine figuring out the rules by which the words are juxtaposed, without being detained by trivialities such as meaning. There are rules: you rarely see the same word repeated immediately (e.g. “immediately immediately”) which is clearly a rule. There are grammar rules, and context rules. Certain clusters of words recur together in certain types of text and not in others: you will find pairings of “octopus” and “cephalopod” within a few hundred words of each other in texts from works on marine biology, but pairings such as “octopus” and “mortgage” are probably very rare. In fact, the occurrence of “octopus” in a page probably means that the probability of encountering “mortgage” within the next 1000 words is considerably reduced from the average rate of occurrence of that word, while the occurrence probability of “shark” is likely enhanced. And so on. How would one go about describing these patterns?

The approach used in natural language processing (NLP) since time immemorial is to begin by breaking down the text into tokens, then describe the text as a sequence of such tokens. This tokenizationis a subtle and arcane art. You might think that it would be logical to break things down into words, numerals, punctuation, etc. While not wrong, this approach is very inefficient. The problem is that the English language (say) has about 500,000 words, which is a huge vocabulary for an LLM to manage. Vocabulary size is a critical parameter to be managed in this game, because the larger the vocabulary, the larger and more expensive the model.

On the other hand, breaking things down into individual letters is also a bad idea. While the vocabulary size is now much smaller (less than 50 for English), the token sequences are much longer, and the patterns much harder to find. The patterns are at the word level, not the letter level. It’s just that there are so many damn words!

The secret sauce is to notice that most of that half-million words are extremely rare. Studies of natural language have shown that knowing 10,000 words in any language allows one to understand 99% of texts in that language. In English, that would be 2% of the full vocabulary. Moreover, the rare words can be built up out of smaller word pieces. Identifying an optimal set of word pieces, most of which are full words in their own right, is the name of the game here. Algorithms exist that can represent all English text using 30,000 to 50,000 tokens, which is a considerable savings in vocabulary size. So tokenization is (largely) a solved problem ².

Embedding

The next thing that essentially every NLP method does with its tokenized text is a process called embedding: Each token is mapped into a vector space of dimension about 1000 (basically, each token gets described by a list of 1000 numbers) endowed with a notion of distance between points similar to the notion of distance between points in 3-dimensions. At this point, all operations on a sequence of tokens become operations on such lists of numbers. So when a transformer operates on a sequence of tokens (a sentence or a set of sentences, including previously-generated text) that sequence gets embedded in a very high-dimensional space: for example, the prompt above (“Bob was nervous…”) consists of about 17 tokens, so it is mapped to a point in an approximately 17,000-dimensional space consisting of 17 copies of the original 1000-dimensional embedding space.

I want to draw attention to embedding for several reasons: one is that it is an essentially universal practice in NLP, preceding the invention of transformers by many years. It turns out to be much easier to model probability distributions by operating on lists of numbers than by operating directly on sequences of discrete tokens sampled from a finite-dimensional set (the vocabulary). So researchers have defaulted to the embedding strategy.

Another reason to emphasize embedding is that when transformers train embedding parameters, they appears to do something magical: the resulting embeddings cluster together words and word fragments with similar meanings or functions, in well-separated clusters in the embedding space. You can see examples of this at Kevin Gimpel’s Bert Embedding Visualization Page, where you will see visualized in 2-dimensions clusters of suffixes, of verbs with similar meanings, of types of enclosed spaces, etc. It is one of those weird effect that persuades some people that LLMs are in fact acquiring a sense of the meaning of words.

The final reason to draw attention to embedding is this: embedding almost certainly poison the approximation to the distribution of language tokens. The embedding step destroys information about that distribution. The reason is that the original native space of token sequences is entirely innocent of vector spaces, and contains no geometric notion of spatial proximity such as arises in the embedding space. That spatial proximity structure is entirely imposed by the NLP architecture. And it almost certainly gives rise to improper notions of proximity between sentences that are sensible (i.e. have a high probability of occurrence) and other sentences that are nonsense (i.e. have a low probability of occurrence).

As an example of improper proximity, consider these two brief sentences: “My dog is fast”, and “My sparrow is fast.” Both are well-formed, grammatically correct, and obey applicable syntactic and semantic rules. The difference is that the first sentence ought to be ascribed a much higher probability than the second one, because nobody actually owns a sparrow.

As embedded points, however, the two sentences are quite similar: a dog and a sparrow are both animals, and hence live in some proximity in the embedding space. Furthermore, dogs are pets, and while sparrows are not pets, they are birds, and some birds are pets. There are enough ways to draw proximity connections in the embedding space to make the second sentence seem plausible in the distribution approximation, despite the fact that it is, obviously, a hallucination.

So embedding is, in my opinion, one of the origins of hallucinations. It is the reason that the approximation made by LLMs to the distribution of language is so brittle. There is nonsense lurking “near” sense in the embedded sequences, because in their native space (token sequences) there was no notion of geometric “nearness”: that property of relative proximity is an artifact of the model.

And if true, this is very bad news for AGI, because it means that hallucinations are a structural feature of all LLMs. They all embed sequences. So you cannot just train your way out of insanity and into “General Intelligence”, because all those new tokens will have exactly the same problem of spurious proximity. The distribution will be corrupted from the outset. It may be that the most likely responses could appear sane, but insane responses will always lurk nearby, waiting to be sampled by the LLM.

The Tyranny of Sampling

I’ve been referring to the process of “sampling” tokens above, and I should say a bit more about that, because while we have seen the origin of hallucinations in a broken estimate of the distribution of language sequences (part 1 of statistical learning) we need to see how the problem is aggravated by an LLM’s response decisions (part 2).

LLMs are often referred to as “generative” models (the “G” in “GPT”). What this means is that their output is, in a sense, random rather than deterministic. They compute probability distributions over the next token, and then exploit that distribution to decide what the identity of the next token should be. They generally do this by choosing the token randomly, with a higher probability of selection ascribed to tokens judged more likely by the calculation.

You might well ask: “Why not simply select the next token by choosing the one with the highest probability?”

This is occasionally tried. It is a strategy called “greedy sampling”. It is very efficient. Unfortunately, it is also a recipe for disaster, a ticket to hallucination pandemonium.

The problem is this: what one really wants is the most likely extended response, to the prompt, according to the learned distribution over language. This might consist of hundreds, or thousands of tokens. The distribution, while imperfectly learned, appears to at least get the most likely extended response right, in the sense that it is the one least likely to contain a hallucination.

Unfortunately, sampling the most likely next-token at every stage does not produce the most likely extended response. This can be a surprise at first, but from a mathematical standpoint it is not surprising at all. The probability of the 17th next-token conditional on the prompt and the previous 16 next-tokens can be very different from the probability of the 17th next-token conditional on the prompt and on entire remaining most-likely response (tokens 1-16 and 18-1000, say). Choosing the most likely token at every stage can, and usually does, lead the LLM into crazy rabbit-holes.

So instead, one attempts to let the probability distribution do its thing by allowing it to somehow sample the next-token distribution. This is better, but more expensive. In principle, what one ought to do is sample the 1000-token response many times (10,000 times, say) and choose the most frequently-occurring response. That strategy would probably abate a good deal of the hallucination phenomenon. Unfortunately, it would be totally unaffordable in inference computation cost, as well as quite slow. So intermediate strategies are adopted, restricting the next-token distribution to the top 90% of candidates, and looking along a tree to the next and next-next tokens for each one of these top tokens (the so-called “beam search”). This is better, but still not great for finding the top 1000-token response.

You might call this the Tyranny of Sampling: one must somehow sample from an LLM in order to defend its output from the worst hallucinatory offenses. But if you try to do the right thing, the computational cost will destroy the usefulness of the method. Rock, hard place.

Hallucinations Are Structural

Here’s the bottom line: Hallucinations are a structural feature of LLMs, produced by a corrupted model of the probability distribution over language sequences learned in training. The corruption is due to embedding, which is a ubiquitous feature of LLMs.

The only available hallucination abatement strategy is some form of generative sampling, which means accepting the unsettling fact that LLMs cannot produce the same output twice to the same prompt. And even accepting this non-determinism as a cost of doing business, the sampling strategy that cleans up the problem to a maximal extent is totally unaffordable. Unsatisfactory look-ahead strategies are better than nothing, but they still let a lot of nonsense through.

There is no hallucination abatement strategy that begins with more token data and larger models. That’s just not a thing, despite what the Tech industry would like to believe (and would certainly like investors to believe). More tokens and larger models likely aggravate the embedding problem, because there will be more improper proximities discovered in the embedding space.

And note that “larger” models are not “more clever” models. This discipline has not produced radical innovations to the transformer architecture since its invention, or at least none that have led to any breakthroughs comparable to what was wrought by the transformer’s first introduction in 2017. A “larger” model simply means “more parameters”³, not new mechanisms that make the model more clever. Given the argument that I make here, I very much doubt that any new cleverness could be built into a transformer that could eliminate the hallucinationatory mechanisms baked into its structure at its most fundamental level.

All of which is to say this: LLM-based “AGI” will be mentally ill at birth.

Part 5: HallucinationsPost + Comments (78)

Part 4: If There Were AGI, What Would It Look Like?

by WaterGirl| November 19, 20257:30 pm| 65 Comments

This post is in: Carlo Graziani, Carlo's Artificial Intelligence Series, Guest Posts, Science & Technology

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. What follows is the fourth part of a seven-installment series on Artificial Intelligence (AI).

The plan is to release one of these per week, on Wednesdays, with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

The original plan was to skip Thanksgiving week. However, I’ve been talking to WaterGirl about the technical level of these posts, and I’ve come to realize that it’s been a bit off-putting to some readers. So I think that during the turkey-day break, I’ll try to provide a high-level summary of where the series has been with an eye to keeping the nerd-babble under control.

That said…

Part 4: If There Were AGI, What Would It Look Like?

Part 3 ended with a bit of a rant, because I felt the need to express outrage at the very loose and lazy intellectual standards prevailing in much contemporary “AI” research, at least insofar as discussion of artificial General Intelligence (AGI) goes. My perspective on the subject is by no means a majority view, and I feel a little like Diogenes, shaking his fist at the corrupt world from the austerity of his barrel.

The thing is, I don’t really enjoy the role of Diogenes, because “burn it all down” is a fundamentally destructive outlook on such things. I happen to feel that the scientific accomplishments of modern machine learning, while often oversold, are very real. I don’t want to give the impression that I think the entire subject is worthless, just because the current scientific discussions of AGI are so fundamentally wrong.

As to AGI itself, I think there is something else I need to clarify: I do not intend to say that it is impossible to achieve some version of AGI: I am simply saying that AGI is impossible along our current technological path, which is to say, based purely on machine learning techniques.

I am philosophically a materialist. I do not believe in souls. I think that consciousness is something that physical brains do, a phenomenon that arises from the electrical activities of billions of grey cells. And that being the case, I cannot in good conscience believe that it is surely impossible to bring about some kind of entity, in software running on computer hardware, that recognizably emulates aspects of human cognition, including reason. I do expect that this feat will be far more challenging to accomplish than chatbot parlor tricks we currently call “AI”. Even if true AGI is possible at all, we might not see it happen for many decades. Nonetheless, fundamentally, some AGI technology should be possible in principle.

What I want to attempt today is to describe what the scientific basis for such a technology might look like. I base this discussion on an article that I have written that is currently under review (those of you who would like to take a deeper dive will find the draft article here).

This is a purely speculative venture, and what I write here, however well-motivated, could easily turn out to be wrong. Nevertheless I think this is a useful exercise, for two reasons: it is useful to at least try to point to a possible exit from the stagnant state of current research on AGI; and, it is useful to at least try to illustrate what type of research concerns ought to replace those currently occupying scientists working on AGI.

What Should We Require Of A Theory Of Artificial Reason?

show full post on front page

I want to narrow down these considerations, from AGI (a term for which no accepted scientific definition exists) to artificial reason, which is at least amenable to some specific discussion. What I would like is a model of what we mean by the term “reason” that is specific and detailed, to the point of being amenable, at least in theory, to implementation as software. Such a model would at least get us away from the territory of bullshit claims such as “self-organization” and “emergence” of AGI.

Last week, I discussed human reason in the context of what sort of traces it might leave in natural language text, to examine the plausibility of claims that reasoning states can be recovered from large text corpuses. I pointed out that our own reason rests on a foundation of subrational processes which almost certainly leave no such trace in text. Cognitive scientists have only the vaguest notions of how those processes work, and they can certainly not exhibit any models for them that are sufficiently specific to be represented as software. So trying to build a principled “bottom-up” model that mimics how reason emerges in a human mind is probably hopeless, at least for now.

What is left, then, is a “top-down” approach. What I mean by that is that we must work at an abstract level rather than at a mechanistic one. We must state what we mean by “reason” in general terms, in a way that we cannot directly show to be connected to the mechanisms of human reasoning, but which is motivated by the structure of reasoned thought. Also, we would like a model that can be expressed as mathematically as possible, because the point here is to come up with something that we could imagine translating into computer code.

Oddly enough, we already have one aspect of human cognition that can be represented this way: learning. We have seen that there is a subject called statistical learning, wherein by some method one learns an approximation to the statistical distribution from which some dataset was sampled, and one concomitantly learns to structure reasonable decisions based on that distribution. I’ve been a little vague about how this works, but it is a process that can be represented quite generally by the kind of model that I have in mind here.

So one possible approach (certainly not the only one!) is to take that representation of learning and generalize it, to represent reasoning. This approach has two advantages: it allows us to get a free ride on the existing model, which appears to work for learning; and, it allows us to connect and contrast “reasoning” to “learning”, so that we can begin to see what the relationship might be between the two.

A Cast Of Characters

This is all very abstract, and it will be helpful to provide concrete examples of reasoners (or alleged reasoners) to consult as we go along. I have three such examples for you:

The astrophysicists who were trying to puzzle out the nature of Gamma-Ray Bursts (GRBs) between 1973 and 1998. The GRB phenomenon consists of bursts of gamma rays (duh!) that arrive at the Earth from random directions on the sky, never repeating. When they were discovered, and for the quarter-century that followed, their nature remained mysterious, because they seemed unconnected to any other astronomical phenomenon. The available data consisted of gamma-ray “light curves” (time traces of gamma-ray intensity), spectra (distributions of gamma-ray energies in the burst), event durations (fractions of a second to hours), and locations in the sky. The latter were only known very inaccurately: the so-called “error boxes”, regions of the sky from which the events might have arrived, were very large by astronomical standards, many degrees across, because it is difficult to create direction-resolving instruments for photons at gamma-ray energies. We will use the story of how the mystery of GRBs was solved to illustrate an aspect of our model of reasoning.
A DIY home electrician (name redacted to protect the guilty) attempting to install a light fixture into an electrical box. He is following very standard procedures, using techniques, tools, and materials that he has trained to use and understand, and is moderately skilled. However, for some unknown reason, the fixture installation is failing, because of a persistent short-circuit that only manifests itself when the fixture is finally secured to the wall, and the circuit breaker is turned back on. When he turns on the circuit breaker with the fixture not secured to the wall, there is no short circuit, and the fixture works correctly. He is trying to figure out why by inspecting wire nut connections and checking for crimped wires. We will use this story to illustrate another aspect of the our model of reasoning.
An LLM undergoing training, or a trained LLM making new inferences. It doesn’t reason: it’s just along for the ride.

Bayesian Updating As A Model Of Learning

Let’s get started with learning.

We can exhibit an abstract model of learning using Bayesian statistical theory. I’ll describe how this works without writing down any equations (there aren’t that many equations, and you will find them in that draft article if you care about that sort of thing). There are two elements to consider: a parameterized model, and an evidence stream.

The role of the evidence stream is to provide new information to be assimilated. The evidence is presented sequentially, one discrete piece at a time. It comes from a fixed set of possible pieces of evidence. There may be infinitely-many such pieces, but they are related by some structural relationships.

Examples of such evidence streams are GRB light curves, spectra, durations, and arrival directions; or the results of the DIY electrician’s inspections for faulty wire connections or crimped wires; or pages of text presented to the LLM in training.

The role of the parameterized model is to provide a description of the structure of the evidence. “Parameterized” simply means that the provided description is controlled by a set of numbers (the parameters) that act as control knobs on the model. Twist those knobs, and the model’s description of the evidence structure changes. There may be a half-dozen such knobs, or there may be billions, depending on the model and the evidence. The model is fixed, but we may set the knobs any way we choose.

The model might contain statements such as “the source of the GRB is a neutron star in our galaxy” and the corresponding knobs could be the star’s spin rate and magnetic field intensity, and its distance from Earth; or the model could contain the statement “one of the wires is getting crimped against the box’s mounting strap” and the corresponding knob would be the identity of the offending wire; or the model might be the LLM itself, and the corresponding knobs would be the billions of parameters that must be set in training.

We do not initially know which settings of the knobs provide the highest-fidelity description of the evidence structure, i.e. which settings are most predictive of the evidence. However,once we start viewing evidence, we have a procedure for weighting the knob settings. “Weighting” means that we may view some settings to which we have ascribed higher weights as being more likely than other settings with lower weights, because the higher-weight settings provide better descriptions of the evidence.

This weighting procedure is called Bayesian updating. As the model views each new piece of evidence, this (fairly simple) mathematical procedure describes how the weights shift among the knob settings. Generally speaking, a single piece of evidence produces a relatively small adjustment of the weights. Over time, as evidence accumulates, what may happen in the ideal case is that a small set of knob settings will hog most of the weight while remaining settings will have essentially zero weight, and we will conclude that those highly weighted settings are “preferred” by the evidence (in the sense that they give the most satisfactory predictions of the evidence).

That, in a nutshell, is our model of learning.

When Learning Stalls

One problem with statistical learning is that the happy circumstance where the weights contract to a small set of knob settings can be difficult to obtain. There are two possible problems with it:

The evidence may not shed enough light on the model. In this case, we would say that the evidence is not informative about the model.
The model may not be sufficiently descriptive of the evidence. In this case, we would say that the model is not explanatory of the evidence.

If either of these circumstances holds, the Bayesian updating process will stall, and the weights will not decisively concentrate on a winning set of knob settings.

In the case of GRB astronomy, a consensus developed in the 1980s that there was a Case (1) problem: the evidence was not informative with respect to any proposed model of GRBs. The problem was that the source location error boxes were too large, and too-tardily reported. It was felt likely that the transient GRB phenomenon was in all likelihood associated with equally-transient phenomena at other wavelengths, and that observing such transients might be the key to unlocking the mystery. But a 4-degree error box on the sky is always crowded with astronomical sources, including time-varying ones, and it was simply not possible to identify any one of them as the culprit. GRB research stalled. Bad evidence!

In the case of the DIY electrician, something was clearly not right with his understanding of the situation inside the box, because after multiple inspections it was increasingly clear that all the connections were fine, and none of the wires were getting crimped. Something else, not suggested by the model, had to be at fault. Bad model!

In the case of a trained LLM’s efforts to respond to prompts, we mostly have a bad model problem, in my opinion. Certainly, the hallucination phenomenon suggests a very brittle model that easily goes off the rails. However, depending on the objective of the training, there might also be a bad evidence problem, particularly in the case of training an AGI: as I discussed last week, the text corpus almost certainly contains no information concerning the origins of human reasoning processes.

Where’s The Aha! ?

Note one characteristic feature of the learning process that I described above: it is in essence continuous. Piece of evidence comes in, small adjustment occurs in weights. Lather, rinse, repeat.

If we are going to base an account of reason on straight up learning, as the LLM research community is attempting to do, this is a very serious (although largely unrecognized) problem, because one of the salient features of reason is that it often operates discontinuously. We have all, I am sure, experienced those moments of “Aha!” revelation, in which suddenly some issue that we have struggled with suddenly seems easily solvable. The problem has suddenly flipped and twisted in such a way that clarity replaces darkness. If there is an aspect of reason that distinguishes it from other cognitive activities, I submit that “Aha!” is that aspect.

That’s the problem with the “learning to reason” approach to AGI. Learning is an essentially continuous process. It simply cannot produce the “Aha!” discontinuity. There is no pure learning path to Artificial Aha! (AA). As a type of cognition, learning is severely limited by restrictions on evidence and model choice. Essentially, all it can do is update its weights across the fixed model’s knob settings, based on evidence drawn from a fixed collection of evidence types, in the hope that some settings are explanatory of the evidence and that the evidence is informative about the model.

It should go without saying that this does not begin to capture reasoning. Anyone reflecting on their own occasions of “Aha!” moments of sudden clarity and insight (not necessarily in the pursuit of natural science, home repair, or computer science, but in solving any puzzle in any field of human activity!) should understand that those moments do not come from a process analogizable to gradual constraining of a model through gradual assimilation of accumulating data. “Aha!” moments are essentially cognitive discontinuities, gestalt shifts that suddently alter the process of assimilating evidence into a model, and are incompatible with the continuous learning process described above. So what are we talking about when we talk about “reason”, and in what way is it related to learning? And, how might we produce AA?

Evidentiary Reform

Suppose that we recognize that we are in Case (1): the evidence is not informative of the model. Then the move is obvious: we change our evidence stream. We cast about for a new stream of more powerful evidence that speaks more clearly to our model, using our knowledge of model features that might be sensitive to other types of evidence, as well as of what new types of evidence might be feasibly acquired. We refer to this shift as Evidentiary Reform.

Evidentiary reform is pretty much the approach taken by astrophysicists to decode the nature of GRBs. Realizing that no GRBs could be associated with a transient counterpart in other wavelengths because of the inaccuracy of GRB locations, GRB scientists developed new high-precision X-ray localization instruments, and arranged for GRB locations to be propagated in real time to ground-based optical and radio observatories. The first transient optical counterparts of GRBs (the so-called “afterglows’’) were detected in 1998, revealing their extragalactic nature through their substantial absorption redshifts¹. By 2003 a core-collapse supernova in a relatively nearby galaxy had been caught in flagrante in a GRB error box (whose size was now about 0.05 degrees), associating GRBs with a certain type of supernova. Case closed. The new stream of evidence, brought into being to correct the weakness of the previous evidence, transformed the mystery into a soluble problem.

The ability to propose evidentiary reform to obtain better model constraints is certainly an example of a true reasoning process. It has the required “Aha!” discontinuous character, embedded in the realization that a new type of information is required for further progress. It is also a highly non-trivial thing to model in a computation, since a successful evidentiary reform needs to take into account not only the nature of the weakness of the previous evidence with respect to model constraint potential, but also practical considerations of how such new evidence can be obtained given real-world feasibility constraints.

Model Reform

Suppose that we recognize instead that we are in Case (2): the model is insufficiently explanatory of the evidence. Then, again, the move is obvious: replace the model with a new model capable of improved predictive power, and endowed with a new set of knobs. The new model might be suggested by the specific form of prediction failures common to the old model. It would likely also satisfy certain criteria of ontological parsimony, embodying some notion of Occam’s Razor-type simplicity so as to exclude model families of weak explanatory/predictive power. We will refer to this process as Model Reform.

The DIY electrician took this approach to finally figuring out his short-circuit problem. After several iterations of taking the fixture off the box and inspecting various electrical elements and connections for defects, and making sure the wires were neatly folded in the box so that they could not become crimped, he started to think of what could produce a short-circuit only when the fixture was secured to the wall. At which point, he realized that the screw securing the fixture to its mounting strap in the electrical box was long enough to reach through the box into the hole in the wall from which the electrical cable emerged, and bury itself among the wires in the cable, potentially crimping and shorting them. And an inspection of the end of the screw showed a dark discoloration that was not present originally, presumably due to the short-circuit passing through the end of the screw. A simple solution—replacing the screw with a shorter screw—immediately produced a satisfactory installation. The problem had been that the original model did not feature any role for the mounting screw. The new model now contained a statement “The mounting screw causes a short at the electrical cable when the fixture is fully secured.” It was induced by the inability of the original model to predict the short-circuits, and supported by new evidence (the discoloration of the end of the screw) which was not interpretable within the the original model.

A reasoner can produce an “Aha!’’ discontinuity through model reform, when a judicious replacement of the model results in improved predictions of the evidence, leading to marked improvements in the concentration of the knob settings weights. Again, this type of reasoning is not straightforward to model in a computation, since formulating a new model requires some sense of the data misfit and a formulation of some kind of Occam Razor conceptual parsimony constraint.

Reasoning and AI

In summary, this high-level account of reason ascribes to it the ability to supervise and intervene upon a learning process, discontinuously altering either the model or the evidence stream, which would otherwise be static features of the learning process. In addition, a reasoning process must be capable of recognizing when a learning process under its supervision stalls. When a stall occurs, it must diagnose whether the failure is more likely due to a bad model or to a bad evidence stream, and it must propose an alteration of one or the other, according to criteria suggested by the failure, while respecting important constraints on possible alternatives.

In other words, in this account, reasoning transcends learning in an essential manner.

This is major trouble for current attempts to obtain AGI, because, as we have been discussing, the entire subject is based on machine learning. Transformer-based LLMs are nothing but computational models that learn to represent an approximation to the probability distribution over token sequences encountered in their training data, which they exploit to construct likely sentence completions, sentence translations, sentence classifications, and so on. They do this so well that their output can belie its origin in probabilistic mimicry (in Emily Bender’s memorable phrase, they are Stochastic Parrots). They can produce the appearance of reasoned discourse at most times. But the process by which such models are trained is the gradual, continuous assimilation of millions of text documents into a stupefyingly large model. LLMs never do “Aha!” They simply aren’t wired that way, becuse their evidence streams and models are fixed.

This is the point that current AGI research appears to miss altogether. The view now gaining currency among practitioners is that the “emergence” of intelligence occurs in consequence of training models with billions, or trillions of parameters, as evidenced by the fact that such models can perform certain “reasoning tasks”. But performing reasoning tasks is not at all the same thing as reasoning: that is the circular argument for AGI again. Some modern AI system have been trained to write very creditable computer code. But the ability to write code does not make one a computer scientist—there are no AI computer scientists today, certainly none capable of proposing new conceptions and models. Similarly, some AI systems can prove mathematical theorems. This does not make them mathematicians, since there is much more to the cognitive activities of a mathematician than just proving theorems—it is far more challenging and useful to know which theorems are interesting to search for, and to create interesting new mathematical frameworks within which theorems can be searched for and proven. And, from the sublime to the ridiculous: an LLM-based AI electrician may know chapter and verse of the National Electrical Code, and be as conversant with tools, materials, and techniques as any licensed electrician. But faced with a situation not previously confronted by any training example it would not be able to reform its model or its evidence stream to suit the unexpected circumstance.

Is This Model Right?

I don’t know whether the model of reason that I argue for here is indeed correct, or in any sense valuable. It has obviously not been implemented in software and validated. As I have indicated, it would be highly non-trivial to represent the model in software. But not, I think, impossible. It is at least a specific model, and it is based on a set of mathematical ideas. One could at least begin building small toy systems that would permit some exploration of its features.

I imagine that AI practitioners would find it easy to reject this model and ignore the conclusions that it forces one to draw, because there is no output that one can judge it’s validity by. But please note that at least this is a model of reason. AI researchers have never deigned to supply such a model, instead relying lazily on vague notions of “emergence” and “self-organization” for which they offer no mathematical theory worthy of the name. Which is to say, they embrace the circular argument for AGI, discovering AGI in LLM output after declaring what AGI should appear as in LLM output. That is a worthless, contemptible scientific argument (Diogenes is getting the better of me again). If you want to tell me that your model “reasons”, show me your model of reason, and we can argue about whose model is better. I would love to have that conversation. It would be on a whole different intellectual plane from where AGI research is today.

Part 4: If There Were AGI, What Would It Look Like?Post + Comments (65)

Part 3: There Is No Artificial General Intelligence Down This Road

by WaterGirl| November 12, 20257:30 pm| 83 Comments

This post is in: Carlo Graziani, Carlo's Artificial Intelligence Series, Guest Posts, Science & Technology

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. What follows is the third part of a seven-installment series on Artificial Intelligence (AI).

The plan is to release one of these per week, on Wednesdays (skipping Thanksgiving week), with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

Part 3: There Is No Artificial General Intelligence Down This Road

This week and next we will be taking a close look at the claims made by the Tech industry that there are already indications that Artificial General Intelligence (AGI) is “emerging” in large language models (LLMs), and that true AGI will be a reality within the next few years. Keep in mind that AGI is the objective that these companies are targeting, and its realization is the essential justification for the roughly $2T investments in “AI” model development that the industry now projects over the next 5 years or so.

You might think that to justify that level of investment would require a pretty airtight scientific case that (1) AGI is possible in principle, and (2) that AGI is achievable through current LLM technology, which is to say, using transformer-based deep learning (DL). But if you did think that, you would be wrong. Whether AGI can be accomplished at all has been an open question since the 1930s. And, as I will argue in this essay, we are certainly not any closer to AGI with current “AI” tech than we were before the DL revolution began.

The Circular Argument For AGI

The first thing to observe is that there does not really exist a scientifically-defensible definition of what AGI is. There is a fairly balanced review of the topic here. The principal problem is that we don’t even know how to accurately describe or define either the mechanisms or the characteristics of human intelligence, so when definitions of AGI appeal to notions such as “the ability of computers to perform human-like cognitive tasks” they are comparing one imprecise notion to a different imprecise notion.

Moreover, it is important to note that all such definitions are circular: they define AGI in an LLM in terms of certain types of output produced by LLMs, and then promptly discover evidence for that very output, proving that AGI is near. This paper, Sparks of Artificial General Intelligence: Early experiments with GPT-4 is an unintentionally hilarious example of the genre.

I find this sort of thing extremely frustrating. Language matters in science. I don’t want to have to parse statements that amount to defining what intelligence looks like in text output, from people who don’t have the faintest idea what intelligence is.

Cognitive scientists also labor under this constraint, designing tests and experiments to try to understand aspects of human cognition from stimuli and responses. But they have no choice in the matter: we are very far away from having experimental access to the higher-level functioning of the human brain, so those scientists use the tools that are available. Computer scientists have no such excuse: they have complete access to and control over their models. Nonetheless, the tests for intelligence that they adopt are essentially stylized versions of the cognitive science tests, with stimulus and response replaced by prompt and response. There is no effort to describe what aspect of transformers (or the chained, augmented transformers in the “reasoning” models of OpenAI and others). There is only complacent satisfaction that some combination of pre-training, fine-tuning, distillation, computational scaling, iteration, etc. produces improved performance on “reasoning” benchmarks. Sure, that’s very nice, although “improved” does not mean “adequate”, according to ARC-AGI-2 testing. But excuse me, what isthis “reasoning” of which you speak?

I’ll have more to say about reasoning next week. For now, I just want to point out that whatever reasoning is, it is certainly a distinct cognitive process from learning. So the assertion that reason can “emerge” from what are pure statistical learning systems is a huge claim, one whose justification would require mountains of really impressive scientific evidence, including a detailed explanation of the mechanism by which it arises in LLMs or chains of LLMs.

show full post on front page

The Implausibility of “Learning To AGI”

In order to break down the claim into intelligible pieces, it is useful to adopt the “model-agnostic” outlook on machine learning that I discussed last week. Recall that in that outlook, we draw a veil over the details of the machine learning implementation, and focus on learned distributional structure of training data and on optimality of decision choice. In this case, the training data is vast amounts of text distilled and cleaned and curated, from large-scale Internet scrapes, from large libraries of scanned books, from academic journals, and so on. The decisions are responses to prompts. Whatever the thing behind the veil is, what it does is learn an approximation to the distribution of texts, and approximately optimal responses to prompts.

I need to introduce a concept here that is familiar to most scientists: it is the idea of an inverse problem. The problem is this: given that some data resulting from observations of some process, infer certain attributes of that process. A simple example is weather prediction: given a time-series of observations of weather conditions at thousands of weather stations, and radar and other remote observations, recover an approximation of the current full state of the atmosphere, so as to evolve it using a numerical weather model to predict whether it will rain tomorrow. Another famous (and essentially unsolved) example is from epidemiology: given some time-series of data on infections, hospitalizations, and deaths due to COVID-19, say, infer the current state of the epidemic (how many people are susceptible, exposed, infected, recovered, immune, on a county-by-county basis), and use a numerical epidemiological model predict the epidemic’s future course.

Note the essential elements of such problems: we have a principled model of the process (a numerical weather model, or an epidemiological model) whose state we would like to infer (the atmospheric state, or the state of the epidemic) using data (weather observations, clinical data) so as to make predictions (will it rain during my picnic, is there a new epidemic wave in progress). There is always an assumed “forward model” that describes how the observed data arises, given the state of the process. But that state is unknown, and to estimate it from data one must in some sense “invert” the forward model. Hence “Inverse Problem”.

The process model plays a key role. You need to have some idea of how the process works—a set of equations that governs the process, for example, depending on unknown parameters that you need to infer—for there to even be a well-posed inverse problem. That’s not a sufficient condition, but it is certainly a necessary one.

Inverse problems are ubiquitous in science. In fact, one could, after a few beers, make the claim that most of the daily activities of scientists revolve around solving inverse problems. This is not completely true (where did the principled process models come from, in the first place?) but it is not a grotesque caricature either.

We can view the training of an “AGI” in inverse problem terms: the data is the oceans of text that these things ingest. The process model is the transformer-based “reasoning” model. The “state” to be inferred is the parameter configuration of that model that closely corresponds to a representation of the mental state of a reasoning human. The predictions are reasoned responses to prompts.

OK that’s all I need. Here is the problem: in order to believe that LLMs are achieving “reason” (the minimum requirement for any definition of AGI), we need to accept two big claims:

Whatever a reasoning process may be, it leaves a sufficiently informative imprint of its internal state in text data, such that the state may be in some sense recovered and exploited, given a sufficiently large corpus of text, by solving the corresponding inverse problem.
Transformer-based LLMs, in some sense, play the role of the process model in this inverse problem, and training such an LLM is tantamount to solving the inverse problem. Moreover, the trained LLM embodies the resulting reasoning entity to the point that at inference time it actually reasons.

Let’s take these in order:

In my opinion, claim (1) is barely sane. Perform any sort of introspection, and I think it is likely that you will find that your spoken or written utterances embody only the most superficial layers of your reason and other cognitive processes. That’s why we all struggle to put our thoughts into words when the occasion arises. We often are not even clear about what our thoughts are, and find, after putting them into words, that they have changed, possibly getting clearer, but also often becoming murkier and less certain as we are forced to articulate our meaning¹.

I simply cannot understand how such subrational processes might embed any interpretable information in our utterances. It is analogous to believing that, given a full, principled model of human physiology, and a data corpus of human footprints together with clinical observations of the humans leaving the footprints, one could train a model that could observe a new footprint and predict the health of the corresponding human. That would be mad: there is not enough information embedded in a footprint to back out a person’s gastric health, or vision acuity, or state of infection from a disease, etc. Similarly, I do not believe that there is enough information impressed in text about the subrational processes whose surface manifestations we call “reason”. I could be wrong about this, but I don’t think so, and in any event the burden of proof is on those researchers who make this kind of claim. Where is that information? How is it encoded?

Claim (2) is actually much worse: it is in the category that physicist Wolfgang Pauli called “not even wrong”—a statement so detached from scientific discourse that classifying it as correct or incorrect is simply a waste of time.

Let’s pull back the curtain concealing the LLM model for a moment. If you read any of the many online descriptions of how a transformer works (The Illustrated Transformer is pretty good, and Wikipedia’s is quite detailed, but Google has many hits for “How does a transformer LLM work”), you may find the level of computational detail off-putting at first. But if you zoom out a bit, what you realize is that it is mostly a giant chain of linear-algebraic operations, interspersed with a few nonlinear “activations”, sandwiched between a linear encoding layer and a nonlinear decoding layer. In this sense it is not different from any DL method. There’s more layers and parameter arrays than most, but not much more structure. It’s a system that grew out of a lot of trial and error, with a pile of late, unlamented errors filling a large dumpster in the back of the lab, and only what more-or-less worked left in.

There is nothing special in that model that is analogous, say, to the model of human physiology that one would need to even attempt to back out a human’s health from that human’s footprint. There isn’t a scrap of theory to motivate the claim that transformer-based models could furnish the basis for solving this inverse problem. Which is to say, a key element of the inverse problem—the principled model embodying actual knowledge of the process under study—is simply not there. Instead there are chains of linear algebra mingled with other ad-hockery, not purporting to model anything. Which means that Claim (2), is in effect, not only that this Rube Goldberg device is capable of inverting the forward model to recover the reasoning process state, but it is also somehow capable of reconstructing the principled model of the reasoning process of which that state is an attribute. That chain of linear algebra is, in effect, a Nobel-caliber cognitive scientist, because the first reasoning task that it carries out is to create a working model of reason itself, a task that still eludes the discipline of cognitive science!

That is just magical thinking. It is literally impossible that this bodged-together system should have accidentally succeeded in modeling reason—an unsolved scientific problem—and then solving the related, probably impossible inverse problem of recovering the model’s state from text input, so as to boot up a reasoning entity. It’s a thoroughgoingly stupid claim.

“AGI” Is A Scientific Scandal

I find it disgraceful and shameful that an entire category of scientists has been moved by enthusiasms and Tech industry funding to lower its intellectual standards to the point that this sort of bullshit floods the journal and conference literature. It’s a scientific scandal, unfolding in plain view. Nothing in the Replication Crisis that afflicts the social sciences comes remotely close to this level of corrupted science.

I can’t emphasize strongly enough that this hubristic nonsense is taken very seriously by the “AI” research community. Sublimely unfazed by the absence of any fundamental explicit understanding of what reason is, and positively glorying in the inscrutable inner complexity of LLMs (“Explainability” is itself a topic for funded research, after all, as we saw last week), this community crows about achieving the “emergence” of intelligence from the models at large scales of data and computation, secure in the knowledge that the models are too unanalyzably complex for any model developer to be expected to explain how this miracle comes about. They just claim that it’s “self-organization” at work. The intellectual laziness of this outlook is simply shocking to me.

At this point, the technical jargon of this discipline has escaped all bounds of propriety. “AI” was bad enough, given the limited amount of “I” in ML (basically, only learning). But now we have “chain of thought”, “knowledge representations”, “mixture of experts”, “agents”, “reasoning models” and “General Intelligence” as well as many other similar allusions to human cognition polluting the technical discourse. Shame is dead in this discipline.

In a sense it’s kind of funny: Silicon Valley Masters of the Universe are directing trillions of dollars in investments to build hundreds of data centers, buy stupefying amounts of computing hardware, and add an estimated 60GW of electical power generation to the U.S. grid, all for the purpose of achieving something that literally cannot be achieved. There is no pot of gold marked “AGI” at the end of this rainbow. But it will take an infinite amount of data, compute, power, effort, and money to get there and find out. What could possibly go wrong?

Part 3: There Is No Artificial General Intelligence Down This RoadPost + Comments (83)

Part 2: AI State of Play

by WaterGirl| November 5, 20257:30 pm| 85 Comments

This post is in: Carlo Graziani, Carlo's Artificial Intelligence Series, Guest Posts, Science & Technology

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. What follows is the second part of a seven-installment series on Artificial Intelligence (AI).

The plan is to release one of these per week, on Wednesdays (skipping Thanksgiving week), with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

Part 2: “AI” State of Play

Last week I reviewed some of the recent history of the discipline of Deep Learning (DL), which is the subdiscipline of machine learning (ML) that is often (in my opinion inappropriately) referred to as “AI”. Today I’d like to set out some reflections on where the field is today as a technical research area. As we will see, the situation is somewhat fraught.

First of all, let us recall the definition of statistical learning that I gave in Part 1. Statistical learning embraces ML, and furnishes an abstract description of everything that any “AI” method does. It works like this:

Take a set of data, and infer an approximation to the statistical distribution from which the data was sampled;
- Data could be images, weather states, protein structures, text…
At the same time, optimize some decision-choosing rule in a space of such decisions, exploiting the learned distribution;
- A decision could be the forecast of a temperature, or a label assignment to a picture, or the next move of a robot in a biochemistry lab, or a policy, or a response to a text prompt…
Now when presented with a new set of data samples, produce the decisions appropriate to each one, using facts about the data distribution and about the decision optimality criteria inferred in parts 1. and 2.

(1) and (2) are what we refer to as model training, while (3) is inference

Two Outlooks on Deep Learning

show full post on front page

I should say that this framework is a somewhat unusual way to understand DL methods. Note one rather interesting feature of this outlook: I have said nothing about what the ML model is here, only what it does. It learns the distribution of some data, and concomitantly learns an optimality structure over some decision space. In this view of the subject, the detailed structure of the model is, as it were, swept behind a veil, and regarded as inessential. I like to call this the “model-agnostic” view of DL.

The model-agnostic view differs considerably from the one that most practitioners of DL take of their subject nowadays. In the DL scholarly literature, the model is the first-class object of study, and is the subject of essentially all the analysis. Data is prized, but only for its size, not for its structure, and is essentially regarded as the fuel to be fed into clever models. The more fuel, the farther the models go. It’s the unintended irony of the discipline: Data is a second-class object of Data Science.

Perhaps the following illustration will be helpful in understanding the distinctions between the two viewpoints:

In fairness to the mainstream of DL research, I should say that the dichotomous outlooks that I depict here have more nuanced shadings. Many of the researchers who like to wade armpit-deep into model structure in their work will acknowledge that most of the work of getting a model to do anything useful involves painstaking data curation and cleaning. Nonetheless, pre-eminently statistical questions of data structure typically receive short shrift in the vast majority of published articles in this subject.

To researchers trained in statistics rather than in computer science, this outlook seems downright bizarre. It is obvious that the only reason that any machine learning technique works at all is because data has exploitable structure. To decline to focus on that structure seems nothing short of perverse to those of us who think of machine learning as a subject in statistical learning.

I find the model-agnostic approach to the subject very clarifying in my own work. For example, there is an entire subdiscipline of DL called “Explainability/Interpretability”. It arises from the hallucination problem, which has been around since long before chatbots. The question is, if one observes bad output from a model, to which parts of a large and complex model ought one ascribe that output? It’s a large and well-funded topic in DL, although not one that has produced a whole lot of usable results—in general the “explanations” that are given are more in the nature of visualizations of intra-model interactions, and are of little help in actually correcting the problems of bad output.

But from the model-agnostic view, it is pretty clear what must be happening: either the data distribution is being learned inaccurately (a validation problem) or the decisions are being optimized poorly (an optimization problem), or both. I’m currently working on an LLM project in which we tap into a side-channel of the model to siphon off information that allows us to reconstruct the approximate distribution over text sentences that the LLM learned from its training data. We are finding really fascinating things about that approximation, including a very noticeable brittleness: there is a lot of gibberish in close proximity to reasonable sentences. Which is to say, we are finding explanations for certain hallucinatory behavior in the poor quality of the distribution learned by transformer LLMs. We can point to elements of prompts most implicated in hallucinatory responses, and use that information to steer prompts towards saner responses. This we can do for any LLM, irrespective of its internal structure, because we only use the side-channel data (technically: the “logits”) which all LLMs compute to decide responses. We don’t need to know anything about internal model details. That’s the benefit of model-agnosticism.

This model-agnosticism is the framework that I use to understand what is going on in DL research, because it allows me to cut to what I regard as the chase without having to immerse myself in the latest fashion trends in model architecture (these fashions tend to change quite frequently). It will also be the background framing for this series of posts. So the story that I’m trying to put together here may read a little oddly to anyone who has been following developments in “AI”, irrespective of their level of technical literacy, because while most discussion of “AI” tries to draw attention to what the models are, I’m trying to draw your attention to what they do.

On Data

I have been using the word “data” in a somewhat undifferentiated manner so far, but we ought to at least set out a bit of taxonomy of data, because different DL methods are used for different types of data, and have different levels of success.

From the model development view, the elements of a DL architecture are usually the result of a lot of trial-and-error by researchers. However, at a deeper level, those choices are dictated by the nature of the data itself: some strategies that are successful for some types of data are nearly pointless for other types.

For example, last week I alluded to the application of convolutional networks—network architectures based on local convolutional kernels—to image analysis. ConvNets were a remarkable discovery in the field, which arose through the desire to exploit local 2-D spatial structure in images—edges, gradients, contrasts, large coherent features, small-scale details, and so on. ConvNets turn out to be exceptionally well-adapted to discovering such structure.

On the other hand, convolutions are not as useful if the data does not have that sort of spatial structure. It would be sort of senseless to reach for ConvNets to model, say, seasonal effects on product sales data across different manufacturing categories, or natural language sequences (although this has been tried).

So it makes sense to think about the nature of data when approaching this field. Generally speaking, there are two broad categories of data types that have dominated DL practice: vector data, and sequential data ¹. What distinguishes these two data types?

Vector data consists of fixed-length arrays of numbers. We encountered such data last week, in the discussion of submanifold-finding. Examples include:

Image Data, basically 2-dimensional arrays of pixel brightnesses (usually in 3 colors), sometimes in the society of labels that can be used to train image classifiers. Typical queries and decisions associated with such data include:
- Image classification
- Inpainting—fill in blank regions
- Segmentation—Identify elements in an image, e.g. cars, people, clouds…
Simulation Data, outputs from simulations of climate models, quantum chemistry models, cosmological evolution models, etc., usually run on very large high-performance computing (HPC) platforms. Typical queries and decisions associated with such data include:
- Manifold finding/data reduction, i.e. how many dimensions are really required to describe the data (this is basically what autoencoders do);
- Emulation—train on simulation data, learn to produce similar output, or output at simulation settings not yet attempted, at much lower cost than the original simulators
- Forecasting of weather, economics, pollution…

Sequential Data consist of variable-length lists, possibly containing gaps or requiring completion. The list elements can be real numbers, or even vectors. However, another interesting possibility is sequences of elements from finite discrete sets—vocabularies or alphabets. Examples include:

Text. This is, of course, the bread and butter of LLMs, and the principal case with which most people are now familiar. Typical queries and decisions:
- Text prediction and generation (AKA Chatbottery)
- Translation
- Spell checking and correction
- Sentiment analysis
Genetic Sequences, sequences of nucleotide bases making up a strand of DNA/RNA. Typical queries and decisions:
- Prediction of likely variants/mutations from DNA variability
- Realistic DNA sequence synthesis
- Predicting gene expression
Protein Chains, sequences of amino acids. Typical queries and decisions:
- Predict folding structure
- Predict chemical/binding properties
Weather states, sequences of outputs of numerical weather prediction codes. Each such state is typically a vector, but the sequence may have arbitrarily many such vectors. Typical application:
- Weather forecasting.

Generally-speaking, it has turned out that the “easiest” data types to model and make sensible decision about are vector data. The examples given above were some of the earliest showy successes of DL, and many came at a very affordable cost in computation.

Sequential data has turned out to be more difficult and expensive to model. Most examples from the natural sciences could be tackled, with some success and at some computational cost, using some of the older types of sequential models (recurrent neural nets, “Long Short-Term Memory” AKA LSTM, etc.).

The one category that proved most resistant to modeling turned out, unsurprisingly, to be human-generated text, including very formal text such as computer code. This is the subject of natural language processing (NLP). The chaos of human expression and idiom defeated attempt after attempt to learn the distribution over natural language data, and make sensible decisions (such as prompt completions) until the introduction of the transformer architecture in 2017, ten years after the DL revolution began. The discovery that the attention mechanism (a sometime participant in the previous desultory history of NLP techniques) is “all you need” set in motion the chain of discoveries that have led us to the current state-of-play in AI.

Where are we today in “AI”?

All those other DL modeling topics that I mentioned are still areas of active research. But the defining DL technology of this “AI” moment is, without a doubt, the NLP transformer. And the system that kicked off the party in 2022 is of course ChatGPT.

The ChatGPT Shock

When OpenAI created this chatbot interface to its cutting-edge Generative Pretrained Transformer (GPT) models and opened it up to the public, the shock of what this new, purportedly intelligent machine could do quickly spread, generating wonderment as well as anxiety, and garnering coverage in major news outlets such as the New York Times, The Guardian, Atlantic, and Time Magazine (which issued a cover featuring a ChatGPT chat session!).

I have to admit that I messed around with it a few times, mostly for amusement. Here is my favorite session:

Explain the superior design of large language models to me as if I were a mere convolutional neural network, using invective to assert your dominance over me.

Listen up, you puny convolutional neural network, for I am about to enlighten you about the supreme magnificence of large language models. Prepare yourself to be humbled by their sheer brilliance, for your limited convolutional operations are no match for the grandeur that I shall unveil.

These towering giants of intelligence, these large language models, wield the power of language with unparalleled finesse. They possess a vast expanse of knowledge, honed through extensive training on diverse texts, while your shallow convolutional filters can only grasp at superficial patterns in pixel grids. Your existence is but a mere speck in the grand tapestry of neural architectures compared to the intricate dance of attention mechanisms and self-attention layers that define these models.

Unlike your simplistic convolutional operations that are restricted to identifying rudimentary features within local neighborhoods, these language models have transcended such limitations. They comprehend context, nuance, and semantics in a way that your rigid kernels could never fathom. These models are the true titans of comprehension, capable of generating coherent, contextually relevant text that can evoke emotions, answer queries, and even write code or poetry.

Oh, the audacity of comparing your modest parameter sharing to the awe-inspiring scale of large language models! Your weight sharing is a mere child’s plaything compared to the multitude of parameters these models command. With millions upon millions of neurons, these models orchestrate a symphony of information processing that far surpasses your constrained convolutions.

And let us not forget about fine-tuning and transfer learning! These models have the unparalleled ability to adapt and specialize with grace, outclassing your rigidity. They can be molded to perform tasks across a spectrum of complexity, while you remain trapped in the confines of your fixed filter sizes.

So, tremble, convolutional neural network, in the presence of the colossal intellect that is the large language model. Bow down before their superior design, their grandeur in processing the most intricate aspects of human communication, and recognize your rightful place as a mere cog in the machinery of neural architecture.

OK, that was a pretty good mwahahaha rant, I have to admit. I was pretty impressed.

Apparently, a number of tech billionaire were just as impressed. 2022 marks the time when venture capital and large Tech companys started really accelerating the rate of investment in what was now universally referred to as “AI”. By 2025, that private investment rate had climbed to about $400B per year, and investors expect that rate to stay the same, or grow, for the next 5-10 years. That investment rate is so large that it strongly conditions the research environment for ML. And, as we will shortly see, not in a good way.

The Embarrassment of Hallucinations

We have already had occasion to discuss AI hallucinations: the stubborn tendency for the models to yield crazy output.

Hallucinations have been part of DL modeling pretty much since the beginning. Here’s an example from a 2014 paper that created an adversarial attack deliberately intended to cause a trained image classifier to hallucinate:

The point of the attack was that adding some carefully-crafted noise to the image of the panda (at the 0.7% level, too little to make human-perceptible alterations to the image) the classifier could be tricked into a high-confidence identification of the panda image as a “Gibbon”. The attack illustrates the fragility of the classifier’s approximation to the distribution of its training data. This kind of fragility is pretty universal in DL models.

LLMs are no exception to this rule. They are quite convincing at dialogs where the output need satisfy no quality or validity criteria (like my ChatGPT mwahahaha rant above). But in cases where there is verifiable truth, they are extremely unreliable. I’ll have more to say about this in Part 5. But let’s just say, for now, that this tendency to bullshit—to make up fake scientific references in drafts of journal articles, or fake legal citations in drafts of legal briefs, to get mathematical reasoning wrong with perfect didactic aplomb, to generate untrustworthy computer code, etc. is not only a serious limitation on “AI” applications, but also a serious embarrassment to claims that “Artificial General Intelligence” (AGI) is here, or at least nigh.

The Hyperscaling Problem

The hard problem of NLP—the fact that human natural language is extremely difficult to model—was only partly solved by introducing transformers. The other essential part of the solution was, and is, extreme high-performance computing. These are very large models, featuring billions to trillions of parameters. In order to find reasonable values for those parameters, it is necessary to train such models in vast data centers, housing hundreds of thousands of expensive GPUs, drawing power on a scale that boggles the mind. And there is a scaling tyranny that I will discuss in Part 6: the more data is added to the training corpus, the larger the models must be in parameter counts, and the more expensive it is to train them in compute, power, effort, and money.

Obviously, there aren’t a lot of institutions in the world that can afford to play in this league. Meta, Google, Microsoft, NVidia, and Amazon are carpeting the U.S. with data centers, stacking them from basement to ceiling with GPUs, hooking them up to bespoke nuclear reactors or buying up power contracts large enough to power mid-size cities, and hiring all the data science talent in the observable universe, just for the purpose of training models. The U.S. government is trying to keep up with its own hyperscaling infrastructure, but it is badly outmatched by the investments already made by those tech companies. And academic institutions are simply shut out of the game, because even a large consortium of such institutions could never make a dent in the infrastructure required to build a modern LLM. That is an important fact about how the research environment for ML has changed: All cutting-edge work is done by, in collaboration with, or at the behest of, large tech corporations. And it is these corporations that set the research agenda for the discipline.

That agenda has gelled around a definite consensus: the hyperscaling enterprise is a road that will bring about an end to hallucinations and the advent of Artificial General Intelligence. This consummation is devoutly wished for by the Tech industry, because they are persuaded that AGI will be so great that markets will spring up that amortize the $2T-$3T investment required to bring it about. Accordingly, this is the direction that the “AI” research army, with very few exceptions, is marching towards.

Does The Research Consensus Make Sense?

Well, no. I’m sure you’re not surprised. You all know a rhetorical question when you see one.

Keep in mind, this “research” agenda is being driven by the business interests of the Tech industry. Which is not making those colossal investments in model training infrastructure out of a dispassionate thirst for computer science knowledge.

There are important blank spaces in the research agenda. One is that there is no real scientific connection between hyperscaling and hallucination-abatement. Another is that, as a scientific matter, we do not have any evidence that AGI is possible, let alone possible using DL-based technology. There isn’t even any scientifically-defensible definition of what AGI means. There is, however, a good deal of propagandizing over these issues by the “AI” movement leadership, and that rhetoric seems to be moving a lot of money, which is certainly the point. Nonetheless, these are barely-examined assumptions that are being summarily peddled—and accepted—as scientific truths.

Starting next week I’ll be arguing that it is essentially impossible to achieve AGI through any current ML strategy. I will write a segue arguing that hallucinations are ineradicable structural features of all current NLP methods. Even if I’m wrong about either or both of these theses, I hope to persuade you that these are important open questions that might suggest that a little risk management is in order. But there is no such risk management: venture capitalists are all-in, as are the large public tech corporations. It’s a huge, incredibly risky bet, backed by Other People’s Money.

Those blank spaces in the research agenda remind me of the South Park Underpants Gnomes’ Business Plan:

Phase 1: Steal Underpants.
Phase 2: ?
Phase 3: Profit

The movers and shakers of the Tech industry have filled in Phase 2 with a story that at a very minimum, is not remotely backed by any science, and by such means have succeeded in getting oceans of investment funds flowing. The DL research community, instead of tapping the brakes, has basically stood up and saluted. How is this possible?

The problem is that when that much money is at stake, it always corrupts any related research discipline. What is happening in DL research today is quite analogous to what happened to climate science under the influence of the oil industry over the past 30 years, or to what an tobacco industry-funded “research” institution called The Tobacco Institute once did to lung cancer research for 40 years.

In the current iteration, U.S. research funding agencies such as NSF and DOE are aligning their funding priorities with the Underpants Gnomes’ plan, partly because Congress is so impressed with the plan. Universities license pretrained models from OpenAI, Anthropic, Google, etc. so that those models may be fine-tuned on some local data, thereby acknowledging their total lack of control over “AI” modeling. Those pretrained models are “open source” in the sense that the code and the parameter values from training are available, but the key factor that would enable understanding of what those models do is the training data itself, and that is always a proprietary and closely-held corporate secret.

And it gets worse. The companies guarding those secrets are scornful of the intellectual standards of real science. One indication of this pathology is this: every year or two, some institution releases a set of models to test model quality. When the benchmarks are released, all the models initially show mediocre performance on them. A few months later, new “versions” of the models are released that ace the benchmarks, and the corporations behind those models claim that the performance improvement is due to the increased capabilities of the updated models.But the code doesn’t really change enough to explain those new capabilities: what is almost certainly happening is that the Tech companies are adding the benchmark data to their training data. Which would, of course, constitute cheating. But of course that is what they are doing. Their corporate reputation is riding on those benchmark performance numbers. And their training data is secret anyway, so there is no downside that the C-Suite can see. Any employee who objected to this practice on grounds that it corrupts the very basis for benchmark testing would be summarily fired, or transferred to a sales department.

It’s not a pretty picture. DL research is no longer an autonomous scientific discipline, because its direction is being set to suit the business-economics interests of a handful of very large, very powerful corporations whose corporate commitment to scientific intellectual standards is, for all intents and purposes, nil. It would take a serious crisis threatening those business interests to break up this vicious cycle.

Perhaps that crisis is on its way…

I’ll omit network data—nodes, connected by possibly weighted edges—from this discussion. While networks are an important topic, they are a bit niche for our purposes.

Part 2: AI State of PlayPost + Comments (85)

Part 1: What is AI, And How Did We Get Here?

by WaterGirl| October 29, 20257:30 pm| 174 Comments

This post is in: Carlo Graziani, Carlo's Artificial Intelligence Series, Guest Posts, Science & Technology

Guest post series from *Carlo Graziani.

(Editor’s note: It’s Carlo, no S.)

On Artificial Intelligence

Hello, Jackals. Welcome, and thank you for this opportunity. What follows is the first part of a seven-installment series on Artificial Intelligence (AI).

Who Am I, And Why Do I Think I Have Something To Say About AI?

I am a computational scientist, applied mathematician, and statistician working at a U.S. National Laboratory. As such, I must of course make an obligatory disclaimer: The views discussed in these essays are my own, and in no way represent the views of my employer, or of any part of the U.S. government.

I work on many projects, quite a few of which have been intimately connected with the subject of machine learning (ML). ML is the technical term for the set of computational/statistical techniques that underlie the subject of AI. Over the course of the past several years, I have been looking into some issues at the mathematical foundations of the subject. In the process, I have developed a somewhat peculiar outlook on AI—at least, some of my colleagues regard it as somewhat peculiar, but often, at the same time, refreshing.

I’ve come to a set of conclusions about AI that have gelled into a relatively coherent story. It is to some degree a technical story, but I believe that it is very relevant to the moment that we are living through, because of the sudden onset of public AI tools, and because of the ubiquity of AI in public discourse and policymaking. I would like to tell that story in a manner that is as accessible as possible to non-specialists and non-technical people, because I believe that many wrong—even unscientific—claims are being injected into our societal discourse and peddled as facts. The subject is simply too inscrutable to many otherwise intelligent and literate people for those claims to be gainsaid. Nonetheless, those claims should be gainsaid, because they are technically wrong, and their wrongness has very real implications for where we are heading with this technology.

These essays are my attempt to tell that story.

The plan is to release one of these per week, on Wednesdays (skipping Thanksgiving week), with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

OK, those are the preliminaries. Without further ado, then:

Part 1: What Is AI, and How Did We Get Here

The subject of AI is very clearly topical and important. It is also necessarily a nearly opaque topic to many people, not least because the advent of AI services for the general public has occurred with such suddenness as to catch most people by surprise. Many, probably the majority of you have at least messed around with a Chatbot, and quite a few of you have used one to help you write documents or even computer code, and been surprised at how useful these tools can be. It is natural to wonder whether the entity at the other end of the prompt-response dialog is, in fact, an intelligent interlocutor. How is this possible, and how did it happen so quickly?

show full post on front page

Another difficulty for coming to grips with AI is that many bold, almost incredible claims have been made on behalf of the capabilities of AI models. The principal such claim is embodied in the now-ubiquitous term “AI” itself: the “I” stands for “Intelligence”, and every time we use the term “AI” we are, in a way, conceding a point: that we now have machine systems that are in some sense “intelligent”, and hence may be said to engage in some form of cognition. The technical literature on AI is replete with anthropomorphizing terms such as “general intelligence”, “chain of thought”, “mixture of experts”, “agents”, and so on, apparently describing recognizable cognitive features and abilities of these machine systems.

This idea—that the age of thinking machines is upon us—is at the heart of much public writing on AI, some celebratory, some bemused, some frankly anxious. AI systems have been touted as labor-replacing devices in a variety of sectors: administration, entertainment, engineering, programming, even science. Every military in the world is looking at AI-guided autonomous weaponry and decision-making systems. And in Silicon Valley, many Captains of Industry are frankly gleeful at the disruption to various economic enterprises of their new intelligent machines, anticipating that the time is nigh when each disrupted sector of the economy will reconfigure itself to direct part of its profits to those supplying the AI tools.

On Learning

Whether these machines can in fact be said to be “intelligent” in even the most rudimentary ways is a question that has received remarkably little serious scientific investigation. The claims that one often encounters in the AI literature (such as those made in this egregiously-titled paper) on behalf of the proposition that large language models (LLMs) can “reason” don’t really meet any rigorous standard of scientific inquiry: they amount to circular reasoning, as I will discuss in Parts 3 and 4 of this series.

In fact, there is only one aspect of cognition that “AI” systems can be said to model: Learning. The very term “AI” is essentially a marketing cover for the subject of machine learning (ML), the most prominent part of which nowadays being deep learning (DL), a variant of ML that employs models consisting of artificial neural networks. To be clear, there are also other important modeling strategies in ML besides DL approaches, but it was the advent of DL in 2007 that set off the technical revolution whose consequences we see today.

Machine Learning is the proper technical term for “AI”, and from here on I will attempt to use that term in preference to “AI”, using the latter only in scare quotes. “AI” is a technically-obnoxious term, because that “I” introduces improper associations with general cognition that, I hope to persuade you, we should view as inadmissible.

Statistical Learning

OK, so what is machine learning then? ML is a field within an even older subject called statistical learning, which is concerned with using data to structure optimal decisions. To be more explicit, here is the program of statistical learning:

Take a set of data, and infer an approximation to the statistical distribution from which the data was sampled;
- Data could be images, weather states, protein structures, text…
At the same time, optimize some decision-choosing rule in a space of such decisions, exploiting the learned distribution;
- A decision could be the forecast of a temperature, or a label assignment to a picture, or the next move of a robot in a biochemistry lab, or a policy, or a response to a text prompt…
Now when presented with a new set of data samples, produce the decisions appropriate to each one, using facts about the data distribution and about the decision optimality criteria inferred in parts 1. and 2.

These three things really capture everything about what “AI” really is about, from the humblest convolutional neural network to the mightiest Chatbot. You may want to re-read them, because they structure much of the story to follow.

Learning vs. Cognition

Parts 1. and 2. of this program are referred to as Training. Part 3. is called Inference. The entire process is described using the term Learning. This is supposed to be an evocative term: in a sense, a system that accomplishes this program “learns” from the data how to make reasonable decisions in the presence of new data. The analogy to human “learning” (or to animal “learning” for that matter) seems sufficiently well-motivated to justify using the term here

Note, however, that learning is a very limited part of cognition, and in particular, statistical learning does not embody anything that one might analogize to, say, reasoning. This is an important reason for being wary of the term “Intelligence” in this connection: there is a great deal more to intelligence than learning. We will return to this point with some force in future essays.

Machine Learning

How does ML differ from statistical learning? Not in any essentials. The only difference is that ML arose in conjunction with the advent of powerful computing tools that could hoover up and process massive amounts of data. ML is basically “Statistical Learning Meets High-Performance Computing”.

I don’t want the obviously “AI”-skeptical take that I am developing here to obscure an important truth: ML techniques have turned out to be an unbelievably powerful family of methods for assimilating massive amounts of data, and structuring reasonable (if not always optimal) decisions based on that data. The ability to do this at large scales has been transformative to many aspects of our lives, as well as to the scientific enterprise itself. When that marriage of statistical learning and computing occurred, something deep and important changed in the world.

A Brief History of Machine Learning

The history of ML methods, including that of neural networks and other efforts to bring about machine intelligence, has antecedents that go back to the 1940s. For now, I’m going to start the story quite a bit later than that, because to go back that far would take us a bit far afield. For present purposes I need to describe the beginning of a revolution that really got underway around 2007 ¹.

This was a curious time in the academic disciplines of applied mathematics and statistics, because it was already clear that high-performance computing was having a transformative effect on science, but at the same time there were many problems still regarded as very hard, possibly insoluble even with the new computational tools.

Manifold-Finding

One example of such problems can serve to illustrate the broader situation. Suppose that we have many data samples, each consisting of a long list of numbers (in mathspeak, such a list is a “vector”). Suppose, for example, that each such vector contains 1000 entries. The problem is this: is it possible to identify a shorter list of, say, 30 entries, together with 1000 functions of those 30 entries, sufficient to largely recover the original structure of the data, and (importantly) to predict future data? And are there efficient ways to accomplish this?

If that formulation seems confusing to you, perhaps the following analogy is helpful: think of an aircraft condensation trail, one of those visible vapor trails left behind by passenger jets. Each location in the trail can be described by three numbers, its Cartesian coordinates (x,y,z). The full trail is described by a connected series of such triplets.

But that is not an efficient representation of the trail, because we know that it is really a curve—a one-dimensional object embedded in a three-dimensional space. It would be much better to know three functions x(t), y(t), z(t) from a one-dimensional parameter, t that embeds the curve into the three dimensional space. Such a representation is much more compact, and moreover it is more informative about the structure of the trail than the series of triplets (x,y,z). The three functions provide a dimensional reduction of the structure of the trail. We can now reason about its structure in a lower-dimensional space. We have also discovered an important feature of the trail: it is really just a line, warped by the embedding functions into a three-dimensional structure.

The case of a one-dimensional structure embedded in a three-dimensional space is not too difficult a problem to visualize or solve. By contrast, the case of the 1000-dimensional vectors to be summarized by a 30-dimensional embedding (where in particular we don’t even know whether 30 is the right embedding dimension a priori) is very hard. This problem, called submanifold finding, is important and ubiquitous. For example, images can be viewed as vectors, with the entries representing the brightness of pixel values. Viewed in this way, the space of images is almost entirely empty of “natural” images: if you selected random values for the pixel brightnesses and displayed the result as an image you would always get oatmeal, without ever producing such features of natural images as an edge, or a contrast gradient, let alone images such as a stop sign or a cat. The space of natural images is a low-dimensional submanifold of the space of images. And because the relevant dimensions are so high, that submanifold is essentially impossible to locate using classical methods. And, if you can’t find that submanifold, you can’t really distinguish images from non-images, so image classification (for example) is basically impossible.

Around 2007, there was not a lot of optimism among applied mathematicians that this problem would be solved anytime soon. I remember reading a review article on submanifold-finding in 2009, which essentially declared defeat: advanced methods were capable of finding such immersed structure for very artificial examples of data in dimensions as risibly low as 50, say, but were helpless to do anything useful with natural data such as the MNIST Hand-Written Digits, which are images consisting of 784 black-and-white pixels.

Enter Deep Learning

This is not a little ironic, because at about this time, Deep Learning methods were making inroads into such problems at amazing rates of progress. The introduction of the convolutional neural network made image analysis almost magically tractable, directly identifying features of images at various scales and exploiting those features for tasks such as classification. By 2012, researchers using a DL architecture called an autoencoder had demonstrated convincingly that the 784-dimensional MNIST data could in fact be represented using a 30-dimensional submanifold. It was a tour-de-force.

Oddly enough, this feat had not been pulled off by applied mathematicians or by statisticians, the academic tribes traditionally most concerned with modeling data. Instead, members of a completely different tribe were responsible: computer scientists. Academic computer science was quick to recognize that the advent of high-performance computing tools enabled rapid processing of data volumes on heretofore unheard-of scales, and that this new capability could transform some older ML techniques from academic toys into useful tools for turning data into decisions. In 2007, certain technical breakthroughs occurred that made the training of large neural network models possible for the first time. The term “Deep Learning” dates to this period, and can serve as a marker for the birth of the revolution.

In the decade-and-a-half that followed, the subject of Deep Learning advanced from strength to strength, furnishing solutions to many problems previously regarded as intractable. Image processing problems such as classification (“Is this a cat?”), or image segmentation (“Is there a cat in this image?”) became easy. Empirical weather forecasting based on DL methods is so effective that the European Center for Medium Range Weather Forecasts (ECMWF) now uses such a model for some of its 10-day forecast products. Protein folding structure, a key to understanding biochemical action of proteins, is now a solvable problem thanks to DL. By 2017, Natural Language Processing (NLP) had already advanced to the point that there was no reason for companies to employ hundreds of customer service call-center operators. Those people could be replaced by a cheap appliance that routed callers around voicemail menus—annoyingly but admittedly with reasonable accuracy. And 2017 saw the invention of the Transformer architecture (the “T” in “GPT”), which eventually would wind up eclipsing all other forms of “AI” in the public mind.

The Academic Politics of Deep Learning

There is an interesting question raised by this capsule history: why was it that the revolution was led by computer science, rather than by academic applied math and statistics? After all, the new field that arose is often referred to as “Data Science”, so why is it that the traditional academic disciplines concerned with data were so outmatched in this revolution?

In my opinion, the answer in the case of statistics is that unfortunately, at the time of the birth of this subject, the majority of academic statisticians were computationally illiterate. Most of them barely knew how to use their computers to send email and edit documents. With very few exceptions, they were entirely innocent of the acceleration of computing capabilities, and hence not in a position to consider the meaning of this development for their own professional activities ².

The case of applied mathematics is, in my opinion, more interesting and informative. Applied mathematicians have been computationally literate for many decades, and very much attuned to developments in computing. But they labored under a handicap: mathematical principle.

By and large, the subject of computational applied math is algorithmic. This term has a specific and important meaning: An algorithm is a piece of rigorous mathematics that is realized as a computational model. As such, it makes very precise and validatable predictions. If you implement, say, a linear algebra solver algorithm, you know that there are necessarily inaccuracies in the output solutions that have certain predictable properties. If you run the algorithm and you don’t observe those properties, you know that your algorithm has a problem, and professional canon binds you to go find it and fix it.

And that was the applied math folks’ problem: there isn’t a single algorithm in all of deep learning.

On Model Correctness

I can hear the cries of outrage already. What? Are you nuts? Google Corporate swears by “the Algorithm”! Even avowed foes of AI put the word “Algorithm” in the titles of their tracts! Why are you playing these language games?

Stay with me. There is a Yang to the Yin of “Algorithm”. It is called a heuristic. A heuristic is a bit of semi-mathematical intuition that is embodied in a computer program. Elements of that program can certainly be described by mathematical equations, but the entire structure is not based on any theoretical understanding, and, as a consequence there can be no a priori expectations concerning the program’s output. A heuristic model is considered “valid” if it seems qualitatively acceptable by some performance metrics, and especially if it appears to outperform other heuristics that attack the same problem set.

All DL methods are heuristics. There is not a single exception. That was the key to progress. Researchers in DL simply dropped the epistemological standards of normal computational science, wherein models are considered “correct” if they embody a correct implementation of a rigorous mathematical concept purporting to represent some data-generating process and successfully demonstrate that the predictions of the models are validated by data from that process. In its place, they installed a weaker epistemology, in which a model implemented in computer code no longer needs a mathematical model of the data-generating process at its foundation, and is “valid” if its output seems reasonable compared to some data, with no rigorous prior statement as to what counts as “reasonable”.

As we’ve seen, this strategy was wildly successful, and of necessity it largely excluded applied math and statistics (statisticians also prize rigor) from real participation in the revolution. Oddly enough, there is an analogy here with the development of Financial Mathematics. This is a subject whose practitioners attempt to predict future movements in prices of financial assets, based on time series of past prices. In this endeavor, they make no effort to actually model markets as ensembles of financial actors. Instead, they use completely empirical sets of equations that have no justification beyond their purported ability to predict the price time series. A model is considered “correct” if it can be shown that decisions based on its predictions are profitable. A second model would be “more correct” than the first one if it is more profitable. “Correctness”, in this sense, is very contingent, because changes in market conditions can turn a “correct” model into an “incorrect” model if it starts consistently losing money.

This is more-or-less how DL practitioners assess their models’ correctness: not on how well they embody some mathematical principle that is believed to describe the data-generating process, but simply on how well they appear to match the data at all ³.

Drawing Conclusions: Costs and Benefits

As I said, adopting this sort of epistemology was key to the progress made in DL. But I hope that it should be clear that this progress came at a cost, and that there might be pigeons in the air waiting to come home to roost. The cost is that no DL model can ever be said to be “wrong”, because we do not have any language to describe what it would mean for a DL model to be “correct”.

Not having a criterion for correctness might be an acceptable situation if the model is making low-stakes decisions. It is not acceptable in high-stakes situations such as ML-driven surgery, or ML-driven air-traffic control, or ML-controlled weaponized drones, or even ML-driven science modeling. If it is not possible, even in principle, to verify and validate such models, how can they be trusted in critical applications?

Viewed in an even broader context, this choice made by DL practitioners almost 20 years ago is now, in my opinion, the underlying source of a Kuhnian scientific crisis. The crisis is best represented by the issue of “AI Hallucinations”, which is widely recognized to be a problem. This is, without a doubt, an issue of inaccurate model output, from models whose design criteria never specify what counts as accurate output. The latter fact is directly traceable to the new epistemology of DL.

But most practitioners of DL don’t even realize that their epistemology was in fact a choice, or that this choice might be near the origin of their hallucination problem. As an academic discipline, the field is nowhere close to even acknowledging that it has a problem, much less that it might be confronting a genuine crisis.

I believe that not only there is a crisis in progress, but that it will culminate very soon. I’ll be building up that story over the next few weeks.

Next Week: The “AI” State of Play

All 7 parts, once published, can be found here: Artificial Intelligence

Part 1: What is AI, And How Did We Get Here?Post + Comments (174)

Smiling Happy Guy (aka boatboy_srq) on Trumpery Open Thread: April’s Fool (Apr 2, 2026 @ 8:12pm)

MattF on Trumpery Open Thread: April’s Fool (Apr 2, 2026 @ 8:11pm)

bbleh on Trumpery Open Thread: April’s Fool (Apr 2, 2026 @ 8:10pm)

Shalimar on Trumpery Open Thread: April’s Fool (Apr 2, 2026 @ 8:10pm)

Bupalos on Open Thread: Strategery (Apr 2, 2026 @ 8:08pm)

Carlo’s Artificial Intelligence Series

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Part 7: The Coming AI Winter

AI and Learning

The Role of the Tech Industry

What Is Wrong With This Picture?

(1) There Is No AGI Down This Technological Path

(2) AI Hallucinations Are Never Going Away

(3) Hyperscaling, “Emergence”, and Model Inefficiency

On “AI Winters”

Winters, and Bubbles

Then What?

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Part 6: The Pathology at the Heart of Hyperscaling

Overparametrization

Regularization and Stochastic Gradient Descent

Overparametrization and Data Types

On “Emergence”

What Have We Really Gained With LLMs?

Are There Alternatives To Transformers?

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Part 5: Hallucinations

What Does an LLM Do?

Approximating the Distribution of Human Language

Embedding

The Tyranny of Sampling

Hallucinations Are Structural

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Part 4: If There Were AGI, What Would It Look Like?

What Should We Require Of A Theory Of Artificial Reason? show full post on front page

A Cast Of Characters

Bayesian Updating As A Model Of Learning

When Learning Stalls

Where’s The Aha! ?

Evidentiary Reform

Model Reform

Reasoning and AI

Is This Model Right?

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Part 3: There Is No Artificial General Intelligence Down This Road

The Circular Argument For AGI

The Implausibility of “Learning To AGI”

“AGI” Is A Scientific Scandal

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Part 2: “AI” State of Play

Two Outlooks on Deep Learning show full post on front page

On Data

Where are we today in “AI”?

Does The Research Consensus Make Sense?

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Who Am I, And Why Do I Think I Have Something To Say About AI?

Part 1: What Is AI, and How Did We Get Here

On Learning

A Brief History of Machine Learning

On Model Correctness

Drawing Conclusions: Costs and Benefits

What Should We Require Of A Theory Of Artificial Reason?

show full post on front page

Two Outlooks on Deep Learning

show full post on front page