Guest post series from *Carlo Graziani.

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. Being able to write these posts on AI has been very helpful to me in clarifying and sorting out my thinking on this subject. The comments that have followed each post have been of very high quality and on point, making up excellent and informative (including to me) discussions.

The plan is to release one of these per week, on Wednesdays, with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

Most of these posts have had a nerdy tinge, because the take that I have developed on AI is an unusual one, blending a mix of reflections on the technical side of the subject with a skeptical (and largely contrarian) outlook on much that passes for conventional wisdom among this discipline’s practitioners. This sort of project, wherein someone claims that much accepted technical wisdom of a certain field of science and technology is in fact wrong, necessarily exposes one to charges of being a crank or a crackpot, unless one is careful to provide some detailed and hopefully persuasive technical arguments pointing to unexamined assumptions and to scientifically plausible alternative views. Hence the plunge into nerd-core.

This post is the last of the truly nerdy posts in this series. After this, I’ve mostly emptied the bag of things that I think I know that most people don’t. So the final post will be a sort of high-level summary, combining take-aways from the series with some historical considerations to attempt some synthesis of where we are with AI, what this moment means, and where we might be heading in the not-too-distant future. If you’ve been waiting for a hopefully more accessible discussion of AI, that should be the one.

That said…

Part 6: The Pathology at the Heart of Hyperscaling

The term “hyperscaling” entered the common vernacular with a vengeance in 2025. The prefix “hyper” in “hyperscaling” is not marketing vapor. At this stage, most people who notice economic news at all are aware that something very unusual is happening to the U.S. economy. AI Capital Expenditures (“CapEx”) contributed 1.1% to U.S. GDP growth in the first half of 2025, a figure that is almost certainly higher when annualized. Total AI CapEx in 2025 is estimated at somewhere in the neighborhood of $500B, which is an incredible 1.4% of U.S. GDP. The Tech industry has persuaded analysts that AI investments between now and 2030 will add up to something like $5T. And as astounding as these numbers are, they look tame compared to the anticipated build-out of U.S. electrical power generation: estimates of the additional generation (and transmission) capacity required to run AI training and inference are somewhere in the range of 75-100 GW. The middle of that range (87.5 GW) corresponds roughly to roughly 7% of total U.S. electrical generation capacity. And that’s just to power the AI data centers.

These numbers are ridiculous. It seems clear that those projections for cost and power generation are totally unrealistic, unlikely to ever come to pass, and damaging to the economy to the limited extent that they do come to pass.

The damage is not limited to economics, however. This level of investment also moves minds. The Tech industry consensus—that hyperscaling of large language models (LLMs) will bring about Artificial General Intelligence (AGI)—has attracted a critical mass of mindshare from government and from academia. This is certainly a bit of an own-goal for academic institutions, because by buying in to this consensus academic scientists have entirely shut themselves out of LLM development: no university or consortium of universities can afford to build out even a small fraction of the computational infrastructure required to train or operate these models at these scales. And even the U.S. government is struggling to stay in the game.

The interesting question here is why has this consensus developed across the field. And the “why” that I’d like to discuss in not a business strategy “why”, but rather a technical “why”. As a matter of science and technology, what is the peculiarity of LLMs that fuels the drive to hyperscale? After all, we have had “AI” deep learning (DL) methods since 2007. Hyperscaling, however, is peculiar to the subset of DL methods that power LLMs, and has only really gotten underway since 2022. What changed?

Overparametrization

In a sense, the seeds of hyperscaling have been an implicit part of DL ever since the DL revolution began in 2007. The origin of the phenomenon is the practice of model overparametrization.

In normal parametrization, one has a model with some parameters, which may be thought of as knobs that one can dial to arbitrary values. The knobs control the predictions that the model makes of the data. One determines the values of those knobs by adjusting them so as to minimize the misfit between the model and the data. Normally, one has many fewer parameters than data samples.

Overparametrization is the practice of endowing a model with many more parameters than one has data samples to model. It has not always been regarded as a useful practice: before the rise of DL, it was in very bad odor, for good and sufficient reasons. The problem with overparametrization is that it endows a model with too much flexibility and too little predictive power. After all, the purpose of training a model on some data is to obtain a means of making predictions and decisions given new data samples (recall the Statistical Learning Catechism, from Part 1 of this series). But if I have 10,000 data samples, and I naively train a 100,000-parameter model on that data, what will inevitably happen is that the model will predict the training data perfectly, while giving wildly wrong predictions about new data not included in the training set. In statistical parlance, the model will overfit the data.

Basically, as a matter of algebra, you only need 10,000 parameters to “solve” for the 10,000 training data samples. This is sometimes called “memorizing the training data”, i.e. learning to predict those 10,000 training samples perfectly. Despite its perfect fit to the training data, that 10,000-parameter model will give terrible predictions of any new data not included in the training set. And a 100,000-parameter model is—or ought to be—much worse.

The point is that the true process underlying the data always has more smoothness than the data itself, because the data has additional random noise. Any computational model of that process should not be allowed to play connect-the-dots with the data, because that would be tantamount to chasing the random noise. But that is exactly what a naive overparametrized model does. And because it learns to chase random noise in the training data, it cannot properly predict the smooth behavior of the underlying process. We say of such models that they have poor generalization properties, which just means that no interpolation based on such a model can be trusted .

The problem is that in the example of the 100,000-parameter model trained on 10,000 data samples (not an atypical case for a convolutional neural net trained on a corpus of images) there is a roughly 90,000-dimensional parameter subspace of solutions that exactly memorize the training data. That subspace has a thickened neighborhood (in the full 100,000-dimensional space) of solutions that don’t quite memorize the data. In that neighborhood exist some values of the parameters that endow the model with reasonable generalization properties. Those values are the target.

Regularization and Stochastic Gradient Descent

The way to improve the generalization properties of an overparametrized model is to regularize it. Regularization is a very general term for a time-honored, broad family of techniques that deprive an overparametrized model of much of its freedom. By regularizing such a model, we in effect impose some kind of smoothness properties on it that interferes with its ability to play connect-the-dots with the data, thus making it a more realistic representation of the smooth underlying data-generating process. The regularization technique adopted in DL methods is peculiar to the discipline, because it was in DL that extreme overparametrization was first seriously considered.

DL methods find the target parameters through the technique of stochastic gradient descent (SGD). The “gradient descent” means that there is some defined cost function of the parameters (such as some average of the prediction errors over the training set, for example), and that cost function is minimized (“descent”) by following it through the parameter space along the steepest descent direction.

Here’s an analogy to help understand the training process: think of the cost function as a continental landscape. You know that the landscape has a long, broad valley somewhere in the middle of the continent, and you are targeting a destination somewhere near the valley floor. Not the exact floor, because that would correspond to the noise-chasing data-memorization solutions, but somewhere in that neighborhood. To find the right neighborhood, you need to locate the valley floor.

A reasonable first approach is to always follow the local descending direction, until you find the lowest point. But that won’t work. The problem is that that while the valley walls slope down on average (by about a foot per mile, say), the landscape is highly textured, with lots of local structure such as boulders, hills, small and twisty valleys, mountain passes, and so on. The local direction of steepest descent might actually lead you away from the continental valley floor, because you are descending some random twisty valley, or because the direction that you need to follow leads up some mountain pass. The landscape texture in the analogy is the structure added to the cost function by those 10,000 data samples, each one of which acts as a sort of structure-adding knob, contributing the boulders, hills, etc.

You need some strategy to ignore the local texture and find the large-scale average descent direction. If only by squinting you could blur the landscape, only resolving structure to within a mile instead of to a few feet, you might be able to see the average descent towards the valley floor, because the boulders etc. would be fuzzed away.

That is where the “stochastic” part of SGD comes in. By randomly swapping subsamples (“minibatches”) of data at every step of gradient descent, the optimization code’s view of the landscape texture is blurred, because the landscape of fine-scale features changes with every minibatch. As a consequence, the small-scale structure ceases to obscure the large-scale path of descent. Even better, the 90,000-dimensional memorization submanifold—the exact bottom of the valley, corresponding to the family of connect-the-dots solutions—also gets fuzzed out somewhat, and harder to find. The optimization algorithm dwells for longer in the liminal region where good generalization may be found.

Such a solution is identified by computing the cost function on held-out “test” data, which is not used in the gradient computation. At first, both the training and test costs decline in tandem. At some point in the training, however, the test cost stops dropping and either flattens out or starts rising again, while the training cost continues to drop. This is the stopping point of the training loop: a parameter solution has been located with reasonable generalization properties, because the test cost is OK, and there is no point in continuing, because the optimization routine is about to find the data-memorization submanifold (the lowest point of the valley floor).

Overparametrization and Data Types

That was rather a long explanation of the practice of overparametrization in DL. I need it here, because I need to discuss the costs and benefits of the practice, and how those vary depending on the type of data and decisions that one’s model is required to traffic in.

The combination of overparametrization and SGD is key to the success of all DL methods. It should be clear, however, that from a strictly statistical point of view, overparametrization is inefficient. Remember that according to the Statistical Learning Catechism, an important part of the job of any such model is to learn the distribution of the data. That distribution may be quite complex, and many parameters may be required to approximate it. But in a principled statistical model, the required number of parameters should not depend on the number of samples used to determine those parameters. Instead, the number of parameters should be fixed, and determined only by the structure of the data-generating process. As the number of data samples increases, the precision with which that fixed parameter set is determined should get better and better .

DL models do not have this property. Instead, there is a notion of model capacity, which corresponds to the size of the parametrization, and which is required to grow as the training data grows, in order to maintain the necessary overparametrization. The more data one trains on, the greater the required model capacity. This is the sense in which they are inefficient.

This inefficiency is not, in and of itself, a bad thing. As we have seen, DL methods have been incredibly successful at addressing problems such as image classification, protein folding, materials design, etc. that were previously regarded as intractable. When used in such applications, they could be trained at very reasonable cost in computation and energy. It was definitely not necessary to occupy a 10,000-GPU datacenter for months to train a high-quality image classifier, for example. On the other hand, this is exactly the computational scale required to train an LLM on corpuses of text. Why the difference?

The answer, in my opinion, is that the distribution of text tokens from human language is vastly more complex than any other data distribution that has ever been modeled using DL methods. Billions of tokens of text are required in order to begin to capture the regularities and peculiarities and sheer chaos of human expression. This is larger, by several orders of magnitude, than, for example, the number of photographic images required to train an image classifier. And this being the case, the model capacity of LLMs has of necessity grown to unprecedented size (hundreds of billions, or even trillions of parameters), in order to maintain overparametrization.

But training models of this sort of capacity requires vastly more compute and energy. Hence, hyperscaling.

On “Emergence”

The account that LLM practitioners give of their capacity issues is simultaneously humorous and frustrating. They do not appear to view those issues in the light of the story that I have just presented. They see no continuity with the capacity requirements characteristic of all DL. Instead, they seem to have persuaded themselves that the relationship between data corpus size (in billions of tokens) and LLM capacity (in billions of parameters) is sui generis, and amounts to some kind of natural law of computing wherein intelligent behavior “emerges” naturally in consequence of the growth of the complexity of the model. The linear relationship between corpus size and model size has even been dignified with a scientific name: the “Chinchilla Scaling Law”, named after the Chichilla model discussed in this paperfrom Google Brain. The idea that there are such “scaling laws” for LLMs was originally suggested in this paper from OpenAI.

This is funny, in a grotesque sort of way. What is really happening here is that, as with all DL methods, without sufficient model capacity the model performance sucks. Then, as one grows the model capacity by adding parametrized computational elements, at some point the model becomes trainable, and begins to suck less. This “point of sucking less” is what LLM developers, in a brilliant stroke of marketing, have chosen to call “emergence”. And the advent of emergence is purportedly predicted by the scaling laws, which are felt to hold some deep significance about the nature of intelligence, and whose discovery rivals that of Newton’s law of gravitation in scientific importance.

This is, of course, pure horseshit. It is indicative of the corrupted state of the science of machine learning under the influence of the business imperatives of the Tech industry. A certain answer is required to be true by the industry: AGI is here, or, at least, nigh, which we know because of these “scientific discoveries”. The scaling law points the way to AGI: we will get there through larger models, more compute, higher CapEx. “Science says” that Hyperscaling will bring about AGI. And AGI will be so great that markets will emerge to absorb the $5T in investment required to get there.

This is a ridiculous story, since there is no credible scientific support for any of these claims. It’s a wild adventure in Madness of Crowds capitalism that cannot possibly end well.

What Have We Really Gained With LLMs?

The advent of the transformer architecture in 2017 represented a true breakthrough in natural language processing (NLP), one worthy of the highest scientific praise. Not because of any bullshit about AGI or “emergence” or scaling laws, but because transformers have demonstrated that natural language can be modeled at all. It was not clear prior to the transformer that this was true, or at least that it might ever be possible to model the distribution of human language text on a realistic computer.

There are two types of scientific impossibilities. The first type is that of things that are simply straight-up impossible so far as our current scientific understanding is concerned (faster-than-light travel, for example). The second type consists of things that, while not impossible in principle, are impossible in practice, because you if you tried to accomplish them you would bankrupt the World in the attempt and still fail (human spacefaring travel to nearby stars is in this category).

Some “type 2 impossible” problems have to do with whether something is or is not amenable to computational modeling. There are many problems that have known scientific principles, but which we do not expect to ever be able to model on a real computer. For example, first-principles numerical modeling of strongly turbulent fluid flow is just one such “type 2 impossible” problem, because no computer we can imagine building would have the capacity to resolve all the physical length scales required for such a simulation.

As to NLP, it has always been clear that language has patterns, and that text corpuses issue from some complicated distribution. It was not clear until 2017 however that computational modeling of that distribution was not “type 2 impossibile”. Now we know that it is in fact possible. That is a real scientific accomplishment.

Unfortunately, the realization of such models using transformers comes at a very daunting cost in compute, power, and capital. This is especially true now that the subject has become entangled with a totally unrealistic quest for a goal (AGI, or, at least, artificial reasoning) that is almost certainly not available along the current technological path. This cannot go on forever. And as economists are fond of saying, if something cannot go on forever, it will stop.

Are There Alternatives To Transformers?

Transformer-based LLMs have exhausted their usefulness as research tools. It is past time to start looking for alternatives.

A small number of leading researchers are already unhitching their wagons from the LLM caravan. Yann LeCun, one of the most celebrated DL scientists, made a splash earlier this year by parting ways with Meta to start up a research effort on an approach called “World Modeling”. Song-Chun Zhu, a renown expert in computer vision, has returned to China after 20 years of research in the U.S., to pursue a set of approaches that he characterizes as “Small Data, Big Task” (by implication, DL approaches have it the other way around).

I think it is promising that some very bright folks are breaking away from transformers (and possibly from DL altogether). As a matter of personal outlook, I am not sanguine about purely computational approaches such as World Models and SDBT superseding LLMs. These are, as I understand matters, still computational learning approaches. As I’ve written in past posts, I feel that the most serious defect of DL approaches is that they place little value on reasoning about data distributions, while focusing too much attention on models. In effect, in light of the Statistical Learning Catechism that I’ve expounded upon, they are computational models attempting to do a statistical model’s job, and as a consequence they make inefficient use of their data. That inefficiency is tolerable for most data types of interest, but unaffordable for human language learning. And while learning human language distributions is not sufficient to model human reason, I believe that without learning human language distributions there is no chance of any kind of emulation of the human faculty of reason.

So what I’d like to see, and something I am attempting to do in my own research, is to try to introduce principled statistical models to perform language-learning tasks. I can’t tell yet whether the things that I am trying will work as well as a transformer, or even be competitive with one. However, I do think that it ought to be possible in principle to construct a model whose capacity need not scale with data corpus size, and whose parameters are fixed in number by the structure of the human language distribution that it models. Those parameters should be determined more accurately the more training data the model is fed.

To accomplish something like this, not only must one break away from transformers: it is necessary to give up on DL methods altogether, because the overparametrization required by all DL-based methods makes language learning unaffordable.

This type of approach is in a sense less ambitious than what LeCun and Zhu are attempting to bring about, because it is still strictly concerned with language modeling. But this seems to me a good way to leverage the one solid, useful result that has emerged from modern NLP: natural language can be modeled on a real computer. Now we just need to find a way to do it that doesn’t bankrupt the World.