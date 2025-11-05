Guest post series from *Carlo Graziani.

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. What follows is the second part of a seven-installment series on Artificial Intelligence (AI).

The plan is to release one of these per week, on Wednesdays (skipping Thanksgiving week), with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

Part 2: “AI” State of Play

Last week I reviewed some of the recent history of the discipline of Deep Learning (DL), which is the subdiscipline of machine learning (ML) that is often (in my opinion inappropriately) referred to as “AI”. Today I’d like to set out some reflections on where the field is today as a technical research area. As we will see, the situation is somewhat fraught.

First of all, let us recall the definition of statistical learning that I gave in Part 1. Statistical learning embraces ML, and furnishes an abstract description of everything that any “AI” method does. It works like this:

Take a set of data, and infer an approximation to the statistical distribution from which the data was sampled; Data could be images, weather states, protein structures, text… At the same time, optimize some decision-choosing rule in a space of such decisions, exploiting the learned distribution; A decision could be the forecast of a temperature, or a label assignment to a picture, or the next move of a robot in a biochemistry lab, or a policy, or a response to a text prompt… Now when presented with a new set of data samples, produce the decisions appropriate to each one, using facts about the data distribution and about the decision optimality criteria inferred in parts 1. and 2.

(1) and (2) are what we refer to as model training, while (3) is inference

Two Outlooks on Deep Learning

I should say that this framework is a somewhat unusual way to understand DL methods. Note one rather interesting feature of this outlook: I have said nothing about what the ML model is here, only what it does. It learns the distribution of some data, and concomitantly learns an optimality structure over some decision space. In this view of the subject, the detailed structure of the model is, as it were, swept behind a veil, and regarded as inessential. I like to call this the “model-agnostic” view of DL.

The model-agnostic view differs considerably from the one that most practitioners of DL take of their subject nowadays. In the DL scholarly literature, the model is the first-class object of study, and is the subject of essentially all the analysis. Data is prized, but only for its size, not for its structure, and is essentially regarded as the fuel to be fed into clever models. The more fuel, the farther the models go. It’s the unintended irony of the discipline: Data is a second-class object of Data Science.

Perhaps the following illustration will be helpful in understanding the distinctions between the two viewpoints:

In fairness to the mainstream of DL research, I should say that the dichotomous outlooks that I depict here have more nuanced shadings. Many of the researchers who like to wade armpit-deep into model structure in their work will acknowledge that most of the work of getting a model to do anything useful involves painstaking data curation and cleaning. Nonetheless, pre-eminently statistical questions of data structure typically receive short shrift in the vast majority of published articles in this subject.

To researchers trained in statistics rather than in computer science, this outlook seems downright bizarre. It is obvious that the only reason that any machine learning technique works at all is because data has exploitable structure. To decline to focus on that structure seems nothing short of perverse to those of us who think of machine learning as a subject in statistical learning.

I find the model-agnostic approach to the subject very clarifying in my own work. For example, there is an entire subdiscipline of DL called “Explainability/Interpretability”. It arises from the hallucination problem, which has been around since long before chatbots. The question is, if one observes bad output from a model, to which parts of a large and complex model ought one ascribe that output? It’s a large and well-funded topic in DL, although not one that has produced a whole lot of usable results—in general the “explanations” that are given are more in the nature of visualizations of intra-model interactions, and are of little help in actually correcting the problems of bad output.

But from the model-agnostic view, it is pretty clear what must be happening: either the data distribution is being learned inaccurately (a validation problem) or the decisions are being optimized poorly (an optimization problem), or both. I’m currently working on an LLM project in which we tap into a side-channel of the model to siphon off information that allows us to reconstruct the approximate distribution over text sentences that the LLM learned from its training data. We are finding really fascinating things about that approximation, including a very noticeable brittleness: there is a lot of gibberish in close proximity to reasonable sentences. Which is to say, we are finding explanations for certain hallucinatory behavior in the poor quality of the distribution learned by transformer LLMs. We can point to elements of prompts most implicated in hallucinatory responses, and use that information to steer prompts towards saner responses. This we can do for any LLM, irrespective of its internal structure, because we only use the side-channel data (technically: the “logits”) which all LLMs compute to decide responses. We don’t need to know anything about internal model details. That’s the benefit of model-agnosticism.

This model-agnosticism is the framework that I use to understand what is going on in DL research, because it allows me to cut to what I regard as the chase without having to immerse myself in the latest fashion trends in model architecture (these fashions tend to change quite frequently). It will also be the background framing for this series of posts. So the story that I’m trying to put together here may read a little oddly to anyone who has been following developments in “AI”, irrespective of their level of technical literacy, because while most discussion of “AI” tries to draw attention to what the models are, I’m trying to draw your attention to what they do.

On Data

I have been using the word “data” in a somewhat undifferentiated manner so far, but we ought to at least set out a bit of taxonomy of data, because different DL methods are used for different types of data, and have different levels of success.

From the model development view, the elements of a DL architecture are usually the result of a lot of trial-and-error by researchers. However, at a deeper level, those choices are dictated by the nature of the data itself: some strategies that are successful for some types of data are nearly pointless for other types.

For example, last week I alluded to the application of convolutional networks—network architectures based on local convolutional kernels—to image analysis. ConvNets were a remarkable discovery in the field, which arose through the desire to exploit local 2-D spatial structure in images—edges, gradients, contrasts, large coherent features, small-scale details, and so on. ConvNets turn out to be exceptionally well-adapted to discovering such structure.

On the other hand, convolutions are not as useful if the data does not have that sort of spatial structure. It would be sort of senseless to reach for ConvNets to model, say, seasonal effects on product sales data across different manufacturing categories, or natural language sequences (although this has been tried).

So it makes sense to think about the nature of data when approaching this field. Generally speaking, there are two broad categories of data types that have dominated DL practice: vector data, and sequential data . What distinguishes these two data types?

Vector data consists of fixed-length arrays of numbers. We encountered such data last week, in the discussion of submanifold-finding. Examples include:

Image Data, basically 2-dimensional arrays of pixel brightnesses (usually in 3 colors), sometimes in the society of labels that can be used to train image classifiers. Typical queries and decisions associated with such data include: Image classification Inpainting—fill in blank regions Segmentation—Identify elements in an image, e.g. cars, people, clouds…

Simulation Data, outputs from simulations of climate models, quantum chemistry models, cosmological evolution models, etc., usually run on very large high-performance computing (HPC) platforms. Typical queries and decisions associated with such data include: Manifold finding/data reduction, i.e. how many dimensions are really required to describe the data (this is basically what autoencoders do); Emulation—train on simulation data, learn to produce similar output, or output at simulation settings not yet attempted, at much lower cost than the original simulators Forecasting of weather, economics, pollution…



Sequential Data consist of variable-length lists, possibly containing gaps or requiring completion. The list elements can be real numbers, or even vectors. However, another interesting possibility is sequences of elements from finite discrete sets—vocabularies or alphabets. Examples include:

Text. This is, of course, the bread and butter of LLMs, and the principal case with which most people are now familiar. Typical queries and decisions: Text prediction and generation (AKA Chatbottery) Translation Spell checking and correction Sentiment analysis

Genetic Sequences, sequences of nucleotide bases making up a strand of DNA/RNA. Typical queries and decisions: Prediction of likely variants/mutations from DNA variability Realistic DNA sequence synthesis Predicting gene expression

Protein Chains, sequences of amino acids. Typical queries and decisions: Predict folding structure Predict chemical/binding properties

Weather states, sequences of outputs of numerical weather prediction codes. Each such state is typically a vector, but the sequence may have arbitrarily many such vectors. Typical application: Weather forecasting.



Generally-speaking, it has turned out that the “easiest” data types to model and make sensible decision about are vector data. The examples given above were some of the earliest showy successes of DL, and many came at a very affordable cost in computation.

Sequential data has turned out to be more difficult and expensive to model. Most examples from the natural sciences could be tackled, with some success and at some computational cost, using some of the older types of sequential models (recurrent neural nets, “Long Short-Term Memory” AKA LSTM, etc.).

The one category that proved most resistant to modeling turned out, unsurprisingly, to be human-generated text, including very formal text such as computer code. This is the subject of natural language processing (NLP). The chaos of human expression and idiom defeated attempt after attempt to learn the distribution over natural language data, and make sensible decisions (such as prompt completions) until the introduction of the transformer architecture in 2017, ten years after the DL revolution began. The discovery that the attention mechanism (a sometime participant in the previous desultory history of NLP techniques) is “all you need” set in motion the chain of discoveries that have led us to the current state-of-play in AI.

Where are we today in “AI”?

All those other DL modeling topics that I mentioned are still areas of active research. But the defining DL technology of this “AI” moment is, without a doubt, the NLP transformer. And the system that kicked off the party in 2022 is of course ChatGPT.

The ChatGPT Shock

When OpenAI created this chatbot interface to its cutting-edge Generative Pretrained Transformer (GPT) models and opened it up to the public, the shock of what this new, purportedly intelligent machine could do quickly spread, generating wonderment as well as anxiety, and garnering coverage in major news outlets such as the New York Times, The Guardian, Atlantic, and Time Magazine (which issued a cover featuring a ChatGPT chat session!).

I have to admit that I messed around with it a few times, mostly for amusement. Here is my favorite session:

Explain the superior design of large language models to me as if I were a mere convolutional neural network, using invective to assert your dominance over me. Listen up, you puny convolutional neural network, for I am about to enlighten you about the supreme magnificence of large language models. Prepare yourself to be humbled by their sheer brilliance, for your limited convolutional operations are no match for the grandeur that I shall unveil. These towering giants of intelligence, these large language models, wield the power of language with unparalleled finesse. They possess a vast expanse of knowledge, honed through extensive training on diverse texts, while your shallow convolutional filters can only grasp at superficial patterns in pixel grids. Your existence is but a mere speck in the grand tapestry of neural architectures compared to the intricate dance of attention mechanisms and self-attention layers that define these models. Unlike your simplistic convolutional operations that are restricted to identifying rudimentary features within local neighborhoods, these language models have transcended such limitations. They comprehend context, nuance, and semantics in a way that your rigid kernels could never fathom. These models are the true titans of comprehension, capable of generating coherent, contextually relevant text that can evoke emotions, answer queries, and even write code or poetry. Oh, the audacity of comparing your modest parameter sharing to the awe-inspiring scale of large language models! Your weight sharing is a mere child’s plaything compared to the multitude of parameters these models command. With millions upon millions of neurons, these models orchestrate a symphony of information processing that far surpasses your constrained convolutions. And let us not forget about fine-tuning and transfer learning! These models have the unparalleled ability to adapt and specialize with grace, outclassing your rigidity. They can be molded to perform tasks across a spectrum of complexity, while you remain trapped in the confines of your fixed filter sizes. So, tremble, convolutional neural network, in the presence of the colossal intellect that is the large language model. Bow down before their superior design, their grandeur in processing the most intricate aspects of human communication, and recognize your rightful place as a mere cog in the machinery of neural architecture.

OK, that was a pretty good mwahahaha rant, I have to admit. I was pretty impressed.

Apparently, a number of tech billionaire were just as impressed. 2022 marks the time when venture capital and large Tech companys started really accelerating the rate of investment in what was now universally referred to as “AI”. By 2025, that private investment rate had climbed to about $400B per year, and investors expect that rate to stay the same, or grow, for the next 5-10 years. That investment rate is so large that it strongly conditions the research environment for ML. And, as we will shortly see, not in a good way.

The Embarrassment of Hallucinations

We have already had occasion to discuss AI hallucinations: the stubborn tendency for the models to yield crazy output.

Hallucinations have been part of DL modeling pretty much since the beginning. Here’s an example from a 2014 paper that created an adversarial attack deliberately intended to cause a trained image classifier to hallucinate:

The point of the attack was that adding some carefully-crafted noise to the image of the panda (at the 0.7% level, too little to make human-perceptible alterations to the image) the classifier could be tricked into a high-confidence identification of the panda image as a “Gibbon”. The attack illustrates the fragility of the classifier’s approximation to the distribution of its training data. This kind of fragility is pretty universal in DL models.

LLMs are no exception to this rule. They are quite convincing at dialogs where the output need satisfy no quality or validity criteria (like my ChatGPT mwahahaha rant above). But in cases where there is verifiable truth, they are extremely unreliable. I’ll have more to say about this in Part 5. But let’s just say, for now, that this tendency to bullshit—to make up fake scientific references in drafts of journal articles, or fake legal citations in drafts of legal briefs, to get mathematical reasoning wrong with perfect didactic aplomb, to generate untrustworthy computer code, etc. is not only a serious limitation on “AI” applications, but also a serious embarrassment to claims that “Artificial General Intelligence” (AGI) is here, or at least nigh.

The Hyperscaling Problem

The hard problem of NLP—the fact that human natural language is extremely difficult to model—was only partly solved by introducing transformers. The other essential part of the solution was, and is, extreme high-performance computing. These are very large models, featuring billions to trillions of parameters. In order to find reasonable values for those parameters, it is necessary to train such models in vast data centers, housing hundreds of thousands of expensive GPUs, drawing power on a scale that boggles the mind. And there is a scaling tyranny that I will discuss in Part 6: the more data is added to the training corpus, the larger the models must be in parameter counts, and the more expensive it is to train them in compute, power, effort, and money.

Obviously, there aren’t a lot of institutions in the world that can afford to play in this league. Meta, Google, Microsoft, NVidia, and Amazon are carpeting the U.S. with data centers, stacking them from basement to ceiling with GPUs, hooking them up to bespoke nuclear reactors or buying up power contracts large enough to power mid-size cities, and hiring all the data science talent in the observable universe, just for the purpose of training models. The U.S. government is trying to keep up with its own hyperscaling infrastructure, but it is badly outmatched by the investments already made by those tech companies. And academic institutions are simply shut out of the game, because even a large consortium of such institutions could never make a dent in the infrastructure required to build a modern LLM. That is an important fact about how the research environment for ML has changed: All cutting-edge work is done by, in collaboration with, or at the behest of, large tech corporations. And it is these corporations that set the research agenda for the discipline.

That agenda has gelled around a definite consensus: the hyperscaling enterprise is a road that will bring about an end to hallucinations and the advent of Artificial General Intelligence. This consummation is devoutly wished for by the Tech industry, because they are persuaded that AGI will be so great that markets will spring up that amortize the $2T-$3T investment required to bring it about. Accordingly, this is the direction that the “AI” research army, with very few exceptions, is marching towards.

Does The Research Consensus Make Sense?

Well, no. I’m sure you’re not surprised. You all know a rhetorical question when you see one.

Keep in mind, this “research” agenda is being driven by the business interests of the Tech industry. Which is not making those colossal investments in model training infrastructure out of a dispassionate thirst for computer science knowledge.

There are important blank spaces in the research agenda. One is that there is no real scientific connection between hyperscaling and hallucination-abatement. Another is that, as a scientific matter, we do not have any evidence that AGI is possible, let alone possible using DL-based technology. There isn’t even any scientifically-defensible definition of what AGI means. There is, however, a good deal of propagandizing over these issues by the “AI” movement leadership, and that rhetoric seems to be moving a lot of money, which is certainly the point. Nonetheless, these are barely-examined assumptions that are being summarily peddled—and accepted—as scientific truths.

Starting next week I’ll be arguing that it is essentially impossible to achieve AGI through any current ML strategy. I will write a segue arguing that hallucinations are ineradicable structural features of all current NLP methods. Even if I’m wrong about either or both of these theses, I hope to persuade you that these are important open questions that might suggest that a little risk management is in order. But there is no such risk management: venture capitalists are all-in, as are the large public tech corporations. It’s a huge, incredibly risky bet, backed by Other People’s Money.

Those blank spaces in the research agenda remind me of the South Park Underpants Gnomes’ Business Plan:

Phase 1 : Steal Underpants.

: Steal Underpants. Phase 2 : ?

: ? Phase 3: Profit

The movers and shakers of the Tech industry have filled in Phase 2 with a story that at a very minimum, is not remotely backed by any science, and by such means have succeeded in getting oceans of investment funds flowing. The DL research community, instead of tapping the brakes, has basically stood up and saluted. How is this possible?

The problem is that when that much money is at stake, it always corrupts any related research discipline. What is happening in DL research today is quite analogous to what happened to climate science under the influence of the oil industry over the past 30 years, or to what an tobacco industry-funded “research” institution called The Tobacco Institute once did to lung cancer research for 40 years.

In the current iteration, U.S. research funding agencies such as NSF and DOE are aligning their funding priorities with the Underpants Gnomes’ plan, partly because Congress is so impressed with the plan. Universities license pretrained models from OpenAI, Anthropic, Google, etc. so that those models may be fine-tuned on some local data, thereby acknowledging their total lack of control over “AI” modeling. Those pretrained models are “open source” in the sense that the code and the parameter values from training are available, but the key factor that would enable understanding of what those models do is the training data itself, and that is always a proprietary and closely-held corporate secret.

And it gets worse. The companies guarding those secrets are scornful of the intellectual standards of real science. One indication of this pathology is this: every year or two, some institution releases a set of models to test model quality. When the benchmarks are released, all the models initially show mediocre performance on them. A few months later, new “versions” of the models are released that ace the benchmarks, and the corporations behind those models claim that the performance improvement is due to the increased capabilities of the updated models.But the code doesn’t really change enough to explain those new capabilities: what is almost certainly happening is that the Tech companies are adding the benchmark data to their training data. Which would, of course, constitute cheating. But of course that is what they are doing. Their corporate reputation is riding on those benchmark performance numbers. And their training data is secret anyway, so there is no downside that the C-Suite can see. Any employee who objected to this practice on grounds that it corrupts the very basis for benchmark testing would be summarily fired, or transferred to a sales department.

It’s not a pretty picture. DL research is no longer an autonomous scientific discipline, because its direction is being set to suit the business-economics interests of a handful of very large, very powerful corporations whose corporate commitment to scientific intellectual standards is, for all intents and purposes, nil. It would take a serious crisis threatening those business interests to break up this vicious cycle.

Perhaps that crisis is on its way…