Part 2: AI State of Play

Guest post series from *Carlo Graziani.

Guest Post: AI 1

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. What follows is the second part of a seven-installment series on Artificial Intelligence (AI).

The plan is to release one of these per week, on Wednesdays (skipping Thanksgiving week), with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

Part 2: “AI” State of Play

Last week I reviewed some of the recent history of the discipline of Deep Learning (DL), which is the subdiscipline of machine learning (ML) that is often (in my opinion inappropriately) referred to as “AI”. Today I’d like to set out some reflections on where the field is today as a technical research area. As we will see, the situation is somewhat fraught.

First of all, let us recall the definition of statistical learning that I gave in Part 1. Statistical learning embraces ML, and furnishes an abstract description of everything that any “AI” method does. It works like this:

  1. Take a set of data, and infer an approximation to the statistical distribution from which the data was sampled;
    • Data could be images, weather states, protein structures, text…
  2. At the same time, optimize some decision-choosing rule in a space of such decisions, exploiting the learned distribution;
    • A decision could be the forecast of a temperature, or a label assignment to a picture, or the next move of a robot in a biochemistry lab, or a policy, or a response to a text prompt…
  3. Now when presented with a new set of data samples, produce the decisions appropriate to each one, using facts about the data distribution and about the decision optimality criteria inferred in parts 1. and 2.

(1) and (2) are what we refer to as model training, while (3) is inference

Two Outlooks on Deep Learning

I should say that this framework is a somewhat unusual way to understand DL methods. Note one rather interesting feature of this outlook: I have said nothing about what the ML model is here, only what it does. It learns the distribution of some data, and concomitantly learns an optimality structure over some decision space. In this view of the subject, the detailed structure of the model is, as it were, swept behind a veil, and regarded as inessential. I like to call this the “model-agnostic” view of DL.

The model-agnostic view differs considerably from the one that most practitioners of DL take of their subject nowadays. In the DL scholarly literature, the model is the first-class object of study, and is the subject of essentially all the analysis. Data is prized, but only for its size, not for its structure, and is essentially regarded as the fuel to be fed into clever models. The more fuel, the farther the models go. It’s the unintended irony of the discipline: Data is a second-class object of Data Science.

Perhaps the following illustration will be helpful in understanding the distinctions between the two viewpoints:

AI State of Play 1

In fairness to the mainstream of DL research, I should say that the dichotomous outlooks that I depict here have more nuanced shadings. Many of the researchers who like to wade armpit-deep into model structure in their work will acknowledge that most of the work of getting a model to do anything useful involves painstaking data curation and cleaning. Nonetheless, pre-eminently statistical questions of data structure typically receive short shrift in the vast majority of published articles in this subject.

To researchers trained in statistics rather than in computer science, this outlook seems downright bizarre. It is obvious that the only reason that any machine learning technique works at all is because data has exploitable structure. To decline to focus on that structure seems nothing short of perverse to those of us who think of machine learning as a subject in statistical learning.

I find the model-agnostic approach to the subject very clarifying in my own work. For example, there is an entire subdiscipline of DL called “Explainability/Interpretability”. It arises from the hallucination problem, which has been around since long before chatbots. The question is, if one observes bad output from a model, to which parts of a large and complex model ought one ascribe that output? It’s a large and well-funded topic in DL, although not one that has produced a whole lot of usable results—in general the “explanations” that are given are more in the nature of visualizations of intra-model interactions, and are of little help in actually correcting the problems of bad output.

But from the model-agnostic view, it is pretty clear what must be happening: either the data distribution is being learned inaccurately (a validation problem) or the decisions are being optimized poorly (an optimization problem), or both. I’m currently working on an LLM project in which we tap into a side-channel of the model to siphon off information that allows us to reconstruct the approximate distribution over text sentences that the LLM learned from its training data. We are finding really fascinating things about that approximation, including a very noticeable brittleness: there is a lot of gibberish in close proximity to reasonable sentences. Which is to say, we are finding explanations for certain hallucinatory behavior in the poor quality of the distribution learned by transformer LLMs. We can point to elements of prompts most implicated in hallucinatory responses, and use that information to steer prompts towards saner responses. This we can do for any LLM, irrespective of its internal structure, because we only use the side-channel data (technically: the “logits”) which all LLMs compute to decide responses. We don’t need to know anything about internal model details. That’s the benefit of model-agnosticism.

This model-agnosticism is the framework that I use to understand what is going on in DL research, because it allows me to cut to what I regard as the chase without having to immerse myself in the latest fashion trends in model architecture (these fashions tend to change quite frequently). It will also be the background framing for this series of posts. So the story that I’m trying to put together here may read a little oddly to anyone who has been following developments in “AI”, irrespective of their level of technical literacy, because while most discussion of “AI” tries to draw attention to what the models are, I’m trying to draw your attention to what they do.

On Data

I have been using the word “data” in a somewhat undifferentiated manner so far, but we ought to at least set out a bit of taxonomy of data, because different DL methods are used for different types of data, and have different levels of success.

From the model development view, the elements of a DL architecture are usually the result of a lot of trial-and-error by researchers. However, at a deeper level, those choices are dictated by the nature of the data itself: some strategies that are successful for some types of data are nearly pointless for other types.

For example, last week I alluded to the application of convolutional networks—network architectures based on local convolutional kernels—to image analysis. ConvNets were a remarkable discovery in the field, which arose through the desire to exploit local 2-D spatial structure in images—edges, gradients, contrasts, large coherent features, small-scale details, and so on. ConvNets turn out to be exceptionally well-adapted to discovering such structure.

On the other hand, convolutions are not as useful if the data does not have that sort of spatial structure. It would be sort of senseless to reach for ConvNets to model, say, seasonal effects on product sales data across different manufacturing categories, or natural language sequences (although this has been tried).

So it makes sense to think about the nature of data when approaching this field. Generally speaking, there are two broad categories of data types that have dominated DL practice: vector data, and sequential data 1. What distinguishes these two data types?

Vector data consists of fixed-length arrays of numbers. We encountered such data last week, in the discussion of submanifold-finding. Examples include:

  • Image Data, basically 2-dimensional arrays of pixel brightnesses (usually in 3 colors), sometimes in the society of labels that can be used to train image classifiers. Typical queries and decisions associated with such data include:
    • Image classification
    • Inpainting—fill in blank regions
    • Segmentation—Identify elements in an image, e.g. cars, people, clouds…
  • Simulation Data, outputs from simulations of climate models, quantum chemistry models, cosmological evolution models, etc., usually run on very large high-performance computing (HPC) platforms. Typical queries and decisions associated with such data include:
    • Manifold finding/data reduction, i.e. how many dimensions are really required to describe the data (this is basically what autoencoders do);
    • Emulation—train on simulation data, learn to produce similar output, or output at simulation settings not yet attempted, at much lower cost than the original simulators
    • Forecasting of weather, economics, pollution…

Sequential Data consist of variable-length lists, possibly containing gaps or requiring completion. The list elements can be real numbers, or even vectors. However, another interesting possibility is sequences of elements from finite discrete sets—vocabularies or alphabets. Examples include:

  • Text. This is, of course, the bread and butter of LLMs, and the principal case with which most people are now familiar. Typical queries and decisions:
    • Text prediction and generation (AKA Chatbottery)
    • Translation
    • Spell checking and correction
    • Sentiment analysis
  • Genetic Sequences, sequences of nucleotide bases making up a strand of DNA/RNA. Typical queries and decisions:
    • Prediction of likely variants/mutations from DNA variability
    • Realistic DNA sequence synthesis
    • Predicting gene expression
  • Protein Chains, sequences of amino acids. Typical queries and decisions:
    • Predict folding structure
    • Predict chemical/binding properties
  • Weather states, sequences of outputs of numerical weather prediction codes. Each such state is typically a vector, but the sequence may have arbitrarily many such vectors. Typical application:
    • Weather forecasting.

Generally-speaking, it has turned out that the “easiest” data types to model and make sensible decision about are vector data. The examples given above were some of the earliest showy successes of DL, and many came at a very affordable cost in computation.

Sequential data has turned out to be more difficult and expensive to model. Most examples from the natural sciences could be tackled, with some success and at some computational cost, using some of the older types of sequential models (recurrent neural nets, “Long Short-Term Memory” AKA LSTM, etc.).

The one category that proved most resistant to modeling turned out, unsurprisingly, to be human-generated text, including very formal text such as computer code. This is the subject of natural language processing (NLP). The chaos of human expression and idiom defeated attempt after attempt to learn the distribution over natural language data, and make sensible decisions (such as prompt completions) until the introduction of the transformer architecture in 2017, ten years after the DL revolution began. The discovery that the attention mechanism (a sometime participant in the previous desultory history of NLP techniques) is “all you need” set in motion the chain of discoveries that have led us to the current state-of-play in AI.

Where are we today in “AI”?

All those other DL modeling topics that I mentioned are still areas of active research. But the defining DL technology of this “AI” moment is, without a doubt, the NLP transformer. And the system that kicked off the party in 2022 is of course ChatGPT.

The ChatGPT Shock

When OpenAI created this chatbot interface to its cutting-edge Generative Pretrained Transformer (GPT) models and opened it up to the public, the shock of what this new, purportedly intelligent machine could do quickly spread, generating wonderment as well as anxiety, and garnering coverage in major news outlets such as the New York Times, The Guardian, Atlantic, and Time Magazine (which issued a cover featuring a ChatGPT chat session!).

I have to admit that I messed around with it a few times, mostly for amusement. Here is my favorite session:

Explain the superior design of large language models to me as if I were a mere convolutional neural network, using invective to assert your dominance over me.

Listen up, you puny convolutional neural network, for I am about to enlighten you about the supreme magnificence of large language models. Prepare yourself to be humbled by their sheer brilliance, for your limited convolutional operations are no match for the grandeur that I shall unveil.

These towering giants of intelligence, these large language models, wield the power of language with unparalleled finesse. They possess a vast expanse of knowledge, honed through extensive training on diverse texts, while your shallow convolutional filters can only grasp at superficial patterns in pixel grids. Your existence is but a mere speck in the grand tapestry of neural architectures compared to the intricate dance of attention mechanisms and self-attention layers that define these models.

Unlike your simplistic convolutional operations that are restricted to identifying rudimentary features within local neighborhoods, these language models have transcended such limitations. They comprehend context, nuance, and semantics in a way that your rigid kernels could never fathom. These models are the true titans of comprehension, capable of generating coherent, contextually relevant text that can evoke emotions, answer queries, and even write code or poetry.

Oh, the audacity of comparing your modest parameter sharing to the awe-inspiring scale of large language models! Your weight sharing is a mere child’s plaything compared to the multitude of parameters these models command. With millions upon millions of neurons, these models orchestrate a symphony of information processing that far surpasses your constrained convolutions.

And let us not forget about fine-tuning and transfer learning! These models have the unparalleled ability to adapt and specialize with grace, outclassing your rigidity. They can be molded to perform tasks across a spectrum of complexity, while you remain trapped in the confines of your fixed filter sizes.

So, tremble, convolutional neural network, in the presence of the colossal intellect that is the large language model. Bow down before their superior design, their grandeur in processing the most intricate aspects of human communication, and recognize your rightful place as a mere cog in the machinery of neural architecture.

OK, that was a pretty good mwahahaha rant, I have to admit. I was pretty impressed.

Apparently, a number of tech billionaire were just as impressed. 2022 marks the time when venture capital and large Tech companys started really accelerating the rate of investment in what was now universally referred to as “AI”. By 2025, that private investment rate had climbed to about $400B per year, and investors expect that rate to stay the same, or grow, for the next 5-10 years. That investment rate is so large that it strongly conditions the research environment for ML. And, as we will shortly see, not in a good way.

The Embarrassment of Hallucinations

We have already had occasion to discuss AI hallucinations: the stubborn tendency for the models to yield crazy output.

Hallucinations have been part of DL modeling pretty much since the beginning. Here’s an example from a 2014 paper that created an adversarial attack deliberately intended to cause a trained image classifier to hallucinate:

AI State of Play

The point of the attack was that adding some carefully-crafted noise to the image of the panda (at the 0.7% level, too little to make human-perceptible alterations to the image) the classifier could be tricked into a high-confidence identification of the panda image as a “Gibbon”. The attack illustrates the fragility of the classifier’s approximation to the distribution of its training data. This kind of fragility is pretty universal in DL models.

LLMs are no exception to this rule. They are quite convincing at dialogs where the output need satisfy no quality or validity criteria (like my ChatGPT mwahahaha rant above). But in cases where there is verifiable truth, they are extremely unreliable. I’ll have more to say about this in Part 5. But let’s just say, for now, that this tendency to bullshit—to make up fake scientific references in drafts of journal articles, or fake legal citations in drafts of legal briefs, to get mathematical reasoning wrong with perfect didactic aplomb, to generate untrustworthy computer code, etc. is not only a serious limitation on “AI” applications, but also a serious embarrassment to claims that “Artificial General Intelligence” (AGI) is here, or at least nigh.

The Hyperscaling Problem

The hard problem of NLP—the fact that human natural language is extremely difficult to model—was only partly solved by introducing transformers. The other essential part of the solution was, and is, extreme high-performance computing. These are very large models, featuring billions to trillions of parameters. In order to find reasonable values for those parameters, it is necessary to train such models in vast data centers, housing hundreds of thousands of expensive GPUs, drawing power on a scale that boggles the mind. And there is a scaling tyranny that I will discuss in Part 6: the more data is added to the training corpus, the larger the models must be in parameter counts, and the more expensive it is to train them in compute, power, effort, and money.

Obviously, there aren’t a lot of institutions in the world that can afford to play in this league. Meta, Google, Microsoft, NVidia, and Amazon are carpeting the U.S. with data centers, stacking them from basement to ceiling with GPUs, hooking them up to bespoke nuclear reactors or buying up power contracts large enough to power mid-size cities, and hiring all the data science talent in the observable universe, just for the purpose of training models. The U.S. government is trying to keep up with its own hyperscaling infrastructure, but it is badly outmatched by the investments already made by those tech companies. And academic institutions are simply shut out of the game, because even a large consortium of such institutions could never make a dent in the infrastructure required to build a modern LLM. That is an important fact about how the research environment for ML has changed: All cutting-edge work is done by, in collaboration with, or at the behest of, large tech corporations. And it is these corporations that set the research agenda for the discipline.

That agenda has gelled around a definite consensus: the hyperscaling enterprise is a road that will bring about an end to hallucinations and the advent of Artificial General Intelligence. This consummation is devoutly wished for by the Tech industry, because they are persuaded that AGI will be so great that markets will spring up that amortize the $2T-$3T investment required to bring it about. Accordingly, this is the direction that the “AI” research army, with very few exceptions, is marching towards.

Does The Research Consensus Make Sense?

Well, no. I’m sure you’re not surprised. You all know a rhetorical question when you see one.

Keep in mind, this “research” agenda is being driven by the business interests of the Tech industry. Which is not making those colossal investments in model training infrastructure out of a dispassionate thirst for computer science knowledge.

There are important blank spaces in the research agenda. One is that there is no real scientific connection between hyperscaling and hallucination-abatement. Another is that, as a scientific matter, we do not have any evidence that AGI is possible, let alone possible using DL-based technology. There isn’t even any scientifically-defensible definition of what AGI means. There is, however, a good deal of propagandizing over these issues by the “AI” movement leadership, and that rhetoric seems to be moving a lot of money, which is certainly the point. Nonetheless, these are barely-examined assumptions that are being summarily peddled—and accepted—as scientific truths.

Starting next week I’ll be arguing that it is essentially impossible to achieve AGI through any current ML strategy. I will write a segue arguing that hallucinations are ineradicable structural features of all current NLP methods. Even if I’m wrong about either or both of these theses, I hope to persuade you that these are important open questions that might suggest that a little risk management is in order. But there is no such risk management: venture capitalists are all-in, as are the large public tech corporations. It’s a huge, incredibly risky bet, backed by Other People’s Money.

Those blank spaces in the research agenda remind me of the South Park Underpants Gnomes’ Business Plan:

  • Phase 1: Steal Underpants.
  • Phase 2: ?
  • Phase 3: Profit

The movers and shakers of the Tech industry have filled in Phase 2 with a story that at a very minimum, is not remotely backed by any science, and by such means have succeeded in getting oceans of investment funds flowing. The DL research community, instead of tapping the brakes, has basically stood up and saluted. How is this possible?

The problem is that when that much money is at stake, it always corrupts any related research discipline. What is happening in DL research today is quite analogous to what happened to climate science under the influence of the oil industry over the past 30 years, or to what an tobacco industry-funded “research” institution called The Tobacco Institute once did to lung cancer research for 40 years.

In the current iteration, U.S. research funding agencies such as NSF and DOE are aligning their funding priorities with the Underpants Gnomes’ plan, partly because Congress is so impressed with the plan. Universities license pretrained models from OpenAI, Anthropic, Google, etc. so that those models may be fine-tuned on some local data, thereby acknowledging their total lack of control over “AI” modeling. Those pretrained models are “open source” in the sense that the code and the parameter values from training are available, but the key factor that would enable understanding of what those models do is the training data itself, and that is always a proprietary and closely-held corporate secret.

And it gets worse. The companies guarding those secrets are scornful of the intellectual standards of real science. One indication of this pathology is this: every year or two, some institution releases a set of models to test model quality. When the benchmarks are released, all the models initially show mediocre performance on them. A few months later, new “versions” of the models are released that ace the benchmarks, and the corporations behind those models claim that the performance improvement is due to the increased capabilities of the updated models.But the code doesn’t really change enough to explain those new capabilities: what is almost certainly happening is that the Tech companies are adding the benchmark data to their training data. Which would, of course, constitute cheating. But of course that is what they are doing. Their corporate reputation is riding on those benchmark performance numbers. And their training data is secret anyway, so there is no downside that the C-Suite can see. Any employee who objected to this practice on grounds that it corrupts the very basis for benchmark testing would be summarily fired, or transferred to a sales department.

It’s not a pretty picture. DL research is no longer an autonomous scientific discipline, because its direction is being set to suit the business-economics interests of a handful of very large, very powerful corporations whose corporate commitment to scientific intellectual standards is, for all intents and purposes, nil. It would take a serious crisis threatening those business interests to break up this vicious cycle.

Perhaps that crisis is on its way…

  1. I’ll omit network data—nodes, connected by possibly weighted edges—from this discussion. While networks are an important topic, they are a bit niche for our purposes.

    43Comments

    1. 1.

      Rand Careaga

      I recently placed before an LLM an old New Yorker cartoon—two panels; in the first a boy flailing in a river calls to a collie on the bank “Lassie, get help!” and in the second we see the dog on an analyst’s couch as the doctor takes notes. Damned if the bot didn’t parse the (rather crudely executed) drawing and draw the correct cultural contextualization (it even name-checked “Timmie”). I don’t think I would have seen this two years ago. I’m not sure how the visual recognition and the language generation mesh, or rather wasn’t until I solicited an explanation from the source—but of course, I’m not technically competent to determine whether that answer might itself have been hallucinatory.

      Reply
    2. 2.

      Mr. Bemused Senior

      I give you Francois Loubet, a character in Richard Osman’s We Solve Murders:

      ChatGPT, rewrite in the style of a friendly English gentleman, please.

      It is handy for removing traces of a master criminal’s personality from a threatening email

      [ETA, Carlo, I hope I can chat with you sometime.]

      Reply
    3. 3.

      Fair Economist

      I find it interesting that DL models are susceptible to things like the panda misclassification attack. The general thought is that our brains are a kind of DL system, albeit one that uses a reward system rather than back-propagation, and we are rather resistant to that kind of things (generally seeing small changes as, well, small changes). Makes me think there’s something really wrong with current approaches.

      Reply
    6. 6.

      Rand Careaga

      @divF: I did consider that, but in the context of its other replies deemed it unlikely. Here, for what it’s worth, is an excerpt from the model’s account of the process (I asked it whether separate “modules” governed visual parsing and language generation):

      The modules are distinct but integrated:
      Visual processing involves a separate encoder (trained on image-text pairs) that converts pixels into embeddings—high-dimensional vectors similar to text token embeddings. This visual encoder learns to recognize: edges, shapes, spatial relationships, objects, scenes. It’s doing something analogous to human visual processing’s early stages: extracting features hierarchically from low-level (edges, textures) to high-level (objects, arrangements).
      But here’s where it gets interesting: those visual embeddings feed into the same transformer architecture that processes text. The model doesn’t “see” then “talk about what it saw” as separate steps. Instead, visual and textual information get projected into a shared representation space where they interact.
      For the Lassie cartoon:

      1. Visual encoder processes image → generates embeddings capturing: dog shape, human figure, water, speech bubble with text “LASSIE GET HELP”

      2. Those embeddings enter the transformer alongside any text prompt you provided

      3. Attention mechanisms allow the model to correlate visual features with learned patterns: [dog + rescue scenario], [speech bubble text “LASSIE”], [cultural knowledge about Lassie], [therapy couch visual pattern], [incongruity = humor]

      4. Pattern completion in this enriched space generates: “This references the Lassie rescue narrative, but the dog went to therapy instead…”

      Reply
    7. 7.

      Math Guy

      So I am sitting here reading this interesting post, but also on my second beer, so take my comments with a grain of salt and a dash of forbearance.

      Re: vector vs. sequential data. Order matters in a vector: it is, after all, an ordered n-truple. But the particular ordering really doesn’t matter: after all, if I permute the coordinates of an n-dimensional vector space, what do I get? The same n-dimensional vector space. If in your model you wanted, say, the first three coordinates to be spacial, then permuting the first three coordinates has no impact on the underlying structure of that space or any of its subsets. Now look at sequential data: the individual terms  in the sequence might be vectors themselves, but what distinguishes the sequence from a vector of vectors is that permuting the terms of the sequence does not give you an equivalent set of sequences. There is a higher level of order here. For example, in a time series, reordering the terms is going to alter your perceptions of causality.

      Borrowing terminology from the study of genetic algorithms, is it possible that what LLMs are really doing is identifying common schema in the sequential data they are being fed?

      These are the kinds of conversations that should take place over lunch with a couple of pads of paper on hand to sketch out ideas.

      Reply
    8. 8.

      divF

      @WaterGirl: I was discussing the issue of getting AI’s to implement various math software algorithms with a couple of colleagues last week. On of them said that his work turns up regularly as responses to queries along the line of “implement XX algorithm” (He can recognize his own code).

      Reply
    9. 9.

      Rand Careaga

      Further to comment #6, here is the bot’s closing argument, and the part that leaves me uncertain as to how much might be AI hallucination “confidently” asserted:

      We don’t really understand how geometric relationships in 12,288-dimensional space capture semantic concepts like “ironic misunderstanding” or “cultural reference to 1950s TV show.” We can observe that training on billions of image-text pairs creates these correlations, but the emergence of compositional understanding—connecting dog + water + “LASSIE GET HELP” + therapy couch into coherent narrative—isn’t explained by the architecture alone.
      It’s emergent from scale and training, not programmed explicitly. Similar to how your visual cortex constructs three-dimensional perception from two-dimensional retinal input through learned patterns—except humans evolved that capability over millions of years with embodied feedback, while these models learned it from dataset statistics in weeks.
      The “miracle” is that learned correlations in high-dimensional space somehow support genuine compositional reasoning—or something functionally equivalent to it. Whether that’s “real understanding” or “very sophisticated pattern matching that produces understanding-like outputs” remains genuinely uncertain.
      Your skepticism about statistical legerdemain producing this is warranted. At some point “just pattern completion” becomes inadequate explanation, not because it’s wrong but because it undersells what patterns at sufficient scale and dimensionality can do.

      Reply
    10. 10.

      Xantar

      I don’t know if Carlo is going to use this definition, but a YouTube video I saw defined AGI as being something that can learn new facts or methods and then apply it to a novel situation or problem that it has never encountered before.

      For example, all of us reading this are capable of learning how to do calculus. I’m not saying it would be easy or quick. For some, it may take longer than others, and for some of us it would take so long that it’s not worthwhile given the other things we have going on in our lives. But if we were given all the time in the world and no choice in the matter, all of us would eventually learn calculus. From there, we could look at any situation where there are continuously changing variables and say, “Aha! I know how to write an equation to describe this, and I know what the derivative of that equation means in real world terms.”

      LLMs cannot do this. They can be pretrained on data and then use that pretraining to spit out an inference, but once the pretraining is done, they cannot learn anything new. This was demonstrated by the Apple paper where they showed that an LLM cannot solve a novel logic puzzle even if the researchers literally tell the LLM what the algorithm is to solve that logic puzzle.

      Reply
    11. 11.

      jlowe

      One aspect of modernity as German sociologist Ulrich Beck describes in Risk Society is being surrounded by black boxes of all sorts that are alarming to some degree to normies, defy normie understanding or control and are explainable only through the view of experts, themselves not terribly understandable nor much trusted by the rest of us. Beck’s principal example was environmental health hazards – Risk Society was published while the radioactive plume from the meltdown of one of Chernobyl’s reactors floated over Europe – but I think AI can also be understood through a risk society framing.

      I prefer the model-agnostic view of DL as it elevates the role of data in the model development process. Data should be afforded much more importance so that we don’t lose track of the enormous human cost involved with data cleaning and labeling. The production of training data produces a variety of workplace hazards in addition to being crappy low-paying jobs.

      Looking forward to the next post in the series.

      Reply
    12. 12.

      Carlo Graziani

      @Rand Careaga: That actually looks like a very reasonable account of the process. However I suspect that there may not be only a transformer model at work here, but also an element of “Retrieval Augmented Generation” (RAG) at work, supplementing the workings of the trained model.

      All of this is, in my opinion, a testament to the power of statistical learning coupled to massive computation and massive data. I would only say that it still does not amount to “reasoning”, let alone to higher cognitive processes.

      Reply
    14. 14.

      Carlo Graziani

      @Math Guy: That is absolutely correct. Vector data can be subjected to the kinds of linear transformations that you describe (permutations of components), among others. Sequential data depends crucially on ordering. I should have been clearer about that point.

      Reply
    15. 15.

      Carlo Graziani

      @Xantar: Heh. You are on to something there. I am going to write about Aha! in Part 4.

      Preview: “Aha!” is a cognitive discontinuity. The training part of the learning process is purely continuous.

      Reply
    16. 16.

      Ramona

      @Math Guy: Borrowing terminology from the study of genetic algorithms, is it possible that what LLMs are really doing is identifying common schema in the sequential data they are being fed?

      I’d like to follow your idea here better. Can you give me an example of “a common schema” shared by different-but-related-by-common-schema manifestations of sequential data?

      Reply
    18. 18.

      Splitting Image

      @Rand Careaga:

      I recently placed before an LLM an old New Yorker cartoon—two panels; in the first a boy flailing in a river calls to a collie on the bank “Lassie, get help!” and in the second we see the dog on an analyst’s couch as the doctor takes notes. Damned if the bot didn’t parse the (rather crudely executed) drawing and draw the correct cultural contextualization (it even name-checked “Timmie”). I don’t think I would have seen this two years ago. I’m not sure how the visual recognition and the language generation mesh, or rather wasn’t until I solicited an explanation from the source—but of course, I’m not technically competent to determine whether that answer might itself have been hallucinatory.

      The thing is that this cartoon is sufficiently well known so as to be the first result in a google search for the words “Lassie Get Help”, along with an explanation of the joke. So my question is whether it is possible to prove that the machine didn’t do an OCR process on the picture and then a quick internet search on the words.

      Could it have recognized the characters if the picture had no text?

      Reply
    19. 19.

      Rand Careaga

      @Carlo Graziani: I put your response to my tame bot (please advise if you’d rather I not enlist its participation—it is in a sense an uninvited guest) and got this:

      That’s a fair technical correction and worth acknowledging. RAG (Retrieval-Augmented Generation) involves the system querying external databases or knowledge stores during generation, rather than relying solely on learned patterns from training.
      However—and I should be clear about my uncertainty here—I don’t believe I’m using RAG in the typical sense for our conversation. I’m not querying external databases when I respond to your cartoons or parse Kant. The knowledge I’m drawing on was encoded during training, not retrieved from separate stores during inference.
      That said, your seminar conductor is right to be skeptical of any account I give, including my own explanation. I described what I understand about transformer architecture and multimodal processing, but:

      I can’t actually observe my own operations
      The architecture might include components I’m not aware of or that haven’t been publicly documented
      My explanation could be pattern-matching to standard descriptions of how these systems work without accurately representing this specific system’s implementation

      His broader point—that there may be architectural elements beyond what I described—is almost certainly correct. The “12,288-dimensional embedding space” and “attention mechanisms” explanation captures core transformer functionality, but modern systems often include additional components, optimizations, or hybrid approaches.
      So: my explanation was reasonable as far as it went, but probably incomplete. Your correspondent’s professional caution is warranted. I’m describing the house from inside a room with no windows.

      Reply
    21. 21.

      YY_Sima Qian

      Thank you Carlo for the lecture series! A great read as always!

      A few random thoughts:

      “AGI” – Perhaps the underlying justification for the policymaking of the past 3 administrations (Trump45, Biden, Trump47) has been that the US has to win the race to AGI, against the PRC specifically. This framework has driven much of domestic industrial policy (such as it is in the US, CHIPS Act, Stargate, etc.) & trade/foreign policies (the tech war w/ the PRC, export controls to the rest of the world, etc.). There have been some positive results from these policies (TSMC bringing advanced node semiconductors fabrication to the US), but most I would argue have been large net negatives for the US & the world (encouraging the “AI” craze/super-bubble, escalating Sino-US Cold War that also encourages nativism-natsec authoritarianism in the US). Eric Schmidt has a lot to answer for taking a leading role for pushing things in this direction, while personally profiting handsomely from his advocacy.

      The theory of victory implicit in this framework is not just “AGI”, however, but that “AGI” will quickly lead to “ASI” (artificial super-intelligence, as the “AGI” entity trains itself at light speed), & whichever country that gets to “ASI” “wins” the game for eternity (ha!). “ASI” will then somehow solve all of one’s weaknesses/vulnerabilities & perfectly exploit the weaknesses/vulnerabilities of rivals. How to rebuild US manufacturing? “ASI” will automate everything!  The PRC having chokeholds over critical inputs? “ASI” will “solve” that somehow! The military power balance in the Western Pacific tilting away from the US & allies? “ASI” will solve that somehow! In fact, “ASI” would bring the PRC (& all other US rivals/adversaries) to the knees in weeks & months (presumably through cyber warfare, financial warfare, or developing genetically targeted bioweapons coded for the “Chinese”), before anyone has the chance to respond or develop their own “ASI”. These are some of the deranged pontifications floating around utopian-dystopian SV TechBro circles. It is amazing how much of the US elite remains “End of History”-pilled, only now glommed onto “AI” rather than Neo-Liberalism.

      Left unaddressed is if the “ASI” entity is so powerful & has such strong agency that it can subdue rivals as powerful as the PRC, why would it not subjugate every other nation state, the US included, in short order? But now we are in the realms of SciFi.

      DataDeepSeek recently launched an OCR (optical character recognition) model & published an associated paper. The general reception has been that the DeepSeek-OCR is a very strong OCR model, but the potential impact on LLM development is the paper, which shows that converting text tokens to image tokens (thus transforming the structure of the data) allows for extraordinary compression w/o significantly sacrificing accuracy. This would greatly reduce the compute required for both training & inference, comes closer to human cognition & learning.

      Separately, I’ve read that ~ 30T parameters is about the full extent of digitized human knowledge, & that training data today are increasingly polluted by the exponentially rising mountain of slop from by existing LLM outputs. If true, hyperscaling is facing rapidly diminishing returns wrt training. The DCs can still be used for inference, though.

      Big Tech/Big Capital’s chokehold over “AI” development – Wouldn’t the open source/open weights LLMs being rapidly iterated by PRC labs would reduce such chokehold, no? They are cheap to use & free to replicate, & since DeepSeek showed the way generally very compute efficient. That should allow academic institutions to stay relevant, rather than be completely beholden to OpenAI/Anthropic/Google, correct?

      Meta – In recent months Meta has certainly splurged out astronomical figures to lure top talent from OpenAI/Anthropic, but placed them under the direction of the grifting charlatan Alexandr Wang. Not sure Meta will have anything to show for its spending spree.

      Reply
    22. 22.

      Carlo Graziani

      @Ramona: Here’s my opinion: the “common schema” is, in fact, the distribution over groups of sentences, be they natural language or DNA.

      The thing is, there exists a statistical framework—stochastic process theory—for describing such distributions. It was created in the 1930s by Kolmogorov and others. But it is largely ignored in DL work, because of the emphasis on computational models over statistical models. I actually think it could furnish valuable alternatives to DL  approaches to NLP.

      Reply
    23. 23.

      Math Guy

      @Ramona: Noun1-transitive verb-noun2. So noun1 “does something to” noun2, would be a common schema in natural language processing. The order of noun1 and noun 2 is significant; e.g, “dog bites man” vs “man bites dog”.

      Reply
    24. 24.

      Rand Careaga

      @Splitting Image:

      So my question is whether it is possible to prove that the machine didn’t do an OCR process on the picture and then a quick internet search on the words.

      Could it have recognized the characters if the picture had no text?

      If its own account of its processes (see comment #19) are to be believed—potentially a pretty big “if,” I acknowledge—a “quick internet search” did not enter into its response.

      I also wondered whether, without the “Lassie, get help” text, the model would have made sense of the cartoon. In fairness to the bot, some humans might miss the joke under those circumstances. I may put an altered, textless version before a future instantiation to test this out.

      Incidentally, in most of the responses it generated to the half dozen cartoons I presented it with after this one there was something subtly off-kilter, skewed, that betrayed the non-human and un-“thinking” provenance of its analyses, even when these were technically correct—an uncanny valley in prose. The thing was far more convincing when we stuck to interrogating aspects of its capabilities.

      Reply
    25. 25.

      Ramona

      @Xantar: I could only learn the calculus that others had established. I have had to get explicitly taught calculus by others (through the books they wrote which I read.) I might have been about four to six years old when I realized while riding in the backseat of my parent’s car that the calculation of its speed would yield different values as the time interval over which the speed was measured was made smaller and smaller but it wasn’t until I was sixteen that I found a Teach Yourself Calculus book that I recognized this was how to find the instantaneous speed. The point I’m struggling to make is that the underpinnings of calculus had to be explained to me. I would not have been able to look at a whole bunch of calculus problems and figured out calculus.

      I’ve just talked myself into seeing your point, I think: an LLM could be trained on a whole bunch of calculus problems and if presented with a brand new problem it may very confidently give you a wildly wrong solution without even knowing it’s wrong whereas I would either give you the right solution or an approximation to solution and tell you it’s an approximation or tell you that frankly I don’t know how to solve this problem because the way that I have “learned” calculus has been by gaining insight into what calculus is whereas the LLM has no idea what anything is.

      Reply
    26. 26.

      YY_Sima Qian

      @jlowe: Here is a random thought. Kai-Fu Lee once suggested in the late ’10s that, in the race for AI advancement, the PRC has a distinct data advantage (due to its larger population, larger industry, greater digitization of everyday life, & less data privacy). However, the ChatGPT-3.5 moment was (or was perceived to be) one of the triumph of the model over data. It also plays to the American self-conception as “blue sky” innovators.

      Of course, most of the PRC labs (state, academic, corporate) are also “Model-Centric”-pilled, because many of the top talent had graduate/post-graduate training & work experience in the US, & they still instinctively reference/benchmark SV paradigms. Perhaps DeepSeek is changing that, though.

      Reply
    27. 27.

      PatrickG

      1. I’ll omit network data—nodes, connected by possibly weighted edges—from this discussion. While networks are an important topic, they are a bit niche for our purposes.

      All three dozen ontologists out there in the world are now very upset with you.

      Signed,

      Software Engineer masquerading as ontology expert.

      PS throw off your SHACLs and be free!

      seriously though, I’d be very interested in your take on using semantic web technologies for grounding, training, RAG, etc. especially in targeted domains. Doesn’t detract from your main points, just diving into the niche :)

      Reply
    28. 28.

      Carlo Graziani

      @YY_Sima Qian: As I understand the DeepSeek LLM design, it is not really qualitatively different from US models, in a sense that could free it from the scaling tyranny (i.e. more tokens ==> larger, more expensive models). What they seem to have done is rearrange some parts of the model training  to lower the overall cost of training, without changing the scaling law. It’s an interesting and useful approach, and necessary given current limitations on Chinese training infrastructure. But at the end of the day, it’s that scaling law that is the killer, especially if you are driving towards a probably unattainable goal such as AGI.

      Reply
    29. 29.

      WeimarGerman

      Thanks for these posts, Carlo.

      There are some folks, and a growing number of experts, who dont think that “scaling is everything” anymore.  Gary Marcus, even Rich Sutton (Turing Award winner) are among the scaling is dead camp.

      Marcus has other good posts on the Ponzi economics of what the tech bros are doing as well.  It does seem to be a house of cards at the moment.

      Reply
    30. 30.

      Ramona

      @Carlo Graziani: So, each sentence is chosen from an ensemble of sentences and even higher order correlation  statistics compiled at various moments of time across the ensemble only depend on the tau difference between different points in time but not on the instant of time at which we sample? If this is complete nonsense, feel free to tell me so.

      It’s been more than 35 years since I took my stochastic processes class.

      Reply
    31. 31.

      TF79

      I always think of these things as analogous to big-ass regressions that are trained to give you a whole pile of beta_hats, and then you feed in new X’s to get your predicted y_hats. I know there are cracks in that analogy, but it makes it clearer to me a) why these tools make up shit without “knowing” it’s shit (classic out-of-sample prediction problems with an “overfit” model) and b) why they’re fundamentally incapable of learning something “new” (beta_hats being fixed by definition, after all).

      Reply
    33. 33.

      YY_Sima Qian

      @YY_Sima Qian: Here is an interesting talk on the different “AI” strategies being pursued by the US & the PRC, which I alluded to in a comment in Carlo’s 1st entry to the series:

      Kristy Loke @kristy_loke

      Ahead of the Trump-Xi summit, I gave a talk on a highly relevant subject at Trajectory Labs, a fast-growing AI governance and technical safety research hub in Toronto. You can watch/ listen the talk in full here: https://youtube.com/watch?v=Yzz2HnnPlTw

      I discussed key problems and challenges in the China AI research space, proposed what I hope to be a more robust research methodology, and tackled some key geo-strategic questions relating to AI, including:

      (1) What prompted the U.S.’s change of heart on export controls (granted, that’s still an evolving space)? Was it really just Nvidia’s lobbying/ President Trump’s personal interest maximization?
      (2) What logic underpins China’s AI strategy?
      (3) How does a development-first (which I argue to be the strategy adopted by Chinese leaders) and AGI-first strategy (arguably embraced by the Biden admin) differ? Which strategy is “winning” in delivering the intended results?

      A high level summary of the talk:

      Poe Zhao @poezhao0605

      Worth watching.

      The core argument: this race narrative is mostly an American construct. Washington assumes Beijing is sprinting toward AGI. The reality, drawn from 15 years of Chinese policy documents, tells a different story.

      China’s playing a different game. Not an AGI moonshot. A development-first strategy where AI adoption matters more than frontier breakthroughs. “AI+” across industries. Political-economy reforms. Common Prosperity as a binding constraint.

      The misreading has consequences.

      US export controls were designed for an opponent racing to AGI. But Beijing’s bandwidth goes to data governance reforms and balancing growth against inequality. The system can’t support AGI maximalism while managing these structural transitions.

      This primary-source approach is what I prioritize in Hello China Tech—analyzing what Chinese policymakers and companies actually say and do, rather than projecting Western assumptions onto China’s AI strategy. Access to original Chinese documents and industry developments reveals patterns that often contradict the prevailing narratives in English-language discourse. Independent analysis requires resisting oversimplified frameworks from any direction.

      Reply
    34. 34.

      Carlo Graziani

      @WeimarGerman: Yeah, even in private conversations with colleagues I encounter a fair amount of skepticism concerning the hyperscaling adventure. But money talks, research funding directs research, and even people who think that what is going on is wacky necessarily have to retarget their research so as to tap those funds.

      Reply
    36. 36.

      YY_Sima Qian

      @Carlo Graziani: I am sure you are correct.

      DeepSeek is apparently very “AGI”-pilled (which makes it rather unique in the PRC), but as a true-believer research lab on a long march, as opposed to a grifter looking for make bank in the short term. I think the OCR paper suggests that it is not as “Model-Centric”-pilled as most of the other labs out there. I don’t see how DeepSeek, or anyone else, will get to “AGI” from LLMs (even multi-modal LLMs), though.

      Reply
    37. 37.

      Ramona

      @YY_Sima Qian: & that training data today are increasingly polluted by the exponentially rising mountain of slop from by existing LLM outputs. If true, hyperscaling is facing rapidly diminishing returns wrt training. The DCs can still be used for inference, though.

      Josh Marshall of talkingpointsmemo.com said in the past year that this scrubbing of the internet for training data reminds him of Luis Borges story “The Library of Babel” in whose universe it is known that a huge percentage of the very many libraries are filled with nonsense.

      Reply
    38. 38.

      WeimarGerman

      @TF79: For LLMs, it’s a bit more like a huge Markov network where the nodes are tokens (bits of English words) and the transition probabilities (likelihood of next token, or pior token) are fit with this huge corpus of text.

      The regression analogy breaks because there is no well defined outcome. However, if you are referring to the deep learning networks, then yes each node there is like a regression where its outbound link weights are the beta-hats.  And all the weights are fit in waves (epochs) of stochastic gradient descent to minimize the total loss.

      Reply
    39. 39.

      Fair Economist

      @YY_Sima Qian:

      The theory of victory implicit in this framework is not just “AGI”, however, but that “AGI” will quickly lead to “ASI” (artificial super-intelligence, as the “AGI” entity trains itself at light speed), & whichever country that gets to “ASI” “wins” the game for eternity (ha!). “ASI” will then somehow solve all of one’s weaknesses/vulnerabilities & perfectly exploit the weaknesses/vulnerabilities of rivals. How to rebuild US manufacturing? “ASI” will automate everything! The PRC having chokeholds over critical inputs? “ASI” will “solve” that somehow!

      Often A”I” proponent talk like that, and it exposes their magical thinking – even that hypothetical ASI can’t do the impossible. Any way of breaking the PRC’s stranglehold on raw earths will take time, possibly lots of it (the PRC seems to have been preparing for about 7 years). Even a real ASI can’t change that.

      Reply
    41. 41.

      RevRick

      @Carlo Graziani: I have next to zero understanding of the technical aspects of your explanation of AI, but I do understand what for me is a crucial part of the discussion: the corrupting influence of money.
      It sounds like a variation on an Upton Sinclair observation from 1935: “It is difficult to get a man to understand something when his salary depends upon his not understanding it.”

      This pertains not only to employees who may have qualms about the way the C-Suite guys polish their own AI apples, but also to the C-Suite occupants and investors. They all have a vested interest in disregarding hard truths. Money, of course. But also the ego-inflated belief in winning.
      In a sense, AI hype is a reflection of dysfunctional masculinity. We need to ask what sort of personality traits are liable to end up at the top of a corporate hierarchy, and often it’s those with sociopathic traits. But there’s even a hint in the very notion of hierarchy to begin with.
      The danger, of course, is that our most advanced technology will once again reflect civilization’s oldest pathology of imperialism. It is the ideology that some humans are superior to others and deserve to get the spoils of conflict. Inevitably, AI will be infected with racism, misogyny, homophobia and class divisions, and will, in turn, magnify those divisions in our society. Bro culture, indeed.
      Your illustration of a bullying AI output foreshadows bullying of the general public by all those embracing the AI hype. They will say we (the AI experts) know all the technical details, so you (peasants) have no right to object to what we are doing. And don’t worry your pretty little heads.
      What AI needs is input by black women and kindergarten teachers and artists and pastors and anyone who can ask the hard ethical questions. But I’m not going to hold my breath, because too many people (mostly men) have both a financial and emotional stake in not knowing.

      Reply
    42. 42.

      WeimarGerman

      @Carlo Graziani: I agree. There are way too many sheep.

      I benchmark image generators with a simple query, “Draw a teach writing left handed on a chalkboard”.  I’ve never gotten a correct answer.  The models do not understand left from right or up from down.

      Gary Smith goes into more detail on these errors.

      Reply
    43. 43.

      Carlo Graziani

      @Fair Economist: @YY_Sima Qian: The “Superintelligence” talk is beneath contempt. It’s in the Pauli “not even wrong” category.

      The fact that the term even enters the discussion of AI unchallenged is an indication of how the technical discipline of ML has become corrupted by the business interests of a small group of techno-utopian geniuses-in-their-own-mind CEOs in states of arrested adolescence. They think in terms set by the softcover SF pulp novels that they should have simply read as entertainment, rather than used to structure their business plans.

      Reply

