• Menu
  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Before Header

  • About Us
  • Lexicon
  • Contact Us
  • Our Store
  • ↑
  • ↓

Balloon Juice

Come for the politics, stay for the snark.

Never entrust democracy to any process that requires Republicans to act in good faith.

Oh FFS you might as well trust a 6-year-old with a flamethrower.

The most dangerous place for a black man in America is in a white man’s imagination.

The republican speaker is a slippery little devil.

And now I have baud making fun of me. this day can’t get worse.

I swear, each month of 2025 will have its own history degree.

A democracy can’t function when people can’t distinguish facts from lies.

A norm that restrains only one side really is not a norm – it is a trap.

That meeting sounds like a shotgun wedding between a shitshow and a clusterfuck.

Anne Laurie is a fucking hero in so many ways. ~ Betty Cracker

Jack be nimble, jack be quick, hurry up and indict this prick.

At some point, the ability to learn is a factor of character, not IQ.

It’s easy to sit in safety and prescribe what other people should be doing.

You would normally have to try pretty hard to self-incriminate this badly.

When they say they are pro-life, they do not mean yours.

The republican ‘Pastor’ of the House is an odious authoritarian little creep.

We still have time to mess this up!

“Facilitate” is an active verb, not a weasel word.

Speaking of republicans, is there a way for a political party to declare intellectual bankruptcy?

When we show up, we win.

“Loving your country does not mean lying about its history.”

Republican also-rans: four mules fighting over a turnip.

We cannot abandon the truth and remain a free nation.

Our messy unity will be our strength.

Mobile Menu

  • 4 Directions VA 2025 Raffle
  • 2025 Activism
  • Donate with Venmo, Zelle & PayPal
  • Site Feedback
  • War in Ukraine
  • Submit Photos to On the Road
  • Politics
  • On The Road
  • Open Threads
  • Topics
  • Authors
  • About Us
  • Contact Us
  • Lexicon
  • Our Store
  • Politics
  • Open Threads
  • 2025 Activism
  • Garden Chats
  • On The Road
  • Targeted Fundraising!
Changing of the Stars

Science & Technology

Photo courtesy of BillinGlendaleCA.com
You are here: Home / Archives for Science & Technology

Part 5: Hallucinations

by WaterGirl|  December 3, 20257:30 pm| 78 Comments

This post is in: Artificial Intelligence, Guest Posts, Open Threads, Science & Technology

Guest Post: AI 1

Guest post series from *Carlo Graziani.

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. Being able to write these posts on AI has been very helpful to me in clarifying and sorting out my thinking on this subject, and the comments that have followed each post have been of high quality and on point, making up excellent and informative (including to me) discussions.

The plan is to release one of these per week, on Wednesdays, with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

I had originally planned to post a high-level summary during Thanksgiving Week, to try to offer usable take-aways to people put off by my nerd-babble. After some discussion with WaterGirl, we have decided instead to leave the summary posts to after the conclusion of the series.

Part 5: Hallucinations

In November 2022, Vancouver resident Jake Moffat needed to travel to Toronto to attend the funeral of his deceased mother. He asked an Air Canada chatbot about the terms of a bereavement fare, and the chatbot assured him, incorrectly, that according to the company’s rules he could receive the bereavement discount retroactively after traveling on the regular fare. When Air Canada denied Moffat the discount, he sued the company. The Tribunal held that Air Canada was liable for its chatbot’s representations to customers on its own website, and had to pay Moffat damages and legal fees.

In May 2023, a plaintiff attorney named Steven A. Schwartz filed a legal brief in the Southern District of New York containing references that the judge deemed to be “…bogus judicial decisions with bogus quotes and bogus internal citations.” Schwartz acknowledged that the source of the bogus references was in fact ChatGPT, representing to the judge that ChatGPT had, upon being questioned about the authenticity of the cases, responded that they were “real” and “can be found in reputable legal databases such as LexisNexis and Westlaw.”

In Spring of 2025, The Chicago Sun-Times published a 15-title summer reading list. Ten items on the list were made-up titles attributed to real authors.

Google’s AI Overview has recommended using non-toxic glue on pizza to help cheese stick to the pie.

show full post on front page

I could go on, but it gets boring. Finding examples in the media of AI going off the rails in embarrassing ways is easier than finding inebriated people on Chicago streets at noon on Saint Patrick’s day. Just try a web search on “AI hallucinations”. The AI hallucination is a daily phenomenon, affecting programmers attempting to speed up their coding, scientists looking for fast ways to generate or clean up papers and proposals, and anyone in need of text that must precisely reflect some legal constraints.

AI models are also notoriously bad at mathematical reasoning, making elementary arithmetic mistakes as well as serious mathematical errors. I have prompted ChatGPT to perform a certain standard physics derivation 1 twice now, at a distance of several months, and both times I have obtained careless stupidity that no undergraduate would be capable of producing, presented with professorial polish and total didactic aplomb.

It’s fun to point and laugh, but sometimes it is no joking matter. People have received bad, even dangerous medical advice from ChatGPT. There is a high-profile effort underway to use AI to “democratize” financial advice, which is seemingly innocent of the associated risks. There’s a pending patent for “AI Traffic Control” which is exactly as terrifyingly stupid an idea as it sounds. In fact, we are living through a moment in which the Tech industry is desperately attempting to propose AI for any application offering any prospect of profitability–no firm today makes any money on AI services–so it is not surprising to see such risks minimized or hidden altogether.

To observers of this discipline, the hallucination phenomenon is a very serious problem, and is another reason to question whether “Artificial General Intelligence” (AGI) is even a remote possibility on our current technological path. Certainly it would seem that if the hallucination issue is not understood and corrected somehow, any prospective AGI will babble hilariously, and possibly dangerously, some unpredictable fraction of the time.

The Tech industry consensus on hallucinations, however, is some combination of (a) hallucinations are not really a problem, and (b) more pretraining of improved (i.e. “larger”) models with more data at higher cost in compute and power will make them go away, as AGI finally emerges. I have had conversations with people who really believe in (a) or (b), and at least one person I spoke with appeared to somehow hold both views simultaneously.

View (a) is obviously not even worth discussing, given the high stakes involved in many AI applications. What I’d like to discuss today is view (b): can we really expect bigger models trained at higher expense with more data to do away with the stubbornly persistent phenomenon of AI hallucinations?

In order to address this question, we need to understand where these hallucinations come from. For that, we first need to review what it is that LLMs do.

What Does an LLM Do?

It is helpful to recall the basic definition of statistical learning–the subject that encompasses all of AI–at this point. Here is how all this stuff works:

  1. Take a set of data, and infer an approximation to the statistical distribution from which the data was sampled;
    • In the case of LLMs, the data consists of hundreds of billions of words of text
  2. At the same time, optimize some decision-choosing rule in a space of such decisions, exploiting the learned distribution;
    • With LLMs, a decision could be response to a text prompt, or a judgment about whether the text expresses positive or negative sentiment, or a translation to another language, etc.
  3. Now when presented with a new set of data samples, produce the decisions appropriate to each one, using facts about the data distribution and about the decision optimality criteria inferred in parts 1. and 2.

The reason that I keep bringing these up is that I find this model-agnostic view of the machine learning enterprise extremely clarifying, and helpful in directing attention towards what matters and away from irrelevant aspects of model design.

We should apply the above catechism to what LLMs do. Data from natural language text consists of sequences of words, interspersed with punctuation. LLMs learn features of the distributions over such sequences that allow them to probabilistically predict what the next response word should be, given a prompt and any response words previously supplied.

So, for example, suppose your prompt to be completed by the LLM is “Bob was nervous about his presentation to the board, despite his preparation the night before.” and the LLM completes it with “He had practiced by reading his slides and timing what he said while each one was displayed.” The LLM starts with the prompt as its context, and uses the learned distribution to compute the probability distribution of the next word. From that distribution, it samples (i.e. decides on) the word “He”. It then appends “He” to the prompt to form a new context, and calculates a new distribution for the next word. It turns out that “had” is pretty high in the probability list, and gets selected. The context is now “Bob was nervous about his presentation to the board, despite his preparation the night before. He had”. The LLM repeats the process, and probabilistically samples the word “practiced” from the new distribution. And so on.

No kidding. This is all that is going on. Next response token prediction based on the prompt and all previous response tokens. That is the entire trick. Neat, eh?

Back to hallucinations: there are two places to look for their origin: the approximation to the data distribution, and the next-token decisions founded on that distribution. Let’s take them in order.

Approximating the Distribution of Human Language

The reason that text comprising hundreds of billions of words are required to train an LLM is that the statistical regularities of human language are extremely complex, and not easy to capture in a principled statistical model.

Just scroll your eyes up and down this essay briefly, and then imagine figuring out the rules by which the words are juxtaposed, without being detained by trivialities such as meaning. There are rules: you rarely see the same word repeated immediately (e.g. “immediately immediately”) which is clearly a rule. There are grammar rules, and context rules. Certain clusters of words recur together in certain types of text and not in others: you will find pairings of “octopus” and “cephalopod” within a few hundred words of each other in texts from works on marine biology, but pairings such as “octopus” and “mortgage” are probably very rare. In fact, the occurrence of “octopus” in a page probably means that the probability of encountering “mortgage” within the next 1000 words is considerably reduced from the average rate of occurrence of that word, while the occurrence probability of “shark” is likely enhanced. And so on. How would one go about describing these patterns?

The approach used in natural language processing (NLP) since time immemorial is to begin by breaking down the text into tokens, then describe the text as a sequence of such tokens. This tokenizationis a subtle and arcane art. You might think that it would be logical to break things down into words, numerals, punctuation, etc. While not wrong, this approach is very inefficient. The problem is that the English language (say) has about 500,000 words, which is a huge vocabulary for an LLM to manage. Vocabulary size is a critical parameter to be managed in this game, because the larger the vocabulary, the larger and more expensive the model.

On the other hand, breaking things down into individual letters is also a bad idea. While the vocabulary size is now much smaller (less than 50 for English), the token sequences are much longer, and the patterns much harder to find. The patterns are at the word level, not the letter level. It’s just that there are so many damn words!

The secret sauce is to notice that most of that half-million words are extremely rare. Studies of natural language have shown that knowing 10,000 words in any language allows one to understand 99% of texts in that language. In English, that would be 2% of the full vocabulary. Moreover, the rare words can be built up out of smaller word pieces. Identifying an optimal set of word pieces, most of which are full words in their own right, is the name of the game here. Algorithms exist that can represent all English text using 30,000 to 50,000 tokens, which is a considerable savings in vocabulary size. So tokenization is (largely) a solved problem 2.

Embedding

The next thing that essentially every NLP method does with its tokenized text is a process called embedding: Each token is mapped into a vector space of dimension about 1000 (basically, each token gets described by a list of 1000 numbers) endowed with a notion of distance between points similar to the notion of distance between points in 3-dimensions. At this point, all operations on a sequence of tokens become operations on such lists of numbers. So when a transformer operates on a sequence of tokens (a sentence or a set of sentences, including previously-generated text) that sequence gets embedded in a very high-dimensional space: for example, the prompt above (“Bob was nervous…”) consists of about 17 tokens, so it is mapped to a point in an approximately 17,000-dimensional space consisting of 17 copies of the original 1000-dimensional embedding space.

I want to draw attention to embedding for several reasons: one is that it is an essentially universal practice in NLP, preceding the invention of transformers by many years. It turns out to be much easier to model probability distributions by operating on lists of numbers than by operating directly on sequences of discrete tokens sampled from a finite-dimensional set (the vocabulary). So researchers have defaulted to the embedding strategy.

Another reason to emphasize embedding is that when transformers train embedding parameters, they appears to do something magical: the resulting embeddings cluster together words and word fragments with similar meanings or functions, in well-separated clusters in the embedding space. You can see examples of this at Kevin Gimpel’s Bert Embedding Visualization Page, where you will see visualized in 2-dimensions clusters of suffixes, of verbs with similar meanings, of types of enclosed spaces, etc. It is one of those weird effect that persuades some people that LLMs are in fact acquiring a sense of the meaning of words.

The final reason to draw attention to embedding is this: embedding almost certainly poison the approximation to the distribution of language tokens. The embedding step destroys information about that distribution. The reason is that the original native space of token sequences is entirely innocent of vector spaces, and contains no geometric notion of spatial proximity such as arises in the embedding space. That spatial proximity structure is entirely imposed by the NLP architecture. And it almost certainly gives rise to improper notions of proximity between sentences that are sensible (i.e. have a high probability of occurrence) and other sentences that are nonsense (i.e. have a low probability of occurrence).

As an example of improper proximity, consider these two brief sentences: “My dog is fast”, and “My sparrow is fast.” Both are well-formed, grammatically correct, and obey applicable syntactic and semantic rules. The difference is that the first sentence ought to be ascribed a much higher probability than the second one, because nobody actually owns a sparrow.

As embedded points, however, the two sentences are quite similar: a dog and a sparrow are both animals, and hence live in some proximity in the embedding space. Furthermore, dogs are pets, and while sparrows are not pets, they are birds, and some birds are pets. There are enough ways to draw proximity connections in the embedding space to make the second sentence seem plausible in the distribution approximation, despite the fact that it is, obviously, a hallucination.

So embedding is, in my opinion, one of the origins of hallucinations. It is the reason that the approximation made by LLMs to the distribution of language is so brittle. There is nonsense lurking “near” sense in the embedded sequences, because in their native space (token sequences) there was no notion of geometric “nearness”: that property of relative proximity is an artifact of the model.

And if true, this is very bad news for AGI, because it means that hallucinations are a structural feature of all LLMs. They all embed sequences. So you cannot just train your way out of insanity and into “General Intelligence”, because all those new tokens will have exactly the same problem of spurious proximity. The distribution will be corrupted from the outset. It may be that the most likely responses could appear sane, but insane responses will always lurk nearby, waiting to be sampled by the LLM.

The Tyranny of Sampling

I’ve been referring to the process of “sampling” tokens above, and I should say a bit more about that, because while we have seen the origin of hallucinations in a broken estimate of the distribution of language sequences (part 1 of statistical learning) we need to see how the problem is aggravated by an LLM’s response decisions (part 2).

LLMs are often referred to as “generative” models (the “G” in “GPT”). What this means is that their output is, in a sense, random rather than deterministic. They compute probability distributions over the next token, and then exploit that distribution to decide what the identity of the next token should be. They generally do this by choosing the token randomly, with a higher probability of selection ascribed to tokens judged more likely by the calculation.

You might well ask: “Why not simply select the next token by choosing the one with the highest probability?”

This is occasionally tried. It is a strategy called “greedy sampling”. It is very efficient. Unfortunately, it is also a recipe for disaster, a ticket to hallucination pandemonium.

The problem is this: what one really wants is the most likely extended response, to the prompt, according to the learned distribution over language. This might consist of hundreds, or thousands of tokens. The distribution, while imperfectly learned, appears to at least get the most likely extended response right, in the sense that it is the one least likely to contain a hallucination.

Unfortunately, sampling the most likely next-token at every stage does not produce the most likely extended response. This can be a surprise at first, but from a mathematical standpoint it is not surprising at all. The probability of the 17th next-token conditional on the prompt and the previous 16 next-tokens can be very different from the probability of the 17th next-token conditional on the prompt and on entire remaining most-likely response (tokens 1-16 and 18-1000, say). Choosing the most likely token at every stage can, and usually does, lead the LLM into crazy rabbit-holes.

So instead, one attempts to let the probability distribution do its thing by allowing it to somehow sample the next-token distribution. This is better, but more expensive. In principle, what one ought to do is sample the 1000-token response many times (10,000 times, say) and choose the most frequently-occurring response. That strategy would probably abate a good deal of the hallucination phenomenon. Unfortunately, it would be totally unaffordable in inference computation cost, as well as quite slow. So intermediate strategies are adopted, restricting the next-token distribution to the top 90% of candidates, and looking along a tree to the next and next-next tokens for each one of these top tokens (the so-called “beam search”). This is better, but still not great for finding the top 1000-token response.

You might call this the Tyranny of Sampling: one must somehow sample from an LLM in order to defend its output from the worst hallucinatory offenses. But if you try to do the right thing, the computational cost will destroy the usefulness of the method. Rock, hard place.

Hallucinations Are Structural

Here’s the bottom line: Hallucinations are a structural feature of LLMs, produced by a corrupted model of the probability distribution over language sequences learned in training. The corruption is due to embedding, which is a ubiquitous feature of LLMs.

The only available hallucination abatement strategy is some form of generative sampling, which means accepting the unsettling fact that LLMs cannot produce the same output twice to the same prompt. And even accepting this non-determinism as a cost of doing business, the sampling strategy that cleans up the problem to a maximal extent is totally unaffordable. Unsatisfactory look-ahead strategies are better than nothing, but they still let a lot of nonsense through.

There is no hallucination abatement strategy that begins with more token data and larger models. That’s just not a thing, despite what the Tech industry would like to believe (and would certainly like investors to believe). More tokens and larger models likely aggravate the embedding problem, because there will be more improper proximities discovered in the embedding space.

And note that “larger” models are not “more clever” models. This discipline has not produced radical innovations to the transformer architecture since its invention, or at least none that have led to any breakthroughs comparable to what was wrought by the transformer’s first introduction in 2017. A “larger” model simply means “more parameters”3, not new mechanisms that make the model more clever. Given the argument that I make here, I very much doubt that any new cleverness could be built into a transformer that could eliminate the hallucinationatory mechanisms baked into its structure at its most fundamental level.

All of which is to say this: LLM-based “AGI” will be mentally ill at birth.


  1. “Derive the wave equation starting from Maxwell equations.”
  2. Note that optimal tokenization is not a solved problem. This would be addressing the following problem: what tokenization is maximally preserving of the information borne by text?
  3. Which is to say, more embedding parameters, more attention heads. Not new mechanisms.

Part 5: HallucinationsPost + Comments (78)

Jackaltariat Bleg, E-Reader Edition

by Tom Levenson|  November 25, 20258:02 pm| 81 Comments

This post is in: Books, Technology

Hey, everyone…a quick semi-technical question.

These days, I read books almost exclusively on my Kindle–a Kindle Voyage to be precise, from almost 10 years ago. Some combination of old eyes needing the capacity to increase type size as the day goes on; having a spouse who strangely enough enjoys not having me turn on the bedside light when my insomnia sets me to reading at 3 a.m.; and just being tired of accreting more stuff has made me a convinced and happy e-ink reader after decades of saying dead trees or nothing!

Jackaltariat Bleg, E-Reader Edition

BUT…I’m getting tired of being locked into Jeff Bezos’s walled garden and would like to buy books in ways that support bookstores, or at least don’t help wreck them.  Bookstore.org offers e-books in a program that allegedly helps independents, though my local (Brookline Booksmith) told me they weren’t participating in the e-book offering. (They do on the physical book side.) And, of course, there’s everywhere else in the e-book space. So there are alternatives…

…but I have this by now quite huge Kindle library.

So, my question(s): A) am I being stupid in trying to navigate around Amazon in this instance? The argument that I am is that nothing I do in this space will alter any of the players’ behavior or outcome, so why put myself through any grief to achieve a performative result.

B) If that’s not true, or at least unproven, then what are my alternatives? The obvious one (as far as I’ve been able to discover) is the Onyx Boox Go 7, (I’d opt for the grayscale one, not the color machine).  It’s anAndroid based e-ink reader that can read Kindle books via the Android Kindle app, and read any other format via yet more apps. My concern (besides the fact that it’s a bit spendy) is that being a fully-capable Android tablet the distraction possibilities are too great. Anyone with any experience with the Boox world with views on this?

And if not that particular device, any other thoughts? Again, the goal is to be able to read my existing library while sourcing new books from anyplace I want.

Whaddaya got, oh Jackaltariat?

Also too: thanks in advance.

Image: Jean-Baptiste Camille Corot, Interrupted Reading, 1870

Jackaltariat Bleg, E-Reader EditionPost + Comments (81)

Part 4: If There Were AGI, What Would It Look Like?

by WaterGirl|  November 19, 20257:30 pm| 65 Comments

This post is in: Artificial Intelligence, Guest Posts, Science & Technology

Guest post series from *Carlo Graziani.

Guest Post: AI 1

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. What follows is the fourth part of a seven-installment series on Artificial Intelligence (AI).

The plan is to release one of these per week, on Wednesdays, with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

The original plan was to skip Thanksgiving week. However, I’ve been talking to WaterGirl about the technical level of these posts, and I’ve come to realize that it’s been a bit off-putting to some readers. So I think that during the turkey-day break, I’ll try to provide a high-level summary of where the series has been with an eye to keeping the nerd-babble under control.

That said…

Part 4: If There Were AGI, What Would It Look Like?

Part 3 ended with a bit of a rant, because I felt the need to express outrage at the very loose and lazy intellectual standards prevailing in much contemporary “AI” research, at least insofar as discussion of artificial General Intelligence (AGI) goes. My perspective on the subject is by no means a majority view, and I feel a little like Diogenes, shaking his fist at the corrupt world from the austerity of his barrel.

The thing is, I don’t really enjoy the role of Diogenes, because “burn it all down” is a fundamentally destructive outlook on such things. I happen to feel that the scientific accomplishments of modern machine learning, while often oversold, are very real. I don’t want to give the impression that I think the entire subject is worthless, just because the current scientific discussions of AGI are so fundamentally wrong.

As to AGI itself, I think there is something else I need to clarify: I do not intend to say that it is impossible to achieve some version of AGI: I am simply saying that AGI is impossible along our current technological path, which is to say, based purely on machine learning techniques.

I am philosophically a materialist. I do not believe in souls. I think that consciousness is something that physical brains do, a phenomenon that arises from the electrical activities of billions of grey cells. And that being the case, I cannot in good conscience believe that it is surely impossible to bring about some kind of entity, in software running on computer hardware, that recognizably emulates aspects of human cognition, including reason. I do expect that this feat will be far more challenging to accomplish than chatbot parlor tricks we currently call “AI”. Even if true AGI is possible at all, we might not see it happen for many decades. Nonetheless, fundamentally, some AGI technology should be possible in principle.

What I want to attempt today is to describe what the scientific basis for such a technology might look like. I base this discussion on an article that I have written that is currently under review (those of you who would like to take a deeper dive will find the draft article here).

This is a purely speculative venture, and what I write here, however well-motivated, could easily turn out to be wrong. Nevertheless I think this is a useful exercise, for two reasons: it is useful to at least try to point to a possible exit from the stagnant state of current research on AGI; and, it is useful to at least try to illustrate what type of research concerns ought to replace those currently occupying scientists working on AGI.

What Should We Require Of A Theory Of Artificial Reason?

show full post on front page

I want to narrow down these considerations, from AGI (a term for which no accepted scientific definition exists) to artificial reason, which is at least amenable to some specific discussion. What I would like is a model of what we mean by the term “reason” that is specific and detailed, to the point of being amenable, at least in theory, to implementation as software. Such a model would at least get us away from the territory of bullshit claims such as “self-organization” and “emergence” of AGI.

Last week, I discussed human reason in the context of what sort of traces it might leave in natural language text, to examine the plausibility of claims that reasoning states can be recovered from large text corpuses. I pointed out that our own reason rests on a foundation of subrational processes which almost certainly leave no such trace in text. Cognitive scientists have only the vaguest notions of how those processes work, and they can certainly not exhibit any models for them that are sufficiently specific to be represented as software. So trying to build a principled “bottom-up” model that mimics how reason emerges in a human mind is probably hopeless, at least for now.

What is left, then, is a “top-down” approach. What I mean by that is that we must work at an abstract level rather than at a mechanistic one. We must state what we mean by “reason” in general terms, in a way that we cannot directly show to be connected to the mechanisms of human reasoning, but which is motivated by the structure of reasoned thought. Also, we would like a model that can be expressed as mathematically as possible, because the point here is to come up with something that we could imagine translating into computer code.

Oddly enough, we already have one aspect of human cognition that can be represented this way: learning. We have seen that there is a subject called statistical learning, wherein by some method one learns an approximation to the statistical distribution from which some dataset was sampled, and one concomitantly learns to structure reasonable decisions based on that distribution. I’ve been a little vague about how this works, but it is a process that can be represented quite generally by the kind of model that I have in mind here.

So one possible approach (certainly not the only one!) is to take that representation of learning and generalize it, to represent reasoning. This approach has two advantages: it allows us to get a free ride on the existing model, which appears to work for learning; and, it allows us to connect and contrast “reasoning” to “learning”, so that we can begin to see what the relationship might be between the two.

A Cast Of Characters

This is all very abstract, and it will be helpful to provide concrete examples of reasoners (or alleged reasoners) to consult as we go along. I have three such examples for you:

  • The astrophysicists who were trying to puzzle out the nature of Gamma-Ray Bursts (GRBs) between 1973 and 1998. The GRB phenomenon consists of bursts of gamma rays (duh!) that arrive at the Earth from random directions on the sky, never repeating. When they were discovered, and for the quarter-century that followed, their nature remained mysterious, because they seemed unconnected to any other astronomical phenomenon. The available data consisted of gamma-ray “light curves” (time traces of gamma-ray intensity), spectra (distributions of gamma-ray energies in the burst), event durations (fractions of a second to hours), and locations in the sky. The latter were only known very inaccurately: the so-called “error boxes”, regions of the sky from which the events might have arrived, were very large by astronomical standards, many degrees across, because it is difficult to create direction-resolving instruments for photons at gamma-ray energies. We will use the story of how the mystery of GRBs was solved to illustrate an aspect of our model of reasoning.
  • A DIY home electrician (name redacted to protect the guilty) attempting to install a light fixture into an electrical box. He is following very standard procedures, using techniques, tools, and materials that he has trained to use and understand, and is moderately skilled. However, for some unknown reason, the fixture installation is failing, because of a persistent short-circuit that only manifests itself when the fixture is finally secured to the wall, and the circuit breaker is turned back on. When he turns on the circuit breaker with the fixture not secured to the wall, there is no short circuit, and the fixture works correctly. He is trying to figure out why by inspecting wire nut connections and checking for crimped wires. We will use this story to illustrate another aspect of the our model of reasoning.
  • An LLM undergoing training, or a trained LLM making new inferences. It doesn’t reason: it’s just along for the ride.

Bayesian Updating As A Model Of Learning

Let’s get started with learning.

We can exhibit an abstract model of learning using Bayesian statistical theory. I’ll describe how this works without writing down any equations (there aren’t that many equations, and you will find them in that draft article if you care about that sort of thing). There are two elements to consider: a parameterized model, and an evidence stream.

The role of the evidence stream is to provide new information to be assimilated. The evidence is presented sequentially, one discrete piece at a time. It comes from a fixed set of possible pieces of evidence. There may be infinitely-many such pieces, but they are related by some structural relationships.

Examples of such evidence streams are GRB light curves, spectra, durations, and arrival directions; or the results of the DIY electrician’s inspections for faulty wire connections or crimped wires; or pages of text presented to the LLM in training.

The role of the parameterized model is to provide a description of the structure of the evidence. “Parameterized” simply means that the provided description is controlled by a set of numbers (the parameters) that act as control knobs on the model. Twist those knobs, and the model’s description of the evidence structure changes. There may be a half-dozen such knobs, or there may be billions, depending on the model and the evidence. The model is fixed, but we may set the knobs any way we choose.

The model might contain statements such as “the source of the GRB is a neutron star in our galaxy” and the corresponding knobs could be the star’s spin rate and magnetic field intensity, and its distance from Earth; or the model could contain the statement “one of the wires is getting crimped against the box’s mounting strap” and the corresponding knob would be the identity of the offending wire; or the model might be the LLM itself, and the corresponding knobs would be the billions of parameters that must be set in training.

We do not initially know which settings of the knobs provide the highest-fidelity description of the evidence structure, i.e. which settings are most predictive of the evidence. However,once we start viewing evidence, we have a procedure for weighting the knob settings. “Weighting” means that we may view some settings to which we have ascribed higher weights as being more likely than other settings with lower weights, because the higher-weight settings provide better descriptions of the evidence.

This weighting procedure is called Bayesian updating. As the model views each new piece of evidence, this (fairly simple) mathematical procedure describes how the weights shift among the knob settings. Generally speaking, a single piece of evidence produces a relatively small adjustment of the weights. Over time, as evidence accumulates, what may happen in the ideal case is that a small set of knob settings will hog most of the weight while remaining settings will have essentially zero weight, and we will conclude that those highly weighted settings are “preferred” by the evidence (in the sense that they give the most satisfactory predictions of the evidence).

That, in a nutshell, is our model of learning.

When Learning Stalls

One problem with statistical learning is that the happy circumstance where the weights contract to a small set of knob settings can be difficult to obtain. There are two possible problems with it:

  1. The evidence may not shed enough light on the model. In this case, we would say that the evidence is not informative about the model.
  2. The model may not be sufficiently descriptive of the evidence. In this case, we would say that the model is not explanatory of the evidence.

If either of these circumstances holds, the Bayesian updating process will stall, and the weights will not decisively concentrate on a winning set of knob settings.

In the case of GRB astronomy, a consensus developed in the 1980s that there was a Case (1) problem: the evidence was not informative with respect to any proposed model of GRBs. The problem was that the source location error boxes were too large, and too-tardily reported. It was felt likely that the transient GRB phenomenon was in all likelihood associated with equally-transient phenomena at other wavelengths, and that observing such transients might be the key to unlocking the mystery. But a 4-degree error box on the sky is always crowded with astronomical sources, including time-varying ones, and it was simply not possible to identify any one of them as the culprit. GRB research stalled. Bad evidence!

In the case of the DIY electrician, something was clearly not right with his understanding of the situation inside the box, because after multiple inspections it was increasingly clear that all the connections were fine, and none of the wires were getting crimped. Something else, not suggested by the model, had to be at fault. Bad model!

In the case of a trained LLM’s efforts to respond to prompts, we mostly have a bad model problem, in my opinion. Certainly, the hallucination phenomenon suggests a very brittle model that easily goes off the rails. However, depending on the objective of the training, there might also be a bad evidence problem, particularly in the case of training an AGI: as I discussed last week, the text corpus almost certainly contains no information concerning the origins of human reasoning processes.

Where’s The Aha! ?

Note one characteristic feature of the learning process that I described above: it is in essence continuous. Piece of evidence comes in, small adjustment occurs in weights. Lather, rinse, repeat.

If we are going to base an account of reason on straight up learning, as the LLM research community is attempting to do, this is a very serious (although largely unrecognized) problem, because one of the salient features of reason is that it often operates discontinuously. We have all, I am sure, experienced those moments of “Aha!” revelation, in which suddenly some issue that we have struggled with suddenly seems easily solvable. The problem has suddenly flipped and twisted in such a way that clarity replaces darkness. If there is an aspect of reason that distinguishes it from other cognitive activities, I submit that “Aha!” is that aspect.

That’s the problem with the “learning to reason” approach to AGI. Learning is an essentially continuous process. It simply cannot produce the “Aha!” discontinuity. There is no pure learning path to Artificial Aha! (AA). As a type of cognition, learning is severely limited by restrictions on evidence and model choice. Essentially, all it can do is update its weights across the fixed model’s knob settings, based on evidence drawn from a fixed collection of evidence types, in the hope that some settings are explanatory of the evidence and that the evidence is informative about the model.

It should go without saying that this does not begin to capture reasoning. Anyone reflecting on their own occasions of “Aha!” moments of sudden clarity and insight (not necessarily in the pursuit of natural science, home repair, or computer science, but in solving any puzzle in any field of human activity!) should understand that those moments do not come from a process analogizable to gradual constraining of a model through gradual assimilation of accumulating data. “Aha!” moments are essentially cognitive discontinuities, gestalt shifts that suddently alter the process of assimilating evidence into a model, and are incompatible with the continuous learning process described above. So what are we talking about when we talk about “reason”, and in what way is it related to learning? And, how might we produce AA?

Evidentiary Reform

Suppose that we recognize that we are in Case (1): the evidence is not informative of the model. Then the move is obvious: we change our evidence stream. We cast about for a new stream of more powerful evidence that speaks more clearly to our model, using our knowledge of model features that might be sensitive to other types of evidence, as well as of what new types of evidence might be feasibly acquired. We refer to this shift as Evidentiary Reform.

Evidentiary reform is pretty much the approach taken by astrophysicists to decode the nature of GRBs. Realizing that no GRBs could be associated with a transient counterpart in other wavelengths because of the inaccuracy of GRB locations, GRB scientists developed new high-precision X-ray localization instruments, and arranged for GRB locations to be propagated in real time to ground-based optical and radio observatories. The first transient optical counterparts of GRBs (the so-called “afterglows’’) were detected in 1998, revealing their extragalactic nature through their substantial absorption redshifts1. By 2003 a core-collapse supernova in a relatively nearby galaxy had been caught in flagrante in a GRB error box (whose size was now about 0.05 degrees), associating GRBs with a certain type of supernova. Case closed. The new stream of evidence, brought into being to correct the weakness of the previous evidence, transformed the mystery into a soluble problem.

The ability to propose evidentiary reform to obtain better model constraints is certainly an example of a true reasoning process. It has the required “Aha!” discontinuous character, embedded in the realization that a new type of information is required for further progress. It is also a highly non-trivial thing to model in a computation, since a successful evidentiary reform needs to take into account not only the nature of the weakness of the previous evidence with respect to model constraint potential, but also practical considerations of how such new evidence can be obtained given real-world feasibility constraints.

Model Reform

Suppose that we recognize instead that we are in Case (2): the model is insufficiently explanatory of the evidence. Then, again, the move is obvious: replace the model with a new model capable of improved predictive power, and endowed with a new set of knobs. The new model might be suggested by the specific form of prediction failures common to the old model. It would likely also satisfy certain criteria of ontological parsimony, embodying some notion of Occam’s Razor-type simplicity so as to exclude model families of weak explanatory/predictive power. We will refer to this process as Model Reform.

The DIY electrician took this approach to finally figuring out his short-circuit problem. After several iterations of taking the fixture off the box and inspecting various electrical elements and connections for defects, and making sure the wires were neatly folded in the box so that they could not become crimped, he started to think of what could produce a short-circuit only when the fixture was secured to the wall. At which point, he realized that the screw securing the fixture to its mounting strap in the electrical box was long enough to reach through the box into the hole in the wall from which the electrical cable emerged, and bury itself among the wires in the cable, potentially crimping and shorting them. And an inspection of the end of the screw showed a dark discoloration that was not present originally, presumably due to the short-circuit passing through the end of the screw. A simple solution—replacing the screw with a shorter screw—immediately produced a satisfactory installation. The problem had been that the original model did not feature any role for the mounting screw. The new model now contained a statement “The mounting screw causes a short at the electrical cable when the fixture is fully secured.” It was induced by the inability of the original model to predict the short-circuits, and supported by new evidence (the discoloration of the end of the screw) which was not interpretable within the the original model.

A reasoner can produce an “Aha!’’ discontinuity through model reform, when a judicious replacement of the model results in improved predictions of the evidence, leading to marked improvements in the concentration of the knob settings weights. Again, this type of reasoning is not straightforward to model in a computation, since formulating a new model requires some sense of the data misfit and a formulation of some kind of Occam Razor conceptual parsimony constraint.

Reasoning and AI

In summary, this high-level account of reason ascribes to it the ability to supervise and intervene upon a learning process, discontinuously altering either the model or the evidence stream, which would otherwise be static features of the learning process. In addition, a reasoning process must be capable of recognizing when a learning process under its supervision stalls. When a stall occurs, it must diagnose whether the failure is more likely due to a bad model or to a bad evidence stream, and it must propose an alteration of one or the other, according to criteria suggested by the failure, while respecting important constraints on possible alternatives.

In other words, in this account, reasoning transcends learning in an essential manner.

This is major trouble for current attempts to obtain AGI, because, as we have been discussing, the entire subject is based on machine learning. Transformer-based LLMs are nothing but computational models that learn to represent an approximation to the probability distribution over token sequences encountered in their training data, which they exploit to construct likely sentence completions, sentence translations, sentence classifications, and so on. They do this so well that their output can belie its origin in probabilistic mimicry (in Emily Bender’s memorable phrase, they are Stochastic Parrots). They can produce the appearance of reasoned discourse at most times. But the process by which such models are trained is the gradual, continuous assimilation of millions of text documents into a stupefyingly large model. LLMs never do “Aha!” They simply aren’t wired that way, becuse their evidence streams and models are fixed.

This is the point that current AGI research appears to miss altogether. The view now gaining currency among practitioners is that the “emergence” of intelligence occurs in consequence of training models with billions, or trillions of parameters, as evidenced by the fact that such models can perform certain “reasoning tasks”. But performing reasoning tasks is not at all the same thing as reasoning: that is the circular argument for AGI again. Some modern AI system have been trained to write very creditable computer code. But the ability to write code does not make one a computer scientist—there are no AI computer scientists today, certainly none capable of proposing new conceptions and models. Similarly, some AI systems can prove mathematical theorems. This does not make them mathematicians, since there is much more to the cognitive activities of a mathematician than just proving theorems—it is far more challenging and useful to know which theorems are interesting to search for, and to create interesting new mathematical frameworks within which theorems can be searched for and proven. And, from the sublime to the ridiculous: an LLM-based AI electrician may know chapter and verse of the National Electrical Code, and be as conversant with tools, materials, and techniques as any licensed electrician. But faced with a situation not previously confronted by any training example it would not be able to reform its model or its evidence stream to suit the unexpected circumstance.

Is This Model Right?

I don’t know whether the model of reason that I argue for here is indeed correct, or in any sense valuable. It has obviously not been implemented in software and validated. As I have indicated, it would be highly non-trivial to represent the model in software. But not, I think, impossible. It is at least a specific model, and it is based on a set of mathematical ideas. One could at least begin building small toy systems that would permit some exploration of its features.

I imagine that AI practitioners would find it easy to reject this model and ignore the conclusions that it forces one to draw, because there is no output that one can judge it’s validity by. But please note that at least this is a model of reason. AI researchers have never deigned to supply such a model, instead relying lazily on vague notions of “emergence” and “self-organization” for which they offer no mathematical theory worthy of the name. Which is to say, they embrace the circular argument for AGI, discovering AGI in LLM output after declaring what AGI should appear as in LLM output. That is a worthless, contemptible scientific argument (Diogenes is getting the better of me again). If you want to tell me that your model “reasons”, show me your model of reason, and we can argue about whose model is better. I would love to have that conversation. It would be on a whole different intellectual plane from where AGI research is today.


  1. The universe is known to be expanding, so that very distant galaxies appear to recede from us at velocities that increase with their distance from our own galaxy. This effect leaves a trace in light that we detect from such galaxies, because the faster they move away from us, the more their light is shifted from lower to higher wavelengths, i.e. from blue to red. This “redshifting” is known as the Doppler effect, and it helps astronomers ascertain how long ago the light was emitted.

Part 4: If There Were AGI, What Would It Look Like?Post + Comments (65)

I Finally Found A Use For Chat GPT…

by Tom Levenson|  November 15, 20253:22 pm| 44 Comments

This post is in: Artificial Intelligence, Open Threads, Something Fun

In the morning thread, Jackal Tobie pointed out that Google does not offer an AI answer for a search asking if Trump has dementia, while generating a reply if one asks about, say Bruce Willis.*

So…what should one do?

Well, if you have my somewhat unfortunate turn of mind, you might ask Chat GPT to generate an internal monologue for a Trump posited to be in the grips of cognitive decline. To its credit, I’d say, that generator of frequency distributions of strings of words declined to do so, explaining that would be too close to offering a diagnosis of a living person.

Being a helpful sack of words+algorithms, however, the LLM offered to write up what a fictional  politician in such straits might be thinking. My first try was a bust–I fed back the word “politician” in the prompt and got a notional senator’s musings.  I tried again with a cleaned up ask, and this is what I got:

Prompt: Write an internal monologue of a fictional American president who is confused, grandiose, or erratic.

I Finally Found A Use For Chat GPT...

Here’s an internal monologue of a completely fictional American president—President Alderidge “A.J.” Mapleson—who is confused, grandiose, and erratic. No real person is being portrayed.


Internal Monologue of President Alderidge “A.J.” Mapleson

Where’s my nuclear football? They told me it’s always nearby, but I don’t see anything shaped like a football. Maybe it’s an actual football. Maybe they’ve been hiding it from me because they’re worried I’ll try to punt it down the hallway again. That only happened twice. People overreact.

Focus, A.J., focus. You’re the president. The President of the United States—the biggest job in the world. Or is it the second biggest? I think the CEO of MegaMart technically has more employees. I should call him. Maybe he can give me leadership tips. Or coupons.

What’s on the schedule? Something about a security briefing? Or a briefing about the briefing? Why do they brief me so much? I already know everything. I’ve read so many files, some of them twice. I could brief them if I wanted. But I won’t. Wouldn’t want to make them insecure.

Is that the Secretary of Defense coming toward me? I hope he doesn’t want an update on the “Mapleson Doctrine.” I don’t remember inventing it, but everyone keeps referencing it like I wrote it in stone tablets. Should I ask him what it is? No—no, play it cool. I’ll nod sagely. Nodding solves many problems.

I need a strong line for the speech today. Something presidential. “My fellow Americans…” Yes, good start. Then something inspiring. Maybe trees? Americans like trees. Or space. Space trees? I’ll circle back to that.

Wait—am I supposed to be in the Oval Office right now? This hallway doesn’t look oval. It looks rectangle. That’s troubling. Maybe they moved the office again. They keep rearranging things. I know they do it to keep me sharp. Or to keep me confused. Hard to tell these days.

I should announce a new initiative. Presidents announce things. “Operation… something.” Operation Lighthouse, maybe? Or Operation Lightning Shark. Yes. That sounds powerful. Dramatic. Cinematic. I’ll figure out what it actually does later, once the polls come in.

Smile, A.J. Smile with gravitas. Think majestic thoughts. Think commander-in-chief energy. Think… wait… where was I going?

Ah yes.

To lead the nation. Boldly. Confidently. And with absolutely no idea what meeting I’m walking into right now.

I don’t know about y’all, but I’m feeling that the veil of fiction is pretty damn thin.

Have fun in the comments, jackatariat; this thread is as open as the presidential mind is welded shut.

PS: I am not an AI-involved guy, at least not yet. As you might expect, the question of how the ready availability of LLMs affects higher ed is a pretty constant topic at my workplace. But I’ve been too much involved in immediate crises to get much of a handle on what actual use such models might be, and how (if at all) to keep their use from wrecking humanistic education. I suspect I will not be permitted such blissful ignorances for much longer.

*To be fair: Google also declines to provide an AI summary for the same search asked of other politicians. I tried it on Joe Biden and Mike Johnson and each time it went “nope.”

Image: Hans Memling, X-radiograph of the painting, Portrait of an Old Man,  c. 1475

I Finally Found A Use For Chat GPT…Post + Comments (44)

Part 3: There Is No Artificial General Intelligence Down This Road

by WaterGirl|  November 12, 20257:30 pm| 83 Comments

This post is in: Artificial Intelligence, Guest Posts, Science & Technology

Guest post series from *Carlo Graziani.

Guest Post: AI 1

On Artificial Intelligence

Hello, Jackals. Welcome back, and thank you again for this opportunity. What follows is the third part of a seven-installment series on Artificial Intelligence (AI).

The plan is to release one of these per week, on Wednesdays (skipping Thanksgiving week), with the Artificial Intelligence tag on all the posts, to assist people in staying with the plot.

Part 3: There Is No Artificial General Intelligence Down This Road

This week and next we will be taking a close look at the claims made by the Tech industry that there are already indications that Artificial General Intelligence (AGI) is “emerging” in large language models (LLMs), and that true AGI will be a reality within the next few years. Keep in mind that AGI is the objective that these companies are targeting, and its realization is the essential justification for the roughly $2T investments in “AI” model development that the industry now projects over the next 5 years or so.

You might think that to justify that level of investment would require a pretty airtight scientific case that (1) AGI is possible in principle, and (2) that AGI is achievable through current LLM technology, which is to say, using transformer-based deep learning (DL). But if you did think that, you would be wrong. Whether AGI can be accomplished at all has been an open question since the 1930s. And, as I will argue in this essay, we are certainly not any closer to AGI with current “AI” tech than we were before the DL revolution began.

The Circular Argument For AGI

The first thing to observe is that there does not really exist a scientifically-defensible definition of what AGI is. There is a fairly balanced review of the topic here. The principal problem is that we don’t even know how to accurately describe or define either the mechanisms or the characteristics of human intelligence, so when definitions of AGI appeal to notions such as “the ability of computers to perform human-like cognitive tasks” they are comparing one imprecise notion to a different imprecise notion.

Moreover, it is important to note that all such definitions are circular: they define AGI in an LLM in terms of certain types of output produced by LLMs, and then promptly discover evidence for that very output, proving that AGI is near. This paper, Sparks of Artificial General Intelligence: Early experiments with GPT-4 is an unintentionally hilarious example of the genre.

I find this sort of thing extremely frustrating. Language matters in science. I don’t want to have to parse statements that amount to defining what intelligence looks like in text output, from people who don’t have the faintest idea what intelligence is.

Cognitive scientists also labor under this constraint, designing tests and experiments to try to understand aspects of human cognition from stimuli and responses. But they have no choice in the matter: we are very far away from having experimental access to the higher-level functioning of the human brain, so those scientists use the tools that are available. Computer scientists have no such excuse: they have complete access to and control over their models. Nonetheless, the tests for intelligence that they adopt are essentially stylized versions of the cognitive science tests, with stimulus and response replaced by prompt and response. There is no effort to describe what aspect of transformers (or the chained, augmented transformers in the “reasoning” models of OpenAI and others). There is only complacent satisfaction that some combination of pre-training, fine-tuning, distillation, computational scaling, iteration, etc. produces improved performance on “reasoning” benchmarks. Sure, that’s very nice, although “improved” does not mean “adequate”, according to ARC-AGI-2 testing. But excuse me, what isthis “reasoning” of which you speak?

I’ll have more to say about reasoning next week. For now, I just want to point out that whatever reasoning is, it is certainly a distinct cognitive process from learning. So the assertion that reason can “emerge” from what are pure statistical learning systems is a huge claim, one whose justification would require mountains of really impressive scientific evidence, including a detailed explanation of the mechanism by which it arises in LLMs or chains of LLMs.

show full post on front page

The Implausibility of “Learning To AGI”

In order to break down the claim into intelligible pieces, it is useful to adopt the “model-agnostic” outlook on machine learning that I discussed last week. Recall that in that outlook, we draw a veil over the details of the machine learning implementation, and focus on learned distributional structure of training data and on optimality of decision choice. In this case, the training data is vast amounts of text distilled and cleaned and curated, from large-scale Internet scrapes, from large libraries of scanned books, from academic journals, and so on. The decisions are responses to prompts. Whatever the thing behind the veil is, what it does is learn an approximation to the distribution of texts, and approximately optimal responses to prompts.

I need to introduce a concept here that is familiar to most scientists: it is the idea of an inverse problem. The problem is this: given that some data resulting from observations of some process, infer certain attributes of that process. A simple example is weather prediction: given a time-series of observations of weather conditions at thousands of weather stations, and radar and other remote observations, recover an approximation of the current full state of the atmosphere, so as to evolve it using a numerical weather model to predict whether it will rain tomorrow. Another famous (and essentially unsolved) example is from epidemiology: given some time-series of data on infections, hospitalizations, and deaths due to COVID-19, say, infer the current state of the epidemic (how many people are susceptible, exposed, infected, recovered, immune, on a county-by-county basis), and use a numerical epidemiological model predict the epidemic’s future course.

Note the essential elements of such problems: we have a principled model of the process (a numerical weather model, or an epidemiological model) whose state we would like to infer (the atmospheric state, or the state of the epidemic) using data (weather observations, clinical data) so as to make predictions (will it rain during my picnic, is there a new epidemic wave in progress). There is always an assumed “forward model” that describes how the observed data arises, given the state of the process. But that state is unknown, and to estimate it from data one must in some sense “invert” the forward model. Hence “Inverse Problem”.

The process model plays a key role. You need to have some idea of how the process works—a set of equations that governs the process, for example, depending on unknown parameters that you need to infer—for there to even be a well-posed inverse problem. That’s not a sufficient condition, but it is certainly a necessary one.

Inverse problems are ubiquitous in science. In fact, one could, after a few beers, make the claim that most of the daily activities of scientists revolve around solving inverse problems. This is not completely true (where did the principled process models come from, in the first place?) but it is not a grotesque caricature either.

We can view the training of an “AGI” in inverse problem terms: the data is the oceans of text that these things ingest. The process model is the transformer-based “reasoning” model. The “state” to be inferred is the parameter configuration of that model that closely corresponds to a representation of the mental state of a reasoning human. The predictions are reasoned responses to prompts.

OK that’s all I need. Here is the problem: in order to believe that LLMs are achieving “reason” (the minimum requirement for any definition of AGI), we need to accept two big claims:

  1. Whatever a reasoning process may be, it leaves a sufficiently informative imprint of its internal state in text data, such that the state may be in some sense recovered and exploited, given a sufficiently large corpus of text, by solving the corresponding inverse problem.
  2. Transformer-based LLMs, in some sense, play the role of the process model in this inverse problem, and training such an LLM is tantamount to solving the inverse problem. Moreover, the trained LLM embodies the resulting reasoning entity to the point that at inference time it actually reasons.

Let’s take these in order:

In my opinion, claim (1) is barely sane. Perform any sort of introspection, and I think it is likely that you will find that your spoken or written utterances embody only the most superficial layers of your reason and other cognitive processes. That’s why we all struggle to put our thoughts into words when the occasion arises. We often are not even clear about what our thoughts are, and find, after putting them into words, that they have changed, possibly getting clearer, but also often becoming murkier and less certain as we are forced to articulate our meaning1.

I simply cannot understand how such subrational processes might embed any interpretable information in our utterances. It is analogous to believing that, given a full, principled model of human physiology, and a data corpus of human footprints together with clinical observations of the humans leaving the footprints, one could train a model that could observe a new footprint and predict the health of the corresponding human. That would be mad: there is not enough information embedded in a footprint to back out a person’s gastric health, or vision acuity, or state of infection from a disease, etc. Similarly, I do not believe that there is enough information impressed in text about the subrational processes whose surface manifestations we call “reason”. I could be wrong about this, but I don’t think so, and in any event the burden of proof is on those researchers who make this kind of claim. Where is that information? How is it encoded?

Claim (2) is actually much worse: it is in the category that physicist Wolfgang Pauli called “not even wrong”—a statement so detached from scientific discourse that classifying it as correct or incorrect is simply a waste of time.

Let’s pull back the curtain concealing the LLM model for a moment. If you read any of the many online descriptions of how a transformer works (The Illustrated Transformer is pretty good, and Wikipedia’s is quite detailed, but Google has many hits for “How does a transformer LLM work”), you may find the level of computational detail off-putting at first. But if you zoom out a bit, what you realize is that it is mostly a giant chain of linear-algebraic operations, interspersed with a few nonlinear “activations”, sandwiched between a linear encoding layer and a nonlinear decoding layer. In this sense it is not different from any DL method. There’s more layers and parameter arrays than most, but not much more structure. It’s a system that grew out of a lot of trial and error, with a pile of late, unlamented errors filling a large dumpster in the back of the lab, and only what more-or-less worked left in.

There is nothing special in that model that is analogous, say, to the model of human physiology that one would need to even attempt to back out a human’s health from that human’s footprint. There isn’t a scrap of theory to motivate the claim that transformer-based models could furnish the basis for solving this inverse problem. Which is to say, a key element of the inverse problem—the principled model embodying actual knowledge of the process under study—is simply not there. Instead there are chains of linear algebra mingled with other ad-hockery, not purporting to model anything. Which means that Claim (2), is in effect, not only that this Rube Goldberg device is capable of inverting the forward model to recover the reasoning process state, but it is also somehow capable of reconstructing the principled model of the reasoning process of which that state is an attribute. That chain of linear algebra is, in effect, a Nobel-caliber cognitive scientist, because the first reasoning task that it carries out is to create a working model of reason itself, a task that still eludes the discipline of cognitive science!

That is just magical thinking. It is literally impossible that this bodged-together system should have accidentally succeeded in modeling reason—an unsolved scientific problem—and then solving the related, probably impossible inverse problem of recovering the model’s state from text input, so as to boot up a reasoning entity. It’s a thoroughgoingly stupid claim.

“AGI” Is A Scientific Scandal

I find it disgraceful and shameful that an entire category of scientists has been moved by enthusiasms and Tech industry funding to lower its intellectual standards to the point that this sort of bullshit floods the journal and conference literature. It’s a scientific scandal, unfolding in plain view. Nothing in the Replication Crisis that afflicts the social sciences comes remotely close to this level of corrupted science.

I can’t emphasize strongly enough that this hubristic nonsense is taken very seriously by the “AI” research community. Sublimely unfazed by the absence of any fundamental explicit understanding of what reason is, and positively glorying in the inscrutable inner complexity of LLMs (“Explainability” is itself a topic for funded research, after all, as we saw last week), this community crows about achieving the “emergence” of intelligence from the models at large scales of data and computation, secure in the knowledge that the models are too unanalyzably complex for any model developer to be expected to explain how this miracle comes about. They just claim that it’s “self-organization” at work. The intellectual laziness of this outlook is simply shocking to me.

At this point, the technical jargon of this discipline has escaped all bounds of propriety. “AI” was bad enough, given the limited amount of “I” in ML (basically, only learning). But now we have “chain of thought”, “knowledge representations”, “mixture of experts”, “agents”, “reasoning models” and “General Intelligence” as well as many other similar allusions to human cognition polluting the technical discourse. Shame is dead in this discipline.

In a sense it’s kind of funny: Silicon Valley Masters of the Universe are directing trillions of dollars in investments to build hundreds of data centers, buy stupefying amounts of computing hardware, and add an estimated 60GW of electical power generation to the U.S. grid, all for the purpose of achieving something that literally cannot be achieved. There is no pot of gold marked “AGI” at the end of this rainbow. But it will take an infinite amount of data, compute, power, effort, and money to get there and find out. What could possibly go wrong?


  1. Both increased clarity and increased murkiness of thought has certainly happened to me several times between when I conceived these essays and when I actually started banging them out on my keyboard.

Part 3: There Is No Artificial General Intelligence Down This RoadPost + Comments (83)

Dish Served Cold: James Watson Is Dead

by Anne Laurie|  November 9, 20251:00 am| 61 Comments

This post is in: Excellent Links, RIP, Science & Technology

pre-writing a devastating obituary for your enemy is god-tier hating of a kind you don’t often see anymore. renaissance haterism. beautiful stuff.

[image or embed]

— Emily C. Hughes (@emilyhughes.bsky.social) November 8, 2025 at 7:55 PM

“James Watson, dead at 97, was a scientific legend and a pariah among his peers”:

When biologist James Watson died on Thursday at age 97, it brought down the curtain on 20th-century biology the way the deaths of John Adams and Thomas Jefferson on the same day in 1826 (July 4, since the universe apparently likes irony) marked the end of 18th-century America. All three died well into a new century, of course, and all three left behind old comrades-in-arms. Yet just as the deaths of Adams and Jefferson symbolized the passing of an era that changed the world, so Watson’s marks the end of an epoch in biology so momentous it was called “the eighth day of creation.”…

What follows is more like the B side of that record. It is based on interviews with people who knew Watson for decades, on Cold Spring Harbor’s oral history, and on Watson’s many public statements and writings.

Together, they shed light on the puzzle of Watson’s later years: a public and unrepentant racism and sexism that made him a pariah in life and poisoned his legacy in death.

Watson cared deeply about history’s verdict, which left old friends even more baffled about his statements and behavior. It started in 2007, when Watson told a British newspaper that he was “inherently gloomy about the prospect of Africa” because “social policies are based on the fact that their intelligence is the same as ours — whereas all the testing says not really.” Moreover, he continued, although one might wish that all humans had an equal genetic endowment of intelligence, “people who have to deal with Black employees find this not true.”

He had not been misquoted. He had not misspoken. He had made the same claim in his 2007 memoir, “Avoid Boring People: Lessons from a Life in Science”: “There is no firm reason to anticipate that the intellectual capacities of peoples geographically separated in their evolution should prove to have evolved identically,” Watson wrote. “Our wanting to reserve equal powers of reason as some universal heritage of humanity will not be enough to make it so.” As for women, he wrote: “Anyone sincerely interested in understanding the imbalance in the representation of men and women in science must reasonably be prepared at least to consider the extent to which nature may figure, even with the clear evidence that nurture is strongly implicated.”…

“I really don’t know what happened to Jim,” said biologist Nancy Hopkins of the Massachusetts Institute of Technology, who in the 1990s led the campaign to get MIT to recognize its discrimination against women faculty. “At a time when almost no men supported women, he insisted I get a Ph.D. and made it possible for me to do so,” she told STAT in 2018. But after 40 years of friendship, Watson turned on her after she blasted the claim by then-Harvard University president Lawrence Summers in 2005 that innate, biological factors kept women from reaching the pinnacle of science.

show full post on front page

“He demanded I apologize to Summers,” Hopkins said of Watson. (She declined.) “Jim now holds the view that women can’t be great at anything,” and certainly not science. “He has adopted these outrageous positions as a new badge of honor, [embracing] political incorrectness.”

A partial answer to “what happened to Jim?”, she and other friends said, lies in the very triumphs that made Watson, in Hopkins’ words, unrivaled for “creativity, vision, and brilliance.” His signal achievements, and the way he accomplished them, inflated his belief not only in his genius but also in how to succeed: by listening to his intuition, by opposing the establishment consensus, and by barely glancing at the edifice of facts on which a scientific field is built.
_____

One formative influence was Watson’s making his one and only important scientific discovery when he was only 25. His next act flopped. Although “Watson’s [Harvard] lab was clearly the most exciting place in the world in molecular biology,” geneticist Richard Burgess, one of Watson’s graduate students, told the oral history, he discovered nothing afterward, even as colleagues were cracking the genetic code or deciphering how DNA is translated into the molecules that make cells (and life) work…

Watson nevertheless viewed himself “as the greatest scientist since Newton or Darwin,” a longtime colleague at CSHL told STAT in 2018.

To remain on the stage and keep receiving what he viewed as his due, he therefore needed a new act. In the 1990s, Watson became smitten with “The Bell Curve,” the 1994 book that argued for a genetics-based theory of intelligence (with African Americans having less of it) and spoke often with its co-author, conservative political scholar Charles Murray. The man who co-discovered the double helix, perhaps not surprisingly, regarded DNA as the ultimate puppet master, immeasurably more powerful than the social and other forces that lesser (much lesser) scientists studied. Then his hubris painted him into a corner…

When the friend proposed that Watson debate the genes/IQ/race hypothesis with a leading scientist in that field, for a documentary, Watson wouldn’t hear of it: “No, he’s not good enough” to be in the same camera frame as me, Watson replied, the friend recalled. “He saw himself as smarter than anyone who ever actually studied this” — which Watson had not.

Friends traced Watson’s smartest-guy-in-the-room attitude, and his disdain for experts, to 1953. When he joined Crick at England’s Cavendish laboratory, Watson knew virtually nothing about molecular structures or “the basic fundamentals of the field,” Jerry Adams, also one of Watson’s graduate students, told the oral history; Watson was “self-taught.” He saw his double-helix discovery as proof that outsiders, unburdened by establishment thinking, could see and achieve what insiders couldn’t.

That belief became cemented with his success remaking Harvard biology. The legendary biologist E.O. Wilson, who was on the losing end of Watson’s putsch, called him “the most unpleasant human being I had ever met,” one who treated eminent professors “with a revolutionary’s fervent disrespect. … Watson radiated contempt in all directions.” But in a lesson Watson apparently over-learned, “his bad manners were tolerated because of the greatness of the discovery he had made.”

Perhaps in reaction to Watson’s sky-high self-regard, in his later years his peers and others began to ask if his discovery of the double helix was just a matter of luck. After all, as a second lab colleague said, “Jim has been gliding on that one day in 1953 for 70 years.”

With Rosalind Franklin’s X-ray images (which Watson surreptitiously studied), other scientists might have cracked the mystery; after all, American chemist Linus Pauling was on the DNA trail. But Watson had something as important as raw skill and genius: “He realized that to discover the structure of DNA at that moment of history was the most important thing in biology,” Mayr told the oral history. Although Crick kept veering off into other projects, he said, “Watson was always the one who brought him back and said, ‘By god, we’ve got to work on this DNA; that’s the important thing!’” Knowing the “one important thing” to pursue, Mayr said, “was Watson’s greatness.”
_____
That was only the most successful result of following his instinct; whether getting the Human Genome Project off the ground or running CSHL, Watson was a strong believer in finding truths in his gut. “Jim is intuitive,” MIT biologist H. Robert Horvitz told the oral history. “He had an uncanny sense of science and science problems.”

He came to believe in his intuition about something else: race and IQ and genetics. His gut, he felt, was a stronger guide to truth than empirical research or logic. As a result, “he believed what he believed and wasn’t going to change his view,” the lab friend said. “It’s not as simple as courting controversy for controversy’s sake. But as the scientific environment became even less hospitable to [the “Bell Curve” thesis], he became even more adamant. He loved trashing the establishment, whatever it is.”

Watson’s loss of his CSHL position, the rescinded invitations, the pariah status, also had their effect. The setbacks made him “resentful and angry,” the lab friend said. “‘Saying the right thing’ now translated into ‘political correctness’ in his mind. And that made him say even more outrageous things.”…
_____
At age 90, Watson told friends he did care how history would see him. He did care what his obituaries would say. He knew his racist and sexist assertions would feature in them. Not even that could make him reconsider his beliefs, which only seemed to harden with criticism. Now history can reach its verdict.

Dish Served Cold: James Watson Is DeadPost + Comments (61)

Thursday Morning Open Thread: Holding Airports Hostage

by Anne Laurie|  November 6, 20255:32 am| 181 Comments

This post is in: Open Threads, Technology, Trumpery

Just in time for Thanksgiving.
www.cnn.com/2025/11/05/u…

[image or embed]

— Patrick Chovanec (@prchovanec.bsky.social) November 5, 2025 at 7:51 PM

As I remember it, blowback from an FAA shutdown ended the last record-setting Trump shutdown, but I guess Sean Duffy (who just got fired from the part of his job he *enjoyed*) believes in turning lemons into lemonade. Or into cyanide-enhanced Flavor Aid.

===

I am sure this will go over well and will not have any ramifications at all for this administrations rapidly collapsing popularity

[image or embed]

— Schnorkles O'Bork (@schnorkles.bsky.social) November 5, 2025 at 5:54 PM

… The reductions, slated to take effect Friday, will probably increase flight delays and cancellations. Further details about the affected markets will be announced Thursday, administration officials said.

The move is unprecedented and could affect cargo operations as well as commercial travelers. It also has the potential to scramble travel plans in the run-up to the busy Thanksgiving travel period, when record numbers of Americans are expected to be on the move.

A 10 percent cut at a major hub like Chicago O’Hare International Airport, for example, could mean 121 fewer flights — or more than 14,500 fewer seats — a day, according to an estimate by Cirium, an aviation analytics data firm.

“There’ll be frustration,” Transportation Secretary Sean P. Duffy said. “But in the end, our sole role is to make sure that we keep this airspace as safe as possible.”…

Federal Aviation Administration head Bryan Bedford, who joined Duffy in making the announcement, emphasized that the aviation system continues to be safe and is running as efficiently as it was before the shutdown. But, he said, an examination of data, including voluntary safety disclosure reports from pilots, highlighted concerns about fatigue among air traffic controllers.

“We are starting to see some evidence that fatigue is building in the system in ways that we feel we need to work towards relieving some of that pressure,” Bedford said…

===

he may not have a choice in this based on staffing and security regulations, but i appreciate the administration's commitment to making sure everyone thinks it's a choice they're making

[image or embed]

— GHOULLIKEHELLMACHINE (@golikehellmachine.com) November 5, 2025 at 4:09 PM

===

Trump and MAGA Republicans would rather shut down our air space than agree to protect your health care.

[image or embed]

— Senator Ed Markey (@markey.senate.gov) November 5, 2025 at 9:32 PM

show full post on front page

===

Sean Duffy, the U.S. transportation secretary, announced that if the government shutdown continued, he would cut air traffic by 10% in 40 key markets by Friday. nyti.ms/49D9uM0

[image or embed]

— The New York Times (@nytimes.com) November 5, 2025 at 10:00 PM

===

Feeling good about having bought Amtrak tickets to visit my family this Thanksgiving.

[image or embed]

— Nied (@nied.bsky.social) November 4, 2025 at 12:46 PM

===

The US government is aiming to ease the pressure on air traffic controllers suffering shutdown-related woes by curtailing flights. But airlines have experience with this kind of sudden disruption. www.wired.com/story/faa-pl…

[image or embed]

— WIRED (@wired.com) November 5, 2025 at 7:02 PM

… Which flights might be canceled, and where, “is data-based,” Duffy said Wednesday. “This is based on, where is the pressure and how do we alleviate the pressure?”

When passengers fly, “they are going to make it to their destinations safely, because we’ve done our work,” Duffy said.

The FAA did not immediately respond to WIRED’s questions, and it’s unclear whether the flight cut will affect only commercial airlines or cargo and private flights as well. A 10 percent reduction in scheduled commercial flights at 40 airports could lead to some 4,000 to 5,000 canceled flights per day.

For airlines and travelers, a sudden cut in flights will likely lead to some serious logistical headaches. Duffy earlier this week warned of air travel “mass chaos” should the shutdown drag on.

But airlines have some experience responding to sudden flight reductions due to staffing issues, says Michael McCormick, a former FAA official who now heads the Air Traffic Management program at Embry-Riddle Aeronautical University.

In the spring of 2023, during another period of air traffic controller shortages, the FAA allowed airlines to reduce their capacities in New York–area airports. (Such reductions usually force airlines to forfeit the right to a takeoff or landing; the FAA temporarily nixed that penalty.) In response, airline schedulers were able to quickly “up-gauge,” compensating for the reduced number of flights by replacing small aircraft with larger ones. That way, cutting flights didn’t necessarily reduce the number of passengers flying overall.

Should the FAA follow through on Friday, airlines will likely be able to pull off a similar up-gauging process, says McCormick. While flights will be canceled and passengers moved around, this could mean that plenty are still able to get to their destinations. The move might actually give airlines more time to prepare…

===

TSA workers and air traffic controllers also have bills to pay and families to feed. I don't blame them for doing what they need to do to make ends meet.
If the shutdown is not resolved by Thanksgiving it will definitely not be safe to fly that weekend. Heck it's less safe now.

[image or embed]

— BK. Titanji (@boghuma.bsky.social) November 4, 2025 at 6:44 AM

===
Mr. Pierce, at Esquire — “Transportation Secretary Sean Duffy Is Threatening to Close U.S. Airspace (!!) Right Before Thanksgiving”:

Secretary of Transportation Sean Duffy, whose rise would seem otherwise inexplicable were it not for the guy who put him in that job, has decided to flex and seems to have sprained his brain in the process…

Given the damage done already to the FAA by the DOGE boyos, for which he was the rubber stamp, does Duffy really think this administration could stand the political gaff of restricting air travel around Thanksgiving? Has he thought through the political ramifications of endless hours of video news clips of children sobbing in airport hallways and college students stretched out over rows of seats at every departure gate? The Republican majorities in Congress, and the White House, are already failing in their efforts to pin the shutdown on the Democrats. Does Duffy think inconveniencing thousands of citizens and their families will turn those poll numbers around? If so, I’d like him to tell me where he buys his mushrooms.

===

The U.S. government shutdown has now dragged on for over a month, and shortages of air traffic controllers are straining operations — with flight disruptions piling up nationwide.

[image or embed]

— The Associated Press (@apnews.com) November 3, 2025 at 11:00 PM

… My flight was canceled. Now what?
If you are already at the airport, it is time to find another flight. Get in line to speak to a customer service representative, and call or go online to connect to the airline’s reservations staff. It also helps to reach out on X, the site formerly known as Twitter, because airlines might respond quickly there.

Airlines will rebook you on a later flight for no additional charge. The good news right now is that this is not a peak travel time, so travelers stand a better chance of finding seats. But the busy holiday season is right around the corner, and at those peak times, passengers might have to wait days for a seat on a new flight…

Can I ask to be booked on another airline?
You can, but airlines aren’t required to put you on another carrier’s flight. Some airlines, including the biggest ones except Southwest and the discount airlines, say they can put you on a partner airline. But even then it is often hit or miss…

Am I owed a refund?
If you no longer want to take the trip, or have found another way of getting to your destination, the airline is legally required to refund your money, even if you bought a non-refundable ticket. It doesn’t matter why the flight was canceled.

“If the airline cancels your flight, you can say, ‘Alright. I will take a refund and get my money back,’” Potter said. “The important piece is that you get your payment returned to you. You’re not just settling for an airline voucher that may expire in a year or less.”

You are also entitled to a refund of any bag fees, seat upgrades or other extras that you didn’t get to use…

Can I get compensation?
U.S. airlines are not required to pay additional cash compensation and cover lodging and meals for passengers who are stranded, even if a flight cancellation or a severe delay is the airline’s fault.

The Biden administration had proposed a rule that would have changed that and required airlines to pay passengers for delays and cancellations in more circumstances, with compensation proposed between about $200 and as high as $775. But the Trump administration scrapped that last month, and the airlines praised him for doing so…

===

Air travel during the Trump administration:
"Photos from Houston's Bush Intercontinental airport show TSA lines at Terminal E extending outside the building and down the sidewalk Sunday night.
At 6:15PM, new photos showed the same long lines hours later…"
www.click2houston.com/news/local/2…

[image or embed]

— Rachel Maddow (@maddow.msnbc.com) November 3, 2025 at 2:32 PM

===

they keep not understanding that people aren’t going to blame the democrats if they do this as long as they keep claiming the authority to do it

[image or embed]

— GHOULLIKEHELLMACHINE (@golikehellmachine.com) November 3, 2025 at 12:28 PM

“we will shut down all air travel the week before thanksgiving if we feel like it” would destroy any other administration if it broke through the news

— GHOULLIKEHELLMACHINE (@golikehellmachine.com) November 3, 2025 at 12:31 PM

Thursday Morning Open Thread: Holding Airports HostagePost + Comments (181)

  • Page 1
  • Page 2
  • Page 3
  • Interim pages omitted …
  • Page 284
  • Go to Next Page »

Primary Sidebar

On The Road - lashonharangue - Costa Rica - Part 3 5
Image by lashonharangue (12/7/25)

2026 Pets of Balloon Juice Calendar

PLEASE REVIEW YOUR INFO ASAP

Recent Comments

  • Gloria DryGarden on Tom Stoppard, High-Culture Influencer (Dec 7, 2025 @ 5:09pm)
  • StringOnAStick on Sunday Morning Garden Chat: Desert Yardscaping (Dec 7, 2025 @ 5:01pm)
  • WTFGhost on Cold Grey Dawn Open Thread: One Hopes for An Extinction Burst (Dec 7, 2025 @ 4:54pm)
  • WaterGirl on Sunday Morning Garden Chat: Desert Yardscaping (Dec 7, 2025 @ 4:32pm)
  • laura on Tom Stoppard, High-Culture Influencer (Dec 7, 2025 @ 4:31pm)

Balloon Juice Posts

View by Topic
View by Author
View by Month & Year
View by Past Author

Featuring

Medium Cool
Artists in Our Midst
Authors in Our Midst
On Artificial Intelligence (7-part series)

🎈Keep Balloon Juice Ad Free

Become a Balloon Juice Patreon
Donate with Venmo, Zelle or PayPal

Calling All Jackals

Site Feedback
Nominate a Rotating Tag
Submit Photos to On the Road
Balloon Juice Anniversary (All Links)
Balloon Juice Anniversary (All Posts)
Fix Nyms with Apostrophes

Balloon Juice Mailing List Signup

Social Media

Balloon Juice
WaterGirl
TaMara
John Cole
DougJ (aka NYT Pitchbot)
Betty Cracker
Tom Levenson
David Anderson
Major Major Major Major
DougJ NYT Pitchbot
mistermix
Rose Judson (podcast)

Site Footer

Come for the politics, stay for the snark.

  • Facebook
  • RSS
  • Twitter
  • YouTube
  • Comment Policy
  • Our Authors
  • Blogroll
  • Our Artists
  • Privacy Policy

Copyright © 2025 Dev Balloon Juice · All Rights Reserved · Powered by BizBudding Inc