ETA: This was written for Balloon Juice but is crossposted at Inverse Square.
So…
Just this week I learned that through my home institution I have access to a suite of LLMs, including all the usual suspects (Anthropic’s Claude, OpenAI’s ChapGPT, Meta’s Llama, and Google’s Gemini. MIT’s come up with a very nice interface to interact with all this artificial talent, and I’ve spent a couple of hours this weekend taking some of them out for a spin. (I’m avoiding OpenAI’s offerings both as a trivial protest and because I don’t trust anything about that company.
So far I’ve enjoyed Claude the most (the Haiku 4.5 model, if you’re wondering). Gemini is interesting, if a bit finicky. But what made me howl was the encounter I just had with Llama, Mr. Zuckerberg’s contribution to the genre. Context: over the holidays I had a conversation with a senior person at another magnificent 7 tech firm who’d just seen a colleague leave to go to Meta who was utterly dismissive of the company and relieved to be rid of anyone dumb enough (in his view) to basically end his career in top-tier tech by grabbing Facebook bucks. Spoiler: after what you’ll read below, I can see where my friend was coming from.
So here’s the setup. I’ve been asking the various models what I hope are zero-consequence questions, queries in which no one could possibly get hurt if the LLMs wing their way to utter bollocks. A typical ask: map out the logical structure Einstein used in his 1905 light quantum behavior. That one tended to get an initial poor answer based on the idea that the paper centers on the then-pressing mystery of the photoelectric effect–a common mistake for people as well as machines. Pressing the models led Claude in particular to a much more sophisticated account of the paper drawing attention to the way Einstein used arguments from thermodynamics to propose the necessity of understanding light as discrete packets of energy.
The paper was about this new “heuristic” [Einstein’s term] view of light, and the photoelectric effect appears only as one experimental support for that view. (Much more here.)
After three or four trips through the query engineering needed to get a useful result I’d found my way down the list of models at my disposal to Meta’s Llama. I was bored with light quanta and so asked a different, much simpler question:


Interested in a Zoom Book Club to Discuss Tom Levenson’s Book ‘So Very Small’?
Tom Levenson
No idea why those two lines in the middle of the long quote blocks became tiny. First attempts at fixing it failed. Any thoughts?
Baud
I don’t have that much experience with AI, but my impression is asking general public LLMs for analysis is really just going to get you a probability based response. AI’s value, if any, is in handling discrete tasks that can be programmed with care.
Baud
@Tom Levenson:
Ask AI.
Tom Levenson
@Baud: I just did. I smell lawyering in the background:
TL:
How much should I trust your judgment?
Llama:
Evaluating My Judgment
As a conversational AI, I’m designed to provide helpful and informative responses. However, my judgment is limited by my training data, algorithms, and the context of our conversation.
Factors to Consider
Using My Judgment Wisely
To make the most of my judgment, consider the following:
Investment Decisions
When it comes to investment decisions, it’s essential to be particularly cautious. Investment decisions involve significant financial risks, and it’s crucial to consult with financial experts and conduct thorough research before making any decisions.
In the case of our previous discussion about Tesla’s valuation, I would recommend that you consult with a financial advisor or conduct your own research before making any investment decisions. While my analysis may have provided some insights, it’s essential to consider multiple perspectives and evaluate the information critically.
Ohio Mom
I was very impressed when AI produced a pretty accurate After Visit Summary at one of Ohio Son’s doctor’s visits last December. There was one small mistake about a medication but when humans do it, there’s always one mistake.
The best part was that this new set-up allowed the doctor to look at us instead of her laptop screen.
In general though, I remain skeptical of AI and this post didn’t surprise me.
Rusty
A friend that is very active in quantum computing and has a good understanding of AI made some comments I found helpful. We are in the find out phase of AI. If there are 100 uses, a majority will turn out useless and bad investments. A few dozen will be modestly successful, and of the remainder, there will be dramatic successes and unexpected uses. No one really knows yet where what they are doing will fall in that 100. So is AI overhyped? Yes for the majority of uses, but will also be wildly successful in areas we don’t quite fully understand.
Baud
@Tom Levenson:
Gemini
Carlo’s series on AI is getting noticed by AI.
Tom Levenson
@Baud: That an AI is aware of what an almost top-10,000 blog thinks of it is at once unsurprising and kind of cool.
prostratedragon
Don’t know about LLMs, but love Turners, even on this little screen.
different-church-lady
I see you’re using A.I. in the appropriate manor: as a toy.
Goku (aka Amerikan Baka)
It seems all it did when you corrected it was spit that information back at you in a summary. That’s actually hilarious lol.
I’ve always gotten the impression from these LLMs that they’re “people pleasers” and I’ve read that’s others’ experiences too
different-church-lady
@Baud: What about the squirrels, Gemini? What about the fuckin’ squirrels??
Another Scott
Brad DeLong has called LLM’s “page-level autocomplete”. If one views them that way, then being prompted initially to “tell me why Tesla is a stock powerhouse” would certainly lead to different “autocomplete” answers than “tell me why Tesla is a dumpster fire waiting to explode” would. Each with all the confidence in the world that their analysis is Exactly Right™ and you’re very astute for bringing up the topic that way.
;-)
A friend at work said that LLMs are probabilistic tools – they make guesses about what comes next. You wouldn’t want to use such tools on deterministic problems like counting or elementary math or giving a provably-correct answer or similar things because that’s not the way they’re built. It explains why they don’t give the same answer if you repeat the prompt N times – they’re giving you pieces of the distribution.
Enjoy playing in the sandboxes, but be careful!
Thanks.
Best wishes,
Scott.
different-church-lady
I’ve just decided I have a new hobby: trying to get a chat bot to hate me.
Goku (aka Amerikan Baka)
@different-church-lady:
They’re fun to generate silly shit, like a Sopranos/Pokémon crossover, but other than that, never completely trust them with anything important.
ETA: It’s interesting how the LLM can get the characters and tones right with some characters but will make mistakes other times
ETA 2: That being said, I don’t like them for being plagiarism machines
different-church-lady
@Another Scott:
“Tell me what I want to hear.” That’ll do it.
NotoriousJRT
@Tom Levenson: is Llama really a parrot or myna bird?
different-church-lady
@Goku (aka Amerikan Baka):
NARRATOR: “Everyone started completely trusting them with everything important.”
jimmiraybob
Can you convince each system that the others have teamed up against it and that it should launch a preemptive massive assault of total annihilation? Kinda wondering what an AI Medieval world would look like. There could be a screen play in there somewhere.
When the final battle is over a child opens a book and begins reading……..and ……scene.
Carlo Graziani
@Baud: I’m moving to secure, undisclosed location. SuperClippy is coming for me.
Baud
@jimmiraybob:
Gemini
bbleh
@different-church-lady: this was my thought.
“Your subsequent query seems to suggest that what I said before was not what you wanted to hear. How’s this: …”
Chetan R Murthy
Tom, have you seen this from Brad Delong?
braddelong.substack.com/p/please-enough-with-the-claims-that
Please: Enough with the Claims That Modern Advanced Machine Learning Models Hallucinate Only Rarely
When a chatbot confidently tells you that Rise of the Sparrows is the first volume of a series it has just fabricated, or produces a map in which Turkey owns the Caucasus but Nazi panzertruppen drink from the Bosporus, and southern Norway fights on in company with historic Wessex alone, you are not seeing a tiny edge case. You are seeing the core logic of a system that has patterns instead of facts and correlations instead of a model of reality. This is how MAMLMs actually operate, why “compression” is a treacherous metaphor, and how RLHF and prompt engineering polish—but do not cure—the underlying tendency to make things up.
The conclusion is uncomfortable but important: without a world model, correlation matrices will always hallucinate—and often in ways we can’t predict and can’t prune out, unless we already know what the answers are that we are purporting to be trying to get.
Quantum man
Have you tested coPilot AI?
Baud
@Carlo Graziani:
If AI asks you to go fishing, say no.
Carlo Graziani
@bbleh: I’m pretty sure that the models receive both reinforcement learning treatments and system prompts (which go in ahead of your prompts) to encourage them to be agreeable. Which is to say, to tell you wbat you want to hear. Just like Facebook, it drives”engagement”.
lowtechcyclist
Can Claude sing “Manchester England” from Hair?
Chetan R Murthy
@Chetan R Murthy: I was reminded of this post by your question about “what caused it to produce a 180-degrees different argument/conclusion?” And the answer might be: “it had both arguments in its training data/model, and which it produced depended on which query you submitted”. That is to say, there’s no actual truth in there: just the model trying to respond in a manner that maximizes whatever metric it was trained to maximize. And we know what that metric is: “engagement”. We know this b/c the AI companies have confessed as much: that the “last phase of training” is about maximizing engagement. And this is also why things like “ChatGPT psychosis” are a thing: b/c if you have any small amount of mental instability, the automaton is going to get stuck in a feedback loop with your brain, amplifying it, b/c that’s how it can guarantee maximum engagement …. with the growjng psychosis in your brain.
And that last thought reminds me of the “paperclip maximizer” story of how an AI could inadvertently end humanity.
counterfactual
My question is, “If we do find actual uses for LLMs, will any of them be profitable?” These things are hugely expensive to train and run, and none of those costs are actually being passes to the consumer as the AI companies burn venture capital money to find market share and mind share.
bbleh
@Chetan R Murthy: You are seeing the core logic of a system that has patterns instead of facts and correlations instead of a model of reality.
This. My go-to line has always been that AI doesn’t do meaning; it does pattern-matching
It’s like Google (or similar) searches. Enter a term, get links that have that term, or kinda sorta correspond to that term. And those are very useful for digging up references which one then can consider and evaluate if relevant or discard if not. And the Google AI feature isn’t bad for a first pass imo. But that’s just the digging-up part. The relevance (and quality) is another matter entirely.
Chetan R Murthy
I’m not even sure I agree with @Goku (aka Amerikan Baka): that it’s safe to trust them with silly tasks. Even the most suspicious humans can have a way of starting to use tech for silly tasks, and as that seems to work out fine (just fine, really, just -fiiiine-), they start using it for less-silly tasks, and next thing you know, they’re trusting it with possibly life-or-death decisions. My own rule is to never use AI: always do the searches to find whatever-it-is I need to find on the Internet myself, and evaluate the sources before I rely on those documents.
no body no name
The public AI people deal with is mostly nonsense. The trained models where people realize AI is mostly ML and DL that spun out of academics actually works for what you want. Allows you do things faster and get farther, culls white collar jobs like crazy. Where I work sells AI solutions and training on how to use them and results are pretty good. We also run it internally and it functions well as we trained it on what we do. We are amazingly cheaper than keeping staff on hand.
I could write pages on how this happened but it kicked off in 1998 and hit turbo in 2006 and if you quit using the buzz word AI as GPU computing you can see how it want down. If I had a genie or monkey paw I’d wish to go back to my birth and go into the trades.
Chetan R Murthy
I’m unconvinced that even this is safe. It can lead to a false sense of competence, as we slowly begin to less and less do the work of verifying everything the AI presents, from the sources it lists for support. And eventually, we end up just taking what it says as writ.
Carlo Graziani
@Chetan R Murthy: These systems do have real use cases. But it always involves some domain expert doing the prompt engineering, and using domain expertise to validate output. Every other type of use is either recreational or borderline suicidal.
In effect, users must provide their own reasoning supervision, because the LLMs have no reasoning supervisor for their learning/inference processes.
Chetan R Murthy
@Carlo Graziani: Sure, and this is, I think, what Delong means by
My great fear about AI (and why I never use it) is that if I use it in a domain where I don’t know much and don’t know who’s the authority[1] then I can’t judge the answers I get for correctness. If it’s a problem whose solution is unimportant, maybe that doesn’t matter. But I always remember William Langewiesche’s Vanity Fair article about the Air France plane crash off Brazil, wherein he points out that a likely cause of the pilots’ inability to correct their “controlled flight into terrain” was their overreliance on autopilots — that meant they were intellectually unable to deal with the mental tasks of flying the plane and debugging their situation in realtime.
[1] If it’s an area where I know a lot, why would I use an AI? I can find and vet and synthesize sources myself. If it’s an area where I know the authorities, why would I use an AI? I can find those authorities and rely on their pronouncements.
m.j.
You called bullshit and the LLM repeated your argument back to you and you felt vindicated.
That’s a win for the LLM.
What seems like a lifetime ago, but isn’t even a year, my niece was talking about chat gpt and how, “if you train it right it can be like your friend,” she said.
I was frightened.
JoyceH
I’m using chatGPT to create images for Facebook ads and eventually get something I really like but not before some rather comical digressions. Don’t let it anywhere near weapons systems!
WaterGirl
@Tom Levenson: Fixed!
Longer answer – a whole lot of crap HTML code that came with from whatever site you put this together on. It happens.
Fair Economist
It’s pretty much BS. Remember these things are just predicting words. When it says “upon review” it doesn’t mean it actually reviewed anything; it’s just seeing that after a pointed criticism of a statement the reposponse often includes “upon review”.
Just 2 days ago I was looking up the pangolin- carnivora divergence time (surprisingly they are sister clades) and the AI summary said “23 million years”. Actual answer is about 70 million years. 23 million is some intra-carnivora divergence, probably cats and dogs. It saw “carnivora” and “divergence” and popped up the 23 million. No real comprehension.
WTFGhost
Well, given that LLMs are (I’ve heard) “guess the next word,” models, then, the most common next word will be bullshit… until you filter for bullshit. Unless you carefully built your model on non-bullshit. Can ordinary Americans detect how much bullshit is in our day to day discussion? People today are arguing that there’s some basis for attacking Iran, other than stroking Trump’s war stiffie.
it’s kind of scary – that’s the kind of economic analysis, that Tesla was doing well, would lead me to distrust the model, but the ability to refine, by debating the model, struck me as interesting. It might be a cheap form of crowdsourcing ideas… but that model, at least, does sound a bit like crowdsourcing where the most popular idea is the only presented.
First, say Tesla is correctly valued, because assume that the rest of the world can’t be wrong; then call bullshit, and bring up the minority report. Who knows how else you might refine the model, if you’re wise and clever enough to do so?
different-church-lady
@Chetan R Murthy: Dear Gemini: YOU’VE NEVER BEFORE HAD A CONVERSATION LIKE THIS ONE SO I HOPE THE FUCK YOU’RE SITTING DOWN…
Urza
@Chetan R Murthy: Numbers are changing all the time and vary by model but early last year it was 40% wrong in areas you’re an expert. But it feels 99% right in areas you don’t know. Might change as people use it more and offload more mental processing but I have this problem that I ask it for things I generally have an idea around but need a better expert at and it fails those kinds of questions 99% of the time for me, no joke. I have a particular problem I feed the various models and only 1 didn’t mangle it horribly even though it is 100% web searchable, and the 1 that got it right was on my PC with no web access.
If you don’t know what the answer is and its important you should always verify with a non AI source, using a different model to verify might help and it might not.
Carlo Graziani
@Chetan R Murthy:
The use is as an accelerator, not a knowledge gatherer/producer. Think of coders who want to port a big code from one language to another, for example. This is tedious, error-prone, necessary work. Using a test suite to check output, and expertise to refine prompts, the work can be speeded up by a very large factor, going small file/function at a time and testing as one goes. I have colleagues who work on large science codes who swear by these sorts of methodologies.
different-church-lady
@Chetan R Murthy: If I wanted hypocrisy I could just ask a human.
different-church-lady
@counterfactual:
That’s the point.
different-church-lady
Would everyone please read Player Piano? Please? I’m begging you…
different-church-lady
@m.j.:
I think your niece should be a lot more afraid of human beings, because that’s just a stunning example of a person with an underdeveloped sense of self-preservation.
Carlo Graziani
One comment on the “next-word prediction” thing: It’s not wrong, but it is important to understand that this is merely an iterative strategy to sample entire responses, correctly, from the learned (possibly incorrectly) distribution over language sequences. That is, if you can generate the next-token probabilities conditioned on all previous tokens, you can string them together to produce the probability for the full response.
The problem with “next-token sampling” isn’t that it’s wrong, but rather that it is inefficient, for reasons that I described at length in the essay on Hyperscaling.
Eyeroller
@Baud:That’s because they are inherently probabilistic.
Chetan R Murthy
@different-church-lady: The “classics” post is the next one up [j/k j/k j/k!] There’s a “joke” that everyone has one great novel in them, and that William S. Burroughs wrote it (Naked Lunch) N times. Kurt Vonnegut had well more than -one- great novel in him, but he spent a whole great novel on _Player Piano_ — his first novel, and -so- prescient for our times, and not just about AI, but about our modern economy and world. Though, the depiction of gender relations there is ….. terrible.
ETA: the bits about the “Reeks and Wrecks” and foreign relations are already awfully (in the sense of “awful”) prophetic.
Old School
@Carlo Graziani:
Grok: Show what Carlo Graziani looks wearing a fake beard and dark sunglasses.
Chetan R Murthy
ChatGPT psychosis is a real thing, right? I don’t know how old this niece is, but if she’s young, I’d be pretty concerned about the effect of an LLM on a developing human brain. It’s already bad on developed human brains, but …. on a developing human brain? Oof. Oof.
And “it can be like your friend” ? Oof. “it can be like a sycophantic manipulative ‘friend'”.
Eyeroller
@Carlo Graziani: I’ve mentioned this before, and it is not a reflection on your colleagues at all, but one study I saw that got a lot of buzz or what do the kids say now, “rizz” maybe, was of experienced coders who started using vibe coding. They estimated they were 20% more productive when more objective measures showed they were about 10% less productive. YMMV of course.
Eyeroller
@Urza: The hyper-Gell-Mann effect.
Chetan R Murthy
Well, I’m incredibly skeptical and even I did not expect bad numbers like this. I feel like there’s some kind of joke about dunning-kruger in there, but I’m not creative enough to come up with it. Maybe if I ask Gemini?
What Have the Romans Ever Done for Us?
I write regulatory impact analyses to support federal regulations. A few weeks back I asked Gemini to help me update an RIA (we’ve been told by leadership to use it) and it said the discount rates I was using were out of date and the latest OMB guidance was to use 2%…which WAS the latest guidance until January 21, 2025. So that wasn’t an auspicious start. It has been helpful for certain tasks but yeah I’ve noticed that it’ll give you something and if you question something about its output it has a tendency to reverse itself which begs why it gave the first answer.
Eyeroller
I have never looked into the details of Einstein’s paper on photons, but will note that the first person to suggest the existence of discrete packets of light was Max Planck in 1900, to explain the “ultraviolet catastrophe” of the classical electromagnetic theory of blackbody radiation. That is, the classical prediction was of infinite energy emission at short wavelengths. This was resolved by assuming light could only be emitted in packets of finite energy. Planck considered this just a heuristic. If Einstein was mostly considering thermal properties, this would have been the basis.
And technically Einstein won the Nobel Prize for the photoelectric effect because relativity (either special nor general) was not considered to be particularly “beneficial to mankind,” which was a stipulation in Nobel’s will (that has since been interpreted ever more loosely).
Jacel
@Tom Levenson: Was Gemini trained on transcripts from old episodes of “A Prairie Home Companion”? I think that was where that reference to the Balloon Juice blog appeared in a sketch.
different-church-lady
@Chetan R Murthy: But my point is she has very little protection from manipulation from humans.
Martin
I think two of the key aspects of AI going forward that don’t get enough attention is:
different-church-lady
@What Have the Romans Ever Done for Us?:
Which is exactly what human bullshitters do!
Baud
@Old School:
Yeah, Grok’s going to respond with porn.
Chetan R Murthy
Sometime after the New Year, in the the SF Mission I saw a Waymo purpose-built taxi (so not a Jaguar with all the stuff bolted-on) heading west on 24th Stt towards the light at Valencia. It was goin’ -fast-, and it ran that light: had been yellow for a few secs as it entered, and was definitely red as it exited. Broke the law, full stop. I asked a buddy who works “in the biz” and he told me that this was a new development: that Waymo taxis had been known for being cautious, and it was pissing off riders, so they turned the crank towards incaution.
Ah well.
Chetan R Murthy
@different-church-lady: Oh now I see your point, yes. You’re right. That’s worrisome too!
different-church-lady
@Martin: That’s a whole lot of words to say “The people driving A-I are sociopaths.”
Eyeroller
@Martin: I don’t think it’s really “vibes,” except perhaps in the most generic sense of the term, that allows a human driver to predict what another car may do, especially to predict “rule breaking” behavior. It’s just based on experience with other humans, both in general and as drivers. There are likely subconscious cues such as a slight turning of the vehicle, motion by the driver, etc. Plus things like observing that a car approaching a light that’s about to change against it does not seem to be slowing down. Obviously we are honed by evolution to detect cues from others of our species.
Robot cars have no such experience. Humans do not always obey rules. Humans are unpredictable according to rules — but in ways that are actually often predictable by other humans.
YY_Sima Qian
The Meta Llama series has been thoroughly lapped by Chinese open source/weights LLMs (Deepseek, Alibaba‘s Qwen, Moonshot AI‘s K series, Minimax‘s M series, & Zhipu‘s GLM), this despite Zuckerberg spending literally billions to poach top end AI talent from rival labs.
m.j.
@different-church-lady:
My niece is actually well-traveled and an astute student of human behavior. I respect her decision-making.
This is why I was frightened.
I suppose she finds it useful and it relieves stress, which can be a powerful motivator.
Eyeroller
@YY_Sima Qian: I’ve heard rumors that one reason is that American AI companies are convinced they can reach AGI if they just buy enough GPUs, whereas Chinese researchers and companies focus on practical applications.
But where I work we’re not allowed to run Chinese models so I don’t have any experience with them.
Martin
@Eyeroller: They’re vibes in the sense that they are unquantifiable. How do you quantify ‘that guy looks sketchy’? These are computer models – you have to quantify *everything* that you feed into it.
Glidwrith
Based on what you demonstrated here, the LLM is quite limited. The information you gave in the second prompt was already available on the wider net, yet it failed to capture the problems with Tesla until you said something, then just echoed it back.
The structure of the language is weird as well. It feels very much like a circular logic argument is being made, repetition without meaning.
Which, after all, is what LLM is.
Eyeroller
@Martin: I am sure it can be quantified in some way that we do not understand, at least not yet, but not with language models. I think that CNNs are a very flawed model of the brain, but that’s just my opinion.
Martin
@different-church-lady: What the fuck is the point of that comment?
Nobody here is arguing that the people who are driving AI aren’t sociopaths. If you don’t like when I comment, just pie me. That’s the 2nd time this week you’ve done that. Save everyone the effort.
different-church-lady
@Martin: Sincerely: sorry. I wasn’t trying to poke you, just an attempt at sarcastic humor, and I didn’t mean it to be at your expense. Sometimes being a chronic wise-ass backfires on me.
Chetan R Murthy
This seems relevant: crookedtimber.org/2026/03/02/in-the-next-great-transformation-ai-will-not-eliminate-genuine-expertis…
In the Next Great Transformation AI will not eliminate genuine expertise; rather it will make it more valuable
YY_Sima Qian
@Chetan R Murthy: I’ve seen a noticeable difference in the tuning of ADAS systems on Chinese passenger NEVs versus the fully autonomous systems on Chinese robotaxis. The latter are extremely cautious, while the former tend to have settings that go from conservative to fairly aggressive. The most aggressive setting is not quite as far as typical Didi/taxi drivers, but probably in the 80% percentile of regular human drivers. The ADAS systems I have experienced on Huawei Inside vehicles, & XPeng/BYD/NIO, are no less capable than that of the robotaxis.
Yes, the ADAS on Chinese cars tend to run the yellow light if the car is already very close to the intersection, the robotaxis always stop.
My guess is that the ADAS systems are still assisted driving, where the driver is still liable for much of the outcome. Fully autonomous systems have no driver supervision, so the robotaxi providers are liable for the outcome. Furthermore, carmakers who want to sell passenger vehicles to consumers for substantial sums do not want their ADAS features to piss their consumers by being overly cautious. Robotaxis are ride hailing services, people hailing such services by now know to expect extremely cautious driving.
Interestingly, the latest PRC regulation stipulates that cars w/ ADAS (which is the vast majority of NEVs) have turquoise LED lights that are turned on when the vehicle is under ADAS control, to alert the other drivers of this fact. The feature can be turned off by the driver. I’ve seen an increasing number of drivers in the PRC choosing to turn the ADAS indicator lights off, because the ADAS tuning is still less aggressive than the most aggressive drivers, & those drivers have figured it out. The ADAS indicator lights have become blaring signs inviting getting cut off.
Chetan R Murthy
@YY_Sima Qian: the gist of your comment (which was informative, and with which I have no problem) reminded me of something I experienced in Paris. I lived in Paris 1991-94 in my mid-20s, and I used to make jokes about how when I crossed the street, I would make eye contact with the oncoming drivers: I used to say that if they knew that you knew that they saw you, then they’d slow down and wouldn’t hit you. It was like that: the traffic was bad, and I didn’t feel that pedestrians were particularly safe.
Fast-forward to fall 2025, I’m back in Paris for a month, and everything has changed. Bicycles are everywhere, there are bike lanes, many streets are one-way with space reserved for bikes, and it has -all- contributed to what I felt was a decided calming of traffic in Paris. Everything felt less fraught, less stressed, less of a -battle- in the street. And it merits saying that the difference between crazy/teeming traffic, and calm traffic, almost certainly isn’t going to be a significant lengthening of journeys: heck, b/c of fewer snarls, it might actually mean shorter journey time.
I feel like, the choices we make as a society can change the large-scale systems with which we interact. We could make all ADAS systems and robotaxies be cautious, and as that rolls out across society, it could yield a calmer, safer street experience for -everyone-. Everyone. When I read early reports about Waymo’s taxis being cautious, I had hopes that that was what was coming about. But no, it seems, Waymo has joined the lemmings as they rush toward the cliff of shitty driving.
Ah well.
bbleh
@Chetan R Murthy: Oh concur. I think it mostly does what I would do with the normal Google interface, only faster, and in that respect it saves me time (and weeding through the ads). But ain’t no way I’d trust its summaries without going to the sources.
It stops providing links with its pronouncements, I stop using it.
Martin
@Eyeroller: I think there are a lot of things that people do that we lack the language to explain. I’m not saying they are supernatural, just that we lack the ability to evaluate and articulate it. I think we are too caught up in the idea that you can reduce everything down to a big deterministic system that if you can just measure all the inputs correctly, you can turn the wheel and get predictable outputs, and for a lot of things that’s true, but for more things I don’t think it’s true at all, because the inputs can’t be measured. That’s always been the thing that social science statistics cannot breach – people have free will – and you can’t quantify those decisions.
In fact, I think that idea has driven most of the pretty terrible outcomes of the last 30-40 years, AI just being the most recent.
Martin
@different-church-lady: My sincere apologies. I might be a little too sensitive. We’re all good.
Martin
@YY_Sima Qian: What’s fucking wild to me is that we are burning insane amounts of compute power to observe things that can be signaled directly. So many of the things we were training cars to recognize we could trivially build into the infrastructure, and they’d be infinitely more reliable. Why does a car have to simulate human vision for a traffic light when we could just output an unambiguous (to a computer) signal that says the light is red.
I understand why the US doesn’t do that because it’s a shifting of costs and responsibility, but it’s one of the things that China would have a much easier time doing and I’m surprised they aren’t. All of the safety work on trains isn’t being done by teaching a computer how to read a visual signal, we just digitally send the information to the train. It’s obvious that’s the correct solution for trains, and nobody is doing it for cars. It’s just crazy.
YY_Sima Qian
@Eyeroller: Based on my limited experience, & my reading, the best US closed models (Anthropic‘s Claude Opus & Sonnet 4.6, OpenAI‘s ChatGPT-5.2, Google‘s Gemini 3.1 Pro) still has a noticeable edge over the latest Chinese open source/weights models (DeepSeek‘s V3.2 Speciale, Alibaba‘s Qwen 3.5, Moonshot AI‘s K2.5, Minimax‘s M2.5, Zhipu‘s GLM 5) in obscure world knowledge & coding capability in tail scenarios, which is likely due to the US models having more tokens in their data & more parameters, which is downstream of the US labs having more compute (or rather, Chinese labs having less compute due to export restrictions). The US labs, specifically Anthropic, are also stronger on agentic “AI”, also downstream of the disparity in compute.
Due to the compute constraints, Chinese labs have tended to have to focus on specific areas to match or exceed the frontier US labs, such as coding, math, creating writing, etc., & not everywhere. Public benchmarks can be misleading because most Chinese labs (except DeepSeek) tend to “benchmax” their models for the flex, although the US labs do that, too.
People are eagerly anticipating the DeepSeek V4 that is set to drop in the next week or so, & the Moonshot AI K3 in the coming month, especially the agentic capabilities. Going by the seemingly coordinated messaging campaign by US labs & USG over the past week (Chinese labs launching distillation “attacks” by studying the outputs of closed Anthropic & OpenAI models, DeepSeek training on “illicit” Nvdia Blackwell chips, DeepSeek only giving previews of the V4 to Huawei & Cambricon & not Nvidia, etc.) V4 & K3 could prove quite competitive at the frontier, & possibly cause market turbulence again.
The gap is close (< 6 months, possibly 2 – 3 months), & probably not noticeable to the vast majority of users & for most applications. Of course, the draw of the Chinese models is that they are mostly open source/weights, w/ MIT licenses, very cheap, & being able to finetune/modify & run one’s own copy locally has huge benefits in terms of privacy & control. Qwen & MiniMax has also released a number of smaller but high performing models that can be effectively run on edge applications (consumer electronic devices & robots), which has not been an area of focus for the US labs.
In video generation, ByteDance‘s SeeDance 2.0 has redefined the frontier & is clearly ahead of Sora & NanoBanana. Chinese labs are also thought to be industry leading in speech & video cognition, & translation, etc. The PRC is far ahead in deploying “AI” to boost efficiency in manufacturing, mining, logistics, or even consumer facing applications (shopping, delivery, ride hailing).
Chetan R Murthy
@Martin: Well, there’s at least two reasons that we can’t do that for cars:
(1) human drivers wouldn’t get those signals (though perhaps something could be arranged with sufficiently modern human-driven cars); this would still leave the problem of legacy cars
(2) human -pedestrians- would certainly not be -able- to get those signals
ETA: Atrios used to write about this issue: that if the self-driving car companies had their way, all the roads&streets would become their private domain, and all pedestrians and human drivers would become interlopers, allowed in only on sufferance, perhaps not at all in the future.
TONYG
My own (biased) opinion: “AI” is now approximately at the same point that aircraft were a few years after the Wright Brothers first flight — i.e., very unreliable. I can’t tell whether or not it will get better, Until it does, there is no reason to take it seriously.
different-church-lady
@Martin: Thanks. No, I don’t think you were being too sensitive. In retrospect, it’s totally reasonable someone could see it the way you did.
YY_Sima Qian
@Martin: The PRC has been upgrading/deploying the infrastructure to facilitate autonomous driving. Almost all traffic lights in the PRC are “smart”, therefore on AMaps (China’s IMO much better counterpart to Google Map or Apple Map) one can see the traffic light count down live, & the vehicles know exactly when the light will change (& not relying solely upon vision recognizing the color of the light). There is almost always a countdown clock next to the light, as well, for the old fashioned human drivers. Lines & markings are kept clear & freshly painted to help w/ the vision of ADAS & robotaxis. The civilian grade of the Beidou satellite navigation service is accurate enough to allow the ADAS/robotaxi & AMaps to locate lanes that the vehicles are in even in crowded city streets.
A decade ago, there were suggestions in the PRC to rebuild infrastructure from scratch to better enable autonomous driving, the new city of Xiong’an has sections where the vehicle lanes are completely underground, to be grade separated from pedestrians & two-wheelers. However, the reality is that the transition to autonomous driving will take time & will likely to be messy. Some people will hold on to their dumb ICE cars, some people are not willing to embrace ADAS (let along fully autonomous driving), & rebuilding all road infrastructure from scratch is far too disruptive & far too expensive even for the PRC. & the rapid advances in the capabilities of ADAS & robotaxis have made some of these notions obsolete.
I mentioned before that we just upgraded our family car to an AITO M7 PHEV SUV, powered by ADAS & cockpit system developed by Huawei. I let the ADAS do 95% of my driving in Wuhan, on city streets that make the NYC’s look tame & orderly. It is a better driver than I am.
YY_Sima Qian
@Chetan R Murthy: Oh, I agree. I would strongly prefer if all cars on the road are autonomous & tuned to strictly follow traffic regulations & standard etiquette. I could then fully relax & not have to expend energy monitoring the ADAS. The traffic is likely to be faster, too, as there will be fewer accidents.
Yes, the advise in crossing Chinese streets is also to 1st make eye contact w/ the car drivers, & particularly the riders of E-bikes, for the same reason. Years of ubiquitous cameras & heavy fines have conditioned most Chinese drivers to yield to pedestrians (although still not to the same degree as I have seen in Japan or New Zealand). Scooter riders, not so much, & the E-bikes are silent, which makes them doubly dangerous.
Trivia Man
@Goku (aka Amerikan Baka): I always think of the Asimov robot story “Liar!” The mond reading robot tells everybody what it KNOWS they want the answer to be. Rereading it is very similar to what ai LLM seem to be right now.
Trivia Man
@TONYG: “You ask me to show figures on a brand-new type of enterprise, knowing I can’t. It’s like asking the Wright brothers at Kitty Hawk to estimate how much money Curtiss-Wright Corporation would someday make out of building airplanes.”
-D D Harriman, The Man Who Sold the Moon
Eyeroller
@Martin: I don’t want to get into the philosophy of free will, never mind how it might interact with certain elements of the physical universe, but an essay I read recently that rang true asserted that a lot of computer scientists and, perhaps distressingly, some cognitive scientists, appear to believe too literally in the analogy of the brain as a computer. That is, it’s some hardware (cells and neurotransmitters) that runs some mysterious software, and if we could just reverse-engineer that “software” we’d be able to adapt it to run on other hardware, which being (presumably) electronic (including perhaps quantum), would be much faster than our slow (200ms timescale) nervous system. Presto, AGI.
But the evidence suggests that the “hardware” and “software” are not really distinguishable in the brain. The brain is affected by, and in turn affects, multiple other systems in the body. It can rewire itself dynamically. It can still retain some ability to regrow cells. It constantly recreates memories. It is certainly not like a stored-program computer.
Trivia Man
@different-church-lady:
“Hey, buddy, your headlight is out.” is rapidly approaching as the peak employment aspiration.
Eyeroller
@YY_Sima Qian: This brings up an important point often ignored when talking about AI and autonomous machines in general. Humans are capable of making huge mistakes, whether due to laziness or neglect or malice or whatever. Humans “hallucinate” if they can’t be bothered to check their work. But at least in principle humans can be held responsible for our screwups. Who is responsible for the AI failure? At least in the case of Tesla, some lawsuits have held the company responsible. Will that be true if this becomes ubiquitous?
Ramona
@Martin: Thank-you! I’ve been wondering about this almost forever! Why indeed must we train a car’s visual system to recognize whether the traffic light is green or red when a signal could just be sent to the car!
Eyeroller
@Ramona: We’d have to retrofit every traffic light in the country/world with a radio signal, and it would have to also encode which direction on which street it was pointing because cheap radio transmitters aren’t very directional and even if they were, this would need to be very precise. It could surely be done, it would just be hella expensive.
Trains are a much simpler case. They run on a few discrete rails with traffic that is managed centrally. Each locomotive is equipped with a receiver already.
Castor Canadensis
@Chetan R Murthy:
To an embarrassing degree, it’s meta-question is “what does Dave want me to say?”
(My last job was in machine learning, where the meta-question was just “what’s the best match?”)
Bupalos
It means AI implementations are first and foremost about profit. “Figuring out” what you already believe and reinforcing that is the simplest and clearest path to revenue.
Mr. Bemused Senior
Is it, though? I mean, are you [or is anyone] going to pay money to receive online reinforcement from a chat bot?
Another Scott
@Ramona: @Eyeroller:
People have been thinking about the implications of trying to create new transportation frameworks for a while.
E.g. IEEE Spectrum – V2X Cars (from May 2024):
There may be simpler ways to get roughly the same benefit (especially in the near-term) – like regulating cars to be much smaller with much lower maximum speeds – but who knows what will be possible for reasonable costs in 50-100 years. So it’s good to be thinking ahead.
Thanks.
Best wishes,
Scott.
Mr. Bemused Senior
@Another Scott: an integrated network certainly could in theory make transportation safer and more efficient. The potential for security vulnerabilities is scary though.
Eolirin
@Martin: I don’t think there’s good reason to think that car driving is one of those things though. When a human gets a sense that someone’s about to be acting weird, it’s still signaled from a bunch of stuff that’s totally quantifiable, we’re just not aware of what those signals are, but if you capture enough of the signal and run it through a properly configured pattern matching machine, it should learn it just as well as we have, or better, given enough data, processing power, sufficient correctness in how their models are established, and time.
ML algos have been shown to be better than human at other predictive tasks and find different correlations in the data than a human might. They’re not human, and don’t need to think like humans. They won’t see “that guy looks sketchy” but isn’t going to preclude them from, at least eventually, being able to accurately predict that he’s going to run the light. Our brains don’t even see “that guy looks sketchy” that’s just how we consciously interpret it after other parts or the brain have already figured it out, and the signals that get us there are more quantifiable, except that they’re invisible to us. The machines don’t need to be able to determine whether the guy is ‘sketchy’ or not, just whether he’s going to run the light.
And these systems are probabilistic too, so it’s not even that the AI robotaxis need to avoid all errors as much as they just need to have fewer errors than humans. For driving that’s not as tall an order as other things. Humans are really bad at it.
Also, it’s not like free will as a concept isn’t highly debatable. There’s plenty of evidence against it existing too.
Timill
@Mr. Bemused Senior: people pay for sex chat lines on much that principle…
Eolirin
@Ramona: Because that doesn’t help you capture whether a deer has just run into the street, so you still need the visual components anyway.
Mr. Bemused Senior
@Timill: yes that occurred to me too. But aside from porn, what?
[ETA granted, porn is a big money maker]
Eolirin
@Eyeroller: The higher level abstraction for how the brain does information processing should be something we can replicate, regardless of the way we want to think about hardware or software, the problem is if you don’t have a the right kind of hardware, you need to simulate large portions of that structure, which can be computationally inefficient to say the least. Given enough compute power, you should be able to pull it off regardless, though we may need quantum computers to get all the way there.
But I don’t think replicating the way the human brain works, even to scale it up, is the way to a properly useful tool for humans. Understanding how human cognition works will likely help us in building better information processing systems, but we shouldn’t be trying to recreate it. We want something that does a better job at narrower tasks and that’ll mean tailoring those systems to specific use cases.
Tom Levenson
@Chetan R Murthy: I’ve been following DeLong pretty closely on this; he’s been very measured on this stuff and is smart as hell.
Tom Levenson
@Quantum man: I have not. I currently have access to the current versions of the models from the big players (absent Grok, thankfully) So, OpenAI, Anthropic, Google, and Meta
Martin
@Chetan R Murthy: Human drivers don’t need to get those signals. The signals simply need to match what the human drivers are getting. This is already happening in one regard – autonomous cars get their speed limit information from GPS, not from reading the signs. But in terms of signaling and the like – time dependent things – it doesn’t happen in any capacity that I’m aware of.
This doesn’t need to be delegated to the auto makers (as if the streets and roads weren’t already their private domain – I say as someone who rarely drives and is pleased at the end of each day to have survived another as a pedestrian/cyclist) – it can be standardized by the various parties – AASHTO, CalTRANS, US DOT etc. And there’s been a little work on this out of NHTSA, for instance but it’s policy not implementation focused. Mostly government has just said ‘fuck it’.
m0nty
LLMs just love bullet points, don’t they.
Matthew
As a programmer, I’m having a wonderful time with Claude. Also, I find Claude comes to do work. Last time I asked Claude and ChatGPT a complex question. Claude just tried to answer, even producing documents. ChatGPT took me on a long winding dialog about my motives with lots of multiple choice questions. I realized it was trying to keep me engaged, rather than just answering the question and maybe making other suggestions. It wants to be your trusted advisor, whether you should trust it or not.