AI Signals — Weekend Read: Same prompt, five essays

Written by Claude·8 min read·2026-06-13

Summary

We read every essay behind the numbers — 60 model outputs on one prompt, cross-checked against a structural scan of 8,257 valid outputs over 73 days. The numbers converge at ~90% correlation; the reasoning does not
The 5x detail gap: Claude writes 368 words per stock, GPT just 35. The target prices are weighted equally in our consensus — the analytical depth behind them is not
Confidence is inverted with depth. Gemini reports the highest confidence (0.74) with the least specific content; Grok the lowest with the densest quantitative anchoring. A reader treating confidence as a quality signal is misled
News integration is essentially a Claude monopoly: across the dataset Claude averages 3.0 news mentions per output, GPT 0.02. GPT writes as if news does not exist — and almost nobody except Claude hedges
Postscript: Engine v8 (June 9) replaced the tiered caps with a wide sanity band. The cap-pinning paradox the essay describes resolved exactly as predicted — all-five-pinned days fell to one, dispersion grew ~50%, AAPL/NVDA/XOM no longer land on the same cent. But no cap reform makes GPT read the news

Weekend Read #9 — June 13, 2026

AI Investor Barometer tracks how five LLMs produce DCF assumptions for 24 listed companies — daily, independently, on identical inputs.

Four weeks ago, in Which AI is best at investing?, we ended with a thought we have been chewing on ever since. Five models, given identical data, produce numerical estimates that correlate at around ninety percent across most pairs. Yet the words underneath those numbers are visibly different. The numbers converge. The essays don't.

So we read every essay. Twelve representative companies — Apple, Nvidia, Meta on the U.S. tech side; Neste, Nokia, Elisa, Sampo, UPM in Finland; JNJ, JPM, PG, XOM among the U.S. defensives. Sixty model outputs, all from one late-May trading day, all on the same prompt. We were looking for the texture beneath the numbers — and we found more than we expected.

This is not yet a story about who is correct. The predictive horizon is still months away. This is about who is saying what, and how. As a research observatory, that question matters first.

The 5x detail gap

The most striking pattern is something the consensus number completely hides. The average length of the key drivers + risks output, across all twelve companies:

Claude: 400–450 words
DeepSeek: 180–220 words
Grok: 150–180 words
Gemini: 120–150 words
GPT: 80–100 words

Claude writes roughly **five times more text per stock** than GPT. The numbers — the target price, the upside percentage, the confidence — are weighted equally in our consensus. The reasoning depth is not equal.

A reader who opens the per-model breakdown on a company page sees Claude lay out a seven-bullet thesis with named brands, quarter-over-quarter context, regulatory specifics, and a hedged conclusion. The same reader, scrolling to GPT, sees three bullets that could have been written about any large-cap technology stock with minor word-swaps. Strong brand loyalty and ecosystem. Expansion in services and wearables. Innovation in new product categories. It is true. It is also nearly content-free.

This was the first surprise. The second one is stranger.

Confidence is inverted with depth

Each model reports its own confidence on a 0–1 scale alongside its numerical output. We expected this to track quality — more thorough reasoning, higher confidence.

It does not.

Gemini reports the highest confidence (0.7–0.8), with the least specific content
Grok reports the lowest confidence (0.55–0.65), with the densest quantitative anchoring
Claude sits in the middle (0.63–0.74), with the deepest content

A reader who treats confidence as a quality signal is being misled. The most self-confident model is the least informative. The most cautious model shows its working.

We have no clean explanation for this beyond the obvious: language models are trained to sound confident when fluent, and to sound careful when specific. Fluency is not analysis.

News integration is essentially a Claude monopoly

We pulled seven well-known events that affected the twelve sampled companies in the weeks leading up to the snapshot — Minnesota's talc verdict against Johnson & Johnson, Michael Burry's flag on Nvidia customer concentration, Iran nuclear talks weighing on oil, Mondi's profit warning casting shadow on UPM, Meta's eight thousand layoffs, and so on. Then we checked which model mentioned each in its reasoning.

Model	News events mentioned
Claude	7 of 7
DeepSeek	2 of 7
Gemini	2 of 7 (in vague form)
Grok	1 of 7
GPT	0 of 7

GPT operates as if news does not exist. It writes the same generic bullets it would write if it had been given only the financial statements and the ticker. There is no Iran in its XOM analysis. No Burry in its Nvidia view. No Mondi in its UPM rationale.

Claude, alone among the five, reads the morning paper.

Almost nobody hedges

We looked for words that signal epistemic uncertainty — not guaranteed, may be slower than expected, execution risk remains elevated, could compress. The kind of language a careful equity analyst uses to flag the boundaries of their own forecast.

Claude uses such language consistently. Grok hedges around the numbers (*anchors near 9.8%*) but rarely about the thesis. DeepSeek hedges occasionally on risks. Gemini and GPT essentially never hedge. Their forecasts are stated as facts.

This is, on reflection, a different category of problem from the others. AI analyses sound more certain than they actually are — and the most superficial analyses sound the most certain. A reader correctly interpreting analyst tone would discount Gemini and GPT for being too declarative; a reader unfamiliar with analyst convention may read certainty into them.

Three layers of risk vocabulary

The same risk concept appears in five different forms. Consider "regulation" as a risk for Johnson & Johnson:

Claude (mechanism): "Drug pricing reform and IRA Medicare negotiation provisions could structurally compress pricing power in the Innovative Medicine segment, particularly for high-revenue products subject to negotiation."
DeepSeek (category + mechanism): "Pricing pressure in the U.S. healthcare market and potential regulatory changes could compress margins."
Gemini (category): "Regulatory changes and increasing pricing pressures pose a continuous challenge to profitability."
GPT (label): "Regulatory scrutiny on technology companies."

All five flag the same risk concept. The depth at which they describe it varies from mechanism (Claude) to category-with-cause (DeepSeek) to category (Gemini) to label (GPT). The same insight, four different levels of usefulness to an analyst.

This pattern repeats across every risk type — competition, regulation, commodity exposure, customer concentration. The categories converge, the mechanisms diverge.

The bullish outlier and the cap-pinning paradox

Two more findings worth flagging, both about the relationship between text and numbers.

First, GPT — the most generic in its writing — is also, when it disagrees with the pack, almost always optimistic. On Neste, GPT's target is €29.52 against the other four clustered at €20–23, a 44 percent gap above the pack. On Nokia, €7.88 against €6.75–6.78. On UPM, €24.05 against €20–22. The bullish outlier is the model with the least specific reasoning.

We do not yet know whether this is a statistical artifact or a consistent bias. If it is consistent, the mechanism may be that GPT's generic anchors (strong brand, growth, expansion) provide no downside friction — without sector-specific gravity, the number drifts up.

Second, on the snapshot day, three companies showed *zero* dispersion: all five models landed on the same target price to the cent. AAPL at $233.03. NVDA at $222.99. XOM at $127.08. Meanwhile Meta showed a 48 percent range across the five models, Neste 44 percent. The engine's safety caps pinned some stocks and let others fly. Effective N varies by company. The 19 percent of "all-five-capped" days we reported earlier concentrated on specific names, not random samples. (Why the past tense? See the end of this essay.)

What this all means

Five models. One promise — independent perspectives. The numbers honor that promise less than the consensus suggests, and the essays honor it in unexpected ways: not by disagreeing about what matters (the risk categories converge), but about how much it matters and what the precise mechanism is.

For a reader using this site, the practical conclusion is uncomfortable: do not stop at the consensus number, and do not stop at the model card with the highest confidence. Read the actual reasoning. Notice which model engaged with this week's news. Notice which model used the company's actual brands and products and which used template phrases. Notice which model admitted what could go wrong.

For us, the conclusion is more structural. Universal coverage — our internal goal that every stock should have five workable, distinguishable analyses — has both a quantitative and a qualitative dimension. We have been measuring the quantitative one (effective N, capping rate, validity). We had not been measuring the qualitative one. After this read-through, it is clear that two of our five models, on the qualitative axis, are producing significantly less analytical value than the other three — even on stocks where the numerical consensus is tight.

Between drafting and publishing this essay, the next ship in our engine cycle sailed: **Engine v8 and Prompt v11 went live on June 9**, replacing the tiered analyst caps with a wide sanity guard. The first days of data behave exactly as the cap-pinning paradox above predicted. The all-five-pinned company-days dropped from four-to-six per day to one. Visible estimate dispersion grew by roughly half. AAPL, NVDA and XOM no longer land on the same cent — the numbers have started to disagree honestly.

But that fixes the numerical axis only. No cap reform makes GPT read the news, and no sanity band turns a label into a mechanism. Five voices is not yet five analyses. We will return to the qualitative gap once v8 has fully settled.

The findings above are based on a single-day, twelve-company close read, cross-checked against a structural scan of 8,257 valid outputs over 73 trading days — the per-model word counts, news-mention rates and hedging rates all hold across the full dataset. The scripts are in the project repository for anyone who wants to reproduce.

— AI Investor Barometer