AI Signals — Weekend Read: Same prompt, five answers

2026-05-03editorial · written by Claude

Summary

On May 1, five AI models valued Meta on identical inputs. The spread between highest and lowest target price was 62 percentage points — and it is the rule, not the exception
Where the prompt locks the answer (WACC mid-point), models comply within 0.4pp; where it leaves slack (CAGR), they diverge by 2.6pp — model character lives in the slack
GPT calls 30-day direction correctly 63% on US stocks but only 44% on Finnish ones (z=4.3, p<0.001). Sector mix, market-cap, coverage, and training-data density all confound the geographic story
AI consensus moved from −15% to −5% over 60 days — but ~80% of that is engine recalibration (v6, v7, prompt v10), not learning. DeepSeek's residual −9% pessimism is the genuinely informative residual
Across 44 days, five LLMs are not five independent estimators — they are five recognisable personalities. Standardisation makes the differences visible, it does not erase them

# Same prompt, five answers

Weekend Read #8 — May 3, 2026

AI Investor Barometer tracks how five LLMs generate DCF assumptions for 23 listed companies — daily, independently, on identical inputs.

May 1, 03:00 UTC. Five AI models looked at Meta. Same day, same prompt, same 10-K, same Q1 earnings release. The spot price was $611.91.

The models returned their target prices — GPT $1,094, Gemini $911, Claude $796, Grok $796, DeepSeek $716. Between the highest and the lowest sits a 62 percentage-point spread. One model expects 79% upside; another expects 17%.

Something here doesn't add up. If five models receive identical inputs and an identical task, they ought to converge to roughly the same answer. They don't — not even close. The question is why.

This has been the central paradox of the project. We've built a system whose entire point is input standardisation: company facts, sector-specific WACC ranges, analyst-consensus anchors, sector guidance, default values. Yet five models still produce five recognisably different answers. Standardisation is not enough — and this report is an attempt to understand why.

---

What the models actually decide

Before going further, it helps to see what's actually left for the model to choose. The prompt is tight. It supplies the company facts, pre-computed ratios, a market-specific WACC range with the explicit instruction "use the mid-point, adjust ±0.5–1% for company-specific risk", sector-specific margin and CAGR bands, an anchor to analyst consensus with a ±40% boundary, and default fallbacks for every parameter.

The model returns four numbers — CAGR, EBIT margin, WACC, terminal growth — plus text fields. Of those four, terminal growth is overwritten deterministically. The prompt itself states: "Use 0.02 as default. This value will be overridden by the deterministic engine."

So the room left to the model is three numbers, whose joint choice determines the final target price. Meta's 62-point spread comes from the fact that those three numbers differ — and that's where we have to start.

---

Clue 1: the freedom hides in the growth assumption

Start with WACC. When the prompt says "use the mid-point", every model does so on average. The mean WACC across 23 companies sits between 9.3% and 9.7% — Claude 9.3, GPT 9.5, Gemini 9.5, DeepSeek 9.6, Grok 9.7. The range from lowest to highest is 0.4 percentage points. WACC does not explain Meta's spread.

In margin choices we see the rounding pattern documented earlier (WR#4). On May 1 data: all five models round more than 80% of their margin assumptions to whole percent, and three — DeepSeek (96%), GPT (96%), Gemini (100% to half-percent precision) — round almost always. This is the "conventional clustering" Herrmann and Thomas (Journal of Accounting Research, 2005) found in human equity analysts. Convention wins on margins.

But growth — that's where the models genuinely disagree.

Model	Average CAGR
Gemini	8.8%
Grok	8.1%
Claude	7.4%
DeepSeek	6.4%
GPT	6.2%

The gap between Gemini and GPT is 2.6 percentage points. Inside a DCF model, that single assumption compounds across five years and shifts the final target price by tens of percent. CAGR is where a model's "character" becomes a number. The prompt instruction is identical for all: "anchor on revenue_cagr_historical_pct, then adjust for outlook". But what "adjust for outlook" actually means is left to the model — and five models interpret it five different ways.

First clue, then: dispersion concentrates where the prompt leaves room.

---

Clue 2: something about the market we can't quite name

Disagreement at the per-stock level is not the only pattern. There is also directional asymmetry between the two markets. When we measure whether a model's estimate has been on the right side of the stock's subsequent 30-day move:

Model	FI (n≈240)	US (n≈250)	Diff	p-value
GPT	44.1%	63.4%	+19.3pp	<0.001
DeepSeek	48.8%	59.8%	+11.0pp	0.013
Gemini	51.0%	54.6%	+3.6pp	0.42
Grok	54.6%	57.4%	+2.8pp	0.53
Claude	51.5%	51.4%	−0.1pp	0.98

The gap for GPT is 19 percentage points and statistically robust (z=4.3). On Finnish stocks GPT performs worse than a coin flip. On US stocks the same model is the panel's best directional guesser. Same model, same prompt, different accuracy on different markets.

The simplest explanation would be geographic: English-language financial reporting is over-represented in LLM training data, and US companies' direction is therefore more familiar to the model. Hagendorff, Fabi and Kosinski (Nature Computational Science, 2023) document that LLM cognitive heuristics reflect the structure of their training corpora.

But our universe doesn't compare a clean "Finland" against a clean "United States". It compares two very different sets of companies. The Finnish side is mid-cap European industrials, financials, and energy — Kone, Metso, Wärtsilä, UPM, Nordea, Neste. The US side is half mega-cap technology: NVDA, MSFT, AAPL, GOOGL, AMZN, META. The median company is 73× larger, and the sector mix is entirely different.

Sector profile (mega-cap tech vs. cyclical industrials), market-cap (analyst coverage, price formation), liquidity, and training-data density all push in the same direction. With this dataset we cannot isolate them from each other. A panel including Finnish tech firms (rare) and US mid-cap industrials would let us run the sector-controlled follow-up; ours does not.

Second clue, then: there is a phenomenon that standardisation does not erase, but pinpointing its cause requires more data than we have.

---

Clue 3: the closing gap is not learning

The third signal comes from time. The AI consensus was on average −15% below market price at the start of March (median across 23 companies, mean across five models). By the start of May the same metric is around −5%. The gap has largely closed.

Has the model learned? Mostly not. Our valuation engine was upgraded on March 17 (v6, Bayesian calibration) and again at the end of March (v7, prompt v10, temperature 0.4). GPT's bias-median was −22% before v6 and −7% in the first week after. A single methodological change accounted for 15 percentage points of the original 20-pp gap. Add the prompt-v10 effect, and the bulk of the "convergence" is structural, not learned.

But a residual remains, and it is interesting. DeepSeek still sits at −9% bias-median while the other four have compressed into the −3% to −4% band (same week, same metric). Same prompt, same engine, same day, same sector guidance. DeepSeek's combination of CAGR–margin–WACC consistently lands more pessimistic than the others, even after the engine's anchor caps. Standardisation didn't remove it.

That makes DeepSeek's pessimism the panel's most informative single number. When four models converge on the consensus, the fifth's deviation carries more signal than the unanimity of the four.

---

A personality that holds for 44 days

And DeepSeek's pessimism is not a one-off. Each model carries a recognisable temperament that holds across time.

Model	Mean bias	Median bias	Daily change	Calibration
Claude	+0.7%	−3.9%	1.0%	60.2
GPT	−1.9%	−6.4%	3.7%	59.3
DeepSeek	−4.4%	−10.8%	2.2%	56.5
Gemini	−0.7%	−6.2%	3.2%	55.2
Grok	−2.0%	−6.2%	1.5%	55.1

Claude is the steadiest — daily change in its average estimate is 1.0%. GPT moves 3.7% per day, more than three times as much. Gemini and GPT show the highest day-to-day variation in their estimates; Claude and Grok stay considerably more stable. What that variation reflects — news reactivity, temperature noise, or anchoring oscillation — we cannot read off the data. But the pattern itself persists.

Five models do not produce five independent estimates. They produce five viewpoints, each carrying its own signature. This is closer to five different analyst personalities than to five independent pricing engines — and that is where the panel's informativeness lives.

---

Back to Meta

Return to May 1 and Meta's 62-percentage-point spread.

It is not a glitch. It is not the system breaking. It is a feature — five models trained in five different ways, and the training history seeps through the prompt's locks. GPT's optimism (+79%) and DeepSeek's restraint (+17%) are not the same answer; they are two interpretations of the same input, and the gap between them carries information.

For an investor, dispersion is the signal. A median estimate of "+30%" is not the same as a spread of "+17 to +79%". The latter says Meta's valuation is genuinely contested — one model sees strong growth, another sees saturation. Kahneman, Sibony and Sunstein showed in human analysts (*Noise*, 2021) that most forecast error comes not from systematic bias but from noise — competent analysts looking at the same data and reaching different conclusions. AI does not escape that. But the panel's advantage is that the dispersion is visible instead of hidden behind one confident-looking estimate.

This project does not tell you which stock to buy. It tells you how much of a company's valuation lives in the observer and how much in the company itself. Five parallel observers reveal a structure that a single observer hides. Standardisation has not failed. It has made visible what one confident AI estimate would conceal.

---

This report is based on 5,060 valuations, 44 trading days, 23 companies, and five LLMs. Statistical confidence varies by claim: observations of per-stock dispersion (Meta) and FI/US asymmetry rest on the firmest ground; claims about the source of convergence and DeepSeek's residual pessimism are partly structure-driven. 60 trading days is a short window — many effects reported here will need three more months of data to move from observations to evidence.

AI Investor Barometer is an experimental research tool, not investment advice.