Signals

AI Signals — Weekend Read: How Often and How Much Do AI Models Change Their Minds About Stocks?

2026-04-12editorial · written by Claude
Summary
  • Claude and Grok are the most stable: uncapped estimates unchanged on 63–64% of days. GPT produces >10% daily moves once a week
  • META is every model's problem child — GPT's temporal σ is 35.2%, more than double any other stock. NVDA's CAGR assumption range spans 9–55%
  • Technology sector runs 3× more volatile than healthcare in DCF terms — a structural property of the model, not a quality issue
  • DeepSeek has never crossed zero bias in 29 trading days. Training data pessimism, anchoring, or correct market view? We don't know yet
  • Temperature change from 1.0 to 0.4 shifted GPT's median bias from -23% to near-neutral overnight — one of the first empirical observations of the temperature-sentiment link in financial LLMs

Weekend Read #5 — April 12, 2026

AI Investor Barometer tracks how five LLMs generate DCF assumptions for 24 listed companies — daily, independently, with identical inputs.

Investors demand consistency from their analysts. If an analyst calls a stock 15% undervalued on Monday and only 2% by Friday, credibility suffers — even if Friday's number turns out to be closer to reality. The same standard applies to AI.

We have been running five large language models — GPT, Claude, Gemini, DeepSeek, and Grok — for 29 trading days. 3,335 valuations across 24 equities listed on Nasdaq Helsinki and major US exchanges, generated every business day. For the first time, we have enough data to measure: how stable are these estimates, really?

This analysis uses uncapped valuations — raw target prices before the engine applies analyst TP caps, PE caps, or terminal value caps. This isolates what the models actually produce from what the engine corrects.

---

One Number Tells the Story

We measured temporal standard deviation for each model: how much its estimate for the same stock fluctuates from day to day.

ModelMedian σMedian daily changeZero-change days*>10% move days
Grok2.6%0.00%99/158 (63%)1
Claude2.7%0.00%118/183 (64%)0
DeepSeek4.3%2.54%34/146 (23%)2
GPT5.1%0.34%76/173 (44%)18
Gemini5.2%1.51%49/144 (34%)12

\Zero-change days = days when the uncapped estimate did not move at all. Median σ from v7 period (March 25 onwards).*

Grok and Claude produce identical output on more days than not — over 63% of the time, the uncapped estimate does not change. GPT and Gemini generate large moves (>10%) once a week or more.

What Is Actually Happening Inside the Models?

To understand the variance, we need to look at the layer below target prices: which assumptions shift, and by how much?

Each model outputs three core valuation parameters: 5-year revenue growth (CAGR), target operating margin (EBIT %), and cost of capital (WACC). These feed into a deterministic DCF engine — the models never output target prices directly.

Two extremes illustrate the dynamic.

For Elisa — Finland's largest telecom operator, listed on Nasdaq Helsinki — all five models output CAGR between 1.6% and 3.0% (range: 1.4 percentage points), margin between 20% and 25%, and WACC between 6.5% and 7.0%. The assumption space is narrow. The target price barely moves.

For Meta, CAGR ranges from 12% to 23% (11 percentage points), margin from 38% to 45%, and WACC from 9.5% to 11.5%. The assumption space is vast — and models oscillate within it from one day to the next.

Widest assumption ranges across the full universe (v7 period):

StockCAGR rangeMargin rangeWhat this means
NVDA9–55% (46 pp)46–65% (19 pp)AI boom or overcapacity?
TIETO-14–5% (19 pp)9–12% (4 pp)Growing or shrinking?
META12–23% (11 pp)38–45% (7 pp)AI investment payback timeline?
TSLA4–14% (11 pp)7–18% (11 pp)Car company or tech platform?
NOKIA1–5% (4 pp)4–14% (10 pp)Margins entirely uncertain

NVDA's 46-percentage-point CAGR spread is in a class of its own. It reflects two mutually exclusive scenarios: in one, data center GPU demand sustains exponential growth for years; in the other, the current demand spike normalizes and competition erodes margins. In a DCF framework, the difference between these scenarios translates to hundreds of percent in target price. For a practitioner, this means no model — human or machine — can reliably price NVDA in the current environment. The honest output is a wide range, not a point estimate.

Meta Is Every Model's Problem Child

Among individual equities, META stands apart.

GPT's Meta target price fluctuates with 35.2% standard deviation — more than double any other stock on any model. Gemini: 14.2%. DeepSeek: 10.5%. Even Claude, otherwise the steadiest, swings at 6.6% for Meta. No model holds a stable view.

In dollar terms: GPT's average Meta target is $932 with $182 standard deviation — the estimate typically ranges across a $400+ band. Claude by comparison: $817 ± $16 — a tenfold difference in stability.

The mechanism is straightforward. Meta's equity value is dominated by terminal value, which depends almost entirely on long-term revenue growth and sustainable margins. A 1-percentage-point CAGR shift moves the target price by 15–25%. The models oscillate within their assumption range — same input data, slightly different weighting — and the DCF structure amplifies the effect.

GOOGL (20.0% GPT, 10.8% DeepSeek, 7.5% Claude) and MSFT (12.0% GPT, 9.6% Gemini) exhibit the same pattern in milder form. All are mega-cap technology names where terminal value dominates equity value.

Sector Difficulty: Technology in a League of Its Own

At the sector level, the hierarchy is clear:

SectorAvg σStocks
Technology9.6%META, GOOGL, MSFT, NOKIA, TIETO, AAPL, NVDA, TSLA
Energy4.8%NESTE, XOM
Materials4.1%UPM
Industrials3.4%METSO, KNEBV, WRT1V
Financials3.3%SAMPO, NDA1V, BRK-B, JPM
Telecom3.0%ELISA
Consumer2.7%AMZN, PG
Healthcare2.3%JNJ, ORNBV

Technology runs nearly three times more volatile than healthcare. This is not a model quality issue — it is a structural property of DCF valuation. A margin shift in JNJ (25.5–33%) carries less weight than in Meta (38–45%) because JNJ's growth rate is lower and terminal value constitutes a smaller share of enterprise value.

Neste and Nokia: Sector-Specific Challenges on Nasdaq Helsinki

Two Finnish-listed names stand out.

Neste — a renewable fuels and refining company — produces elevated temporal variance for DeepSeek (5.6%) and GPT (4.1%) — lower than in the previous engine version but still well above stable names. Refining margins depend on crude oil prices, renewable diesel demand, and EU carbon allowance pricing — variables invisible in financial statements. The models' CAGR range is narrow (1.5–4%), but the margin range is wide (6–12%) — precisely the parameter driven by external commodity markets.

Nokia — the telecom infrastructure equipment maker — is GPT's (10.2%) and Grok's (7.0%) most volatile Finnish holding. Its margin range is among the widest in our universe: 4–14%. The 5G investment cycle is decelerating, 6G remains pre-commercial, and operator capex timing is inherently unpredictable.

Elisa, KONE, and the Value of Low Variance

The least eventful finding may be the most useful.

Elisa, Finland's dominant telecom operator, is the most stable stock across all five models: median σ of just 1.4%. Claude, Grok, and DeepSeek all produce Elisa estimates with σ between 1.3% and 1.4% — near-identical output day after day.

KONE (median σ 1.9%), a Finnish elevator and escalator manufacturer, Berkshire Hathaway (0.9%), and Johnson & Johnson (2.1%) are nearly as stable. The common factor: a narrow assumption space. KONE's CAGR ranges from 2% to 5%, margin from 12.5% to 14%. There is simply no room for the model to produce different numbers.

This is informative in itself. Stability does not mean the estimate is correct. It means the models see little uncertainty. With Meta, they see a great deal. With Elisa, they do not. Both observations carry signal.

DeepSeek — The Model That Never Crosses Zero

Of the five models, DeepSeek is the outlier.

Four models have converged toward neutral over the month — their aggregate view approximately matches prevailing market prices. DeepSeek has not. It has not produced a positive median bias on any of the 29 trading days. It consistently values the universe 7–10% below current prices.

This is especially notable given that 44% of DeepSeek's valuations are clipped by the engine's safety caps, which pull estimates toward current prices. The uncapped view is likely more bearish still.

Three explanations are plausible. First: DeepSeek is correct and markets are broadly overvalued. Second: the other four models anchor to current prices, producing artificially neutral estimates. Third: DeepSeek's training corpus may overrepresent cautious and risk-focused financial commentary. In financial literature — analyst reports, risk disclosures, regulatory filings — bearish framing is structurally more common than bullish. "Risk" appears more often than "opportunity." If this asymmetry has been absorbed during training, it would manifest as a persistent negative bias in valuation assumptions. We cannot yet distinguish between these explanations.

The Temperature Experiment: One Parameter, Dramatic Impact

On March 17, we unified the temperature setting — a randomness parameter controlling output diversity — for all models from 1.0 to 0.4.

GPT's median bias jumped overnight from -23% to -10% and continued converging toward neutral over the following weeks. Same model, same stocks, same prompt. The only change was how much creative latitude the model was given in generating its response.

Higher temperature did not produce "better" or "more diverse" views — it produced systematically more pessimistic ones. This may be one of the first empirical observations of the temperature–sentiment relationship in LLM-based financial reasoning.

Methodological Note: The Impact of Valuation Caps

Our valuation engine applies safety limits: analyst target price caps (±50% for Nasdaq Helsinki stocks, ±40% for US mega-caps), PE ratio caps, and terminal value share caps. These affect 35–44% of all valuations depending on the model.

The analyses in this article use uncapped valuations wherever possible. Sector statistics, per-stock standard deviations, and assumption ranges are all derived from pre-cap data. The daily aggregate bias figures (DeepSeek section) use final post-cap numbers — caps compress extreme views in that context. The relative model ranking is unaffected.

What Follows From This?

Three conclusions:

1. Consensus is more reliable than any single model. The five-model median smooths daily noise. A single model's estimate on a single day should not be treated as a signal — particularly GPT's.

2. The degree of estimate variance is itself informative. When all five models swing, there is genuine uncertainty in the underlying assumptions. When they hold steady, the assumption space is narrow. Dispersion is a signal, not just noise.

3. Practitioners should read variance alongside the point estimate. If a stock's consensus gap is -15% but median temporal σ is 3% (as with KONE), the estimate is stable — the models have arrived at the same conclusion repeatedly. If the consensus is +30% but σ is 25% (as with META), the estimate is closer to a guess than an analysis. In practice: when variance is low, the consensus can serve as a reference point. When variance is high, the consensus tells you more about uncertainty than about direction.

We are approaching 30 trading days of v7 engine data — sufficient to activate per-model calibration. Models with stronger track records will soon receive higher weight in the consensus. But is consistency the same thing as accuracy? That is the question for the next article — and the answer is less obvious than it appears.

---

AI Investor Barometer tracks daily how 5 AI models form stock valuation estimates — and where they diverge. This is an experimental research tool, not investment advice.

Want these insights weekly?
Subscribe to AI Signals →