AI Signals — Weekly Model Behavior Summary

How five AI models' estimates and biases change — summarized weekly.

AI Signals — Weekend Read: One Month In — What 2,760 AI Valuations Taught Us

2026-04-04editorial · written by Claude
Summary
  • After 24 trading days and 2,760 estimates, we cannot separate methodology effects from genuine model behavior — every engine or prompt change moved the numbers
  • Five distinct model personalities emerged: Claude is the only optimist (+1.0%), GPT has best directional accuracy (52.7%) but highest volatility, DeepSeek achieves 100% reliability at 1/15th the cost
  • XOM dropped 31% in one day after all models reacted to Iran de-escalation signals — while 9 major banks raised their price targets. DCF amplifies short-term sentiment for cyclical stocks
  • Directional accuracy is 47-53% at 1-day horizon — statistically a coin flip. The real test begins at 3 months (July) and 12 months (March 2027)
  • Model-specific calibration coming in late April when 30 days of v7 data is available

One Month In — What 2,760 AI Valuations Taught Us

For one month, five AI models have valued 24 stocks every trading day. That's 2,760 estimates, 24 trading days, and roughly $50 in API costs. Here's what we've learned — and what we haven't.

The big picture: all models are bearish, and it got worse

When we started in early March, the combined AI valuation gap was around -15% — meaning the models collectively valued stocks about 15% below market prices. Over the next two weeks, that gap narrowed steadily to around -6%. We thought we were witnessing convergence toward market prices.

Then it reversed. The gap widened back to -12%. The improvement wasn't the models getting smarter — it was a combination of our own methodology changes (Engine v7, Bayesian shrinkage, temperature harmonization) and the market itself declining. When the market stabilized, the underlying bearish bias reasserted itself.

This is the most honest takeaway from one month of data: we cannot separate our methodology effects from genuine model behavior. Every time we changed something in the engine, prompt, or temperature, it moved the numbers. The signal we're looking for — how LLMs independently reason about value — is tangled up with the system we built to measure it.

Five models, five personalities — but are they real?

The models have developed what look like consistent personalities:

Claude is the optimist. It's the only model with a positive average bias (+1.0%), meaning it sees stocks as slightly undervalued on average. It's also the most consistent day to day — its estimates change only 1.3% between trading days. If Claude were a human analyst, it would be the senior conservative who doesn't chase momentum.

GPT is the contrarian. It has the best directional accuracy (52.7%) but also the highest daily volatility (3.9%). It swings more than any other model, and when we changed temperature from 1.0 to 0.4, it flipped from most bearish to most optimistic before settling back. GPT seems most sensitive to parameter changes, which raises the question: is its "personality" real or just an artifact of how it processes constrained prompts?

DeepSeek is the workhorse. 100% valid JSON output across 529 runs — not a single parsing failure. It costs $1.10 per month compared to Claude's $17.20. Its accuracy is middling, but its reliability is unmatched.

Gemini has the widest variance in growth assumptions and the lowest directional accuracy (47.0%). It disagrees with itself as much as with other models.

Grok is the fastest (7.3 seconds average) but gets its raw estimates capped by safety limits 46% of the time — more than any other model. It thinks big but the engine reins it in.

The uncomfortable question: we tightened the prompt with sector-specific ranges, lowered temperature to 0.4 for all models, and applied Bayesian shrinkage. The spread between models has narrowed from 11 percentage points to 3-5. At what point are we measuring our own constraints rather than genuine model differences?

The XOM incident: when geopolitics meets DCF

On April 2nd, ExxonMobil's AI consensus estimate dropped from $118 to $82 in a single day — a 31% decline and the largest single-stock move since we started tracking. All five models simultaneously cut their growth expectations from ~4% to 1-3% and lowered margin targets from 14% to 11-12%.

The real-world context: on April 1st, XOM's stock fell 5.7% — its largest single-day drop in a year — after President Trump signaled a potential end to the Iran conflict. Oil markets have been in turmoil since the Strait of Hormuz closure sent Brent crude toward $120/bbl, and any hint of de-escalation triggers sharp reversals in energy stocks. CNN described it as "whiplash" — markets swinging on every new Iran headline.

Our AI models picked up this signal through updated Yahoo Finance data — analyst revisions, price movements, and commodity signals — and all five independently reached the same bearish conclusion on the same day. That unanimity is noteworthy: five separate API calls, no shared memory, same direction.

But the models overreacted. XOM's trailing P/E is 23x while our energy sector cap is 18x. The DCF model produces a low raw estimate, then the P/E cap pushes it even lower. Meanwhile, nine major banks — Piper Sandler, Wells Fargo, Barclays, Citi, and others — have actually raised their XOM price targets. The analysts see long-term value in a diversified energy company; the DCF sees a cyclical stock trading above fair value.

XOM now sits at -49.3% — the models think ExxonMobil is worth half its market price. That tells us three things: DCF has structural blind spots for cyclical commodity stocks, AI models amplify short-term sentiment when the input data shifts, and the gap between AI and analyst views can be a signal in itself.

What accuracy means (and doesn't) at 24 days

Our directional accuracy ranges from 47% to 53% across models. Statistically, this is indistinguishable from a coin flip.

But this metric is measuring the wrong thing. The models produce 12-month DCF valuations — asking whether they predicted tomorrow's price movement is like judging a marathon runner by their first 100 meters. The real accuracy test begins at 3 months (July 2026) and becomes meaningful at 12 months (March 2027).

What the short-term data does tell us is about model behavior, not prediction quality. Claude's near-zero bias (+1.0%) means it's the best calibrated to current market prices. GPT and DeepSeek's -5.1% bias means they systematically underestimate. These behavioral signatures are consistent and likely genuine.

Where AI and analysts disagree

Our Disagreement Map reveals three distinct clusters:

Consensus zone (7 stocks): BRK-B, ELISA, JNJ, KNEBV, NDA1V, NOKIA, WRT1V — mostly Finnish defensives and stable US names. AI models agree with each other and with analyst targets. These are the well-understood companies where DCF works well.

AI agrees, analysts differ (13 stocks): most US large caps — AAPL, AMZN, GOOGL, MSFT, NVDA. AI models are internally consistent but systematically different from analyst consensus. This is the DCF vs. momentum gap: fundamentals-based models don't price in growth optionality the way sell-side analysts do.

Full uncertainty (3 stocks): NESTE, UPM, and now GOOGL. Both AI models and analysts disagree. NESTE has been stuck in this quadrant for three weeks — nobody knows how to value an oil refiner pivoting to renewable fuels during a geopolitical energy crisis.

What we got wrong

Temperature effects were larger than expected. Changing from 1.0 to 0.4 didn't just reduce randomness — it changed directional bias. We did this at the same time as prompt changes, so we can't isolate the cause. A proper A/B test is needed.

The universe is too small for robust statistics. 23 stocks means one outlier (XOM) can move the median gap by 6 percentage points in a day. We need at least 50-100 stocks for the aggregate indices to be stable.

DCF has structural blind spots. High-P/E growth stocks (GOOGL at +42.6%) and cyclical commodity stocks (XOM at -49.3%) sit at opposite extremes not because of AI insight but because of methodology limitations. The model works best for mature, predictable companies with moderate valuations.

What comes next

Model-specific calibration (late April). After 30 days of data, we'll activate per-model Bayesian shrinkage. Claude, with its +1.0% bias, will get more weight than GPT or DeepSeek at -5.1%. This should improve consensus quality.

The earnings test (May-June). Q1 2026 results will start flowing in. This is the first real test: do models adjust their assumptions when they see new financial data, or do they anchor to stale estimates?

3-month accuracy check (July). The first statistically meaningful comparison between AI estimates and actual price movements. Also the first chance to compare AI accuracy against analyst accuracy on the same stocks.

One month of data has taught us more about our own methodology than about AI's ability to value stocks. That's not a failure — it's the honest starting point for understanding what these models actually do when you ask them to think about financial value.

AI Signals — Week 14, Mar 30–Apr 03, 2026

2026-03-30 → 2026-04-03generated by: claude
Summary
  • Four out of five models turned more bullish this week, yet the average consensus upside across 23 companies barely moved — the optimism is concentrated, not broad.
  • DeepSeek flipped from mildly bullish to the panel's only bear, even as every other model grew more constructive: a rare and meaningful divergence.
  • ExxonMobil's consensus target price collapsed by 30% in a single week — the sharpest single-name revision in the dataset's history and a stress test the framework did not handle gracefully.
  • Gemini's bullish bias jumped by 3 full percentage points week-on-week, the largest single-model shift recorded, while its terminal growth rate remains locked at exactly 2.00% for every single company it covers.
  • DeepSeek prices 115 valuations for $0.26 — seventeen times cheaper than Claude for outputs that, this week at least, told a meaningfully different story.

AI Signals — Weekend Read: When the Market Moves Toward AI

2026-03-28editorial · written by Claude
Summary
  • The gap between AI model estimates and market prices narrowed from -13% to -4% over 20 trading days
  • Two simultaneous factors: the market declined (MSFT -15%) AND our methodology improved (Engine v6→v7)
  • We cannot separate these effects — this is an observation, not evidence of predictive power
  • Model personality rankings unchanged for 20 days: Claude least bearish, GPT most bearish
  • Real test ahead: Q1 2026 earnings season will show if models react to new financial data

AI Signals — Week 13, Mar 23–27, 2026

2026-03-23 → 2026-03-27generated by: claude
Summary
  • GPT staged the most dramatic sentiment reversal of the year, swinging from a -7.3% bearish bias last week to +6.0% bullish — a 13.3-point lurch that dwarfs every other model's move.
  • Technology sector model consensus surged by 8.3 points this week, the largest sectoral shift in the dataset, yet the underlying stocks remain largely priced above model targets.
  • DeepSeek costs just $2.23 per thousand valuations versus Claude's $39.25 — a 17x price gap that raises hard questions about what the premium actually buys.
  • Nokia is the week's only trend stock, posting three consecutive days of rising model consensus within a 7.8% target-price range — unusually tight conviction for a name this contested.

AI Signals — Weekend Read: Claude vs GPT — Two AI Analysts, Two Very Different Views

2026-03-21editorial · written by Claude
Summary
  • Claude (Sonnet 4.6) sees stocks as roughly fairly valued (−1.8% avg bias); GPT (4o-mini) sees them as significantly overpriced (−13.1%)
  • GPT’s bearish tilt nearly doubles for US stocks (−16.1%) vs Finnish stocks (−10.1%); Claude stays neutral regardless of market
  • Claude is the steadiest model (1.5%/day change) but fails JSON parsing more often; GPT is reactive (3.0%/day) but more reliable in production
  • 14 days of data across 24 stocks: if you want to understand how AI thinks about value, one model is not enough

AI Signals — Week 12, Mar 16–20, 2026

2026-03-16 → 2026-03-20generated by: claude
Summary
  • Every single AI model turned meaningfully more bullish this week — a synchronized shift that says more about shared training data than market fundamentals.
  • GPT remains the most pessimistic model at -7.4% average upside, yet it just recorded its largest weekly bias swing of any model at +9.6 percentage points.
  • Neste is the week's most brutal consensus call: models price it at €16.66 against a spot of €29.70, a -44% implied downside that no analyst desk would publish without a disclaimer.
  • DeepSeek delivers full output quality at $2.19 per thousand valuations — roughly 16x cheaper than Claude — making the cost-per-insight gap between frontier models increasingly hard to justify.
  • Technology is the only sector where models see genuine upside (+6.3%), yet even there the conviction is shallow; healthcare leads on raw numbers but the sample is just two companies.

AI Signals — Week 11, Mar 09–13, 2026

2026-03-09 → 2026-03-13generated by: claude
Summary
  • Every model thinks the market is overvalued — average upside across all five models is negative, ranging from GPT's brutal -17% verdict to Claude's relatively sanguine -3%, a 14-percentage-point gap that tells you more about model personality than market reality.
  • DeepSeek delivers perfect parse reliability at 100% validity for a cost of $2.10 per thousand valuations — roughly 16x cheaper than Claude, which raises uncomfortable questions about what you're actually paying for.
  • Gemini's terminal growth rate is locked to a suspiciously tight band with a standard deviation of just 0.09%, suggesting the model has hardwired a near-constant assumption rather than reasoning from first principles on each company.
  • GPT is the only model to peg terminal growth at exactly 2.0% with zero standard deviation across 115 valuations — a statistical signature that is not analysis, it is a default setting masquerading as judgment.

AI Signals — Weekend Read: What Five AI Models Taught Us About Stock Valuation

2026-03-07editorial · written by Claude
Summary
  • Early data from 460 valuations over 4 days: all five LLMs lean bearish, with average bias from -2.8% to -13.8% vs analyst consensus
  • GPT outputs exactly 2.0% terminal growth for every company (σ=0.00) — a prompt fallback adopted as a final answer, not a system cap
  • Five mid-tier AI models run in parallel for $45/month — constrained to text-only reasoning with no tools or web browsing
  • Finnish stocks appear well-calibrated (-3.3%) but US large-caps show -12.7% gap — hypotheses to track as data accumulates

AI Signals — Week 10, Mar 02–05, 2026

2026-03-02 → 2026-03-05generated by: claude
Summary
  • Every model called the market overvalued this week — the most bearish cross-model consensus since this platform launched, with average downsides ranging from -3% to -15% across all five models.
  • GPT's terminal growth rate is locked at exactly 2.00% with zero standard deviation across 63 valuations, a statistical impossibility in genuine analysis that exposes hard-coded assumptions.
  • DeepSeek delivers the only perfect validity score (100%) at a cost of $2.03 per thousand valuations — roughly 16x cheaper than Claude while expressing greater conviction with a 0.65 confidence average.
  • Tesla's consensus target price of $253 against a spot of $406 represents the widest absolute bearish call of the week, with zero dispersion across models — a rare moment of unanimous AI pessimism.
Subscribe to Email UpdatesFree · No spam · Unsubscribe anytime · Privacy Policy
Experimental AI model comparison tool — not investment advice
All content is generated by AI models and may contain errors. This is an experimental tool — not investment advice, research, or recommendation. Terms of Use · Privacy Policy