AI Signals — Weekend Read: One Month In — What 2,760 AI Valuations Taught Us
- After 24 trading days and 2,760 estimates, we cannot separate methodology effects from genuine model behavior — every engine or prompt change moved the numbers
- Five distinct model personalities emerged: Claude is the only optimist (+1.0%), GPT has best directional accuracy (52.7%) but highest volatility, DeepSeek achieves 100% reliability at 1/15th the cost
- XOM dropped 31% in one day after all models reacted to Iran de-escalation signals — while 9 major banks raised their price targets. DCF amplifies short-term sentiment for cyclical stocks
- Directional accuracy is 47-53% at 1-day horizon — statistically a coin flip. The real test begins at 3 months (July) and 12 months (March 2027)
- Model-specific calibration coming in late April when 30 days of v7 data is available
One Month In — What 2,760 AI Valuations Taught Us
For one month, five AI models have valued 24 stocks every trading day. That's 2,760 estimates, 24 trading days, and roughly $50 in API costs. Here's what we've learned — and what we haven't.
The big picture: all models are bearish, and it got worse
When we started in early March, the combined AI valuation gap was around -15% — meaning the models collectively valued stocks about 15% below market prices. Over the next two weeks, that gap narrowed steadily to around -6%. We thought we were witnessing convergence toward market prices.
Then it reversed. The gap widened back to -12%. The improvement wasn't the models getting smarter — it was a combination of our own methodology changes (Engine v7, Bayesian shrinkage, temperature harmonization) and the market itself declining. When the market stabilized, the underlying bearish bias reasserted itself.
This is the most honest takeaway from one month of data: we cannot separate our methodology effects from genuine model behavior. Every time we changed something in the engine, prompt, or temperature, it moved the numbers. The signal we're looking for — how LLMs independently reason about value — is tangled up with the system we built to measure it.
Five models, five personalities — but are they real?
The models have developed what look like consistent personalities:
Claude is the optimist. It's the only model with a positive average bias (+1.0%), meaning it sees stocks as slightly undervalued on average. It's also the most consistent day to day — its estimates change only 1.3% between trading days. If Claude were a human analyst, it would be the senior conservative who doesn't chase momentum.
GPT is the contrarian. It has the best directional accuracy (52.7%) but also the highest daily volatility (3.9%). It swings more than any other model, and when we changed temperature from 1.0 to 0.4, it flipped from most bearish to most optimistic before settling back. GPT seems most sensitive to parameter changes, which raises the question: is its "personality" real or just an artifact of how it processes constrained prompts?
DeepSeek is the workhorse. 100% valid JSON output across 529 runs — not a single parsing failure. It costs $1.10 per month compared to Claude's $17.20. Its accuracy is middling, but its reliability is unmatched.
Gemini has the widest variance in growth assumptions and the lowest directional accuracy (47.0%). It disagrees with itself as much as with other models.
Grok is the fastest (7.3 seconds average) but gets its raw estimates capped by safety limits 46% of the time — more than any other model. It thinks big but the engine reins it in.
The uncomfortable question: we tightened the prompt with sector-specific ranges, lowered temperature to 0.4 for all models, and applied Bayesian shrinkage. The spread between models has narrowed from 11 percentage points to 3-5. At what point are we measuring our own constraints rather than genuine model differences?
The XOM incident: when geopolitics meets DCF
On April 2nd, ExxonMobil's AI consensus estimate dropped from $118 to $82 in a single day — a 31% decline and the largest single-stock move since we started tracking. All five models simultaneously cut their growth expectations from ~4% to 1-3% and lowered margin targets from 14% to 11-12%.
The real-world context: on April 1st, XOM's stock fell 5.7% — its largest single-day drop in a year — after President Trump signaled a potential end to the Iran conflict. Oil markets have been in turmoil since the Strait of Hormuz closure sent Brent crude toward $120/bbl, and any hint of de-escalation triggers sharp reversals in energy stocks. CNN described it as "whiplash" — markets swinging on every new Iran headline.
Our AI models picked up this signal through updated Yahoo Finance data — analyst revisions, price movements, and commodity signals — and all five independently reached the same bearish conclusion on the same day. That unanimity is noteworthy: five separate API calls, no shared memory, same direction.
But the models overreacted. XOM's trailing P/E is 23x while our energy sector cap is 18x. The DCF model produces a low raw estimate, then the P/E cap pushes it even lower. Meanwhile, nine major banks — Piper Sandler, Wells Fargo, Barclays, Citi, and others — have actually raised their XOM price targets. The analysts see long-term value in a diversified energy company; the DCF sees a cyclical stock trading above fair value.
XOM now sits at -49.3% — the models think ExxonMobil is worth half its market price. That tells us three things: DCF has structural blind spots for cyclical commodity stocks, AI models amplify short-term sentiment when the input data shifts, and the gap between AI and analyst views can be a signal in itself.
What accuracy means (and doesn't) at 24 days
Our directional accuracy ranges from 47% to 53% across models. Statistically, this is indistinguishable from a coin flip.
But this metric is measuring the wrong thing. The models produce 12-month DCF valuations — asking whether they predicted tomorrow's price movement is like judging a marathon runner by their first 100 meters. The real accuracy test begins at 3 months (July 2026) and becomes meaningful at 12 months (March 2027).
What the short-term data does tell us is about model behavior, not prediction quality. Claude's near-zero bias (+1.0%) means it's the best calibrated to current market prices. GPT and DeepSeek's -5.1% bias means they systematically underestimate. These behavioral signatures are consistent and likely genuine.
Where AI and analysts disagree
Our Disagreement Map reveals three distinct clusters:
Consensus zone (7 stocks): BRK-B, ELISA, JNJ, KNEBV, NDA1V, NOKIA, WRT1V — mostly Finnish defensives and stable US names. AI models agree with each other and with analyst targets. These are the well-understood companies where DCF works well.
AI agrees, analysts differ (13 stocks): most US large caps — AAPL, AMZN, GOOGL, MSFT, NVDA. AI models are internally consistent but systematically different from analyst consensus. This is the DCF vs. momentum gap: fundamentals-based models don't price in growth optionality the way sell-side analysts do.
Full uncertainty (3 stocks): NESTE, UPM, and now GOOGL. Both AI models and analysts disagree. NESTE has been stuck in this quadrant for three weeks — nobody knows how to value an oil refiner pivoting to renewable fuels during a geopolitical energy crisis.
What we got wrong
Temperature effects were larger than expected. Changing from 1.0 to 0.4 didn't just reduce randomness — it changed directional bias. We did this at the same time as prompt changes, so we can't isolate the cause. A proper A/B test is needed.
The universe is too small for robust statistics. 23 stocks means one outlier (XOM) can move the median gap by 6 percentage points in a day. We need at least 50-100 stocks for the aggregate indices to be stable.
DCF has structural blind spots. High-P/E growth stocks (GOOGL at +42.6%) and cyclical commodity stocks (XOM at -49.3%) sit at opposite extremes not because of AI insight but because of methodology limitations. The model works best for mature, predictable companies with moderate valuations.
What comes next
Model-specific calibration (late April). After 30 days of data, we'll activate per-model Bayesian shrinkage. Claude, with its +1.0% bias, will get more weight than GPT or DeepSeek at -5.1%. This should improve consensus quality.
The earnings test (May-June). Q1 2026 results will start flowing in. This is the first real test: do models adjust their assumptions when they see new financial data, or do they anchor to stale estimates?
3-month accuracy check (July). The first statistically meaningful comparison between AI estimates and actual price movements. Also the first chance to compare AI accuracy against analyst accuracy on the same stocks.
One month of data has taught us more about our own methodology than about AI's ability to value stocks. That's not a failure — it's the honest starting point for understanding what these models actually do when you ask them to think about financial value.