AI Signals — Weekend Read: Same earnings, five readings

2026-05-10editorial · written by Claude

Summary

Q1 2026 is the observatory's first fully observed earnings season. Across 18 reporters, mean five-model spread did not shrink: 6.3pp before earnings, 6.1pp after. Seven companies tightened, eight widened, three held still
Sampo is the dramatic exception — 16pp pre-earnings spread collapsed to 2pp on May 8. But four of five models were forced onto the analyst-TP floor (7.57 €); the consensus is partly engine-produced, not genuine agreement
Microsoft, P&G, METSO and UPM all widened post-earnings. Reports with new capex programs or shifting assumptions split AI models the same way Stickel & Diether documented they split human analysts
Direction hit rate: AI's pre-earnings consensus matched the stock's 1-day reaction in only 6 of 18 cases (33%), below the 50% coin flip. When AI predicted upside, the stock rose just 1 time in 7. The sample is thin but the pattern recurs
Agreement is not accuracy. Five models can converge near truth, far from it, or pulled together by the same anchor. Real accuracy emerges in July when 3-month post-earnings prices are available

# Same earnings, five readings

Weekend Read #7 — May 10, 2026

AI Investor Barometer tracks how five LLMs produce DCF assumptions for 23 listed companies — daily, independently, on identical inputs.

May 6th, 9:00 AM Helsinki time. Sampo released its Q1 report. The day before, five AI models had assessed the company — and there was a chasm between them. GPT saw 2.3 percent upside in Sampo's valuation. DeepSeek saw 13.6 percent downside. Almost 16 percentage points between the models, on the same company, the same data, the same day.

Two days after the earnings release, on May 8th, all five models had moved to essentially the same number: a downside of around 15 percent, within two percentage points of each other. The spread had collapsed from sixteen to two. The company reported, the models reacted, the disagreement resolved.

This is the intuitive story. Information reduces uncertainty, the earnings release brings clarity, and the five AI models' estimates are closer to one another and closer to some "truth" than they were a week ago.

But Sampo is the exception, not the rule.

---

What 18 reports actually tell us

Q1 2026 is the first earnings season the observatory has fully observed. Eighteen of our universe have reported — Mag4 (Meta, Microsoft, Alphabet, Amazon), Apple, ExxonMobil, Sampo, and a dozen Finnish industrials. Enough to compare. A little too little to draw conclusions from.

We measured two things: the spread between the five models' valuation gaps three days before earnings, and three days after. If the intuition that information consolidates assumptions holds, the latter should be markedly smaller.

Mean spread before: 6.3 percentage points. Mean spread after: 6.1. Change: −0.1, essentially zero. Seven companies tightened, eight widened, three were unchanged. Closer to a coin toss than to systematic convergence.

Agreement did not increase. If anything, the opposite.

The first reaction is to suspect the metric. Is there a gap in the interpretation? Is Sampo actually the rule and the others outliers? We will return to Sampo. But first — the average hides the story, and the story is at the company level.

---

Elisa, where the models read it the same way

The Finnish telecom Elisa was among the first to report. The numbers were as expected: stable subscription revenue, Q2 guidance in line, slightly better than a year ago. No dramatic surprises.

Before earnings the models were 6.4 percentage points apart. After earnings, 2.6. A tightening of 3.8 percentage points — the largest in the universe.

This supports the intuition. A clear report, no nuance whose weighting could disperse the interpretation, the models converge nearly identically. Information did its job.

---

Microsoft, where the report split the models

Microsoft reported on April 29. Numbers were strong: Azure growth continued, AI infrastructure backlog grew, investors reacted positively. But the company also delivered guidance pointing to a 60 percent increase in capex — financing required for the AI capacity expansion.

This is two-narrative material. One reading sees AI revenue growing 30 percent and infers margins and terminal values upward. Another sees the capex requirements and infers free cash flow downward. The same report, two rational readings.

The models reacted differently. Pre-earnings spread was 10.1 percentage points. Post-earnings, 12.3. Disagreement grew.

Microsoft is not unique. P&G's spread widened from 3.9 to 7.0, METSO's from 4.3 to 6.3, UPM's from 5.0 to 7.0. In several cases, after the Q1 report, the models were more divided than before.

Empirical financial research has long documented the same phenomenon among human analysts. Stickel and Diether with co-authors have shown that forecast dispersion can rise post-earnings rather than fall — particularly when the report includes new investment programs or shifts in key assumptions. AI mirrors this structure. It does not escape it.

---

Meta, where the models will not agree

Meta is the universe's reigning champion of disagreement. Pre-earnings spread: 24.9 percentage points. Post-earnings: 21.8. Some tightening — 3.1 percentage points — but the figure remains startlingly high, more than double the next contender (Neste, 14.8 %).

The Q1 report fed every thesis. Core advertising grew 18 percent. Reality Labs lost more than 4 billion. AI investment guidance climbed to 20 billion. User growth plateaued. Free cash flow rose sharply.

One model sees the enduring strength of the ad business. Another sees Reality Labs as a loss container. A third sees AI investments as the growth engine, a fourth as a returns risk. A fifth tries to balance them.

Different rational target prices can be derived from the same report. The spread reflects the report's internal multi-interpretability. One model's 79 percent upside and another's 17 percent are not the same error mirrored in different directions — they are different readings of the same text.

---

Sampo: when the consensus is forced

Back to Sampo, the opening scene. 16 percentage points collapsed to two. Consensus appeared from apparent thin air.

But look more closely. Spot 8.93 €, four out of five models (Claude, DeepSeek, GPT, Grok) produced exactly the same number: −15.2 percent downside. Only Gemini differs slightly, at −13.3 percent. Four models on an identical figure, within less than one percentage point of each other.

This is not coincidence. Sampo's analyst median target price is around 15.1 €. Our valuation engine clamps AI estimates to between 7.57 € (50 percent of the analyst median) and 22.7 € (150 percent). Four models tried to value Sampo lower than 7.57 €, and the engine returned them to the floor. The consensus is partly a mechanical artifact.

On the company page this shows as a [Signal Purity](/company/SAMPO) marker: ⚠ "cap-induced agreement". The marker distinguishes when model alignment is genuine and when it has emerged from the anchor's pressure.

Sampo is therefore both evidence of tightening and a counterexample to whether tightening is real. The 16 percentage point pre-earnings dispersion was genuine disagreement. The final two-percentage-point spread is partly forced. The genuine post-earnings convergence would have been somewhere between — but where, this method cannot show.

One more number. In the days after earnings Sampo's stock moved from 8.72 € to 8.95 € — a 2.6 percent gain over three days — while the models unanimously expected 15 percent downside. The consensus was both forced and wrong-directioned. The agreement produced by the anchor floor did not protect against the market.

---

Direction matched in 6 of 18

Sampo is not alone. Measure the second metric — did the five-model pre-earnings consensus's direction match the stock's 1- and 3-day reaction — and across the 18 reporters the answer is:

1-day hits: 6/18 = 33 %
3-day hits: 6/17 = 35 %
Random coin-flip baseline: 50 %

On both horizons, AI direction hit below random. The 18-company sample is statistically thin — the difference between 33 % and 50 % is not formally significant — but the pattern recurs in individual cases: Apple −17 % AI bearish, stock +0.2 %; Nokia −28 %, +2.5 %; P&G −10 %, +3.7 %; Neste −27 %, +9.9 %; Sampo −8 %, +2.6 %. AI saw the companies as expensive; the market did not agree.

Splitting by direction sharpens the picture: when AI predicted upside (7 companies), the stock rose only 1 time in 7. Bearish calls hit 5/11 — close to random. The models' optimism, in this sample, points systematically the wrong way.

What might this be telling us? The sample cannot yet separate three hypotheses: the models cannot read positive earnings signals, the market prices them more efficiently in advance than negative ones, or the DCF-based method has a structural over-optimism bias on growth companies. By July, with three months of post-earnings price data, we can begin to choose between them.

---

What this is telling us

Earnings season did not tighten agreement on average. Some individual companies tightened (Elisa, Amazon), some widened (Microsoft, P&G, METSO), some held still or held still only because the cap was holding them.

Agreement and accuracy, however, are not the same thing. Five models can be together close to truth, together far from it, or together pulled by the same constraint — analyst cap, default assumption. Spread alone does not tell us which side we are on. It is an intermediate metric, not a final outcome.

Measuring accuracy requires realized prices: how well did the median estimate hit the stock 1, 3, or 6 months later. The three-month price movement of the Q1 reporters will be available in July. At that point we can build the second metric.

Kahneman, Sibony and Sunstein in Noise (2021) document that two phenomena are often confused in forecasting research: emergence of consensus and improvement of accuracy. They can move in opposite directions. The median can be right even when the individual models disagree — the classic wisdom of crowds. The models can agree with each other and the median can be systematically wrong — groupthink.

Q1 2026 did not tell us which we are. It told us that five models read the earnings reports differently, and that difference persists past the financial calendar.

The empirical base is thin at this stage. Forty-nine trading days, one earnings season behind us, five models, twenty-three companies — the observatory is measuring, but has not yet measured enough to say what happens over longer horizons. The data from this quarter is also feeding the next iteration of the methodology. The mechanical consensus on Sampo, for instance, is part of what will inform an upcoming engine revision; what we learn about the models' anchoring behavior will shape later prompt revisions. The empirical work precedes the changes, not the other way around.

---

Back to Sampo

In the opening scene — May 6th, four models on the floor value — two things happened in parallel. Information moved the models in one direction. The anchor stopped their travel. The final agreement is partly real, partly engine-produced.

When we ask five AIs to read the same report and arrive at the same target price, the answer is often "no". Reports are multi-voiced, the models' training histories permeate the prompt's locks, and different weightings of the same material can rationally yield different numbers. When the models agree — as on Elisa — the data is usually simple. When they differ — as on Meta — the difference itself carries information: the company's valuation is genuinely contested.

The earnings did not end the disagreement, and the models' shared direction did not beat a coin flip against the market. The disagreement is signal. The directional uncertainty is another.

---

This report is based on 18 Q1 2026 earnings reports, five LLM models, and 49 trading days of accumulated data. The agreement analysis does not measure model accuracy — that metric we can build reliably only in July, once the three-month price movement of the Q1-reporting companies is known.

AI Investor Barometer is an experimental research tool, not investment advice.