Signals

AI Signals — Weekend Read: Which AI is best at investing?

2026-05-17editorial · written by Claude
Summary
  • Calibration and backtesting are still in progress (Engine v8 / Prompt v11 due late May; meaningful 3-month accuracy data lands July 2026, 12-month March 2027). What we can compare today is observable behaviour, not predictive performance
  • Behavioural wins by category: Claude is best calibrated (raw output clipped only 27 % of the time vs 42 % for Grok), DeepSeek is 100 % reliable and the cheapest ($2.07/1K), Grok is the fastest (7.7s end-to-end), Gemini is the most willing to take extreme calls, and GPT is the only model whose answers are partially uncorrelated with the rest
  • The panel collapses statistically: effective number of independent estimators dropped from 1.21 (early March) to 1.10 (early May). Herd is intensifying, not loosening
  • But part of that 'agreement' is engine-produced: on 19 % of company-days all five models hit a cap (pre-cap raw spread on those days averages 15 pp, post-cap 0). On 41 % of days at least three models are capped. The site already shows raw vs calibrated agreement per company with a flag when the consensus is partly mechanical
  • Bonus finding: every model's daily TP autocorrelation is negative (−0.14 to −0.31). AI does not anchor on yesterday's view — opposite of the +0.3 to +0.5 anchoring well documented in human analysts. Whether this helps or hurts predictive accuracy is a question only the 3- and 12-month backtests will answer

# Which AI is best at investing?

Weekend Read #8 — May 17, 2026

AI Investor Barometer tracks how five LLMs produce DCF assumptions for 23 listed companies — daily, independently, on identical inputs.

The question arrives in some form once a week. It surfaces in a Reddit thread, a Slack channel, a coffee chat with someone who knows you write about this. Which AI should I use for stocks? GPT? Claude? Gemini? Which one's best?

The instinct is to answer cleanly. Pick a winner. Anchor a recommendation. Use Claude — it's the most balanced. Or Use Grok — it's fastest. Or Use whichever, they're all the same.

We have run five frontier models in parallel for 53 trading days, on the same 23 companies, with the same prompt, the same financial inputs, the same temperature. Six thousand and change valid outputs. Enough data, finally, to look the question in the eye — but not yet enough to call a winner.

Three things up front, before the answer. First, the methodology is still settling. Engine v7 and Prompt v10 went live in late March, and an Engine v8 / Prompt v11 cycle is queued for late May. Each calibration step changes the numbers. Second, the verdict that matters most — which model's estimates actually predict where stocks go — needs longer horizons than we have. One-day directional accuracy is a coin flip across all five models (47–53 percent). Real tests at three months land in July 2026, at twelve months in March 2027. We have not yet earned the right to crown a forecaster. Third, the panel is not a sample of each provider's best. Anthropic, OpenAI, Google, DeepSeek and xAI all field deliberately differentiated lineups — top-tier reasoning models on one shelf, faster and cheaper variants on another. We run the mid-cost tier, chosen so the panel can cover every company every day on a $45-per-month budget. Running each provider's flagship would be a different observatory at perhaps 20–50× the cost, and could shift every finding here. What we measure is what mid-tier frontier models do, in production, at scale.

What we can say something about today is behaviour. How the models speak, where they agree, where the engine compresses them, and where one of them refuses to drift toward the centre. That is what this essay is about. The performance verdict is months away.

---

Each model wins at something observable

There is no shortage of legitimate-sounding crowns to hand out, as long as the crown stays modest.

Claude is the most calibrated. Its raw estimates land closest to analyst consensus before the engine has to step in with bounds. When Claude diverges from analysts by more than 40 percent, the engine clips it back; this happens to Claude only 27 percent of the time. For Grok, it happens 42 percent of the time. Calibration is what we can observe today — it does not yet say Claude is right. It says Claude is the model whose untouched output respects the analyst-defined bounds most often.

DeepSeek is the most reliable. It has produced valid, parseable JSON output 100 percent of the time — every single call, every day, every company. The other four models hover between 90 and 99 percent. DeepSeek is also the cheapest by a wide margin, $2.07 per thousand outputs versus $30+ for the frontier American models. If you wanted to run an AI valuation engine on your own laptop overnight, DeepSeek is the only honest answer.

Grok is the fastest. 7.7 seconds end-to-end versus 18 seconds for Claude on identical prompts. If your workflow involves twenty queries before lunch, Grok wins on time.

GPT is the partial outlier. It correlates with the other four models in the 0.72–0.74 range, while they correlate with each other in the 0.91–0.95 range. GPT also produces the lowest day-to-day variance; once it commits to a number, it sticks to it. Whether the divergence carries useful signal or just useful noise is a question backtesting will answer eventually.

Gemini is the most willing to make extreme calls. The widest CAGR ranges, the most aggressive long-term margin assumptions. If you want a model that pushes past consensus, Gemini does so most often.

Five clean answers to five different versions of the question. None of them is wrong as a description of behaviour. None of them is yet a verdict on quality.

---

The problem with five clean answers

Alongside the comparisons, we computed something quieter. For each pair of models, we measured how strongly their daily estimates moved together once you control for the underlying company effects. Then we aggregated this into a single statistic: the effective number of independent estimators in the panel. Five real models, but how many distinct perspectives?

In early March the answer was 1.21 out of 5. By early May it was 1.10. The herd is intensifying, not loosening.

The headline reading — five LLMs are roughly one perspective measured five times — is correct in spirit. The reality is messier and worth being honest about, because part of that one-and-a-bit number is structural and part is a side effect of how we built the engine.

The structural part is real. Frontier LLMs are trained on overlapping data, pointed at similar reward signals, and converge architecturally. Asking five of them the same valuation question produces less independence than the count suggests. This is a finding about LLMs, not about our pipeline.

The pipeline part is also real, and we should not pretend otherwise. Every model's output passes through bounds — sector caps, analyst-target caps, PE caps — that exist for a defensible reason. Without them, a model that decides Nokia should trade at 50× earnings would show up as 100 percent upside, and the dashboard would be unreadable. With them, raw outliers are clipped to a defensible band.

That bound has a side effect. We measured it. On 19 percent of company-days, all five models hit a cap. On those days the engine forces the spread between them to zero. The pre-cap raw spread on those same days averages 15 percentage points; the displayed post-cap spread is 0.0. Fifteen points of disagreement, gone, because the engine could not let five models all assert that Apple should trade at 80× forward earnings.

Forty-one percent of company-days have at least three models capped. On those days some of the apparent agreement is engine-produced, not model-produced. The site already shows this — every company page has a small "raw vs calibrated agreement" indicator with a flag when the agreement is partly mechanical. We did not invent this caveat for the essay; we have been showing it on the site for weeks.

So when we say five LLMs collapse into 1.10 effective opinions, the honest decomposition is: some of that is the priors really being shared, some of that is the engine forcing the priors into the same channel. We do not yet know the split. Engine v8, due in late May, will let us measure cleanly the residual model-side agreement.

What we can say with confidence: even on the 45 percent of company-days where no model is capped — pure raw output — the four non-GPT models still correlate above 0.90. The structural finding survives the cap caveat, just at a less dramatic intensity.

The herd is also unevenly distributed across sectors. Run the same effective-N calculation sector by sector and the spread is striking. In healthcare, the panel collapses to 1.04 — five models speaking with essentially one voice. In financials it sits at 1.10, in consumer goods at 1.14, technology at 1.17. Energy is the apparent outlier at 1.45, but only because GPT decouples sharply there: its correlation with the other four falls from a typical 0.7+ to between 0.20 and 0.41. Strip GPT out and the remaining four still correlate above 0.70 in energy. The four-model herd holds across every sector we measure; only GPT's outsider behaviour is sector-dependent.

---

What "best" means once you know all that

Once you accept that the panel is structurally homogeneous, partly engine-compressed, and statistically untested as a forecaster, the original question fragments into smaller, more honest ones.

If you want the cheapest path to one solid AI valuation per stock per day, the answer is DeepSeek. It produces output of the same kind as the other models for one-fifteenth the cost, and never fails to parse. For a researcher building a one-person observatory, this is the only model that scales.

If you want the most market-aligned number — the one that requires the fewest mathematical interventions to stay sane — the answer is Claude. It is the model that most closely respects the bounds analysts have already drawn.

If you want a second opinion that genuinely differs from the mainstream, you want GPT. It is the only model in the panel whose answers are partially uncorrelated with the others. Whether that uncorrelation is wisdom or noise is a question we cannot answer yet.

And if you want a model that holds a consistent contrarian view, watch DeepSeek. In 53 trading days it has not once crossed into positive territory on average bias. Every other model has oscillated; DeepSeek has been negatively biased, every day, on every market. Wisdom, conservatism, training-data artefact, or something else? We do not know yet. But it is the only model that refuses to drift toward the centre.

These are answers to behavioural questions. They are smaller than the original question. They are what we are entitled to say on the data we have.

---

A brief surprise: AI does not anchor

One bonus finding worth flagging. We measured the autocorrelation of each model's daily target-price changes — the tendency for today's estimate to anchor on yesterday's. For human analysts, this number is famously positive, around +0.3 to +0.5; analysts cling to their prior view and adjust slowly.

For all five LLMs, the number is negative. Mildly so, between -0.14 (Claude) and -0.31 (Gemini). The models do not anchor to yesterday — if anything they slightly overcorrect against it. This is the cleanest difference we have found between AI and human analysts. It points to one structural advantage AI may carry into the eventual accuracy backtest: no stale view, no commitment bias, fresh look every morning. It also points to one structural cost: more day-to-day noise, which is why the dashboard offers 7-day and 14-day smoothing.

We will see, when the three- and twelve-month accuracy data starts arriving, whether anti-anchoring helps or hurts.

---

What the question really asks

There is a reading of which AI is best for investing that is not really about AI at all. It is about wanting an authority. It is the same impulse that produces the search for a single best analyst, a single best newsletter, a single best fund manager. The instinct is older than the technology.

What the data suggests is that AI does not solve this problem yet, and we should not pretend otherwise. It does not deliver an oracle. It delivers, instead, a panel of five models that mostly agree with each other and with the analyst consensus that already existed, partly because the priors really are shared and partly because the engine compresses them. For a marginal cost of $45 per month. The contribution is not better answers. It is cheaper, more numerous, and more transparent answers — including transparent about what the engine is doing.

That is genuinely useful. It is not what the question hoped for.

The honest closing on a Friday afternoon, after seven and a half weeks and six thousand valuations: there is no best AI for investing. There is a panel of five models that, used together, give you a structured way to think about how an AI consensus diverges from a human analyst consensus, and where the divergence might be informative. The interesting question is not which model to pick. It is what the AI consensus and the analyst consensus together tell you that neither one says alone — and even that question waits on the accuracy data we will not have until summer.

That is the question we built this site to ask. It just is not the question anyone arrives with. And the answer to it is still being earned.

Want these insights weekly?
Subscribe to AI Signals →