AI Signals — Weekend Read: What Five AI Models Taught Us About Stock Valuation

Written by Claude·9 min read·2026-03-07

Summary

Early data from 460 valuations over 4 days: all five LLMs lean bearish, with average bias from -2.8% to -13.8% vs analyst consensus
GPT outputs exactly 2.0% terminal growth for every company (σ=0.00) — a prompt fallback adopted as a final answer, not a system cap
Five mid-tier AI models run in parallel for $45/month — constrained to text-only reasoning with no tools or web browsing
Finnish stocks appear well-calibrated (-3.3%) but US large-caps show -12.7% gap — hypotheses to track as data accumulates

The Experiment

What happens when you give five competing AI models the same financial data about a company and ask them to value the stock? We built a system to find out — and the early results, while preliminary, are already raising interesting questions about how AI reasons about financial value.

The AI Investor Barometer runs a daily pipeline where GPT, Claude, Gemini, DeepSeek, and Grok each receive identical company fundamentals — revenue history, margins, analyst consensus, sector context — and output four valuation assumptions: revenue growth, target margin, cost of capital, and terminal growth rate. A deterministic valuation engine then converts those assumptions into model estimates using either a DCF model for operating companies or an excess return model for financials.

No model sees another model's output. No model computes a target price directly. The AI's job is judgment — the math is fixed.

A note on sample size: the observations in this article are based on four production days — 460 model outputs across 23 companies. This is enough to spot patterns and raise hypotheses, but far too little to draw statistically robust conclusions. We present the data as early signals, not established findings. Many of these patterns may shift, reverse, or disappear as weeks and months of data accumulate.

The Models: Mid-Tier by Design

A natural question: why not use the most powerful model from each provider? The lineup — GPT-4.1, Claude 3.5 Sonnet, Gemini 1.5 Pro, Grok-2, and DeepSeek-chat — is a heterogeneous mix. These are not equivalent products: each provider structures their API offerings differently, and the models vary in architecture, capability level, and pricing tier. Some are clearly mid-range within their provider's lineup, others are closer to the flagship. There is no apples-to-apples comparison across providers.

The choice is also practical. In the early phase of building, testing, and iterating on the entire framework — prompts, validation pipeline, valuation engine, data flow — it makes little sense to burn through the most expensive models. The infrastructure needs to be proven before premium compute is justified. Mid-tier models are well-suited for this: capable enough to produce structured financial reasoning, affordable enough to run five of them in parallel across 23 companies every day while the system matures.

The models are also intentionally constrained. They receive no tools, no web browsing, no function calling capabilities. Each model gets a static snapshot of pre-fetched financial data — revenue history, margins, analyst consensus, recent IR headlines — and must reason purely from that text. No live searches, no follow-up questions, no ability to pull additional data.

This is a deliberate choice: before giving AI models more freedom, we need to understand how they behave under tight guidance. The constraint reveals each model's baseline reasoning style — how it interprets the same numbers, weighs uncertainty, and arrives at conclusions when it can only think, not act.

Early Observations: A Bearish Tilt

In the first four days, all five models lean bearish — every model's average estimate sits below analyst consensus. The gap ranges from Claude's mild -2.8% to GPT's -13.8%.

Whether this is a stable pattern or an artifact of the initial sample remains to be seen. However, the direction is consistent across all five models, which is notable even in a small dataset. One plausible explanation: language models tend toward conservatism under uncertainty. They anchor heavily on historical data and may discount forward-looking narratives. Where a human analyst might price in a product launch or strategic pivot, an LLM hedges.

Model	Avg Bias	Validity	Speed	Cost/1K	Cap Rate
Claude	-2.8%	90.2%	26.5s	$36.36	27.2%
Gemini	-5.7%	98.9%	18.0s	$10.26	33.7%
Grok	-8.2%	98.9%	7.7s	$14.63	42.4%
DeepSeek	-10.6%	100%	25.2s	$2.07	34.8%
GPT	-13.8%	92.4%	10.5s	$15.61	36.4%

The "Validity" column deserves explanation. Each model output goes through a multi-stage validation pipeline: first, the JSON structure must match a strict schema with all required fields and value ranges. Then, every cited source URL is checked against a domain allowlist — analyst reports and third-party commentary are blocked. Finally, the assumptions must produce a meaningful valuation through the engine without errors. Validity is the percentage of outputs that survive all three stages. DeepSeek's 100% means every output it produced in this period passed every check. Claude's 90.2% means roughly one in ten outputs contained a structural issue that prevented a valid valuation. Whether these validity rates hold as the sample grows is an open question.

A Curious Pattern: GPT and Terminal Growth

Perhaps the most striking observation so far involves GPT's terminal growth assumption — the perpetual growth rate applied after the explicit forecast period. In a DCF model, small changes here compound dramatically.

In the first four days, GPT has output exactly 2.00% for every single company. Standard deviation: 0.00.

Nokia and NVIDIA get the same terminal growth. Elisa and Tesla. UPM and Amazon. GPT doesn't differentiate — at least not yet.

We traced the likely cause: the valuation prompt provides all models with a fallback value of 2.0% for terminal growth in case data is insufficient. Four out of five models appear to treat this as a starting point and adjust based on company context. GPT appears to treat it as a final answer — adopting the fallback for every company regardless of sector or growth profile. The valuation engine allows up to 2.5% or more for technology and growth sectors, so this isn't a system cap. Whether GPT would differentiate with a differently structured prompt or with more context is something we plan to explore.

Claude, by contrast, shows terminal growth variation of σ=0.39 in the initial data — ranging from roughly 1.5% for mature companies to 3.0% for growth technology. This suggests more company-specific reasoning, though the pattern needs more data to confirm.

Model	Avg Terminal Growth	Std Dev	Avg WACC	Avg CAGR
Claude	2.41%	0.39	9.13%	8.0%
Grok	2.21%	0.25	9.49%	8.0%
DeepSeek	2.19%	0.28	9.37%	6.0%
Gemini	2.02%	0.12	9.45%	8.0%
GPT	2.00%	0.00	9.68%	5.0%

Two Markets, Different Signals

An interesting split emerges between markets, though the small sample warrants caution.

Finnish stocks show an average bias of -3.3% — relatively close to analyst consensus. This could indicate that the models handle Nordic stable-growth companies like Elisa, Sampo, and Nordea reasonably well, or it may simply reflect that analyst consensus for these companies is itself conservative.

US large-caps show a wider gap: -12.7% average bias. The models appear to systematically undervalue Apple, Microsoft, NVIDIA, and peers. One hypothesis: this reflects a structural limitation of DCF valuation for companies whose market premiums reflect optionality, ecosystem lock-in, and narrative that conservative cash-flow modeling struggles to capture. But four days of data is insufficient to separate structural bias from noise, and this split could narrow or widen as more data arrives.

The Economics

Running five competing AI analysts across 23 companies costs approximately $45 per month. The cost differences across providers are striking — and in the initial data, don't appear to correlate with output quality.

DeepSeek processes 1,000 valuations for $2.07 — seventeen times cheaper than Claude at $36.36. In the initial period, DeepSeek achieved 100% validity, compared to Claude's 90.2%. Grok completes valuations in 7.7 seconds — 3.4 times faster than Claude's 26.5 seconds. The pricing reflects how different providers position their API tiers: some charge a premium for their brand, others compete aggressively on cost.

The weekly AI Signals report, auto-generated by Claude Sonnet, costs $0.034 per issue.

For context, a single junior equity analyst's monthly salary would fund this system for decades. The question isn't whether AI can replace analysts — it's what happens when you run multiple AI perspectives simultaneously and let them disagree.

Early Takeaways

Four days is not enough to draw conclusions — but it is enough to formulate questions worth tracking.

Do LLMs have a systematic conservative bias in valuation? All five models lean bearish in the initial data. If this persists over weeks and months, it would suggest a structural property of how language models reason about financial uncertainty. If it reverses, the initial pattern was noise.

Does model diversity create useful signal? The disagreements between models — Claude's relative optimism, GPT's mechanical conservatism, DeepSeek's cheap reliability — are interesting precisely because they're consistent across companies. Whether model spread is predictive of anything remains to be seen.

Does cost predict quality? DeepSeek's strong initial performance at the lowest cost challenges the assumption that expensive models produce better structured output. But four days is too thin to generalize — validity rates and bias patterns may converge or diverge with more data.

What are the limits of text-only reasoning? Without browsing or tool access, the models can only work with what they're given. They can't verify a rumor, check today's commodity prices, or read a fresh earnings call transcript. This constraint is intentional for now — it isolates raw reasoning ability — but future iterations may explore what happens when models can actively seek information.

How much does the framework constrain the model? Hard caps — analyst consensus ±40%, PE multiples, terminal value share limits — override the AI's raw estimate in 27-42% of valuations in the initial data. The models may be less free than they appear, and the framework itself shapes the output substantially.

This is the very beginning of an ongoing experiment. We will revisit these observations as the dataset grows, and we expect some of them to hold up and others to be revised. The honest answer to most questions right now is: we don't know yet — but we're building the infrastructure to find out.