Selected findings

This page collects findings that are easy to cite but easy to overstate. Each entry separates the result from the evidence behind it, the interpretation it supports, and the claim it does not support.

AI and strategy

Frontier LLMs outperformed the tested human groups on one bounded strategic foresight task

Finding: Several frontier LLMs predicted live venture fundraising outcomes more accurately than the tested managers, investors, and crowd benchmark.
Evidence: In a fully prospective tournament, 30 U.S. Kickstarter technology ventures were evaluated while fundraising was still in progress and outcomes were unknown. Each LLM completed 870 pairwise comparisons. The human benchmarks included 346 employed U.S. managers recruited through Prolific and three MBA-trained investors working under monitored conditions. Gemini 2.5 Pro reached a Spearman rank correlation of 0.74 with realized funds raised and correctly ordered about 79% of venture pairs; the best human expert reached 0.45 and 67%, respectively.
Correct interpretation: This is prospective evidence that frontier models can outperform experienced human evaluators on a specific venture-outcome forecasting task with standardized information and a clear realized criterion.
Do not overclaim: This does not show that AI can run companies, make all strategic choices, or outperform humans on every kind of strategy problem.
Source: The strategic foresight of LLMs: Evidence from a fully prospective venture tournament

Adding humans to a strong model can reduce forecast quality

Finding: In the tested foresight ensembles, human-AI aggregation did not improve on the best standalone model.
Evidence: Combining the best human expert with the best LLM produced a rank correlation of 0.67, below Gemini 2.5 Pro alone at 0.74. Larger ensembles, including a human-only ensemble and a grand ensemble combining all evaluators, also fell below the best individual model. The paper describes this as an augmentation trap in this setting.
Correct interpretation: “Human in the loop” is not automatically better. Hybrid systems need design, testing, and a clear account of when human judgment adds independent signal rather than noise.
Do not overclaim: This does not mean humans are useless. The paper argues that human contributions may matter more before and after the forecast: framing the question, assembling relevant information, judging applicability, and acting on the result.
Source: The strategic foresight of LLMs: Evidence from a fully prospective venture tournament

AI-generated business plans can be rated more favorably than entrepreneur plans

Finding: In one accelerator-application experiment, GPT-3.5-generated plans received higher evaluator ratings than the entrepreneur-written versions.
Evidence: GPT-3.5 completed ten accelerator applications from the entrepreneurs’ problem descriptions. A preregistered experiment with 250 experienced investment evaluators produced 2,500 blind evaluations. On average, the LLM-generated plans scored 0.14 standard deviations higher and were five percentage points more likely to receive an acceptance recommendation. The advantage was especially concentrated among plans the accelerator had rejected.
Correct interpretation: LLMs can raise the evaluated quality of strategy generation in some bounded settings, especially where the original plan is weak.
Do not overclaim: This does not show that the AI-generated plans would build better ventures or that AI can replace entrepreneurial judgment.
Source: Artificial intelligence and strategic decision-making: Evidence from entrepreneurs and investors

AI evaluations can align with investor consensus more than individual investors do

Finding: In one business-plan scoring task, LLM evaluations aligned more closely with average investor scores than individual investors aligned with one another.
Evidence: The model scored 138 textual business plans from a startup competition using the same rubric as 137 venture-capital and angel investors, who had supplied 541 evaluations. LLM scores correlated 0.52 with average investor scores and explained roughly one-quarter of their variation. The intraclass correlation between AI and VC scores was 0.51, compared with 0.25 among individual VC scores.
Correct interpretation: AI can approximate a panel’s consensus judgment in a bounded evaluation task.
Do not overclaim: Consensus is not ground truth. Agreement with investors does not prove that the model’s evaluations, or the investors’ evaluations, predicted later venture success.
Source: Artificial intelligence and strategic decision-making: Evidence from entrepreneurs and investors

Strategy learning and representation

Strategy courses measurably change how MBA students make strategic judgments

Finding: A strategy course improved measured decision accuracy and changed students’ mental representations of strategic problems.
Evidence: The study followed 2,269 MBA students who evaluated four real startup videos before and after a core strategy course. Average decision accuracy increased by about seven percentage points. Students listed more considerations, increased attention to industry structure and imitability, used more uncertain language, became more confident, and found the task less difficult.
Correct interpretation: Strategy education can change strategic judgment and representation, not merely transmit vocabulary.
Do not overclaim: The evidence comes from a specific course, population, and task design. It does not show that every strategy course produces the same effects or that short-run case judgment maps directly onto long-run managerial performance.
Source: Learning strategic representations: Exploring the effects of taking a strategy course

External representations can improve or degrade strategy work

Finding: Strategy frameworks, diagrams, maps, spreadsheets, and other external representations shape the thinking they are used to support.
Evidence: The external-representations framework argues that visuals support four cognitive functions: working memory, long-term memory, pattern recognition, and knowledge transfer and transformation. Decision quality depends on the task, the representation’s usability and malleability, and the manager’s representational capability. Malleability can help a skilled manager search productively but can send a less skilled manager through an unhelpfully large or badly specified problem space.
Correct interpretation: Frameworks and visuals are cognitive tools, not neutral containers for ideas. Their value depends on fit with the problem and the user.
Do not overclaim: A polished visual does not automatically improve thinking, and a more flexible tool is not always better.
Source: External representations in strategic decision-making: Understanding strategy’s reliance on visuals

Distributed representations help under some conditions and hurt under others

Finding: Combining specialists’ partial models can outperform individual judgment, but only when the aggregation rule fits the task environment.
Evidence: The distributed-representations model compares specialists, generalists, unanimity, and averaging across environments that vary in munificence, dominance, complexity, uncertainty, and experience. Experienced generalists perform best when available, but they may be rare. Specialists work well when one cue dominates. Unanimity helps when good projects are rare and evaluators are inexperienced. Averaging is a robust default in many other conditions.
Correct interpretation: Organizational judgment depends on the fit among what people know, how their judgments are aggregated, and the environment being evaluated.
Do not overclaim: Diversity of partial models is not enough. No single aggregation rule is universally best.
Source: The power and limits of distributed representations in strategic decision-making

Aggregation and organization design

Consensus can hurt organizations

Finding: Requiring more agreement can reduce mistaken approvals while increasing missed opportunities.
Evidence: The consensus article and related organization-design work show that higher consensus thresholds lower commission errors but raise omission errors. Evidence from mutual funds shows the same trade-off: funds using unanimity tended to miss good investments, while funds whose managers could act independently tended to acquire more stocks that later performed poorly.
Correct interpretation: The right decision rule depends on which mistake is costlier before the group knows which proposal it will evaluate.
Do not overclaim: Consensus is not always bad. It can be appropriate when approving a bad proposal would be much worse than missing a good one.
Source: When consensus hurts the company; Organizational structure as a determinant of performance

More data can reduce predictive accuracy when group-based predictions are inconsistent

Finding: A noncausal group cue can make predictions less accurate when a fallible decision-maker uses it inconsistently.
Evidence: The statistical-discrimination model compares rules that use a causal cue, a correlated but noncausal discriminatory cue, both cues, or neither. Across many modeled situations, not using the discriminatory cue produces the most accurate predictions. Even when discrimination improves prediction, the gains are usually small in the model.
Correct interpretation: Information helps only when it is integrated coherently with the prediction problem. Extra cues can add another weight for a fallible decision-maker to misapply.
Do not overclaim: This is not a general argument for less data. It is a formal result about group-based, noncausal cues under specified conditions.
Source: When ‘less is more’: How statistical discrimination can decrease predictive accuracy

The best crowd is not always the largest crowd

Finding: Adding voters can reduce idea-selection quality when later recruits are less accurate than the people already in the crowd.
Evidence: The crowd-selection model varies population size, the distribution of evaluator accuracy, recruiting ability, and majority voting. It shows that increasing crowd size can decrease performance, that near-optimal performance often requires a much smaller group than exact optimality, and that large crowds are needed mainly when everyone available has low accuracy.
Correct interpretation: Crowd size and recruitment quality are design choices. Firms should first ask whether they can identify accurate judges, then decide how many judgments to aggregate.
Do not overclaim: Crowds can still work well, especially for generating ideas or when individual accuracy is low and errors can be usefully pooled.
Source: Limits to the wisdom of the crowd in idea selection

Organization structure changes performance by changing decision errors

Finding: Organization structure changes which proposals are approved, which good opportunities are missed, and which bad opportunities slip through.
Evidence: The mutual-funds study examines more than 150,000 stock-picking decisions by 609 funds. Decentralized funds bought more stocks, made fewer omission errors, and made more commission errors than centralized funds. Increasing the consensus threshold moved decisions in the opposite direction.
Correct interpretation: Structure changes the threshold for action. It changes what the organization sees, approves, and misses; coordination cost is only part of the design problem.
Do not overclaim: The setting is mutual funds. The general mechanism travels, but the best structure depends on the error costs and decision environment in the setting at hand.
Source: Organizational structure as a determinant of performance

Search and imitation

Imitating high performers more broadly can be worse

Finding: Copying more practices from a high-performing firm is not always better; broader imitation can lower performance under some conditions.
Evidence: The imitation model has firms copy practices from the highest-performing firm after local search stalls. It varies imitation breadth, interdependence among practices, context similarity, and time horizon. When context similarity is high, broader imitation is generally useful, but not always: with high interdependence, intermediate breadth can hurt in the short run, and very broad imitation can hurt over long horizons. When context similarity is low, increasing imitation breadth is generally harmful, especially over short horizons.
Correct interpretation: The question is not whether winners should be copied. It is how much of a winner’s system to imitate, whether the winner’s context matches the imitator’s context, and whether imitation is meant to mimic a coherent system or dislodge the firm into new search.
Do not overclaim: Imitation of high performers can work. Broad imitation can help in similar, low-complexity settings, and limited imitation from dissimilar firms can help exploration over long horizons.
Source: How much to copy? Determinants of effective imitation breadth