The Berkshire Test for AI: A Compounding Diagnostic for Leaders Allocating Capital to AI

The Two Proposals on Her Desk

A CFO I worked with at a mid-sized industrial manufacturer had two AI proposals on her desk last March, and she made the wrong choice. The first was a polished slide deck from the largest consultancy in the country, pitching an LLM-based sales-assistant rollout to her 600-person revenue organization. Projected first-year value: $4.2M in productivity time savings, $1.1M in upsell capture, NPV positive in eighteen months. The deck had testimonials from a dozen Fortune 500 sales leaders, demo videos, and an implementation timeline that bottomed-up to twelve weeks. The second was a one-page memo from her plant manager. The memo argued for computer-vision quality control at a single weld-inspection station, the one that had been the production bottleneck for fourteen of the prior eighteen months. Projected first-year value: $2.8M, almost all of it from reduced rework and warranty exposure. No testimonials. No deck. A budget of $800K, mostly cameras and one data engineer.

She picked the LLM rollout. The plant memo was filed.

A year later we sat down in her office and walked through what had happened. The sales-assistant pilot had launched in two regional teams, generated three quarters of "promising early signals" presentations, and quietly stalled. Adoption among sellers had peaked at 22% in week six and drifted to 8% by month four. The productivity numbers had not materialized. The sellers who needed the assistant most were the ones least likely to log into a tool that asked them to type. The assistant's outputs needed manual review against the CRM, which added a step the sellers were already trying to skip. The $4.2M projection had become roughly $180K in surfaced quotes that probably would have closed anyway. The weld-inspection station was still the bottleneck.

The CFO is a serious operator who reads the AI literature, attends the conferences, and gets briefings from the right consultancies. What she lacked was a test: not a maturity model, which she had read; not an ROI template, which she had three copies of; but a question shaped to filter compounders from Roman Candles before the capital was committed. This essay is about that test.

The Berkshire Discipline, Transferred

In January 2026 Warren Buffett handed Berkshire Hathaway to Greg Abel. Abel's first letter to shareholders, published a few weeks later, made one argument worth pulling out. What he had inherited was a discipline, not a person's instinct. The discipline could survive him.

What is the discipline? Buffett spent fifty years describing it in successive annual letters, and the cleanest summary lives in the 2007 letter, where he draws the now-famous distinction between three kinds of businesses: the great ones, the good ones, and the gruesome ones. The great ones are characterized by what he called compounding economics: a sustainable competitive advantage that earns high returns on incremental capital, so that every dollar reinvested in the business produces a multiplier on the next dollar of returns. See's Candy is the canonical example. In 1972 Berkshire paid $25M for See's; over the following decades See's required almost no additional capital to sustain its earnings power, and the cash it threw off was redeployed into other Berkshire holdings at compounding rates. The arithmetic is unforgiving: a business that returns 25% on incremental capital and can reinvest it produces vastly more long-run value than one that returns 80% on initial capital but cannot reinvest at any meaningful rate. That arithmetic is the Berkshire discipline in one sentence.

The 2007 letter also names the failure mode I want to anchor on for this essay: what Buffett calls a Roman Candle. A Roman Candle is a business with extraordinary short-term economics that cannot be sustained because the underlying moat, the structural advantage that generates the returns, has to be continuously rebuilt against competition. "A moat that must be continuously rebuilt will eventually be no moat at all," Buffett writes in the 2007 letter. That sentence is the test in negative form: ask any investment to demonstrate that its advantage is structural and self-sustaining, not just bright and current.

The AI capital allocation problem in 2026 is structurally the same problem Buffett was describing in 1972 and in 2007. The decision in front of the CFO from the previous section was not a technology decision; it was a capital-allocation decision, and her two proposals (the polished LLM rollout and the boring weld-inspection station) represented exactly the See's-Candy-versus-Roman-Candle choice that the Berkshire discipline was designed to surface. She did not have the discipline. Most non-technology leaders evaluating AI proposals in 2026 do not have the discipline. That is the gap this essay is built to close.

The Berkshire Test for AI

The discipline reduces, for an AI investment, to two questions. Both must be answered yes for the investment to compound.

Axis A — Location. Is the AI being applied at the bottleneck of the process? This is Eliyahu Goldratt's territory. In The Goal (1984) Goldratt described an operational truth that has held in every manufacturing, logistics, and service operation since. Every system has one binding constraint (Goldratt's term; the everyday word is bottleneck) at any moment, and the system's throughput is determined by it. Optimizing non-constraint steps does not increase throughput; it only piles inventory in front of the bottleneck. Goldratt's Five Focusing Steps (identify, exploit, subordinate, elevate, repeat) are the operational discipline. The capital-allocation discipline that follows is brutal: do not invest in non-bottleneck steps, no matter how much the investment improves them in isolation.

Axis B — Compounding. At that constraint, do the four conditions hold that turn an AI improvement into a flywheel? This is Charlie Munger's territory, joined to the contemporary AI-strategy literature. Munger's USC 1994 latticework talk gives the mental model. Compounding is what happens when each unit of progress makes the next unit cheaper or more accurate to produce. Marco Iansiti and Karim Lakhani's Competing in the Age of AI (HBR, January 2020) supplies the modern operational shape: the AI factory. In an AI factory, data, labels, and predictions are produced continuously by the operation itself. Each cycle's outputs sharpen the next cycle's inputs. Compounding requires four specific conditions; the next section defines them. For now the structural point is the geometry: location is where, compounding is whether.

Both axes must hold. Either one alone produces a failure mode the field has learned to name.

Failure mode 1 — Compounding the wrong thing. A self-improving model attached to a non-binding step in the process. The model looks beautiful: data flows in, predictions get more accurate, the system gets better over time on its own metrics. The trouble is that none of the improvement reaches the constraint, so none of it shows up in throughput or P&L. The CFO from the opening vignette nearly bought a textbook example. Her sales-assistant pilot would have compounded (sellers' usage data labeling future training cycles) but not at the location that limited her business. The bottleneck was production yield at the weld station. The model got better; the business did not.

Failure mode 2 — The Roman Candle. A real gain at the bottleneck that cannot be sustained because none of the four compounding conditions hold. This is Buffett's original framing from the 2007 letter, now applied to AI: a moat that must be continuously rebuilt against commodity competition will eventually be no moat at all. The vendor's first-generation product gave you a competitive lead; the vendor's third-generation product, available to every competitor, takes the lead away. Twelve to eighteen months of advantage, then a treadmill.

Both axes together produce a 2×2 verdict matrix. Every initiative lands in exactly one quadrant.

                            AXIS A — LOCATION (Goldratt / TOC)

                          Applied AT          Applied AWAY FROM
                          the binding         the binding
                          constraint          constraint
                      ┌──────────────────┬──────────────────┐
                      │                  │                  │
                      │   COMPOUNDER     │   "Compounding   │
        Four          │                  │    the wrong     │
        conditions    │   Widens an      │    thing"        │
        HOLD          │   already-good   │                  │
                      │   moat           │   Self-improving │
AXIS B —              │                  │   model attached │
COMPOUNDING           │   (Progressive,  │   to non-binding │
(Munger /             │    Deere,        │   step. Great    │
Iansiti)              │    Mastercard,   │   demo, no P&L   │
                      │    Mayo)         │   signal.        │
                      ├──────────────────┼──────────────────┤
        Four          │                  │                  │
        conditions    │   One-shot win   │   ROMAN CANDLE   │
        DO NOT        │                  │                  │
        HOLD          │   Real gain at   │   (Buffett 2007: │
                      │   the bottleneck │    "a moat that  │
                      │   but no flywheel│    must be       │
                      │   — competitors  │    continuously  │
                      │   catch up in    │    rebuilt will  │
                      │   ~1 year        │    eventually be │
                      │                  │    no moat at    │
                      │                  │    all")         │
                      └──────────────────┴──────────────────┘

The matrix is a teaching tool, not a scoring scale. The four quadrants are categorical: an initiative is one or the other on each axis, and the combination determines whether the investment compounds, sputters, or burns out. The rest of this essay defines the four compounding conditions, walks the location axis through Goldratt's five steps, and presents four portrait companies that pass the test at scale.

The Four Conditions

For AI at the bottleneck to compound rather than sputter, four conditions must hold simultaneously. Each is drawn from a distinct cited source; together they form a single test. Pass three of four and you do not compound. The conditions are not a checklist where partial credit averages out; they are a Liebig's-Law-of-the-Minimum, where the weakest condition gates the whole flywheel.

Condition 1 — Proprietary data origin. The data feeding the model is generated by your operation, not bought, scraped, or shared with peers. James Currier of NFX names this directly in his analysis of data network effects: one of the necessary conditions for a real data network effect is "you are where the data is generated." If you buy your training data from a vendor, your competitor can buy the same training data from the same vendor next quarter; the data is, by construction, not yours. César Hidalgo, in Why Information Grows (2015), supplies the deeper reason: tacit operational knowledge is geographically and organizationally sticky in a way that internet-scraped corpora cannot replicate. The knowledge that Progressive accumulates about which driving patterns predict claims is not lying in a public dataset. It is sitting in the trajectory of Snapshot beacons across millions of policies, attached to specific outcomes only Progressive observes. The data has provenance, and the provenance is the moat.

Condition 2 — Self-labeling workflow. The work itself produces the labels. Every Progressive policy generates a claim outcome that confirms or revises the underwriting assumption. Every Mastercard transaction either does or does not later produce a chargeback that supplies the fraud label. Every Mayo Clinic patient outcome confirms or revises a prior diagnosis. This is the AI factory pattern named in the previous section, applied to one condition. The contrast case is a model that requires humans to hand-label training data outside the operational loop. That model improves only as fast as the labeling budget grows; it does not compound with usage.

Condition 3 — Decreasing marginal cost per cycle. Each iteration of the model is cheaper to produce than the last. Erik Brynjolfsson, Daniel Rock, and Chad Syverson, in The Productivity J-Curve (NBER 2018, AEJ Macroeconomics 2021), name the underlying economics. General Purpose Technologies require massive complementary intangible investment. That investment suppresses measured productivity early and amplifies it later, producing a J-curve in firm output. The compounding condition is being on the rising side of that J-curve. The marginal cost of each labeled training cycle drops because the operational infrastructure, the labeling pipeline, the feedback instrumentation, and the deployment surface have already been paid for. The first model cycle is expensive. The hundredth is nearly free. Firms still climbing the early J-curve, where the infrastructure investment is consuming all the productivity gain, are not yet compounding; they are paying tuition.

Condition 4 — Defensible asymmetry. Competitors cannot catch up by buying the same vendor's product. Carlota Perez, in Technological Revolutions and Financial Capital (2002), traced the historical pattern. Every prior General Purpose Technology moved through two phases. In the installation phase, pure-tech entrants dominate and incumbents look obsolete. In the deployment phase, incumbents with embedded process knowledge win by retrofitting the new technology into their existing operations. Electricity, the internal combustion engine, and the integrated circuit all followed this arc. AI in 2026 is transitioning from installation to deployment. The deployment-phase advantage is asymmetric because embedded operational knowledge cannot be purchased. The incumbent spent decades accumulating it. The entrant must spend decades acquiring it before competing on the same surface.

The strongest skeptical case here comes from Martin Casado and Alex Lauten in The Empty Promise of Data Moats. Most claimed data moats, they argue, are real for a quarter and gone in a year. The marginal value of training data plateaus quickly, and competitors can usually reach the plateau with less data than the incumbent had. Casado is right. The four conditions above are written specifically to filter for what survives his critique. A system that satisfies all four (proprietary data origin, self-labeling work, decreasing marginal cost, defensible asymmetry) is exactly the configuration Casado's argument does not refute. The four conditions are the residue after the empty promises burn off.

The Location Axis: Goldratt's Five Steps

The location axis is the Theory of Constraints applied to AI capital allocation. Goldratt's Five Focusing Steps, translated into plain language:

Identify the bottleneck. Where is throughput limited today? Not where the work is hardest, not where the demo is most polished; where the next unit of throughput is being held back.
Exploit the constraint. Apply AI here. Make the constraint produce everything it can within current capacity. This is the question of where to deploy.
Subordinate all non-constraint activity to the constraint. Stop deploying AI at steps that are not currently binding, even when those deployments would improve a local metric.
Elevate the constraint. Raise its capacity. This is where the four compounding conditions deliver their work: every cycle through the loop expands what the constraint can absorb.
Repeat. The constraint moves. A successful Elevate step shifts the bottleneck somewhere else; identify the new constraint and restart at Step 1.

Two observations from practice land harder than the others.

The most-violated step is Step 3. In every AI portfolio I have audited in the past two years the symptom has been the same: AI everywhere, no clear bottleneck owner, no policy on what to stop. Subordinate is the discipline of saying no to AI investments that are not at the constraint. It is unpopular for the same reason all subordination is unpopular: it asks leaders to refuse a thing that would, in isolation, make a local metric look better. Munger argued repeatedly that the most important investment skill is the discipline of saying no, and at the 2023 Berkshire annual meeting he applied a sharp version of that advice to AI hype directly. The CFO from the opening vignette did not violate Step 3 because she did not understand it; she violated it because the LLM proposal had more sponsors, more presentations, and more political momentum than the weld-inspection memo. Subordinate is the action that costs political capital, and the action without which the rest of the discipline does not work.

The step where compounding actually lives is Step 5. A firm that has elevated its current constraint and produced the labeled outcomes the four conditions require has, in the same motion, generated the diagnostic capability to see the next constraint sooner than the competition does. The AI factory is a Step 5 machine: each loop's outputs sharpen the next loop's Identify step. The firm with the better Step 5 moves its next AI investment to the new constraint before the competitor has finished celebrating the last one. Compounding, on this axis, is the rate at which the constraint relocates and the firm follows it.

The Four Portraits, Previewed

Four companies operating outside the technology sector have built compounding AI loops at the bottleneck of their businesses. Each portrait below walks the full evidence base; the previews here set up the four mechanisms.

Progressive — the risk-selection flywheel. The bottleneck in personal auto insurance is adverse selection in pricing: the insurer that consistently misjudges driver risk loses money on every policy. Progressive's Snapshot telematics program, now collecting tens of billions of vehicle-miles of behavior data (2024 Annual Report), is AI applied at exactly that constraint. Each claim outcome labels the underwriting assumption. The driver-behavior corpus is attached to claim outcomes Progressive uniquely observes, and it is structurally inaccessible to a rival without two decades of Snapshot adoption. Full portrait: Progressive's Risk-Selection Flywheel.

Deere — the physical-world data loop. The bottleneck in row-crop agriculture is herbicide cost per acre at the spraying step. John Deere's See & Spray applies computer vision in real time at the boom of the sprayer (Deere investor disclosures) to distinguish crop from weed and discharge herbicide only where needed. Every pass produces labeled weed-and-crop images attached to yield and cost data. The corpus is physical-world data Deere uniquely owns; the asymmetry is the embedded equipment fleet, which competitors cannot replicate by software alone. Full portrait: Deere's Physical-World Data Loop.

Mastercard — the network of labeled outcomes. The bottleneck in card-payments fraud is the false-positive-to-false-negative balance: aggressive blocking pushes customers to competing networks, lax blocking lets fraud through. Mastercard's Decision Intelligence Pro, launched in May 2024 (press release), applies a generative-AI fraud-scoring model at network-rail speed. Every transaction either does or does not later produce a chargeback that labels the prior decision; the network's scale sustains decreasing marginal cost per training cycle. Full portrait: Mastercard's Network of Labeled Outcomes.

Mayo Clinic — the outcome-labeled corpus. The bottleneck in clinical diagnosis of hidden cardiac conditions is diagnostic latency: by the time symptoms present, the underlying pathology has been advancing silently for years. Mayo's AI-ECG program, anchored in the 2019 Lancet study on atrial fibrillation detection from routine sinus ECGs, applies a deep-learning model to standard ECG readings to surface conditions the cardiologist would not yet see clinically. Every patient's eventual diagnosis labels the prior ECG; the outcome data is structurally Mayo's, and the embedded clinical context is what makes the labels load-bearing for retraining. Full portrait: Mayo Clinic's Outcome-Labeled Corpus.

Where the Test Would Fail

The Berkshire Test filters out a category of failure. It does not guarantee success against three limits worth surfacing up front.

The first limit is the one Casado and Lauten named. Most claimed data moats are real for a quarter and gone in a year. Even the moats that satisfy the four conditions degrade if the operator stops investing in the labeling pipeline. Passing the test today does not exempt the operator from re-passing it next year.

The second limit is the one Luke Sernau named in the leaked Google memo "We Have No Moat, and Neither Does OpenAI". Commodity LLM capability moves faster than any incumbent's product roadmap. Capabilities that look like a competitive lead this quarter become free-tier defaults in the next vendor release. Sernau's argument applies most sharply to the capability layer. The four conditions anchor the moat in the operational layer that his argument leaves alone. A portrait that drifts from operational-layer compounding into capability-layer competition has lost its protection.

The third limit is the one Ilia Shumailov and colleagues documented in 2024: model collapse. Models trained on the output of earlier models progressively lose tail behavior and degrade across generations. The four conditions protect against this only when the self-labeling workflow stays anchored to real-world outcomes rather than to other models' predictions. A self-labeling loop where the "labels" are actually a second model's outputs is a Shumailov failure mode in slow motion.

The test is a filter. Pass it and the investment has a structural reason to compound; fail it and it almost certainly will not. Passing does not foreclose Sernau's commodity dynamics, Shumailov's collapse, or Casado's plateau. The test tells you which investments are worth defending against those failures.

What to Do Monday

Four actions follow from the test. Each is concrete enough to put on a Monday morning agenda. The first three are diagnostic; the fourth is the one that costs political capital and produces the compounding gain.

1. Write your bottleneck in one sentence. Not a list of priority areas. Not the top five themes from your operating-plan offsite. One sentence that names the single step in your operation where the next unit of throughput is being held back today. If you cannot write the sentence in under ten minutes, you do not know where your AI investments belong, and the diagnostic work is the prerequisite for any new spending decision.

2. Audit your current AI portfolio against the 2×2. For each active or proposed initiative, locate it in one of the four quadrants. How many are in the Compounder quadrant (at the bottleneck, four conditions hold)? How many are Compounding-the-Wrong-Thing (self-improving but at a non-constraint)? How many are One-Shot Wins (at the constraint but no compounding mechanism)? How many are Roman Candles (neither)? Most portfolios I have seen are concentrated in the wrong quadrants. That is not a personal failure; it is the structural consequence of the discipline never having been applied.

3. Run any new proposal through the diagnostic. Two versions are available depending on the depth you need. The five-question deterministic diagnostic at /labs/compounding-test walks the test in under a minute and produces a verdict that maps to one of the four quadrants — useful in the proposal-review meeting itself. The AI-powered narrative diagnostic at /labs/compounding-test-ai reads a full prose description of the initiative and returns a tailored writeup that names the bottleneck, scores each of the four conditions with rationale quoting the description back, and surfaces the closest-fit portrait company — useful when the proposal warrants a longer-form second opinion or when you want something to circulate with a colleague. The failure mode either version catches most often is "we deferred the binding-constraint question because the demo was so good."

4. Decide what to STOP. This is the action that costs political capital and produces the compounding gain. For every AI investment in your portfolio that is not at the bottleneck, decide whether to defund, sunset, or absorb the work into another line item. Subordinate is the discipline of saying no to investments that would, in isolation, improve a local metric. Stopping a sponsor's pet AI initiative is unpopular for the same reason all subordination is unpopular: it is visible, immediate, and feels like loss. The compounding gain is invisible, deferred, and feels like nothing for several quarters. The discipline is to do it anyway.

The fourth action is the one without which the first three do not work. Identify is easy. Exploit is satisfying. Subordinate is the action that hurts, and the action that produces every other gain.

Closing Reflection

Charlie Munger, in his 2007 USC Law School commencement address, described the people he most admired as "learning machines". They compounded their understanding deliberately, over decades, by stacking models on top of working models and pruning the ones that did not match reality. The Berkshire Test is the firm-level version of the same discipline. The companies in the four portraits that follow are learning machines at the operational scale: their AI investments are at the bottleneck, the four conditions hold, and each cycle of operation sharpens the next.

The discipline is the test. The test is a question pair, not a maturity model, because the question pair is what filters the Roman Candles before the capital is committed. The hard part is Action #4: the discipline of saying no to investments that would, in isolation, look good.

If you have run an AI portfolio through the test and the results surprised you, or if you have applied the four conditions to a company outside the four portraits and found a different mechanism, I would genuinely like to hear about it. The literature is still shallow on the operator side; the better source is the field.

Section 5 (~400 words) — Location axis, Goldratt's Five Focusing Steps. Section 6 (~400 words) — Four portraits previewed (Progressive, Deere, Mastercard, Mayo). Section 7 (~300 words) — Where the test would fail (Casado, Sernau, Shumailov). Section 8 (~400 words) — What to do Monday (4 actions; #4 = decide what to STOP). Section 9 (~200 words) — Closing reflection (Munger USC 2007 learning-machines line). -->