The AI Value Gap

In May 2026, Uber's COO said of the company's AI coding spend: "it's very hard to draw a line" between usage and shipped customer value - weeks after the company spent its annual AI budget in four months. That sentence, from that company, is the AI value gap in the wild. This essay defines the gap, prices it, explains why it persists, and sets out the only reliable way to close it.

Bar comparison: AI cost is tracked precisely while AI value proof is small, with the gap between them — Cost visibility is mature; value proof is not. The space between is the value gap.

1. The gap, defined operationally

The AI Value Gap is the difference between what an organisation spends on AI and the share of that spend it can connect to a measured outcome. The measure is attribution coverage: the percentage of AI expenditure linked to a baseline, an owner, and evidence of change. The gap is its complement.

The root cause is a two-meter asymmetry. Cost telemetry is token-level, granular and real-time. Value telemetry is manually defined, delayed and contested. Every incident in the case files reduces to this: organisations can see what they spend to the millisecond, but cannot state what they gained to the quarter.

The gap can emerge at multiple points in the conversion chain:

Unused capacity: Reserved or committed AI capacity that sits idle
Wasteful token consumption: Agent loops, retries and context bloat that produce no useful output
Rejected output: AI-generated work that fails review or is discarded
Review and rework: Human effort correcting, validating or redoing AI outputs
Recommendations not acted upon: Insights generated but never implemented
Benefits not captured: Time saved but not redeployed; capacity released but not banked
Strategic opportunity cost: Capital allocated to low-return AI rather than higher-value alternatives

Understanding Token Economics helps organisations see the cost meter clearly. But cost visibility alone does not close the gap. The organisation must also build work-unit attribution that connects tokens to tasks, tasks to accepted outputs, and outputs to measured outcomes with named owners and capture status.

This is a house position, stated plainly. Attribution at task level is not technically impossible. It is organisationally uncomfortable, and most enterprises have chosen comfort over clarity. The evidence below shows the cost of that choice.

2. The evidence that the gap is real

Tier 1: On-the-record incidents

Uber. In May 2026, COO Andrew Macdonald told the Rapid Response podcast: "it's very hard to draw a line between one of those stats and 'okay now we're actually producing like 25% more useful consumer features'... That link is not there yet." This was weeks after the company spent its entire 2026 AI budget in four months. Uber is not an AI laggard. It has run machine learning in production for a decade. If the AI coding budget can outrun governance at Uber, the default assumption for every other large organisation should be that it is happening to them too, less visibly. Full case file analysis.

UK government Copilot evaluations. The Department for Business and Trade ran controlled trials of Microsoft 365 Copilot across multiple departments and published the findings in full. The conclusion: "no robust evidence that time savings are leading to improved productivity". Users reported time saved. Measured output did not change. The gap between self-reported benefit and measured outcome is the value gap at task level, and the UK trials are the exception that proves the rule: they ran baselines and control groups, which is why they could state the finding. Most organisations skip the baseline and accept the self-report.

Klarna's partial reversal. Klarna announced in early 2024 that its AI customer service assistant was doing the work of 700 full-time agents. By May 2025, the company's CEO had acknowledged that the cost focus reduced service quality and that Klarna was recruiting human agents again, particularly for complex cases. The initial announcement measured deflection. The reversal measured resolution quality and customer satisfaction. Benefits and costs landed in different ledgers, and the gap appeared when the ledgers were reconciled.

Tier 2: Surveys with sample sizes stated

Bain, April 2026. Survey of 951 firms with revenue above $100 million. Bain's own report: 40% saw cost improvements of 10% or less; 4% achieved over 30%; 44% are funding the next AI wave from savings that have not yet materialised. The conclusion, verbatim: "The value didn't arrive." The 44% figure is particularly revealing: next year's budget justified by this year's unproven savings. The gap does not stay still; it leverages.

KPMG scaling-versus-measurement gap. KPMG's 2025 research found organisations scaling AI faster than they are building the measurement systems to prove it works. The gap between deployment velocity and evidence velocity is the governance failure in survey form.

MIT NANDA 95%. MIT's 2025 reporting on generative AI pilots suggests failure rates remain extremely high. The 95% figure should be read with its method caveats: it includes pilots that were never intended to reach production, and "failure" is defined broadly. But even with those caveats, the directional finding is consistent with every other data source: most AI initiatives do not produce measurable returns.

Tier 3: The market's own behaviour

The Wall Street Journal reported in May 2026: "Corporate America Is Starting to Ration AI as Cost Skyrockets." A senior bank technology executive, quoted: "The free-money period for AI is definitively over."

Rationing is what organisations do when they cannot attribute. If the value were clear, the response to rising cost would be reallocation toward the highest-return uses, not rationing. The fact that rationing is the dominant response tells you the gap is real and wide.

3. Why the gap persists

Five mechanisms explain why the gap opens and why it is hard to close. Each gets one case-file anchor.

1. Budgets are licence-shaped; spend is consumption-shaped

Uber budgeted AI coding tools as if they were licences: size it annually, then push adoption. But the spend was metered and unbounded. A fixed budget and an instruction to maximise a metered variable is a fuse, not a governance model. Nobody would budget a fixed annual fuel cost and then run a leaderboard for miles driven, but that is the structure in most enterprise AI programmes today. Understanding Token Economics is essential for managing this metered consumption model.

2. No baseline before rollout, so no counterfactual after

The UK government trials are the exception. Most organisations roll out AI tools without capturing pre-rollout baselines for cycle time, throughput, or quality. Without a baseline, "maybe implicitly there's more that is getting shipped" - Uber COO Andrew Macdonald's actual words - is the strongest claim available. A company that measures everything about a ride could not measure this, because measurement was never designed in.

3. Benefits and costs land in different ledgers

Time saved accrues to individuals. Licence costs land in IT. Rework costs land in operations. Klarna's initial claim measured deflection (a customer service metric). The reversal measured resolution quality and satisfaction (different metrics, different owners). When benefits and costs are in different ledgers, the gap can persist for quarters before anyone reconciles them.

4. Incentives attach to usage, which destroys the data

Uber ran internal leaderboards ranking AI coding tool usage. Once usage is ranked, usage data stops telling you about productivity. Some unknown share of consumption becomes performative. The data you need to prove value is the data the incentive scheme destroyed. This pattern is reported across multiple organisations but requires validation in each case.

Microsoft bundles Copilot into Microsoft 365 renewals and raised consumer pricing by over 30% in some markets. Cursor repriced its coding assistant in June 2025, moving from flat to usage-based. AI cost is increasingly embedded in SaaS renewals and repriced mid-contract. The invoice is visible. The AI-attributable share is not, unless the buyer demands it in writing. This opacity makes AI TCO harder to calculate and value attribution nearly impossible.

4. Pricing the gap

A worked illustration, labelled hypothetical, in the currency a board uses.

A 10,000-person organisation. £6 million annual AI spend across copilots, coding tools, and embedded tiers. 20% attribution coverage - meaning 20% of spend is connected to a measured outcome with a named owner and a baseline. That leaves £4.8 million of spend that is, in audit terms, unevidenced.

Not necessarily wasted. Unevidenced. The board question is whether any other £4.8 million programme would survive that status.

The compounding version

44% of firms are funding next year's AI from this year's unproven savings

Bain's April 2026 survey. The gap does not stay still. It leverages. Next year's budget justified by this year's unproven savings is the gap compounding.

This is a dated, falsifiable house position. Industry analysts forecast inference costs falling significantly by 2030, with some projections suggesting declines up to 90%. But agentic systems multiply tokens per task, and providers do not fully pass price declines through. Cheaper tokens, bigger invoices. This is Jevons paradox in the wild: efficiency improvements increase total consumption. The site can be held to this position. If total AI invoices fall in 2027 or 2028, this position was wrong.

5. Why "wait for prices to fall" is not a strategy

Industry analysts forecast inference costs falling significantly by 2030, with some projections suggesting declines up to 90%. That sounds like a reason to wait. It is not.

Agentic systems multiply tokens per task. An agent iterating on a codebase resends growing context with every step. The marginal task is more expensive than the average task. Market forecasts also project agent software spend rising substantially from 2025 to 2026, potentially more than doubling.

Per-token prices fall. Total invoices rise. Anyone who ran cloud budgets through the 2010s has seen this movie. Waiting for prices to fall is not a strategy. It is a way to arrive at the same governance failure with a larger invoice.

6. Closing the gap: attribution coverage as an operating metric

Attribution coverage is the share of AI spend connected to a measured outcome. It is the headline metric the whole discipline improves over time. Here is how to compute it this quarter.

How to compute attribution coverage

Inventory all AI spend. Direct consumption (API and token spend), licensed seats (copilots, coding assistants), embedded tiers (AI components inside SaaS contracts), infrastructure (GPU, hosting, model platform costs).
Classify each line. Outcome-linked (a measured result exists, with a baseline and an owner), activity-linked (usage data only), unattributed (neither).
Publish the ratio. Outcome-linked spend divided by total AI spend. That is attribution coverage, version one.

Version one is allowed to be embarrassing. Most organisations will start at 10-20%. A credible trajectory is 20% to 50-60% inside a year for a mid-size estate. The trajectory matters more than the starting point.

Who owns it

One named executive, not a committee. The owner needs authority to ask awkward questions across budget lines and to make stop, redesign, or scale decisions mid-year. Without a named owner, attribution coverage becomes a dashboard metric rather than a governance lever.

The four value dimensions

Each dimension has a metric that evidences it and a trap that fakes it.

1. Productivity. Metric: redeployed capacity (FTE hours reassigned to other work, measured). Trap: self-reported time saved (a claim, not evidence).

2. Cost avoided. Metric: baseline cost per unit of work, measured before and after, with adoption and quality held constant. Trap: deflection without resolution (Klarna's initial claim).

3. Revenue. Metric: incremental revenue isolated from other commercial changes, with a control group or time-series analysis. Trap: revenue growth during an AI rollout, attributed to AI without isolating its contribution.

4. Risk. Metric: incidents avoided, with a baseline rate and a mechanism that explains how AI reduced it. Trap: risk reduction asserted without showing what was avoided or how.

7. What to do in the next 30 days

A compressed pointer to the playbook. Five steps, each with a deliverable.

Name an owner. One person accountable for AI economics across the estate, with a mandate letter from the CFO or CIO.
Build the spend inventory. Every AI cost, in one register, in four categories: direct consumption, licensed seats, embedded tiers, infrastructure.
Capture baselines for the top three use cases by spend. Whatever pre-AI history exists - cycle times, tickets resolved per FTE, drafting turnaround - capture now, before it ages out of systems.
Define one unit cost per major use case. Cost per accepted code change; cost per resolved ticket; cost per document processed, fully loaded with review time.
Hold the first value review and publish the pack. One hour, CFO or delegate in the room. Contents: total AI spend and trend; adoption versus seats; attribution coverage v1; top-three unit costs; three decisions.

Full playbook: The First 30 Days of AI Value Management.

8. Where we might be wrong

Three ways this analysis might be wrong

Attribution at task level really is harder than cloud allocation ever was

Maybe. Cloud allocation had cost pools, shared services, and indirect demand. AI attribution has self-reported time savings and workflow changes that are genuinely harder to instrument. But the UK government trials show it is possible when the organisation decides it matters. The question is not whether it is hard. The question is whether it is harder than the alternative, which is governing a growing cost base with no evidence.

Self-reported time savings deserve more credit as leading indicators

Possibly. If time saved is real, it should eventually show up in redeployed capacity, faster cycle times, or reduced headcount. Self-reports could be leading indicators of those lagging measures. But the UK trials found self-reported time savings with no measured productivity change. Until the lagging measures move, the self-report is a hypothesis, not evidence.

Heavy governance this early kills the experimentation that produces the eventual return

This is the strongest objection. Governance overhead can kill velocity, and velocity matters in a fast-moving technology domain. The counter is that the governance proposed here is not approval gates or compliance theatre. It is baselines, unit costs, and named owners. Those disciplines make experimentation faster to evaluate, not slower to run. But if attribution coverage becomes a bureaucratic exercise rather than a decision tool, this objection will be proven right.

The analysis presented here rests on three core assumptions that deserve scrutiny. First, that attribution at task level is both technically feasible and economically justified for most enterprise AI deployments. Second, that the gap between cost visibility and value evidence represents a governance failure rather than a temporary measurement lag that will resolve as AI matures. Third, that the disciplines proposed—baselines, unit costs, named owners—can scale across diverse AI use cases without creating prohibitive overhead.

Each assumption could be wrong in specific contexts. Small-scale experimentation may genuinely benefit from lighter governance until patterns emerge. Highly exploratory AI research initiatives may resist standardised measurement frameworks. And some AI capabilities—particularly those that enhance decision quality rather than automate tasks—may require fundamentally different value proof approaches than the productivity-focused methods emphasised here.

The strongest challenge to this framework comes from organisations that have achieved genuine AI value without formal attribution systems. If such cases exist at scale and can be documented, they would suggest that the Value Gap analysis overweights measurement discipline relative to other success factors such as executive sponsorship, technical capability, or organisational readiness. The framework should be tested against those counterexamples and refined accordingly.

9. The call to action

The next phase of enterprise AI will be defined less by model access and more by management discipline. Organisations that close the AI Value Gap earliest will not simply reduce waste. They will create a more credible basis for sustained investment, faster reallocation, and stronger trust between finance, technology, and business leaders.

The four questions remain straightforward:

What are we really spending?
Who owns the economic result?
What proof standard governs the next tranche of scale?
Which initiatives deserve more capital, less capital, or redesign?

Organisations that can answer those questions consistently are not just measuring AI better. They are governing it better.

The AI Value Gap

1. The gap, defined operationally

2. The evidence that the gap is real

Tier 1: On-the-record incidents

Tier 2: Surveys with sample sizes stated

Tier 3: The market's own behaviour

3. Why the gap persists

1. Budgets are licence-shaped; spend is consumption-shaped

2. No baseline before rollout, so no counterfactual after

3. Benefits and costs land in different ledgers

4. Incentives attach to usage, which destroys the data

4. Pricing the gap

5. Why "wait for prices to fall" is not a strategy

6. Closing the gap: attribution coverage as an operating metric

How to compute attribution coverage

Who owns it

The four value dimensions

7. What to do in the next 30 days

8. Where we might be wrong

9. The call to action

Token Economics

AI ROI Models

AI TCO Framework

FinOps for AI

SaaS Token Opacity

Uber AI Budget Case Study

The First 30 Days of AI Value Management

Attribution coverage

Four attributable value dimensions

Productivity

Cost avoided

Revenue

Risk

How to compute attribution coverage — version 1

The savings conversion waterfall

What Cloud Taught Us About the Real Cost of AI Inference

A Proof of Concept That Proves the Technology Has Proved Almost Nothing

AI Value Management Is Not FinOps for AI

The AI Value Gap

1. The gap, defined operationally

2. The evidence that the gap is real

Tier 1: On-the-record incidents

Tier 2: Surveys with sample sizes stated

Tier 3: The market's own behaviour

3. Why the gap persists

1. Budgets are licence-shaped; spend is consumption-shaped

2. No baseline before rollout, so no counterfactual after

3. Benefits and costs land in different ledgers

4. Incentives attach to usage, which destroys the data

5. A growing share of AI cost is invisible by design

4. Pricing the gap

5. Why "wait for prices to fall" is not a strategy

6. Closing the gap: attribution coverage as an operating metric

How to compute attribution coverage

Who owns it

The four value dimensions

7. What to do in the next 30 days

8. Where we might be wrong

9. The call to action

Token Economics

AI ROI Models

AI TCO Framework

FinOps for AI

SaaS Token Opacity

Uber AI Budget Case Study

The First 30 Days of AI Value Management

Four attributable value dimensions

Productivity

Cost avoided

Revenue

Risk

How to compute attribution coverage — version 1

What Cloud Taught Us About the Real Cost of AI Inference

A Proof of Concept That Proves the Technology Has Proved Almost Nothing

AI Value Management Is Not FinOps for AI