Skip to content

Framework

The Metrics Framework

The reference set for AI economics

Attribution coverage, cost per successful outcome, four value dimensions, honest maturity tags. This is the reference set: what to measure, why it matters, where it fails, and what maturity level you need to use it.

Measurement framework

Start with the distinction ladder

Before you choose metrics, understand where they sit on the distinction ladder. Every metric belongs on one of five rungs: activity, adoption, productivity, value, or strategic impact. The higher the rung, the harder the metric is to game, and the more it matters.

This framework focuses on rungs 3-5: productivity, value, and strategic impact. Activity and adoption metrics are necessary for operations, but they do not prove value. If you only measure activity and adoption, you are measuring the wrong thing.

The lead metric: attribution coverage

If you can only track one metric across your AI portfolio, track attribution coverage. It is the percentage of your total AI cost that can be linked to a measured business outcome.

Attribution coverage = (Cost attributed to measured outcomes) / (Total AI cost)

The unattributed remainder is the value gap. One number to lead with, one gap to close.

Attribution coverage is a lead metric because it predicts value realisation. If you cannot attribute cost to outcome, you cannot prove value. If you cannot prove value, you cannot defend the spend. Coverage is the early warning.

Refined attribution coverage requirements

An attributed outcome must include minimum metadata to prevent weak attribution inflating the score:

  • Named owner
  • Baseline and period
  • Outcome event
  • Evidence source
  • Cost boundary
  • Attribution method
  • Confidence rating
  • Capture status

Maturity: Emerging
Requires: cost visibility by initiative, outcome measurement for major use cases, attribution logic to link cost to outcome.

Capture status: from potential to sustained

Not all claimed value is equal. Every value item should be tagged with its capture status to distinguish modelled opportunities from realised benefits.

Potential

A modelled opportunity. No operating change yet observed.

Observed

An operating measure changed. Correlation exists but causation not yet established.

Attributed

Evidence supports an AI contribution. Attribution method is documented.

Captured

Financial, capacity, service or risk benefit was acted upon. Value is banked or redeployed.

Sustained

Benefit persists through a defined period. Durability is demonstrated.

InterpretationMost AI value claims stop at “potential” or “observed”. Few reach “captured”. Almost none track “sustained”. This progression explains why portfolio-level ROI remains elusive despite impressive pilot results.

The comprehensive metric set

The full reference set contains 42 metrics across 12 categories. Each metric includes: definition, what it is good for, where it fails, and what maturity level you need to use it. Below are the key metrics from each category.

Portfolio Coverage

Attribution coverage

Definition: Percentage of total AI cost linked to measured business outcomes

Good for: Portfolio-level value proof, board reporting

Failure mode: Can be gamed by attributing cost to weak outcomes

Maturity: Emerging

Initiative coverage

Definition: Percentage of AI initiatives with defined success metrics

Good for: Governance maturity, ensuring clear goals

Failure mode: Metrics can be defined but not measured

Maturity: Foundational

Cost Efficiency

Cost per successful outcome

Definition: Total AI cost divided by number of successful outcomes delivered

Good for: Unit economics, comparing efficiency across use cases

Failure mode: Requires clear definition of ‘successful outcome’

Maturity: Emerging

Fully loaded cost per use case

Definition: Total cost including infrastructure, operations, governance, not just inference

Good for: True TCO visibility, portfolio prioritisation

Failure mode: Requires cost allocation methodology

Maturity: Systematic

Productivity Impact

Realised productivity gain

Definition: Measured output increase or cost reduction, not self-reported time savings

Good for: Honest productivity measurement, avoiding self-report bias

Failure mode: Requires baseline data and control groups

Maturity: Systematic

Value Realisation

Multi-dimensional value

Four dimensions: Revenue growth, cost reduction, quality improvement, risk mitigation

Good for: Capturing full value, avoiding single-dimension optimisation

Failure mode: Requires weighting across dimensions

Maturity: Systematic

ROI and Payback

Simple ROI

Definition: (Value delivered - Cost) / Cost, expressed as percentage

Good for: Quick business case assessment

Failure mode: Ignores time value of money and strategic value

Maturity: Foundational

Risk and Compliance

Model risk coverage

Definition: Percentage of production models with completed risk assessments

Good for: Governance maturity, regulatory readiness

Failure mode: Assessment completion does not equal risk mitigation

Maturity: Emerging

Note: The full reference set includes 42 metrics across 12 categories. See the complete AI Economics KPIs library for the full set with filters by role, maturity level, and governance domain.

Economic yield families

Economic yield measures compare worthwhile outcomes with fully loaded cost, risk and operational burden. There is no single universal yield metric. Use the family that matches your value dimension.

Workflow yield

Measured operating benefit / full workflow cost

Use for: Claims processing, support resolution, document review, software delivery, forecasting

Revenue yield

Incremental contribution margin / full AI cost

Use when AI changes: Conversion, retention, pricing, product adoption, sales capacity

Risk yield

Expected loss avoided / full AI cost

Use for: Fraud, cyber, compliance, safety, operational resilience

Capacity yield

Productive capacity redeployed / full AI cost

Note: Stricter than hours saved. Time has value only when it changes output, cost, service or strategic capacity.

Quality-adjusted yield

Successful outcomes × quality weight / full cost

Use when: Output quality varies significantly and affects downstream value

Warning: Do not compare unlike outcomes using one league table. A medical decision, code completion and fraud alert should not be ranked by a single yield score.

Behavioural metrics

Behavioural outcomes determine whether AI outputs become useful work. These metrics track trust, adoption quality, review burden and capability effects.

Appropriate-use rate

Share of AI use occurring in tasks where evidence shows net benefit

Trust calibration gap

Difference between user confidence and actual performance

Review debt

Backlog or superficial approval when outputs are generated faster than validation capacity

Rework displacement

Whether AI removes rework or moves it downstream

Learning transfer

Whether users improve independent task performance over time

Shadow-use exposure

Estimated use outside approved systems or evidence standards

For a complete behavioural measurement framework, see The Behavioural P&L of AI.

Agentic demand metrics

Agentic systems create machine-generated demand that is not limited by headcount or working hours. These metrics track autonomous consumption and value per objective.

Autonomous calls per human request

Model calls generated by agents per initial user objective

Cost per completed objective

Full cost including retries, tool calls and verification per successful agent objective

Retry rate

Frequency of agent retries and correction loops

Idle loop rate

Agent calls that produce no useful outcome or forward progress

Value at risk from runaway demand

Potential cost exposure if agent demand expands without governance

Interpretation Agentic demand can scale faster than value. Track cost per completed objective, not cost per call, to avoid rewarding activity over outcomes.

The instrumentation gap

Most organisations can measure activity and adoption (rungs 1-2). Some can measure productivity (rung 3). Few can measure value (rung 4). Almost none can measure strategic impact (rung 5).

The gap is not conceptual. It is instrumentation. The systems that track AI usage do not connect to the systems that track business outcomes. The token meter does not talk to the P&L. The user analytics do not talk to the CRM. The model logs do not talk to the quality system.

Closing the instrumentation gap requires integration work, not just metric definition. You need to connect AI telemetry to business telemetry. That is an engineering problem, not a measurement problem.

Short self-audit

Use this checklist to audit your metric selection. Answer honestly.

Metric selection checklist

We have at least one rung 4 (value) metric for each major initiative
We can calculate attribution coverage across the portfolio
We track cost per successful outcome for our top 3 use cases
We measure realised productivity gain, not just self-reported time savings
We track risk metrics alongside value metrics
We know which metrics require which maturity level
We have a plan to close the instrumentation gap
We review metrics quarterly and retire ones that no longer matter
We can explain each metric’s failure mode to leadership
We have stopped using at least one vanity metric in the past year

Goodhart’s Law warning

Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. Every metric in this framework can be gamed. The question is not whether gaming is possible, but whether the cost of gaming exceeds the cost of honest measurement.

Interpretation

The tokenmaxxing leaderboards collapsed in early 2026 because they optimised for the wrong thing. Organisations competed to minimise token cost per task, which incentivised shorter outputs, simpler prompts, and cheaper models. The result was lower quality, not lower cost. The leaderboards were gaming the metric, not improving the outcome.

The defence against Goodhart’s Law is to measure multiple dimensions, track failure modes, and retire metrics when they stop being useful. No single metric is sufficient. The portfolio of metrics matters more than any individual metric.

Related reading