Start with the distinction ladder
Before you choose metrics, understand where they sit on the distinction ladder. Every metric belongs on one of five rungs: activity, adoption, productivity, value, or strategic impact. The higher the rung, the harder the metric is to game, and the more it matters.
This framework focuses on rungs 3-5: productivity, value, and strategic impact. Activity and adoption metrics are necessary for operations, but they do not prove value. If you only measure activity and adoption, you are measuring the wrong thing.
The lead metric: attribution coverage
If you can only track one metric across your AI portfolio, track attribution coverage. It is the percentage of your total AI cost that can be linked to a measured business outcome.
Attribution coverage = (Cost attributed to measured outcomes) / (Total AI cost)
The unattributed remainder is the value gap. One number to lead with, one gap to close.
Attribution coverage is a lead metric because it predicts value realisation. If you cannot attribute cost to outcome, you cannot prove value. If you cannot prove value, you cannot defend the spend. Coverage is the early warning.
Maturity: Emerging
Requires: cost visibility by initiative, outcome measurement for major use cases, attribution logic to link cost to outcome.
Capture status: from potential to sustained
Not all claimed value is equal. Every value item should be tagged with its capture status to distinguish modelled opportunities from realised benefits.
Potential
A modelled opportunity. No operating change yet observed.
Observed
An operating measure changed. Correlation exists but causation not yet established.
Captured
Financial, capacity, service or risk benefit was acted upon. Value is banked or redeployed.
Sustained
Benefit persists through a defined period. Durability is demonstrated.
InterpretationMost AI value claims stop at “potential” or “observed”. Few reach “captured”. Almost none track “sustained”. This progression explains why portfolio-level ROI remains elusive despite impressive pilot results.
The comprehensive metric set
The full reference set contains 42 metrics across 12 categories. Each metric includes: definition, what it is good for, where it fails, and what maturity level you need to use it. Below are the key metrics from each category.
Portfolio Coverage
Attribution coverage
Definition: Percentage of total AI cost linked to measured business outcomes
Good for: Portfolio-level value proof, board reporting
Failure mode: Can be gamed by attributing cost to weak outcomes
Initiative coverage
Definition: Percentage of AI initiatives with defined success metrics
Good for: Governance maturity, ensuring clear goals
Failure mode: Metrics can be defined but not measured
Maturity: Foundational
Cost Efficiency
Cost per successful outcome
Definition: Total AI cost divided by number of successful outcomes delivered
Good for: Unit economics, comparing efficiency across use cases
Failure mode: Requires clear definition of ‘successful outcome’
Fully loaded cost per use case
Definition: Total cost including infrastructure, operations, governance, not just inference
Good for: True TCO visibility, portfolio prioritisation
Failure mode: Requires cost allocation methodology
Maturity: Systematic
Productivity Impact
Realised productivity gain
Definition: Measured output increase or cost reduction, not self-reported time savings
Good for: Honest productivity measurement, avoiding self-report bias
Failure mode: Requires baseline data and control groups
Maturity: Systematic
Value Realisation
Multi-dimensional value
Four dimensions: Revenue growth, cost reduction, quality improvement, risk mitigation
Good for: Capturing full value, avoiding single-dimension optimisation
Failure mode: Requires weighting across dimensions
Maturity: Systematic
ROI and Payback
Simple ROI
Definition: (Value delivered - Cost) / Cost, expressed as percentage
Good for: Quick business case assessment
Failure mode: Ignores time value of money and strategic value
Maturity: Foundational
Risk and Compliance
Model risk coverage
Definition: Percentage of production models with completed risk assessments
Good for: Governance maturity, regulatory readiness
Failure mode: Assessment completion does not equal risk mitigation
Note: The full reference set includes 42 metrics across 12 categories. See the complete AI Economics KPIs library for the full set with filters by role, maturity level, and governance domain.
Economic yield families
Economic yield measures compare worthwhile outcomes with fully loaded cost, risk and operational burden. There is no single universal yield metric. Use the family that matches your value dimension.
Workflow yield
Measured operating benefit / full workflow cost
Use for: Claims processing, support resolution, document review, software delivery, forecasting
Revenue yield
Incremental contribution margin / full AI cost
Use when AI changes: Conversion, retention, pricing, product adoption, sales capacity
Risk yield
Expected loss avoided / full AI cost
Use for: Fraud, cyber, compliance, safety, operational resilience
Capacity yield
Productive capacity redeployed / full AI cost
Note: Stricter than hours saved. Time has value only when it changes output, cost, service or strategic capacity.
Quality-adjusted yield
Successful outcomes × quality weight / full cost
Use when: Output quality varies significantly and affects downstream value
Warning: Do not compare unlike outcomes using one league table. A medical decision, code completion and fraud alert should not be ranked by a single yield score.
Behavioural metrics
Behavioural outcomes determine whether AI outputs become useful work. These metrics track trust, adoption quality, review burden and capability effects.
Appropriate-use rate
Share of AI use occurring in tasks where evidence shows net benefit
Trust calibration gap
Difference between user confidence and actual performance
Review debt
Backlog or superficial approval when outputs are generated faster than validation capacity
Rework displacement
Whether AI removes rework or moves it downstream
Learning transfer
Whether users improve independent task performance over time
Shadow-use exposure
Estimated use outside approved systems or evidence standards
For a complete behavioural measurement framework, see The Behavioural P&L of AI.
Agentic demand metrics
Agentic systems create machine-generated demand that is not limited by headcount or working hours. These metrics track autonomous consumption and value per objective.
Autonomous calls per human request
Model calls generated by agents per initial user objective
Cost per completed objective
Full cost including retries, tool calls and verification per successful agent objective
Retry rate
Frequency of agent retries and correction loops
Idle loop rate
Agent calls that produce no useful outcome or forward progress
Value at risk from runaway demand
Potential cost exposure if agent demand expands without governance
Interpretation Agentic demand can scale faster than value. Track cost per completed objective, not cost per call, to avoid rewarding activity over outcomes.
The instrumentation gap
Most organisations can measure activity and adoption (rungs 1-2). Some can measure productivity (rung 3). Few can measure value (rung 4). Almost none can measure strategic impact (rung 5).
The gap is not conceptual. It is instrumentation. The systems that track AI usage do not connect to the systems that track business outcomes. The token meter does not talk to the P&L. The user analytics do not talk to the CRM. The model logs do not talk to the quality system.
Closing the instrumentation gap requires integration work, not just metric definition. You need to connect AI telemetry to business telemetry. That is an engineering problem, not a measurement problem.
Short self-audit
Use this checklist to audit your metric selection. Answer honestly.
Metric selection checklist
Goodhart’s Law warning
Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. Every metric in this framework can be gamed. The question is not whether gaming is possible, but whether the cost of gaming exceeds the cost of honest measurement.
The tokenmaxxing leaderboards collapsed in early 2026 because they optimised for the wrong thing. Organisations competed to minimise token cost per task, which incentivised shorter outputs, simpler prompts, and cheaper models. The result was lower quality, not lower cost. The leaderboards were gaming the metric, not improving the outcome.
The defence against Goodhart’s Law is to measure multiple dimensions, track failure modes, and retire metrics when they stop being useful. No single metric is sufficient. The portfolio of metrics matters more than any individual metric.
Related reading