The Metrics Framework: AI Economics Reference Set

Start with the distinction ladder

Before you choose metrics, understand where they sit on the distinction ladder. Every metric belongs on one of five rungs: activity, adoption, productivity, value, or strategic impact. The higher the rung, the harder the metric is to game, and the more it matters.

This framework focuses on rungs 3-5: productivity, value, and strategic impact. Activity and adoption metrics are necessary for operations, but they do not prove value. If you only measure activity and adoption, you are measuring the wrong thing.

The lead metric: attribution coverage

If you can only track one metric across your AI portfolio, track attribution coverage. It is the percentage of your total AI cost that can be linked to a measured business outcome.

Attribution coverage = (Cost attributed to measured outcomes) / (Total AI cost)

The unattributed remainder is the value gap. One number to lead with, one gap to close.

Attribution coverage is a lead metric because it predicts value realisation. If you cannot attribute cost to outcome, you cannot prove value. If you cannot prove value, you cannot defend the spend. Coverage is the early warning.

Refined attribution coverage requirements

An attributed outcome must include minimum metadata to prevent weak attribution inflating the score:

•Named owner
•Baseline and period
•Outcome event
•Evidence source
•Cost boundary
•Attribution method
•Confidence rating
•Capture status

Maturity: Emerging
Requires: cost visibility by initiative, outcome measurement for major use cases, attribution logic to link cost to outcome.

Capture status: from potential to sustained

Not all claimed value is equal. Every value item should be tagged with its capture status to distinguish modelled opportunities from realised benefits.

Funnel from potential to observed, attributed, captured and sustained, narrowing sharply — Most value claims stop at potential or observed. Few reach captured.

InterpretationMost AI value claims stop at “potential” or “observed”. Few reach “captured”. Almost none track “sustained”. This progression explains why portfolio-level ROI remains elusive despite impressive pilot results.

The comprehensive metric set

The full reference set contains 42 metrics across 12 categories. Each metric includes: definition, what it is good for, where it fails, and what maturity level you need to use it. Below are the key metrics from each category.

Portfolio Coverage

Attribution coverage

Definition: Percentage of total AI cost linked to measured business outcomes

Good for: Portfolio-level value proof, board reporting

Failure mode: Can be gamed by attributing cost to weak outcomes

Maturity: Emerging

Initiative coverage

Definition: Percentage of AI initiatives with defined success metrics

Good for: Governance maturity, ensuring clear goals

Failure mode: Metrics can be defined but not measured

Maturity: Foundational

Cost Efficiency

Cost per successful outcome

Definition: Total AI cost divided by number of successful outcomes delivered

Good for: Unit economics, comparing efficiency across use cases

Failure mode: Requires clear definition of ‘successful outcome’

Maturity: Emerging

Fully loaded cost per use case

Definition: Total cost including infrastructure, operations, governance, not just inference

Good for: True TCO visibility, portfolio prioritisation

Failure mode: Requires cost allocation methodology

Maturity: Systematic

Productivity Impact

Realised productivity gain

Definition: Measured output increase or cost reduction, not self-reported time savings

Good for: Honest productivity measurement, avoiding self-report bias

Failure mode: Requires baseline data and control groups

Maturity: Systematic

Value Realisation

Multi-dimensional value

Four dimensions: Revenue growth, cost reduction, quality improvement, risk mitigation

Good for: Capturing full value, avoiding single-dimension optimisation

Failure mode: Requires weighting across dimensions

Maturity: Systematic

ROI and Payback

Simple ROI

Definition: (Value delivered - Cost) / Cost, expressed as percentage

Good for: Quick business case assessment

Failure mode: Ignores time value of money and strategic value

Maturity: Foundational

Risk and Compliance

Model risk coverage

Definition: Percentage of production models with completed risk assessments

Good for: Governance maturity, regulatory readiness

Failure mode: Assessment completion does not equal risk mitigation

Maturity: Emerging

Note: The full reference set includes 42 metrics across 12 categories. See the complete AI Economics KPIs library for the full set with filters by role, maturity level, and governance domain.

Economic yield families

Economic yield measures compare worthwhile outcomes with fully loaded cost, risk and operational burden. There is no single universal yield metric. Use the family that matches your value dimension.

Workflow yield

Measured operating benefit / full workflow cost

Use for: Claims processing, support resolution, document review, software delivery, forecasting

Revenue yield

Incremental contribution margin / full AI cost

Use when AI changes: Conversion, retention, pricing, product adoption, sales capacity

Risk yield

Expected loss avoided / full AI cost

Use for: Fraud, cyber, compliance, safety, operational resilience

Capacity yield

Productive capacity redeployed / full AI cost

Note: Stricter than hours saved. Time has value only when it changes output, cost, service or strategic capacity.

Quality-adjusted yield

Successful outcomes × quality weight / full cost

Use when: Output quality varies significantly and affects downstream value

Warning: Do not compare unlike outcomes using one league table. A medical decision, code completion and fraud alert should not be ranked by a single yield score.

Behavioural metrics

Behavioural outcomes determine whether AI outputs become useful work. These metrics track trust, adoption quality, review burden and capability effects.

Appropriate-use rate

Share of AI use occurring in tasks where evidence shows net benefit

Trust calibration gap

Difference between user confidence and actual performance

Review debt

Backlog or superficial approval when outputs are generated faster than validation capacity

Rework displacement

Whether AI removes rework or moves it downstream

Learning transfer

Whether users improve independent task performance over time

Shadow-use exposure

Estimated use outside approved systems or evidence standards

For a complete behavioural measurement framework, see The Behavioural P&L of AI.

Agentic demand metrics

Agentic systems create machine-generated demand that is not limited by headcount or working hours. These metrics track autonomous consumption and value per objective.

Autonomous calls per human request

Model calls generated by agents per initial user objective

Cost per completed objective

Full cost including retries, tool calls and verification per successful agent objective

Retry rate

Frequency of agent retries and correction loops

Idle loop rate

Agent calls that produce no useful outcome or forward progress

Value at risk from runaway demand

Potential cost exposure if agent demand expands without governance

Interpretation Agentic demand can scale faster than value. Track cost per completed objective, not cost per call, to avoid rewarding activity over outcomes.

The instrumentation gap

Most organisations can measure activity and adoption (rungs 1-2). Some can measure productivity (rung 3). Few can measure value (rung 4). Almost none can measure strategic impact (rung 5).

The gap is not conceptual. It is instrumentation. The systems that track AI usage do not connect to the systems that track business outcomes. The token meter does not talk to the P&L. The user analytics do not talk to the CRM. The model logs do not talk to the quality system.

Closing the instrumentation gap requires integration work, not just metric definition. You need to connect AI telemetry to business telemetry. That is an engineering problem, not a measurement problem.

Short self-audit

Use this checklist to audit your metric selection. Answer honestly.

Metric selection checklist

☐We have at least one rung 4 (value) metric for each major initiative

☐We can calculate attribution coverage across the portfolio

☐We track cost per successful outcome for our top 3 use cases

☐We measure realised productivity gain, not just self-reported time savings

☐We track risk metrics alongside value metrics

☐We know which metrics require which maturity level

☐We have a plan to close the instrumentation gap

☐We review metrics quarterly and retire ones that no longer matter

☐We can explain each metric’s failure mode to leadership

☐We have stopped using at least one vanity metric in the past year

Goodhart’s Law warning

Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. Every metric in this framework can be gamed. The question is not whether gaming is possible, but whether the cost of gaming exceeds the cost of honest measurement.

Interpretation

The tokenmaxxing leaderboards collapsed in early 2026 because they optimised for the wrong thing. Organisations competed to minimise token cost per task, which incentivised shorter outputs, simpler prompts, and cheaper models. The result was lower quality, not lower cost. The leaderboards were gaming the metric, not improving the outcome.

The defence against Goodhart’s Law is to measure multiple dimensions, track failure modes, and retire metrics when they stop being useful. No single metric is sufficient. The portfolio of metrics matters more than any individual metric.

Where we might be wrong

The metrics framework assumes that measurement drives improvement. In practice, measurement can drive gaming, compliance theatre, or paralysis. Some organisations measure everything and improve nothing. Others improve without measuring much at all. The framework does not tell you when measurement is worth the cost.

It also assumes that maturity levels are sequential: you need foundational metrics before emerging metrics, emerging before systematic, systematic before advanced. In practice, some organisations skip levels, or regress, or operate at different levels for different use cases. The maturity model is a simplification.

Finally, the framework is silent on weighting. It does not tell you which metrics matter most, or how to trade off between them. Attribution coverage is the lead metric, but it is not the only metric. The framework does not solve the prioritisation problem.

What would change our mind: a real organisation where comprehensive measurement led to worse outcomes than selective measurement, or where skipping maturity levels led to better outcomes than following them. That would suggest the framework is over-prescriptive, or that measurement has diminishing returns we have not accounted for.

The Metrics Framework

Start with the distinction ladder

The lead metric: attribution coverage

Attribution coverage = (Cost attributed to measured outcomes) / (Total AI cost)

Refined attribution coverage requirements

Capture status: from potential to sustained

The comprehensive metric set

Portfolio Coverage

Attribution coverage

Initiative coverage

Cost Efficiency

Cost per successful outcome

Fully loaded cost per use case

Productivity Impact

Realised productivity gain

Value Realisation

Multi-dimensional value

ROI and Payback

Simple ROI

Risk and Compliance

Model risk coverage

Economic yield families

Workflow yield

Revenue yield

Risk yield

Capacity yield

Quality-adjusted yield

Behavioural metrics

Appropriate-use rate

Trust calibration gap

Review debt

Rework displacement

Learning transfer

Shadow-use exposure

Agentic demand metrics

Autonomous calls per human request

Cost per completed objective

Retry rate

Idle loop rate

Value at risk from runaway demand

The instrumentation gap

Short self-audit

Metric selection checklist

Goodhart’s Law warning

The distinction ladder

AI Economics KPIs

The AI iceberg

All frameworks