Skip to content

A Proof of Concept That Proves the Technology Has Proved Almost Nothing

Standfirst

Most AI proofs of concept answer one question: can the technology do something impressive? Investment committees need answers to a different set of questions.

1. The demo trap

An AI team runs a pilot.

The model can summarise documents, answer questions or generate code. Users like it. The sponsor presents positive feedback. The programme asks for production funding.

The technology may have proved it can perform the task.

It has not yet proved:

  • that the task matters
  • that performance is reliable enough
  • that users will change their behaviour
  • that the workflow will improve
  • that the full production cost is acceptable
  • that risk is manageable
  • that the organisation can capture the benefit
  • that the use case is better than alternatives
  • that it deserves scarce capital

Gartner recommends turning a proof of concept into a proof of value by weighing achieved benefits against AI cost.[^gartner] That sounds simple. In practice, it requires a different pilot design.

2. Four proofs, not one

Proof 1: feasibility

Can the system technically perform the task?

Evidence:

  • functional completion
  • integration viability
  • latency
  • security feasibility
  • data access
  • model availability

Proof 2: performance

Does it perform at the required standard?

Evidence:

  • accuracy
  • acceptance
  • error and hallucination rates
  • robustness
  • consistency
  • escalation rate
  • performance by case type

Proof 3: operating adoption

Will people and systems use it in the real workflow?

Evidence:

  • repeat use
  • task fit
  • trust
  • workarounds
  • shadow use
  • review behaviour
  • manager and process-owner acceptance
  • agent or system integration

Proof 4: value

Did a meaningful outcome change, and can the organisation capture it?

Evidence:

  • baseline movement
  • cost per successful outcome
  • throughput
  • revenue or margin
  • quality
  • risk or loss
  • released and redeployed capacity
  • customer or employee outcome
  • confidence in attribution

A pilot that completes only Proof 1 is a technical experiment, not an investment case.

3. Start with a value contract

Before the pilot begins, write a one-page value contract.

It should state:

  • problem and current baseline
  • business owner
  • users and affected workflow
  • target outcome
  • value dimension
  • acceptable quality and risk
  • full-cost hypothesis
  • evidence method
  • production-scale scenario
  • scale, redesign and stop criteria
  • date of decision

This prevents the success definition changing after results arrive.

4. Build the baseline first

Without a baseline, every improvement claim becomes anecdotal.

Relevant baselines may include:

  • time per case
  • cases per employee
  • first-time resolution
  • defect rate
  • conversion
  • loss rate
  • cycle time
  • backlog
  • customer effort
  • review effort
  • cost per completed task
  • risk incidents

The baseline should include distribution, not just an average. AI may improve easy cases and damage complex ones.

5. Measure the full workflow

A model benchmark is not a workflow benchmark.

The pilot should measure:

  • input preparation
  • retrieval
  • model processing
  • tool execution
  • human review
  • correction
  • escalation
  • downstream processing
  • exception handling
  • governance and audit work

A system that produces an answer in seconds may still make the workflow slower if review and repair increase.

6. Bridge pilot cost to production cost

Pilot economics are structurally favourable:

  • small volumes
  • motivated users
  • limited integration
  • manual support
  • subsidised vendor access
  • reduced controls
  • selected data
  • hidden engineering effort

Production introduces:

  • higher and less predictable volume
  • integration and observability
  • identity and access
  • evaluation
  • incident response
  • support
  • compliance
  • change management
  • data pipelines
  • vendor commitments
  • resilience
  • model drift
  • human oversight

Deloitte's build-versus-buy analysis demonstrates how cost structure can change as volume and complexity scale.[^deloitte] Its own TCO model also excludes several categories, reinforcing why a pilot must use the site's full AI TCO Framework rather than a model invoice alone.

7. Include behavioural evidence

A pilot changes people as well as process.

Measure:

  • who adopts and who avoids
  • whether users over-trust outputs
  • whether experienced and inexperienced workers perform differently
  • whether people create shadow workflows
  • whether review becomes superficial
  • whether junior learning is displaced
  • whether people retain the ability to work without the system
  • whether managers redesign work or simply add AI on top

Gartner argues that behavioural outcomes deserve the same rigour as business and technology outcomes.[^gartner]

8. Prove capture, not just potential

A common claim is:

"AI saves ten minutes per task across 100,000 tasks."

That is capacity potential.

Value capture requires an explicit path:

  • Will headcount reduce?
  • Will output increase?
  • Will queues fall?
  • Will service improve?
  • Will people move to higher-value work?
  • Is there demand for the additional capacity?
  • Is the receiving process ready?
  • Is the benefit owner accountable?

If nobody can answer, the benefit remains theoretical.

9. Decision gates

Gate 0: permission to experiment

Required:

  • problem worth testing
  • owner
  • baseline plan
  • risk boundary
  • small budget
  • decision date

Gate 1: technical viability

Required:

  • integration feasible
  • data available
  • minimum performance met
  • no disqualifying risk

Decision:

  • stop
  • redesign
  • continue to controlled workflow test

Gate 2: operating viability

Required:

  • workflow fit
  • repeat adoption
  • review burden known
  • support model known
  • production cost range

Decision:

  • stop
  • redesign
  • limited production

Gate 3: value evidence

Required:

  • outcome movement
  • credible attribution
  • benefit owner
  • value-capture plan
  • acceptable cost per successful outcome

Decision:

  • scale
  • optimise
  • contain
  • stop

Gate 4: portfolio scale

Required:

  • comparison with competing investments
  • strategic fit
  • vendor and sovereignty assessment
  • capability and risk implications
  • funding source

10. What should count as failure

Failure includes:

  • no outcome movement
  • value below hurdle rate
  • unacceptable review burden
  • cost scaling faster than benefit
  • adoption without workflow impact
  • material quality disparity
  • inability to capture released capacity
  • vendor dependency beyond risk tolerance
  • evidence too weak for the next funding stage

Stopping can be a successful governance outcome.

11. A worked illustration

A contract-review assistant reduces first-pass review from 90 minutes to 25 minutes.

A conventional pilot declares success.

A Proof of Value asks:

  • Does legal review time fall after correction and escalation?
  • Are material clauses missed?
  • Does the workflow handle complex and multilingual contracts?
  • Does faster review increase completed deals or merely move the queue?
  • Can lawyers redeploy time?
  • What is the production cost including retrieval, audit and human review?
  • Who owns the benefit?
  • What happens at ten times the volume?
  • Is the model allowed to process the data in required jurisdictions?

The initial speed result remains useful. It is one input, not the conclusion.

Conclusion

A proof of concept demonstrates possibility.

A proof of value demonstrates enough technical, operating and economic evidence to justify the next decision.

The distinction is not administrative. It is the difference between funding learning and funding hope.

Downloadable

AI Proof of Value Scorecard - A structured template for evaluating AI pilots across feasibility, performance, adoption, cost, risk and value evidence.

Sources

[^gartner]: Gartner, "For AI Value, Focus on Your Use Cases", https://www.gartner.com/en/articles/ai-value. Accessed June 2026.

[^deloitte]: Deloitte, The pivot to tokenomics: Navigating AI's new spend dynamics, pp. 13-19 and pp. 25-26. The report models infrastructure sourcing decisions and explicitly lists TCO exclusions.

[^nist]: NIST AI Risk Management Framework, https://www.nist.gov/itl/ai-risk-management-framework. Emphasises explicit human roles, responsibilities and oversight across AI systems.

[^oecd]: OECD, "The effects of generative AI on productivity, innovation and entrepreneurship", https://www.oecd.org/en/publications/the-effects-of-generative-ai-on-productivity-innovation-and-entrepreneurship_b21df222-en.html. Reviews task-specific productivity evidence and variation by context.