Skip to content

Glossary entry

Inference Optimisation

The practice of reducing the cost, latency, or resource intensity of AI inference without unacceptable loss of quality.

Why it matters

Inference optimisation matters because recurring production cost, not one-off model development cost, often becomes the main economic bottleneck in enterprise AI.

Common techniques include routing smaller models to simpler tasks, caching, batching, and improving prompt efficiency. For the operating discipline behind that work, see FinOps & AI.