Latency
The delay between sending a request to an AI system and receiving the response.
Definition
Latency is the time elapsed between submitting a prompt and receiving the complete response. It depends on model size, query complexity, server load, and network conditions.
Low latency is critical for interactive applications where users expect near-instant responses; higher latency is acceptable for background processing.
Why It Matters
Latency directly impacts user experience. Applications requiring real-time interaction must choose models and infrastructure that deliver acceptable response times.
Understanding latency tradeoffs helps optimize for either speed or quality depending on use case requirements.
Examples in Practice
A conversational AI uses a smaller, faster model to keep response latency under 2 seconds for natural dialogue.
A content generation system accepts higher latency because quality matters more than speed for long-form output.
An engineering team caches common queries to reduce latency for frequently asked questions.