Quantization
Reducing AI model precision to decrease size and increase speed with minimal quality loss.
Definition
Quantization converts AI model weights from high-precision numbers (32-bit floats) to lower precision (8-bit integers), dramatically reducing memory requirements and computational costs.
This technique enables powerful models to run on consumer hardware, mobile devices, and edge computing environments.
Why It Matters
Quantization makes AI deployment practical and affordable. Models that previously required expensive GPUs can run locally on laptops or phones.
Understanding quantization tradeoffs helps choose between cloud APIs and local deployment.
Examples in Practice
A 70-billion parameter model is quantized to run on a gaming laptop, enabling local AI without cloud costs.
A mobile app uses a quantized model for on-device processing, avoiding network latency and privacy concerns.
A developer compares quantized model quality against the original, finding 4-bit quantization barely impacts results.