Model Quantization
Reducing AI model size and computational requirements by using lower precision.
Definition
Model Quantization is a technique that reduces the precision of numbers used in neural network weights and activations, typically from 32-bit floating point to 8-bit integers or even lower. This significantly reduces model size and speeds up inference.
Quantization can be applied during training (quantization-aware training) or after training (post-training quantization), with trade-offs between ease of implementation and accuracy preservation.
Why It Matters
Quantized models can run on resource-constrained devices like smartphones and edge hardware, enabling AI deployment where full-precision models would be too large or slow.
Businesses use quantization to reduce cloud computing costs, improve response times, and enable offline AI capabilities in mobile and embedded applications.
Examples in Practice
A language model quantized to 4-bit precision might shrink from 30GB to 8GB, enabling it to run on consumer hardware. Image recognition models quantized for mobile devices can classify images in real-time without cloud connectivity.
Frameworks like GGUF allow running large language models on laptops through aggressive quantization with minimal quality loss.