← Back to Blog

Understanding Quantization in Deep Learning

Mar 13th, 2025 · 15 min read


Understanding Memory Footprint

When working with deep learning models, understanding memory usage is crucial. The total memory consumption can be broken down into two main components:

Total memory = Training memory + Inference memory Training memory = Weights + Gradient memory + Optimizer memory + Activations Inference memory = Weights + Activation memory (forward pass)

Example: Calculating Model Memory

Let's calculate memory for a simple neural network: - Input layer: 784 neurons (28x28 image) - Hidden layer: 512 neurons - Output layer: 10 neurons (digits 0-9) Weights memory: - Layer 1: 784 × 512 = 401,408 parameters - Layer 2: 512 × 10 = 5,120 parameters Total parameters: 406,528 Using float32 (4 bytes): - Weights: 406,528 × 4 = 1.6 MB - Gradients: 1.6 MB - Optimizer (Adam, 2 states): 3.2 MB - Activations (batch size 32): ~0.2 MB Total Training Memory: ~6.6 MB Inference Memory: ~1.8 MB

Memory Optimization Techniques

Let's explore key techniques for reducing memory usage in deep learning models, starting with an important but often overlooked approach:

1. Gradient Checkpointing

Gradient checkpointing is a powerful technique that trades computation time for memory savings. Instead of storing all activations in memory during the forward pass, we do the following:

Strategy:

  1. Store activations at checkpoints only
  2. Recompute intermediate activations when needed
  3. Free memory after gradients are computed

Trade-offs:

Basics and Lookup Table

Floating Point Formats Comparison: Format Bytes Precision Common Use Case ───────────────────────────────────────────── FP64 8 15-17 dec Scientific Computing (rare in DL) FP32 4 6-9 dec Training (standard) FP16 2 3-4 dec Inference/Training INT8 1 256 levels Quantized Inference INT4 0.5 16 levels Extreme Compression INT1 0.125 2 levels Experimental (e.g., Blackwell)

Understanding INT4 Range

When we say INT4 (4-bit integer) has a range of -8 to 7, we're describing the minimum and maximum values that can be represented using 4 bits in signed integer format. Let's break this down:

1. 4 Bits = 4 Binary Digits • Each bit can be either 0 or 1 • So, 4 bits can represent 2⁴ = 16 unique values 2. Signed vs. Unsigned • Unsigned INT4: represents positive values only Range: 0 to 15 • Signed INT4 (common in ML quantization) Uses Two's Complement representation Range: -8 to 7 Binary Representation of INT4 (Signed): Binary Decimal ───────────────── 1000 -8 (Most negative) 1111 -1 0000 0 0001 1 0111 7 (Most positive)
Key Points:
Important Note: While reducing precision can significantly decrease memory usage, it can also introduce numerical errors. Always validate model performance after precision reduction.

2. Understanding Quantization

Quantization Impact Diagram
Figure: Quantization can be applied to four key areas: Weights, Training Time, Inference Time, and Activations

From the figure, we now understand that:

  1. Quantization can be applied to weights, Activations.
  2. It can also be applied in Inference time and Training time.

Let's see one by one.

1. Quantizing Weights (Static and Stable)

Weights are ideal candidates for quantization because they change less frequently. Once trained, weights remain constant unless the model is fine-tuned, making them perfect for one-time quantization.

Original Weights (FP32): [0.45, -0.23, 0.89, -0.75] The range of weights: min = -0.75, max = 0.89 Quantization Process: • The INT8 range is from -128 to 127 • Calculate the scaling factor (S): S = (Max - Min) / 255 = (0.89 - (-0.75)) / 255 ≈ 0.00647 Quantized weights (INT8) using: Q = round((Original - Min) / S) - 128
Original Weight Quantized Value (INT8)
0.45 92
-0.23 35
0.89 127
-0.75 -128
Key Benefits:

2. Quantizing Activations (Dynamic Values)

Unlike weights, activations change with every inference because they depend on the input data. This makes activation quantization more challenging and requires careful consideration of the dynamic range.

Example: ReLU Activation Values for Different Inputs: Input Image 1 (digit 7): [0.0, 4.2, 0.0, 3.1, 0.0] Range: 0.0 to 4.2 Input Image 2 (digit 4): [2.1, 0.0, 5.7, 0.0, 1.9] Range: 0.0 to 5.7 Input Image 3 (digit 1): [0.0, 0.0, 7.2, 0.0, 0.0] Range: 0.0 to 7.2 Observation: • Activation ranges vary significantly between inputs • Need dynamic scaling for effective quantization • Common to use running statistics for range estimation
Challenges with Activation Quantization:

Inference Time Quantization

Inference time quantization focuses on serving the model in low precision to accelerate computation. Modern approaches have moved beyond simple quantization to mixed precision strategies, which offer a better balance between performance and accuracy.

In a typical mixed precision setup:

Quantization-Aware Training (QAT)

QAT is a training-time technique designed to maintain high accuracy when models are deployed with low-bit quantization (like INT8, INT4). Unlike post-training quantization, QAT allows the model to adapt to quantization effects during the training process itself.

How QAT Works

Quantization-Aware Training Process
Figure: The QAT process showing fake quantization during training and real quantization for deployment

The QAT process involves four key steps:

1. Simulate Quantization During Training

During the forward pass, weights and activations are "fake quantized" to simulate deployment conditions. This involves rounding and clipping values based on the target precision (like INT8), helping the model learn to work within quantization constraints.

Simple Example of How Learning Happens: 1. Original Weight (FP32): W = 0.45 2. Fake Quantized (during forward pass to INT8): • Using a scale of 0.1, the quantized value becomes: Q = round(0.45/0.1) = 4 • Dequantized back for calculations: Q_dequantized = 4 × 0.1 = 0.4 3. Forward Pass Calculation (with Quantization Noise): • Let's say the model predicts an output based on 0.4 and calculates a loss 4. Loss Function Result: Loss = 0.2
2. Backpropagate Using High Precision

The backward pass maintains high-precision gradients (typically FP32) to ensure accurate learning. This dual approach allows stable gradient updates while still preparing the model for quantized deployment.

Example: High-Precision Gradient Calculation Given from previous step: • Original weight (W) = 0.45 • Quantized forward value = 0.4 • Loss = 0.2 Backward Pass (in FP32): • Gradient = ∂Loss/∂W = -0.15 • Learning rate (η) = 0.01 Weight Update: W_new = W - η × gradient W_new = 0.45 - 0.01 × (-0.15) W_new = 0.4515 (kept in FP32 during training)
Key Insights:

Common Challenges

When implementing QAT, teams typically face several challenges:

What's Next?

In our next article, we'll explore advanced memory-efficient techniques like LoRA (Low-Rank Adaptation) and other parameter-efficient fine-tuning methods that are revolutionizing how we train large language models.

Did you find this article helpful? Have questions about implementing these techniques? I'd love to hear your thoughts and experiences in the comments below! Your feedback helps make these explanations better for everyone.