Understanding Quantization in Deep Learning

Mar 13th, 2025 · 15 min read

Understanding Memory Footprint

When working with deep learning models, understanding memory usage is crucial. The total memory consumption can be broken down into two main components:

Total memory = Training memory + Inference memory

Training memory = Weights + Gradient memory + Optimizer memory + Activations
Inference memory = Weights + Activation memory (forward pass)

Example: Calculating Model Memory

Let's calculate memory for a simple neural network:
- Input layer: 784 neurons (28x28 image)
- Hidden layer: 512 neurons
- Output layer: 10 neurons (digits 0-9)

Weights memory:
- Layer 1: 784 × 512 = 401,408 parameters
- Layer 2: 512 × 10 = 5,120 parameters
Total parameters: 406,528

Using float32 (4 bytes):
- Weights: 406,528 × 4 = 1.6 MB
- Gradients: 1.6 MB
- Optimizer (Adam, 2 states): 3.2 MB
- Activations (batch size 32): ~0.2 MB

Total Training Memory: ~6.6 MB
Inference Memory: ~1.8 MB

Memory Optimization Techniques

Let's explore key techniques for reducing memory usage in deep learning models, starting with an important but often overlooked approach:

1. Gradient Checkpointing

Gradient checkpointing is a powerful technique that trades computation time for memory savings. Instead of storing all activations in memory during the forward pass, we do the following:

Strategy:

Store activations at checkpoints only
Recompute intermediate activations when needed
Free memory after gradients are computed

Trade-offs:

✓ Reduced memory footprint
✗ Increased training time (recomputation)

Basics and Lookup Table

 Floating Point Formats Comparison:
 
 Format    Bytes    Precision    Common Use Case
 ─────────────────────────────────────────────
 FP64      8        15-17 dec    Scientific Computing (rare in DL)
 FP32      4        6-9 dec      Training (standard)
 FP16      2        3-4 dec      Inference/Training
 INT8      1        256 levels   Quantized Inference
 INT4      0.5      16 levels    Extreme Compression
 INT1      0.125    2 levels     Experimental (e.g., Blackwell)

Understanding INT4 Range

When we say INT4 (4-bit integer) has a range of -8 to 7, we're describing the minimum and maximum values that can be represented using 4 bits in signed integer format. Let's break this down:

  1. 4 Bits = 4 Binary Digits
  • Each bit can be either 0 or 1
  • So, 4 bits can represent 2⁴ = 16 unique values
  
  2. Signed vs. Unsigned
  • Unsigned INT4: represents positive values only
    Range: 0 to 15
  
  • Signed INT4 (common in ML quantization)
    Uses Two's Complement representation
    Range: -8 to 7

  Binary Representation of INT4 (Signed):
  Binary    Decimal
  ─────────────────
  1000      -8      (Most negative)
  1111      -1
  0000       0
  0001       1
  0111       7      (Most positive)

Key Points:

Signed integers reserve one bit for the sign (positive/negative)
Two's Complement allows efficient hardware implementation
The range is asymmetric around zero (-8 to +7) due to Two's Complement

Important Note: While reducing precision can significantly decrease memory usage, it can also introduce numerical errors. Always validate model performance after precision reduction.

2. Understanding Quantization

Figure: Quantization can be applied to four key areas: Weights, Training Time, Inference Time, and Activations

From the figure, we now understand that:

Quantization can be applied to weights, Activations.
It can also be applied in Inference time and Training time.

Let's see one by one.

1. Quantizing Weights (Static and Stable)

Weights are ideal candidates for quantization because they change less frequently. Once trained, weights remain constant unless the model is fine-tuned, making them perfect for one-time quantization.

 Original Weights (FP32):
 [0.45, -0.23, 0.89, -0.75]
 
 The range of weights: min = -0.75, max = 0.89
 
 Quantization Process:
 • The INT8 range is from -128 to 127
 • Calculate the scaling factor (S):
   S = (Max - Min) / 255 = (0.89 - (-0.75)) / 255 ≈ 0.00647
 
 Quantized weights (INT8) using:
 Q = round((Original - Min) / S) - 128

Original Weight	Quantized Value (INT8)
0.45	92
-0.23	35
0.89	127
-0.75	-128

Key Benefits:

Weights only need to be quantized once after training
Quantized model can be used repeatedly without re-quantization
More predictable impact on model performance

2. Quantizing Activations (Dynamic Values)

Unlike weights, activations change with every inference because they depend on the input data. This makes activation quantization more challenging and requires careful consideration of the dynamic range.

 Example: ReLU Activation Values for Different Inputs:
 
 Input Image 1 (digit 7):
 [0.0, 4.2, 0.0, 3.1, 0.0]    Range: 0.0 to 4.2
 
 Input Image 2 (digit 4):
 [2.1, 0.0, 5.7, 0.0, 1.9]    Range: 0.0 to 5.7
 
 Input Image 3 (digit 1):
 [0.0, 0.0, 7.2, 0.0, 0.0]    Range: 0.0 to 7.2
 
 Observation:
 • Activation ranges vary significantly between inputs
 • Need dynamic scaling for effective quantization
 • Common to use running statistics for range estimation

Challenges with Activation Quantization:

Dynamic range varies with each input
Requires runtime quantization/dequantization
May need batch-wise statistics for better accuracy
More sensitive to quantization errors than weights

Inference Time Quantization

Inference time quantization focuses on serving the model in low precision to accelerate computation. Modern approaches have moved beyond simple quantization to mixed precision strategies, which offer a better balance between performance and accuracy.

In a typical mixed precision setup:

Model weights are stored in FP16 or FP8 format for memory efficiency
Activations and gradients use FP32 or FP16 for better numerical stability
Critical operations may dynamically switch between precisions as needed

Quantization-Aware Training (QAT)

QAT is a training-time technique designed to maintain high accuracy when models are deployed with low-bit quantization (like INT8, INT4). Unlike post-training quantization, QAT allows the model to adapt to quantization effects during the training process itself.

How QAT Works

Figure: The QAT process showing fake quantization during training and real quantization for deployment

The QAT process involves four key steps:

1. Simulate Quantization During Training

During the forward pass, weights and activations are "fake quantized" to simulate deployment conditions. This involves rounding and clipping values based on the target precision (like INT8), helping the model learn to work within quantization constraints.

Simple Example of How Learning Happens:

1. Original Weight (FP32):
   W = 0.45

2. Fake Quantized (during forward pass to INT8):
   • Using a scale of 0.1, the quantized value becomes:
   Q = round(0.45/0.1) = 4

   • Dequantized back for calculations:
   Q_dequantized = 4 × 0.1 = 0.4

3. Forward Pass Calculation (with Quantization Noise):
   • Let's say the model predicts an output based on 0.4 and calculates a loss

4. Loss Function Result:
   Loss = 0.2

2. Backpropagate Using High Precision

The backward pass maintains high-precision gradients (typically FP32) to ensure accurate learning. This dual approach allows stable gradient updates while still preparing the model for quantized deployment.

 Example: High-Precision Gradient Calculation
 
 Given from previous step:
 • Original weight (W) = 0.45
 • Quantized forward value = 0.4
 • Loss = 0.2
 
 Backward Pass (in FP32):
 • Gradient = ∂Loss/∂W = -0.15
 • Learning rate (η) = 0.01
 
 Weight Update:
 W_new = W - η × gradient
 W_new = 0.45 - 0.01 × (-0.15)
 W_new = 0.4515  (kept in FP32 during training)

Key Insights:

The model adapts during training to minimize the accuracy loss that might occur from quantization
It learns to "expect" the noise from quantization and adjusts accordingly
Once training is complete, the model weights are actually quantized to low-bit precision for deployment

Common Challenges

When implementing QAT, teams typically face several challenges:

Balancing training time with quantization accuracy
Choosing appropriate quantization parameters
Handling layers with different sensitivity to quantization
Managing the increased complexity of the training pipeline

What's Next?

In our next article, we'll explore advanced memory-efficient techniques like LoRA (Low-Rank Adaptation) and other parameter-efficient fine-tuning methods that are revolutionizing how we train large language models.

Did you find this article helpful? Have questions about implementing these techniques? I'd love to hear your thoughts and experiences in the comments below! Your feedback helps make these explanations better for everyone.