โ† Back to Blog

ML Training Optimization: FLOPs, Profiling, and Learning Strategies

Oct 20th, 2025 ยท 12 min read


๐Ÿ’ป Fun disclaimer: Used GPT to get all the beautiful visual gradients, but the content is mine!

When training large-scale machine learning models, optimization goes beyond just hyperparameter tuning. This guide covers the essential aspects of efficient ML training: computational constraints, performance profiling, and learning strategies that can save you significant costs and time.

1. FLOPs and Chinchilla Scaling Law

When training large-scale ML models, you typically have FLOPs (Floating Point Operations) constraints. The Chinchilla scaling law provides crucial guidance on how to allocate your compute budget effectively.

Chinchilla Scaling Law

For a fixed compute budget (FLOPs), you need to decide between having more parameters (bigger model) or training the model for longer (showing it more data).

Two Critical Cases to Avoid

1. Compute Inefficient Training

Example: Compute Inefficiency Scenario: โ€ข You've built a model that can handle 20 data points effectively โ€ข But you only show it 10 data points โ€ข You keep training beyond what's necessary Problem: โ€ข You're wasting the model's potential โ€ข Training longer won't help if you're not using enough data โ€ข This is compute inefficient because you're not utilizing your model's capacity

2. Data Inefficient Training

Example: Data Inefficiency Scenario: โ€ข You've built a model that can handle 10 data points โ€ข But you're showing it 20 data points โ€ข The model can't effectively process all the information Problem: โ€ข You're wasting valuable data โ€ข The model can't learn from the excess information โ€ข This is data inefficient because you're not utilizing your data effectively
Key Takeaway: Always match your model capacity with your data size and training duration to avoid both compute and data inefficiency.

2. Profiling Your Code

Profiling your training code is essential for maximizing GPU utilization and getting the best performance for your investment. This is different from hyperparameter tuning, which focuses on model learning rather than computational efficiency.

Key Bottlenecks to Monitor

I/O Bottleneck

Don't assume that just because your GPU can handle a larger batch size, you should use it. PyTorch data loaders work on CPU threads, and if your GPU finishes processing batch 1 but your data loader isn't ready with batch 2, your GPU sits idle.

I/O Bottleneck Example: GPU Timeline: Batch 1: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] Processing Batch 2: [ ] Waiting for data... Batch 3: [ ] Still waiting... CPU DataLoader: Batch 1: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] Loading Batch 2: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] Loading (slow) Batch 3: [ ] Not ready yet Result: GPU utilization drops significantly

Memory Bottleneck

Good profiling reveals what's consuming your memory. Common culprits include:

Memory Optimization Techniques
Gradient Checkpointing
  • Trades computation for memory
  • Recomputes activations during backward pass
  • Can reduce memory by 50-80%
Mixed Precision
  • Uses FP16 for forward pass
  • Maintains FP32 for gradients
  • Reduces memory by ~50%

CPU โ†” GPU Transfer Bottleneck

Moving data between CPU and GPU is often a major bottleneck due to bandwidth limitations. Common scenarios that cause this issue:

Avoid These CPU-GPU Transfers: โŒ Bad: loss_value = loss.item() # Moves to CPU if loss_value < threshold: # Do something โœ… Good: if loss < threshold: # Keep on GPU # Do something

Kernel Overhead

Launching many small kernels can create overhead. The CPU tells the GPU to launch numerous kernels, and the GPU may struggle to keep up with the launch rate.

Profiling Priority

Always profile your code first to identify bottlenecks before focusing on accuracy improvements. This approach will save you significant costs.

3. Learning Strategies

Once you've optimized your computational efficiency, focus on improving model performance through effective learning strategies.

Batch Size Selection

Choose the highest batch size your GPU and data loader can handle, but ensure you maintain some stochasticity in your updates. When you change batch size, adjust your learning rate accordingly (usually linearly).

Batch Size Guidelines: โ€ข MNIST Example: Don't use the entire dataset as one batch - Too smooth learning leads to local minima - Always maintain some randomness in updates โ€ข Learning Rate Adjustment: - If you double batch size, consider doubling learning rate - Monitor training dynamics carefully

Gradient Accumulation

If your learning is too noisy (loss oscillates up and down), consider gradient accumulation to smooth the updates:

Gradient Accumulation Example: # Instead of: loss = model(batch) / batch_size loss.backward() optimizer.step() # Use: for i in range(accumulation_steps): loss = model(batch[i]) / batch_size loss.backward() # Accumulate gradients optimizer.step() # Update once with accumulated gradients

Frequently Asked Questions

When do you stop training? What is the ideal loss?

Stopping Criteria: Keep training while: โœ“ Validation loss decreases alongside training loss โœ“ You have budget remaining โœ“ No signs of overfitting Stop when: โœ— Validation loss flattens or increases โœ— Training loss keeps dropping but validation loss rises โœ— Early stopping triggers

What if training loss keeps dropping but validation loss increases?

This is classic overfitting. Solutions include:

How do I know if my learning rate is too high or low?

Learning Rate Too High
  • Loss oscillates or spikes
  • Gradients explode
  • Training becomes unstable
Learning Rate Too Low
  • Loss crawls down slowly
  • Training stalls early
  • Very slow convergence
Finding the Sweet Spot: 1. Plot learning rate on a log scale 2. Plot loss against learning rate 3. The sweet spot is the steepest descent before instability 4. Use learning rate schedulers for dynamic adjustment

Key Takeaways

Effective ML training optimization requires balancing computational efficiency, proper profiling, and smart learning strategies. Always profile first to identify bottlenecks, then focus on model performance improvements. This systematic approach will save you both time and money.