LLM Inference Engine from Scratch 2026
PythonPyTorchTritonCUDA
  • Triton-fused FlashAttention kernels and paged KV-cache2.3x speedup over Python-level implementations on 4K–16K token workloads.
  • Speculative decoding (Qwen3-0.6B draft, Qwen3-4B verifier) reducing decode latency by ~28% on long-context workloads.
  • Dynamic batching scheduler improving GPU utilization by 22%; ~78% of peak HBM bandwidth on A100 via Nsight Compute profiling.
Distributed LLM Serving with NVIDIA Dynamo 2026
NVIDIA DynamovLLMPythonCUDA
  • NVIDIA Dynamo orchestrating vLLM backends with disaggregated prefill/decode pools for better GPU utilization on mixed workloads.
  • Benchmarked KV-aware routing vs round-robin on TTFT and redundant prefill when requests share prompt prefixes across workers.
  • Compared single-node vLLM (RunPod) against Dynamo-managed deployments on throughput, tail latency, and memory efficiency.
Distributed Training from First Principles 2026
PyTorchNCCLCUDAMulti-GPU
  • GPipe and 1F1B pipeline scheduling on a multi-GPU transformer with stage-wise model partitioning and micro-batch execution.
  • Core NCCL primitives (send/recv, broadcast, all-reduce) to synchronize activations and gradients across pipeline stages.
  • 1F1B improved pipeline utilization by ~3% over GPipe; extending to tensor parallelism for hybrid 3D parallel training.
AI Agent Framework 2025
PythonOpenAIMCPFastAPI
  • Multi-step reasoning loop with OpenAI function calling, MCP integration for external tool servers, and sliding-window memory with compaction.
  • Tool orchestration (web search, code execution, file handling) with callback-driven observability and session persistence via FastAPI.
  • Evaluated on the GAIA benchmark against GPT-4o and Claude across multi-step real-world tasks.