Builds
What I built and why.
LLM Inference Engine from Scratch
2026
- Triton-fused FlashAttention kernels and paged KV-cache — 2.3x speedup over Python-level implementations on 4K–16K token workloads.
- Speculative decoding (Qwen3-0.6B draft, Qwen3-4B verifier) reducing decode latency by ~28% on long-context workloads.
- Dynamic batching scheduler improving GPU utilization by 22%; ~78% of peak HBM bandwidth on A100 via Nsight Compute profiling.
Distributed LLM Serving with NVIDIA Dynamo
2026
- NVIDIA Dynamo orchestrating vLLM backends with disaggregated prefill/decode pools for better GPU utilization on mixed workloads.
- Benchmarked KV-aware routing vs round-robin on TTFT and redundant prefill when requests share prompt prefixes across workers.
- Compared single-node vLLM (RunPod) against Dynamo-managed deployments on throughput, tail latency, and memory efficiency.
Distributed Training from First Principles
2026
- GPipe and 1F1B pipeline scheduling on a multi-GPU transformer with stage-wise model partitioning and micro-batch execution.
- Core NCCL primitives (send/recv, broadcast, all-reduce) to synchronize activations and gradients across pipeline stages.
- 1F1B improved pipeline utilization by ~3% over GPipe; extending to tensor parallelism for hybrid 3D parallel training.
AI Agent Framework
2025
- Multi-step reasoning loop with OpenAI function calling, MCP integration for external tool servers, and sliding-window memory with compaction.
- Tool orchestration (web search, code execution, file handling) with callback-driven observability and session persistence via FastAPI.
- Evaluated on the GAIA benchmark against GPT-4o and Claude across multi-step real-world tasks.