What I built, how I built it, and what it achieved.
Complex real-world tasks need multi-step reasoning that single LLM calls can't handle — they need tool use, memory, and the ability to decompose problems.
Designed a multi-step reasoning loop with OpenAI function calling, MCP integration for external tool servers, Pydantic schemas for structured outputs, and sliding-window memory with compaction for long conversations.
Evaluated on GAIA benchmark · Web chat interface · Extensible tool registration
Most engineers use FSDP or DeepSpeed as black boxes. Understanding distributed training requires implementing the communication primitives and scheduling algorithms from scratch.
Implemented GPipe and 1F1B pipeline scheduling on a multi-GPU transformer pipeline with stage-wise model partitioning and micro-batch execution. Built distributed training infrastructure using core NCCL primitives (send/recv, broadcast, all-reduce) to synchronize activations and gradients across pipeline stages.
1F1B improved pipeline utilization by ~3% · Reduced communication overhead by ~4% · Extending to tensor parallelism for hybrid 3D parallel training
Single-trajectory RL limits exploration and tool-use diversity. Multi-agent architectures with group-relative policy optimization unlock richer reasoning strategies.
Designed and built a modular multi-agent GRPO system (Planner, Executor, Verifier, Memory) with tool-grounded reasoning, supporting up to 20 decisions/query (4 rollouts × 5 turns). Engineered a data pipeline merging DeepMath-103K + FlashRAG-NQ into a unified 182,190-sample training corpus. Implemented 4-bit QLoRA fine-tuning for a 1.5B policy model, training only 1.10% of parameters.
+300% exploration gain over single-trajectory RL · Reward: 29% → 87% · Loss decreased by 97.7%
Commercial inference engines like vLLM and TGI are powerful but opaque — understanding how LLM serving actually works requires building one from the ground up.
Built a modular inference engine with Triton-fused FlashAttention kernels and paged KV-cache, achieving 2.3x speedup on 4K–16K token workloads. Implemented speculative decoding using Qwen3-0.6B draft model with Qwen3-4B as verifier, reducing decoding latency by ~28%. Designed a dynamic batching and request scheduler improving GPU utilization by 22%.
2.3x speedup · ~28% latency reduction via speculative decoding · ~78% peak HBM bandwidth on A100
A fun weekend project. Upload a resume and a job description, get an embedding-based match score and LLM-generated feedback on what to improve.
Base language models can generate text but lack the ability to hold conversations or call external tools — both critical for production AI applications.
Took the base Qwen 3B model and LoRA fine-tuned it in two stages: first on conversational data to give it chat capabilities, then on function-calling datasets to teach it structured tool use. Used QLoRA for memory-efficient training on consumer GPUs.
Base → Chat → Function Calling pipeline · LoRA adapters for each stage · Runs on single GPU
Manual code review is slow and inconsistent. Large models can review code but are too expensive and heavy for production deployment.
LoRA fine-tuned a 220M parameter Microsoft model on 150k code samples, then distilled it down to 80M parameters. Containerized with Docker, tracked experiments with ClearML.
60% cost reduction via distillation · Production-ready containerized APIs · ClearML versioning
Standard diffusion-based super-resolution requires hundreds of denoising steps, making it too slow for practical use. The ResShift paper offers an efficient alternative.
Implemented from scratch: U-Net with 4-stage encoder-decoder and Swin Transformer bottleneck, residual shifting mechanism reducing steps to just 15, sinusoidal time conditioning, trained on DIV2K dataset.
Competitive PSNR/SSIM/LPIPS · 15 diffusion steps · Full from-scratch implementation