What I built, how I built it, and what it achieved.
Complex real-world tasks need multi-step reasoning that single LLM calls can't handle — they need tool use, memory, and the ability to decompose problems.
Designed a multi-step reasoning loop with OpenAI function calling, MCP integration for external tool servers, Pydantic schemas for structured outputs, and sliding-window memory with compaction for long conversations.
Evaluated on GAIA benchmark · Web chat interface · Extensible tool registration
Models like DeepSeek-R1 and OpenAI o1 show that reasoning ability emerges through deliberate post-training, but these pipelines are proprietary and poorly documented.
Implemented the complete post-training pipeline to turn a base LLM (Qwen-2.5) into a reasoning model. Built a custom inference engine with KV-cache and temperature/top-p sampling. Created a math evaluation harness with SymPy-based symbolic equivalence checking. Implemented inference-time scaling (chain-of-thought, self-consistency via majority voting, self-refinement with log-probability scoring). Trained the model using GRPO reinforcement learning from scratch: rollout sampling, group-relative advantage normalization, clipped policy gradient with KL penalty.
Evaluated on GSM8K & MATH benchmarks · GRPO from raw PyTorch (no TRL) · Runs on single GPU
Commercial inference engines like vLLM and TGI are powerful but opaque — understanding how LLM serving actually works requires building one from the ground up.
Built a mini inference engine from scratch covering the full serving stack: tokenization, KV-cache management, batching strategies, and basic quantization. Designed to demystify what happens between receiving a prompt and streaming back tokens.
End-to-end from-scratch implementation · KV-cache & batching · Deep understanding of LLM serving internals
A fun weekend project. Upload a resume and a job description, get an embedding-based match score and LLM-generated feedback on what to improve.
Base language models can generate text but lack the ability to hold conversations or call external tools — both critical for production AI applications.
Took the base Qwen 3B model and LoRA fine-tuned it in two stages: first on conversational data to give it chat capabilities, then on function-calling datasets to teach it structured tool use. Used QLoRA for memory-efficient training on consumer GPUs.
Base → Chat → Function Calling pipeline · LoRA adapters for each stage · Runs on single GPU
Manual code review is slow and inconsistent. Large models can review code but are too expensive and heavy for production deployment.
LoRA fine-tuned a 220M parameter Microsoft model on 150k code samples, then distilled it down to 80M parameters. Containerized with Docker, tracked experiments with ClearML.
60% cost reduction via distillation · Production-ready containerized APIs · ClearML versioning
Standard diffusion-based super-resolution requires hundreds of denoising steps, making it too slow for practical use. The ResShift paper offers an efficient alternative.
Implemented from scratch: U-Net with 4-stage encoder-decoder and Swin Transformer bottleneck, residual shifting mechanism reducing steps to just 15, sinusoidal time conditioning, trained on DIV2K dataset.
Competitive PSNR/SSIM/LPIPS · 15 diffusion steps · Full from-scratch implementation