Akhil Shekkari

AI Engineer — Agents, Post-Training, Inference

Excited about building agents that can plan, reason, and act across complex multi-step tasks, post-training techniques that unlock new capabilities, and pushing inference to go faster at scale.

Currently: MS in Applied Machine Learning at University of Maryland, College Park
Previously: AI Engineer at Atrium (client: Pfizer) · 3 years building ML systems at Tezo

My Best Past Work

Atrium · Pfizer · United States
AI Engineer
Jun 2025 – Aug 2025

Worked at Atrium with Pfizer's AI team to automate how statisticians write analysis plans. Built a RAG pipeline that cut drafting time by 60%, and an LLM-as-a-Judge system that catches hallucinations with 80% precision.

60% faster
drafting
80% hallucination
precision
What my manager said
Tezo · 3 years · India
Software Developer (ML)
Jul 2021 – Jul 2024

Built a RAG-powered chatbot that let employees search across 10,000+ internal documents. Reduced document lookup time by 60%. Also trained fraud detection models that improved recall by 15%.

10k+ docs
indexed
60% faster
lookups
15% fraud recall
improvement
What my manager said
View Full Timeline

Projects

AI Agents from Scratch

Featured

Designed and implemented a modular AI agent framework in Python, without relying on LangChain, CrewAI, or any existing agent library. Features a think-act reasoning loop that enables LLMs to autonomously chain tool calls across multiple steps to solve complex tasks.

  • Agent core: Async-first agent loop with Pydantic data models, structured output support, and execution tracing for full observability
  • Tool system + MCP: Extensible @tool decorator with auto function-to-JSON-schema conversion, MCP client for dynamically loading external tool servers, and human-in-the-loop confirmation for dangerous operations
  • Memory & context: Hierarchical optimization combining sliding window truncation, tool result compaction, tiktoken counting, and LLM-based summarization to stay within context limits
  • RAG pipeline: Text chunking, OpenAI embeddings, and cosine-similarity search to compress long search results before feeding them back to the agent
  • Multi-format files: Unified tool for text, CSV, Excel, PDF, images, and audio with multimodal LLM vision/audio analysis
Python
OpenAI
MCP
Pydantic
LiteLLM
FastAPI
Message Reason Tool Call Callbacks Memory Loop

Building a Reasoning LLM from Scratch

New

Implemented the complete post-training pipeline to turn a base LLM into a reasoning model. Covers inference-time scaling, self-refinement, and reinforcement learning with verifiable rewards (GRPO). No TRL, no alignment libraries. Inspired by DeepSeek-R1.

  • Inference engine: Custom text generation with KV-cache, temperature/top-p sampling, and torch.compile optimization
  • Evaluation harness: Math verifier with SymPy symbolic equivalence checking, structured answer extraction, and GSM8K/MATH benchmarking
  • Inference-time scaling: Chain-of-thought prompting, self-consistency via majority voting (N=10), and systematic temperature analysis
  • Self-refinement: Iterative generate-score-critique loop using token log-probability confidence scoring
  • GRPO from scratch: Full RL pipeline: rollout sampling, group-relative advantages, clipped policy gradient with KL penalty, multi-epoch training with checkpointing
Python
PyTorch
HuggingFace
SymPy
W&B
CUDA
Base Model Evaluate CoT + Sampling Self-Refine GRPO Train Reasoning LLM
View All Projects

Latest Blog Posts

Improving Context: Reasoning & Long Context in LLMs

Feb 6, 2026 · 25 min read

A deep dive into how LLMs reason over long-horizon tasks, the mechanics behind context length, and why smart agents with surgical retrieval beat brute-force long context windows.

GPU Fundamentals & LLM Inference Mental Models

Summary · 20 min read

Build intuition for LLM inference from first principles: GPU architecture, the roofline model, memory estimation, and latency.

Serving LLMs with vLLM on RunPod: A Complete Guide

Feb 3, 2026 · 12 min read

A deep dive into self-hosting LLMs using vLLM on RunPod. Covers PagedAttention, continuous batching, and cost analysis.

Read All Posts

Get in Touch