CV | Akhil Shekkari

Summary

AI engineer with 3+ years of industry experience building production-grade AI systems. My current focus is post-training LLMs to become reliable agents using reinforcement learning — specifically GRPO-based RLVR for tool use and multi-step reasoning. I have implemented GRPO from raw PyTorch, fine-tuned models for function calling, and built an agent framework from scratch. I believe the bottleneck in agentic AI is not scaffolding or prompting — it is making the underlying model learn to decompose tasks, recover from failures, and know when to stop. That is what I work on.

Education

University of Maryland, College Park Expected May 2026

Master of Science in Applied Machine Learning

Relevant coursework: Advanced Machine Learning · Deep Learning · Natural Language Processing · Probability & Statistics · Optimization

Work Experience

AI Engineer June 2025 – August 2025

Atrium · Client: Pfizer · Silver Spring, MD, USA

Worked within a regulated pharma environment to automate the generation of Statistical Analysis Plans (SAPs) — lengthy, structured clinical trial documents that statisticians typically draft by hand over several days. The core challenge was not just retrieval but accurate generation: SAPs must reference specific protocol sections, apply ICH E9 statistical guidelines, and use precise terminology without hallucination.
Built a RAG pipeline that ingested clinical trial protocols, prior SAP templates, and regulatory guidance documents. Implemented hybrid retrieval combining dense embeddings with BM25 keyword search to handle both semantic queries ("what endpoint analysis approach was used for this trial type?") and exact lookups ("retrieve the randomization scheme from section 5.2"). Reduced statisticians' manual drafting time by up to 60%.
Designed and implemented a custom LLM-as-a-Judge evaluation system to detect hallucinations and guideline deviations in generated SAPs. The judge used structured rubrics broken into discrete criteria (e.g., does the generated analysis plan match the stated primary endpoint? does the covariate list align with the protocol?), combining rule-based checks with prompted LLM scoring. Achieved over 80% precision in flagging problematic outputs before human review.
Ran systematic prompt evaluation experiments varying temperature, system prompt framing, few-shot example selection, and input chunking strategies. Tracked results across a held-out set of SAP sections and iterated toward a 20% improvement in generation consistency for the highest-stakes SAP components (primary efficacy analysis, handling of missing data, multiplicity adjustments).
Collaborated directly with biostatisticians to align the system's output format with ICH E9 guidelines, ensuring outputs were interpretable and trustworthy enough for regulatory submission workflows.

Machine Learning Engineer July 2022 – July 2024

Tezo · India

Led the design and deployment of a RAG-powered enterprise chatbot integrated with SharePoint project repositories. The core problem was that employees had no unified way to search across 1,000+ internal documents spanning project charters, design specs, architecture decisions, and weekly status reports — knowledge was siloed by team and project. The system gave any employee a single interface to query across all of it in natural language.
Built the full semantic search pipeline from scratch: document ingestion and chunking strategies (tested fixed-size, sentence-boundary, and recursive character splitting), embedding generation using a fine-tuned sentence transformer model, vector database indexing with metadata filtering for project-level and date-range scoping, and a hybrid retrieval layer combining dense and sparse signals. Reduced document lookup time by approximately 60% compared to manual SharePoint navigation.
Implemented LLM-based summarization of lengthy project documentation using map-reduce summarization for documents exceeding context limits, with section-level summaries feeding into a final synthesis pass. Enabled managers to get 3-sentence overviews of any project's current status, cutting manual review time by approximately 35%.
Containerized the full system using Docker with environment-specific configuration management, set up a CI/CD pipeline for automated testing and deployment, and configured on-premises access control to integrate with the company's existing Active Directory authentication — a hard requirement given the sensitivity of client project data.
Presented the system to senior leadership, leading to its adoption as the standard internal knowledge tool across two additional business units beyond the original team.

Junior Machine Learning Engineer July 2021 – July 2022

Tezo · India

Migrated insurance policy and claims data from a legacy on-premises warehouse to Snowflake. Redesigned the ETL pipeline architecture to take advantage of Snowflake's columnar storage and query pruning — rewrote the most expensive recurring queries as optimized SQL with proper clustering keys, reducing average query time by approximately 40% and cutting monthly compute costs.
Trained ML classification models for fraud detection on an imbalanced claims dataset. Evaluated Logistic Regression, Random Forest, Gradient Boosting, and XGBoost with class-weight adjustments and SMOTE oversampling. The final ensemble improved recall of fraudulent claims by approximately 15%, enabling the claims team to flag and investigate cases earlier in the processing pipeline before payouts occurred.
Built SQL-based BI dashboards for actuarial and operations teams covering claims frequency, severity distributions, loss ratios, and fraud flags. Standardized the report definitions in collaboration with actuarial leads to ensure consistent metric definitions across teams. Cut manual report preparation time by approximately 25% and grew active dashboard users to 50+ across actuarial, operations, and management functions.

Projects

AI Agent Framework — Built from Scratch GitHub · Live Demo

2025

Python · OpenAI Function Calling · MCP · Pydantic · FastAPI · GAIA Benchmark

Built a complete multi-step reasoning agent from scratch without relying on LangChain or similar scaffolding frameworks. The goal was to understand — and control — every layer of the agent loop: how tool schemas are passed to the model, how results are formatted back into the context, how memory is managed across turns, and how the agent decides when it has gathered enough information to produce a final answer.
Implemented the full tool orchestration layer: web search, code execution in a sandboxed environment, file read/write, and calculator. Each tool is defined as a Pydantic schema, serialized into the model's function-calling API, and dispatched via a callback-driven executor that logs every call and result for observability. The agent supports parallel tool calls — the model can invoke multiple tools in a single turn when the reasoning chain warrants it.
Integrated Model Context Protocol (MCP) support to connect the agent to external tool servers without writing custom integrations for each one. This decouples the agent core from tool implementations and allows hot-swapping or extending the tool set at runtime.
Designed a sliding-window memory system with automatic compaction: the full conversation history is maintained in a rolling buffer; when the context approaches the model's limit, older turns are summarized and compressed by a secondary LLM call, then re-injected as a compact memory block. This preserves relevant context across long sessions without truncating the conversation.
Built a FastAPI backend that exposes the agent as a REST service with streaming support, session persistence (conversation history stored per session ID), and tool execution logs queryable by session.
Evaluated the agent on the GAIA benchmark — a benchmark of real-world multi-step tasks requiring web search, file analysis, arithmetic, and code execution. Compared GPT-4o and Claude across task categories to understand where each model's tool-use strategy differs, and used these insights to improve the agent's system prompt and tool dispatch logic.

Building a Reasoning LLM from Scratch GitHub

2025

PyTorch · GRPO · Qwen-2.5 · SymPy · KV-Cache · GSM8K · MATH Benchmark

Implemented the complete post-training pipeline to turn a base language model (Qwen-2.5) into a reasoning model, starting from the raw base weights with no fine-tuned instruct model as a shortcut. The project covered three inference-time compute strategies — chain-of-thought prompting, self-consistency via majority voting, and iterative self-refinement with confidence scoring — followed by online reinforcement learning via GRPO to bake the reasoning behavior into the weights.
Implemented GRPO entirely from scratch in PyTorch with no dependency on TRL or other training libraries. The implementation covers: group sampling (G rollouts per prompt at temperature T), per-token advantage computation using group-relative normalization (subtract group mean, divide by group std), clipped surrogate objective with epsilon = 0.2, KL divergence penalty against a frozen reference model to prevent reward hacking, and degenerate group detection and skipping (groups where all rollouts receive identical rewards, making advantage estimation numerically unstable).
Built a custom reward function using SymPy for symbolic math equivalence checking — critical because string matching alone fails on mathematically equivalent answers in different forms (e.g., "3/4" vs "0.75" vs "\\frac{3}{4}"). The verifier parses both the model's generated answer and the ground truth into SymPy expressions and checks symbolic equality, returning a binary reward of 0 or 1.
Built a custom KV-cache inference engine for rollout generation: caches the key-value tensors for the shared prompt prefix across all G rollouts in a group, then runs the per-rollout decode passes without redundant prefill computation. This reduced rollout generation time by approximately 40% on longer prompts, which was the primary bottleneck in the training loop.
Implemented chain-of-thought prompting with structured output parsing to separate reasoning traces from final answers, self-consistency using majority voting over 8–16 sampled solutions with SymPy-based answer normalization for aggregation, and iterative self-refinement that uses the model's own log-probability on the current answer as a confidence proxy to decide whether to regenerate with a critique prompt.
Evaluated across GSM8K (grade school math) and a subset of the MATH benchmark (competition math). Tracked reward curves, KL divergence from reference, response entropy, and per-difficulty-level accuracy throughout training to diagnose reward hacking and mode collapse.

Mini Inference Engine from Scratch GitHub

2025

PyTorch · Triton · Paged KV-Cache · Streaming Attention · Online Softmax

Built a modular transformer inference engine from the ground up to study LLM serving internals in full detail. The engine implements three attention variants as interchangeable modules: naive attention (materializing the full QK^T matrix in HBM), streaming attention using the online softmax algorithm (Flash Attention–style, O(T) memory), and a Triton-fused kernel that fuses the QK^T multiplication, softmax, and AV product into a single GPU kernel to eliminate intermediate memory writes. Benchmarked all three variants on latency and peak memory across sequence lengths from 512 to 16,384 tokens.
Implemented two KV-cache designs side by side for comparison. The contiguous KV-cache pre-allocates a fixed-size tensor for the full maximum sequence length at the start of each request — simple but wasteful when sequences terminate early. The paged KV-cache allocates memory in fixed-size blocks (pages) on demand: each request maintains a page table that maps logical sequence positions to physical pages in a shared page pool, and pages are returned to the pool on request completion. This eliminates the internal memory fragmentation that makes contiguous caches wasteful at scale.
Benchmarked prefill throughput (tokens processed per second during the prompt encoding phase) and decode throughput (tokens generated per second during autoregressive generation) across context lengths of 4K, 8K, and 16K tokens. Documented the O(T²) memory growth of naive attention, the O(T) memory of streaming attention, and the practical speedup of the Triton kernel over both Python-level implementations.
Implemented temperature and top-p (nucleus) sampling in the decoding loop, with reproducible seeding for benchmarking. Also implemented speculative decoding as a research experiment: a small draft model proposes K tokens at a time, which the target model verifies in a single forward pass — reducing the number of target model forward passes required per generated token.

Qwen 3B — Chat & Function Calling Fine-Tuning Demo

2025

LoRA · QLoRA · Qwen-2.5 3B · HuggingFace Transformers · SFT · Function Calling

Fine-tuned the base Qwen-2.5 3B model in two sequential stages using LoRA adapters. Stage 1 used conversational SFT data (ShareGPT-format multi-turn dialogues) to add instruction-following and chat behavior to the base model. Stage 2 used function-calling datasets (structured JSON tool schemas with ground-truth call sequences) to teach the model to emit syntactically valid tool calls in the correct format when a user query requires tool use. Each stage produces a separate LoRA adapter that can be applied independently or composed.
Used QLoRA (quantized LoRA) for memory-efficient training: the base model weights are loaded in 4-bit NormalFloat quantization and kept frozen; only the LoRA adapter parameters (rank 16, alpha 32, targeting all linear layers) are trained in bf16. This allowed full fine-tuning on a single consumer GPU with 24GB VRAM where full fine-tuning of the 3B model would require significantly more memory.
This project served as the SFT cold-start baseline for the RLVR training pipeline. A model that cannot produce syntactically valid tool calls receives reward 0 on every rollout, making advantage estimation degenerate (all zeros, no gradient signal). The LoRA fine-tuned model provides the format-level capability that RLVR then builds on top of, teaching the model when and how to use tools strategically rather than just what the format looks like.

Code Reviewer via Knowledge Distillation GitHub

2025

LoRA · Knowledge Distillation · ClearML · Docker · Microsoft CodeReview model

LoRA fine-tuned a 220M parameter Microsoft code review model on a dataset of 150,000 (code diff, review comment) pairs. The model was trained to generate actionable code review comments given a unified diff as input, targeting the style and content of real human reviewers in the training data.
Distilled the fine-tuned 220M teacher into an 80M student model using soft label distillation — the student learns to match the teacher's output distribution (soft targets) rather than the hard ground-truth labels, which transfers the teacher's uncertainty and nuance rather than just its argmax predictions. The 80M student achieved approximately 60% lower inference latency while retaining the majority of the teacher's review quality as measured on a held-out evaluation set.
Tracked all experiments, dataset versions, and model checkpoints in ClearML for reproducibility. Containerized the training and inference pipeline with Docker with separate training and serving images, and configured ClearML's remote execution feature to launch training runs on GPU instances without manual setup.

Image Super-Resolution — ResShift Diffusion Demo · GitHub

2025

PyTorch · U-Net · Swin Transformer · Diffusion · PSNR · SSIM · LPIPS · DIV2K

Implemented the ResShift super-resolution architecture from scratch. The core idea of ResShift is to define the diffusion process over the residual between the low-resolution input and the high-resolution target rather than over pure Gaussian noise — this dramatically reduces the number of diffusion steps needed (15 vs 1000 in standard DDPM) because the model starts from a meaningful initialization (the upsampled LR image) rather than pure noise, and only needs to learn a small residual correction.
The network architecture is a U-Net with a 4-stage encoder-decoder. Each encoder stage applies strided convolutions for downsampling followed by residual blocks with GroupNorm. The bottleneck is a Swin Transformer block — a shifted-window self-attention module that captures long-range spatial dependencies at the compressed feature map scale without the quadratic cost of full self-attention. Skip connections from encoder to decoder preserve spatial detail. The diffusion timestep is conditioned via sinusoidal embeddings injected through AdaGN layers.
Trained on 4× upscaling on the DIV2K high-resolution image dataset. Evaluated using three metrics capturing different aspects of quality: PSNR (pixel-level fidelity, measures signal-to-noise ratio in dB), SSIM (structural similarity, correlates with human perception of sharpness and contrast), and LPIPS (learned perceptual image patch similarity, measures perceptual distance using VGG features — lower is better). Achieved competitive results with significantly fewer inference steps than standard diffusion super-resolution methods.

Publications & Writing

A Comparative Analysis of Machine Learning Algorithms for Breast Cancer Detection

IJRASET · 2021 · DOI: 10.22214/ijraset.2021.38940

Evaluated SVM, Random Forest, Logistic Regression, and KNN on the Wisconsin Breast Cancer Dataset. Analyzed accuracy, precision, recall, F1, and AUC to identify the best-performing classifier for early detection. Included hyperparameter sensitivity analysis and an examination of feature importance for clinical interpretability.

View paper (PDF)

RLVR for LLM Agents — In Preparation

arXiv preprint · Expected 2026

Training LLMs to become reliable tool-use agents using Group Relative Policy Optimization. Covers verifier design, environment and tool sandbox construction, trajectory tokenization and gradient masking, rollout generation at scale, advantage estimation under sparse rewards, and training stability diagnostics. Evaluated on GAIA benchmark and custom multi-step tool-use evaluation suite.

Technical Blog — "Under the Hood"

Technical Skills

Languages & Libraries: Python, PyTorch, Triton, SQL, NumPy, Scikit-learn
LLM Training & Fine-Tuning: SFT, RLHF, GRPO, LoRA, QLoRA, FSDP, HuggingFace Transformers, TRL
Inference & Serving: KV-Cache, Paged Attention, Streaming Attention, Triton Kernels, vLLM
Agent & RAG Systems: Function Calling, MCP, LangChain, LangGraph, LlamaIndex, Llama Parse, FastAPI
MLOps & Infrastructure: Docker, Kubernetes, AWS SageMaker, Azure ML, ClearML, GitHub Actions, CI/CD
Data & Evaluation: Snowflake, Vector Databases, SymPy, LLM-as-Judge, GAIA Benchmark