Akhil Shekkari

AI Engineer — Agents, Large Scale Training, Inference

Excited about building agents that can plan, reason, and act across complex multi-step tasks, large scale training techniques that unlock new capabilities, and pushing inference to go faster at scale.

Currently: MS in Applied Machine Learning at University of Maryland, College Park
Previously: AI Engineer at Atrium (client: Pfizer) · 3 years building ML systems at Tezo

Published Peer-Reviewed Journal
Leveraging Generative AI to Transform Statistical Analysis Plan Authoring in Clinical Trials
Clinical Trials (SAGE Publications) · 2025 · Co-authored with Pfizer R&D
Read Paper

My Best Past Work

Atrium · Pfizer · United States
AI Engineer
Jun 2025 – Aug 2025

Co-authored a peer-reviewed publication in Clinical Trials (SAGE). Automated SAP generation for Pfizer, cutting drafting time by 60%. Built an LLM-as-a-Judge system with 82% precision for hallucination detection.

60% faster
drafting
82% hallucination
precision
What my manager said
Tezo · 3 years · India
Machine Learning Engineer
Jul 2021 – Jul 2024

Built a RAG-powered chatbot that let employees search across 1,000+ internal documents. Reduced document lookup time by 61%. Also trained fraud detection models that improved recall by 12%.

1k+ docs
indexed
61% faster
lookups
12% fraud recall
improvement
What my manager said
View Full Timeline

Projects

AI Agents from Scratch

Featured

Designed and implemented a modular AI agent framework in Python, without relying on LangChain, CrewAI, or any existing agent library. Features a think-act reasoning loop that enables LLMs to autonomously chain tool calls across multiple steps to solve complex tasks.

  • Agent core: Async-first agent loop with Pydantic data models, structured output support, and execution tracing for full observability
  • Tool system + MCP: Extensible @tool decorator with auto function-to-JSON-schema conversion, MCP client for dynamically loading external tool servers, and human-in-the-loop confirmation for dangerous operations
  • Memory & context: Hierarchical optimization combining sliding window truncation, tool result compaction, tiktoken counting, and LLM-based summarization to stay within context limits
  • RAG pipeline: Text chunking, OpenAI embeddings, and cosine-similarity search to compress long search results before feeding them back to the agent
  • Multi-format files: Unified tool for text, CSV, Excel, PDF, images, and audio with multimodal LLM vision/audio analysis
Python
OpenAI
MCP
Pydantic
LiteLLM
FastAPI
Message Reason Tool Call Callbacks Memory Loop

LLM Inference Engine from Scratch

New

Commercial inference engines like vLLM and TGI are powerful but opaque. Built a full inference engine from scratch to understand every layer of the serving stack — from kernel-level attention to request scheduling.

  • FlashAttention: Triton-fused attention kernels with paged KV-cache, achieving 2.3x speedup over Python-level implementations on 4K–16K token workloads
  • Speculative decoding: Qwen3-0.6B draft model with Qwen3-4B as verifier, reducing autoregressive decoding latency by ~28% and increasing throughput by ~1.4x
  • Dynamic batching: Request scheduler for concurrent inference, improving GPU utilization by 22% and sustaining 1.6x higher tokens/sec vs sequential decoding
  • GPU profiling: Nsight Compute to identify memory-bound bottlenecks; optimized tiling and memory access to reach ~78% of peak HBM bandwidth on A100
Python
PyTorch
Triton
CUDA
KV Cache
Nsight
Request In Dynamic Batch FlashAttention KV-Cache Speculative Decode Stream Tokens
View All Projects

Latest Blog Posts

Improving Context: Reasoning & Long Context in LLMs

Feb 6, 2026 · 25 min read

A deep dive into how LLMs reason over long-horizon tasks, the mechanics behind context length, and why smart agents with surgical retrieval beat brute-force long context windows.

GPU Fundamentals & LLM Inference Mental Models

Summary · 20 min read

Build intuition for LLM inference from first principles: GPU architecture, the roofline model, memory estimation, and latency.

Serving LLMs with vLLM on RunPod: A Complete Guide

Feb 3, 2026 · 12 min read

A deep dive into self-hosting LLMs using vLLM on RunPod. Covers PagedAttention, continuous batching, and cost analysis.

Read All Posts

Get in Touch