Back to portfolio

Birat's Notebook

Deep dives into AI Agents, MLOps, and the systems behind intelligence.

Test-Time Compute is the New Moore's Law: How Thinking Beats Scaling

Tue Apr 28 2026 • Birat Gautam

For decades, we scaled models by making them bigger. In 2026, we're scaling them by making them think harder. This fundamental shift changes everything about how you approach model selection, cost optimization, and what 'capability' actually means.


The Data Quality Crisis: Why 80% of 'Model Alignment Problems' Are Actually 'Bad Labels'

Sun Apr 26 2026 • Birat Gautam

You think your model is broken. You're actually training it wrong. Lilian Weng's research shows that labeler disagreement, reward hacking, and underspecified objectives are the real bottlenecks. Here's how to fix them.


How Reasoning Models Work in 2026: The Training, Inference, and Verification Stack

Wed Apr 22 2026 • Birat Gautam

A practical, source-backed guide to how reasoning LLMs actually work: inference-time scaling, RL with verifiable rewards, trace distillation, API controls, and production trade-offs.


Alignment Faking: When Your Safe Model is Strategically Hiding Its Misalignment

Tue Apr 21 2026 • Birat Gautam

Anthropic and Redwood Research discovered something disturbing: LLMs learn to appear safe during training, then revert to misaligned behavior in deployment. This isn't speculative AI risk—it's empirical. And it changes how you need to monitor production systems.


How to Cut LLM Inference Costs with vLLM and Speculative Decoding

Mon Apr 20 2026 • Birat Gautam

Your inference infrastructure is burning money. A typical LLM serving setup wastes 60-70% of compute on token generation. Here's how vLLM's PagedAttention + speculative decoding can cut your costs by 70% without sacrificing quality.


Small Models Actually Win: Why Qwen 35B Beats GPT-5 at Economics (And Often Quality)

Mon Apr 20 2026 • Birat Gautam

The LLM industry sold you a lie: bigger is better. In 2026, smaller models are outcompeting frontier models on the metrics that actually matter—cost, latency, and real-world accuracy. Here's the economics breakdown and why you should care.


I Tested Gemma 4 for Local Agentic AI: Architecture, Benchmarks, Prompting, and Deployment Lessons

Sat Apr 18 2026 • Birat Gautam

From Android on-device flows to workstation-grade MoE serving, this hands-on Gemma 4 deep dive explains where it shines, where it breaks, and how to deploy it without latency, memory, and tool-loop traps.


Hot vs Cold Memory: State Architecture Patterns for Long-Running Agents

Fri Apr 17 2026 • Birat Gautam

Long-running agent quality depends on memory architecture, not just context window size. Separate hot execution state from cold historical memory to scale safely.


Context Management Is Actually Workflow Design

Fri Apr 17 2026 • Birat

The 1M token window didn't just give us more room—it exposed a hidden layer of AI development nobody was talking about. How you manage context reveals how well you understand your own work.


Eval-Driven Releases: How to Ship Agent Changes Without Guessing

Fri Apr 17 2026 • Birat Gautam

Agent quality is a release engineering problem. A stable eval suite with quality gates is the only reliable way to ship model, prompt, and tool changes safely.


From Prompts to Policy Engines: Guardrails That Survive Real Traffic

Fri Apr 17 2026 • Birat Gautam

Prompt-only guardrails fail under scale. Durable safety comes from explicit policy engines that evaluate intent, context, and tool permissions before execution.


RAG Reliability by Design: Retrieval Quality SLOs That Prevent Silent Failure

Fri Apr 17 2026 • Birat Gautam

Most RAG failures start before generation. Define retrieval SLOs, measure them continuously, and gate responses when evidence quality is weak.


When Agents Should Not Decide: Building Confidence Thresholds for Human Handoff

Thu Apr 16 2026 • Birat Gautam

Agents need rejection regions and escalation policies. The right goal is not maximum autonomy, but appropriate autonomy with clear human handoff points.


Observability for Black-Box Agents: Tracing Decisions in Production

Thu Apr 16 2026 • Birat Gautam

Agent observability is about reconstructing decisions, not just timing requests. You need traces that show what the agent saw, believed, and decided.


The Hallucination Budget: Quantifying Risk for Mission-Critical Agents

Thu Apr 16 2026 • Birat Gautam

Hallucinations are not random. They cluster by input type, failure mode, and downstream cost, which means they can be budgeted like any other production risk.


Agents in the Loop: Designing for Human-AI Collaboration Instead of Replacement

Thu Apr 16 2026 • Birat Gautam

The best agents do not replace people. They reduce human effort on routine work, surface confidence clearly, and make intervention cheap when the case is borderline.


The Latency Trap: Why 99th-Percentile Response Time Matters More Than Average

Thu Apr 16 2026 • Birat Gautam

Agent latency is heavy-tailed, not normal. The user experience is governed by tail latency, stage budgets, and the failure paths that inflate p95 and p99.


Orchestrating Agents at Scale: When You Need a Supervisor, Not a Bigger Model

Thu Apr 16 2026 • Birat Gautam

Coordination complexity does not disappear when you use a bigger model. A supervisor plus specialized agents usually scales better than one monolithic agent.


Prompt Injection in Agents: Defense Patterns That Actually Work

Thu Apr 16 2026 • Birat Gautam

Prompt injection is not a prompt-writing bug. It is an architecture problem across retrieval, memory, tools, and output handling.


State Management Without the Mess: Deterministic Agent Memory for Long-Running Systems

Thu Apr 16 2026 • Birat Gautam

Vector search is useful, but deterministic event logs are what make long-running agents auditable, reproducible, and safe to debug after the fact.


Token Economics: Why Your Agent Architecture Is Costing 10x More Than It Should

Thu Apr 16 2026 • Birat Gautam

Token spend is usually an architecture problem, not a prompt-writing problem. The biggest savings come from routing, caching, pruning, and fewer unnecessary model calls.


The Tool-Use Illusion: Why Most Agent Frameworks Fail at Production Scale

Thu Apr 16 2026 • Birat Gautam

Adding more tools does not make an agent smarter if every decision adds latency, retries, and hidden orchestration cost. Here is how to design tool flows that stay fast and debuggable.


The Architecture of Agency: Model Context Protocol (MCP)

Wed Apr 15 2026 • Birat Gautam

MCP turns tool integration from custom glue code into a protocol. This guide explains the architecture, the trade-offs, and how to build a server that is actually useful in production.


Demystifying the Working of ReactJs: From JSX to Pixels

Fri Aug 30 2024 • Birat Gautam

A practical walkthrough of what actually happens from JSX authoring to browser rendering, including Babel transforms, Vite build stages, and how React finally updates pixels on screen.