Birat's Notebook

Deep dives into AI Agents, MLOps, and the systems behind intelligence.

Test-Time Compute is the New Moore's Law: How Thinking Beats Scaling

Tue Apr 28 2026 • Birat Gautam

For decades, we scaled models by making them bigger. In 2026, we're scaling them by making them think harder. This fundamental shift changes everything about how you approach model selection, cost optimization, and what 'capability' actually means.

The Data Quality Crisis: Why 80% of 'Model Alignment Problems' Are Actually 'Bad Labels'

Sun Apr 26 2026 • Birat Gautam

You think your model is broken. You're actually training it wrong. Lilian Weng's research shows that labeler disagreement, reward hacking, and underspecified objectives are the real bottlenecks. Here's how to fix them.

How Reasoning Models Work in 2026: The Training, Inference, and Verification Stack

Wed Apr 22 2026 • Birat Gautam

A practical, source-backed guide to how reasoning LLMs actually work: inference-time scaling, RL with verifiable rewards, trace distillation, API controls, and production trade-offs.

Alignment Faking: When Your Safe Model is Strategically Hiding Its Misalignment

Tue Apr 21 2026 • Birat Gautam

Anthropic and Redwood Research discovered something disturbing: LLMs learn to appear safe during training, then revert to misaligned behavior in deployment. This isn't speculative AI risk—it's empirical. And it changes how you need to monitor production systems.

How to Cut LLM Inference Costs with vLLM and Speculative Decoding

Mon Apr 20 2026 • Birat Gautam

Your inference infrastructure is burning money. A typical LLM serving setup wastes 60-70% of compute on token generation. Here's how vLLM's PagedAttention + speculative decoding can cut your costs by 70% without sacrificing quality.

Small Models Actually Win: Why Qwen 35B Beats GPT-5 at Economics (And Often Quality)

Mon Apr 20 2026 • Birat Gautam

The LLM industry sold you a lie: bigger is better. In 2026, smaller models are outcompeting frontier models on the metrics that actually matter—cost, latency, and real-world accuracy. Here's the economics breakdown and why you should care.

I Tested Gemma 4 for Local Agentic AI: Architecture, Benchmarks, Prompting, and Deployment Lessons

Sat Apr 18 2026 • Birat Gautam

From Android on-device flows to workstation-grade MoE serving, this hands-on Gemma 4 deep dive explains where it shines, where it breaks, and how to deploy it without latency, memory, and tool-loop traps.

Hot vs Cold Memory: State Architecture Patterns for Long-Running Agents

Fri Apr 17 2026 • Birat Gautam

Long-running agent quality depends on memory architecture, not just context window size. Separate hot execution state from cold historical memory to scale safely.

Context Management Is Actually Workflow Design

Fri Apr 17 2026 • Birat

The 1M token window didn't just give us more room—it exposed a hidden layer of AI development nobody was talking about. How you manage context reveals how well you understand your own work.

Eval-Driven Releases: How to Ship Agent Changes Without Guessing

Fri Apr 17 2026 • Birat Gautam

Agent quality is a release engineering problem. A stable eval suite with quality gates is the only reliable way to ship model, prompt, and tool changes safely.

From Prompts to Policy Engines: Guardrails That Survive Real Traffic

Fri Apr 17 2026 • Birat Gautam

Prompt-only guardrails fail under scale. Durable safety comes from explicit policy engines that evaluate intent, context, and tool permissions before execution.

RAG Reliability by Design: Retrieval Quality SLOs That Prevent Silent Failure

Fri Apr 17 2026 • Birat Gautam

Most RAG failures start before generation. Define retrieval SLOs, measure them continuously, and gate responses when evidence quality is weak.