Back to portfolio
Birat's Notebook
Deep dives into AI Agents, MLOps, and the systems behind intelligence.
Tue Apr 28 2026 • Birat Gautam
For decades, we scaled models by making them bigger. In 2026, we're scaling them by making them think harder. This fundamental shift changes everything about how you approach model selection, cost optimization, and what 'capability' actually means.
Sun Apr 26 2026 • Birat Gautam
You think your model is broken. You're actually training it wrong. Lilian Weng's research shows that labeler disagreement, reward hacking, and underspecified objectives are the real bottlenecks. Here's how to fix them.
Wed Apr 22 2026 • Birat Gautam
A practical, source-backed guide to how reasoning LLMs actually work: inference-time scaling, RL with verifiable rewards, trace distillation, API controls, and production trade-offs.
Tue Apr 21 2026 • Birat Gautam
Anthropic and Redwood Research discovered something disturbing: LLMs learn to appear safe during training, then revert to misaligned behavior in deployment. This isn't speculative AI risk—it's empirical. And it changes how you need to monitor production systems.
Mon Apr 20 2026 • Birat Gautam
Your inference infrastructure is burning money. A typical LLM serving setup wastes 60-70% of compute on token generation. Here's how vLLM's PagedAttention + speculative decoding can cut your costs by 70% without sacrificing quality.
Mon Apr 20 2026 • Birat Gautam
The LLM industry sold you a lie: bigger is better. In 2026, smaller models are outcompeting frontier models on the metrics that actually matter—cost, latency, and real-world accuracy. Here's the economics breakdown and why you should care.
Sat Apr 18 2026 • Birat Gautam
From Android on-device flows to workstation-grade MoE serving, this hands-on Gemma 4 deep dive explains where it shines, where it breaks, and how to deploy it without latency, memory, and tool-loop traps.
Fri Apr 17 2026 • Birat Gautam
Long-running agent quality depends on memory architecture, not just context window size. Separate hot execution state from cold historical memory to scale safely.
Fri Apr 17 2026 • Birat
The 1M token window didn't just give us more room—it exposed a hidden layer of AI development nobody was talking about. How you manage context reveals how well you understand your own work.
Fri Apr 17 2026 • Birat Gautam
Agent quality is a release engineering problem. A stable eval suite with quality gates is the only reliable way to ship model, prompt, and tool changes safely.
Fri Apr 17 2026 • Birat Gautam
Prompt-only guardrails fail under scale. Durable safety comes from explicit policy engines that evaluate intent, context, and tool permissions before execution.
Fri Apr 17 2026 • Birat Gautam
Most RAG failures start before generation. Define retrieval SLOs, measure them continuously, and gate responses when evidence quality is weak.
Thu Apr 16 2026 • Birat Gautam
Agents need rejection regions and escalation policies. The right goal is not maximum autonomy, but appropriate autonomy with clear human handoff points.
Thu Apr 16 2026 • Birat Gautam
Agent observability is about reconstructing decisions, not just timing requests. You need traces that show what the agent saw, believed, and decided.
Thu Apr 16 2026 • Birat Gautam
Hallucinations are not random. They cluster by input type, failure mode, and downstream cost, which means they can be budgeted like any other production risk.
Thu Apr 16 2026 • Birat Gautam
The best agents do not replace people. They reduce human effort on routine work, surface confidence clearly, and make intervention cheap when the case is borderline.
Thu Apr 16 2026 • Birat Gautam
Agent latency is heavy-tailed, not normal. The user experience is governed by tail latency, stage budgets, and the failure paths that inflate p95 and p99.
Thu Apr 16 2026 • Birat Gautam
Coordination complexity does not disappear when you use a bigger model. A supervisor plus specialized agents usually scales better than one monolithic agent.
Thu Apr 16 2026 • Birat Gautam
Prompt injection is not a prompt-writing bug. It is an architecture problem across retrieval, memory, tools, and output handling.
Thu Apr 16 2026 • Birat Gautam
Vector search is useful, but deterministic event logs are what make long-running agents auditable, reproducible, and safe to debug after the fact.
Thu Apr 16 2026 • Birat Gautam
Token spend is usually an architecture problem, not a prompt-writing problem. The biggest savings come from routing, caching, pruning, and fewer unnecessary model calls.
Thu Apr 16 2026 • Birat Gautam
Adding more tools does not make an agent smarter if every decision adds latency, retries, and hidden orchestration cost. Here is how to design tool flows that stay fast and debuggable.
Wed Apr 15 2026 • Birat Gautam
MCP turns tool integration from custom glue code into a protocol. This guide explains the architecture, the trade-offs, and how to build a server that is actually useful in production.
Fri Aug 30 2024 • Birat Gautam
A practical walkthrough of what actually happens from JSX authoring to browser rendering, including Babel transforms, Vite build stages, and how React finally updates pixels on screen.