Sahil Khundiya | Backend & AI Infrastructure Engineer

The "Always-LLM" Anti-Pattern

In early RAG implementations, every user query is treated as a complex semantic puzzle. This is wasteful. If a user asks "What is the status of ticket #402?", you don't need an LLM to "understand" the intent—you need a regex and a SQL lookup.

The Routing Hierarchy

We built a three-tier routing engine to optimize for latency and cost:

Deterministic Router (Regex/Keyword): Handles exact matches for IDs, technical terms, and navigation commands. Latency: <1ms.
Classifier Router (Small Model): A lightweight BERT classifier that identifies if the query is "Simple" (Retrieval only) or "Complex" (Multi-document reasoning).
LLM Router (The Fallback): Only used when the intent is ambiguous or requires zero-shot classification.

Implementation: The Intent Matrix

We mapped user intents into a matrix:

NAVIGATIONAL: Route to DB directly.
INFORMATIONAL: Route to Hybrid Search.
ANALYTICAL: Route to Agentic Reasoning Pipeline.

Result: 60% Cost Reduction

By filtering out navigational and simple informational queries from the LLM routing stage, we reduced our daily token usage by 60% and improved p50 latency from 1.2s to 450ms.

Summary

Senior AI engineering is the art of not using an LLM when a hash map will do.