The "Always-LLM" Anti-Pattern
In early RAG implementations, every user query is treated as a complex semantic puzzle. This is wasteful. If a user asks "What is the status of ticket #402?", you don't need an LLM to "understand" the intent—you need a regex and a SQL lookup.
The Routing Hierarchy
We built a three-tier routing engine to optimize for latency and cost:
- Deterministic Router (Regex/Keyword): Handles exact matches for IDs, technical terms, and navigation commands. Latency: <1ms.
- Classifier Router (Small Model): A lightweight BERT classifier that identifies if the query is "Simple" (Retrieval only) or "Complex" (Multi-document reasoning).
- LLM Router (The Fallback): Only used when the intent is ambiguous or requires zero-shot classification.
Implementation: The Intent Matrix
We mapped user intents into a matrix:
- NAVIGATIONAL: Route to DB directly.
- INFORMATIONAL: Route to Hybrid Search.
- ANALYTICAL: Route to Agentic Reasoning Pipeline.
Result: 60% Cost Reduction
By filtering out navigational and simple informational queries from the LLM routing stage, we reduced our daily token usage by 60% and improved p50 latency from 1.2s to 450ms.
Summary
Senior AI engineering is the art of not using an LLM when a hash map will do.