From 50 Seconds to 3: Cutting LLM Inference Latency in a Production RAG System
When we first deployed our document intelligence system at AI-RE, the average query response time sat around 50 seconds. Users submitted a question about a property document, waited nearly a minute, sometimes timed out entirely. That’s not a product. That’s a loading screen pretending to be one.
Over several months we brought that number down to roughly 3 seconds. Here’s what actually moved the needle, ordered by impact.
Send less to the model
The biggest latency win had nothing to do with the model. Our initial pipeline retrieved too many chunks and stuffed them all into the prompt. More context means more tokens, which means slower inference and a bigger bill. We improved retrieval precision with better embeddings, metadata pre-filtering, and a re-ranking step. Cut the context window by roughly 60 percent without losing answer quality. That single change was the largest speed improvement we made.
This matters even more now. With frontier models like Claude Opus 4.6 and Gemini 3.1 Pro supporting context windows up to 2 million tokens, the temptation is to just throw everything in. Don’t. Token count still directly drives latency and cost. The discipline of sending only what’s relevant hasn’t gone away just because context windows got bigger.
Stream the output
Perceived latency matters just as much as actual latency. We switched from waiting for the complete response to streaming tokens as they’re generated. A 15-second response suddenly feels like 2 seconds because users start reading immediately instead of staring at a spinner. This is a frontend change, not an infrastructure overhaul, but the experience improvement is massive. Every production LLM application should stream by default at this point.
Semantic caching
Many document queries are just variations of the same question. “What’s the purchase price?” and “How much does the property cost?” should return the same answer. We embed the query and check similarity against recent queries before calling the LLM. Hit rates of 30 to 40 percent on recurring document types meant a third of all queries came back in under 500 milliseconds. Some teams are reporting semantic caching cutting LLM costs by up to 68% in typical production workloads. That’s not a marginal improvement, that’s a fundamentally different cost structure.
Parallelize everything
Our original pipeline was fully sequential: parse, chunk, embed, retrieve, generate. But several of these steps don’t depend on each other. Pre-computing embeddings at ingestion time instead of query time, running retrieval and metadata lookups in parallel, pre-loading model weights. None of this is glamorous work. Async processing is where we found the last 40 percent of our latency savings.
Why this matters for agentic AI
With Gartner predicting 40% of enterprise applications will embed AI agents by end of 2026, latency optimization isn’t just about user experience anymore. Agents call LLMs in loops. An agent that needs 5 sequential LLM calls to complete a task will multiply your latency problem by 5. Every second you shave off a single inference call compounds across the entire agentic workflow. The teams that built fast, efficient LLM pipelines in 2025 are the ones successfully deploying multi-agent systems now. Everyone else is stuck wondering why their agents are so slow.