Research

Spending your latency budget where users feel it

Streaming, speculative tool calls, and caching — where milliseconds matter and where they quietly don't.

PN Pooja Nair · October 8, 2025 · 10 min read
An engineer monitoring real-time performance and latency dashboards on large monitors

A latency budget is the total time a user will quietly tolerate before an agent feels broken. It is not a number your infra team gets to set — your users set it, the moment they hit enter and start counting. The job is to spend that budget where they can feel the difference, and to stop paying for milliseconds nobody will ever notice. Most teams do the opposite: they optimize the parts that are easy to measure and ignore the parts that are easy to feel.

We have spent a lot of time profiling agent turns that were technically fast and still felt slow, and slow turns that users described as snappy. The gap between those two experiences is almost never the model's raw token throughput. It is where you chose to spend the wait — and whether the user had anything to look at while it happened.

Key takeaways

  • Time-to-first-token (TTFT) is the metric users punish hardest — getting something on screen before 400 ms feels alive; silence past 1 second triggers doubt and abandonment.
  • Streaming is the highest-leverage latency move available and costs almost nothing, because the model is already producing tokens incrementally.
  • Parallel tool calls eliminate additive sequential overhead; the LLMCompiler paper (ICML 2024) demonstrated up to 3.7× latency speedup over sequential ReAct agents (Kim et al., 2024).
  • Speculative tool execution — pre-firing the likely next call before the model formally decides — can cut average task completion time by up to 48.5% on common-case patterns (PASTE, 2025).
  • Prompt-prefix caching reduces TTFT meaningfully on every turn after the first and slashes input-token cost: OpenAI reports up to 80% latency reduction and 50% cost reduction; Anthropic charges only 0.1× the base price for cache hits.
  • Perceived latency is a design variable, not just an engineering metric — status messages, streaming prose, and transparent progress all compress felt wait time even when the clock does not move.

Perceived latency is the only one users grade

There are two clocks running on every turn. One is actual latency: the wall-clock time from request to the final token. The other is perceived latency: how long the wait felt to the person staring at the screen. They are correlated, but they are not the same number, and your users only ever experience the second one.

The classic example is a six-second answer. Render it all at once after six seconds of a spinner and people will reload, retype, or leave. Stream it token by token starting at 350 ms and the same six seconds reads as a thoughtful, capable assistant working through the problem. Identical actual latency. Completely different product. The wait did not get shorter — it got occupied.

This is why we treat time-to-first-token (TTFT) as a first-class metric, separate from total time. TTFT is the gap before anything happens, and it is the part users punish hardest. A turn with a 300 ms TTFT and a 7-second total feels dramatically better than one with a 2-second TTFT and a 4-second total — even though the second one finishes three seconds sooner. The fast-finishing turn loses because it made the user wait in silence first.

Users don't experience your p95. They experience the silence before the first token.

This maps directly onto Jakob Nielsen's foundational response-time thresholds, first published in Usability Engineering (1993) and grounded in research going back to Miller (1968): 0.1 s feels instantaneous; 1.0 s is the limit for keeping the user's flow of thought uninterrupted; 10 s is the limit before users disengage and start doing something else. Those numbers were established for general software interactions, but they hold for LLM agents — the cognitive machinery is the same. The one wrinkle specific to agents is that streaming can effectively move the perceived 10-second ceiling, because the user is reading rather than waiting.

A 2025 study on human-LLM interaction found an unexpected reversal worth knowing: participants who received responses in 2 seconds rated the output as less thoughtful and useful than those who waited 9 or 20 seconds, suggesting users partly interpret processing time as a signal of deliberation (arXiv:2604.06183). The lesson is nuanced — silence before first token is damaging, but once generation is streaming, a longer completion can carry implicit quality signals. Do not mistake "users want it faster" for "users want it instant in every dimension."

A real-time performance dashboard showing response-time graphs on a monitor Watch two clocks on every turn: time-to-first-token (what users feel) and total response time (what actually finishes).

Streaming: the cheapest second you'll ever buy

If you do exactly one thing from this article, stream your responses. It is the highest-leverage latency move available, and it costs you almost nothing because the model is already producing tokens incrementally — you are just choosing whether to show them.

The mechanism is simple but the discipline is not. Streaming only helps if the first token actually arrives early, which means everything you do before generation starts is on the critical path the user feels. A 1.2-second authentication round-trip, a cold retrieval index, a system prompt you rebuild from scratch every turn — each of those pushes your TTFT out and undoes the benefit. We have seen teams add streaming, see no improvement, and conclude it does not work, when the real problem was 1.8 seconds of setup standing between enter and the first token.

The other trap is streaming something users cannot read. If your agent emits a long tool-call plan or a chunk of JSON before any prose, the stream is technically flowing but the screen is showing the user nothing they care about. Stream the human-facing answer first, or at minimum stream a short, honest status — "searching your last 90 days of tickets" — so the early bytes carry meaning, not machinery.

Research on progress feedback confirms this: HCI studies find that visual or textual feedback during delays reduces abandonment and perceived wait time, while a 2026 CHI paper found that elapsed-time displays reduced frustration compared to countdown timers — and that no progress feedback at all made delays "feel longer and heightened ambiguity" (arXiv:2602.04138). Status text during a tool call is not a cosmetic touch; it is engineering perceived latency.

How fast is fast enough to stream? Human reading speed averages roughly 200–250 words per minute (3–4 words per second). At approximately 1.3 tokens per word, comfortable reading requires around 4–7 tokens per second. Research on cognitive load-aware streaming found that the 99th-percentile reading speed for complex content (such as technical explanations) is about 12 words per second; for simple content it reaches 21 words per second — and streaming faster than a user's reading speed does not enhance their experience (arXiv:2504.17999). In practice, once you clear roughly 10 tokens per second, users stop noticing throughput and their sensitivity shifts entirely to TTFT.

A rule of thumb — Aim for TTFT under 400 ms and you are in the zone where the interface feels alive. Between 400 ms and 1 second, it feels deliberate. Past about 1 second of silence, users start to doubt that anything is happening — and that doubt is what makes them reload, double-send, or walk away.

Speculative and parallel tool calls

Most agent latency does not live in the model. It lives in the tool calls — the retrievals, the database queries, the third-party APIs — and in the round-trips between them. A turn that calls three tools sequentially, each waiting for the model to decide on the next, can spend four seconds in orchestration overhead before the model has done any real thinking.

Two techniques claw that time back. The first is parallelism: any tools whose inputs do not depend on each other's outputs should fire at the same time. If your agent needs the customer's account status and their recent orders, those are independent — fan them out and await them together. The slowest call sets your floor instead of the sum of all of them. This sounds obvious, and yet most orchestration code we audit runs these calls back to back because that is how the loop was first written.

The LLMCompiler paper (ICML 2024) formalized this into a compiler-style framework that automatically detects independent function calls and executes them in parallel. The results against sequential ReAct baselines were stark: up to 3.7× latency speedup, up to 6.7× cost reduction, and up to ~9% accuracy improvement. Those numbers represent the realistic ceiling for fan-out gains; real-world improvement depends on how much parallelism your task graph actually contains.

The second, more aggressive technique is speculation: when the next step is highly predictable, start it before the model formally asks. If 90% of "where is my order" turns end in the same shipment lookup, kick that lookup off the moment you classify the intent, in parallel with the model's first generation. If the model takes a different path, you throw the result away. You are spending a little extra compute and a few wasted calls to buy back a full round-trip of latency on the common case — and on a high-volume support agent, the common case is almost everything.

The PASTE paper (2025) demonstrated this rigorously, exploiting the fact that agents exhibit stable application-level control flows even when requests are semantically diverse. By detecting recurring tool-call sequences and pre-executing the likely next call, PASTE achieved a 48.5% reduction in average task completion time and 1.8× improvement in tool execution throughput compared to standard sequential execution.

The trade-off is real and worth naming. Speculation raises your tool-call volume and can hit rate limits or rack up cost on metered APIs. We gate it on confidence: only speculate when the predicted action is read-only, idempotent, and cheap, and never speculate on anything that writes, charges, or sends. A speculative read of a shipment record is free to be wrong. A speculative refund is not.

Rows of server racks in a modern data center Most agent latency lives in the infrastructure around the model — tool calls, round-trips, and the cache that short-circuits them.

Caching: prompt prefixes and results

There are two kinds of caching that matter for agents, and they pay off in different places.

Prompt-prefix caching is about the input. A production agent carries a large, stable preamble on every single turn: the system prompt, the tool definitions, the policy text, often a chunk of retrieved context that does not change within a session. Most modern model APIs let you cache that prefix so it is processed once and reused, cutting both the cost and the time-to-first-token of every subsequent turn. On a long system prompt — and they are all long now — this routinely shaves hundreds of milliseconds off TTFT and a meaningful fraction off the bill.

Both Anthropic and OpenAI have documented this concretely:

  • Anthropic charges 0.1× the base input-token price on cache hits (a 90% discount) for supported Claude models, with a minimum cacheable prefix of 1,024 tokens on current Sonnet and Opus models. The documentation notes "you will generally see improved time-to-first-token for long documents." (Anthropic prompt caching docs)
  • OpenAI reports up to 80% latency reduction and up to 50% cost reduction on prompts longer than 1,024 tokens, with caching applied automatically at no additional charge on supported models (gpt-4o and newer).

The catch is that the cache keys on an exact prefix match, so a single dynamic token near the top of your prompt — a timestamp, a request ID — invalidates the whole thing. Push everything volatile to the end of the prompt and keep the top byte-for-byte stable across turns.

Result caching is about the output of your tools. The same forty users ask about the same five help articles; the same dashboard query runs every time someone opens the same view. Caching those results with a sane TTL turns a 600 ms database round-trip into a sub-millisecond memory hit. In RAG pipelines specifically, retrieval has been measured to account for 41% of end-to-end latency and 45–47% of TTFT — making result caching one of the highest-leverage interventions available for retrieval-heavy agents. The hard part is invalidation, as always — cache something that changed and you have traded latency for a confidently wrong answer, which is far more expensive than a slow one. We keep result-cache TTLs short for anything a user can mutate and longer for genuinely static reference data.

A cache that returns the wrong answer instantly is the most expensive optimization you can ship.

Where milliseconds matter — and where they don't

The whole point of a budget is that it is finite, so spend it where it is felt. After profiling a lot of turns, the map is fairly consistent.

Milliseconds matter enormously in two places. The first is TTFT, as we have said — the silence before anything appears is the most expensive latency you own. The second is interactive, conversational turns where the user is actively waiting on a reply they intend to read and respond to. In a live chat, the difference between a 1-second and a 3-second turn is the difference between a conversation and an interrogation.

Milliseconds matter far less than people assume in a handful of common cases. Background and asynchronous work — an agent drafting a report it will hand off in five minutes — has a budget measured in seconds or minutes, not milliseconds, and optimizing it hard is wasted engineering. Long-form generation past the first screen of text matters less too: once the user is reading, the stream just needs to stay ahead of their eyes, which is a low bar. And the tail end of a multi-step task, after the user has seen the agent commit to a plan, buys you patience you did not have at the start. People will wait for an agent that has visibly understood them; they will not wait for one that is still silent.

The practical implication is uncomfortable for tidy engineers: it is correct to leave some turns slow on purpose. If a nightly batch agent takes 40 seconds per item and no human is watching, spending a sprint to get it to 25 is a sprint you stole from the interactive path where the same effort would have moved your TTFT and changed how the product feels.

A reference table of latency impact by context

Context User tolerance Highest-leverage tactic
Interactive chat (user waiting) TTFT < 400 ms p50, < 1 s p95 Stream immediately; cache prefix
Multi-tool agentic turn Total < 3 s for simple tasks Parallel + speculative tool calls
Voice / real-time assistant TTFT < 500 ms Smaller routing model + streaming
Long-form generation (reading) First screenful < 2 s TTFT; throughput only needs ~10 t/s
Background / async batch Minutes acceptable Do not optimize; redirect effort

The four tactics, ranked by what they buy

Here is the budget in order of leverage. Streaming first because it changes perception for free; the rest because they change the actual clock the perception is built on.

Stream the first token. The single highest-leverage move you can make. Time-to-first-token is what a user reads as "fast" — getting words on screen in under 400 ms makes a 6-second answer feel responsive. Nothing else in this list comes close to its return.

Fire tool calls speculatively. When the next step is obvious, do not wait for the model to ask. Pre-warm the retrieval, kick off the likely API call, and discard it if the plan changes. You trade a few wasted requests for a turn that lands a full second sooner.

Cache the parts that repeat. Prompt-prefix caching cuts the cost of a long, stable system prompt to near zero on every turn after the first. Result caching kills the duplicate tool call you make forty times an hour. Both are free latency you are leaving on the table.

Run independent steps in parallel. Two retrievals that do not depend on each other should never run back to back. Fan them out, await them together, and the slow one sets your floor instead of the sum. Most multi-step turns have more parallelism in them than the code admits.

How to measure and budget

You cannot spend a budget you have not written down. Start by measuring the two clocks separately on real traffic: TTFT and total time, broken out by turn type, at the percentiles that matter. We watch p50 to understand the typical experience and p95 to understand the experience that drives complaints — averages hide the tail, and the tail is what people remember.

A reasonable starting set of SLOs for an interactive agent: TTFT under 400 ms at p50 and under 1 second at p95; total turn under 3 seconds at p50 for a simple answer, with a longer ceiling for turns that legitimately need multiple tool calls. For context, production benchmarks in 2026 show p95/p50 TTFT ratios averaging around 2.1×, meaning your p95 will routinely run more than twice your median — design your alert thresholds with that spread in mind. Write those numbers down, alert on them, and treat a regression as a bug, not a vibe. The teams that keep agents fast are the ones who made latency a number on a dashboard that someone owns, not a feeling someone occasionally has in a demo.

Finally, instrument the spans inside a turn — setup, first model call, each tool call, generation — so that when a turn is slow you can see which part ate the budget. Almost every "the model is slow" complaint we have chased turned out to be a cold index, a sequential call that should have been parallel, or 1.5 seconds of setup that streaming was supposed to hide and could not. You optimize what you can see. Make the budget visible, spend it where users feel it, and the milliseconds you save will be the ones that actually count.

Frequently asked questions

What is a good TTFT target for a production LLM agent? For interactive chat agents, aim for TTFT under 400 ms at p50 and under 1 second at p95. Benchmarks of major API providers in 2026 show leading models achieving median TTFTs between 300 ms and 600 ms under normal load, with p95 typically running 2–3× higher. For real-time voice agents the practical ceiling is tighter — roughly 500 ms median — because voice tolerates less pre-speech silence than text. Reasoning models that perform chain-of-thought before generating can see TTFTs of 10–150 seconds; if you are deploying those, streaming intermediate reasoning text or a status message is essential to prevent perceived abandonment.

Why does streaming sometimes not feel faster even after I add it? Streaming only helps if the first token arrives early. If you have 1–2 seconds of setup before generation begins (authentication, context retrieval, prompt assembly), that entire block is on the critical path and pushes TTFT out before streaming starts. Adding streaming to a pipeline with a 2-second pre-generation delay does not move the 2-second silence the user experiences. Profile the spans before the first model call; that is almost always where the real budget is being consumed.

When is speculative tool execution not worth it? Speculation is unsafe or uneconomical when: (1) the predicted action has side effects (writes, charges, sends notifications), (2) the prediction accuracy is too low to justify the wasted API calls or compute, or (3) the tool is expensive, rate-limited, or metered per call. A useful gate: only speculate on actions that are read-only, idempotent, cheap, and triggered on at least 70–80% of turns in the relevant intent class. Below that threshold, the wasted-call overhead eats the latency savings.

How does prompt caching affect cost, not just latency? Significantly. Anthropic charges 0.1× the base input-token price on cache hits (a 90% discount) once a prefix is cached; OpenAI reports up to 50% cost reduction on cached prompts. For an agent with a 20,000-token system prompt running thousands of turns per day, prompt-prefix caching can be the single largest cost reduction available — outpacing model downgrades in many workloads. The write cost (1.25–2× base price depending on TTL) is recovered within a handful of cache hits.

Sources

← All articles