Stop streaming tokens at people who can't read that fast
Streaming the answer one word at a time looks alive, but it often hides latency you should have just fixed.
Most "thinking…" animations are an apology in disguise. The model is slow, the answer isn't ready, and rather than fix that, the team decided to dribble the response out one word at a time so the wait feels productive. It works, for a while. A cursor blinking its way across the screen reads as effort, as a mind at work. But the trick has a ceiling, and most products blow past it without noticing.
Here is the thing nobody measures: a comfortable adult reads at roughly four to five words per second. Plenty of streaming endpoints emit faster than that. So you are now animating text the user has already read, which means the animation is no longer covering latency — it is adding it. You took a response that was ready and decided to release it slower than the person could consume it. That is not a feature. That is a governor on your own product.
Streaming solves one problem and fakes the rest
Streaming earns its keep in exactly one case: when time-to-first-token is short but time-to-last-token is long, and the output is the kind of thing a human consumes as it arrives. Long-form prose. A code explanation they read top to bottom. A chain of reasoning where seeing the early steps changes what they do next. There, showing the first sentence at 400ms instead of the whole thing at 6s is a real gift. The user starts reading while the model keeps writing, and the two roughly keep pace.
Now look at where streaming actually gets bolted on. A three-line answer. A yes/no with a caveat. A number. A JSON object the UI is going to parse anyway, rendered as fake-typed text the user watches assemble character by character before it snaps into a table. None of these are consumed progressively. The user can't act on half a price or two-thirds of a date. They wait for the end regardless — except now the end arrives later, dressed up as motion.
If the user can't act on a partial answer, streaming it is just a slow reveal of a thing that was already done.
The tell is simple. Ask whether a partial response has value. If a person can read the first half and start doing something useful, stream it. If they have to see the whole thing before it means anything, you are animating for the sake of animating, and you should have spent that engineering effort cutting the latency instead.
Latency you can fix, hidden by latency you chose
The reason this matters beyond aesthetics: the typewriter effect quietly excuses real performance problems. Once the response feels alive, nobody on the team files the ticket. The p95 is ugly, the prompt is bloated, half the wait is a cold retrieval step that could be cached — but the demo looks great, so it ships. Streaming became the painkiller that let the underlying injury go untreated.
I have watched a team shave four seconds off a response by doing the unglamorous work: a smaller model for the classification step, a cached embedding lookup, dropping two paragraphs of system prompt nobody could prove earned their tokens. The answer went from "watch it type for six seconds" to "it's just there." Users described the second version as smarter. It wasn't smarter. It was faster, and fast reads as competent in a way that fake typing never will.
When the answer is short and structured, the right move is usually to wait for the whole thing and render it at once, with an honest indicator while you wait:
Stream only what a human reads as it arrives.
A determinate spinner that says "searching three sources" tells the user more than a cursor pretending the answer is being composed live when it was computed in full 200 milliseconds ago.
Match the pace to the reader, not the model
The deeper principle is that output cadence is a design decision, not a property of the model you inherited. The token stream is raw material. What the user sees is a choice you make on top of it. You can buffer it, chunk it by sentence, hold structured output until it parses, or release it whole. Defaulting to "pipe every token straight to the DOM the instant it arrives" is abdicating that choice and calling the result a UX.
So set the cadence to the human:
- →Long prose someone reads linearly → stream it, the reading and the writing keep pace.
- →Short or structured answers → wait, render whole, show honest progress.
- →Anything the UI parses before displaying → never fake-type it; you are animating data, not language.
The question is never "should we stream." It is "what is the fastest the user can actually absorb this, and does drip-feeding get us closer to that or further from it." Most of the time the most respectful thing you can do with a finished answer is hand it over and get out of the way.
