TTFT, TPOT, throughput, and what actually matters when you're serving a fine-tuned model

Benchmark your vLLM deployment with four methods: latency (offline), throughput (offline), serving (online with percentiles), and custom async. In our production benchmarks, a fine-tuned 0.6B model on an A40 delivers 30-40 req/s. Key optimizations: prefix caching (40% TTFT improvement in our tests), chunked prefill, and 0.90 GPU memory utilization. Capacity planning: GPUs needed = (daily volume / 86,400) / throughput per GPU.
You fine-tuned a model. It works. Now you need to know: how fast can you serve it?
Not a theoretical number from a blog post. Your model, your hardware, your input distribution. Benchmarks on someone else's setup won't tell you.
I spent a while benchmarking inference for a fine-tuned Qwen3 model served through vLLM. Here's what I learned.
There are too many metrics floating around. These are the ones that actually affect production decisions:
TTFT (Time to First Token) is how long until the user sees the first token. This is perceived latency. For interactive applications, this is the metric that determines whether your service feels fast.
TPOT (Time Per Output Token) is how long between each subsequent token. Determines streaming speed. Less important if you're not streaming.
Throughput (requests/sec) is how many complete requests you can serve per second. This is what determines your infrastructure cost. If you need to process 1M items per day, this is the number you divide by.
Total token throughput (tok/sec) is how many tokens your system processes per second, input and output combined. Useful for comparing across different input/output length distributions.
| Metric | What it tells you | Who cares |
|---|---|---|
| TTFT | Perceived latency | User-facing apps |
| TPOT | Streaming speed | Chat interfaces |
| Throughput (req/s) | Capacity planning | Batch processing |
| P99 latency | Worst-case experience | SLA commitments |
vLLM has three built-in benchmark tools, and you can write a fourth for custom metrics. Each answers a different question.
"How fast is a single request with no contention?"
vllm bench latency \
--model your-model-name \
--input-len 600 \
--output-len 30 \
--batch-size 1 \
--num-iters 100 \
--enable-prefix-caching
This loads the model directly, no server needed. Batch size 1 gives the purest latency measurement. Increase batch size to see how latency degrades under load.
# Sweep across batch sizes
for bs in 1 4 8 16 32; do
echo "Batch size: $bs"
vllm bench latency \
--model your-model-name \
--input-len 600 \
--output-len 30 \
--batch-size $bs \
--num-iters 50
done
"What's the maximum throughput on this hardware?"
vllm bench throughput \
--model your-model-name \
--dataset-name sharegpt \
--dataset-path benchmark_data.json \
--num-prompts 1000 \
--enable-prefix-caching
This pushes the model as hard as possible. No network overhead, no API layer. The number you get here is your theoretical ceiling.
Use your actual data distribution, not synthetic prompts. Input length variation matters a lot for throughput because vLLM's continuous batching behaves differently with mixed-length sequences.
"What's the performance of my actual deployed server?"
# Terminal 1: Start the server
vllm serve your-model-name \
--port 8000 \
--enable-prefix-caching \
--enable-chunked-prefill
# Terminal 2: Benchmark it
vllm bench serve \
--model your-model-name \
--base-url http://localhost:8000 \
--endpoint /v1/chat/completions \
--dataset-name sharegpt \
--dataset-path benchmark_data.json \
--num-prompts 1000 \
--percentile-metrics ttft,tpot,itl,e2el \
--metric-percentiles 50,90,99 \
--save-result
This is the most realistic test. It includes network overhead, API parsing, scheduling, everything your production requests hit.
The --request-rate flag matters more than you'd think. Set it to inf for max throughput testing, or to a specific number (like 50) to simulate steady-state production traffic. Latency numbers look very different at different request rates.

When you need metrics that vLLM's built-in tools don't track (per-request token counts, success rates, custom output validation), write your own:
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests, time
def run_single_request(sample, url, model, max_tokens):
response = requests.post(url, json={
"model": model,
"messages": [
{"role": "system", "content": sample["instruction"]},
{"role": "user", "content": sample["input"]}
],
"max_tokens": max_tokens,
"temperature": 0.0
})
result = response.json()
usage = result.get("usage", {})
return {
"prompt_tokens": usage.get("prompt_tokens", 0),
"completion_tokens": usage.get("completion_tokens", 0),
}
# Run with 50 parallel workers
start = time.time()
with ThreadPoolExecutor(max_workers=50) as executor:
futures = {executor.submit(run_single_request, s, url, model, 100): i
for i, s in enumerate(samples)}
for future in as_completed(futures):
results[futures[future]] = future.result()
elapsed = time.time() - start
throughput = len(samples) / elapsed
For better performance, use aiohttp instead of threaded requests:
async def run_async_benchmark(samples, url, model, max_concurrency=50):
semaphore = asyncio.Semaphore(max_concurrency)
connector = aiohttp.TCPConnector(
limit=max_concurrency,
keepalive_timeout=30,
)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [call_api(session, s, url, model, semaphore) for s in samples]
results = await asyncio.gather(*tasks)
return results
In our A/B tests (same 1,000-request workload, same server config on an A40), the async aiohttp version outperformed ThreadPoolExecutor by 10-30%. Thread context switching adds up, especially at high concurrency.
For a fine-tuned 0.6B model on an A40 (48GB) with ~600 token inputs and ~30 token outputs, here's what we measured in our internal benchmarks (vLLM with PagedAttention, prefix caching enabled):
| Metric | Value |
|---|---|
| Request throughput | 30-40 req/s |
| Output tokens/sec | 1,000-1,500 tok/s |
| Total tokens/sec | 20,000-25,000 tok/s |
| TTFT (median) | 15-30 ms |
| P99 latency | 50-100 ms |
In production terms:
Items/minute: ~2,000
Items/hour: ~130,000
Items/day: ~3 million
On a single GPU. For a 0.6B model with ~600 token inputs and ~30 token outputs.
Your numbers will be different. Run the benchmarks yourself.
vllm serve your-model \
--gpu-memory-utilization 0.90 \ # Use more VRAM
--enable-prefix-caching \ # Cache shared prefixes (system prompts)
--enable-chunked-prefill \ # Better scheduling for mixed batches
--max-num-seqs 128 \ # More concurrent sequences
--disable-log-requests # Reduce logging overhead
Prefix caching was our biggest win. As described in the vLLM PagedAttention paper (Kwon et al., 2023), vLLM caches the KV-cache for shared prefixes so each new request only processes the unique part. If your requests share a common system prompt, this optimization applies directly. In our benchmarks (1,000 requests with a shared system prompt on an A40), this cut our median TTFT by about 40%.
Chunked prefill helps when you have a mix of long and short inputs. Instead of blocking on a long prefill before scheduling new requests, vLLM splits the prefill into chunks and interleaves them with decoding steps.
| GPU | Memory util | Max seqs | Max model len |
|---|---|---|---|
| A40 (48GB) | 0.90 | 128 | 4096 |
| A100 (80GB) | 0.95 | 256 | 8192 |
| H100 (80GB) | 0.95 | 512 | 16384 |
More VRAM means more concurrent sequences, which means higher throughput. The relationship is roughly linear, but past ~256 concurrent sequences, scheduling overhead starts eating into gains.
The number of parallel workers matters. Too few and you're underutilizing the server. Too many and you overwhelm it.
Start at 50 workers for a single-GPU vLLM server and adjust based on P99 latency. If P99 stays under your SLA, increase workers. If it spikes, back off.
Benchmarking with synthetic data is the big one. If your benchmark uses fixed-length random inputs but your production data has variable-length structured text, your numbers are wrong. Always benchmark with representative data.
Ignoring warmup throws off results too. The first few requests after server startup are slow (model loading, CUDA graph capture, KV cache allocation). Exclude the first 10-20 requests.
Measuring throughput at low concurrency is misleading. Throughput of 5 req/s with 1 worker doesn't mean your server tops out at 5 req/s. It means your latency is 200ms. With 50 concurrent workers, you might hit 40 req/s.
And if all your requests share a system prompt and you haven't enabled --enable-prefix-caching, you're leaving free performance on the table.
Once you have your throughput number, the math is simple:
Required throughput = daily_volume / seconds_per_day
GPUs needed = required_throughput / throughput_per_gpu
Example: You need to process 5 million items per day.
5,000,000 / 86,400 = ~58 req/s
With 35 req/s per A40: ceil(58 / 35) = 2 GPUs
Add headroom for traffic spikes. 2x is standard, so 4 GPUs for this workload.

The difference between a benchmarked deployment and a guess is about 3x in infrastructure cost. Measure first.

AI Consultant. 9+ years building production AI. Previously Chief Data Scientist at recruitRyte. IIT Dhanbad.