What metrics should you use to benchmark LLM inference?

The four metrics that matter are TTFT (Time to First Token, determines perceived latency), TPOT (Time Per Output Token, determines streaming speed), throughput in requests/sec (determines infrastructure cost), and P99 latency (determines worst-case SLA).

How does vLLM prefix caching improve performance?

Prefix caching computes the KV-cache for shared prefixes (like system prompts) once and reuses it across requests, so each new request only processes the unique part. In production benchmarks on an A40, prefix caching cut median TTFT by about 40%.

What's the difference between latency and throughput benchmarks for LLM inference?

Latency benchmarks measure single-request TTFT and TPOT at batch sizes 1-8 to capture user-perceived speed. Throughput benchmarks load the server with hundreds of concurrent requests to find max requests/sec, which determines infrastructure cost per million tokens served.

How do you run an online serving benchmark for vLLM?

Start the server in one terminal with `vllm serve `. In a second terminal run `vllm bench serve` against the OpenAI-compatible endpoint with 50-100 concurrent workers to saturate. Capture per-percentile TTFT and TPOT alongside requests/sec for a full picture.

What are the main optimization levers for vLLM throughput?

Tensor parallelism for models above 13B, prefix caching when system prompts are shared, KV cache quantization on H100s, and batch size tuning. Switching FP16 to FP8 on H100s nearly doubles throughput with minimal output-quality loss on most tasks.

Benchmarking LLM Inference with vLLM

TTFT, TPOT, throughput, and what actually matters when you're serving a fine-tuned model

vLLM inference benchmarking

TL;DR

Benchmark your vLLM deployment with four methods: latency (offline), throughput (offline), serving (online with percentiles), and custom async. In our production benchmarks, a fine-tuned 0.6B model on an A40 delivers 30-40 req/s. Key optimizations: prefix caching (40% TTFT improvement in our tests), chunked prefill, and 0.90 GPU memory utilization. Capacity planning: GPUs needed = (daily volume / 86,400) / throughput per GPU.

The metrics that matter
Four ways to benchmark
What the numbers look like
Optimization levers
Common mistakes
Capacity planning

You fine-tuned a model. It works. Now you need to know: how fast can you serve it?

Not a theoretical number from a blog post. Your model, your hardware, your input distribution. Benchmarks on someone else's setup won't tell you.

I spent a while benchmarking inference for a fine-tuned Qwen3 model served through vLLM. Here's what I learned.

The metrics that matter

There are too many metrics floating around. These are the ones that actually affect production decisions:

TTFT (Time to First Token) is how long until the user sees the first token. This is perceived latency. For interactive applications, this is the metric that determines whether your service feels fast.

TPOT (Time Per Output Token) is how long between each subsequent token. Determines streaming speed. Less important if you're not streaming.

Throughput (requests/sec) is how many complete requests you can serve per second. This is what determines your infrastructure cost. If you need to process 1M items per day, this is the number you divide by.

Total token throughput (tok/sec) is how many tokens your system processes per second, input and output combined. Useful for comparing across different input/output length distributions.

Metric	What it tells you	Who cares
TTFT	Perceived latency	User-facing apps
TPOT	Streaming speed	Chat interfaces
Throughput (req/s)	Capacity planning	Batch processing
P99 latency	Worst-case experience	SLA commitments

Four ways to benchmark

vLLM has three built-in benchmark tools, and you can write a fourth for custom metrics. Each answers a different question.

1. Latency benchmark (offline)

"How fast is a single request with no contention?"

vllm bench latency \
  --model your-model-name \
  --input-len 600 \
  --output-len 30 \
  --batch-size 1 \
  --num-iters 100 \
  --enable-prefix-caching

This loads the model directly, no server needed. Batch size 1 gives the purest latency measurement. Increase batch size to see how latency degrades under load.

# Sweep across batch sizes
for bs in 1 4 8 16 32; do
  echo "Batch size: $bs"
  vllm bench latency \
    --model your-model-name \
    --input-len 600 \
    --output-len 30 \
    --batch-size $bs \
    --num-iters 50
done

2. Throughput benchmark (offline)

"What's the maximum throughput on this hardware?"

vllm bench throughput \
  --model your-model-name \
  --dataset-name sharegpt \
  --dataset-path benchmark_data.json \
  --num-prompts 1000 \
  --enable-prefix-caching

This pushes the model as hard as possible. No network overhead, no API layer. The number you get here is your theoretical ceiling.

Use your actual data distribution, not synthetic prompts. Input length variation matters a lot for throughput because vLLM's continuous batching behaves differently with mixed-length sequences.

3. Serving benchmark (online)

"What's the performance of my actual deployed server?"

# Terminal 1: Start the server
vllm serve your-model-name \
  --port 8000 \
  --enable-prefix-caching \
  --enable-chunked-prefill

# Terminal 2: Benchmark it
vllm bench serve \
  --model your-model-name \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --dataset-name sharegpt \
  --dataset-path benchmark_data.json \
  --num-prompts 1000 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,99 \
  --save-result

This is the most realistic test. It includes network overhead, API parsing, scheduling, everything your production requests hit.

The --request-rate flag matters more than you'd think. Set it to inf for max throughput testing, or to a specific number (like 50) to simulate steady-state production traffic. Latency numbers look very different at different request rates.

4. Custom benchmark (parallel API calls)

Benchmark comparison

When you need metrics that vLLM's built-in tools don't track (per-request token counts, success rates, custom output validation), write your own:

from concurrent.futures import ThreadPoolExecutor, as_completed
import requests, time

def run_single_request(sample, url, model, max_tokens):
    response = requests.post(url, json={
        "model": model,
        "messages": [
            {"role": "system", "content": sample["instruction"]},
            {"role": "user", "content": sample["input"]}
        ],
        "max_tokens": max_tokens,
        "temperature": 0.0
    })
    result = response.json()
    usage = result.get("usage", {})
    return {
        "prompt_tokens": usage.get("prompt_tokens", 0),
        "completion_tokens": usage.get("completion_tokens", 0),
    }

# Run with 50 parallel workers
start = time.time()
with ThreadPoolExecutor(max_workers=50) as executor:
    futures = {executor.submit(run_single_request, s, url, model, 100): i
               for i, s in enumerate(samples)}
    for future in as_completed(futures):
        results[futures[future]] = future.result()

elapsed = time.time() - start
throughput = len(samples) / elapsed

For better performance, use aiohttp instead of threaded requests:

async def run_async_benchmark(samples, url, model, max_concurrency=50):
    semaphore = asyncio.Semaphore(max_concurrency)
    connector = aiohttp.TCPConnector(
        limit=max_concurrency,
        keepalive_timeout=30,
    )
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [call_api(session, s, url, model, semaphore) for s in samples]
        results = await asyncio.gather(*tasks)
    return results

In our A/B tests (same 1,000-request workload, same server config on an A40), the async aiohttp version outperformed ThreadPoolExecutor by 10-30%. Thread context switching adds up, especially at high concurrency.

What the numbers look like

For a fine-tuned 0.6B model on an A40 (48GB) with ~600 token inputs and ~30 token outputs, here's what we measured in our internal benchmarks (vLLM with PagedAttention, prefix caching enabled):

Metric	Value
Request throughput	30-40 req/s
Output tokens/sec	1,000-1,500 tok/s
Total tokens/sec	20,000-25,000 tok/s
TTFT (median)	15-30 ms
P99 latency	50-100 ms

In production terms:

Items/minute:  ~2,000
Items/hour:    ~130,000
Items/day:     ~3 million

On a single GPU. For a 0.6B model with ~600 token inputs and ~30 token outputs.

Your numbers will be different. Run the benchmarks yourself.

Optimization levers

Server-side

vllm serve your-model \
  --gpu-memory-utilization 0.90 \     # Use more VRAM
  --enable-prefix-caching \           # Cache shared prefixes (system prompts)
  --enable-chunked-prefill \          # Better scheduling for mixed batches
  --max-num-seqs 128 \               # More concurrent sequences
  --disable-log-requests              # Reduce logging overhead

Prefix caching was our biggest win. As described in the vLLM PagedAttention paper (Kwon et al., 2023), vLLM caches the KV-cache for shared prefixes so each new request only processes the unique part. If your requests share a common system prompt, this optimization applies directly. In our benchmarks (1,000 requests with a shared system prompt on an A40), this cut our median TTFT by about 40%.

Chunked prefill helps when you have a mix of long and short inputs. Instead of blocking on a long prefill before scheduling new requests, vLLM splits the prefill into chunks and interleaves them with decoding steps.

Hardware-specific

GPU	Memory util	Max seqs	Max model len
A40 (48GB)	0.90	128	4096
A100 (80GB)	0.95	256	8192
H100 (80GB)	0.95	512	16384

More VRAM means more concurrent sequences, which means higher throughput. The relationship is roughly linear, but past ~256 concurrent sequences, scheduling overhead starts eating into gains.

Client-side

The number of parallel workers matters. Too few and you're underutilizing the server. Too many and you overwhelm it.

Start at 50 workers for a single-GPU vLLM server and adjust based on P99 latency. If P99 stays under your SLA, increase workers. If it spikes, back off.

Common mistakes

Benchmarking with synthetic data is the big one. If your benchmark uses fixed-length random inputs but your production data has variable-length structured text, your numbers are wrong. Always benchmark with representative data.

Ignoring warmup throws off results too. The first few requests after server startup are slow (model loading, CUDA graph capture, KV cache allocation). Exclude the first 10-20 requests.

Measuring throughput at low concurrency is misleading. Throughput of 5 req/s with 1 worker doesn't mean your server tops out at 5 req/s. It means your latency is 200ms. With 50 concurrent workers, you might hit 40 req/s.

And if all your requests share a system prompt and you haven't enabled --enable-prefix-caching, you're leaving free performance on the table.

Capacity planning

Once you have your throughput number, the math is simple:

Required throughput = daily_volume / seconds_per_day
GPUs needed = required_throughput / throughput_per_gpu

Example: You need to process 5 million items per day.

5,000,000 / 86,400 = ~58 req/s
With 35 req/s per A40: ceil(58 / 35) = 2 GPUs

Add headroom for traffic spikes. 2x is standard, so 4 GPUs for this workload.

Performance dashboard

The difference between a benchmarked deployment and a guess is about 3x in infrastructure cost. Measure first.